Turkish
From SuperMemopedia
Important! This problem has been fixed the newest version of SuperMemo 2004. SuperMemo 2006 can interpret a wider range of BOMs.
Contents |
Problem statement
Import the following item (in the question-and-answer format) in Turkish, that is stored in Microsoft Word document (DOC extension), into a SuperMemo collection without losing diacritical characters:
Q: çarşaf ağıt
A: önür
Solution
Word 2000
- choose Save As in Word
- choose Save As Type : Encoded Text
- choose Save and Yes
- choose Unicode (UTF-8)
- choose OK (this will save the file as UTF-8 encoded plain text)
- import this file to SuperMemo with File : Tools : Import : Q&A text and use an HTML template (to correctly display Unicode). Do not use the plain text template
Word XP or 2002/2003
- In Microsoft Word XP or 2002/2003, on the File menu, click Open to open a document that contains items in question-and-answer format
- In the just opened document, place caret/insertion point at document's very beginning (before the first Q:), and press ENTER to move it to the second line of text (otherwise the question of the first item is not going to be imported)
- Click File, and then click Save As...
- At the bottom of the Save As dialog box, in the Save as type combo box, select Plain Text, and then click Save
- In the File Conversion dialog box, select Other encoding » Unicode (UTF-8), and then click OK
- Close Microsoft Word XP or 2002/2003 before you perform the import procedure in SuperMemo 2004 (otherwise SuperMemo 2004 is going to throw the unintuitive Error importing Q&A texts; I/O error 32 error at you)
- In SuperMemo 2004, on the File menu, click Tools » Import » Q&A text to import the plain text file created in the points from 3 to 5
- In the Import Text dialog box, click OK to begin importing
Diagnosis
The Unicode defines one code point (U+FEFF or decimal 65279) as the Unicode byte order mark (the BOM). If you have files stored in UTF-16 (in whichever byte order), the first bytes FEFF or FFFE are a good indication of UTF-16 (they also determine the byte order). UTF-8 doesn't have a problem with the byte order, but still MS Word XP/2002/2003 forces the insertion of BOMs in front of UTF-8 texts. When SuperMemo hits BOM before Q:, it considers it an illegal Q&A format and skips the Q. Inserting an empty line before the first Q ensures SuperMemo skips the first empty line and the import is not affected. MS Word 2000 does not insert BOMs and this is why SuperMemo eats that file without questions.
Proposition
SuperMemo cannot kill BOM bytes arbitrarily, but once it detects UTF8, it could safely skip the BOM when it is placed at the very beginning of the imported file. This would not affect any imported texts and would make life easier for users of new Word. This would prevent users wasting hours on trying to figure out where MS Word 2000 and SuperMemo differ from MS Word 2003/XP in interpreting UTF8.
Do not use these!
These methods will not work:
- SuperMemo cannot import DOC files with Q&A import
- SuperMemo cannot import RTF files with Q&A import
- SuperMemo cannot import the same file saved as UTF-16 Unicode (it reads it byte by byte)
- SuperMemo will not import it correctly if the plain text is saved with Turkish encoding (it only displays ASCII or Unicode and it does not do Unicode conversions on import)
