Language Setup
This section shows some language setups ("RegExp Split Sentences", "RegExp Word Characters", "Make each character a word", "Remove spaces") for different languages. They are only recommendations, and you may change them according to your needs (and texts). See also the New/Edit Language section in the user guide.
If you are unsure, try the "Language Settings Wizard" first. Later you can adjust the settings.
Please inform yourself about Unicode here (general information) and here (Table of Unicode characters) and about the characters that occur in the language you learn!
| Language | RegExp Split Sentences | RegExp Word Characters | Make each character a word | Remove spaces |
|---|---|---|---|---|
| Latin derived alphabet (English, French, German, etc.) | .!?:; | a-zA-ZÀ-ÖØ-öø-ȳ | No | No |
| Languages with a Cyrillic-derived alphabet (Russian, Bulgarian, Ukrainian, etc.) | .!?:; | a-zA-ZÀ-ÖØ-öø-ȳЀ-ӹ | No | No |
| Greek | .!?:; | \x{0370}-\x{03FF}\x{1F00}-\x | No | No |
| Hebrew (Right-To-Left = Yes) | .!?:; | \x{0590}-\x | No | No |
| Thai | .!?:; | ก-๛ | No | Yes |
| Chinese | .!?:;。!?:; | 一-龥 | Yes or No | Yes |
| Japanese (Without MeCab) | .!?:;。!?:; | 一-龥ぁ-ヾ | Yes or No | Yes |
| Japanese (With MeCab) | .!?:;。!?:; | mecab | Yes or No | Yes |
| Japanese (With MeCab Python) | .!?:;。!?:; | mecab-python | Yes or No | Yes |
| Chinese (With Jieba) | .!?:;。!?:; | jieba | No | Yes |
| Korean | .!?:;。!?:; | 가-힣ᄀ-ᇂ | No | No or Yes |
External Parsers
For Chinese and Japanese, external NLP parsers (Jieba, MeCab) provide better word segmentation than character-by-character splitting. See the Text Parsers documentation for installation and configuration.
"\'" = Apostrophe, and/or "\-" = Dash, may be added to "RegExp Word Characters", then words like "aujourd'hui" or "non-government-owned" are one word, instead of two or more single words. If you omit "\'" and/or "\-" here, you can still create a multi-word expression "aujourd'hui", etc., later.
":" and ";" may be omitted in "RegExp Split Sentences", but longer example sentences may result from this.
"Make each character a word" = "Yes" should only be set in Chinese, Japanese, and similar languages. Normally words are split by any non-word character or whitespace. If you choose "Yes", then you do not need to insert spaces to specify word endings. If you choose "No", then you must prepare texts without whitespace by inserting whitespace to specify words. If you are a beginner, "Yes" may be better for you. If you are an advanced learner, and you have a possibility to prepare a text in the above described way, then "No" may be better for you.
"Remove spaces" = "Yes" should only be set in Chinese, Japanese, and similar languages to remove whitespace that has been automatically or manually inserted to specify words.