Skip to content

Language Setup

  • This section shows some language setups ("RegExp Split Sentences", "RegExp Word Characters", "Make each character a word", "Remove spaces") for different languages. They are only recommendations, and you may change them according to your needs (and texts). See also the New/Edit Language section in the user guide.

  • If you are unsure, try the "Language Settings Wizard" first. Later you can adjust the settings.

  • Please inform yourself about Unicode here (general information) and here (Table of Unicode characters) and about the characters that occur in the language you learn!

LanguageRegExp Split SentencesRegExp Word CharactersMake each character a wordRemove spaces
Latin derived alphabet (English, French, German, etc.).!?:;a-zA-ZÀ-ÖØ-öø-ȳNoNo
Languages with a Cyrillic-derived alphabet (Russian, Bulgarian, Ukrainian, etc.).!?:;a-zA-ZÀ-ÖØ-öø-ȳЀ-ӹNoNo
Greek.!?:;\x{0370}-\x{03FF}\x{1F00}-\xNoNo
Hebrew (Right-To-Left = Yes).!?:;\x{0590}-\xNoNo
Thai.!?:;ก-๛NoYes
Chinese.!?:;。!?:;一-龥Yes or NoYes
Japanese (Without MeCab).!?:;。!?:;一-龥ぁ-ヾYes or NoYes
Japanese (With MeCab).!?:;。!?:;mecabYes or NoYes
Japanese (With MeCab Python).!?:;。!?:;mecab-pythonYes or NoYes
Chinese (With Jieba).!?:;。!?:;jiebaNoYes
Korean.!?:;。!?:;가-힣ᄀ-ᇂNoNo or Yes

External Parsers

For Chinese and Japanese, external NLP parsers (Jieba, MeCab) provide better word segmentation than character-by-character splitting. See the Text Parsers documentation for installation and configuration.

  • "\'" = Apostrophe, and/or "\-" = Dash, may be added to "RegExp Word Characters", then words like "aujourd'hui" or "non-government-owned" are one word, instead of two or more single words. If you omit "\'" and/or "\-" here, you can still create a multi-word expression "aujourd'hui", etc., later.

  • ":" and ";" may be omitted in "RegExp Split Sentences", but longer example sentences may result from this.

  • "Make each character a word" = "Yes" should only be set in Chinese, Japanese, and similar languages. Normally words are split by any non-word character or whitespace. If you choose "Yes", then you do not need to insert spaces to specify word endings. If you choose "No", then you must prepare texts without whitespace by inserting whitespace to specify words. If you are a beginner, "Yes" may be better for you. If you are an advanced learner, and you have a possibility to prepare a text in the above described way, then "No" may be better for you.

  • "Remove spaces" = "Yes" should only be set in Chinese, Japanese, and similar languages to remove whitespace that has been automatically or manually inserted to specify words.

Released into the Public Domain under the Unlicense.