TextParsing
in package
Text parsing and processing utilities.
Provides methods for parsing texts into sentences and words, handling Japanese text with MeCab, and managing text items in the database.
Tags
Table of Contents
Methods
- checkText() : array{sentences: int, words: int, unknownPercent: float, preview: string}
- Check/preview text and return parsing statistics without saving.
- parseAndDisplayPreview() : void
- Parse text and display preview HTML for validation.
- parseAndSave() : void
- Parse text and save to database.
- splitIntoSentences() : array<string|int, string>
- Split text into sentences without database operations.
- applyInitialTransformations() : string
- Apply initial text transformations (before display preview).
- applyWordSplitting() : string
- Apply word-splitting transformations (after display preview).
- checkExpressions() : void
- Check a language that contains expressions.
- checkValid() : void
- Echo the sentences in a text. Prepare JS data for words and word count.
- displayJapanesePreview() : void
- Display preview HTML for Japanese text.
- displayStandardPreview() : void
- Display preview HTML for standard text.
- displayStatistics() : void
- Display statistics about a text.
- getLanguageSettings() : array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null
- Get language settings for parsing.
- getMultiWordLengths() : array<string|int, int>
- Get all multi-word expression lengths for a language.
- parseJapaneseToDatabase() : void
- Parse Japanese text with MeCab and insert into temp_word_occurrences.
- parseStandardToDatabase() : void
- Parse standard text and insert into temp_word_occurrences.
- registerSentencesTextItems() : void
- Append sentences and text items in the database.
- saveWithSql() : void
- Insert a processed text in the data in pure SQL way.
- saveWithSqlFallback() : void
- Fallback method to insert text data when LOAD DATA LOCAL INFILE is disabled.
- splitJapaneseSentences() : array<string|int, string>
- Split Japanese text into sentences (split-only mode).
- splitStandardSentences() : array<string|int, string>
- Split standard text into sentences (split-only mode).
Methods
checkText()
Check/preview text and return parsing statistics without saving.
public
static checkText(string $text, int $lid) : array{sentences: int, words: int, unknownPercent: float, preview: string}
Use this method to get text statistics for preview purposes. Does not output any HTML or save to database.
Parameters
- $text : string
-
Text to parse
- $lid : int
-
Language ID
Return values
array{sentences: int, words: int, unknownPercent: float, preview: string}parseAndDisplayPreview()
Parse text and display preview HTML for validation.
public
static parseAndDisplayPreview(string $text, int $lid) : void
Use this method for the text checking UI. Outputs HTML directly to show parsed sentences and word statistics.
Parameters
- $text : string
-
Text to parse
- $lid : int
-
Language ID
parseAndSave()
Parse text and save to database.
public
static parseAndSave(string $text, int $lid, int $textId) : void
Use this method when creating or updating texts. Parses the text and inserts sentences and text items into the database.
Parameters
- $text : string
-
Text to parse
- $lid : int
-
Language ID
- $textId : int
-
Text ID (must be positive)
Tags
splitIntoSentences()
Split text into sentences without database operations.
public
static splitIntoSentences(string $text, int $lid) : array<string|int, string>
Use this method when you only need to split text into sentences without saving to the database (e.g., for long text splitting).
Parameters
- $text : string
-
Text to parse
- $lid : int
-
Language ID
Tags
Return values
array<string|int, string> —Array of sentences
applyInitialTransformations()
Apply initial text transformations (before display preview).
private
static applyInitialTransformations(string $text, bool $splitEachChar) : string
Parameters
- $text : string
-
Raw text
- $splitEachChar : bool
-
Whether to split each character
Return values
string —Text after initial transformations
applyWordSplitting()
Apply word-splitting transformations (after display preview).
private
static applyWordSplitting(string $text, string $splitSentence, string $noSentenceEnd, string $termchar) : string
Parameters
- $text : string
-
Text after initial transformations
- $splitSentence : string
-
Sentence split regex
- $noSentenceEnd : string
-
Exception patterns
- $termchar : string
-
Word character regex
Return values
string —Preprocessed text ready for parsing
checkExpressions()
Check a language that contains expressions.
private
static checkExpressions(array<string|int, int> $wl) : void
Parameters
- $wl : array<string|int, int>
-
All the different expression length in the language.
checkValid()
Echo the sentences in a text. Prepare JS data for words and word count.
private
static checkValid(int $lid) : void
Parameters
- $lid : int
-
Language ID
displayJapanesePreview()
Display preview HTML for Japanese text.
private
static displayJapanesePreview(string $text) : void
Parameters
- $text : string
-
Preprocessed text
displayStandardPreview()
Display preview HTML for standard text.
private
static displayStandardPreview(string $text, bool $rtlScript) : void
Parameters
- $text : string
-
Preprocessed text (after initial transformations)
- $rtlScript : bool
-
Whether text is right-to-left
displayStatistics()
Display statistics about a text.
private
static displayStatistics(int $lid, bool $rtlScript, bool $multiwords) : void
Parameters
- $lid : int
-
Language ID
- $rtlScript : bool
-
true if language is right-to-left
- $multiwords : bool
-
Display if text has multi-words
getLanguageSettings()
Get language settings for parsing.
private
static getLanguageSettings(int $lid) : array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null
Parameters
- $lid : int
-
Language ID
Return values
array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null —Language settings or null if not found
getMultiWordLengths()
Get all multi-word expression lengths for a language.
private
static getMultiWordLengths(int $lid) : array<string|int, int>
Parameters
- $lid : int
-
Language ID
Return values
array<string|int, int> —Array of distinct word counts (e.g., [2, 3] for 2-word and 3-word expressions)
parseJapaneseToDatabase()
Parse Japanese text with MeCab and insert into temp_word_occurrences.
private
static parseJapaneseToDatabase(string $text, bool $useMaxSeID) : void
Parameters
- $text : string
-
Preprocessed text
- $useMaxSeID : bool
-
Whether to query for max sentence ID (true for existing texts)
parseStandardToDatabase()
Parse standard text and insert into temp_word_occurrences.
private
static parseStandardToDatabase(string $text, string $termchar, string $removeSpaces, bool $useMaxSeID) : void
Parameters
- $text : string
-
Preprocessed text
- $termchar : string
-
Word character regex
- $removeSpaces : string
-
Space removal setting
- $useMaxSeID : bool
-
Whether to query for max sentence ID
registerSentencesTextItems()
Append sentences and text items in the database.
private
static registerSentencesTextItems(int $tid, int $lid, bool $hasmultiword) : void
TiSeID in temp_word_occurrences is pre-computed to match future SeID values. When parseStandardToDatabase runs with useMaxSeID=true, it sets TiSeID to MAX(SeID)+1, MAX(SeID)+2, etc. When we insert sentences here, they get those exact SeID values via auto-increment, so TiSeID = SeID.
Parameters
- $tid : int
-
ID of text from which insert data
- $lid : int
-
ID of the language of the text
- $hasmultiword : bool
-
Set to true to insert multi-words as well.
saveWithSql()
Insert a processed text in the data in pure SQL way.
private
static saveWithSql(string $text, int $id) : void
Parameters
- $text : string
-
Preprocessed text to insert
- $id : int
-
Text ID
saveWithSqlFallback()
Fallback method to insert text data when LOAD DATA LOCAL INFILE is disabled.
private
static saveWithSqlFallback(string $text, int $id) : void
Parameters
- $text : string
-
Preprocessed text to insert
- $id : int
-
Text ID
splitJapaneseSentences()
Split Japanese text into sentences (split-only mode).
private
static splitJapaneseSentences(string $text) : array<string|int, string>
Parameters
- $text : string
-
Preprocessed text
Tags
Return values
array<string|int, string> —Array of sentences
splitStandardSentences()
Split standard text into sentences (split-only mode).
private
static splitStandardSentences(string $text, string $removeSpaces) : array<string|int, string>
Parameters
- $text : string
-
Preprocessed text
- $removeSpaces : string
-
Space removal setting
Tags
Return values
array<string|int, string> —Array of sentences