StandardTextParser
in package
Standard text parsing with sentence splitting.
Handles language settings retrieval, text transformations, splitting, previewing, and database insertion for non-Japanese text.
Tags
Table of Contents
Methods
- applyInitialTransformations() : string
- Apply initial text transformations (before display preview).
- applyWordSplitting() : string
- Apply word-splitting transformations (after display preview).
- displayStandardPreview() : void
- Display preview HTML for standard text.
- getLanguageSettings() : array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null
- Get language settings for parsing.
- parseStandardToDatabase() : void
- Parse standard text and insert into temp_word_occurrences.
- splitStandardSentences() : array<string|int, string>
- Split standard text into sentences (split-only mode).
- quoteChars() : string
- Build the Unicode quotation-mark character class fragment used in regex patterns.
Methods
applyInitialTransformations()
Apply initial text transformations (before display preview).
public
static applyInitialTransformations(string $text, bool $splitEachChar) : string
Parameters
- $text : string
-
Raw text
- $splitEachChar : bool
-
Whether to split each character
Return values
string —Text after initial transformations
applyWordSplitting()
Apply word-splitting transformations (after display preview).
public
static applyWordSplitting(string $text, string $splitSentence, string $noSentenceEnd, string $termchar) : string
Parameters
- $text : string
-
Text after initial transformations
- $splitSentence : string
-
Sentence split regex
- $noSentenceEnd : string
-
Exception patterns
- $termchar : string
-
Word character regex
Tags
Return values
string —Preprocessed text ready for parsing
displayStandardPreview()
Display preview HTML for standard text.
public
static displayStandardPreview(string $text, bool $rtlScript) : void
Parameters
- $text : string
-
Preprocessed text (after initial transformations)
- $rtlScript : bool
-
Whether text is right-to-left
getLanguageSettings()
Get language settings for parsing.
public
static getLanguageSettings(int $lid) : array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null
Parameters
- $lid : int
-
Language ID
Return values
array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null —Language settings or null if not found
parseStandardToDatabase()
Parse standard text and insert into temp_word_occurrences.
public
static parseStandardToDatabase(string $text, string $termchar, string $removeSpaces, bool $useMaxSeID) : void
Parameters
- $text : string
-
Preprocessed text
- $termchar : string
-
Word character regex
- $removeSpaces : string
-
Space removal setting
- $useMaxSeID : bool
-
Whether to query for max sentence ID
Tags
splitStandardSentences()
Split standard text into sentences (split-only mode).
public
static splitStandardSentences(string $text, string $removeSpaces) : array<string|int, string>
Parameters
- $text : string
-
Preprocessed text
- $removeSpaces : string
-
Space removal setting
Tags
Return values
array<string|int, string> —Array of sentences
quoteChars()
Build the Unicode quotation-mark character class fragment used in regex patterns.
private
static quoteChars() : string
Contains: RIGHT DOUBLE QUOTE, close-paren, LEFT/RIGHT SINGLE QUOTE, single angle quotes, LEFT DOUBLE QUOTE, DOUBLE LOW-9 QUOTE, guillemets, CJK brackets.
Return values
string —Character class content (without surrounding brackets)