Documentation

TextParsing
in package

Text parsing and processing utilities.

Provides methods for parsing texts into sentences and words, handling Japanese text with MeCab, and managing text items in the database.

Tags
since
3.0.0

Table of Contents

Methods

checkText()  : array{sentences: int, words: int, unknownPercent: float, preview: string}
Check/preview text and return parsing statistics without saving.
parseAndDisplayPreview()  : void
Parse text and display preview HTML for validation.
parseAndSave()  : void
Parse text and save to database.
splitIntoSentences()  : array<string|int, string>
Split text into sentences without database operations.
applyInitialTransformations()  : string
Apply initial text transformations (before display preview).
applyWordSplitting()  : string
Apply word-splitting transformations (after display preview).
checkExpressions()  : void
Check a language that contains expressions.
checkValid()  : void
Echo the sentences in a text. Prepare JS data for words and word count.
displayJapanesePreview()  : void
Display preview HTML for Japanese text.
displayStandardPreview()  : void
Display preview HTML for standard text.
displayStatistics()  : void
Display statistics about a text.
getLanguageSettings()  : array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null
Get language settings for parsing.
getMultiWordLengths()  : array<string|int, int>
Get all multi-word expression lengths for a language.
parseJapaneseToDatabase()  : void
Parse Japanese text with MeCab and insert into temp_word_occurrences.
parseStandardToDatabase()  : void
Parse standard text and insert into temp_word_occurrences.
registerSentencesTextItems()  : void
Append sentences and text items in the database.
saveWithSql()  : void
Insert a processed text in the data in pure SQL way.
saveWithSqlFallback()  : void
Fallback method to insert text data when LOAD DATA LOCAL INFILE is disabled.
splitJapaneseSentences()  : array<string|int, string>
Split Japanese text into sentences (split-only mode).
splitStandardSentences()  : array<string|int, string>
Split standard text into sentences (split-only mode).

Methods

checkText()

Check/preview text and return parsing statistics without saving.

public static checkText(string $text, int $lid) : array{sentences: int, words: int, unknownPercent: float, preview: string}

Use this method to get text statistics for preview purposes. Does not output any HTML or save to database.

Parameters
$text : string

Text to parse

$lid : int

Language ID

Return values
array{sentences: int, words: int, unknownPercent: float, preview: string}

parseAndDisplayPreview()

Parse text and display preview HTML for validation.

public static parseAndDisplayPreview(string $text, int $lid) : void

Use this method for the text checking UI. Outputs HTML directly to show parsed sentences and word statistics.

Parameters
$text : string

Text to parse

$lid : int

Language ID

parseAndSave()

Parse text and save to database.

public static parseAndSave(string $text, int $lid, int $textId) : void

Use this method when creating or updating texts. Parses the text and inserts sentences and text items into the database.

Parameters
$text : string

Text to parse

$lid : int

Language ID

$textId : int

Text ID (must be positive)

Tags
throws
InvalidArgumentException

If textId is not positive

splitIntoSentences()

Split text into sentences without database operations.

public static splitIntoSentences(string $text, int $lid) : array<string|int, string>

Use this method when you only need to split text into sentences without saving to the database (e.g., for long text splitting).

Parameters
$text : string

Text to parse

$lid : int

Language ID

Tags
psalm-return

non-empty-list

Return values
array<string|int, string>

Array of sentences

applyInitialTransformations()

Apply initial text transformations (before display preview).

private static applyInitialTransformations(string $text, bool $splitEachChar) : string
Parameters
$text : string

Raw text

$splitEachChar : bool

Whether to split each character

Return values
string

Text after initial transformations

applyWordSplitting()

Apply word-splitting transformations (after display preview).

private static applyWordSplitting(string $text, string $splitSentence, string $noSentenceEnd, string $termchar) : string
Parameters
$text : string

Text after initial transformations

$splitSentence : string

Sentence split regex

$noSentenceEnd : string

Exception patterns

$termchar : string

Word character regex

Return values
string

Preprocessed text ready for parsing

checkExpressions()

Check a language that contains expressions.

private static checkExpressions(array<string|int, int> $wl) : void
Parameters
$wl : array<string|int, int>

All the different expression length in the language.

checkValid()

Echo the sentences in a text. Prepare JS data for words and word count.

private static checkValid(int $lid) : void
Parameters
$lid : int

Language ID

displayJapanesePreview()

Display preview HTML for Japanese text.

private static displayJapanesePreview(string $text) : void
Parameters
$text : string

Preprocessed text

displayStandardPreview()

Display preview HTML for standard text.

private static displayStandardPreview(string $text, bool $rtlScript) : void
Parameters
$text : string

Preprocessed text (after initial transformations)

$rtlScript : bool

Whether text is right-to-left

displayStatistics()

Display statistics about a text.

private static displayStatistics(int $lid, bool $rtlScript, bool $multiwords) : void
Parameters
$lid : int

Language ID

$rtlScript : bool

true if language is right-to-left

$multiwords : bool

Display if text has multi-words

getLanguageSettings()

Get language settings for parsing.

private static getLanguageSettings(int $lid) : array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null
Parameters
$lid : int

Language ID

Return values
array{removeSpaces: string, splitSentence: string, noSentenceEnd: string, termchar: string, rtlScript: mixed, splitEachChar: bool}|null

Language settings or null if not found

getMultiWordLengths()

Get all multi-word expression lengths for a language.

private static getMultiWordLengths(int $lid) : array<string|int, int>
Parameters
$lid : int

Language ID

Return values
array<string|int, int>

Array of distinct word counts (e.g., [2, 3] for 2-word and 3-word expressions)

parseJapaneseToDatabase()

Parse Japanese text with MeCab and insert into temp_word_occurrences.

private static parseJapaneseToDatabase(string $text, bool $useMaxSeID) : void
Parameters
$text : string

Preprocessed text

$useMaxSeID : bool

Whether to query for max sentence ID (true for existing texts)

parseStandardToDatabase()

Parse standard text and insert into temp_word_occurrences.

private static parseStandardToDatabase(string $text, string $termchar, string $removeSpaces, bool $useMaxSeID) : void
Parameters
$text : string

Preprocessed text

$termchar : string

Word character regex

$removeSpaces : string

Space removal setting

$useMaxSeID : bool

Whether to query for max sentence ID

registerSentencesTextItems()

Append sentences and text items in the database.

private static registerSentencesTextItems(int $tid, int $lid, bool $hasmultiword) : void

TiSeID in temp_word_occurrences is pre-computed to match future SeID values. When parseStandardToDatabase runs with useMaxSeID=true, it sets TiSeID to MAX(SeID)+1, MAX(SeID)+2, etc. When we insert sentences here, they get those exact SeID values via auto-increment, so TiSeID = SeID.

Parameters
$tid : int

ID of text from which insert data

$lid : int

ID of the language of the text

$hasmultiword : bool

Set to true to insert multi-words as well.

saveWithSql()

Insert a processed text in the data in pure SQL way.

private static saveWithSql(string $text, int $id) : void
Parameters
$text : string

Preprocessed text to insert

$id : int

Text ID

saveWithSqlFallback()

Fallback method to insert text data when LOAD DATA LOCAL INFILE is disabled.

private static saveWithSqlFallback(string $text, int $id) : void
Parameters
$text : string

Preprocessed text to insert

$id : int

Text ID

splitJapaneseSentences()

Split Japanese text into sentences (split-only mode).

private static splitJapaneseSentences(string $text) : array<string|int, string>
Parameters
$text : string

Preprocessed text

Tags
psalm-return

non-empty-list

Return values
array<string|int, string>

Array of sentences

splitStandardSentences()

Split standard text into sentences (split-only mode).

private static splitStandardSentences(string $text, string $removeSpaces) : array<string|int, string>
Parameters
$text : string

Preprocessed text

$removeSpaces : string

Space removal setting

Tags
psalm-return

non-empty-list

Return values
array<string|int, string>

Array of sentences


        
On this page

Search results