ParserConfig
in package
Configuration passed to parsers from language settings.
This value object encapsulates all the language-specific settings that affect how text is parsed into words and sentences.
Tags
Table of Contents
Properties
- $characterSubstitutions : string
- $exceptionsSplitSentences : string
- $languageId : int
- $parserOptions : array<string, mixed>
- $regexpSplitSentences : string
- $regexpWordCharacters : string
- $removeSpaces : bool
- $rightToLeft : bool
- $splitEachChar : bool
Methods
- __construct() : mixed
- Create a new parser configuration.
- fromDatabaseRow() : self
- Create configuration from database row array.
- fromLanguage() : self
- Create configuration from a Language entity.
- getCharacterSubstitutions() : string
- Get character substitution rules.
- getExceptionsSplitSentences() : string
- Get exceptions to sentence splitting.
- getLanguageId() : int
- Get the language ID.
- getParserOption() : mixed
- Get a specific parser option.
- getParserOptions() : array<string, mixed>
- Get parser-specific options.
- getRegexpSplitSentences() : string
- Get the sentence split regex pattern.
- getRegexpWordCharacters() : string
- Get the word character regex pattern.
- isRightToLeft() : bool
- Check if text is right-to-left.
- shouldRemoveSpaces() : bool
- Check if spaces should be removed before parsing.
- shouldSplitEachChar() : bool
- Check if each character should be treated as a separate word.
Properties
$characterSubstitutions
private
string
$characterSubstitutions
$exceptionsSplitSentences
private
string
$exceptionsSplitSentences
$languageId
private
int
$languageId
$parserOptions
private
array<string, mixed>
$parserOptions
= []
$regexpSplitSentences
private
string
$regexpSplitSentences
$regexpWordCharacters
private
string
$regexpWordCharacters
$removeSpaces
private
bool
$removeSpaces
$rightToLeft
private
bool
$rightToLeft
$splitEachChar
private
bool
$splitEachChar
Methods
__construct()
Create a new parser configuration.
public
__construct(int $languageId, string $regexpSplitSentences, string $exceptionsSplitSentences, string $regexpWordCharacters, string $characterSubstitutions, bool $removeSpaces, bool $splitEachChar, bool $rightToLeft[, array<string, mixed> $parserOptions = [] ]) : mixed
Parameters
- $languageId : int
-
Language ID
- $regexpSplitSentences : string
-
Regex pattern for sentence boundaries
- $exceptionsSplitSentences : string
-
Exceptions to sentence splitting
- $regexpWordCharacters : string
-
Regex pattern defining word characters
- $characterSubstitutions : string
-
Character replacement rules (pipe-separated)
- $removeSpaces : bool
-
Whether to remove spaces (CJK languages)
- $splitEachChar : bool
-
Whether to split each character (CJK languages)
- $rightToLeft : bool
-
Whether text is right-to-left
- $parserOptions : array<string, mixed> = []
-
Additional parser-specific options
fromDatabaseRow()
Create configuration from database row array.
public
static fromDatabaseRow(array<string|int, mixed> $row) : self
Parameters
- $row : array<string|int, mixed>
-
Database row with Lg* prefixed columns
Return values
self —Parser configuration
fromLanguage()
Create configuration from a Language entity.
public
static fromLanguage(Language $language) : self
Parameters
- $language : Language
-
Language entity
Return values
self —Parser configuration
getCharacterSubstitutions()
Get character substitution rules.
public
getCharacterSubstitutions() : string
Pipe-separated list of "from=to" replacements applied before parsing. Example: "ß=ss|ä=ae"
Return values
string —Character substitution rules
getExceptionsSplitSentences()
Get exceptions to sentence splitting.
public
getExceptionsSplitSentences() : string
Patterns that should not trigger a sentence split even if they contain sentence-ending characters. Example: "Mr.|Dr.|etc."
Return values
string —Exception patterns
getLanguageId()
Get the language ID.
public
getLanguageId() : int
Return values
int —Language ID
getParserOption()
Get a specific parser option.
public
getParserOption(string $key[, mixed $default = null ]) : mixed
Parameters
- $key : string
-
Option key
- $default : mixed = null
-
Default value if option not set
Return values
mixed —Option value or default
getParserOptions()
Get parser-specific options.
public
getParserOptions() : array<string, mixed>
Return values
array<string, mixed> —Parser-specific options
getRegexpSplitSentences()
Get the sentence split regex pattern.
public
getRegexpSplitSentences() : string
Characters in this pattern mark sentence boundaries. Example: ".!?" for English.
Return values
string —Sentence split regex
getRegexpWordCharacters()
Get the word character regex pattern.
public
getRegexpWordCharacters() : string
Defines which characters can form words. Everything else is considered a non-word (punctuation, whitespace). Example: "a-zA-Z0-9" for basic English.
Return values
string —Word character regex
isRightToLeft()
Check if text is right-to-left.
public
isRightToLeft() : bool
Return values
bool —True for RTL languages like Arabic, Hebrew
shouldRemoveSpaces()
Check if spaces should be removed before parsing.
public
shouldRemoveSpaces() : bool
Used for CJK languages where spaces are not word boundaries.
Return values
bool —True if spaces should be removed
shouldSplitEachChar()
Check if each character should be treated as a separate word.
public
shouldSplitEachChar() : bool
Used for Chinese and similar languages without word boundaries.
Return values
bool —True if character-by-character splitting is enabled