Documentation

ParserConfig

Configuration passed to parsers from language settings.

This value object encapsulates all the language-specific settings that affect how text is parsed into words and sentences.

Tags
since
3.0.0

Table of Contents

Properties

$characterSubstitutions  : string
$exceptionsSplitSentences  : string
$languageId  : int
$parserOptions  : array<string, mixed>
$regexpSplitSentences  : string
$regexpWordCharacters  : string
$removeSpaces  : bool
$rightToLeft  : bool
$splitEachChar  : bool

Methods

__construct()  : mixed
Create a new parser configuration.
fromDatabaseRow()  : self
Create configuration from database row array.
fromLanguage()  : self
Create configuration from a Language entity.
getCharacterSubstitutions()  : string
Get character substitution rules.
getExceptionsSplitSentences()  : string
Get exceptions to sentence splitting.
getLanguageId()  : int
Get the language ID.
getParserOption()  : mixed
Get a specific parser option.
getParserOptions()  : array<string, mixed>
Get parser-specific options.
getRegexpSplitSentences()  : string
Get the sentence split regex pattern.
getRegexpWordCharacters()  : string
Get the word character regex pattern.
isRightToLeft()  : bool
Check if text is right-to-left.
shouldRemoveSpaces()  : bool
Check if spaces should be removed before parsing.
shouldSplitEachChar()  : bool
Check if each character should be treated as a separate word.

Properties

$characterSubstitutions

private string $characterSubstitutions

$exceptionsSplitSentences

private string $exceptionsSplitSentences

$parserOptions

private array<string, mixed> $parserOptions = []

$regexpSplitSentences

private string $regexpSplitSentences

$regexpWordCharacters

private string $regexpWordCharacters

Methods

__construct()

Create a new parser configuration.

public __construct(int $languageId, string $regexpSplitSentences, string $exceptionsSplitSentences, string $regexpWordCharacters, string $characterSubstitutions, bool $removeSpaces, bool $splitEachChar, bool $rightToLeft[, array<string, mixed> $parserOptions = [] ]) : mixed
Parameters
$languageId : int

Language ID

$regexpSplitSentences : string

Regex pattern for sentence boundaries

$exceptionsSplitSentences : string

Exceptions to sentence splitting

$regexpWordCharacters : string

Regex pattern defining word characters

$characterSubstitutions : string

Character replacement rules (pipe-separated)

$removeSpaces : bool

Whether to remove spaces (CJK languages)

$splitEachChar : bool

Whether to split each character (CJK languages)

$rightToLeft : bool

Whether text is right-to-left

$parserOptions : array<string, mixed> = []

Additional parser-specific options

fromDatabaseRow()

Create configuration from database row array.

public static fromDatabaseRow(array<string|int, mixed> $row) : self
Parameters
$row : array<string|int, mixed>

Database row with Lg* prefixed columns

Return values
self

Parser configuration

fromLanguage()

Create configuration from a Language entity.

public static fromLanguage(Language $language) : self
Parameters
$language : Language

Language entity

Return values
self

Parser configuration

getCharacterSubstitutions()

Get character substitution rules.

public getCharacterSubstitutions() : string

Pipe-separated list of "from=to" replacements applied before parsing. Example: "ß=ss|ä=ae"

Return values
string

Character substitution rules

getExceptionsSplitSentences()

Get exceptions to sentence splitting.

public getExceptionsSplitSentences() : string

Patterns that should not trigger a sentence split even if they contain sentence-ending characters. Example: "Mr.|Dr.|etc."

Return values
string

Exception patterns

getLanguageId()

Get the language ID.

public getLanguageId() : int
Return values
int

Language ID

getParserOption()

Get a specific parser option.

public getParserOption(string $key[, mixed $default = null ]) : mixed
Parameters
$key : string

Option key

$default : mixed = null

Default value if option not set

Return values
mixed

Option value or default

getParserOptions()

Get parser-specific options.

public getParserOptions() : array<string, mixed>
Return values
array<string, mixed>

Parser-specific options

getRegexpSplitSentences()

Get the sentence split regex pattern.

public getRegexpSplitSentences() : string

Characters in this pattern mark sentence boundaries. Example: ".!?" for English.

Return values
string

Sentence split regex

getRegexpWordCharacters()

Get the word character regex pattern.

public getRegexpWordCharacters() : string

Defines which characters can form words. Everything else is considered a non-word (punctuation, whitespace). Example: "a-zA-Z0-9" for basic English.

Return values
string

Word character regex

isRightToLeft()

Check if text is right-to-left.

public isRightToLeft() : bool
Return values
bool

True for RTL languages like Arabic, Hebrew

shouldRemoveSpaces()

Check if spaces should be removed before parsing.

public shouldRemoveSpaces() : bool

Used for CJK languages where spaces are not word boundaries.

Return values
bool

True if spaces should be removed

shouldSplitEachChar()

Check if each character should be treated as a separate word.

public shouldSplitEachChar() : bool

Used for Chinese and similar languages without word boundaries.

Return values
bool

True if character-by-character splitting is enabled


        
On this page

Search results