Documentation

ParserConfig
in package

Lwt

Modules

Language

Domain

Parser

Configuration passed to parsers from language settings.

This value object encapsulates all the language-specific settings that affect how text is parsed into words and sentences.

Properties

$characterSubstitutions : string
$exceptionsSplitSentences : string
$languageId : int
$parserOptions : array<string, mixed>
$regexpSplitSentences : string
$regexpWordCharacters : string
$removeSpaces : bool
$rightToLeft : bool
$splitEachChar : bool

Methods

__construct() : mixed: Create a new parser configuration.
fromDatabaseRow() : self: Create configuration from database row array.
fromLanguage() : self: Create configuration from a Language entity.
getCharacterSubstitutions() : string: Get character substitution rules.
getExceptionsSplitSentences() : string: Get exceptions to sentence splitting.
getLanguageId() : int: Get the language ID.
getParserOption() : mixed: Get a specific parser option.
getParserOptions() : array<string, mixed>: Get parser-specific options.
getRegexpSplitSentences() : string: Get the sentence split regex pattern.
getRegexpWordCharacters() : string: Get the word character regex pattern.
isRightToLeft() : bool: Check if text is right-to-left.
shouldRemoveSpaces() : bool: Check if spaces should be removed before parsing.
shouldSplitEachChar() : bool: Check if each character should be treated as a separate word.

$characterSubstitutions


        private
            string
    $characterSubstitutions

$exceptionsSplitSentences


        private
            string
    $exceptionsSplitSentences

$languageId


        private
            int
    $languageId

$parserOptions


        private
            array<string, mixed>
    $parserOptions
     = []

$regexpSplitSentences


        private
            string
    $regexpSplitSentences

$regexpWordCharacters


        private
            string
    $regexpWordCharacters

$removeSpaces


        private
            bool
    $removeSpaces

$rightToLeft


        private
            bool
    $rightToLeft

$splitEachChar


        private
            bool
    $splitEachChar

__construct()

Create a new parser configuration.


    public
                    __construct(int $languageId, string $regexpSplitSentences, string $exceptionsSplitSentences, string $regexpWordCharacters, string $characterSubstitutions, bool $removeSpaces, bool $splitEachChar, bool $rightToLeft[, array<string, mixed> $parserOptions = [] ]) : mixed

Parameters

$languageId : int: Language ID
$regexpSplitSentences : string: Regex pattern for sentence boundaries
$exceptionsSplitSentences : string: Exceptions to sentence splitting
$regexpWordCharacters : string: Regex pattern defining word characters
$characterSubstitutions : string: Character replacement rules (pipe-separated)
$removeSpaces : bool: Whether to remove spaces (CJK languages)
$splitEachChar : bool: Whether to split each character (CJK languages)
$rightToLeft : bool: Whether text is right-to-left
$parserOptions : array<string, mixed> = []: Additional parser-specific options

fromDatabaseRow()

Create configuration from database row array.


    public
            static        fromDatabaseRow(array<string|int, mixed> $row) : self

Parameters

$row : array<string|int, mixed>: Database row with Lg* prefixed columns

Return values

self —

Parser configuration

fromLanguage()

Create configuration from a Language entity.


    public
            static        fromLanguage(Language $language) : self

Parameters

$language : Language: Language entity

Return values

self —

Parser configuration

getCharacterSubstitutions()

Get character substitution rules.


    public
                    getCharacterSubstitutions() : string

Pipe-separated list of "from=to" replacements applied before parsing. Example: "ß=ss|ä=ae"

Return values

string —

Character substitution rules

getExceptionsSplitSentences()

Get exceptions to sentence splitting.


    public
                    getExceptionsSplitSentences() : string

Patterns that should not trigger a sentence split even if they contain sentence-ending characters. Example: "Mr.|Dr.|etc."

Return values

string —

Exception patterns

getLanguageId()

Get the language ID.


    public
                    getLanguageId() : int

Return values

int —

Language ID

getParserOption()

Get a specific parser option.


    public
                    getParserOption(string $key[, mixed $default = null ]) : mixed

Parameters

$key : string: Option key
$default : mixed = null: Default value if option not set

Return values

mixed —

Option value or default

getParserOptions()

Get parser-specific options.


    public
                    getParserOptions() : array<string, mixed>

Return values

array<string, mixed> —

Parser-specific options

getRegexpSplitSentences()

Get the sentence split regex pattern.


    public
                    getRegexpSplitSentences() : string

Characters in this pattern mark sentence boundaries. Example: ".!?" for English.

Return values

string —

Sentence split regex

getRegexpWordCharacters()

Get the word character regex pattern.


    public
                    getRegexpWordCharacters() : string

Defines which characters can form words. Everything else is considered a non-word (punctuation, whitespace). Example: "a-zA-Z0-9" for basic English.

Return values

string —

Word character regex

isRightToLeft()

Check if text is right-to-left.


    public
                    isRightToLeft() : bool

Return values

bool —

True for RTL languages like Arabic, Hebrew

shouldRemoveSpaces()

Check if spaces should be removed before parsing.


    public
                    shouldRemoveSpaces() : bool

Used for CJK languages where spaces are not word boundaries.

Return values

bool —

True if spaces should be removed

shouldSplitEachChar()

Check if each character should be treated as a separate word.


    public
                    shouldSplitEachChar() : bool

Used for Chinese and similar languages without word boundaries.

Return values

bool —

True if character-by-character splitting is enabled

ParserConfig in package Lwt Modules Language Domain Parser

Tags

Table of Contents

Properties

Methods

Properties

$characterSubstitutions

$exceptionsSplitSentences

$languageId

$parserOptions

$regexpSplitSentences

$regexpWordCharacters

$removeSpaces

$rightToLeft

$splitEachChar

Methods

__construct()

Parameters

fromDatabaseRow()

Parameters

Return values

fromLanguage()

Parameters

Return values

getCharacterSubstitutions()

Return values

getExceptionsSplitSentences()

Return values

getLanguageId()

Return values

getParserOption()

Parameters

Return values

getParserOptions()

Return values

getRegexpSplitSentences()

Return values

getRegexpWordCharacters()

Return values

isRightToLeft()

Return values

shouldRemoveSpaces()

Return values

shouldSplitEachChar()

Return values

ParserConfig
in package

Lwt

Modules

Language

Domain

Parser