DictionaryLemmatizer
in package
implements
LemmatizerInterface
Lemmatizer that uses dictionary files for lookup.
Dictionary files are TSV format with columns: word_form, lemma Files are loaded from data/lemma-dictionaries/{lang}_lemmas.tsv
Tags
Table of Contents
Interfaces
- LemmatizerInterface
- Interface for lemmatization strategies.
Properties
- $availableLanguages : array<string|int, string>|null
- List of available dictionaries (language codes with dictionary files).
- $dictionaries : array<string, array<string, string>>
- Loaded dictionaries keyed by language code.
- $dictionaryPath : string
- Base directory for dictionary files.
Methods
- __construct() : mixed
- Constructor.
- clearCache() : void
- Clear all loaded dictionaries from memory.
- getDictionaryPath() : string
- Get the dictionary path.
- getStatistics() : array<string, array{entries: int, file_size: int|false}>
- Get statistics about loaded dictionaries.
- getSupportedLanguages() : array<string|int, string>
- Get the list of supported language codes.
- lemmatize() : string|null
- Find the lemma (base form) of a word.
- lemmatizeBatch() : array<string, string|null>
- Lemmatize multiple words in batch.
- loadDictionary() : bool
- Load a dictionary file for a language.
- supportsLanguage() : bool
- Check if this lemmatizer supports a given language.
- dictionaryFileExists() : bool
- Check if a dictionary file exists.
- ensureDictionaryLoaded() : void
- Ensure a dictionary is loaded.
- getDefaultDictionaryPath() : string
- Get the default dictionary path.
- getDictionaryFilePath() : string
- Get the file path for a language dictionary.
- normalizeLanguageCode() : string
- Normalize a language code to standard format.
- parseDictionaryLine() : void
- Parse a single dictionary line.
Properties
$availableLanguages
List of available dictionaries (language codes with dictionary files).
private
array<string|int, string>|null
$availableLanguages
= null
$dictionaries
Loaded dictionaries keyed by language code.
private
array<string, array<string, string>>
$dictionaries
= []
$dictionaryPath
Base directory for dictionary files.
private
string
$dictionaryPath
Methods
__construct()
Constructor.
public
__construct([string|null $dictionaryPath = null ]) : mixed
Parameters
- $dictionaryPath : string|null = null
-
Base path for dictionary files
clearCache()
Clear all loaded dictionaries from memory.
public
clearCache() : void
getDictionaryPath()
Get the dictionary path.
public
getDictionaryPath() : string
Return values
stringgetStatistics()
Get statistics about loaded dictionaries.
public
getStatistics() : array<string, array{entries: int, file_size: int|false}>
Return values
array<string, array{entries: int, file_size: int|false}>getSupportedLanguages()
Get the list of supported language codes.
public
getSupportedLanguages() : array<string|int, string>
Return values
array<string|int, string> —Array of ISO language codes
lemmatize()
Find the lemma (base form) of a word.
public
lemmatize(string $word, string $languageCode) : string|null
Parameters
- $word : string
-
The word to lemmatize
- $languageCode : string
-
ISO language code (e.g., 'en', 'de', 'fr')
Return values
string|null —The lemma, or null if not found
lemmatizeBatch()
Lemmatize multiple words in batch.
public
lemmatizeBatch(array<string|int, mixed> $words, string $languageCode) : array<string, string|null>
Parameters
- $words : array<string|int, mixed>
-
Array of words to lemmatize
- $languageCode : string
-
ISO language code
Return values
array<string, string|null> —Word => lemma mapping
loadDictionary()
Load a dictionary file for a language.
public
loadDictionary(string $languageCode) : bool
Parameters
- $languageCode : string
-
Normalized language code
Return values
bool —True if loaded successfully
supportsLanguage()
Check if this lemmatizer supports a given language.
public
supportsLanguage(string $languageCode) : bool
Parameters
- $languageCode : string
-
ISO language code
Return values
bool —True if the language is supported
dictionaryFileExists()
Check if a dictionary file exists.
private
dictionaryFileExists(string $languageCode) : bool
Parameters
- $languageCode : string
-
The language code
Return values
boolensureDictionaryLoaded()
Ensure a dictionary is loaded.
private
ensureDictionaryLoaded(string $languageCode) : void
Parameters
- $languageCode : string
-
The language code
getDefaultDictionaryPath()
Get the default dictionary path.
private
getDefaultDictionaryPath() : string
Return values
stringgetDictionaryFilePath()
Get the file path for a language dictionary.
private
getDictionaryFilePath(string $languageCode) : string
Parameters
- $languageCode : string
-
The language code
Return values
string —The file path
normalizeLanguageCode()
Normalize a language code to standard format.
private
normalizeLanguageCode(string $code) : string
Handles variations like "en-US" -> "en", "eng" -> "en"
Parameters
- $code : string
-
The language code
Return values
string —Normalized code
parseDictionaryLine()
Parse a single dictionary line.
private
parseDictionaryLine(string $line, string $languageCode) : void
Parameters
- $line : string
-
The line to parse
- $languageCode : string
-
The language code