LemmaService
in package
Service for managing lemmatization of vocabulary items.
Provides methods for:
- Suggesting lemmas for new words
- Batch lemmatization of existing vocabulary
- Word family queries
- NLP integration via factory pattern
Tags
Table of Contents
Properties
Methods
- __construct() : mixed
- Constructor.
- applyLemmasToVocabulary() : array{processed: int, updated: int, skipped: int}
- Apply lemmas to existing vocabulary for a language.
- bulkUpdateTermStatus() : int
- Apply status to multiple terms (for bulk family updates).
- clearLemmas() : int
- Clear all lemmas for a language.
- findPotentialLemmaGroups() : array<int, array{base: string, variants: string[]}>
- Find terms that might benefit from lemmatization.
- findWordIdByLemma() : int|null
- Find a word ID by its lemma.
- getAllNlpLanguages() : array<string|int, string>
- Get all languages potentially supported by NLP (including uninstalled models).
- getAvailableLanguages() : array<string|int, string>
- Get all languages with available lemmatization support.
- getLemmaAggregateStats() : array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}
- Get aggregate lemma statistics for a language.
- getLemmaStatistics() : array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}
- Get lemma statistics for a language.
- getLemmatizerByType() : LemmatizerInterface
- Get a lemmatizer by type.
- getLemmatizerForLanguage() : LemmatizerInterface
- Get the best available lemmatizer for a language.
- getNlpSupportedLanguages() : array<string|int, string>
- Get languages supported by the NLP service.
- getSuggestedFamilyUpdate() : array{suggestion: string, affected_count: int, term_ids: int[]}
- Suggest status update for related forms when one form's status changes.
- getUnmatchedStatistics() : array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}
- Get statistics about unmatched text items that could benefit from lemma linking.
- getWordFamilies() : array<string, array{lemma: string, count: int, terms: string[]}>
- Get words grouped by their lemma.
- getWordFamily() : array<string|int, Term>
- Get the word family (all words sharing a lemma).
- getWordFamilyByLemma() : array<string|int, mixed>|null
- Get word family by lemma directly (without requiring a term ID).
- getWordFamilyDetails() : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
- Get detailed word family information for a term.
- getWordFamilyList() : array{families: array, pagination: array}
- Get paginated list of word families for a language.
- isAvailableForLanguage() : bool
- Check if lemmatization is available for a language.
- isNlpServiceAvailable() : bool
- Check if NLP service (spaCy) is available.
- linkTextItemsByLemma() : array{linked: int, unmatched: int, errors: int}
- Link unmatched text items to words by lemma.
- linkTextItemsByLemmaSql() : int
- Link text items directly using SQL (efficient for large datasets).
- propagateLemma() : int
- Copy lemma from one term to all related terms.
- setLemma() : bool
- Set lemma for a specific term.
- suggestLemma() : string|null
- Suggest a lemma for a word.
- suggestLemmasBatch() : array<string, string|null>
- Suggest lemmas for multiple words.
- updateWordFamilyStatus() : int
- Update status for all words in a word family.
- buildSingleTermFamily() : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
- Build a "family" response for a term without a lemma.
- fetchTermsWithoutLemma() : array<int, array<string, mixed>>
- Fetch terms without a lemma.
- fetchUnmatchedTextItems() : array<int, array<string, mixed>>
- Fetch unmatched text items (Ti2WoID IS NULL) for a language.
- getWordOccurrenceCount() : int
- Get occurrence count for a word across all texts.
- linkItemsToWord() : int
- Link text items to a word.
- updateTermLemma() : void
- Update the lemma for a term.
Properties
$lemmatizer
private
LemmatizerInterface
$lemmatizer
$repository
private
MySqlTermRepository
$repository
Methods
__construct()
Constructor.
public
__construct([LemmatizerInterface|null $lemmatizer = null ][, MySqlTermRepository|null $repository = null ]) : mixed
Parameters
- $lemmatizer : LemmatizerInterface|null = null
-
Lemmatizer implementation
- $repository : MySqlTermRepository|null = null
-
Term repository
applyLemmasToVocabulary()
Apply lemmas to existing vocabulary for a language.
public
applyLemmasToVocabulary(int $languageId, string $languageCode[, int $batchSize = 100 ]) : array{processed: int, updated: int, skipped: int}
Parameters
- $languageId : int
-
Language ID
- $languageCode : string
-
ISO language code for lemmatizer
- $batchSize : int = 100
-
Number of words to process per batch
Return values
array{processed: int, updated: int, skipped: int}bulkUpdateTermStatus()
Apply status to multiple terms (for bulk family updates).
public
bulkUpdateTermStatus(array<string|int, int> $termIds, int $status) : int
Parameters
- $termIds : array<string|int, int>
-
Term IDs to update
- $status : int
-
New status
Return values
int —Number of terms updated
clearLemmas()
Clear all lemmas for a language.
public
clearLemmas(int $languageId) : int
Parameters
- $languageId : int
-
Language ID
Return values
int —Number of terms affected
findPotentialLemmaGroups()
Find terms that might benefit from lemmatization.
public
findPotentialLemmaGroups(int $languageId[, int $limit = 20 ]) : array<int, array{base: string, variants: string[]}>
Identifies terms with similar text that could share a lemma.
Parameters
- $languageId : int
-
Language ID
- $limit : int = 20
-
Maximum suggestions
Return values
array<int, array{base: string, variants: string[]}>findWordIdByLemma()
Find a word ID by its lemma.
public
findWordIdByLemma(int $languageId, string $lemmaLc) : int|null
Returns the word that has this lemma (preferring the base form).
Parameters
- $languageId : int
-
Language ID
- $lemmaLc : string
-
Lowercase lemma to match
Return values
int|null —Word ID or null if not found
getAllNlpLanguages()
Get all languages potentially supported by NLP (including uninstalled models).
public
getAllNlpLanguages() : array<string|int, string>
Return values
array<string|int, string>getAvailableLanguages()
Get all languages with available lemmatization support.
public
getAvailableLanguages() : array<string|int, string>
Return values
array<string|int, string> —Array of language codes
getLemmaAggregateStats()
Get aggregate lemma statistics for a language.
public
getLemmaAggregateStats(int $languageId) : array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}
Parameters
- $languageId : int
-
Language ID
Return values
array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}getLemmaStatistics()
Get lemma statistics for a language.
public
getLemmaStatistics(int $languageId) : array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}
Parameters
- $languageId : int
-
Language ID
Return values
array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}getLemmatizerByType()
Get a lemmatizer by type.
public
getLemmatizerByType(string $type) : LemmatizerInterface
Parameters
- $type : string
-
Lemmatizer type ('dictionary', 'spacy', 'hybrid')
Return values
LemmatizerInterfacegetLemmatizerForLanguage()
Get the best available lemmatizer for a language.
public
getLemmatizerForLanguage(string $languageCode) : LemmatizerInterface
Uses the LemmatizerFactory to select the appropriate lemmatizer based on language configuration and availability.
Parameters
- $languageCode : string
-
ISO language code
Return values
LemmatizerInterfacegetNlpSupportedLanguages()
Get languages supported by the NLP service.
public
getNlpSupportedLanguages() : array<string|int, string>
Return values
array<string|int, string>getSuggestedFamilyUpdate()
Suggest status update for related forms when one form's status changes.
public
getSuggestedFamilyUpdate(int $termId, int $newStatus) : array{suggestion: string, affected_count: int, term_ids: int[]}
Based on the "suggested" inheritance mode from the proposal.
Parameters
- $termId : int
-
Term that was updated
- $newStatus : int
-
The new status that was set
Return values
array{suggestion: string, affected_count: int, term_ids: int[]}getUnmatchedStatistics()
Get statistics about unmatched text items that could benefit from lemma linking.
public
getUnmatchedStatistics(int $languageId) : array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}
Parameters
- $languageId : int
-
Language ID
Return values
array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}getWordFamilies()
Get words grouped by their lemma.
public
getWordFamilies(int $languageId[, int $limit = 50 ]) : array<string, array{lemma: string, count: int, terms: string[]}>
Parameters
- $languageId : int
-
Language ID
- $limit : int = 50
-
Maximum number of lemma groups to return
Return values
array<string, array{lemma: string, count: int, terms: string[]}>getWordFamily()
Get the word family (all words sharing a lemma).
public
getWordFamily(int $languageId, string $lemmaLc) : array<string|int, Term>
Parameters
- $languageId : int
-
Language ID
- $lemmaLc : string
-
Lowercase lemma
Return values
array<string|int, Term> —Array of terms in the word family
getWordFamilyByLemma()
Get word family by lemma directly (without requiring a term ID).
public
getWordFamilyByLemma(int $languageId, string $lemmaLc) : array<string|int, mixed>|null
Parameters
- $languageId : int
-
Language ID
- $lemmaLc : string
-
Lowercase lemma
Return values
array<string|int, mixed>|nullgetWordFamilyDetails()
Get detailed word family information for a term.
public
getWordFamilyDetails(int $termId) : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
Returns all words sharing the same lemma with full details for display.
Parameters
- $termId : int
-
Term ID to get family for
Return values
array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|nullgetWordFamilyList()
Get paginated list of word families for a language.
public
getWordFamilyList(int $languageId[, int $page = 1 ][, int $perPage = 50 ][, string $sortBy = 'lemma' ][, string $sortDir = 'asc' ]) : array{families: array, pagination: array}
Parameters
- $languageId : int
-
Language ID
- $page : int = 1
-
Page number (1-based)
- $perPage : int = 50
-
Items per page
- $sortBy : string = 'lemma'
-
Sort field: 'lemma', 'count', 'status'
- $sortDir : string = 'asc'
-
Sort direction: 'asc', 'desc'
Return values
array{families: array, pagination: array}isAvailableForLanguage()
Check if lemmatization is available for a language.
public
isAvailableForLanguage(string $languageCode) : bool
Parameters
- $languageCode : string
-
ISO language code
Return values
bool —True if lemmatization is available
isNlpServiceAvailable()
Check if NLP service (spaCy) is available.
public
isNlpServiceAvailable() : bool
Return values
boollinkTextItemsByLemma()
Link unmatched text items to words by lemma.
public
linkTextItemsByLemma(int $languageId, string $languageCode[, int|null $textId = null ]) : array{linked: int, unmatched: int, errors: int}
When a text item doesn't have an exact word match (Ti2WoID IS NULL), this method tries to find a word whose lemma matches the text item's lemmatized form.
Example: Text item "runs" with no exact match → lemmatize to "run" → find word with WoLemmaLC = "run" → link text item to that word
Parameters
- $languageId : int
-
Language ID
- $languageCode : string
-
ISO language code for lemmatizer
- $textId : int|null = null
-
Optional: limit to specific text
Return values
array{linked: int, unmatched: int, errors: int}linkTextItemsByLemmaSql()
Link text items directly using SQL (efficient for large datasets).
public
linkTextItemsByLemmaSql(int $languageId[, int|null $textId = null ]) : int
This method links text items to words where the text item's lowercase text matches a word's lemma. It's more efficient than the PHP-based approach for large datasets.
Parameters
- $languageId : int
-
Language ID
- $textId : int|null = null
-
Optional text ID filter
Return values
int —Number of text items linked
propagateLemma()
Copy lemma from one term to all related terms.
public
propagateLemma(int $termId, int $languageId, string $languageCode) : int
When a user sets a lemma for "running", this can propagate the lemma "run" to other forms like "runs", "ran" if they match the lemmatizer's suggestions.
Parameters
- $termId : int
-
Source term ID
- $languageId : int
-
Language ID
- $languageCode : string
-
Language code for lemmatizer
Return values
int —Number of terms updated
setLemma()
Set lemma for a specific term.
public
setLemma(int $termId, string $lemma) : bool
Parameters
- $termId : int
-
Term ID
- $lemma : string
-
The lemma to set
Return values
bool —True if updated
suggestLemma()
Suggest a lemma for a word.
public
suggestLemma(string $word, string $languageCode) : string|null
Parameters
- $word : string
-
The word to lemmatize
- $languageCode : string
-
ISO language code (e.g., 'en', 'de')
Return values
string|null —The suggested lemma, or null if not found
suggestLemmasBatch()
Suggest lemmas for multiple words.
public
suggestLemmasBatch(array<string|int, string> $words, string $languageCode) : array<string, string|null>
Parameters
- $words : array<string|int, string>
-
Array of words
- $languageCode : string
-
ISO language code
Return values
array<string, string|null> —Word => lemma mapping
updateWordFamilyStatus()
Update status for all words in a word family.
public
updateWordFamilyStatus(int $languageId, string $lemmaLc, int $status) : int
Parameters
- $languageId : int
-
Language ID
- $lemmaLc : string
-
Lowercase lemma
- $status : int
-
New status (1-5, 98, 99)
Return values
int —Number of words updated
buildSingleTermFamily()
Build a "family" response for a term without a lemma.
private
buildSingleTermFamily(int $termId) : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
Parameters
- $termId : int
-
Term ID
Return values
array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|nullfetchTermsWithoutLemma()
Fetch terms without a lemma.
private
fetchTermsWithoutLemma(int $languageId, int $limit, int $offset) : array<int, array<string, mixed>>
Parameters
- $languageId : int
-
Language ID
- $limit : int
-
Maximum number to fetch
- $offset : int
-
Starting offset
Return values
array<int, array<string, mixed>>fetchUnmatchedTextItems()
Fetch unmatched text items (Ti2WoID IS NULL) for a language.
private
fetchUnmatchedTextItems(int $languageId[, int|null $textId = null ]) : array<int, array<string, mixed>>
Parameters
- $languageId : int
-
Language ID
- $textId : int|null = null
-
Optional text ID filter
Return values
array<int, array<string, mixed>>getWordOccurrenceCount()
Get occurrence count for a word across all texts.
private
getWordOccurrenceCount(int $wordId) : int
Parameters
- $wordId : int
-
Word ID
Return values
intlinkItemsToWord()
Link text items to a word.
private
linkItemsToWord(array<int, array<string, mixed>> $items, int $wordId) : int
Parameters
- $items : array<int, array<string, mixed>>
-
Text items to link
- $wordId : int
-
Word ID to link to
Return values
int —Number of items linked
updateTermLemma()
Update the lemma for a term.
private
updateTermLemma(int $termId, string $lemma) : void
Parameters
- $termId : int
-
Term ID
- $lemma : string
-
The lemma to set