Documentation

LemmaService
in package

Lwt

Modules

Vocabulary

Application

Services

Service for managing lemmatization of vocabulary items.

Provides methods for:

Suggesting lemmas for new words
Batch lemmatization of existing vocabulary
Word family queries
NLP integration via factory pattern

Properties

$lemmatizer : LemmatizerInterface
$repository : MySqlTermRepository

Methods

__construct() : mixed: Constructor.
applyLemmasToVocabulary() : array{processed: int, updated: int, skipped: int}: Apply lemmas to existing vocabulary for a language.
bulkUpdateTermStatus() : int: Apply status to multiple terms (for bulk family updates).
clearLemmas() : int: Clear all lemmas for a language.
findPotentialLemmaGroups() : array<int, array{base: string, variants: string[]}>: Find terms that might benefit from lemmatization.
findWordIdByLemma() : int|null: Find a word ID by its lemma.
getAllNlpLanguages() : array<string|int, string>: Get all languages potentially supported by NLP (including uninstalled models).
getAvailableLanguages() : array<string|int, string>: Get all languages with available lemmatization support.
getLemmaAggregateStats() : array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}: Get aggregate lemma statistics for a language.
getLemmaStatistics() : array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}: Get lemma statistics for a language.
getLemmatizerByType() : LemmatizerInterface: Get a lemmatizer by type.
getLemmatizerForLanguage() : LemmatizerInterface: Get the best available lemmatizer for a language.
getNlpSupportedLanguages() : array<string|int, string>: Get languages supported by the NLP service.
getSuggestedFamilyUpdate() : array{suggestion: string, affected_count: int, term_ids: int[]}: Suggest status update for related forms when one form's status changes.
getUnmatchedStatistics() : array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}: Get statistics about unmatched text items that could benefit from lemma linking.
getWordFamilies() : array<string, array{lemma: string, count: int, terms: string[]}>: Get words grouped by their lemma.
getWordFamily() : array<string|int, Term>: Get the word family (all words sharing a lemma).
getWordFamilyByLemma() : array<string|int, mixed>|null: Get word family by lemma directly (without requiring a term ID).
getWordFamilyDetails() : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null: Get detailed word family information for a term.
getWordFamilyList() : array{families: array, pagination: array}: Get paginated list of word families for a language.
isAvailableForLanguage() : bool: Check if lemmatization is available for a language.
isNlpServiceAvailable() : bool: Check if NLP service (spaCy) is available.
linkTextItemsByLemma() : array{linked: int, unmatched: int, errors: int}: Link unmatched text items to words by lemma.
linkTextItemsByLemmaSql() : int: Link text items directly using SQL (efficient for large datasets).
propagateLemma() : int: Copy lemma from one term to all related terms.
setLemma() : bool: Set lemma for a specific term.
suggestLemma() : string|null: Suggest a lemma for a word.
suggestLemmasBatch() : array<string, string|null>: Suggest lemmas for multiple words.
updateWordFamilyStatus() : int: Update status for all words in a word family.
buildSingleTermFamily() : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null: Build a "family" response for a term without a lemma.
fetchTermsWithoutLemma() : array<int, array<string, mixed>>: Fetch terms without a lemma.
fetchUnmatchedTextItems() : array<int, array<string, mixed>>: Fetch unmatched text items (Ti2WoID IS NULL) for a language.
getWordOccurrenceCount() : int: Get occurrence count for a word across all texts.
linkItemsToWord() : int: Link text items to a word.
updateTermLemma() : void: Update the lemma for a term.

$lemmatizer


        private
            LemmatizerInterface
    $lemmatizer

$repository


        private
            MySqlTermRepository
    $repository

__construct()

Constructor.


    public
                    __construct([LemmatizerInterface|null $lemmatizer = null ][, MySqlTermRepository|null $repository = null ]) : mixed

Parameters

$lemmatizer : LemmatizerInterface|null = null: Lemmatizer implementation
$repository : MySqlTermRepository|null = null: Term repository

applyLemmasToVocabulary()

Apply lemmas to existing vocabulary for a language.


    public
                    applyLemmasToVocabulary(int $languageId, string $languageCode[, int $batchSize = 100 ]) : array{processed: int, updated: int, skipped: int}

Parameters

$languageId : int: Language ID
$languageCode : string: ISO language code for lemmatizer
$batchSize : int = 100: Number of words to process per batch

Return values

array{processed: int, updated: int, skipped: int}

bulkUpdateTermStatus()

Apply status to multiple terms (for bulk family updates).


    public
                    bulkUpdateTermStatus(array<string|int, int> $termIds, int $status) : int

Parameters

$termIds : array<string|int, int>: Term IDs to update
$status : int: New status

Return values

int —

Number of terms updated

clearLemmas()

Clear all lemmas for a language.


    public
                    clearLemmas(int $languageId) : int

Parameters

$languageId : int: Language ID

Return values

int —

Number of terms affected

findPotentialLemmaGroups()

Find terms that might benefit from lemmatization.


    public
                    findPotentialLemmaGroups(int $languageId[, int $limit = 20 ]) : array<int, array{base: string, variants: string[]}>

Identifies terms with similar text that could share a lemma.

Parameters

$languageId : int: Language ID
$limit : int = 20: Maximum suggestions

Return values

array<int, array{base: string, variants: string[]}>

findWordIdByLemma()

Find a word ID by its lemma.


    public
                    findWordIdByLemma(int $languageId, string $lemmaLc) : int|null

Returns the word that has this lemma (preferring the base form).

Parameters

$languageId : int: Language ID
$lemmaLc : string: Lowercase lemma to match

Return values

int|null —

Word ID or null if not found

getAllNlpLanguages()

Get all languages potentially supported by NLP (including uninstalled models).


    public
                    getAllNlpLanguages() : array<string|int, string>

Return values

array<string|int, string>

getAvailableLanguages()

Get all languages with available lemmatization support.


    public
                    getAvailableLanguages() : array<string|int, string>

Return values

array<string|int, string> —

Array of language codes

getLemmaAggregateStats()

Get aggregate lemma statistics for a language.


    public
                    getLemmaAggregateStats(int $languageId) : array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}

Parameters

$languageId : int: Language ID

Return values

array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}

getLemmaStatistics()

Get lemma statistics for a language.


    public
                    getLemmaStatistics(int $languageId) : array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}

Parameters

$languageId : int: Language ID

Return values

array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}

getLemmatizerByType()

Get a lemmatizer by type.


    public
                    getLemmatizerByType(string $type) : LemmatizerInterface

Parameters

$type : string: Lemmatizer type ('dictionary', 'spacy', 'hybrid')

Return values

LemmatizerInterface

getLemmatizerForLanguage()

Get the best available lemmatizer for a language.


    public
                    getLemmatizerForLanguage(string $languageCode) : LemmatizerInterface

Uses the LemmatizerFactory to select the appropriate lemmatizer based on language configuration and availability.

Parameters

$languageCode : string: ISO language code

Return values

LemmatizerInterface

getNlpSupportedLanguages()

Get languages supported by the NLP service.


    public
                    getNlpSupportedLanguages() : array<string|int, string>

Return values

array<string|int, string>

getSuggestedFamilyUpdate()

Suggest status update for related forms when one form's status changes.


    public
                    getSuggestedFamilyUpdate(int $termId, int $newStatus) : array{suggestion: string, affected_count: int, term_ids: int[]}

Based on the "suggested" inheritance mode from the proposal.

Parameters

$termId : int: Term that was updated
$newStatus : int: The new status that was set

Return values

array{suggestion: string, affected_count: int, term_ids: int[]}

getUnmatchedStatistics()

Get statistics about unmatched text items that could benefit from lemma linking.


    public
                    getUnmatchedStatistics(int $languageId) : array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}

Parameters

$languageId : int: Language ID

Return values

array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}

getWordFamilies()

Get words grouped by their lemma.


    public
                    getWordFamilies(int $languageId[, int $limit = 50 ]) : array<string, array{lemma: string, count: int, terms: string[]}>

Parameters

$languageId : int: Language ID
$limit : int = 50: Maximum number of lemma groups to return

Return values

array<string, array{lemma: string, count: int, terms: string[]}>

getWordFamily()

Get the word family (all words sharing a lemma).


    public
                    getWordFamily(int $languageId, string $lemmaLc) : array<string|int, Term>

Parameters

$languageId : int: Language ID
$lemmaLc : string: Lowercase lemma

Return values

array<string|int, Term> —

Array of terms in the word family

getWordFamilyByLemma()

Get word family by lemma directly (without requiring a term ID).


    public
                    getWordFamilyByLemma(int $languageId, string $lemmaLc) : array<string|int, mixed>|null

Parameters

$languageId : int: Language ID
$lemmaLc : string: Lowercase lemma

Return values

array<string|int, mixed>|null

getWordFamilyDetails()

Get detailed word family information for a term.


    public
                    getWordFamilyDetails(int $termId) : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

Returns all words sharing the same lemma with full details for display.

Parameters

$termId : int: Term ID to get family for

Return values

array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

getWordFamilyList()

Get paginated list of word families for a language.


    public
                    getWordFamilyList(int $languageId[, int $page = 1 ][, int $perPage = 50 ][, string $sortBy = 'lemma' ][, string $sortDir = 'asc' ]) : array{families: array, pagination: array}

Parameters

$languageId : int: Language ID
$page : int = 1: Page number (1-based)
$perPage : int = 50: Items per page
$sortBy : string = 'lemma': Sort field: 'lemma', 'count', 'status'
$sortDir : string = 'asc': Sort direction: 'asc', 'desc'

Return values

array{families: array, pagination: array}

isAvailableForLanguage()

Check if lemmatization is available for a language.


    public
                    isAvailableForLanguage(string $languageCode) : bool

Parameters

$languageCode : string: ISO language code

Return values

bool —

True if lemmatization is available

isNlpServiceAvailable()

Check if NLP service (spaCy) is available.


    public
                    isNlpServiceAvailable() : bool

Return values

bool

linkTextItemsByLemma()

Link unmatched text items to words by lemma.


    public
                    linkTextItemsByLemma(int $languageId, string $languageCode[, int|null $textId = null ]) : array{linked: int, unmatched: int, errors: int}

When a text item doesn't have an exact word match (Ti2WoID IS NULL), this method tries to find a word whose lemma matches the text item's lemmatized form.

Example: Text item "runs" with no exact match → lemmatize to "run" → find word with WoLemmaLC = "run" → link text item to that word

Parameters

$languageId : int: Language ID
$languageCode : string: ISO language code for lemmatizer
$textId : int|null = null: Optional: limit to specific text

Return values

array{linked: int, unmatched: int, errors: int}

linkTextItemsByLemmaSql()

Link text items directly using SQL (efficient for large datasets).


    public
                    linkTextItemsByLemmaSql(int $languageId[, int|null $textId = null ]) : int

This method links text items to words where the text item's lowercase text matches a word's lemma. It's more efficient than the PHP-based approach for large datasets.

Parameters

$languageId : int: Language ID
$textId : int|null = null: Optional text ID filter

Return values

int —

Number of text items linked

propagateLemma()

Copy lemma from one term to all related terms.


    public
                    propagateLemma(int $termId, int $languageId, string $languageCode) : int

When a user sets a lemma for "running", this can propagate the lemma "run" to other forms like "runs", "ran" if they match the lemmatizer's suggestions.

Parameters

$termId : int: Source term ID
$languageId : int: Language ID
$languageCode : string: Language code for lemmatizer

Return values

int —

Number of terms updated

setLemma()

Set lemma for a specific term.


    public
                    setLemma(int $termId, string $lemma) : bool

Parameters

$termId : int: Term ID
$lemma : string: The lemma to set

Return values

bool —

True if updated

suggestLemma()

Suggest a lemma for a word.


    public
                    suggestLemma(string $word, string $languageCode) : string|null

Parameters

$word : string: The word to lemmatize
$languageCode : string: ISO language code (e.g., 'en', 'de')

Return values

string|null —

The suggested lemma, or null if not found

suggestLemmasBatch()

Suggest lemmas for multiple words.


    public
                    suggestLemmasBatch(array<string|int, string> $words, string $languageCode) : array<string, string|null>

Parameters

$words : array<string|int, string>: Array of words
$languageCode : string: ISO language code

Return values

array<string, string|null> —

Word => lemma mapping

updateWordFamilyStatus()

Update status for all words in a word family.


    public
                    updateWordFamilyStatus(int $languageId, string $lemmaLc, int $status) : int

Parameters

$languageId : int: Language ID
$lemmaLc : string: Lowercase lemma
$status : int: New status (1-5, 98, 99)

Return values

int —

Number of words updated

buildSingleTermFamily()

Build a "family" response for a term without a lemma.


    private
                    buildSingleTermFamily(int $termId) : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

Parameters

$termId : int: Term ID

Return values

array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

fetchTermsWithoutLemma()

Fetch terms without a lemma.


    private
                    fetchTermsWithoutLemma(int $languageId, int $limit, int $offset) : array<int, array<string, mixed>>

Parameters

$languageId : int: Language ID
$limit : int: Maximum number to fetch
$offset : int: Starting offset

Return values

array<int, array<string, mixed>>

fetchUnmatchedTextItems()

Fetch unmatched text items (Ti2WoID IS NULL) for a language.


    private
                    fetchUnmatchedTextItems(int $languageId[, int|null $textId = null ]) : array<int, array<string, mixed>>

Parameters

$languageId : int: Language ID
$textId : int|null = null: Optional text ID filter

Return values

array<int, array<string, mixed>>

getWordOccurrenceCount()

Get occurrence count for a word across all texts.


    private
                    getWordOccurrenceCount(int $wordId) : int

Parameters

$wordId : int: Word ID

Return values

int

linkItemsToWord()

Link text items to a word.


    private
                    linkItemsToWord(array<int, array<string, mixed>> $items, int $wordId) : int

Parameters

$items : array<int, array<string, mixed>>: Text items to link
$wordId : int: Word ID to link to

Return values

int —

Number of items linked

updateTermLemma()

Update the lemma for a term.


    private
                    updateTermLemma(int $termId, string $lemma) : void

Parameters

$termId : int: Term ID
$lemma : string: The lemma to set

LemmaService in package Lwt Modules Vocabulary Application Services

Tags

Table of Contents

Properties

Methods

Properties

$lemmatizer

$repository

Methods

__construct()

Parameters

applyLemmasToVocabulary()

Parameters

Return values

bulkUpdateTermStatus()

Parameters

Return values

clearLemmas()

Parameters

Return values

findPotentialLemmaGroups()

Parameters

Return values

findWordIdByLemma()

Parameters

Return values

getAllNlpLanguages()

Return values

getAvailableLanguages()

Return values

getLemmaAggregateStats()

Parameters

Return values

getLemmaStatistics()

Parameters

Return values

getLemmatizerByType()

Parameters

Return values

getLemmatizerForLanguage()

Parameters

Return values

getNlpSupportedLanguages()

Return values

getSuggestedFamilyUpdate()

Parameters

Return values

getUnmatchedStatistics()

Parameters

Return values

getWordFamilies()

Parameters

Return values

getWordFamily()

Parameters

Return values

getWordFamilyByLemma()

Parameters

Return values

getWordFamilyDetails()

Parameters

Return values

getWordFamilyList()

Parameters

Return values

isAvailableForLanguage()

Parameters

Return values

isNlpServiceAvailable()

Return values

linkTextItemsByLemma()

Parameters

Return values

linkTextItemsByLemmaSql()

Parameters

Return values

propagateLemma()

Parameters

Return values

setLemma()

LemmaService
in package

Lwt

Modules

Vocabulary

Application

Services