Documentation

LemmaService

Service for managing lemmatization of vocabulary items.

Provides methods for:

  • Suggesting lemmas for new words
  • Batch lemmatization of existing vocabulary
  • Word family queries
  • NLP integration via factory pattern
Tags
since
3.0.0

Table of Contents

Properties

$lemmatizer  : LemmatizerInterface
$repository  : MySqlTermRepository

Methods

__construct()  : mixed
Constructor.
applyLemmasToVocabulary()  : array{processed: int, updated: int, skipped: int}
Apply lemmas to existing vocabulary for a language.
bulkUpdateTermStatus()  : int
Apply status to multiple terms (for bulk family updates).
clearLemmas()  : int
Clear all lemmas for a language.
findPotentialLemmaGroups()  : array<int, array{base: string, variants: string[]}>
Find terms that might benefit from lemmatization.
findWordIdByLemma()  : int|null
Find a word ID by its lemma.
getAllNlpLanguages()  : array<string|int, string>
Get all languages potentially supported by NLP (including uninstalled models).
getAvailableLanguages()  : array<string|int, string>
Get all languages with available lemmatization support.
getLemmaAggregateStats()  : array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}
Get aggregate lemma statistics for a language.
getLemmaStatistics()  : array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}
Get lemma statistics for a language.
getLemmatizerByType()  : LemmatizerInterface
Get a lemmatizer by type.
getLemmatizerForLanguage()  : LemmatizerInterface
Get the best available lemmatizer for a language.
getNlpSupportedLanguages()  : array<string|int, string>
Get languages supported by the NLP service.
getSuggestedFamilyUpdate()  : array{suggestion: string, affected_count: int, term_ids: int[]}
Suggest status update for related forms when one form's status changes.
getUnmatchedStatistics()  : array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}
Get statistics about unmatched text items that could benefit from lemma linking.
getWordFamilies()  : array<string, array{lemma: string, count: int, terms: string[]}>
Get words grouped by their lemma.
getWordFamily()  : array<string|int, Term>
Get the word family (all words sharing a lemma).
getWordFamilyByLemma()  : array<string|int, mixed>|null
Get word family by lemma directly (without requiring a term ID).
getWordFamilyDetails()  : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
Get detailed word family information for a term.
getWordFamilyList()  : array{families: array, pagination: array}
Get paginated list of word families for a language.
isAvailableForLanguage()  : bool
Check if lemmatization is available for a language.
isNlpServiceAvailable()  : bool
Check if NLP service (spaCy) is available.
linkTextItemsByLemma()  : array{linked: int, unmatched: int, errors: int}
Link unmatched text items to words by lemma.
linkTextItemsByLemmaSql()  : int
Link text items directly using SQL (efficient for large datasets).
propagateLemma()  : int
Copy lemma from one term to all related terms.
setLemma()  : bool
Set lemma for a specific term.
suggestLemma()  : string|null
Suggest a lemma for a word.
suggestLemmasBatch()  : array<string, string|null>
Suggest lemmas for multiple words.
updateWordFamilyStatus()  : int
Update status for all words in a word family.
buildSingleTermFamily()  : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
Build a "family" response for a term without a lemma.
fetchTermsWithoutLemma()  : array<int, array<string, mixed>>
Fetch terms without a lemma.
fetchUnmatchedTextItems()  : array<int, array<string, mixed>>
Fetch unmatched text items (Ti2WoID IS NULL) for a language.
getWordOccurrenceCount()  : int
Get occurrence count for a word across all texts.
linkItemsToWord()  : int
Link text items to a word.
updateTermLemma()  : void
Update the lemma for a term.

Properties

Methods

applyLemmasToVocabulary()

Apply lemmas to existing vocabulary for a language.

public applyLemmasToVocabulary(int $languageId, string $languageCode[, int $batchSize = 100 ]) : array{processed: int, updated: int, skipped: int}
Parameters
$languageId : int

Language ID

$languageCode : string

ISO language code for lemmatizer

$batchSize : int = 100

Number of words to process per batch

Return values
array{processed: int, updated: int, skipped: int}

bulkUpdateTermStatus()

Apply status to multiple terms (for bulk family updates).

public bulkUpdateTermStatus(array<string|int, int> $termIds, int $status) : int
Parameters
$termIds : array<string|int, int>

Term IDs to update

$status : int

New status

Return values
int

Number of terms updated

clearLemmas()

Clear all lemmas for a language.

public clearLemmas(int $languageId) : int
Parameters
$languageId : int

Language ID

Return values
int

Number of terms affected

findPotentialLemmaGroups()

Find terms that might benefit from lemmatization.

public findPotentialLemmaGroups(int $languageId[, int $limit = 20 ]) : array<int, array{base: string, variants: string[]}>

Identifies terms with similar text that could share a lemma.

Parameters
$languageId : int

Language ID

$limit : int = 20

Maximum suggestions

Return values
array<int, array{base: string, variants: string[]}>

findWordIdByLemma()

Find a word ID by its lemma.

public findWordIdByLemma(int $languageId, string $lemmaLc) : int|null

Returns the word that has this lemma (preferring the base form).

Parameters
$languageId : int

Language ID

$lemmaLc : string

Lowercase lemma to match

Return values
int|null

Word ID or null if not found

getAllNlpLanguages()

Get all languages potentially supported by NLP (including uninstalled models).

public getAllNlpLanguages() : array<string|int, string>
Return values
array<string|int, string>

getAvailableLanguages()

Get all languages with available lemmatization support.

public getAvailableLanguages() : array<string|int, string>
Return values
array<string|int, string>

Array of language codes

getLemmaAggregateStats()

Get aggregate lemma statistics for a language.

public getLemmaAggregateStats(int $languageId) : array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}
Parameters
$languageId : int

Language ID

Return values
array{total_lemmas: int, single_form: int, multi_form: int, avg_forms_per_lemma: float, status_distribution: array}

getLemmaStatistics()

Get lemma statistics for a language.

public getLemmaStatistics(int $languageId) : array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}
Parameters
$languageId : int

Language ID

Return values
array{total_terms: int, with_lemma: int, without_lemma: int, unique_lemmas: int}

getLemmatizerForLanguage()

Get the best available lemmatizer for a language.

public getLemmatizerForLanguage(string $languageCode) : LemmatizerInterface

Uses the LemmatizerFactory to select the appropriate lemmatizer based on language configuration and availability.

Parameters
$languageCode : string

ISO language code

Return values
LemmatizerInterface

getNlpSupportedLanguages()

Get languages supported by the NLP service.

public getNlpSupportedLanguages() : array<string|int, string>
Return values
array<string|int, string>

getSuggestedFamilyUpdate()

Suggest status update for related forms when one form's status changes.

public getSuggestedFamilyUpdate(int $termId, int $newStatus) : array{suggestion: string, affected_count: int, term_ids: int[]}

Based on the "suggested" inheritance mode from the proposal.

Parameters
$termId : int

Term that was updated

$newStatus : int

The new status that was set

Return values
array{suggestion: string, affected_count: int, term_ids: int[]}

getUnmatchedStatistics()

Get statistics about unmatched text items that could benefit from lemma linking.

public getUnmatchedStatistics(int $languageId) : array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}
Parameters
$languageId : int

Language ID

Return values
array{unmatched_count: int, unique_words: int, matchable_by_lemma: int}

getWordFamilies()

Get words grouped by their lemma.

public getWordFamilies(int $languageId[, int $limit = 50 ]) : array<string, array{lemma: string, count: int, terms: string[]}>
Parameters
$languageId : int

Language ID

$limit : int = 50

Maximum number of lemma groups to return

Return values
array<string, array{lemma: string, count: int, terms: string[]}>

getWordFamily()

Get the word family (all words sharing a lemma).

public getWordFamily(int $languageId, string $lemmaLc) : array<string|int, Term>
Parameters
$languageId : int

Language ID

$lemmaLc : string

Lowercase lemma

Return values
array<string|int, Term>

Array of terms in the word family

getWordFamilyByLemma()

Get word family by lemma directly (without requiring a term ID).

public getWordFamilyByLemma(int $languageId, string $lemmaLc) : array<string|int, mixed>|null
Parameters
$languageId : int

Language ID

$lemmaLc : string

Lowercase lemma

Return values
array<string|int, mixed>|null

getWordFamilyDetails()

Get detailed word family information for a term.

public getWordFamilyDetails(int $termId) : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

Returns all words sharing the same lemma with full details for display.

Parameters
$termId : int

Term ID to get family for

Return values
array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

getWordFamilyList()

Get paginated list of word families for a language.

public getWordFamilyList(int $languageId[, int $page = 1 ][, int $perPage = 50 ][, string $sortBy = 'lemma' ][, string $sortDir = 'asc' ]) : array{families: array, pagination: array}
Parameters
$languageId : int

Language ID

$page : int = 1

Page number (1-based)

$perPage : int = 50

Items per page

$sortBy : string = 'lemma'

Sort field: 'lemma', 'count', 'status'

$sortDir : string = 'asc'

Sort direction: 'asc', 'desc'

Return values
array{families: array, pagination: array}

isAvailableForLanguage()

Check if lemmatization is available for a language.

public isAvailableForLanguage(string $languageCode) : bool
Parameters
$languageCode : string

ISO language code

Return values
bool

True if lemmatization is available

isNlpServiceAvailable()

Check if NLP service (spaCy) is available.

public isNlpServiceAvailable() : bool
Return values
bool

linkTextItemsByLemma()

Link unmatched text items to words by lemma.

public linkTextItemsByLemma(int $languageId, string $languageCode[, int|null $textId = null ]) : array{linked: int, unmatched: int, errors: int}

When a text item doesn't have an exact word match (Ti2WoID IS NULL), this method tries to find a word whose lemma matches the text item's lemmatized form.

Example: Text item "runs" with no exact match → lemmatize to "run" → find word with WoLemmaLC = "run" → link text item to that word

Parameters
$languageId : int

Language ID

$languageCode : string

ISO language code for lemmatizer

$textId : int|null = null

Optional: limit to specific text

Return values
array{linked: int, unmatched: int, errors: int}

linkTextItemsByLemmaSql()

Link text items directly using SQL (efficient for large datasets).

public linkTextItemsByLemmaSql(int $languageId[, int|null $textId = null ]) : int

This method links text items to words where the text item's lowercase text matches a word's lemma. It's more efficient than the PHP-based approach for large datasets.

Parameters
$languageId : int

Language ID

$textId : int|null = null

Optional text ID filter

Return values
int

Number of text items linked

propagateLemma()

Copy lemma from one term to all related terms.

public propagateLemma(int $termId, int $languageId, string $languageCode) : int

When a user sets a lemma for "running", this can propagate the lemma "run" to other forms like "runs", "ran" if they match the lemmatizer's suggestions.

Parameters
$termId : int

Source term ID

$languageId : int

Language ID

$languageCode : string

Language code for lemmatizer

Return values
int

Number of terms updated

setLemma()

Set lemma for a specific term.

public setLemma(int $termId, string $lemma) : bool
Parameters
$termId : int

Term ID

$lemma : string

The lemma to set

Return values
bool

True if updated

suggestLemma()

Suggest a lemma for a word.

public suggestLemma(string $word, string $languageCode) : string|null
Parameters
$word : string

The word to lemmatize

$languageCode : string

ISO language code (e.g., 'en', 'de')

Return values
string|null

The suggested lemma, or null if not found

suggestLemmasBatch()

Suggest lemmas for multiple words.

public suggestLemmasBatch(array<string|int, string> $words, string $languageCode) : array<string, string|null>
Parameters
$words : array<string|int, string>

Array of words

$languageCode : string

ISO language code

Return values
array<string, string|null>

Word => lemma mapping

updateWordFamilyStatus()

Update status for all words in a word family.

public updateWordFamilyStatus(int $languageId, string $lemmaLc, int $status) : int
Parameters
$languageId : int

Language ID

$lemmaLc : string

Lowercase lemma

$status : int

New status (1-5, 98, 99)

Return values
int

Number of words updated

buildSingleTermFamily()

Build a "family" response for a term without a lemma.

private buildSingleTermFamily(int $termId) : array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null
Parameters
$termId : int

Term ID

Return values
array{lemma: string, lemmaLc: string, langId: int, terms: array, stats: array}|null

fetchTermsWithoutLemma()

Fetch terms without a lemma.

private fetchTermsWithoutLemma(int $languageId, int $limit, int $offset) : array<int, array<string, mixed>>
Parameters
$languageId : int

Language ID

$limit : int

Maximum number to fetch

$offset : int

Starting offset

Return values
array<int, array<string, mixed>>

fetchUnmatchedTextItems()

Fetch unmatched text items (Ti2WoID IS NULL) for a language.

private fetchUnmatchedTextItems(int $languageId[, int|null $textId = null ]) : array<int, array<string, mixed>>
Parameters
$languageId : int

Language ID

$textId : int|null = null

Optional text ID filter

Return values
array<int, array<string, mixed>>

getWordOccurrenceCount()

Get occurrence count for a word across all texts.

private getWordOccurrenceCount(int $wordId) : int
Parameters
$wordId : int

Word ID

Return values
int

linkItemsToWord()

Link text items to a word.

private linkItemsToWord(array<int, array<string, mixed>> $items, int $wordId) : int
Parameters
$items : array<int, array<string, mixed>>

Text items to link

$wordId : int

Word ID to link to

Return values
int

Number of items linked

updateTermLemma()

Update the lemma for a term.

private updateTermLemma(int $termId, string $lemma) : void
Parameters
$termId : int

Term ID

$lemma : string

The lemma to set


        
On this page

Search results