WiktionaryEnrichmentService
in package
Enriches imported vocabulary with translations from kaikki.org (Wiktextract structured data) or monolingual definitions from Wiktionary APIs.
Designed to be called in small batches via AJAX polling, so the UI can show progress without blocking.
Table of Contents
Constants
- BATCH_SIZE = 20
- FETCH_TIMEOUT = 10
- KAIKKI_BASE_URL = 'https://kaikki.org/dictionary'
- MAX_CONSECUTIVE_FAILURES = 5
- WIKTIONARY_API_TEMPLATE = 'https://%s.wiktionary.org/w/api.php'
Methods
- buildKaikkiUrl() : string
- Build the kaikki.org URL for a word.
- countTotal() : int
- Count total words for a language (for progress calculation).
- countUnenriched() : int
- Count remaining unenriched words for progress tracking.
- enrichBatchDefinition() : array{enriched: int, failed: int, remaining: int, total: int, warning: string}
- Enrich a batch of words with monolingual definitions from Wiktionary.
- enrichBatchTranslation() : array{enriched: int, failed: int, remaining: int, total: int, warning: string}
- Enrich a batch of words with English translations from kaikki.org.
- fetchKaikkiTranslation() : string|null
- Fetch English translation from kaikki.org for a single word.
- fetchWiktionaryDefinition() : string|null
- Fetch monolingual definition from Wiktionary API.
- getUnenrichedWords() : array<int, array{WoID: int, WoText: string}>
- Get the next batch of unenriched words for a language.
- parseKaikkiResponse() : string|null
- Parse kaikki.org JSONL response to extract the first English gloss.
- parseWikitext() : string|null
- Parse wikitext to extract the first definition line.
- cleanWikitext() : string
- Clean wikitext markup to produce readable text.
- fetchFromWiktionaryApi() : string|null
- Fetch a definition from the Wiktionary parse API.
- httpGet() : string|null
- Perform an HTTP GET with timeout.
- updateTranslation() : void
- Update a word's translation in the database.
Constants
BATCH_SIZE
private
mixed
BATCH_SIZE
= 20
FETCH_TIMEOUT
private
mixed
FETCH_TIMEOUT
= 10
KAIKKI_BASE_URL
private
mixed
KAIKKI_BASE_URL
= 'https://kaikki.org/dictionary'
MAX_CONSECUTIVE_FAILURES
private
mixed
MAX_CONSECUTIVE_FAILURES
= 5
WIKTIONARY_API_TEMPLATE
private
mixed
WIKTIONARY_API_TEMPLATE
= 'https://%s.wiktionary.org/w/api.php'
Methods
buildKaikkiUrl()
Build the kaikki.org URL for a word.
public
buildKaikkiUrl(string $word, string $kaikkiLangName) : string
Path format: /dictionary/{Language}/meaning/{w[0]}/{w[0:2]}/{word}.jsonl
Parameters
- $word : string
- $kaikkiLangName : string
Return values
stringcountTotal()
Count total words for a language (for progress calculation).
public
countTotal(int $langId) : int
Parameters
- $langId : int
Return values
intcountUnenriched()
Count remaining unenriched words for progress tracking.
public
countUnenriched(int $langId) : int
Parameters
- $langId : int
Return values
intenrichBatchDefinition()
Enrich a batch of words with monolingual definitions from Wiktionary.
public
enrichBatchDefinition(int $langId, string $languageName) : array{enriched: int, failed: int, remaining: int, total: int, warning: string}
Parameters
- $langId : int
- $languageName : string
Return values
array{enriched: int, failed: int, remaining: int, total: int, warning: string}enrichBatchTranslation()
Enrich a batch of words with English translations from kaikki.org.
public
enrichBatchTranslation(int $langId, string $languageName) : array{enriched: int, failed: int, remaining: int, total: int, warning: string}
Parameters
- $langId : int
- $languageName : string
Return values
array{enriched: int, failed: int, remaining: int, total: int, warning: string}fetchKaikkiTranslation()
Fetch English translation from kaikki.org for a single word.
public
fetchKaikkiTranslation(string $word, string $kaikkiLangName) : string|null
Parameters
- $word : string
- $kaikkiLangName : string
Return values
string|null —First English gloss, or null on failure
fetchWiktionaryDefinition()
Fetch monolingual definition from Wiktionary API.
public
fetchWiktionaryDefinition(string $word, string $wiktCode, string $kaikkiLangName) : string|null
Strategy: first try kaikki.org for the raw_glosses/glosses in the target language. If that fails, fall back to the Wiktionary parse API and extract the first definition line from wikitext.
Parameters
- $word : string
- $wiktCode : string
- $kaikkiLangName : string
Return values
string|null —First definition, or null on failure
getUnenrichedWords()
Get the next batch of unenriched words for a language.
public
getUnenrichedWords(int $langId[, int $batchSize = self::BATCH_SIZE ]) : array<int, array{WoID: int, WoText: string}>
Parameters
- $langId : int
- $batchSize : int = self::BATCH_SIZE
Return values
array<int, array{WoID: int, WoText: string}>parseKaikkiResponse()
Parse kaikki.org JSONL response to extract the first English gloss.
public
parseKaikkiResponse(string $jsonl) : string|null
Prefers non-form-of entries (lexical definitions over inflection forms).
Parameters
- $jsonl : string
Return values
string|null —First gloss or null
parseWikitext()
Parse wikitext to extract the first definition line.
public
parseWikitext(string $wikitext) : string|null
Wikitext definitions look like:
[[house]]
{{lb|es|architecture}} [[building]]
Parameters
- $wikitext : string
Return values
string|null —Cleaned definition or null
cleanWikitext()
Clean wikitext markup to produce readable text.
private
cleanWikitext(string $text) : string
Parameters
- $text : string
Return values
stringfetchFromWiktionaryApi()
Fetch a definition from the Wiktionary parse API.
private
fetchFromWiktionaryApi(string $word, string $wiktCode) : string|null
Uses {lang}.wiktionary.org to get a monolingual definition.
Parameters
- $word : string
- $wiktCode : string
Return values
string|null —First definition line or null
httpGet()
Perform an HTTP GET with timeout.
private
httpGet(string $url) : string|null
Parameters
- $url : string
Return values
string|null —Response body or null on failure
updateTranslation()
Update a word's translation in the database.
private
updateTranslation(int $wordId, string $translation) : void
Parameters
- $wordId : int
- $translation : string