Documentation

DifficultyEstimationService
in package

Lwt

Modules

Text

Application

Services

Estimates text difficulty relative to a user's known vocabulary.

Constants

EASY_SUBJECTS = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']: Subject keywords that indicate easier texts.
HARD_SUBJECTS = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']: Subject keywords that indicate harder texts.
LOOKUP_BATCH_SIZE = 500: Maximum number of words per vocabulary lookup batch.
SAMPLE_WORD_COUNT = 2000: Maximum number of words to sample for accurate coverage.

Methods

analyzeTextSample() : array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}: Analyze a text sample for accurate vocabulary coverage.
classifySubjectsPublic() : string: Classify subjects into a difficulty tier (public API).
estimateQuickTiers() : array<int, string>: Estimate quick difficulty tiers for a batch of books.
classifySubjects() : string: Classify subject list into a difficulty tier.
computeQuickTier() : string: Compute quick difficulty tier from vocabulary size and subjects.
fetchTextContent() : string|null: Fetch text content from a URL.
getKnownWordCount() : int: Count words the user knows for a language.
getLanguageParseSettings() : array{regex: string, splitEachChar: bool}|null: Get language parsing settings for tokenization.
labelFromCoverage() : string: Map coverage percentage to a human-readable difficulty label.
lookupKnownWords() : array<int, string>: Look up which words from a list the user already knows.
tokenize() : array<int, string>: Tokenize text into words using the language's word character regex.

EASY_SUBJECTS

Subject keywords that indicate easier texts.


    private
        array<int, string>
    EASY_SUBJECTS
    = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']

HARD_SUBJECTS

Subject keywords that indicate harder texts.


    private
        array<int, string>
    HARD_SUBJECTS
    = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']

LOOKUP_BATCH_SIZE

Maximum number of words per vocabulary lookup batch.


    private
        mixed
    LOOKUP_BATCH_SIZE
    = 500

SAMPLE_WORD_COUNT

Maximum number of words to sample for accurate coverage.


    private
        mixed
    SAMPLE_WORD_COUNT
    = 2000

analyzeTextSample()

Analyze a text sample for accurate vocabulary coverage.


    public
                    analyzeTextSample(string $textUrl, int $languageId) : array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

Fetches the text, extracts a sample, tokenizes it, and computes the percentage of unique words the user already knows.

Parameters

$textUrl : string: URL of the plain text
$languageId : int: Language ID (for tokenization and vocab lookup)

Return values

array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

classifySubjectsPublic()

Classify subjects into a difficulty tier (public API).


    public
                    classifySubjectsPublic(array<int, string> $subjects) : string

Parameters

$subjects : array<int, string>: Subject categories

Return values

string —

'easy'|'medium'|'hard'

estimateQuickTiers()

Estimate quick difficulty tiers for a batch of books.


    public
                    estimateQuickTiers(int $languageId, array<int, array<int, string>> $booksSubjects) : array<int, string>

Performs a single DB query for vocabulary size, then classifies each book based on its subjects.

Parameters

$languageId : int: Language ID
$booksSubjects : array<int, array<int, string>>: Map of bookId => subjects

Return values

array<int, string> —

Map of bookId => 'easy'|'medium'|'hard'

classifySubjects()

Classify subject list into a difficulty tier.


    private
                    classifySubjects(array<int, string> $subjects) : string

Picks the most favorable (lowest difficulty) match.

Parameters

$subjects : array<int, string>: Subject categories

Return values

string —

'easy'|'medium'|'hard'

computeQuickTier()

Compute quick difficulty tier from vocabulary size and subjects.


    private
                    computeQuickTier(int $knownCount, array<int, string> $subjects) : string

Parameters

$knownCount : int: Number of known words
$subjects : array<int, string>: Gutenberg subject categories

Return values

string —

'easy'|'medium'|'hard'

fetchTextContent()

Fetch text content from a URL.


    private
                    fetchTextContent(string $url) : string|null

Uses GutenbergClient for Gutenberg URLs (simpler, follows redirects), falls back to WebPageExtractor for other URLs.

Parameters

$url : string: Text URL

Return values

string|null —

Extracted text or null on fetch failure

getKnownWordCount()

Count words the user knows for a language.


    private
                    getKnownWordCount(int $languageId) : int

"Known" = status 5 (learned), 98 (ignored), 99 (well-known).

Parameters

$languageId : int: Language ID

Return values

int —

Number of known words

getLanguageParseSettings()

Get language parsing settings for tokenization.


    private
                    getLanguageParseSettings(int $languageId) : array{regex: string, splitEachChar: bool}|null

Parameters

$languageId : int: Language ID

Return values

array{regex: string, splitEachChar: bool}|null —

Settings or null if not found

labelFromCoverage()

Map coverage percentage to a human-readable difficulty label.


    private
                    labelFromCoverage(float $percent) : string

Based on research: 95%+ coverage = comfortable reading, 90-95% = challenging but feasible, below 90% = frustrating.

Parameters

$percent : float: Coverage percentage

Return values

string —

Difficulty label

lookupKnownWords()

Look up which words from a list the user already knows.


    private
                    lookupKnownWords(int $languageId, array<int, string> $words) : array<int, string>

Words with any status (1-5, 98, 99) are considered "encountered".

Parameters

$languageId : int: Language ID
$words : array<int, string>: Lowercase words to look up

Return values

array<int, string> —

Words that exist in the user's vocabulary

tokenize()

Tokenize text into words using the language's word character regex.


    private
                    tokenize(string $text, string $wordRegex, int $maxWords[, bool $splitEachChar = false ]) : array<int, string>

For languages with splitEachChar (Chinese, etc.), each matched character becomes its own token.

Parameters

$text : string: Text to tokenize
$wordRegex : string: Word character regex class content
$maxWords : int: Maximum number of words to return
$splitEachChar : bool = false: Whether to treat each character as a word

Return values

array<int, string> —

Word tokens

DifficultyEstimationService in package Lwt Modules Text Application Services

Tags

Table of Contents

Constants

Methods

Constants

EASY_SUBJECTS

HARD_SUBJECTS

LOOKUP_BATCH_SIZE

SAMPLE_WORD_COUNT

Methods

analyzeTextSample()

Parameters

Return values

classifySubjectsPublic()

Parameters

Return values

estimateQuickTiers()

Parameters

Return values

classifySubjects()

Parameters

Return values

computeQuickTier()

Parameters

Return values

fetchTextContent()

Parameters

Return values

getKnownWordCount()

Parameters

Return values

getLanguageParseSettings()

Parameters

Return values

labelFromCoverage()

Parameters

Return values

lookupKnownWords()

Parameters

Return values

tokenize()

Parameters

Return values

DifficultyEstimationService
in package

Lwt

Modules

Text

Application

Services