DifficultyEstimationService
in package
Estimates text difficulty relative to a user's known vocabulary.
Tags
Table of Contents
Constants
- EASY_SUBJECTS = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']
- Subject keywords that indicate easier texts.
- HARD_SUBJECTS = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']
- Subject keywords that indicate harder texts.
- LOOKUP_BATCH_SIZE = 500
- Maximum number of words per vocabulary lookup batch.
- SAMPLE_WORD_COUNT = 2000
- Maximum number of words to sample for accurate coverage.
Methods
-
analyzeTextSample()
: array{total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list
}|array{error: string} - Analyze a text sample for accurate vocabulary coverage.
- classifySubjectsPublic() : string
- Classify subjects into a difficulty tier (public API).
- estimateQuickTiers() : array<int, string>
- Estimate quick difficulty tiers for a batch of books.
- classifySubjects() : string
- Classify subject list into a difficulty tier.
- computeQuickTier() : string
- Compute quick difficulty tier from vocabulary size and subjects.
- getKnownWordCount() : int
- Count words the user knows for a language.
- getWordCharRegex() : string|null
- Get the word character regex for a language.
- labelFromCoverage() : string
- Map coverage percentage to a human-readable difficulty label.
- lookupKnownWords() : array<int, string>
- Look up which words from a list the user already knows.
- tokenize() : array<int, string>
- Tokenize text into words using the language's word character regex.
Constants
EASY_SUBJECTS
Subject keywords that indicate easier texts.
private
array<int, string>
EASY_SUBJECTS
= ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']
HARD_SUBJECTS
Subject keywords that indicate harder texts.
private
array<int, string>
HARD_SUBJECTS
= ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']
LOOKUP_BATCH_SIZE
Maximum number of words per vocabulary lookup batch.
private
mixed
LOOKUP_BATCH_SIZE
= 500
SAMPLE_WORD_COUNT
Maximum number of words to sample for accurate coverage.
private
mixed
SAMPLE_WORD_COUNT
= 2000
Methods
analyzeTextSample()
Analyze a text sample for accurate vocabulary coverage.
public
analyzeTextSample(string $textUrl, int $languageId) : array{total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}
Fetches the text, extracts a sample, tokenizes it, and computes the percentage of unique words the user already knows.
Parameters
- $textUrl : string
-
URL of the plain text
- $languageId : int
-
Language ID (for tokenization and vocab lookup)
Return values
array{total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: listclassifySubjectsPublic()
Classify subjects into a difficulty tier (public API).
public
classifySubjectsPublic(array<int, string> $subjects) : string
Parameters
- $subjects : array<int, string>
-
Subject categories
Return values
string —'easy'|'medium'|'hard'
estimateQuickTiers()
Estimate quick difficulty tiers for a batch of books.
public
estimateQuickTiers(int $languageId, array<int, array<int, string>> $booksSubjects) : array<int, string>
Performs a single DB query for vocabulary size, then classifies each book based on its subjects.
Parameters
- $languageId : int
-
Language ID
- $booksSubjects : array<int, array<int, string>>
-
Map of bookId => subjects
Return values
array<int, string> —Map of bookId => 'easy'|'medium'|'hard'
classifySubjects()
Classify subject list into a difficulty tier.
private
classifySubjects(array<int, string> $subjects) : string
Picks the most favorable (lowest difficulty) match.
Parameters
- $subjects : array<int, string>
-
Subject categories
Return values
string —'easy'|'medium'|'hard'
computeQuickTier()
Compute quick difficulty tier from vocabulary size and subjects.
private
computeQuickTier(int $knownCount, array<int, string> $subjects) : string
Parameters
- $knownCount : int
-
Number of known words
- $subjects : array<int, string>
-
Gutenberg subject categories
Return values
string —'easy'|'medium'|'hard'
getKnownWordCount()
Count words the user knows for a language.
private
getKnownWordCount(int $languageId) : int
"Known" = status 5 (learned), 98 (ignored), 99 (well-known).
Parameters
- $languageId : int
-
Language ID
Return values
int —Number of known words
getWordCharRegex()
Get the word character regex for a language.
private
getWordCharRegex(int $languageId) : string|null
Parameters
- $languageId : int
-
Language ID
Return values
string|null —Regex character class content, or null if not found
labelFromCoverage()
Map coverage percentage to a human-readable difficulty label.
private
labelFromCoverage(float $percent) : string
Based on research: 95%+ coverage = comfortable reading, 90-95% = challenging but feasible, below 90% = frustrating.
Parameters
- $percent : float
-
Coverage percentage
Return values
string —Difficulty label
lookupKnownWords()
Look up which words from a list the user already knows.
private
lookupKnownWords(int $languageId, array<int, string> $words) : array<int, string>
Words with any status (1-5, 98, 99) are considered "encountered".
Parameters
- $languageId : int
-
Language ID
- $words : array<int, string>
-
Lowercase words to look up
Return values
array<int, string> —Words that exist in the user's vocabulary
tokenize()
Tokenize text into words using the language's word character regex.
private
tokenize(string $text, string $wordRegex, int $maxWords) : array<int, string>
Parameters
- $text : string
-
Text to tokenize
- $wordRegex : string
-
Word character regex class content
- $maxWords : int
-
Maximum number of words to return
Return values
array<int, string> —Word tokens