Documentation

DifficultyEstimationService

Estimates text difficulty relative to a user's known vocabulary.

Tags
since
3.0.0

Table of Contents

Constants

EASY_SUBJECTS  = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']
Subject keywords that indicate easier texts.
HARD_SUBJECTS  = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']
Subject keywords that indicate harder texts.
LOOKUP_BATCH_SIZE  = 500
Maximum number of words per vocabulary lookup batch.
SAMPLE_WORD_COUNT  = 2000
Maximum number of words to sample for accurate coverage.

Methods

analyzeTextSample()  : array{total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}
Analyze a text sample for accurate vocabulary coverage.
classifySubjectsPublic()  : string
Classify subjects into a difficulty tier (public API).
estimateQuickTiers()  : array<int, string>
Estimate quick difficulty tiers for a batch of books.
classifySubjects()  : string
Classify subject list into a difficulty tier.
computeQuickTier()  : string
Compute quick difficulty tier from vocabulary size and subjects.
getKnownWordCount()  : int
Count words the user knows for a language.
getWordCharRegex()  : string|null
Get the word character regex for a language.
labelFromCoverage()  : string
Map coverage percentage to a human-readable difficulty label.
lookupKnownWords()  : array<int, string>
Look up which words from a list the user already knows.
tokenize()  : array<int, string>
Tokenize text into words using the language's word character regex.

Constants

EASY_SUBJECTS

Subject keywords that indicate easier texts.

private array<int, string> EASY_SUBJECTS = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']

HARD_SUBJECTS

Subject keywords that indicate harder texts.

private array<int, string> HARD_SUBJECTS = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']

Methods

analyzeTextSample()

Analyze a text sample for accurate vocabulary coverage.

public analyzeTextSample(string $textUrl, int $languageId) : array{total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

Fetches the text, extracts a sample, tokenizes it, and computes the percentage of unique words the user already knows.

Parameters
$textUrl : string

URL of the plain text

$languageId : int

Language ID (for tokenization and vocab lookup)

Return values
array{total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

classifySubjectsPublic()

Classify subjects into a difficulty tier (public API).

public classifySubjectsPublic(array<int, string> $subjects) : string
Parameters
$subjects : array<int, string>

Subject categories

Return values
string

'easy'|'medium'|'hard'

estimateQuickTiers()

Estimate quick difficulty tiers for a batch of books.

public estimateQuickTiers(int $languageId, array<int, array<int, string>> $booksSubjects) : array<int, string>

Performs a single DB query for vocabulary size, then classifies each book based on its subjects.

Parameters
$languageId : int

Language ID

$booksSubjects : array<int, array<int, string>>

Map of bookId => subjects

Return values
array<int, string>

Map of bookId => 'easy'|'medium'|'hard'

classifySubjects()

Classify subject list into a difficulty tier.

private classifySubjects(array<int, string> $subjects) : string

Picks the most favorable (lowest difficulty) match.

Parameters
$subjects : array<int, string>

Subject categories

Return values
string

'easy'|'medium'|'hard'

computeQuickTier()

Compute quick difficulty tier from vocabulary size and subjects.

private computeQuickTier(int $knownCount, array<int, string> $subjects) : string
Parameters
$knownCount : int

Number of known words

$subjects : array<int, string>

Gutenberg subject categories

Return values
string

'easy'|'medium'|'hard'

getKnownWordCount()

Count words the user knows for a language.

private getKnownWordCount(int $languageId) : int

"Known" = status 5 (learned), 98 (ignored), 99 (well-known).

Parameters
$languageId : int

Language ID

Return values
int

Number of known words

getWordCharRegex()

Get the word character regex for a language.

private getWordCharRegex(int $languageId) : string|null
Parameters
$languageId : int

Language ID

Return values
string|null

Regex character class content, or null if not found

labelFromCoverage()

Map coverage percentage to a human-readable difficulty label.

private labelFromCoverage(float $percent) : string

Based on research: 95%+ coverage = comfortable reading, 90-95% = challenging but feasible, below 90% = frustrating.

Parameters
$percent : float

Coverage percentage

Return values
string

Difficulty label

lookupKnownWords()

Look up which words from a list the user already knows.

private lookupKnownWords(int $languageId, array<int, string> $words) : array<int, string>

Words with any status (1-5, 98, 99) are considered "encountered".

Parameters
$languageId : int

Language ID

$words : array<int, string>

Lowercase words to look up

Return values
array<int, string>

Words that exist in the user's vocabulary

tokenize()

Tokenize text into words using the language's word character regex.

private tokenize(string $text, string $wordRegex, int $maxWords) : array<int, string>
Parameters
$text : string

Text to tokenize

$wordRegex : string

Word character regex class content

$maxWords : int

Maximum number of words to return

Return values
array<int, string>

Word tokens


        
On this page

Search results