Documentation

DifficultyEstimationService

Estimates text difficulty relative to a user's known vocabulary.

Tags
since
3.0.0

Table of Contents

Constants

BEGINNER_VOCAB_THRESHOLD  = 500
Known-word count below which a reader is treated as a beginner.
EASY_SUBJECTS  = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']
Subject keywords that indicate easier texts.
HARD_SUBJECTS  = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']
Subject keywords that indicate harder texts.
LOOKUP_BATCH_SIZE  = 500
Maximum number of words per vocabulary lookup batch.
SAMPLE_WORD_COUNT  = 2000
Maximum number of words to sample for accurate coverage.

Methods

analyzeTextSample()  : array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}
Analyze a text sample for accurate vocabulary coverage.
classifySubjectsPublic()  : string
Classify subjects into a difficulty tier (public API).
estimateQuickTiers()  : array<int, string>
Estimate quick difficulty tiers for a batch of books.
getReaderProfile()  : array{vocabularySize: int, beginner: bool}
Build a reader profile for a language used to order home suggestions.
isBeginnerVocabulary()  : bool
Decide whether a known-word count puts the reader at beginner level.
classifySubjects()  : string
Classify subject list into a difficulty tier.
computeQuickTier()  : string
Compute quick difficulty tier from vocabulary size and subjects.
fetchTextContent()  : string|null
Fetch text content from a URL.
getKnownWordCount()  : int
getLanguageParseSettings()  : array{regex: string, splitEachChar: bool}|null
Get language parsing settings for tokenization.
labelFromCoverage()  : string
Map coverage percentage to a human-readable difficulty label.
lookupKnownWords()  : array<int, string>
Look up which words from a list the user already knows.
tokenize()  : array<int, string>
Tokenize text into words using the language's word character regex.

Constants

BEGINNER_VOCAB_THRESHOLD

Known-word count below which a reader is treated as a beginner.

public mixed BEGINNER_VOCAB_THRESHOLD = 500

Matches the lower vocabulary threshold used by computeQuickTier(): under ~500 known words, even "easy" classics read hard, so beginners are better served by the Global Digital Library's early-grade readers.

EASY_SUBJECTS

Subject keywords that indicate easier texts.

private array<int, string> EASY_SUBJECTS = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']

HARD_SUBJECTS

Subject keywords that indicate harder texts.

private array<int, string> HARD_SUBJECTS = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']

Methods

analyzeTextSample()

Analyze a text sample for accurate vocabulary coverage.

public analyzeTextSample(string $textUrl, int $languageId) : array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

Fetches the text, extracts a sample, tokenizes it, and computes the percentage of unique words the user already knows.

Parameters
$textUrl : string

URL of the plain text

$languageId : int

Language ID (for tokenization and vocab lookup)

Return values
array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

classifySubjectsPublic()

Classify subjects into a difficulty tier (public API).

public classifySubjectsPublic(array<int, string> $subjects) : string
Parameters
$subjects : array<int, string>

Subject categories

Return values
string

'easy'|'medium'|'hard'

estimateQuickTiers()

Estimate quick difficulty tiers for a batch of books.

public estimateQuickTiers(int $languageId, array<int, array<int, string>> $booksSubjects) : array<int, string>

Performs a single DB query for vocabulary size, then classifies each book based on its subjects.

Parameters
$languageId : int

Language ID

$booksSubjects : array<int, array<int, string>>

Map of bookId => subjects

Return values
array<int, string>

Map of bookId => 'easy'|'medium'|'hard'

getReaderProfile()

Build a reader profile for a language used to order home suggestions.

public getReaderProfile(int $languageId) : array{vocabularySize: int, beginner: bool}
Parameters
$languageId : int

Language ID

Return values
array{vocabularySize: int, beginner: bool}

isBeginnerVocabulary()

Decide whether a known-word count puts the reader at beginner level.

public static isBeginnerVocabulary(int $knownWordCount) : bool
Parameters
$knownWordCount : int

Number of known words in the language

Return values
bool

True if the reader is a beginner

classifySubjects()

Classify subject list into a difficulty tier.

private classifySubjects(array<int, string> $subjects) : string

Picks the most favorable (lowest difficulty) match.

Parameters
$subjects : array<int, string>

Subject categories

Return values
string

'easy'|'medium'|'hard'

computeQuickTier()

Compute quick difficulty tier from vocabulary size and subjects.

private computeQuickTier(int $knownCount, array<int, string> $subjects) : string
Parameters
$knownCount : int

Number of known words

$subjects : array<int, string>

Gutenberg subject categories

Return values
string

'easy'|'medium'|'hard'

fetchTextContent()

Fetch text content from a URL.

private fetchTextContent(string $url) : string|null

Uses GutenbergClient for Gutenberg URLs (simpler, follows redirects), falls back to WebPageExtractor for other URLs.

Parameters
$url : string

Text URL

Return values
string|null

Extracted text or null on fetch failure

getKnownWordCount()

private getKnownWordCount(int $languageId) : int
Parameters
$languageId : int
Return values
int

getLanguageParseSettings()

Get language parsing settings for tokenization.

private getLanguageParseSettings(int $languageId) : array{regex: string, splitEachChar: bool}|null
Parameters
$languageId : int

Language ID

Return values
array{regex: string, splitEachChar: bool}|null

Settings or null if not found

labelFromCoverage()

Map coverage percentage to a human-readable difficulty label.

private labelFromCoverage(float $percent) : string

Based on research: 95%+ coverage = comfortable reading, 90-95% = challenging but feasible, below 90% = frustrating.

Parameters
$percent : float

Coverage percentage

Return values
string

Difficulty label

lookupKnownWords()

Look up which words from a list the user already knows.

private lookupKnownWords(int $languageId, array<int, string> $words) : array<int, string>

Words with any status (1-5, 98, 99) are considered "encountered".

Parameters
$languageId : int

Language ID

$words : array<int, string>

Lowercase words to look up

Return values
array<int, string>

Words that exist in the user's vocabulary

tokenize()

Tokenize text into words using the language's word character regex.

private tokenize(string $text, string $wordRegex, int $maxWords[, bool $splitEachChar = false ]) : array<int, string>

For languages with splitEachChar (Chinese, etc.), each matched character becomes its own token.

Parameters
$text : string

Text to tokenize

$wordRegex : string

Word character regex class content

$maxWords : int

Maximum number of words to return

$splitEachChar : bool = false

Whether to treat each character as a word

Return values
array<int, string>

Word tokens


        
On this page

Search results