Documentation

DifficultyEstimationService

Estimates text difficulty relative to a user's known vocabulary.

Tags
since
3.0.0

Table of Contents

Constants

EASY_SUBJECTS  = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']
Subject keywords that indicate easier texts.
HARD_SUBJECTS  = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']
Subject keywords that indicate harder texts.
LOOKUP_BATCH_SIZE  = 500
Maximum number of words per vocabulary lookup batch.
SAMPLE_WORD_COUNT  = 2000
Maximum number of words to sample for accurate coverage.

Methods

analyzeTextSample()  : array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}
Analyze a text sample for accurate vocabulary coverage.
classifySubjectsPublic()  : string
Classify subjects into a difficulty tier (public API).
estimateQuickTiers()  : array<int, string>
Estimate quick difficulty tiers for a batch of books.
classifySubjects()  : string
Classify subject list into a difficulty tier.
computeQuickTier()  : string
Compute quick difficulty tier from vocabulary size and subjects.
fetchTextContent()  : string|null
Fetch text content from a URL.
getKnownWordCount()  : int
Count words the user knows for a language.
getLanguageParseSettings()  : array{regex: string, splitEachChar: bool}|null
Get language parsing settings for tokenization.
labelFromCoverage()  : string
Map coverage percentage to a human-readable difficulty label.
lookupKnownWords()  : array<int, string>
Look up which words from a list the user already knows.
tokenize()  : array<int, string>
Tokenize text into words using the language's word character regex.

Constants

EASY_SUBJECTS

Subject keywords that indicate easier texts.

private array<int, string> EASY_SUBJECTS = ['children', 'juvenile', 'fairy tale', 'nursery', 'picture book', 'fable', 'primer', 'easy reading']

HARD_SUBJECTS

Subject keywords that indicate harder texts.

private array<int, string> HARD_SUBJECTS = ['philosophy', 'science', 'law', 'economics', 'political science', 'mathematics', 'psychology', 'theology', 'metaphysics', 'logic', 'jurisprudence', 'historiography']

Methods

analyzeTextSample()

Analyze a text sample for accurate vocabulary coverage.

public analyzeTextSample(string $textUrl, int $languageId) : array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

Fetches the text, extracts a sample, tokenizes it, and computes the percentage of unique words the user already knows.

Parameters
$textUrl : string

URL of the plain text

$languageId : int

Language ID (for tokenization and vocab lookup)

Return values
array{total_words: int, total_unique_words: int, known_words: int, unknown_words: int, coverage_percent: float, difficulty_label: string, sample_unknown_words: list}|array{error: string}

classifySubjectsPublic()

Classify subjects into a difficulty tier (public API).

public classifySubjectsPublic(array<int, string> $subjects) : string
Parameters
$subjects : array<int, string>

Subject categories

Return values
string

'easy'|'medium'|'hard'

estimateQuickTiers()

Estimate quick difficulty tiers for a batch of books.

public estimateQuickTiers(int $languageId, array<int, array<int, string>> $booksSubjects) : array<int, string>

Performs a single DB query for vocabulary size, then classifies each book based on its subjects.

Parameters
$languageId : int

Language ID

$booksSubjects : array<int, array<int, string>>

Map of bookId => subjects

Return values
array<int, string>

Map of bookId => 'easy'|'medium'|'hard'

classifySubjects()

Classify subject list into a difficulty tier.

private classifySubjects(array<int, string> $subjects) : string

Picks the most favorable (lowest difficulty) match.

Parameters
$subjects : array<int, string>

Subject categories

Return values
string

'easy'|'medium'|'hard'

computeQuickTier()

Compute quick difficulty tier from vocabulary size and subjects.

private computeQuickTier(int $knownCount, array<int, string> $subjects) : string
Parameters
$knownCount : int

Number of known words

$subjects : array<int, string>

Gutenberg subject categories

Return values
string

'easy'|'medium'|'hard'

fetchTextContent()

Fetch text content from a URL.

private fetchTextContent(string $url) : string|null

Uses GutenbergClient for Gutenberg URLs (simpler, follows redirects), falls back to WebPageExtractor for other URLs.

Parameters
$url : string

Text URL

Return values
string|null

Extracted text or null on fetch failure

getKnownWordCount()

Count words the user knows for a language.

private getKnownWordCount(int $languageId) : int

"Known" = status 5 (learned), 98 (ignored), 99 (well-known).

Parameters
$languageId : int

Language ID

Return values
int

Number of known words

getLanguageParseSettings()

Get language parsing settings for tokenization.

private getLanguageParseSettings(int $languageId) : array{regex: string, splitEachChar: bool}|null
Parameters
$languageId : int

Language ID

Return values
array{regex: string, splitEachChar: bool}|null

Settings or null if not found

labelFromCoverage()

Map coverage percentage to a human-readable difficulty label.

private labelFromCoverage(float $percent) : string

Based on research: 95%+ coverage = comfortable reading, 90-95% = challenging but feasible, below 90% = frustrating.

Parameters
$percent : float

Coverage percentage

Return values
string

Difficulty label

lookupKnownWords()

Look up which words from a list the user already knows.

private lookupKnownWords(int $languageId, array<int, string> $words) : array<int, string>

Words with any status (1-5, 98, 99) are considered "encountered".

Parameters
$languageId : int

Language ID

$words : array<int, string>

Lowercase words to look up

Return values
array<int, string>

Words that exist in the user's vocabulary

tokenize()

Tokenize text into words using the language's word character regex.

private tokenize(string $text, string $wordRegex, int $maxWords[, bool $splitEachChar = false ]) : array<int, string>

For languages with splitEachChar (Chinese, etc.), each matched character becomes its own token.

Parameters
$text : string

Text to tokenize

$wordRegex : string

Word character regex class content

$maxWords : int

Maximum number of words to return

$splitEachChar : bool = false

Whether to treat each character as a word

Return values
array<int, string>

Word tokens


        
On this page

Search results