GdlImportService
in package
Downloads and extracts reading text from Global Digital Library ePUBs.
Tags
Table of Contents
Constants
- MIN_WORDS = 30
- Minimum word count for a book to be importable.
Properties
Methods
- __construct() : mixed
- extractText() : array{title: string, text: string, sourceUri: string}|array{error: string}
- Download a GDL ePUB and extract its reading text.
- buildText() : string
- Concatenate chapter contents into a single reading text.
- parseEpub() : array{title: string, text: string}
- Buffer ePUB bytes to disk and parse them into title + text.
- wordCount() : int
- Count whitespace-separated words in a text.
Constants
MIN_WORDS
Minimum word count for a book to be importable.
private
mixed
MIN_WORDS
= 30
Many GDL titles are image-only picture books whose ePUB carries almost no extractable text; importing those yields an empty reading text. The threshold is deliberately low — a few short sentences — so genuine beginner readers still pass while pure picture books are rejected.
Properties
$client
private
GdlClient
$client
$epubParser
private
EpubParserService
$epubParser
Methods
__construct()
public
__construct([GdlClient|null $client = null ][, EpubParserService|null $epubParser = null ]) : mixed
Parameters
- $client : GdlClient|null = null
- $epubParser : EpubParserService|null = null
extractText()
Download a GDL ePUB and extract its reading text.
public
extractText(string $epubUrl) : array{title: string, text: string, sourceUri: string}|array{error: string}
Parameters
- $epubUrl : string
-
ePUB URL from a GdlClient search result
Return values
array{title: string, text: string, sourceUri: string}|array{error: string}buildText()
Concatenate chapter contents into a single reading text.
protected
buildText(array<string|int, array{num?: int, title?: string, content?: string}> $chapters) : string
Parameters
- $chapters : array<string|int, array{num?: int, title?: string, content?: string}>
Return values
string —Blank-line-separated chapter text
parseEpub()
Buffer ePUB bytes to disk and parse them into title + text.
protected
parseEpub(string $bytes) : array{title: string, text: string}
Isolated as a seam so the download/filter orchestration can be tested without a real ePUB or the zip extension.
Parameters
- $bytes : string
-
Raw ePUB bytes
Tags
Return values
array{title: string, text: string}wordCount()
Count whitespace-separated words in a text.
private
wordCount(string $text) : int
Parameters
- $text : string
-
Text to measure
Return values
int —Word count (0 for empty/whitespace-only text)