Documentation

GdlImportService

Downloads and extracts reading text from Global Digital Library ePUBs.

Tags
since
3.1.0

Table of Contents

Constants

MIN_WORDS  = 30
Minimum word count for a book to be importable.

Properties

$client  : GdlClient
$epubParser  : EpubParserService

Methods

__construct()  : mixed
extractText()  : array{title: string, text: string, sourceUri: string}|array{error: string}
Download a GDL ePUB and extract its reading text.
buildText()  : string
Concatenate chapter contents into a single reading text.
parseEpub()  : array{title: string, text: string}
Buffer ePUB bytes to disk and parse them into title + text.
wordCount()  : int
Count whitespace-separated words in a text.

Constants

MIN_WORDS

Minimum word count for a book to be importable.

private mixed MIN_WORDS = 30

Many GDL titles are image-only picture books whose ePUB carries almost no extractable text; importing those yields an empty reading text. The threshold is deliberately low — a few short sentences — so genuine beginner readers still pass while pure picture books are rejected.

Properties

Methods

extractText()

Download a GDL ePUB and extract its reading text.

public extractText(string $epubUrl) : array{title: string, text: string, sourceUri: string}|array{error: string}
Parameters
$epubUrl : string

ePUB URL from a GdlClient search result

Return values
array{title: string, text: string, sourceUri: string}|array{error: string}

buildText()

Concatenate chapter contents into a single reading text.

protected buildText(array<string|int, array{num?: int, title?: string, content?: string}> $chapters) : string
Parameters
$chapters : array<string|int, array{num?: int, title?: string, content?: string}>
Return values
string

Blank-line-separated chapter text

parseEpub()

Buffer ePUB bytes to disk and parse them into title + text.

protected parseEpub(string $bytes) : array{title: string, text: string}

Isolated as a seam so the download/filter orchestration can be tested without a real ePUB or the zip extension.

Parameters
$bytes : string

Raw ePUB bytes

Tags
throws
RuntimeException

If the bytes cannot be buffered or parsed

Return values
array{title: string, text: string}

wordCount()

Count whitespace-separated words in a text.

private wordCount(string $text) : int
Parameters
$text : string

Text to measure

Return values
int

Word count (0 for empty/whitespace-only text)


        
On this page

Search results