Documentation

EpubParserService

Service for parsing EPUB files and extracting content.

Uses the kiwilan/php-ebook library to read EPUB files and extract metadata and chapter content for import into LWT.

Tags
since
3.0.0

Table of Contents

Constants

MAX_DECOMPRESSED_BYTES  = 500 * 1024 * 1024
Cap on total decompressed bytes across all entries — the zip-bomb defense. 500 MB is well above any plausible legitimate EPUB and far below the level that would crash a typical worker.
MAX_ENTRIES  = 2000
Cap on the number of files inside the EPUB. EPUBs typically contain dozens of HTML chapters + assets, not thousands.
MAX_FILE_SIZE  = 100 * 1024 * 1024
Hard cap on the uploaded EPUB size (compressed bytes on disk).

Methods

cleanHtmlContent()  : string
Clean HTML content to plain text suitable for LWT.
getMetadata()  : array{title: string, author: string|null, description: string|null, language: string|null}|null
Get just the metadata without parsing chapters.
isNavigationFile()  : bool
Detect EPUB 3 navigation / TOC documents that should not appear as chapters.
isValidEpub()  : bool
Validate that a file is an EPUB.
parse()  : array{metadata: array{title: string, author: string|null, description: string|null, language: string|null, sourceHash: string}, chapters: array{num: int, title: string, content: string}[]}
Parse an EPUB file and extract metadata and chapters.
assertZipWithinLimits()  : void
Walk the ZIP central directory and refuse archives that exceed the entry-count or decompressed-size budget — the zip-bomb defense. Skipped silently when ext-zip is unavailable so the minimal magic-byte path in isValidEpub stays the only check.
extractAuthor()  : string|null
Extract the primary author name from an ebook.
extractChapters()  : array<string|int, array{num: int, title: string, content: string}>
Extract chapters from an ebook.
extractFromHtmlFiles()  : array<string|int, array{num: int, title: string, content: string}>
Extract content from HTML files in the EPUB as fallback.
extractTitleFromContent()  : string
Extract a title from content if possible.
getEpubModule()  : EpubModule|null
Get the EpubModule from an Ebook.
resolveFormat()  : string|null
Resolve the format hint for the underlying ebook library.

Constants

MAX_DECOMPRESSED_BYTES

Cap on total decompressed bytes across all entries — the zip-bomb defense. 500 MB is well above any plausible legitimate EPUB and far below the level that would crash a typical worker.

public mixed MAX_DECOMPRESSED_BYTES = 500 * 1024 * 1024

MAX_ENTRIES

Cap on the number of files inside the EPUB. EPUBs typically contain dozens of HTML chapters + assets, not thousands.

public mixed MAX_ENTRIES = 2000

MAX_FILE_SIZE

Hard cap on the uploaded EPUB size (compressed bytes on disk).

public mixed MAX_FILE_SIZE = 100 * 1024 * 1024

Real EPUBs are usually < 5 MB; even very large illustrated books stay under ~50 MB. 100 MB leaves head-room for outliers without inviting trivially-DoSable uploads.

Methods

cleanHtmlContent()

Clean HTML content to plain text suitable for LWT.

public cleanHtmlContent(string $html) : string

Strips HTML tags while preserving paragraph structure with double newlines for paragraph breaks.

Parameters
$html : string

The HTML content

Return values
string

Clean plain text

getMetadata()

Get just the metadata without parsing chapters.

public getMetadata(string $filePath[, string $originalName = '' ]) : array{title: string, author: string|null, description: string|null, language: string|null}|null
Parameters
$filePath : string

Path to the EPUB file

$originalName : string = ''

Original filename (used to derive the format when $filePath has no extension, e.g. PHP upload temp paths)

Return values
array{title: string, author: string|null, description: string|null, language: string|null}|null

Metadata or null on failure

isNavigationFile()

Detect EPUB 3 navigation / TOC documents that should not appear as chapters.

public isNavigationFile(EpubHtml $htmlFile) : bool

The kiwilan library's NCX-based getChapters() ignores nav.xhtml, but when an EPUB ships without an NCX the HTML fallback would otherwise include the nav document as a phantom chapter. Filename heuristics cover the common cases (nav.xhtml, toc.xhtml); the body sniff catches less conventionally-named EPUB 3 nav documents identified by the epub:type="toc" (or related) attribute on a <nav> element.

Parameters
$htmlFile : EpubHtml
Return values
bool

isValidEpub()

Validate that a file is an EPUB.

public isValidEpub(string $filePath[, string $originalName = '' ]) : bool
Parameters
$filePath : string

Path to the file

$originalName : string = ''
Return values
bool

True if valid EPUB

parse()

Parse an EPUB file and extract metadata and chapters.

public parse(string $filePath[, string $originalName = '' ]) : array{metadata: array{title: string, author: string|null, description: string|null, language: string|null, sourceHash: string}, chapters: array{num: int, title: string, content: string}[]}
Parameters
$filePath : string

Absolute path to the EPUB file

$originalName : string = ''

Original filename (used to derive the format when $filePath has no extension, e.g. PHP upload temp paths)

Tags
throws
InvalidArgumentException

If file doesn't exist

throws
RuntimeException

If file cannot be parsed

Return values
array{metadata: array{title: string, author: string|null, description: string|null, language: string|null, sourceHash: string}, chapters: array{num: int, title: string, content: string}[]}

assertZipWithinLimits()

Walk the ZIP central directory and refuse archives that exceed the entry-count or decompressed-size budget — the zip-bomb defense. Skipped silently when ext-zip is unavailable so the minimal magic-byte path in isValidEpub stays the only check.

private assertZipWithinLimits(string $filePath) : void
Parameters
$filePath : string

Path to the EPUB on disk

Tags
throws
InvalidArgumentException

When the archive exceeds limits

extractAuthor()

Extract the primary author name from an ebook.

private extractAuthor(Ebook $ebook) : string|null
Parameters
$ebook : Ebook

The ebook object

Return values
string|null

Author name or null if not found

extractChapters()

Extract chapters from an ebook.

private extractChapters(Ebook $ebook) : array<string|int, array{num: int, title: string, content: string}>
Parameters
$ebook : Ebook

The ebook object

Return values
array<string|int, array{num: int, title: string, content: string}>

extractFromHtmlFiles()

Extract content from HTML files in the EPUB as fallback.

private extractFromHtmlFiles(Ebook $ebook) : array<string|int, array{num: int, title: string, content: string}>
Parameters
$ebook : Ebook

The ebook object

Return values
array<string|int, array{num: int, title: string, content: string}>

extractTitleFromContent()

Extract a title from content if possible.

private extractTitleFromContent(string $content, int $num) : string
Parameters
$content : string

The text content

$num : int

Default chapter number

Return values
string

The extracted or default title

getEpubModule()

Get the EpubModule from an Ebook.

private getEpubModule(Ebook $ebook) : EpubModule|null
Parameters
$ebook : Ebook

The ebook object

Return values
EpubModule|null

The EPUB module or null if not an EPUB

resolveFormat()

Resolve the format hint for the underlying ebook library.

private resolveFormat(string $filePath, string $originalName) : string|null

Falls back to the original filename's extension when the path itself has none (PHP upload temp paths look like /tmp/phpXXXXXX). Returns null when no extension can be determined, letting the library decide.

Parameters
$filePath : string
$originalName : string
Return values
string|null

        
On this page

Search results