EpubParserService
in package
Service for parsing EPUB files and extracting content.
Uses the kiwilan/php-ebook library to read EPUB files and extract metadata and chapter content for import into LWT.
Tags
Table of Contents
Constants
- MAX_DECOMPRESSED_BYTES = 500 * 1024 * 1024
- Cap on total decompressed bytes across all entries — the zip-bomb defense. 500 MB is well above any plausible legitimate EPUB and far below the level that would crash a typical worker.
- MAX_ENTRIES = 2000
- Cap on the number of files inside the EPUB. EPUBs typically contain dozens of HTML chapters + assets, not thousands.
- MAX_FILE_SIZE = 100 * 1024 * 1024
- Hard cap on the uploaded EPUB size (compressed bytes on disk).
Methods
- cleanHtmlContent() : string
- Clean HTML content to plain text suitable for LWT.
- getMetadata() : array{title: string, author: string|null, description: string|null, language: string|null}|null
- Get just the metadata without parsing chapters.
- isNavigationFile() : bool
- Detect EPUB 3 navigation / TOC documents that should not appear as chapters.
- isValidEpub() : bool
- Validate that a file is an EPUB.
- parse() : array{metadata: array{title: string, author: string|null, description: string|null, language: string|null, sourceHash: string}, chapters: array{num: int, title: string, content: string}[]}
- Parse an EPUB file and extract metadata and chapters.
- assertZipWithinLimits() : void
- Walk the ZIP central directory and refuse archives that exceed the entry-count or decompressed-size budget — the zip-bomb defense. Skipped silently when ext-zip is unavailable so the minimal magic-byte path in isValidEpub stays the only check.
- extractAuthor() : string|null
- Extract the primary author name from an ebook.
- extractChapters() : array<string|int, array{num: int, title: string, content: string}>
- Extract chapters from an ebook.
- extractFromHtmlFiles() : array<string|int, array{num: int, title: string, content: string}>
- Extract content from HTML files in the EPUB as fallback.
- extractTitleFromContent() : string
- Extract a title from content if possible.
- getEpubModule() : EpubModule|null
- Get the EpubModule from an Ebook.
- resolveFormat() : string|null
- Resolve the format hint for the underlying ebook library.
Constants
MAX_DECOMPRESSED_BYTES
Cap on total decompressed bytes across all entries — the zip-bomb defense. 500 MB is well above any plausible legitimate EPUB and far below the level that would crash a typical worker.
public
mixed
MAX_DECOMPRESSED_BYTES
= 500 * 1024 * 1024
MAX_ENTRIES
Cap on the number of files inside the EPUB. EPUBs typically contain dozens of HTML chapters + assets, not thousands.
public
mixed
MAX_ENTRIES
= 2000
MAX_FILE_SIZE
Hard cap on the uploaded EPUB size (compressed bytes on disk).
public
mixed
MAX_FILE_SIZE
= 100 * 1024 * 1024
Real EPUBs are usually < 5 MB; even very large illustrated books stay under ~50 MB. 100 MB leaves head-room for outliers without inviting trivially-DoSable uploads.
Methods
cleanHtmlContent()
Clean HTML content to plain text suitable for LWT.
public
cleanHtmlContent(string $html) : string
Strips HTML tags while preserving paragraph structure with double newlines for paragraph breaks.
Parameters
- $html : string
-
The HTML content
Return values
string —Clean plain text
getMetadata()
Get just the metadata without parsing chapters.
public
getMetadata(string $filePath[, string $originalName = '' ]) : array{title: string, author: string|null, description: string|null, language: string|null}|null
Parameters
- $filePath : string
-
Path to the EPUB file
- $originalName : string = ''
-
Original filename (used to derive the format when $filePath has no extension, e.g. PHP upload temp paths)
Return values
array{title: string, author: string|null, description: string|null, language: string|null}|null —Metadata or null on failure
isNavigationFile()
Detect EPUB 3 navigation / TOC documents that should not appear as chapters.
public
isNavigationFile(EpubHtml $htmlFile) : bool
The kiwilan library's NCX-based getChapters() ignores nav.xhtml, but
when an EPUB ships without an NCX the HTML fallback would otherwise
include the nav document as a phantom chapter. Filename heuristics
cover the common cases (nav.xhtml, toc.xhtml); the body sniff catches
less conventionally-named EPUB 3 nav documents identified by the
epub:type="toc" (or related) attribute on a <nav> element.
Parameters
- $htmlFile : EpubHtml
Return values
boolisValidEpub()
Validate that a file is an EPUB.
public
isValidEpub(string $filePath[, string $originalName = '' ]) : bool
Parameters
- $filePath : string
-
Path to the file
- $originalName : string = ''
Return values
bool —True if valid EPUB
parse()
Parse an EPUB file and extract metadata and chapters.
public
parse(string $filePath[, string $originalName = '' ]) : array{metadata: array{title: string, author: string|null, description: string|null, language: string|null, sourceHash: string}, chapters: array{num: int, title: string, content: string}[]}
Parameters
- $filePath : string
-
Absolute path to the EPUB file
- $originalName : string = ''
-
Original filename (used to derive the format when $filePath has no extension, e.g. PHP upload temp paths)
Tags
Return values
array{metadata: array{title: string, author: string|null, description: string|null, language: string|null, sourceHash: string}, chapters: array{num: int, title: string, content: string}[]}assertZipWithinLimits()
Walk the ZIP central directory and refuse archives that exceed the entry-count or decompressed-size budget — the zip-bomb defense. Skipped silently when ext-zip is unavailable so the minimal magic-byte path in isValidEpub stays the only check.
private
assertZipWithinLimits(string $filePath) : void
Parameters
- $filePath : string
-
Path to the EPUB on disk
Tags
extractAuthor()
Extract the primary author name from an ebook.
private
extractAuthor(Ebook $ebook) : string|null
Parameters
- $ebook : Ebook
-
The ebook object
Return values
string|null —Author name or null if not found
extractChapters()
Extract chapters from an ebook.
private
extractChapters(Ebook $ebook) : array<string|int, array{num: int, title: string, content: string}>
Parameters
- $ebook : Ebook
-
The ebook object
Return values
array<string|int, array{num: int, title: string, content: string}>extractFromHtmlFiles()
Extract content from HTML files in the EPUB as fallback.
private
extractFromHtmlFiles(Ebook $ebook) : array<string|int, array{num: int, title: string, content: string}>
Parameters
- $ebook : Ebook
-
The ebook object
Return values
array<string|int, array{num: int, title: string, content: string}>extractTitleFromContent()
Extract a title from content if possible.
private
extractTitleFromContent(string $content, int $num) : string
Parameters
- $content : string
-
The text content
- $num : int
-
Default chapter number
Return values
string —The extracted or default title
getEpubModule()
Get the EpubModule from an Ebook.
private
getEpubModule(Ebook $ebook) : EpubModule|null
Parameters
- $ebook : Ebook
-
The ebook object
Return values
EpubModule|null —The EPUB module or null if not an EPUB
resolveFormat()
Resolve the format hint for the underlying ebook library.
private
resolveFormat(string $filePath, string $originalName) : string|null
Falls back to the original filename's extension when the path itself has none (PHP upload temp paths look like /tmp/phpXXXXXX). Returns null when no extension can be determined, letting the library decide.
Parameters
- $filePath : string
- $originalName : string