Documentation

WebPageExtractor
in package

Lwt

Shared

Infrastructure

Http

Extracts readable text content from web pages.

Performs URL validation (SSRF protection), fetches HTML, detects charset, and extracts the main article text.

Constants

CONTENT_TAGS = ['article', 'main', '[role="main"]', '[id="mw-content-text"]']: Tags that indicate main content areas (in priority order).
FETCH_TIMEOUT = 15: HTTP fetch timeout in seconds.
MAX_RESPONSE_SIZE = 2 * 1024 * 1024: Maximum response size in bytes (2 MB).
NOISE_XPATHS = [ // Common reference/citation sections '//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]', '//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]', '//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]', '//*[@id="toc" or @class="toc"]', '//nav[contains(@class,"toc")]', '//*[contains(@class,"noprint")]', // Common ad/social/cookie elements '//*[contains(@class,"share") or contains(@class,"social")]', '//*[contains(@class,"cookie") or contains(@class,"banner")]', '//*[contains(@class,"related-") or contains(@class,"recommended")]', '//*[contains(@class,"comment") and not(contains(@class,"content"))]', // Wikipedia-specific '//*[contains(@class,"mw-editsection")]', '//*[contains(@class,"sistersitebox")]', '//*[contains(@class,"authority-control")]', ]: XPath selectors for noisy elements to remove (class/id-based).
STRIP_TAGS = ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']: Tags to strip before extracting text.

Methods

extractFromUrl() : array{title: string, text: string, sourceUri: string}|array{error: string}: Extract title and text content from a URL.
stripGutenbergBoilerplatePublic() : string: Strip Gutenberg boilerplate and clean up text (public API).
cleanText() : string: Clean extracted text: normalize whitespace and line breaks.
detectCharset() : string|null: Detect charset from HTML meta tags.
extractBodyText() : string: Extract the main body text from the HTML document.
extractTitle() : string: Extract title from HTML document.
fetchPage() : string|null: Fetch page content from URL.
isPlainText() : bool: Check if content is plain text (no HTML tags).
looksLikeBinary() : bool: Check if content looks like binary (not HTML/text).
normalizeCharset() : string: Detect charset and convert to UTF-8.
stripGutenbergBoilerplate() : string: Strip Project Gutenberg header and footer boilerplate.
titleFromUrl() : string: Derive a title from a URL's path.
unwrapHardLineBreaks() : string: Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps).
findLargestTextBlock() : DOMNode|null: Find the child element with the most text content.
getTextFromNodeList() : string: Get cleaned text from a node list.
parseHtml() : DOMDocument|null: Parse HTML string into DOMDocument.
queryNodes() : DOMNodeList|null: Query nodes using either tag name or CSS-like selector.
stripByXPath() : void: Remove elements matching XPath selectors.
stripElements() : void: Remove specified element types from the DOM.

CONTENT_TAGS

Tags that indicate main content areas (in priority order).


    private
        array<int, string>
    CONTENT_TAGS
    = ['article', 'main', '[role="main"]', '[id="mw-content-text"]']

FETCH_TIMEOUT

HTTP fetch timeout in seconds.


    private
        mixed
    FETCH_TIMEOUT
    = 15

MAX_RESPONSE_SIZE

Maximum response size in bytes (2 MB).


    private
        mixed
    MAX_RESPONSE_SIZE
    = 2 * 1024 * 1024

NOISE_XPATHS

XPath selectors for noisy elements to remove (class/id-based).


    private
        array<int, string>
    NOISE_XPATHS
    = [
    // Common reference/citation sections
    '//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]',
    '//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]',
    '//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]',
    '//*[@id="toc" or @class="toc"]',
    '//nav[contains(@class,"toc")]',
    '//*[contains(@class,"noprint")]',
    // Common ad/social/cookie elements
    '//*[contains(@class,"share") or contains(@class,"social")]',
    '//*[contains(@class,"cookie") or contains(@class,"banner")]',
    '//*[contains(@class,"related-") or contains(@class,"recommended")]',
    '//*[contains(@class,"comment") and not(contains(@class,"content"))]',
    // Wikipedia-specific
    '//*[contains(@class,"mw-editsection")]',
    '//*[contains(@class,"sistersitebox")]',
    '//*[contains(@class,"authority-control")]',
]

STRIP_TAGS

Tags to strip before extracting text.


    private
        array<int, string>
    STRIP_TAGS
    = ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']

extractFromUrl()

Extract title and text content from a URL.


    public
                    extractFromUrl(string $url[, string $titleHint = '' ]) : array{title: string, text: string, sourceUri: string}|array{error: string}

Parameters

$url : string: The URL to fetch and extract from
$titleHint : string = '': Optional pre-filled title (used as fallback for plain text files)

Return values

array{title: string, text: string, sourceUri: string}|array{error: string}

stripGutenbergBoilerplatePublic()

Strip Gutenberg boilerplate and clean up text (public API).


    public
                    stripGutenbergBoilerplatePublic(string $text) : string

Parameters

$text : string: Raw Gutenberg text

Return values

string —

Cleaned text

cleanText()

Clean extracted text: normalize whitespace and line breaks.


    protected
                    cleanText(string $text) : string

Parameters

$text : string: Raw text content

Return values

string —

Cleaned text

detectCharset()

Detect charset from HTML meta tags.


    protected
                    detectCharset(string $html) : string|null

Parameters

$html : string: HTML content

Return values

string|null —

Detected charset or null

extractBodyText()

Extract the main body text from the HTML document.


    protected
                    extractBodyText(DOMDocument $dom) : string

Strategy: strip noise elements, then try to find the main content area (

, etc.). If none found, fall back to the largest text-bearing

Parameters

$dom : DOMDocument: Parsed HTML document

Return values

string —

Extracted text content

extractTitle()

Extract title from HTML document.


    protected
                    extractTitle(DOMDocument $dom) : string

Tries og:title first, then <title> tag.

Parameters

$dom : DOMDocument: Parsed HTML document

Return values

string —

Extracted title

fetchPage()

Fetch page content from URL.


    protected
                    fetchPage(string $url) : string|null

Routes through UrlUtilities::safeHttpGet so the entry URL and every redirect hop are run through validateUrlForFetch. With the older follow_location => true setup, an attacker-owned public host could 302 the fetch into a private range; that vector is closed here.

Parameters

$url : string: URL to fetch

Return values

string|null —

HTML content or null on failure

isPlainText()

Check if content is plain text (no HTML tags).


    protected
                    isPlainText(string $content) : bool

Parameters

$content : string: Content to check

Return values

bool —

True if content appears to be plain text

looksLikeBinary()

Check if content looks like binary (not HTML/text).


    protected
                    looksLikeBinary(string $content) : bool

Parameters

$content : string: Content to check

Return values

bool —

True if content appears to be binary

normalizeCharset()

Detect charset and convert to UTF-8.


    protected
                    normalizeCharset(string $html) : string

Parameters

$html : string: HTML content

Return values

string —

UTF-8 encoded HTML

stripGutenbergBoilerplate()

Strip Project Gutenberg header and footer boilerplate.


    protected
                    stripGutenbergBoilerplate(string $text) : string

Gutenberg plain text files have a preamble ending with "*** START OF THE PROJECT GUTENBERG EBOOK ..." and a footer starting with "*** END OF THE PROJECT GUTENBERG EBOOK ...".

Parameters

$text : string: Raw Gutenberg text

Return values

string —

Text with boilerplate removed (unchanged if no markers found)

titleFromUrl()

Derive a title from a URL's path.


    protected
                    titleFromUrl(string $url) : string

Parameters

$url : string: The source URL

Return values

string —

A human-readable title

unwrapHardLineBreaks()

Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps).


    protected
                    unwrapHardLineBreaks(string $text) : string

Joins consecutive non-blank lines into paragraphs. Blank lines are treated as paragraph separators.

Parameters

$text : string: Text with hard line breaks

Return values

string —

Text with natural paragraphs

findLargestTextBlock()

Find the child element with the most text content.


    private
                    findLargestTextBlock(DOMNode $parent) : DOMNode|null

Parameters

$parent : DOMNode: Parent node to search within

Return values

DOMNode|null —

The node with the most text, or null

getTextFromNodeList()

Get cleaned text from a node list.


    private
                    getTextFromNodeList(DOMNodeList $nodes) : string

Parameters

$nodes : DOMNodeList: Node list

Return values

string —

Combined text

parseHtml()

Parse HTML string into DOMDocument.


    private
                    parseHtml(string $html) : DOMDocument|null

Parameters

$html : string: HTML content

Return values

DOMDocument|null —

Parsed document or null on failure

queryNodes()

Query nodes using either tag name or CSS-like selector.


    private
                    queryNodes(DOMXPath $xpath, string $selector) : DOMNodeList|null

Parameters

$xpath : DOMXPath: XPath instance
$selector : string: Tag name or attribute selector

Return values

DOMNodeList|null

stripByXPath()

Remove elements matching XPath selectors.


    private
                    stripByXPath(DOMXPath $xpath, array<int, string> $selectors) : void

Parameters

$xpath : DOMXPath: XPath instance
$selectors : array<int, string>: XPath selectors

stripElements()

Remove specified element types from the DOM.


    private
                    stripElements(DOMDocument $dom, array<int, string> $tags) : void

Parameters

$dom : DOMDocument: Document to modify
$tags : array<int, string>: Tag names to remove

WebPageExtractor in package Lwt Shared Infrastructure Http

Tags

Table of Contents

Constants

Methods

Constants

CONTENT_TAGS

FETCH_TIMEOUT

MAX_RESPONSE_SIZE

NOISE_XPATHS

STRIP_TAGS

Methods

extractFromUrl()

Parameters

Return values

stripGutenbergBoilerplatePublic()

Parameters

Return values

cleanText()

Parameters

Return values

detectCharset()

Parameters

Return values

extractBodyText()

Parameters

Return values

extractTitle()

Parameters

Return values

fetchPage()

Parameters

Return values

isPlainText()

Parameters

Return values

looksLikeBinary()

Parameters

Return values

normalizeCharset()

Parameters

Return values

stripGutenbergBoilerplate()

Parameters

Return values

titleFromUrl()

Parameters

Return values

unwrapHardLineBreaks()

Parameters

Return values

findLargestTextBlock()

Parameters

Return values

getTextFromNodeList()

Parameters

Return values

parseHtml()

Parameters

Return values

queryNodes()

Parameters

Return values

stripByXPath()

Parameters

stripElements()

Parameters

WebPageExtractor
in package

Lwt

Shared

Infrastructure

Http