Documentation

WebPageExtractor

Extracts readable text content from web pages.

Performs URL validation (SSRF protection), fetches HTML, detects charset, and extracts the main article text.

Tags
since
3.0.0

Table of Contents

Constants

CONTENT_TAGS  = ['article', 'main', '[role="main"]', '[id="mw-content-text"]']
Tags that indicate main content areas (in priority order).
FETCH_TIMEOUT  = 15
HTTP fetch timeout in seconds.
MAX_RESPONSE_SIZE  = 2 * 1024 * 1024
Maximum response size in bytes (2 MB).
NOISE_XPATHS  = [ // Common reference/citation sections '//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]', '//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]', '//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]', '//*[@id="toc" or @class="toc"]', '//nav[contains(@class,"toc")]', '//*[contains(@class,"noprint")]', // Common ad/social/cookie elements '//*[contains(@class,"share") or contains(@class,"social")]', '//*[contains(@class,"cookie") or contains(@class,"banner")]', '//*[contains(@class,"related-") or contains(@class,"recommended")]', '//*[contains(@class,"comment") and not(contains(@class,"content"))]', // Wikipedia-specific '//*[contains(@class,"mw-editsection")]', '//*[contains(@class,"sistersitebox")]', '//*[contains(@class,"authority-control")]', ]
XPath selectors for noisy elements to remove (class/id-based).
STRIP_TAGS  = ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']
Tags to strip before extracting text.

Methods

extractFromUrl()  : array{title: string, text: string, sourceUri: string}|array{error: string}
Extract title and text content from a URL.
cleanText()  : string
Clean extracted text: normalize whitespace and line breaks.
detectCharset()  : string|null
Detect charset from HTML meta tags.
extractBodyText()  : string
Extract the main body text from the HTML document.
extractTitle()  : string
Extract title from HTML document.
fetchPage()  : string|null
Fetch page content from URL.
findLargestTextBlock()  : DOMNode|null
Find the child element with the most text content.
getTextFromNodeList()  : string
Get cleaned text from a node list.
isPlainText()  : bool
Check if content is plain text (no HTML tags).
looksLikeBinary()  : bool
Check if content looks like binary (not HTML/text).
normalizeCharset()  : string
Detect charset and convert to UTF-8.
parseHtml()  : DOMDocument|null
Parse HTML string into DOMDocument.
queryNodes()  : DOMNodeList|null
Query nodes using either tag name or CSS-like selector.
stripByXPath()  : void
Remove elements matching XPath selectors.
stripElements()  : void
Remove specified element types from the DOM.
stripGutenbergBoilerplate()  : string
Strip Project Gutenberg header and footer boilerplate.
titleFromUrl()  : string
Derive a title from a URL's path.
unwrapHardLineBreaks()  : string
Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps).

Constants

CONTENT_TAGS

Tags that indicate main content areas (in priority order).

private array<int, string> CONTENT_TAGS = ['article', 'main', '[role="main"]', '[id="mw-content-text"]']

FETCH_TIMEOUT

HTTP fetch timeout in seconds.

private mixed FETCH_TIMEOUT = 15

MAX_RESPONSE_SIZE

Maximum response size in bytes (2 MB).

private mixed MAX_RESPONSE_SIZE = 2 * 1024 * 1024

NOISE_XPATHS

XPath selectors for noisy elements to remove (class/id-based).

private array<int, string> NOISE_XPATHS = [ // Common reference/citation sections '//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]', '//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]', '//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]', '//*[@id="toc" or @class="toc"]', '//nav[contains(@class,"toc")]', '//*[contains(@class,"noprint")]', // Common ad/social/cookie elements '//*[contains(@class,"share") or contains(@class,"social")]', '//*[contains(@class,"cookie") or contains(@class,"banner")]', '//*[contains(@class,"related-") or contains(@class,"recommended")]', '//*[contains(@class,"comment") and not(contains(@class,"content"))]', // Wikipedia-specific '//*[contains(@class,"mw-editsection")]', '//*[contains(@class,"sistersitebox")]', '//*[contains(@class,"authority-control")]', ]

STRIP_TAGS

Tags to strip before extracting text.

private array<int, string> STRIP_TAGS = ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']

Methods

extractFromUrl()

Extract title and text content from a URL.

public extractFromUrl(string $url) : array{title: string, text: string, sourceUri: string}|array{error: string}
Parameters
$url : string

The URL to fetch and extract from

Return values
array{title: string, text: string, sourceUri: string}|array{error: string}

cleanText()

Clean extracted text: normalize whitespace and line breaks.

private cleanText(string $text) : string
Parameters
$text : string

Raw text content

Return values
string

Cleaned text

detectCharset()

Detect charset from HTML meta tags.

private detectCharset(string $html) : string|null
Parameters
$html : string

HTML content

Return values
string|null

Detected charset or null

extractBodyText()

Extract the main body text from the HTML document.

private extractBodyText(DOMDocument $dom) : string

Strategy: strip noise elements, then try to find the main content area (

,
, etc.). If none found, fall back to the largest text-bearing
.

Parameters
$dom : DOMDocument

Parsed HTML document

Return values
string

Extracted text content

extractTitle()

Extract title from HTML document.

private extractTitle(DOMDocument $dom) : string

Tries og:title first, then <title> tag.

Parameters
$dom : DOMDocument

Parsed HTML document

Return values
string

Extracted title

fetchPage()

Fetch page content from URL.

private fetchPage(string $url) : string|null
Parameters
$url : string

URL to fetch

Return values
string|null

HTML content or null on failure

findLargestTextBlock()

Find the child element with the most text content.

private findLargestTextBlock(DOMNode $parent) : DOMNode|null
Parameters
$parent : DOMNode

Parent node to search within

Return values
DOMNode|null

The node with the most text, or null

getTextFromNodeList()

Get cleaned text from a node list.

private getTextFromNodeList(DOMNodeList $nodes) : string
Parameters
$nodes : DOMNodeList

Node list

Return values
string

Combined text

isPlainText()

Check if content is plain text (no HTML tags).

private isPlainText(string $content) : bool
Parameters
$content : string

Content to check

Return values
bool

True if content appears to be plain text

looksLikeBinary()

Check if content looks like binary (not HTML/text).

private looksLikeBinary(string $content) : bool
Parameters
$content : string

Content to check

Return values
bool

True if content appears to be binary

normalizeCharset()

Detect charset and convert to UTF-8.

private normalizeCharset(string $html) : string
Parameters
$html : string

HTML content

Return values
string

UTF-8 encoded HTML

parseHtml()

Parse HTML string into DOMDocument.

private parseHtml(string $html) : DOMDocument|null
Parameters
$html : string

HTML content

Return values
DOMDocument|null

Parsed document or null on failure

queryNodes()

Query nodes using either tag name or CSS-like selector.

private queryNodes(DOMXPath $xpath, string $selector) : DOMNodeList|null
Parameters
$xpath : DOMXPath

XPath instance

$selector : string

Tag name or attribute selector

Return values
DOMNodeList|null

stripByXPath()

Remove elements matching XPath selectors.

private stripByXPath(DOMXPath $xpath, array<int, string> $selectors) : void
Parameters
$xpath : DOMXPath

XPath instance

$selectors : array<int, string>

XPath selectors

stripElements()

Remove specified element types from the DOM.

private stripElements(DOMDocument $dom, array<int, string> $tags) : void
Parameters
$dom : DOMDocument

Document to modify

$tags : array<int, string>

Tag names to remove

stripGutenbergBoilerplate()

Strip Project Gutenberg header and footer boilerplate.

private stripGutenbergBoilerplate(string $text) : string

Gutenberg plain text files have a preamble ending with "*** START OF THE PROJECT GUTENBERG EBOOK ..." and a footer starting with "*** END OF THE PROJECT GUTENBERG EBOOK ...".

Parameters
$text : string

Raw Gutenberg text

Return values
string

Text with boilerplate removed (unchanged if no markers found)

titleFromUrl()

Derive a title from a URL's path.

private titleFromUrl(string $url) : string
Parameters
$url : string

The source URL

Return values
string

A human-readable title

unwrapHardLineBreaks()

Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps).

private unwrapHardLineBreaks(string $text) : string

Joins consecutive non-blank lines into paragraphs. Blank lines are treated as paragraph separators.

Parameters
$text : string

Text with hard line breaks

Return values
string

Text with natural paragraphs


        
On this page

Search results