WebPageExtractor
in package
Extracts readable text content from web pages.
Performs URL validation (SSRF protection), fetches HTML, detects charset, and extracts the main article text.
Tags
Table of Contents
Constants
- CONTENT_TAGS = ['article', 'main', '[role="main"]', '[id="mw-content-text"]']
- Tags that indicate main content areas (in priority order).
- FETCH_TIMEOUT = 15
- HTTP fetch timeout in seconds.
- MAX_RESPONSE_SIZE = 2 * 1024 * 1024
- Maximum response size in bytes (2 MB).
- NOISE_XPATHS = [ // Common reference/citation sections '//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]', '//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]', '//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]', '//*[@id="toc" or @class="toc"]', '//nav[contains(@class,"toc")]', '//*[contains(@class,"noprint")]', // Common ad/social/cookie elements '//*[contains(@class,"share") or contains(@class,"social")]', '//*[contains(@class,"cookie") or contains(@class,"banner")]', '//*[contains(@class,"related-") or contains(@class,"recommended")]', '//*[contains(@class,"comment") and not(contains(@class,"content"))]', // Wikipedia-specific '//*[contains(@class,"mw-editsection")]', '//*[contains(@class,"sistersitebox")]', '//*[contains(@class,"authority-control")]', ]
- XPath selectors for noisy elements to remove (class/id-based).
- STRIP_TAGS = ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']
- Tags to strip before extracting text.
Methods
- extractFromUrl() : array{title: string, text: string, sourceUri: string}|array{error: string}
- Extract title and text content from a URL.
- stripGutenbergBoilerplatePublic() : string
- Strip Gutenberg boilerplate and clean up text (public API).
- cleanText() : string
- Clean extracted text: normalize whitespace and line breaks.
- detectCharset() : string|null
- Detect charset from HTML meta tags.
- extractBodyText() : string
- Extract the main body text from the HTML document.
- extractTitle() : string
- Extract title from HTML document.
- fetchPage() : string|null
- Fetch page content from URL.
- isPlainText() : bool
- Check if content is plain text (no HTML tags).
- looksLikeBinary() : bool
- Check if content looks like binary (not HTML/text).
- normalizeCharset() : string
- Detect charset and convert to UTF-8.
- stripGutenbergBoilerplate() : string
- Strip Project Gutenberg header and footer boilerplate.
- titleFromUrl() : string
- Derive a title from a URL's path.
- unwrapHardLineBreaks() : string
- Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps).
- findLargestTextBlock() : DOMNode|null
- Find the child element with the most text content.
- getTextFromNodeList() : string
- Get cleaned text from a node list.
- parseHtml() : DOMDocument|null
- Parse HTML string into DOMDocument.
- queryNodes() : DOMNodeList|null
- Query nodes using either tag name or CSS-like selector.
- stripByXPath() : void
- Remove elements matching XPath selectors.
- stripElements() : void
- Remove specified element types from the DOM.
Constants
CONTENT_TAGS
Tags that indicate main content areas (in priority order).
private
array<int, string>
CONTENT_TAGS
= ['article', 'main', '[role="main"]', '[id="mw-content-text"]']
FETCH_TIMEOUT
HTTP fetch timeout in seconds.
private
mixed
FETCH_TIMEOUT
= 15
MAX_RESPONSE_SIZE
Maximum response size in bytes (2 MB).
private
mixed
MAX_RESPONSE_SIZE
= 2 * 1024 * 1024
NOISE_XPATHS
XPath selectors for noisy elements to remove (class/id-based).
private
array<int, string>
NOISE_XPATHS
= [
// Common reference/citation sections
'//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]',
'//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]',
'//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]',
'//*[@id="toc" or @class="toc"]',
'//nav[contains(@class,"toc")]',
'//*[contains(@class,"noprint")]',
// Common ad/social/cookie elements
'//*[contains(@class,"share") or contains(@class,"social")]',
'//*[contains(@class,"cookie") or contains(@class,"banner")]',
'//*[contains(@class,"related-") or contains(@class,"recommended")]',
'//*[contains(@class,"comment") and not(contains(@class,"content"))]',
// Wikipedia-specific
'//*[contains(@class,"mw-editsection")]',
'//*[contains(@class,"sistersitebox")]',
'//*[contains(@class,"authority-control")]',
]
STRIP_TAGS
Tags to strip before extracting text.
private
array<int, string>
STRIP_TAGS
= ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']
Methods
extractFromUrl()
Extract title and text content from a URL.
public
extractFromUrl(string $url[, string $titleHint = '' ]) : array{title: string, text: string, sourceUri: string}|array{error: string}
Parameters
- $url : string
-
The URL to fetch and extract from
- $titleHint : string = ''
-
Optional pre-filled title (used as fallback for plain text files)
Return values
array{title: string, text: string, sourceUri: string}|array{error: string}stripGutenbergBoilerplatePublic()
Strip Gutenberg boilerplate and clean up text (public API).
public
stripGutenbergBoilerplatePublic(string $text) : string
Parameters
- $text : string
-
Raw Gutenberg text
Return values
string —Cleaned text
cleanText()
Clean extracted text: normalize whitespace and line breaks.
protected
cleanText(string $text) : string
Parameters
- $text : string
-
Raw text content
Return values
string —Cleaned text
detectCharset()
Detect charset from HTML meta tags.
protected
detectCharset(string $html) : string|null
Parameters
- $html : string
-
HTML content
Return values
string|null —Detected charset or null
extractBodyText()
Extract the main body text from the HTML document.
protected
extractBodyText(DOMDocument $dom) : string
Strategy: strip noise elements, then try to find the main
content area ( Parsed HTML document Extracted text content Extract title from HTML document. Tries og:title first, then <title> tag. Parsed HTML document Extracted title Fetch page content from URL. URL to fetch HTML content or null on failure Check if content is plain text (no HTML tags). Content to check True if content appears to be plain text Check if content looks like binary (not HTML/text). Content to check True if content appears to be binary Detect charset and convert to UTF-8. HTML content UTF-8 encoded HTML Strip Project Gutenberg header and footer boilerplate. Gutenberg plain text files have a preamble ending with
"*** START OF THE PROJECT GUTENBERG EBOOK ..." and a footer
starting with "*** END OF THE PROJECT GUTENBERG EBOOK ...". Raw Gutenberg text Text with boilerplate removed (unchanged if no markers found) Derive a title from a URL's path. The source URL A human-readable title Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps). Joins consecutive non-blank lines into paragraphs. Blank lines are
treated as paragraph separators. Text with hard line breaks Text with natural paragraphs Find the child element with the most text content. Parent node to search within The node with the most text, or null Get cleaned text from a node list. Node list Combined text Parse HTML string into DOMDocument. HTML content Parsed document or null on failure Query nodes using either tag name or CSS-like selector. XPath instance Tag name or attribute selector Remove elements matching XPath selectors. XPath instance XPath selectors Remove specified element types from the DOM. Document to modify Tag names to removeParameters
Return values
string
—
extractTitle()
protected
extractTitle(DOMDocument $dom) : string
Parameters
Return values
string
—
fetchPage()
protected
fetchPage(string $url) : string|null
Parameters
Return values
string|null
—
isPlainText()
protected
isPlainText(string $content) : bool
Parameters
Return values
bool
—
looksLikeBinary()
protected
looksLikeBinary(string $content) : bool
Parameters
Return values
bool
—
normalizeCharset()
protected
normalizeCharset(string $html) : string
Parameters
Return values
string
—
stripGutenbergBoilerplate()
protected
stripGutenbergBoilerplate(string $text) : string
Parameters
Return values
string
—
titleFromUrl()
protected
titleFromUrl(string $url) : string
Parameters
Return values
string
—
unwrapHardLineBreaks()
protected
unwrapHardLineBreaks(string $text) : string
Parameters
Return values
string
—
findLargestTextBlock()
private
findLargestTextBlock(DOMNode $parent) : DOMNode|null
Parameters
Return values
DOMNode|null
—
getTextFromNodeList()
private
getTextFromNodeList(DOMNodeList $nodes) : string
Parameters
Return values
string
—
parseHtml()
private
parseHtml(string $html) : DOMDocument|null
Parameters
Return values
DOMDocument|null
—
queryNodes()
private
queryNodes(DOMXPath $xpath, string $selector) : DOMNodeList|null
Parameters
Return values
DOMNodeList|null
stripByXPath()
private
stripByXPath(DOMXPath $xpath, array<int, string> $selectors) : void
Parameters
stripElements()
private
stripElements(DOMDocument $dom, array<int, string> $tags) : void
Parameters
Search results