WebPageExtractor
in package
Extracts readable text content from web pages.
Performs URL validation (SSRF protection), fetches HTML, detects charset, and extracts the main article text.
Tags
Table of Contents
Constants
- CONTENT_TAGS = ['article', 'main', '[role="main"]', '[id="mw-content-text"]']
- Tags that indicate main content areas (in priority order).
- FETCH_TIMEOUT = 15
- HTTP fetch timeout in seconds.
- MAX_RESPONSE_SIZE = 2 * 1024 * 1024
- Maximum response size in bytes (2 MB).
- NOISE_XPATHS = [ // Common reference/citation sections '//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]', '//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]', '//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]', '//*[@id="toc" or @class="toc"]', '//nav[contains(@class,"toc")]', '//*[contains(@class,"noprint")]', // Common ad/social/cookie elements '//*[contains(@class,"share") or contains(@class,"social")]', '//*[contains(@class,"cookie") or contains(@class,"banner")]', '//*[contains(@class,"related-") or contains(@class,"recommended")]', '//*[contains(@class,"comment") and not(contains(@class,"content"))]', // Wikipedia-specific '//*[contains(@class,"mw-editsection")]', '//*[contains(@class,"sistersitebox")]', '//*[contains(@class,"authority-control")]', ]
- XPath selectors for noisy elements to remove (class/id-based).
- STRIP_TAGS = ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']
- Tags to strip before extracting text.
Methods
- extractFromUrl() : array{title: string, text: string, sourceUri: string}|array{error: string}
- Extract title and text content from a URL.
- cleanText() : string
- Clean extracted text: normalize whitespace and line breaks.
- detectCharset() : string|null
- Detect charset from HTML meta tags.
- extractBodyText() : string
- Extract the main body text from the HTML document.
- extractTitle() : string
- Extract title from HTML document.
- fetchPage() : string|null
- Fetch page content from URL.
- findLargestTextBlock() : DOMNode|null
- Find the child element with the most text content.
- getTextFromNodeList() : string
- Get cleaned text from a node list.
- isPlainText() : bool
- Check if content is plain text (no HTML tags).
- looksLikeBinary() : bool
- Check if content looks like binary (not HTML/text).
- normalizeCharset() : string
- Detect charset and convert to UTF-8.
- parseHtml() : DOMDocument|null
- Parse HTML string into DOMDocument.
- queryNodes() : DOMNodeList|null
- Query nodes using either tag name or CSS-like selector.
- stripByXPath() : void
- Remove elements matching XPath selectors.
- stripElements() : void
- Remove specified element types from the DOM.
- stripGutenbergBoilerplate() : string
- Strip Project Gutenberg header and footer boilerplate.
- titleFromUrl() : string
- Derive a title from a URL's path.
- unwrapHardLineBreaks() : string
- Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps).
Constants
CONTENT_TAGS
Tags that indicate main content areas (in priority order).
private
array<int, string>
CONTENT_TAGS
= ['article', 'main', '[role="main"]', '[id="mw-content-text"]']
FETCH_TIMEOUT
HTTP fetch timeout in seconds.
private
mixed
FETCH_TIMEOUT
= 15
MAX_RESPONSE_SIZE
Maximum response size in bytes (2 MB).
private
mixed
MAX_RESPONSE_SIZE
= 2 * 1024 * 1024
NOISE_XPATHS
XPath selectors for noisy elements to remove (class/id-based).
private
array<int, string>
NOISE_XPATHS
= [
// Common reference/citation sections
'//*[contains(@class,"reflist") or contains(@class,"references") or contains(@class,"refbegin")]',
'//*[contains(@class,"navbox") or contains(@class,"sidebar") or contains(@class,"infobox")]',
'//*[contains(@class,"catlinks") or contains(@class,"mw-jump-link")]',
'//*[@id="toc" or @class="toc"]',
'//nav[contains(@class,"toc")]',
'//*[contains(@class,"noprint")]',
// Common ad/social/cookie elements
'//*[contains(@class,"share") or contains(@class,"social")]',
'//*[contains(@class,"cookie") or contains(@class,"banner")]',
'//*[contains(@class,"related-") or contains(@class,"recommended")]',
'//*[contains(@class,"comment") and not(contains(@class,"content"))]',
// Wikipedia-specific
'//*[contains(@class,"mw-editsection")]',
'//*[contains(@class,"sistersitebox")]',
'//*[contains(@class,"authority-control")]',
]
STRIP_TAGS
Tags to strip before extracting text.
private
array<int, string>
STRIP_TAGS
= ['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript', 'iframe', 'svg', 'figure', 'figcaption']
Methods
extractFromUrl()
Extract title and text content from a URL.
public
extractFromUrl(string $url) : array{title: string, text: string, sourceUri: string}|array{error: string}
Parameters
- $url : string
-
The URL to fetch and extract from
Return values
array{title: string, text: string, sourceUri: string}|array{error: string}cleanText()
Clean extracted text: normalize whitespace and line breaks.
private
cleanText(string $text) : string
Parameters
- $text : string
-
Raw text content
Return values
string —Cleaned text
detectCharset()
Detect charset from HTML meta tags.
private
detectCharset(string $html) : string|null
Parameters
- $html : string
-
HTML content
Return values
string|null —Detected charset or null
extractBodyText()
Extract the main body text from the HTML document.
private
extractBodyText(DOMDocument $dom) : string
Strategy: strip noise elements, then try to find the main
content area ( Parsed HTML document Extracted text content Extract title from HTML document. Tries og:title first, then <title> tag. Parsed HTML document Extracted title Fetch page content from URL. URL to fetch HTML content or null on failure Find the child element with the most text content. Parent node to search within The node with the most text, or null Get cleaned text from a node list. Node list Combined text Check if content is plain text (no HTML tags). Content to check True if content appears to be plain text Check if content looks like binary (not HTML/text). Content to check True if content appears to be binary Detect charset and convert to UTF-8. HTML content UTF-8 encoded HTML Parse HTML string into DOMDocument. HTML content Parsed document or null on failure Query nodes using either tag name or CSS-like selector. XPath instance Tag name or attribute selector Remove elements matching XPath selectors. XPath instance XPath selectors Remove specified element types from the DOM. Document to modify Tag names to remove Strip Project Gutenberg header and footer boilerplate. Gutenberg plain text files have a preamble ending with
"*** START OF THE PROJECT GUTENBERG EBOOK ..." and a footer
starting with "*** END OF THE PROJECT GUTENBERG EBOOK ...". Raw Gutenberg text Text with boilerplate removed (unchanged if no markers found) Derive a title from a URL's path. The source URL A human-readable title Unwrap hard line breaks typical of plain text files (e.g. ~72-char wraps). Joins consecutive non-blank lines into paragraphs. Blank lines are
treated as paragraph separators. Text with hard line breaks Text with natural paragraphsParameters
Return values
string
—
extractTitle()
private
extractTitle(DOMDocument $dom) : string
Parameters
Return values
string
—
fetchPage()
private
fetchPage(string $url) : string|null
Parameters
Return values
string|null
—
findLargestTextBlock()
private
findLargestTextBlock(DOMNode $parent) : DOMNode|null
Parameters
Return values
DOMNode|null
—
getTextFromNodeList()
private
getTextFromNodeList(DOMNodeList $nodes) : string
Parameters
Return values
string
—
isPlainText()
private
isPlainText(string $content) : bool
Parameters
Return values
bool
—
looksLikeBinary()
private
looksLikeBinary(string $content) : bool
Parameters
Return values
bool
—
normalizeCharset()
private
normalizeCharset(string $html) : string
Parameters
Return values
string
—
parseHtml()
private
parseHtml(string $html) : DOMDocument|null
Parameters
Return values
DOMDocument|null
—
queryNodes()
private
queryNodes(DOMXPath $xpath, string $selector) : DOMNodeList|null
Parameters
Return values
DOMNodeList|null
stripByXPath()
private
stripByXPath(DOMXPath $xpath, array<int, string> $selectors) : void
Parameters
stripElements()
private
stripElements(DOMDocument $dom, array<int, string> $tags) : void
Parameters
stripGutenbergBoilerplate()
private
stripGutenbergBoilerplate(string $text) : string
Parameters
Return values
string
—
titleFromUrl()
private
titleFromUrl(string $url) : string
Parameters
Return values
string
—
unwrapHardLineBreaks()
private
unwrapHardLineBreaks(string $text) : string
Parameters
Return values
string
—
Search results