ArticleExtractor
in package
Service for extracting text content from web articles.
Provides HTML content extraction using XPath selectors, charset detection, and content cleaning.
Tags
Table of Contents
Constants
- DEFAULT_FILTER_TAGS = '//img | //script | //meta | //noscript | //link | //iframe'
- Default filter tags to remove from extracted content.
Methods
- detectCharset() : string
- Detect charset from HTTP headers, meta tags, or content.
- extract() : array<int|string, array<string, mixed>>
- Extract text content from feed article data.
- fetchArticleContent() : string
- Fetch article content from URL with charset detection.
- mapWindowsCharset() : string
- Map Windows charset to UTF-8 locale equivalent.
- buildFilterTagsList() : array<string|int, string>
- Build filter tags list from string.
- cleanExtractedText() : string
- Clean extracted text.
- convertLineBreaks() : string
- Convert HTML line break tags to newlines.
- detectCharsetFromHeaders() : string|null
- Detect charset from HTTP headers.
- detectCharsetFromMeta() : string|null
- Detect charset from HTML meta tags.
- extractNewArticleHtml() : array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}
- Extract full HTML for 'new' article mode.
- extractSingle() : array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null
- Extract content from a single article.
- extractWithXPath() : string
- Extract content using XPath selectors.
- formatErrorMessage() : string
- Format error message for failed extraction.
- handleRedirect() : string
- Handle redirect article section to find actual article URL.
- parseHtml() : DOMDocument
- Parse HTML into DOMDocument.
- prepareInlineHtml() : string
- Prepare inline HTML content.
- processInlineLink() : string
- Process inline link (handle # prefix for feed link references).
Constants
DEFAULT_FILTER_TAGS
Default filter tags to remove from extracted content.
private
mixed
DEFAULT_FILTER_TAGS
= '//img | //script | //meta | //noscript | //link | //iframe'
Methods
detectCharset()
Detect charset from HTTP headers, meta tags, or content.
public
detectCharset(string $url, string $htmlString[, string|null $override = null ]) : string
Parameters
- $url : string
-
URL being fetched
- $htmlString : string
-
HTML content
- $override : string|null = null
-
Override charset
Return values
string —Detected charset
extract()
Extract text content from feed article data.
public
extract(array<int|string, array{link: string, title: string, audio?: string, text?: string}> $feedData, string $articleSection[, string $filterTags = '' ][, string|null $charset = null ]) : array<int|string, array<string, mixed>>
Handles various scenarios:
- Inline text from feed (description, content, encoded)
- Fetching full article from webpage
- Redirect handling for intermediate pages
- Charset detection and conversion
- XPath-based content extraction
Parameters
- $feedData : array<int|string, array{link: string, title: string, audio?: string, text?: string}>
-
Array of feed items with link, title, etc.
- $articleSection : string
-
XPath selector(s) for article content
- $filterTags : string = ''
-
XPath selector(s) for elements to remove
- $charset : string|null = null
-
Override charset (null for auto-detect)
Return values
array<int|string, array<string, mixed>> —Extracted text data with 'error' key for failed extractions
fetchArticleContent()
Fetch article content from URL with charset detection.
public
fetchArticleContent(string $url[, string|null $charset = null ]) : string
Parameters
- $url : string
-
Article URL
- $charset : string|null = null
-
Override charset (null for auto-detect)
Return values
string —HTML content
mapWindowsCharset()
Map Windows charset to UTF-8 locale equivalent.
public
mapWindowsCharset(string $charset) : string
Parameters
- $charset : string
-
Input charset
Return values
string —Mapped charset
buildFilterTagsList()
Build filter tags list from string.
private
buildFilterTagsList(string $filterTags) : array<string|int, string>
Parameters
- $filterTags : string
-
Additional filter tags
Return values
array<string|int, string> —Filter tags array
cleanExtractedText()
Clean extracted text.
private
cleanExtractedText(string $text) : string
Parameters
- $text : string
-
Raw extracted text
Return values
string —Cleaned text
convertLineBreaks()
Convert HTML line break tags to newlines.
private
convertLineBreaks(string $html) : string
Parameters
- $html : string
-
HTML content
Return values
string —Converted content
detectCharsetFromHeaders()
Detect charset from HTTP headers.
private
detectCharsetFromHeaders(string $url) : string|null
Parameters
- $url : string
-
URL to check
Return values
string|null —Charset or null if not found
detectCharsetFromMeta()
Detect charset from HTML meta tags.
private
detectCharsetFromMeta(string $htmlString) : string|null
Parameters
- $htmlString : string
-
HTML content
Return values
string|null —Charset or null if not found
extractNewArticleHtml()
Extract full HTML for 'new' article mode.
private
extractNewArticleHtml(DOMDocument $dom, array<string|int, string> $filterTags) : array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}
Parameters
- $dom : DOMDocument
-
DOM document
- $filterTags : array<string|int, string>
-
Tags to filter out
Return values
array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string} —Result with TxText containing cleaned HTML
extractSingle()
Extract content from a single article.
private
extractSingle(array{link: string, title: string, audio?: string, text?: string} $item, string $articleSection, string $filterTags, string|null $charset) : array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null
Parameters
- $item : array{link: string, title: string, audio?: string, text?: string}
-
Feed item data
- $articleSection : string
-
XPath selector(s)
- $filterTags : string
-
Filter selectors
- $charset : string|null
-
Override charset
Return values
array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null —Extracted data or null on failure
extractWithXPath()
Extract content using XPath selectors.
private
extractWithXPath(DOMDocument $dom, string $articleSection, array<string|int, string> $filterTags, bool $isInlineText) : string
Parameters
- $dom : DOMDocument
-
DOM document
- $articleSection : string
-
Article selectors
- $filterTags : array<string|int, string>
-
Filter selectors
- $isInlineText : bool
-
Whether source is inline text
Return values
string —Extracted text
formatErrorMessage()
Format error message for failed extraction.
private
formatErrorMessage(array{link: string, title: string, audio?: string, text?: string} $item) : string
Parameters
- $item : array{link: string, title: string, audio?: string, text?: string}
-
Feed item
Return values
string —Error message HTML
handleRedirect()
Handle redirect article section to find actual article URL.
private
handleRedirect(string $link, string $articleSection, string &$newSection) : string
Parameters
- $link : string
-
Original link
- $articleSection : string
-
Full article section string
- $newSection : string
-
Output: updated article section
Return values
string —Updated link
parseHtml()
Parse HTML into DOMDocument.
private
parseHtml(string $htmlString) : DOMDocument
Parameters
- $htmlString : string
-
HTML content
Return values
DOMDocument —Parsed document
prepareInlineHtml()
Prepare inline HTML content.
private
prepareInlineHtml(string $text) : string
Parameters
- $text : string
-
Inline text from feed
Return values
string —Prepared HTML
processInlineLink()
Process inline link (handle # prefix for feed link references).
private
processInlineLink(string $link) : string
Parameters
- $link : string
-
Original link
Return values
string —Processed link