Documentation

ArticleExtractor

Service for extracting text content from web articles.

Provides HTML content extraction using XPath selectors, charset detection, and content cleaning.

Tags
since
3.0.0

Table of Contents

Constants

DEFAULT_FILTER_TAGS  = '//img | //script | //meta | //noscript | //link | //iframe'
Default filter tags to remove from extracted content.

Methods

detectCharset()  : string
Detect charset from HTTP headers, meta tags, or content.
extract()  : array<int|string, array<string, mixed>>
Extract text content from feed article data.
fetchArticleContent()  : string
Fetch article content from URL with charset detection.
mapWindowsCharset()  : string
Map Windows charset to UTF-8 locale equivalent.
buildFilterTagsList()  : array<string|int, string>
Build filter tags list from string.
cleanExtractedText()  : string
Clean extracted text.
convertLineBreaks()  : string
Convert HTML line break tags to newlines.
detectCharsetFromHeaders()  : string|null
Detect charset from HTTP headers.
detectCharsetFromMeta()  : string|null
Detect charset from HTML meta tags.
extractNewArticleHtml()  : array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}
Extract full HTML for 'new' article mode.
extractSingle()  : array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null
Extract content from a single article.
extractWithXPath()  : string
Extract content using XPath selectors.
formatErrorMessage()  : string
Format error message for failed extraction.
handleRedirect()  : string
Handle redirect article section to find actual article URL.
parseHtml()  : DOMDocument
Parse HTML into DOMDocument.
prepareInlineHtml()  : string
Prepare inline HTML content.
processInlineLink()  : string
Process inline link (handle # prefix for feed link references).

Constants

DEFAULT_FILTER_TAGS

Default filter tags to remove from extracted content.

private mixed DEFAULT_FILTER_TAGS = '//img | //script | //meta | //noscript | //link | //iframe'

Methods

detectCharset()

Detect charset from HTTP headers, meta tags, or content.

public detectCharset(string $url, string $htmlString[, string|null $override = null ]) : string
Parameters
$url : string

URL being fetched

$htmlString : string

HTML content

$override : string|null = null

Override charset

Return values
string

Detected charset

extract()

Extract text content from feed article data.

public extract(array<int|string, array{link: string, title: string, audio?: string, text?: string}> $feedData, string $articleSection[, string $filterTags = '' ][, string|null $charset = null ]) : array<int|string, array<string, mixed>>

Handles various scenarios:

  • Inline text from feed (description, content, encoded)
  • Fetching full article from webpage
  • Redirect handling for intermediate pages
  • Charset detection and conversion
  • XPath-based content extraction
Parameters
$feedData : array<int|string, array{link: string, title: string, audio?: string, text?: string}>

Array of feed items with link, title, etc.

$articleSection : string

XPath selector(s) for article content

$filterTags : string = ''

XPath selector(s) for elements to remove

$charset : string|null = null

Override charset (null for auto-detect)

Return values
array<int|string, array<string, mixed>>

Extracted text data with 'error' key for failed extractions

fetchArticleContent()

Fetch article content from URL with charset detection.

public fetchArticleContent(string $url[, string|null $charset = null ]) : string
Parameters
$url : string

Article URL

$charset : string|null = null

Override charset (null for auto-detect)

Return values
string

HTML content

mapWindowsCharset()

Map Windows charset to UTF-8 locale equivalent.

public mapWindowsCharset(string $charset) : string
Parameters
$charset : string

Input charset

Return values
string

Mapped charset

buildFilterTagsList()

Build filter tags list from string.

private buildFilterTagsList(string $filterTags) : array<string|int, string>
Parameters
$filterTags : string

Additional filter tags

Return values
array<string|int, string>

Filter tags array

cleanExtractedText()

Clean extracted text.

private cleanExtractedText(string $text) : string
Parameters
$text : string

Raw extracted text

Return values
string

Cleaned text

convertLineBreaks()

Convert HTML line break tags to newlines.

private convertLineBreaks(string $html) : string
Parameters
$html : string

HTML content

Return values
string

Converted content

detectCharsetFromHeaders()

Detect charset from HTTP headers.

private detectCharsetFromHeaders(string $url) : string|null
Parameters
$url : string

URL to check

Return values
string|null

Charset or null if not found

detectCharsetFromMeta()

Detect charset from HTML meta tags.

private detectCharsetFromMeta(string $htmlString) : string|null
Parameters
$htmlString : string

HTML content

Return values
string|null

Charset or null if not found

extractNewArticleHtml()

Extract full HTML for 'new' article mode.

private extractNewArticleHtml(DOMDocument $dom, array<string|int, string> $filterTags) : array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}
Parameters
$dom : DOMDocument

DOM document

$filterTags : array<string|int, string>

Tags to filter out

Return values
array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}

Result with TxText containing cleaned HTML

extractSingle()

Extract content from a single article.

private extractSingle(array{link: string, title: string, audio?: string, text?: string} $item, string $articleSection, string $filterTags, string|null $charset) : array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null
Parameters
$item : array{link: string, title: string, audio?: string, text?: string}

Feed item data

$articleSection : string

XPath selector(s)

$filterTags : string

Filter selectors

$charset : string|null

Override charset

Return values
array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null

Extracted data or null on failure

extractWithXPath()

Extract content using XPath selectors.

private extractWithXPath(DOMDocument $dom, string $articleSection, array<string|int, string> $filterTags, bool $isInlineText) : string
Parameters
$dom : DOMDocument

DOM document

$articleSection : string

Article selectors

$filterTags : array<string|int, string>

Filter selectors

$isInlineText : bool

Whether source is inline text

Return values
string

Extracted text

formatErrorMessage()

Format error message for failed extraction.

private formatErrorMessage(array{link: string, title: string, audio?: string, text?: string} $item) : string
Parameters
$item : array{link: string, title: string, audio?: string, text?: string}

Feed item

Return values
string

Error message HTML

handleRedirect()

Handle redirect article section to find actual article URL.

private handleRedirect(string $link, string $articleSection, string &$newSection) : string
Parameters
$link : string

Original link

$articleSection : string

Full article section string

$newSection : string

Output: updated article section

Return values
string

Updated link

parseHtml()

Parse HTML into DOMDocument.

private parseHtml(string $htmlString) : DOMDocument
Parameters
$htmlString : string

HTML content

Return values
DOMDocument

Parsed document

prepareInlineHtml()

Prepare inline HTML content.

private prepareInlineHtml(string $text) : string
Parameters
$text : string

Inline text from feed

Return values
string

Prepared HTML

Process inline link (handle # prefix for feed link references).

private processInlineLink(string $link) : string
Parameters
$link : string

Original link

Return values
string

Processed link


        
On this page

Search results