Documentation

ArticleExtractor
in package

Lwt

Modules

Feed

Application

Services

Service for extracting text content from web articles.

Provides HTML content extraction using XPath selectors, charset detection, and content cleaning.

Constants

DEFAULT_FILTER_TAGS = '//img | //script | //meta | //noscript | //link | //iframe': Default filter tags to remove from extracted content.

Methods

detectCharset() : string: Detect charset from HTTP headers, meta tags, or content.
extract() : array<int|string, array<string, mixed>>: Extract text content from feed article data.
fetchArticleContent() : string: Fetch article content from URL with charset detection.
mapWindowsCharset() : string: Map Windows charset to UTF-8 locale equivalent.
buildFilterTagsList() : array<string|int, string>: Build filter tags list from string.
cleanExtractedText() : string: Clean extracted text.
convertLineBreaks() : string: Convert HTML line break tags to newlines.
detectCharsetFromHeaders() : string|null: Detect charset from HTTP headers.
detectCharsetFromMeta() : string|null: Detect charset from HTML meta tags.
extractNewArticleHtml() : array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}: Extract full HTML for 'new' article mode.
extractSingle() : array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null: Extract content from a single article.
extractWithXPath() : string: Extract content using XPath selectors.
formatErrorMessage() : string: Format error message for failed extraction.
handleRedirect() : string: Handle redirect article section to find actual article URL.
parseHtml() : DOMDocument: Parse HTML into DOMDocument.
prepareInlineHtml() : string: Prepare inline HTML content.
processInlineLink() : string: Process inline link (handle # prefix for feed link references).

DEFAULT_FILTER_TAGS

Default filter tags to remove from extracted content.


    private
        mixed
    DEFAULT_FILTER_TAGS
    = '//img | //script | //meta | //noscript | //link | //iframe'

detectCharset()

Detect charset from HTTP headers, meta tags, or content.


    public
                    detectCharset(string $url, string $htmlString[, string|null $override = null ]) : string

Parameters

$url : string: URL being fetched
$htmlString : string: HTML content
$override : string|null = null: Override charset

Return values

string —

Detected charset

extract()

Extract text content from feed article data.


    public
                    extract(array<int|string, array{link: string, title: string, audio?: string, text?: string}> $feedData, string $articleSection[, string $filterTags = '' ][, string|null $charset = null ]) : array<int|string, array<string, mixed>>

Handles various scenarios:

Inline text from feed (description, content, encoded)
Fetching full article from webpage
Redirect handling for intermediate pages
Charset detection and conversion
XPath-based content extraction

Parameters

$feedData : array<int|string, array{link: string, title: string, audio?: string, text?: string}>: Array of feed items with link, title, etc.
$articleSection : string: XPath selector(s) for article content
$filterTags : string = '': XPath selector(s) for elements to remove
$charset : string|null = null: Override charset (null for auto-detect)

Return values

array<int|string, array<string, mixed>> —

Extracted text data with 'error' key for failed extractions

fetchArticleContent()

Fetch article content from URL with charset detection.


    public
                    fetchArticleContent(string $url[, string|null $charset = null ]) : string

Parameters

$url : string: Article URL
$charset : string|null = null: Override charset (null for auto-detect)

Return values

string —

HTML content

mapWindowsCharset()

Map Windows charset to UTF-8 locale equivalent.


    public
                    mapWindowsCharset(string $charset) : string

Parameters

$charset : string: Input charset

Return values

string —

Mapped charset

buildFilterTagsList()

Build filter tags list from string.


    private
                    buildFilterTagsList(string $filterTags) : array<string|int, string>

Parameters

$filterTags : string: Additional filter tags

Return values

array<string|int, string> —

Filter tags array

cleanExtractedText()

Clean extracted text.


    private
                    cleanExtractedText(string $text) : string

Parameters

$text : string: Raw extracted text

Return values

string —

Cleaned text

convertLineBreaks()

Convert HTML line break tags to newlines.


    private
                    convertLineBreaks(string $html) : string

Parameters

$html : string: HTML content

Return values

string —

Converted content

detectCharsetFromHeaders()

Detect charset from HTTP headers.


    private
                    detectCharsetFromHeaders(string $url) : string|null

Parameters

$url : string: URL to check

Return values

string|null —

Charset or null if not found

detectCharsetFromMeta()

Detect charset from HTML meta tags.


    private
                    detectCharsetFromMeta(string $htmlString) : string|null

Parameters

$htmlString : string: HTML content

Return values

string|null —

Charset or null if not found

extractNewArticleHtml()

Extract full HTML for 'new' article mode.


    private
                    extractNewArticleHtml(DOMDocument $dom, array<string|int, string> $filterTags) : array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string}

Parameters

$dom : DOMDocument: DOM document
$filterTags : array<string|int, string>: Tags to filter out

Return values

array{TxTitle: string, TxText: string, TxSourceURI: string, TxAudioURI: string} —

Result with TxText containing cleaned HTML

extractSingle()

Extract content from a single article.


    private
                    extractSingle(array{link: string, title: string, audio?: string, text?: string} $item, string $articleSection, string $filterTags, string|null $charset) : array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null

Parameters

$item : array{link: string, title: string, audio?: string, text?: string}: Feed item data
$articleSection : string: XPath selector(s)
$filterTags : string: Filter selectors
$charset : string|null: Override charset

Return values

array{TxTitle: string, TxAudioURI: string, TxText: string, TxSourceURI: string}|null —

Extracted data or null on failure

extractWithXPath()

Extract content using XPath selectors.


    private
                    extractWithXPath(DOMDocument $dom, string $articleSection, array<string|int, string> $filterTags, bool $isInlineText) : string

Parameters

$dom : DOMDocument: DOM document
$articleSection : string: Article selectors
$filterTags : array<string|int, string>: Filter selectors
$isInlineText : bool: Whether source is inline text

Return values

string —

Extracted text

formatErrorMessage()

Format error message for failed extraction.


    private
                    formatErrorMessage(array{link: string, title: string, audio?: string, text?: string} $item) : string

Parameters

$item : array{link: string, title: string, audio?: string, text?: string}: Feed item

Return values

string —

Error message HTML

handleRedirect()

Handle redirect article section to find actual article URL.


    private
                    handleRedirect(string $link, string $articleSection, string &$newSection) : string

Parameters

$link : string: Original link
$articleSection : string: Full article section string
$newSection : string: Output: updated article section

Return values

string —

Updated link

parseHtml()

Parse HTML into DOMDocument.


    private
                    parseHtml(string $htmlString) : DOMDocument

Parameters

$htmlString : string: HTML content

Return values

DOMDocument —

Parsed document

prepareInlineHtml()

Prepare inline HTML content.


    private
                    prepareInlineHtml(string $text) : string

Parameters

$text : string: Inline text from feed

Return values

string —

Prepared HTML

processInlineLink()

Process inline link (handle # prefix for feed link references).


    private
                    processInlineLink(string $link) : string

Parameters

$link : string: Original link

Return values

string —

Processed link

ArticleExtractor in package Lwt Modules Feed Application Services

Tags

Table of Contents

Constants

Methods

Constants

DEFAULT_FILTER_TAGS

Methods

detectCharset()

Parameters

Return values

extract()

Parameters

Return values

fetchArticleContent()

Parameters

Return values

mapWindowsCharset()

Parameters

Return values

buildFilterTagsList()

Parameters

Return values

cleanExtractedText()

Parameters

Return values

convertLineBreaks()

Parameters

Return values

detectCharsetFromHeaders()

Parameters

Return values

detectCharsetFromMeta()

Parameters

Return values

extractNewArticleHtml()

Parameters

Return values

extractSingle()

Parameters

Return values

extractWithXPath()

Parameters

Return values

formatErrorMessage()

Parameters

Return values

handleRedirect()

Parameters

Return values

parseHtml()

Parameters

Return values

prepareInlineHtml()

Parameters

Return values

processInlineLink()

Parameters

Return values

ArticleExtractor
in package

Lwt

Modules

Feed

Application

Services