Documentation

RssParser
in package

Lwt

Modules

Feed

Application

Services

Service for parsing RSS and Atom feeds.

Provides pure parsing functionality without database access. Supports both RSS 2.0 and Atom feed formats.

Methods

detectAndParse() : array<int|string, array<string, string>|string>|null: Detect and parse feed, determining best text source.
getFeedTitle() : string|null: Get the feed title from a feed URL.
parse() : array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null: Parse RSS/Atom feed and return article items with metadata.
cleanDescription() : string: Clean and normalize description text.
cleanDescriptionForDetection() : string: Clean description for detection mode.
cleanTitle() : string: Clean and normalize title text.
cleanTitleForDetection() : string: Clean title for detection mode.
convertToHtmlEntities() : string: Convert HTML to HTML entities.
countTextLengths() : array{desc: array{long: int, short: int}, encoded: array{long: int, short: int}}: Count text lengths for source detection.
determineBestTextSource() : array<int|string, array<string, string>|string>: Determine best text source and update items.
extractAudioEnclosure() : string: Extract audio enclosure URL.
extractInlineText() : string|null: Extract inline text from item node.
extractLink() : string: Extract link from node based on feed type.
formatParsedDate() : string: Format parsed date array to MySQL datetime.
getFeedTagMapping() : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}|null: Get tag mapping for RSS/Atom feed format.
parseFeedDate() : string: Parse feed date string to MySQL datetime format.
parseItem() : array{title: string, link: string, desc: string, date: string, audio: string, text: string}|null: Parse a single feed item.
parseItemForDetection() : array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}: Parse item for detection mode (includes raw text content).

detectAndParse()

Detect and parse feed, determining best text source.


    public
                    detectAndParse(string $sourceUri) : array<int|string, array<string, string>|string>|null

Analyzes feed to determine whether to use:

content (Atom)
description (RSS)
encoded (RSS with content:encoded)
webpage link (external fetch)

Parameters

$sourceUri : string: Feed URL

Return values

array<int|string, array<string, string>|string>|null —

Feed data with feed_text indicator or null on error

getFeedTitle()

Get the feed title from a feed URL.


    public
                    getFeedTitle(string $sourceUri) : string|null

Parameters

$sourceUri : string: Feed URL

Return values

string|null —

Feed title or null on error

parse()

Parse RSS/Atom feed and return article items with metadata.


    public
                    parse(string $sourceUri[, string $articleSection = '' ]) : array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null

Supports both RSS 2.0 and Atom feed formats. Extracts:

Title, description, link, publication date
Audio enclosures (podcast support)
Inline text content (if article section specified)

Parameters

$sourceUri : string: Feed URL
$articleSection : string = '': Tag name for inline text extraction

Return values

array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null —

Array of feed items or null on error

cleanDescription()

Clean and normalize description text.


    private
                    cleanDescription(string $desc) : string

Parameters

$desc : string: Raw description

Return values

string —

Cleaned description

cleanDescriptionForDetection()

Clean description for detection mode.


    private
                    cleanDescriptionForDetection(string $desc) : string

Parameters

$desc : string: Raw description

Return values

string —

Cleaned description

cleanTitle()

Clean and normalize title text.


    private
                    cleanTitle(string $title) : string

Parameters

$title : string: Raw title

Return values

string —

Cleaned title

cleanTitleForDetection()

Clean title for detection mode.


    private
                    cleanTitleForDetection(string $title) : string

Parameters

$title : string: Raw title

Return values

string —

Cleaned title

convertToHtmlEntities()

Convert HTML to HTML entities.


    private
                    convertToHtmlEntities(string $html) : string

Parameters

$html : string: HTML content

Return values

string —

Converted content

countTextLengths()

Count text lengths for source detection.


    private
                    countTextLengths(array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string} $item, string $descKey, string $encKey) : array{desc: array{long: int, short: int}, encoded: array{long: int, short: int}}

Parameters

$item : array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}: Item data
$descKey : string: Description key
$encKey : string: Encoded key

Return values

array{desc: array{long: int, short: int}, encoded: array{long: int, short: int}} —

Counts array

determineBestTextSource()

Determine best text source and update items.


    private
                    determineBestTextSource(array<int|string, array<string, string>|string> $rssData, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags, int $descCount, int $descNocount, int $encCount, int $encNocount) : array<int|string, array<string, string>|string>

Parameters

$rssData : array<int|string, array<string, string>|string>: Feed items
$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}: Tag mapping
$descCount : int: Long description count
$descNocount : int: Short description count
$encCount : int: Long encoded count
$encNocount : int: Short encoded count

Return values

array<int|string, array<string, string>|string> —

Updated feed data

extractAudioEnclosure()

Extract audio enclosure URL.


    private
                    extractAudioEnclosure(DOMElement $node, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags) : string

Parameters

$node : DOMElement: Item node
$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}: Tag mapping

Return values

string —

Audio URL or empty string

extractInlineText()

Extract inline text from item node.


    private
                    extractInlineText(DOMElement $node, string $articleSection) : string|null

Parameters

$node : DOMElement: Item node
$articleSection : string: Tag name for text extraction

Return values

string|null —

Extracted text or null

extractLink()

Extract link from node based on feed type.


    private
                    extractLink(DOMElement|null $linkNode, array<string|int, mixed> $feedTags) : string

Parameters

$linkNode : DOMElement|null: Link node
$feedTags : array<string|int, mixed>: Tag mapping

Return values

string —

Link URL

formatParsedDate()

Format parsed date array to MySQL datetime.


    private
                    formatParsedDate(array<string|int, mixed> $pubDate, int $fallback) : string

Parameters

$pubDate : array<string|int, mixed>: Parsed date array
$fallback : int: Fallback offset

Return values

string —

MySQL datetime string

getFeedTagMapping()

Get tag mapping for RSS/Atom feed format.


    private
                    getFeedTagMapping(DOMDocument $rss) : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}|null

Parameters

$rss : DOMDocument: Feed document

Return values

array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}|null —

Tag mapping or null if unknown format

parseFeedDate()

Parse feed date string to MySQL datetime format.


    private
                    parseFeedDate(string|null $dateStr, int $fallback) : string

Parameters

$dateStr : string|null: Date string from feed
$fallback : int: Fallback offset for ordering

Return values

string —

MySQL datetime string

parseItem()

Parse a single feed item.


    private
                    parseItem(DOMElement $node, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags, int $index, string $articleSection) : array{title: string, link: string, desc: string, date: string, audio: string, text: string}|null

Parameters

$node : DOMElement: Item node
$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}: Tag mapping
$index : int: Item index (for date fallback)
$articleSection : string: Tag for inline text extraction

Return values

array{title: string, link: string, desc: string, date: string, audio: string, text: string}|null —

Parsed item or null if invalid

parseItemForDetection()

Parse item for detection mode (includes raw text content).


    private
                    parseItemForDetection(DOMElement $node, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags) : array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}

Parameters

$node : DOMElement: Item node
$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}: Tag mapping

Return values

array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string} —

Parsed item

RssParser in package Lwt Modules Feed Application Services

Tags

Table of Contents

Methods

Methods

detectAndParse()

Parameters

Return values

getFeedTitle()

Parameters

Return values

parse()

Parameters

Return values

cleanDescription()

Parameters

Return values

cleanDescriptionForDetection()

Parameters

Return values

cleanTitle()

Parameters

Return values

cleanTitleForDetection()

Parameters

Return values

convertToHtmlEntities()

Parameters

Return values

countTextLengths()

Parameters

Return values

determineBestTextSource()

Parameters

Return values

extractAudioEnclosure()

Parameters

Return values

extractInlineText()

Parameters

Return values

extractLink()

Parameters

Return values

formatParsedDate()

Parameters

Return values

getFeedTagMapping()

Parameters

Return values

parseFeedDate()

Parameters

Return values

parseItem()

Parameters

Return values

parseItemForDetection()

Parameters

Return values

RssParser
in package

Lwt

Modules

Feed

Application

Services