Documentation

RssParser

Service for parsing RSS and Atom feeds.

Provides pure parsing functionality without database access. Supports both RSS 2.0 and Atom feed formats.

Tags
since
3.0.0

Table of Contents

Constants

MAX_FEED_BYTES  = 8 * 1024 * 1024
Cap individual feed downloads at 8 MB. The largest legitimate podcast feeds we've seen are ~3 MB; anything beyond that is either misconfigured or an attempt to OOM the parser.

Methods

detectAndParse()  : array<int|string, array<string, string>|string>|null
Detect and parse feed, determining best text source.
detectAndParseXml()  : array<int|string, array<string, string>|string>|null
Detect and parse feed from XML already in memory.
getFeedTitle()  : string|null
Get the feed title from a feed URL.
getFeedTitleFromXml()  : string|null
Extract the feed title from XML already in memory.
parse()  : array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null
Parse RSS/Atom feed and return article items with metadata.
parseXml()  : array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null
Parse RSS/Atom feed XML already fetched into a string.
cleanDescription()  : string
Clean and normalize description text.
cleanDescriptionForDetection()  : string
Clean description for detection mode.
cleanTitle()  : string
Clean and normalize title text.
cleanTitleForDetection()  : string
Clean title for detection mode.
convertToHtmlEntities()  : string
Convert HTML to HTML entities.
countTextLengths()  : array{desc: array{long: int, short: int}, encoded: array{long: int, short: int}}
Count text lengths for source detection.
determineBestTextSource()  : array<int|string, array<string, string>|string>
Determine best text source and update items.
extractAudioEnclosure()  : string
Extract audio enclosure URL.
extractInlineText()  : string|null
Extract inline text from item node.
extractLink()  : string
Extract link from node based on feed type.
fetchFeedBody()  : string|null
Fetch the feed body with SSRF guards applied.
formatParsedDate()  : string
Format parsed date array to MySQL datetime.
getFeedTagMapping()  : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}|null
Get tag mapping for RSS/Atom feed format.
parseFeedDate()  : string
Parse feed date string to MySQL datetime format.
parseItem()  : array{title: string, link: string, desc: string, date: string, audio: string, text: string}|null
Parse a single feed item.
parseItemForDetection()  : array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}
Parse item for detection mode (includes raw text content).

Constants

MAX_FEED_BYTES

Cap individual feed downloads at 8 MB. The largest legitimate podcast feeds we've seen are ~3 MB; anything beyond that is either misconfigured or an attempt to OOM the parser.

private mixed MAX_FEED_BYTES = 8 * 1024 * 1024

Methods

detectAndParse()

Detect and parse feed, determining best text source.

public detectAndParse(string $sourceUri) : array<int|string, array<string, string>|string>|null

Analyzes feed to determine whether to use:

  • content (Atom)
  • description (RSS)
  • encoded (RSS with content:encoded)
  • webpage link (external fetch)
Parameters
$sourceUri : string

Feed URL

Return values
array<int|string, array<string, string>|string>|null

Feed data with feed_text indicator or null on error

detectAndParseXml()

Detect and parse feed from XML already in memory.

public detectAndParseXml(string $xml) : array<int|string, array<string, string>|string>|null

Same role as parseXml for the detection variant — splits the SSRF-guarded fetch from the parsing logic so tests don't have to round-trip through HTTP.

Parameters
$xml : string

Raw feed XML

Return values
array<int|string, array<string, string>|string>|null

Feed data with feed_text indicator or null on parse error

getFeedTitle()

Get the feed title from a feed URL.

public getFeedTitle(string $sourceUri) : string|null
Parameters
$sourceUri : string

Feed URL

Return values
string|null

Feed title or null on error

getFeedTitleFromXml()

Extract the feed title from XML already in memory.

public getFeedTitleFromXml(string $xml) : string|null
Parameters
$xml : string

Raw feed XML

Return values
string|null

Feed title or null on parse error

parse()

Parse RSS/Atom feed and return article items with metadata.

public parse(string $sourceUri[, string $articleSection = '' ]) : array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null

Supports both RSS 2.0 and Atom feed formats. Extracts:

  • Title, description, link, publication date
  • Audio enclosures (podcast support)
  • Inline text content (if article section specified)
Parameters
$sourceUri : string

Feed URL

$articleSection : string = ''

Tag name for inline text extraction

Return values
array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null

Array of feed items or null on error

parseXml()

Parse RSS/Atom feed XML already fetched into a string.

public parseXml(string $xml[, string $articleSection = '' ]) : array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null

The URI variant parse() adds an SSRF-guarded HTTP fetch on top of this; tests and any other in-memory caller should land here directly to skip the network entirely.

Parameters
$xml : string

Raw feed XML

$articleSection : string = ''

Tag name for inline text extraction

Return values
array<int, array{title: string, link: string, desc: string, date: string, audio: string, text: string}>|null

Array of feed items or null on parse error

cleanDescription()

Clean and normalize description text.

private cleanDescription(string $desc) : string
Parameters
$desc : string

Raw description

Return values
string

Cleaned description

cleanDescriptionForDetection()

Clean description for detection mode.

private cleanDescriptionForDetection(string $desc) : string
Parameters
$desc : string

Raw description

Return values
string

Cleaned description

cleanTitle()

Clean and normalize title text.

private cleanTitle(string $title) : string
Parameters
$title : string

Raw title

Return values
string

Cleaned title

cleanTitleForDetection()

Clean title for detection mode.

private cleanTitleForDetection(string $title) : string
Parameters
$title : string

Raw title

Return values
string

Cleaned title

convertToHtmlEntities()

Convert HTML to HTML entities.

private convertToHtmlEntities(string $html) : string
Parameters
$html : string

HTML content

Return values
string

Converted content

countTextLengths()

Count text lengths for source detection.

private countTextLengths(array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string} $item, string $descKey, string $encKey) : array{desc: array{long: int, short: int}, encoded: array{long: int, short: int}}
Parameters
$item : array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}

Item data

$descKey : string

Description key

$encKey : string

Encoded key

Return values
array{desc: array{long: int, short: int}, encoded: array{long: int, short: int}}

Counts array

determineBestTextSource()

Determine best text source and update items.

private determineBestTextSource(array<int|string, array<string, string>|string> $rssData, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags, int $descCount, int $descNocount, int $encCount, int $encNocount) : array<int|string, array<string, string>|string>
Parameters
$rssData : array<int|string, array<string, string>|string>

Feed items

$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}

Tag mapping

$descCount : int

Long description count

$descNocount : int

Short description count

$encCount : int

Long encoded count

$encNocount : int

Short encoded count

Return values
array<int|string, array<string, string>|string>

Updated feed data

extractAudioEnclosure()

Extract audio enclosure URL.

private extractAudioEnclosure(DOMElement $node, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags) : string
Parameters
$node : DOMElement

Item node

$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}

Tag mapping

Return values
string

Audio URL or empty string

extractInlineText()

Extract inline text from item node.

private extractInlineText(DOMElement $node, string $articleSection) : string|null
Parameters
$node : DOMElement

Item node

$articleSection : string

Tag name for text extraction

Return values
string|null

Extracted text or null

Extract link from node based on feed type.

private extractLink(DOMElement|null $linkNode, array<string|int, mixed> $feedTags) : string
Parameters
$linkNode : DOMElement|null

Link node

$feedTags : array<string|int, mixed>

Tag mapping

Return values
string

Link URL

fetchFeedBody()

Fetch the feed body with SSRF guards applied.

private fetchFeedBody(string $sourceUri) : string|null

DOMDocument::load($url) would route the network fetch through PHP's stream wrappers — LIBXML_NONET only blocks libxml's external entity resolution, not the initial document load. So we have to do the fetch ourselves via safeHttpGet to get URL validation and per-hop redirect re-validation, then feed the bytes to loadXML instead.

Parameters
$sourceUri : string
Return values
string|null

Raw feed XML, or null if the URL is invalid, unreachable, or redirects to a private range.

formatParsedDate()

Format parsed date array to MySQL datetime.

private formatParsedDate(array<string|int, mixed> $pubDate, int $fallback) : string
Parameters
$pubDate : array<string|int, mixed>

Parsed date array

$fallback : int

Fallback offset

Return values
string

MySQL datetime string

getFeedTagMapping()

Get tag mapping for RSS/Atom feed format.

private getFeedTagMapping(DOMDocument $rss) : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}|null
Parameters
$rss : DOMDocument

Feed document

Return values
array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}|null

Tag mapping or null if unknown format

parseFeedDate()

Parse feed date string to MySQL datetime format.

private parseFeedDate(string|null $dateStr, int $fallback) : string
Parameters
$dateStr : string|null

Date string from feed

$fallback : int

Fallback offset for ordering

Return values
string

MySQL datetime string

parseItem()

Parse a single feed item.

private parseItem(DOMElement $node, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags, int $index, string $articleSection) : array{title: string, link: string, desc: string, date: string, audio: string, text: string}|null
Parameters
$node : DOMElement

Item node

$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}

Tag mapping

$index : int

Item index (for date fallback)

$articleSection : string

Tag for inline text extraction

Return values
array{title: string, link: string, desc: string, date: string, audio: string, text: string}|null

Parsed item or null if invalid

parseItemForDetection()

Parse item for detection mode (includes raw text content).

private parseItemForDetection(DOMElement $node, array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string} $feedTags) : array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}
Parameters
$node : DOMElement

Item node

$feedTags : array{item: string, title: string, description: string, link: string, pubDate: string, enclosure: string, url: string}

Tag mapping

Return values
array{title: string, desc: string, link: string, encoded?: string, description?: string, content?: string}

Parsed item


        
On this page

Search results