|
Filtering and the Universal Feed Format
|
|
Summary:
This document is intended for anyone wanting to understand the details of the FeedSweep filtering function.
FeedSweep Universal Feed Format
Internally, the FeedSweep engine converts all feeds into a Universal Feed Format. Atom versions 1.0 and 0.3, RSS 2.0 and its variants and the RSS/RDF protocols all share the same basic data model: a container that holds both some global feed data and any number of entries.
For each protocol, the format is defined by a XML-based base schema, but it can be extended using foreign namespaces. The FeedSweep Universal Feed Format maps all the common elements to a universal feed model. Elements that do not have a direct equivalent in the other protcols are ignored.
However, while some elements of some protocols are close matches for each other, there are many elements that are only approximations. For example, all protocols can define a language - but this might be defined as differently named elements or as part of the parent XML protocol. As a result, the FeedSweep Universal mapping function is inexact at best. In the case where several elements of a feed might match up against an element of the Universal model, each of the different elements will be read in sequence with the element that would most closely match attempted first.
Universal Feed Format and Filtering
The following table shows a mapping of the FeedSweep Universal Feed against the Atom and RSS representations of the elements of their schema. The mapping is limited to those elements used in the filtering function. All data mentioned in these tables is treated as plain XML and shows up the same in all representations. Unless indicated otherwise, the XML elements in a given column are in the namespace corresponding to that column. This summary uses standard XPath notation: in particular, slashes show the element hierarchy, and an @ sign indicates an attribute of an element.
The following table shows what elements of each feed protocol map out against a Feed filter item of the filtering syntax.
| Filter Item | Atom | RSS (2.0, 0.91, 0.92) | RDF/RSS (1.0, 0.90) |
| |
/atom/feed |
/rss/channel |
/rdf:RDF or /rdf:channel |
| FeedFormat |
/atom/feed |
/rss/channel |
/rdf:RDF or /rdf:channel |
| FeedAuthor |
/atom/feed/author |
/rss/channel/managingEditor /rss/channel/dc:creator /rss/channel/dc:author |
/rdf:RDF/channel/dc:creator |
| Copyright |
/atom/feed/copyright |
/rss/channel/copyright /rss/channel/dc:rights |
/rdf:RDF/channel/dc:rights |
| FeedDescription |
atom/feed/subtitle |
/rss/channel/description /rss/channel/dc:description |
/rdf:RDF/channel/rdf:description /rdf:RDF/channel/dc:description |
| Language |
/atom/feed/xml:lang |
/rss/channel/language /rss/channel/dc:language |
/rdf:RDF/channel/dc:language |
| FeedLink |
/atom/feed/link |
/rss/channel/link |
/rdf:RDF/channel/rdf:link |
| LastUpdatedOn |
/atom/feed/updated
/atom/feed/modified (Atom 0.3) |
/rss/channel/lastBuildDate /rss/channel/pubDate /rss/channel/dc:date |
/rdf:RDF/channel/dc:date |
| FeedTags |
/atom/feed/category
/atom/feed/category(Atom 0.3) |
/rss/channel/category /rss/channel/dc:subject |
/rdf:RDF/channel/dc:subject |
| FeedTitle |
/atom/feed/title |
/rss/channel/title /rss/channel/dc:title |
/rdf:RDF/channel/rdf:title /rdf:RDF/channel/dc:title |
The following table shows what elements of each feed protocol map out against an Entry (or Article) filter item of the filtering syntax.
| Filter Item | Atom | RSS (2.0, 0.91, 0.92) | RDF/RSS (1.0, 0.90) |
| |
/atom/feed/entry |
/rss/channel/item |
/rdf:RDF/item |
| Author |
/atom/feed/entry/author[name] |
/rss/channel/item/dc:creator /rss/channel/item/dc:author |
/rdf:RDF/item/dc:creator /rdf:RDF/item/dc:author |
| Description |
/atom/feed/entry/content |
/rss/channel/item/description /rss/channel/item/dc:description |
/rdf:RDF/item/link description /rdf:RDF/item/dc:description |
| ID |
/atom/feed/id |
/rss/channel/item/guid |
/rdf:RDF/item/@rdf:about |
| Link |
/atom/feed/entry/link[href] |
/rss/channel/item/link |
/rdf:RDF/item/link link |
| PublishedOn |
/atom/feed/entry/published
/atom/feed/entry/issued (Atom 0.3) |
/rss/channel/item/dcterms:issued |
/rdf:RDF/item/dcterms:issued |
| UpdatedOn |
/atom/feed/entry/updated
/atom/feed/entry/modified (Atom 0.3) |
/rss/channel/item/pubDate /rss/channel/item/dc:date /rss/channel/item/dcterms:modified |
/rdf:RDF/item/dc:date /rdf:RDF/item/dcterms:modified |
| Tags |
/atom/feed/entry/summary
/atom/feed/dc:subject |
/rss/channel/item/category /rss/channel/item/dc:subject |
/rdf:RDF/channel/item/dc:subject |
| Title |
/atom/feed/entry/title |
/rss/channel/item/title /rss/channel/item/dc:title |
/rdf:RDF/item/title /rdf:RDF/item/dc:title |
Date Handling
The handling of dates is the most contentious issue when working with feeds. The first problem is that a large number of feeds, if not the majority, are created with date formats that are incorrect. Our educated guess is that date formatting is just plain too complicated.
But the net result is the FeedSweep engine has to go through several steps to try and obtain Filter Item dates from feeds:
- Attempt to parse the first potential date element matched up to the Filter Item
- If the read fails, attempt all other elements in succession, continuing until success
- If the parse fails, try reading the date using the following RFC date representations:
| Atom | RSS (2.0, 0.91, 0.92) | RDF/RSS (1.0, 0.90) |
RFC 3339 |
RFC 822 |
RFC 822 |
The second problem working with feed dates has to do with the uncertainty between published dates and updated dates. It would make sense that an updated date would supercede a published dates, but many feeds do not discriminate between the two. In fact, there is again much confusion regarding these elements.
FeedSweep assumes the published date (PublishOn Filter Item) is the most accurate and up-to-date element. This is particularly relevant when it comes to sorting. All sorting is done on the published date and the updated date (UpdatedOn Filter Item) is ignored.
How would you rate this article?
Rating:
4 user(s) have rated this article
Posted by:
Admin, on
12/10/2008, in category "Questions and Answers"
Views:
this article has been read 10287 times