464,230,207 widgets served
The largest re-publisher of feeds on the Internet
Filtering and the Universal Feed Format

Summary: This document is intended for anyone wanting to understand the details of the FeedSweep filtering function.


FeedSweep Universal Feed Format

Internally, the FeedSweep engine converts all feeds into a Universal Feed Format. Atom versions 1.0 and 0.3, RSS 2.0 and its variants and the RSS/RDF protocols all share the same basic data model: a container that holds both some global feed data and any number of entries.

For each protocol, the format is defined by a XML-based base schema, but it can be extended using foreign namespaces. The FeedSweep Universal Feed Format maps all the common elements to a universal feed model. Elements that do not have a direct equivalent in the other protcols are ignored.

However, while some elements of some protocols are close matches for each other, there are many elements that are only approximations. For example, all protocols can define a language - but this might be defined as differently named elements or as part of the parent XML protocol. As a result, the FeedSweep Universal mapping function is inexact at best. In the case where several elements of a feed might match up against an element of the Universal model, each of the different elements will be read in sequence with the element that would most closely match attempted first.

Universal Feed Format and Filtering

The following table shows a mapping of the FeedSweep Universal Feed against the Atom and RSS representations of the elements of their schema. The mapping is limited to those elements used in the filtering function. All data mentioned in these tables is treated as plain XML and shows up the same in all representations. Unless indicated otherwise, the XML elements in a given column are in the namespace corresponding to that column. This summary uses standard XPath notation: in particular, slashes show the element hierarchy, and an @ sign indicates an attribute of an element.

The following table shows what elements of each feed protocol map out against a Feed filter item of the filtering syntax.

Filter ItemAtomRSS (2.0, 0.91, 0.92)RDF/RSS (1.0, 0.90)
  /atom/feed /rss/channel /rdf:RDF or /rdf:channel
FeedFormat /atom/feed /rss/channel /rdf:RDF or /rdf:channel
FeedAuthor /atom/feed/author /rss/channel/managingEditor
/rss/channel/dc:creator
/rss/channel/dc:author
/rdf:RDF/channel/dc:creator
Copyright /atom/feed/copyright /rss/channel/copyright
/rss/channel/dc:rights
/rdf:RDF/channel/dc:rights
FeedDescription atom/feed/subtitle /rss/channel/description
/rss/channel/dc:description
/rdf:RDF/channel/rdf:description
/rdf:RDF/channel/dc:description
Language /atom/feed/xml:lang /rss/channel/language
/rss/channel/dc:language
/rdf:RDF/channel/dc:language
FeedLink /atom/feed/link /rss/channel/link /rdf:RDF/channel/rdf:link
LastUpdatedOn /atom/feed/updated
/atom/feed/modified (Atom 0.3)
/rss/channel/lastBuildDate
/rss/channel/pubDate
/rss/channel/dc:date
/rdf:RDF/channel/dc:date
FeedTags /atom/feed/category
/atom/feed/category(Atom 0.3)
/rss/channel/category
/rss/channel/dc:subject
/rdf:RDF/channel/dc:subject
FeedTitle /atom/feed/title /rss/channel/title
/rss/channel/dc:title
/rdf:RDF/channel/rdf:title
/rdf:RDF/channel/dc:title

 The following table shows what elements of each feed protocol map out against an Entry (or Article) filter item of the filtering syntax.

Filter ItemAtomRSS (2.0, 0.91, 0.92)RDF/RSS (1.0, 0.90)
  /atom/feed/entry /rss/channel/item /rdf:RDF/item
Author /atom/feed/entry/author[name] /rss/channel/item/dc:creator
/rss/channel/item/dc:author
/rdf:RDF/item/dc:creator
/rdf:RDF/item/dc:author
Description /atom/feed/entry/content /rss/channel/item/description
/rss/channel/item/dc:description
/rdf:RDF/item/link description
/rdf:RDF/item/dc:description
ID /atom/feed/id /rss/channel/item/guid /rdf:RDF/item/@rdf:about
Link /atom/feed/entry/link[href] /rss/channel/item/link /rdf:RDF/item/link link
PublishedOn /atom/feed/entry/published
/atom/feed/entry/issued (Atom 0.3)
/rss/channel/item/dcterms:issued /rdf:RDF/item/dcterms:issued
UpdatedOn /atom/feed/entry/updated
/atom/feed/entry/modified (Atom 0.3)
/rss/channel/item/pubDate
/rss/channel/item/dc:date
/rss/channel/item/dcterms:modified
/rdf:RDF/item/dc:date
/rdf:RDF/item/dcterms:modified
Tags /atom/feed/entry/summary
/atom/feed/dc:subject
/rss/channel/item/category
/rss/channel/item/dc:subject
/rdf:RDF/channel/item/dc:subject
Title /atom/feed/entry/title /rss/channel/item/title
/rss/channel/item/dc:title
/rdf:RDF/item/title
/rdf:RDF/item/dc:title

 

Date Handling

The handling of dates is the most contentious issue when working with feeds. The first problem is that a large number of feeds, if not the majority, are created with date formats that are incorrect. Our educated guess is that date formatting is just plain too complicated.

But the net result is the FeedSweep engine has to go through several steps to try and obtain Filter Item dates from feeds:

  1. Attempt to parse the first potential date element matched up to the Filter Item
  2. If the read fails, attempt all other elements in succession, continuing until success
  3. If the parse fails, try reading the date using the following RFC date representations:

 

AtomRSS (2.0, 0.91, 0.92)RDF/RSS (1.0, 0.90)
RFC 3339 RFC 822 RFC 822

The second problem working with feed dates has to do with the uncertainty between published dates and updated dates. It would make sense that an updated date would supercede a published dates, but many feeds do not discriminate between the two. In fact, there is again much confusion regarding these elements.

FeedSweep assumes the published date (PublishOn Filter Item) is the most accurate and up-to-date element. This is particularly relevant when it comes to sorting. All sorting is done on the published date and the updated date (UpdatedOn Filter Item) is ignored.


How would you rate this article?

 

Rating: 5 user(s) have rated this article Average rating: 4.0
Posted by: Admin, on 12/10/2008, in category "Questions and Answers"
Views: this article has been read 27203 times

DiggDigg It!  Del.icio.usDel.icio.us  RedditReddit  StumbleUponStumbleIt  NewsvineNewsvine  FurlFurl  BlinkListBlinkList