The two dominant news syndication formats (and the only ones I bother to parse) are RSS 2.0 and ATOM. From what I’ve seen so far, ATOM is mostly well-behaved because of it’s more strict set of item-level required tags. I also get the feeling (although I can’t be sure of this) that there are fewer ATOM libraries out in the wild than RSS libraries. This would make things more consistent/predictable on the ATOM side.
RSS is a different beast, however. It’s a beautifully loose specification, in the sense that you really can customize it’s structure to the format of the content you want to deliver. It doesn’t assume it knows what your content is. That’s really nice for the content deliverers, but not so easy on the consumption side. Here are a few things that have tripped me up and how I solved them.
A channel may contain any number of <item>s. An item may represent a “story” — much like a story in a newspaper or magazine; if so its description is a synopsis of the story, and the link points to the full story. An item may also be complete in itself, if so, the description contains the text (entity-encoded HTML is allowed; see examples), and the link and title may be omitted. All elements of an item are optional, however at least one of title or description must be present.
That means that this is a valid feed:
<rss version="2.0"> <channel> <title>Test Feed</title> <link>http://test.com</link> <description>My little test feed.</description> <item> <description>This is a test.</description> <enclosure>http://test.com/post4/file.mp3</enclosure> </item> <item> <description>This is a test.</description> <link>http://test.com/post3</link> </item> <item> <description>This is a test.</description> </item> <item> <description>This is a test.</description> <guid>http://test.com/post2</guid> </item> <item> <description>This is a test.</description> <link>http://test.com/post1</link> </item> </channel> </rss>
This might look like a stupid example that we wouldn’t want to see in our feed reader anyway. But, imagine if those aren’t blog posts. Imagine if that’s a USGS seismograph feed. You’d want to see all the duplicates if they were actually new posts, and not duplication errors. That’s tough though. Since the only requirement for an item is to have one of either description or title, there’s no way to differentiate whether or not these are meant to be separate posts or not, without tracking duplicates on every potential element an item could contain. That would be very messy.
The solution I settled on is to use the RSS guid element as-is if it was defined in the item. If it wasn’t defined, we create one by hashing the string representation of the entire item into a 40 character hash. All we care about here is speed, so we use SHA-1. We then index the feed items table to place a unique index on the combination of “feed id” and “guid”. This ensures that guid’s, whether defined or hashed, are unique within their own feeds.
Pub Dates That Lie
Another issue to deal with is publish dates for feed items. Since “pubDate” isn’t required in items, we have to generate one if it’s missing using the channel’s pubDate, or current time if that’s missing too. No big deal. But, what you see in the real world of crappy CMS’s are insane pubDates. I’ve seen brand new items with a pubDate of 2 days ago when the description content on the item lists today as the post date. What’s worse is that some feeds mix up their time zone settings and constantly show pubDates in the future. Sometimes by a day or more.
To deal with this, we chose to determine newness and sort order based on the time the post shows up in the feed. We still track and show pubDate if one exists. Even bad ones. But, we mostly ignore it for purposes of sorting and displaying feed items. This “time added” time stamp becomes gospel.
A lot of CMS’s simply parse their database for the most recent X number of posts and build their RSS feed from that. This means that if you delete a post somewhere in that range, an older post that had previously rolled off the feed will now come back. From what I’ve seen this is very common. This isn’t a problem if you choose to honor the “pubDate” element for sorting. But, since we sort based on “time added”, these old items reappear at the top of the list as if they were newly published. That’s no good.
One way you could fix this is to simply not delete items as they roll off their respective feeds. But, when you’re dealing with hundreds of feeds, that’s really not do-able. Database size would become a serious issue.
The fix we chose instead was adding a pubDate check to all new items being added. If we think it’s new, then we check the pubDate. If the pubDate is more than 36 hours older than the current time, we set “time added” to be equal to pubDate. This makes the item sort back in time where it should be.
But, there’s another problem. What if there’s no pubDate in the item to compare to? Since we can’t just assume that RSS feeds are going to be structured in reverse chronological order; and, since pubDate isn’t required, this is a possible scenario. Actually, I have no answer for this at the moment. And, luckily, I haven’t seen it happen yet. If anyone can think of a solution, I’d love to hear it.