**I know most of my blog readers are not programmers, but this type of thing still might be interesting in a theoretical way.
I’ve been writing a news feed aggregator as a side-project for a few months now and I’d like to share some of what I’ve learned. In particular, the most recent change that I made was implementing support for the rssCloud protocol. This protocol has been around for a long time, and it’s purpose is very simple: provide a way for apps that generate rss feeds to “ping” the readers of their feeds when the feeds add new items. Think of it as paleo-twitter. If you write aggregator code, you should definitely implement it. It makes the web a better place.
While the concept is simple, the implementation ended up being quite a chore. This was mostly because my aggregator is running on an Amazon Ec2 micro-instance (600MB of ram, 8GB hard drive, 1 cpu core). That’s a very limited environment when you’re already running a MySQL database and a copy of Apache. The system currently has close to 700 feeds in it, of which roughly 250 have a cloud tag. When you drop in the potential for 250 people to “ping” you at any given moment, you have to be able to handle that without crashing. But, no matter how you slice it, a micro-instance can’t handle that. It was this issue that took most of the planning.
What was needed was to spread the incoming traffic out as wide as possible, and also keep ping subscriptions to only the minimum needed to satisfy users. I think I mostly achieved that, although time will tell.
So, here’s the high points of the implementation as it’s running now:
- Only subscribe to cloud pings for feeds that have been updated within the last 14 days. There’s no reason to keep re-registering every day for instant updates to feeds that rarely update. Once the feed updates, we’ll subscribe to it again.
- Only subscribe to cloud pings for feeds that have a subscriber count greater than zero.
- The cgi handler for incoming pings simply flags the feed as “updated” in the feed table. No other processing is done during a ping.
- A scanner runs once per minute in a cron job and grabs fresh content for any feeds marked with the “updated” flag. It then un-flags the feeds it updated. We use locking to make sure the scanner doesn’t run twice over top of itself.
- A feed’s cloud ping subscription is renewed only if the “last rsscloud reg time” is greater than 20 hours.
- Renewals are randomized within a four hour window to ensure that we don’t have a large batch of incoming subscription challenges all at once.
All of this means that incoming pings are effectively queued and then handled the next time the scanner gets a chance, which is normally once per minute. In this context I consider a one minute ping time to be more than acceptable. It might not be as fast as something like the twitter API, but it’s pretty darn close.
I’ll post more as I learn more. And this isn’t meant to be a how-to. It’s just a log of what I’m learning and think might be of value to others wanting to write an aggregator.