Building notes, projects, and occasional rants

P2P? Think NNTP...

Celso wrote about using P2P technology to distribute content, in the context of blogging.

There are two types of content that you might want to distribute: the syndication feeds, and the HTML pages.

The syndication feeds is a hot topic right now, specially in the Feedmesh group. Feedmesh already has notification distribution and it seems to be working great. In the two previous weeks there where some messages about full content distribution of feeds in the Feedmesh mailing list.

My personal view about syndication feeds distribution is well known to Rui and Nuno who are forced to listen to me every time the subject comes up. I’m a firm believer that NNTP is the way to go regarding the distribution of RSS/Atom feeds. Notice that I’m not talking about Usenet, but the NNTP protocol itself.

If you use NNTP to flood your content network, and write the files in a easy-to-export layout so that you can put any HTTP server over it and use that as a syndication proxy, you have a basic Content Distribution Network (CDN) for small files.

There are a lot of small issues that can be improved. There is no need for each node to keep a copy of every feed. If the logs of the HTTP server are used as a lazy-subscription mechanism, falling back to plain HTTP Proxy when the feed is not found locally, but signaling the NNTP server that he should accept that feed from now on, you have a organic beast that follows the trends of your client base. You can also expire old feeds that nobody seems to be reading, or expire only the feed itself but keep the meta data of feeds that are not updated (so you can answer 304 to our clients). We can (and should) store the feed data as GZiped files, so you don’t need to GZip-on-the-fly.

You can point out a lot of drawbacks to this system: the publisher looses statistics of readership, you can’t do password protected feeds (at least not in the traditional basic HTTP authentication sense), among others. But those are the tradeoffs that we make to save yourself a lot of traffic.

My personal side-effect is that if this kind of infrastructure was available, we could extend Feedmesh to the syndication clients. The current Feedmesh is only useful to be used between the major syndication aggregation services because it sends all the feeds that where modified. But if each node in this CDN also provided a filtered Feedmesh feed based on subscriptions of clients, then they could remove all the pooling and switch to a trigger-based content retrieval if they so wished (care should be taken to prevent a swarm of request, but that could be done by controlling the stream of updates sent by this service).

The question is who will support this. Well, the optimistic person inside me, would like to believe that the ISPs would be interested to provide a better service to it’s customers, and if they had a simple system that just worked, they would install it on their own networks, and give it to their clients. The advantages are that they would save some bandwidth (negligible in a world of Bittorrent, I know) and give better service to its customers.

But the main push, the main collecting of feeds seems to be a perfect fit to the current syndication aggregation services. They already need to do it because their business model is based on having the most fresh content, so each one of them fetches the changed feeds whenever we ping them. If they start pushing the content via NNTP, we just might have the seed we need to implement something like this.

The HTML distribution part is different, and I think we are starting to see the solution: Google Web Cache.

Google is happy to do that service for you, to be your Coral Network. It seems obvious to me that Google value is related directly to the freshness of their own index. So they want your HTML as soon as it changes. So if Google taps the Feedmesh notification stream, they can see two things: that a specific page has changed, so they can fetch it, index it and cache it (and feed their own GWA), and (as a bonus) mark that page as more than likely being a blog and with that improve PageRank by removing the problems that the blogs are causing them.

Akamai is probably going to be relegated to DNS and static content CDN from now on.