Wednesday, February 13, 2008

Squid Cache: Searching our own Dog Food

Yesterday we loaded 115,000 messages from the Squid mailing lists. We're particularly pleased about this because Squid plays a prominent role in the MarkMail site architecture and we plan to use these searchable archives to help with our own development.

Squid is probably the most famous caching proxy out there. It's been around for years, is fire-tested, and has oodles of configuration options. At MarkMail we use Squid as our "reverse proxy cache". In case you're not familiar with that term, let me explain.

On the web a "proxy" is a piece of software that sits between the user and the web server. When a user wants a web page, the user makes the request to the proxy and the proxy makes the request to the web server. Simple proxies provide a means to poke through firewalls, mask user identity, and things like that.

A "caching proxy" is a proxy that remembers the traffic passing through it, so later requests for the same content can (subject to configurable rules) be delivered to the user without actually connecting to the destination server. Schools, companies, and even countries use caching proxies to reduce their bandwidth costs and speed their users' web browsing. For example, once any user has pulled a logo image from a web site, every other user at that organization can just pull the proxy's version. Caching proxies make web browsing better, faster, and cheaper.

A "reverse proxy cache" is a caching proxy that runs on the server-side instead of the client-side. It gets first crack at each user request. In many cases, like when the requested page is already in its cache, a reverse proxy cache can handle the user request on its own and reduce the load on the actual web server.

On MarkMail, Squid sits in front of our MarkLogic Server instance (our web server) and gets first crack at all user requests. It handles several tasks:

  • URL rewriting. This lets us present friendly URLs like /message/xyzzy to our users, while we actually serve the content from a .xqy XQuery page in the MarkLogic Server back-end. We use a Squid plug-in called Squirm for this. It lets us map public URLs to private URLs.
  • Caching. Almost every page in MarkMail is dynamic, even the home page with all those count statistics, but that doesn't mean we should regenerate the page on every request. We let Squid cache the results of each page for a few minutes. If you're looking at a page that anyone else saw recently, we're probably serving it to you from cache.
  • Connection pooling. On the web there's a feature called Keep-Alive that lets users hold open connections to the web server in case they make later requests. A common Keep-Alive period is 30 seconds. Keep-Alive saves the cost of opening up new connections but holding all the open connections can be resource intensive for a web server. By using Squid, we let Squid hold all the Keep-Alive connections to end users (hundreds of connections) while MarkLogic Server only talks to Squid. This reduces the load on the actual web server, leaving it free to focus its energy on searching, rendering, counting, etc.
Of course there's more we'd like Squid to do for us. We'd like some help in blocking abusive users, automatically gzipping content, and things like that. We'll probably look for those features in a new load balancer. More on that later.

In the meanwhile, hope you enjoy the Squid archives.

1 comment:

Frank Rubino said...

Pieces like this, which have general background information on technology of interest are a real community service. And written so well too, with an engaging, conversational style. Thanks!