The Making of MarkMail: February 2008

Wednesday, February 27, 2008

New Feature: Top 10 expands to Top 100

Every time you do a search on MarkMail the leftmost pane shows you the top 10 lists, senders, attachments, and message types for all emails matching your query. OK, it's not always 10 that you see. Sometimes it's more, sometimes less. Exactly how many you see depends on your browser size. But even if you're the proud owner of one of those new 17" MacBook Pro laptops with the 1920x1200 screen, the view maxes out around 25.

We've added a new feature to help improve this. When there are more values than will fit in the selection box, you'll see a "View more" link in the top right corner.

Clicking on "View more" shows the top 100 in an overlay.

Clicking on any of the values in the overlay will limit your search, same as clicking on a value in the short list. Enjoy!

Monday, February 25, 2008

A Place for Xen

At MarkMail you can now find your Zen. Or, to be more accurate, you can find your Xen.

Xen is an open source "hypervisor" (similar to VMWare) that enables operating system virtualization. It's supported by Citrix and used by Amazon EC2, among others.

I can joke about finding Xen at MarkMail because we recently loaded a bit over 100,000 messages from the Xen community. If you're into virtualization, enjoy!

If you want to compare VMWare with Xen, you'll find some good discussion in the archive.

Wednesday, February 20, 2008

PostgreSQL: More Traffic than MySQL (and a first Google spotting)

When we announced back in December we'd loaded the MySQL database mailing lists, we heard from several people who asked us to load the PostgreSQL lists also. We said we'd be happy to, and MarkMail now has 635,000 PostgreSQL emails loaded and searchable.

Comparing PostgreSQL and MySQL is kind of interesting. With all the talk about the LAMP (Linux/Apache/MySQL/PHP-Perl-Python) architecture you'd think MySQL had a lock on the open source database market, but based on simple message traffic analytics, PostgreSQL has a much higher level of community involvement. Looking at January 2000 onward, the MySQL lists have amassed 340,000 messages with about 3,000 new messages each month:

In the same time period, the PostgreSQL lists have hit 583,000 messages with 7,000 new each month:

I wouldn't have thought it, but there it is.

Also in the PostgreSQL lists we find the very first mention of Google in all of the messages loaded so far! The first Google sighting was on the pgsql-interfaces list, January 28, 1999, in a post by James Thomson:

"I've been using the Oracle Pro*C precompiler manual. I don't have the URL here at work but I found an online copy using www.google.com"

The first mention in another community happened on the xml-dev list a couple months later, March 10, 1999, in a post by Andrew McNaughton:

"You need a new search engine. I've recently been using www.google.com with results an order of magnitude better than what I got from altavista (though altavista still has it's place for more complex query definitions)."

Here's the query if you want to look for yourself:

http://markmail.org/search/google+order:df

Maybe as we load more community archives we'll get even earlier sightings.

Wednesday, February 13, 2008

Announcing an Informal Partnership with Codehaus

We're happy to announce we've developed an informal partnership with Codehaus to load all their mail archives and receive automatic notification of new Codehaus lists as they get created.

The automatic update is particularly important because Codehaus is a fast-growing home to open source projects with new lists being created all the time. How fast is Codehaus growing? Looking at the traffic chart, it shows a beautiful upward trend line. For comparison, it has the same level of activity as Apache had in late 2000.

Previously we loaded the Groovy, Mule, and XFire archives from Codehaus. We now have the archives from Grails, Castor, Mojo, JRuby, Plexus, PicoContainer, Cargo, Drools, OpenEJB, and XStream as well as almost a hundred more. In total we're archiving 400,000 emails across the Codehaus lists.

P.S. Curious what happened in May 2006? They had some ISP troubles that month. And of course the last month is short since February has only just begun.

Squid Cache: Searching our own Dog Food

Yesterday we loaded 115,000 messages from the Squid mailing lists. We're particularly pleased about this because Squid plays a prominent role in the MarkMail site architecture and we plan to use these searchable archives to help with our own development.

Squid is probably the most famous caching proxy out there. It's been around for years, is fire-tested, and has oodles of configuration options. At MarkMail we use Squid as our "reverse proxy cache". In case you're not familiar with that term, let me explain.

On the web a "proxy" is a piece of software that sits between the user and the web server. When a user wants a web page, the user makes the request to the proxy and the proxy makes the request to the web server. Simple proxies provide a means to poke through firewalls, mask user identity, and things like that.

A "caching proxy" is a proxy that remembers the traffic passing through it, so later requests for the same content can (subject to configurable rules) be delivered to the user without actually connecting to the destination server. Schools, companies, and even countries use caching proxies to reduce their bandwidth costs and speed their users' web browsing. For example, once any user has pulled a logo image from a web site, every other user at that organization can just pull the proxy's version. Caching proxies make web browsing better, faster, and cheaper.

A "reverse proxy cache" is a caching proxy that runs on the server-side instead of the client-side. It gets first crack at each user request. In many cases, like when the requested page is already in its cache, a reverse proxy cache can handle the user request on its own and reduce the load on the actual web server.

On MarkMail, Squid sits in front of our MarkLogic Server instance (our web server) and gets first crack at all user requests. It handles several tasks:

URL rewriting. This lets us present friendly URLs like /message/xyzzy to our users, while we actually serve the content from a .xqy XQuery page in the MarkLogic Server back-end. We use a Squid plug-in called Squirm for this. It lets us map public URLs to private URLs.
Caching. Almost every page in MarkMail is dynamic, even the home page with all those count statistics, but that doesn't mean we should regenerate the page on every request. We let Squid cache the results of each page for a few minutes. If you're looking at a page that anyone else saw recently, we're probably serving it to you from cache.
Connection pooling. On the web there's a feature called Keep-Alive that lets users hold open connections to the web server in case they make later requests. A common Keep-Alive period is 30 seconds. Keep-Alive saves the cost of opening up new connections but holding all the open connections can be resource intensive for a web server. By using Squid, we let Squid hold all the Keep-Alive connections to end users (hundreds of connections) while MarkLogic Server only talks to Squid. This reduces the load on the actual web server, leaving it free to focus its energy on searching, rendering, counting, etc.

Of course there's more we'd like Squid to do for us. We'd like some help in blocking abusive users, automatically gzipping content, and things like that. We'll probably look for those features in a new load balancer. More on that later.

In the meanwhile, hope you enjoy the Squid archives.

Saturday, February 9, 2008

New Feature: Sweep the Chart to Select a Date Range

People often write us saying they want to click or sweep on the chart to select a date range. We're happy to announce that's now possible.

To demonstrate, if you search for JavaOne you see a repeating yearly spike which correlates to the dates of the annual Java developer conference. Lets say you want to investigate what people said in just the last few years about the show. You can click and swipe your mouse over the time period of interest:

This adds a date: constraint to the query and automatically updates the search results. You can remove the date constraint by clicking on the "Remove date refinements" link in the top right of the graph.

You can sweep to select or click on individual months, and if you hold down control (command on a Mac) it toggles the selection, enabling you to create non-contiguous selections.

Monday, February 4, 2008

Saxon: Loaded 10,000 emails about XSLT and XQuery

We've recently begun archiving the saxon-help mailing list, with its 10,000 emails about the famous XSLT and XQuery processor written and maintained by Michael Kay.

Michael's a great guy, in person and online, and he writes long detailed emails answering people's questions. He stays considerate even when the receiver is being a little "dense", ignoring people's help and exasperating the others who try to help out. A recent quote from Michael on the xquery-talk list:

I've spent five or ten minutes writing this response in the hope that you will learn from it and not make the same mistake again, which will save everyone time in the future. If you come back with another query showing the same error in a week's time, I shall give up.

I hope that by making the saxon-help archives more easily searchable than the built-in SourceForge search we'll be able to save him and the readers even more time.

Extra tidbit: If you admin a project on SourceForge and want your archives in MarkMail, there's an easy way to work with SourceForge to make that happen. Just let us know.

The Making of MarkMail