Tuesday, January 29, 2008

Give us a Date, and We'll Search It

In MarkMail we like to have both a search box way to do something (easy for experts) and a graphical way to do something (easy for novices). Recently we added support for date-based query constraints. At the moment it's only available in the search box, but we thought it worth talking about anyway. A graphical version will be coming soon (we know, we can't wait to click and swipe the months on the chart either).

With the new feature you specify a date or date range by adding a date: term to a query. For example, lets say you'd like to investigate the cause of the sudden spike in messages from the PHP lists for the query "register globals". You see this histogram:

So let's satisfy that curiosity. In April of 2002 things really start to heat up and it remains a pretty hot topic until roughly November 2003. To restrict our query to these dates all we have to do is add "date:2002/04-2003/11" to our register globals query, yielding "register globals date:2002/04-2003/11". Two dates separated by a hyphen indicate a range. This gives you only the matching messages from April 2002 through November 2003. The chart even highlights the selection range:

Looking at the statistics for this query, most of the messages were posted to the discuss list so it's likely that users are having problems. We might assume that something changed with the language so let's add "release" to the query. Sure enough, looks like they changed the default behavior of register globals in the 4.2.0 release which was released on April 22nd, 2002.

We support more than just date ranges. Here's a just a few of the formats that we support:

date:todayMessages posted today
date:"last week"Messages posted in the last 7 days
date:"last month"Messages posted in the last 30 days
date:lastmonthSpaces are optional, for convenience
date:2008/01/26Selection by day
date:20080126Slashes are optional, if you prefer
date:2008/01Selection by month
date:2007Selection by year
date:2005/06-Everything from June 2005 onward, because of the trailing hyphen
date:-2005/06Everything up to the end of June 2005, because of the leading hyphen
date:2007-2008This year and last
date:"July 4th, 2007"Human readable formats are supported too
date:"Julio 4th, 2007"For all of you Spanish speakers
date:t90dMessages from the last 90 days, don't forget the t
-date:2008Negation is also allowed, put the hyphen in front of the date:

Don't worry if you can't remember all this. The question mark graphic next to the search box will pop up a reminder. So now while we continue to work on allowing you to interact with the graphs, give this a spin and let us know what you think.

P.S. Jason really likes this because it lets him examine all the changes between JDOM 1.0 and JDOM 1.1.

Sunday, January 27, 2008

Ruby vs Groovy: What Can List Traffic Tell Us?

Over the weekend we loaded the main Ruby lists from ruby-lang.org, about 300,000 messages across the last six years. The ruby-talk list alone weighs in at 245,000 messages and is our new second place traffic champ, trailing only php-general.

With both the Ruby and Groovy archives loaded, we have an exciting opportunity to compare the two communities. In my experience talking with developers at Java conferences, they often look to both Ruby and Groovy as possible next languages to learn. The Java developers have a natural desire to go toward Groovy because it lets them keep their Java stack, but they're concerned about Groovy's level of support relative to Ruby, which has been around much longer and has a larger community.

How much larger? Is the community growing or shrinking? It can be hard to tell with open source, having no revenue numbers and with download counts skewed by bundling. I think looking at email list traffic patterns are about as good a gauge as anything.

Below you'll see the a composite graphic showing the traffic from the five Ruby lists compared to the five Groovy lists, with the Groovy lists inlaid at matching scale:

The Ruby lists are more active, by about double. In the months before the Groovy 1.0 launch in January 2007, the spread was even larger. Both communities seem to have plateaued in 2007. I look forward to seeing what 2008 brings.

P.S. The Ruby lists are half English and half Japanese. Yukihiro (Matz) Matsumoto who created Ruby is Japanese, and the language first took off in Japan. If you speak Japanese, feel free to search for Japanese words. It should work but do let us know if you spot any issues.

Thursday, January 24, 2008

Groovy: Traffic Doubled with a Formal Release

A few days ago we loaded the Groovy lists and their 70,000 messages. The list traffic chart helped change my mind about the language.

The groovy project calls itself, "an agile and dynamic language for the Java Virtual Machine". I'd call it a cool scripting language that compiles to Java bytecodes and so lets you write in a scripting language while accessing the vast set of Java libraries out there.

The first time I saw Groovy, years back, I got very excited -- but then it didn't seem to be catching on, and I thought it was slowly on the downturn. In fact it's not like that at all, it just takes time to develop a language. Look at the shape of the message traffic histogram:

Looks like the project caught some fire. Guillaume Laforge blogged that the big jump you see here starting in January 2007 was due to the release of Groovy 1.0. I see they're on Groovy 1.5 now, as of a month ago. The 100+ messages per day rate will probably continue. My friend Scott Davis was right.

Monday, January 21, 2008

Stuffing Six Million Pages Down Google's Throat

Tim O'Reilly in the O'Reilly Radar blog just reposted an email I sent him discussing the challenges with getting our millions of emails indexed by the major search engines. Here's a follow-up email he didn't repost, concerning the techniques we do use:

We do two things to help search engines with crawling. If there's more we can or should do, I'd be happy to hear it. Hopefully Google will agree that getting these emails (some of which aren't on the web anywhere else) into their index is a Good Thing.

First, we use sitemap files. Our robots.txt file points at a sitemap index file which points at several dozen sitemap files.

Second, we have a "Browse" link in the footer. It's not real prominent because it's not very friendly to humans but it provides a full navigation tree ideal for spiders. The top page links to all hosted lists, each list's page shows the messages by month, and each month page shows the messages for that month.

Our sitemap explicitly excludes mentioning any commit emails (which we view as lower value) but we notice Google still crawls them, so we can deduce their crawler found the Browse link. Their crawler also pulls the sitemap files regularly so it found them also.

Some people recommend having no more than 100 links on a page, and (because we have not implemented any browse paging) it's true that for some months on popular lists we exceed that. In some basic testing though, Google does appear to index emails both at the top and bottom of the long lists, so I'm not sure this is the problem.
The above discussion pertains to Google, but the real challenge right now is how to get the Yahoo spider count (19k pages) and MSN spider count (4k pages) even half as high as the Google spider count (851k pages).

Saturday, January 19, 2008

We've loaded PHP and PEAR (700,000 emails)

Move over tomcat-users, there's a new king in town: php-general.

In the last few weeks we loaded the PHP and PEAR mailing lists, a sum total of about 700,000 new messages. Contained within the new load is the php-general list, now statistically our largest list at 266,000 messages, passing by the old king tomcat-users with its 225,000 messages. Third place now goes to the main MySQL list.

Hmm, I wonder if http://php.markmail.org/ could be the logic behind the http://php.net/ mailing list search box, instead of the MARC archives that are supporting it today?