The Making of MarkMail: 2008

Tuesday, December 9, 2008

MarkMail at One Year: Looking Back

It's now been a little over a year since we launched MarkMail. We've sure come a long way!

We're now seeing well over a million unique visitors every month and more than 5 million page views.

The Googlebot crawler (whose activity isn't included in the above statistics) has also been active. It now crawls between 1.0 and 1.3 million pages every day to keep its index fresh. That's about 15 page hits every second -- or 15 Hertz, enough to make a nice low background rumble noise. It's really enjoyable to get so much Google attention because it wasn't that long ago when we were just trying to get Google to index more than a million of our pages, nevermind crawl that many in a day.

Our content size has grown also, from 4 million messages at launch, covering just the Apache Software Foundation archives, to 34 million messages today, spanning all sorts of communities. For us to grow so big so fast has been possible only because of the community support we've received. There's a long list of various community members who have worked with us to accumulate and load their list archives. We'd like to thank all those folks, as well as the people who placed a MarkMail search box or other MarkMail link on their site or helped spread the word in blogs and emails and tweets.

Looking forward, where do we go from here? We have some big plans. I'll get into details with a later post.

Thursday, October 9, 2008

Google Code Adds Gadgets: MarkMail Helps

Google today announced new support for embeddable "gadgets" on Google Code project pages. Particularly exciting to us, they introduced MarkMail as the recommended gadget for viewing and searching Google Code project list archives.

For those who haven't encountered one in the wild, a Google Gadget is an embeddable web object that puts a bit of third-party dynamic content into the middle of a web page. Gadgets are the things you place on your iGoogle home page or your Google Desktop, but you can also add them to your own web page with one line of JavaScript, or anyone else's page if it supports the OpenSocial APIs.

We've coordinated with the Google Code team over the last several months to load about 500 GoogleGroups lists (3.8 million emails) and build a new MarkMail Gadget (launching today!) to let Google Code developers search and analyze their lists using MarkMail.

The new MarkMail gadget lets you view messages, threads, attachments, and senders, and a traffic chart (wouldn't be MarkMail without it!) for any set of messages you want to follow. The messages you choose to track with the gadget can be those from a single list, set of lists, a person, containing a term or phrase, or any combination. In fact, anything you can use in a search on MarkMail can be used as input to the gadget view. The new gadget offers two features not yet available on MarkMail.org: a daily traffic chart (MarkMail.org only does monthly traffic charts) and a view that coalesces threads.

So what does this mean for you? If you're a project leader (either on Google Code or somewhere else) it's now easier than ever to embed a MarkMail traffic chart and recent message list inside any of your project pages. If you're just a lurker, you can personalize your view on MarkMail traffic and embed that view into iGoogle or Google Desktop, or any other page.

To help you set up the right links, we created a Gadget Embedding Wizard that guides you through the process of embedding. You can also find our gadget in the Google Directory where they have additional embedding instructions.

Tim O'Reilly in describing Web 2.0 says, A platform beats an application every time. We agree. We think you should be able to access mailing list archives whenever and wherever you want, be it at MarkMail.org or on another page that's been MarkMail-enabled via a gadget. So have fun, and let us know how this works for you!

Thursday, October 2, 2008

1.4% of Emails Mention Google

As Google celebrates its 10 year anniversary we thought it'd be fun to use our archive of 30 million mailing list messages to see how Google's popularity has grown over time across the list-o-sphere. Boy has it grown!

In 2008 (so far) the word "Google" appears in 1.4% of emails in our archive, up from 1.15% last year and 0.75% five years ago.

While shockingly high, that 1.4% number is actually calculated with some conservative restrictions. We're excluding all mentions that occur inside quote blocks (where someone replies to another who said the word). It'd be 2% if we didn't have that rule. We're also excluding from our calculations all the Google Groups lists we follow, where Google is often the topic of discussion. With those lists added in? It's 13%.

You can explore this yourself with our public interface. You'll want to query for "google", use the "opt:noquote" flag, and set "-list:googlegroups" to exclude those lists. Then you can add date constraints either by typing "date:2008" in the search or dragging on the chart. Track the numbers as a result of your searches, do a little division, and you get your percentages.

You'll see that so far in 2008 there were 50,826 emails saying "google" across 3,607,973 emails total. That's 1.4%. For 2003 it's 21,165 emails out of 2,770,480 total, or 0.75%.

Monday, September 22, 2008

Ruby on Rails on MarkMail: 200,000 Emails

Interested in Ruby on Rails? If so, you'll be happy to learn we've loaded the full RoR mailing list archive. It holds about 200,000 emails and includes both the original Mailman lists from 2004-2006 and the GoogleGroups lists from 2006 onward.

Fun facts:

Frederick Cheung is the #1 most frequent poster
DHH is #22
The traffic never fully recovered after it transitioned from rubyonrails.org to GoogleGroups. You can compare the two charts (keep an eye on the y-axis).
Maybe it's because DHH didn't make the move to GG?

Don't forget, we have the regular Ruby lists too.

Thursday, September 11, 2008

FreeBSD, the Unknown Giant

My last entry about NetBeans and OpenOffice.org and their million messages reminded me that I've never announced here our load of the FreeBSD archives, an even larger and older community. They have more than 2.5 million messages, stretching back to 1994.

FreeBSD doesn't get as much attention at Linux but is a great operating system. Here's a description from an IBM developerWorks article:

"The FreeBSD operating system is the unknown giant among free operating systems. Starting out from the 386BSD project, it is an extremely fast UNIX®-like operating system mostly for the Intel® chip and its clones. In many ways, FreeBSD has always been the operating system that GNU/Linux®-based operating systems should have been. It runs on out-of-date Intel machines and 64-bit AMD chips, and it serves terabytes of files a day on some of the largest file servers on earth."

Here's the historic traffic chart (excluding automated bug and check-in messages):

Looks like it's a giant in traffic as well. The freebsd-questions list alone gets a couple thousand emails a month, half a million in its history. Got a FreeBSD problem? I bet the answer's in there.

Announcing NetBeans and OpenOffice.org

Last week we finished adding the NetBeans and OpenOffice.org mailing lists to the MarkMail archive. Both communities host more than a million messages each. Here's the NetBeans activity graph (with automated bugs and check-in messages removed):

Looks like they've seen a resurgence in activity going up for the last 4 years. They have more list activity than Eclipse, too. (Eclipse directs user questions to web forums that aren't included in our stats.)

Here's the OpenOffice.org traffic (same automated message removals):

The folks at CollabNet worked with us to transfer the massive archives, and yesterday we issued a joint press release announcing the new list availability. We also boasted passing 27.5 million emails in total. That was yesterday. Today we're passing 28 million. Chugga, chugga!

Tuesday, September 2, 2008

A Tale of Two Search Engines, Revisited

As Jason announced previously, last Wednesday night I delivered
A Tale of Two Search Engines — a presentation for the Software Architecture and Modeling SIG of SDForum — about building and running the Krugle and MarkMail vertical search engines for code and email, respectively.

Here are my tidied-up slides.

Note carefully that my presentation style is a very visual, story-telling approach for live, interactive audiences -- i.e., the slide deck is quite large and not geared towards a reading-at-home audience. Heck, I only broke down and used bullet points on 4 slides right at the end. :-)

That said, I'll start blogging some of the stories, go deeper on various technical details, and/or get into any of the "fun topics" that people are interested in. Feel free to leave comments here about any that you particularly want to hear about.

Special thanks to Ron Lichty for dragging me into giving this presentation and the wonderful SAMSIG audience for making it so much fun.

Enjoy,

John

Monday, August 25, 2008

Loaded Red Hat: A Thousand Emails a Day

How did we celebrate the success of our memory swap ballet last week? We loaded 1.7 million Red Hat emails. It's geeky, but so are the lists! We've now got the complete set archived.

The first Red Hat messages start back in May 1996. At that time then there were just a few hundred emails each month. The chatter has grown a lot since, with recent numbers topping 30,000 messages a month. That's 1,000 per day.

It's interesting to see the #1 most common file attachment is of type patch. That makes sense as these are mostly developer lists.

But can anyone explain why on a Linux list the #2 most common attachment is the Outlook-generated winmail.dat!? Is that a good sign or bad sign?

Tuesday, August 19, 2008

How to Shutdown All Your Machines Without Anyone Noticing

Last week we discovered we had to replace some bad memory chips in 2 of the 3 machines we use to run the MarkMail service. This blog post tells the story of how we managed to replace these memory chips without (almost) any of our visitors noticing.

Architecture

First, a word about our architecture. The three machines I'm talking about here all run MarkLogic Server. We have some other machines in the overall MarkMail system that do things like queue and handle incoming mail, but they're not directly involved in the web site experience. I'm talking here about the three MarkLogic machines that work together in a cluster and that you interact with when you hit http://markmail.org.

The MarkLogic machines have specialized roles. One machine (I like to picture it up front) listens for client connections. It's responsible for running XQuery script code, gathering answers from the other two machines, and formatting responses. The other two machines manage the stored content, about half on each. They support the front machine by actually executing queries and returning content.

I'll refer to the front machine as E1, which stands for evaluator #1. We don't have an E2 yet but we're planning for that when user load requires. The back-end machines are D1 and D2, which stands for data manager #1 and #2.

The bad memory was on E1 and D1.

We'll Fix E1 First

We decided to fix E1 first because it's easiest. We gathered the MarkMail team and started at 5pm. That's the time period with our lowest traffic. It's a little counter-intuitive but since we're a global site we're as busy at 2am (Pacific) as we are at 2pm. The time around 5pm Pacific still sees a lot of traffic, but relatively less. Why? We theorize it's because we get the most traffic during the visitor's local business hours, and the 5pm to 8pm Pacific time slot puts the local business hours out in the middle of the Pacific.

The E1 server is important because it catches all requests. Our plan was to place a new host, essentially E2, into the cluster and route all traffic through it instead of E1. There's no state held by the front-end machines, so this is an easy change. We borrowed a machine, added it to the MarkLogic cluster, told it to join the "group" that would make it act like E1, and has our reverse proxy start routing traffic to it instead. We did all this with the MarkLogic web-based administration. It was far too easy, frankly.

We immediately saw the E1 access logs go silent and we knew our patient was, in effect, on a heart-lung bypass machine. We told our sysadmin in the colo to proceed.

That's when he told us that on more careful inspection the memory problems were on D1 and D2. The E1 server was just fine. Hmm...

We decided to call the maneuver good practice for later and put things back like we found them.

OK, We'll Fix D1 First

Performing maintenance on a machine like D1 requires more consideration because it's managing content. If we were to just unplug it, the content on the site would appear to be reduced by half. It'd be like winding the clock back to April, with our home page saying we just passed the 10 million message mark.

All email messages go into MarkLogic data structures called "forests". (Get it? Forests are collections of trees, each document being an XML tree). Our D1 server manages forests MarkMail 1 and MarkMail 2, the oldest two. They're now effectively read-only because we're loading into higher numbered forests now on D2.

Turns out that's a highly convenient fact. It means we could back up the content from D1 and put it on our spare box, now acting like a D3. Then with a single transactional call to MarkLogic we could enable the two backup forests on D3 and disable the two original forests on D1. No one on the outside would see a difference. Zero downtime.

It worked great! It took a few hours to copy things because it's hundreds of gigs of messages, but like a chef on TV we knew what we were going to need for showtime and prepared things in advance.

With the new memory chip placed in D1 we did a transactional switch-back, put the two original forests back into service and had the spare box unused again, ready to help with D2.

We Need an Alternate Approach for D2

Had we planned in advance to work on D2 we probably would have followed the same "use a backup forest" approach we used to work on D1 because it allowed for zero downtime. It would have required pushing ingestion activities to another machine like D1 so the forests could settle down and be read-only, but that's done easily enough. We didn't do this, however, because we were too impatient to wait for the data to copy between machines. Instead we decided to leave the data in place and do a SAN mount switch.

We host all our forest content on a big SAN (a storage area network, basically a bunch of drives working together to act like a big fast disk). All the data managing machines (D1, D2, and the spare acting as D3) have access to it. Usually we partition things into individual mount points so they can't step on each other's toes and corrupt things. You never want two databases to operate against the same data! Here we decided to remove the isolation. We'd have D2 "detach" the MarkMail 3 and MarkMail 4 forests and have our spare machine (acting like D3) quickly "attach" them. We would essentially transfer a few hundred gigs in seconds.

This system change couldn't be made transactionally, so we had a decision to make: Is it better to turn off the MarkMail site for a short time or let the world see a MarkMail with only half its content? We decided to just turn off the site. Our total downtime for the switch was 43 seconds going over, just over a minute coming back after the memory change.

We think we could do it faster next time with some optimizations in the MarkLogic configuration -- turning off things like index compatibility checks, which we know we don't need. Maybe 20 seconds, or even 15.

The Moral

Looking back, we're happy that we could cycle through disabling every machine in our MarkLogic cluster yet not have any substantial downtime. Looking forward, we expect operations like this will get easier. If and when we add a permanent E2 machine to the cluster it means we won't have to do anything special to take one of them out of commission. Our load balancer will just automatically route around any unresponsive front-end servers. We were also happy to see that our configuration for SAN-based manual failover works. We proved that as long as another machine can access the SAN, we'll be able to bring the content back online should a back-end machine fail.

Everyone on the MarkMail team works at Mark Logic, the company that makes the core technology that powers our site. In fact, in years past some of us have been directly involved in building the technology. But despite our familiarity, we were still delighted to take the production MarkLogic cluster out for a walk and get it to do tricks. It did the right thing time after time with every disconnect and reconnect and reconfiguration, and we couldn't help but feel a point of pride. This is some fun software! If you're a Mark Logic customer, we trust you know what we mean.

A non-techie friend once asked why managing a high-uptime web site was hard. I said, "It's like we're driving from California to New York and we're not allowed to stop the car. We have to fill the gas tank, change the tires, wash the windows, and tune the engine but never reduce our speed. And really, because we're trying to add new features and load new content as we go, we need to leave California driving a Mini Cooper S and arrive in New York with a Mercedes ML320."

So far so good! Here's to the long roads ahead...

Sunday, August 17, 2008

Pillow Talk

We have a bit of a tradition at MarkMail where we give away T-shirts at the conferences we attend. Printed on the front of the T-shirts we put the mailing list traffic chart generated by the community whose conference we're attending. Last year we did it at ApacheCon in November, then again at XML 2007 in December. We did it at JavaOne too. They're fun because they're personalized, and to the recipients the long bars often bring back memories of fast growth, new product releases, and raging flamewars.

One of the recipients of a T-shirt at XML 2007 was B Tommie Usdin. Tommie doesn't like to wear T-shirts. No, she likes to make T-pillows out of them instead. Recently she emailed us a picture of her handiwork:

We just had to share.

Saturday, August 16, 2008

MarkMail has Procmail

This last week we loaded the Procmail list archives into MarkMail and I wanted to pause and mention it here because it's the kind of thing that readers of our blog would probably appreciate.

Procmail, for those who don't know, is a tool for filtering email. It lets you define complex rules for email processing. You can file messages into folders, quarantine spam, block viruses, and more. First released back in 1990, it's an oldie but goodie for people who want to do advanced things with email and aren't afraid to do some rule file hacking.

Of course not all is great with Procmail. It's arcane and fickle, with a rule syntax that confuses new users. It hasn't had a new release in a long while, nor have its official docs been kept up to date. Answers to common questions aren't on the web site. As a result, every time I've wanted to do something non-trivial with Procmail, I've had to spend a fair amount of time Googling for answers and hunting for samples.

I think that can change. With this list load MarkMail lets you search 25,000 emails spanning the last 8 years where people have been doing Q&A for each other. I expect those emails will give me some good A's for my Q's. Hopefully they'll do the same for you.

P.S. Of course there's almost as many emails about Procmail outside the Procmail list as there are emails inside it.

Thursday, August 7, 2008

Interview with KDE

Earlier this week KDE News published an interview with us about our recent loading of the KDE community list archives.

Our interviewer, Jos Poortvliet, asked some interesting questions on topics we haven't spoken much about before: how we select which lists to load, and what technical challenges you hit in gathering and loading 2.7 million emails. If you're curious about how things work at MarkMail on the loading side, check it out.

Wednesday, August 6, 2008

Blogger Names us a "Blog of Note"

Earlier today we received a comment on one of our blog posts that said, "Congratulations on being named a Blog of Note this week!". It seemed like a perfect comment-spam ploy: Say something nice so the blog owner won't delete your comment. Yet something about the post smelled non-fishy. The comment didn't have any sketchy links like most spams do. I thought maybe it was real. To my surprise and happiness, it was!

We were listed by the Blogger Team as a "Blog of Note" for August 4th:

Thanks, Blogger!

Tuesday, July 22, 2008

Here Comes the Sun

Last week, in collaboration with Sun and CollabNet, we loaded the mail archive histories for java.net, Sun's open source developer playground for Java projects and home to projects like GlassFish, jMaki, AppFuse, Grizzly, Hudson and WebWork.

The load includes more than 1,000 mailing lists and roughly 1,000,000 messages. Their growth curve is fantastic (the last month is partial):

Just about half the java.net mails are auto-generated as a result of checkins or bugs. If we remove those, the curve is still beautiful. Looks like people are writing more than 15,000 human-to-human emails every month on java.net.

With such a large community, it's fun to look at community-wide analytics. It's a little-known feature that you can go to our browse page and add an arbitrary query to the URL and it'll show you list-by-list numbers for all messages matching that query. For example, you can view the total number of messages per list throughout time, or the counts for just last week.

You can browse the lists where people from "sun.com" have written the most. If you want to see the top lists, do it as a regular search.

There's been a lot of coverage about this, from Marla Parker, Eduardo Pelegri-Llopart, Clark Richey, and javaHispano. Plus we issued our own press release!

Friday, July 18, 2008

A Tale of Two Search Engines

If you're local to the Bay Area, you may be interested in attending an upcoming talk from the SDForum Software Architecture & Modeling SIG on August 27th. It's titled A Tale of Two Search Engines and will be given by our own John Mitchell, one of the developers on MarkMail. Here's his abstract:

Betwixt the rigid structure of relational databases and the unbridled chaos of random content lies the world of search engines. Search engines shine in the middle ground where the messy complexity of reality makes everything harder than we imagine.

While the soap operas of general-purpose search engines dominate the news, specialized search engines are coming to dominate their vertical niches. Special-purpose search engines can aggressively leverage domain-specific intelligence to return highly relevant results.

This talk will present the architecture, implementation, and stories behind the creation of two specialized search engines for code and email: Krugle and MarkMail.

If you're interested in MarkMail I think you'll enjoy it. (And no, John doesn't really use words like "betwixt" in daily conversations.)

Wednesday, July 2, 2008

The Perl Review: Now with Video

The folks at The Perl Review recently enhanced the interview I mentioned here Monday with a new screencast video showing MarkMail in action. The intro is terrific. There's a guy hitting his Mac with a hammer!

It's a strange (happy) feeling to have others produce advertising videos for you. Thanks, brian d foy!

Monday, June 30, 2008

Interview with The Perl Review

The Perl Review, a quarterly newsletter about all things Perl, recently published an interview with us where we discuss several topics relating to MarkMail:

How we load mail
Our choice between Java and Perl
Our model of permalinking
Comparative community sizes
What's in store for the future
How this is different than Google

It's a more technical interview than some of the ones we've done previously with Apache, The Content Wrangler, and InfoQ.

Tuesday, June 10, 2008

Diacritics, or should I say dịẫçritícs

We changed our indexing this week regarding how we handle diacritics -- those accent marks you see on vowels and some consonants in many languages.

Previously we resolved all queries in a diacritic insensitive manner. That meant that a search for "francois" would match both "francois" and "françois", and a search for "françois" would do the same. Basically we specified in our MarkLogic Server configuration that the c versus ç difference be ignored.

Now we've changed the configuration so the diacritic sensitivity choice depends on the search term. A term containing diacritics will trigger a diacritic sensitive match, while a term without diacritics will remain diacritic insensitive. That means a search for "francois" will match with and without diacritics (the same as before), but a search for "françois" will respect the ç character constraint and won't match "francois" anymore.

To summarize: If you care enough to type a diacritic, we'll care enough to match it for you!

This is a particularly helpful change as we've expanded from English-only content into lists written in Japanese, Vietnamese, Spanish, German, Italian, Dutch, Portuguese, Slovak, Polish, and Farsi. We even have one mail in Frisian. Who knew!

On a per-message basis, we get more traffic from these lists than our English lists. Perhaps they're underserved by other email archive systems? Maybe the other systems have issues hosting messages with the non-ASCII characters. We've definitely had trouble finding "clean" historical archive records for non-English lists, ones where the diacritics were reliably preserved. Luckily for us, being built on XML, we have native support for all Unicode characters.

We hope you find the new indexing logic helpful.

Wednesday, May 21, 2008

Loaded TAXACOM: A List about Biodiversity

A couple days ago we received a request from Roderic Page to load TAXACOM. He describes it as:

...a mailing list that dates back to the early '90's, and is a forum for taxonomists and other researchers interested in biodiversity. It is lively, with some long conversations. It's also featured as the source for sociological research, such as "Systematics as Cyberscience: Computers, Change, and Continuity in Science". Given interest in the Encyclopedia of Life (see also Ed Wilson's TED talk ), which could be viewed as one response to the issues raised on TAXACOM posts over the years, I think it would be a very timely addition to MarkMail.

With a description like that, how could we resist! So I'm happy to say we've loaded the list, and (for trivia buffs) it even sets a new earliest list record in MarkMail, with archives starting back in 1992.

For more, see Rod's blog and email announcements.

Monday, May 19, 2008

Loaded OpenMoko: An Open Source Smartphone Platform

Last week we received on our feedback form a request to load the OpenMoko mailing lists. These folks are creating an open source smartphone platform, very cool stuff. Along with the request, the requester explained the benefit he saw in having MarkMail archive the OpenMoko lists, from the perspective of a project participant. I've reprinted it here with permission:

I would LOVE to see the OpenMoko lists get into MarkMail.. for 2 reasons..

1. From a developer perspective, I'm new to the OpenMoko platform and still learning the build system, etc. and am eager to start writing my own applications. But there's only so much info on the wiki and like many young communities all the juicy info is buried in the Mailing lists. So I'd love to be able to search all the lists for things like installing the sim card, what new hardware bugs they've found on the dev list, how to modify the dialer application, etc. This is where MarkMail really shines and is the best platform out there for this type of information gathering from community lists. If these lists were in MarkMail it would be one of the ONLY places one could find some of this information because of the advanced search functions in MarkMail. I think this holds true for a lot of young open source communities and MarkMail can really help out.

2. I think it would help the community in general by giving users an avenue to find the information they need to start better participating and contributing back. There would be less duplication of questions which distracts everyone on the list and hopefully more "I see how things are being done because I got all caught up by searching MarkMail, how about we do it this way.." etc.
Of course we loaded the lists for him.  Here's the activity chart:

Monday, May 12, 2008

Loaded Perforce: High-End Revision Control

Recently we loaded seven mailing lists dedicated to the Perforce SCM system. If you haven't heard of Perforce, they're a high-end revision control system, with a long list of corporate clients. They're known for speed and features.

I've been using a Perforce system to manage my own files for over a decade now, appreciating their free individual license.

Their mailing lists have a lot of technical Q&A discussion, so I hope having these lists more easily searched will help people find the answers they need. Here's the historic traffic pattern:

Thursday, May 1, 2008

Loaded Eclipse and NetcoolUsers

Yesterday we loaded the email archive histories for two new communities: Eclipse and NetcoolUsers. Normally I wouldn't talk about these two communities in the same blog post, but after the load it occurred to me that both projects (coincidentally) relate to IBM. More about that at the end.

Eclipse (eclipse.org) is an extremely popular open source development tool project, initiated by IBM back in 2001. (The name was widely seen as an attack on Sun.) It took off and lots of Java developers use it as their IDE. They have a beautiful growth chart:

NetcoolUsers (netcoolusers.org) is a user community focused on IBM Tivoli Netcool. For a single list it's quite hopping (25 posts/day):

The fact both these communities relate to IBM is purely coincidental, but it's also interesting because it reflects the direction of pull we're seeing from the MarkMail user base. It's an early sign of what you can expect in the future: more technical content beyond open source.

Technical lists come in many flavors: pure open source (Apache, JDOM), corporate-sponsored open source (Eclipse, Xen), standards development (W3), technical user groups (NANOG), and groups focused on proprietary technology (NetcoolUsers). We plan to expand along each of these axes.

If you have a list you'd like us to load, let us hear about it.

Tuesday, April 29, 2008

Loaded NANOG: North American Network Operators' Group

Today we loaded a new list, NANOG, a discussion forum for the North American Network Operators' Group. In its 100,000 messages it holds some fascinating discussions about internet operations. The chatter around 9/11, Katrina, and y2k stand out especially.

The list extends back to April 1994, two months earlier than any list we previously loaded. It's always fun to break little records like that. It could be a while before we break this one again.

If you're intrigued by internet operations, the NANOG FAQ has lots of good factoids.

Tuesday, April 15, 2008

Loaded Python: A Cool Million Messages

Happy news: We've just finished loading the Python Software Foundation mailing lists. (Python is a popular programming language, overseen by the PSF.) With this load we're breaking a few records:

Weighing in at 1,022,479 total messages, Python is now the largest community ever loaded since our initial launch. (We went live back in November with roughly 4 million Apache messages.)
Half of those million mails are from a single list, python-list. That means python-list holds the new record for Crazy Huge What The Heck Can They Talk About So Much list. (And, would you believe, there's even more python-list histories from 1992-1995 still to load.)
This puts our total combined MarkMail message count above 10,000,000. There was much hooting and hollering (and page refreshing) around here as the numbers clicked over.
It's the biggest community ever loaded by our new hire Evan Paull. OK, it's the only community ever loaded by Evan. He started just a couple weeks ago. We figure after he's wrangled together a million message history, everything else will look easy.

Among the million mails are the archives for the Mailman project, something I'm especially happy about because much of our work here involves interfacing with Mailman, and this should help us understand it better.

As always, here's the traffic chart:

Monday, April 7, 2008

Loaded GNOME: 750,000 emails

Over the weekend we loaded the mailing list history for the GNOME project. GNOME is a immensely popular GNU project, a free software desktop environment and development framework. Their message traffic shows it has a vibrant and active community. They boast a history of 750,000 emails across more than 200 lists:

The peak in 2007? That's because in 2007 they started a new svn-commits-list (a list that captures emails about code check-ins) and it's been archived while the older cvs-commits-list wasn't. If we add -type:checkins to the query, we can graph the history without that list:

It took a fair amount of work to load the GNOME history because the archives had more spam and virus mails than could suitably be removed by hand. We had to use procmail and SpamAssassin to remove the junk.

One neat factoid: It's easier to remove spam from mail sent in 2004 than mail sent today. Spam blocking has always been a competitive arms race, but in this case we're fighting yesterday's war with today's technology! Even running in offline mode, SpamAssassin did a darn fine job.

I just wish it ran faster. If anyone out there is a SpamAssassin performance guru, please let us know.

Our thanks to Jeff Waugh for helping us get the histories.

Monday, March 17, 2008

World Wide Web Consortium Lists: 400,000 emails

HTML 4.0, XML, PNG, CSS, DOM, and XQuery: These are but a few of the technologies to come out of the World Wide Web Consortium, commonly referred to as the W3C. We're proud to announce that MarkMail (which by the way uses all of those technologies!) has loaded the full W3C public mailing lists. They start in 1994 and cover 400,000 emails across 200 mailing lists.

With such a long and deep history it's fun to do a little archaeology: You can find the first mention of XML back in 1996. I tried to find the formal "XML 1.0" announcement and saw there wasn't one, but on launch day (February 10, 1998) you can find people complaining about rendering issues with the spec. Isn't that always the way with mailing lists? By the way, it's fun to use XML to search on the birth of XML.

Google first came up as a topic in August 1998, back when its domain ended "stanford.edu". That beats any other list by 5 months. The first mention of XQuery didn't come until January 2001, well after xml-dev and other lists were talking about it. I expect there's more chatter in the private W3C archives.

Finally, the first mention of MarkMail came in December 2007. And what a great post it was! :)

Thursday, March 13, 2008

Loaded Perl: 530,000 emails

Perl is the duct tape of the internet. Created by Larry Wall in 1987 and made famous with his Programming Perl "camel book" published by O'Reilly, it's the tool sysadmins use to keep things running.

We're proud to announce we've finished loading the Perl.org mailing list history into MarkMail. A total of 530,000 emails across 75 lists. The lists don't go back to 1987 (boy that'd be cool if they did). But that's all right; who really needs tech support against Perl 1.000?

What we have here is traffic starting with the migration to the Perl.org setup in 1999:

Enjoy! And if anyone has earlier archives, let us know.

Tuesday, March 11, 2008

New Search Feature: "opt:nostem"

In the science of Information Retrieval there's a constant tug of war between precision and recall. As Wikipedia defines the terms, precision is the fraction of the documents retrieved that are relevant to the user's information need, and recall is the fraction of the documents that are relevant to the query that are successfully retrieved. Or as I define the terms, precision is how much of what you wanted you actually got, and recall is how much of what you got is what you wanted.

MarkMail increases recall by running stemmed searches. This loosens the query constraint so that searching for proxies will match proxy as well. Sometimes this is good, and sometimes we hear from users who don't like the behavior all that much! They want more precision.

So we're happy to announce a new feature, opt:nostem, that when added to the search string turns off stemming for that query. You can try it for yourself:

http://markmail.org/search/?q=proxies
http://markmail.org/search/?q=proxies+opt%3Anostem

Friday, March 7, 2008

Average Load Time: 0.1 Seconds

There are many challenges in running a high-traffic web site. Performance is a challenge we particularly focus on at MarkMail because users get frustrated if they have to wait more than a second for a reply.

The challenge in maintaining performance increases as more of a site's content gets built dynamically -- meaning on the fly in response to user requests rather than ahead of time where it can be directly served (like a McDonalds hamburger).

With MarkMail we build every page dynamically using XQuery. Even a page that at first blush seems as if it could be pre-built, like an individual email message, we actually build dynamically because we want to highlight the search terms from your query.

All this is why I was so happy to notice that Alexa.com calls us a "Very Fast" site...

Markmail.org has a traffic rank of: 128,666 (UP 745,248)
Speed: Very Fast (99% of sites are slower), Avg Load Time: 0.1 Secs

Here's some background on how Alexa tracks performance.

Wednesday, February 27, 2008

New Feature: Top 10 expands to Top 100

Every time you do a search on MarkMail the leftmost pane shows you the top 10 lists, senders, attachments, and message types for all emails matching your query. OK, it's not always 10 that you see. Sometimes it's more, sometimes less. Exactly how many you see depends on your browser size. But even if you're the proud owner of one of those new 17" MacBook Pro laptops with the 1920x1200 screen, the view maxes out around 25.

We've added a new feature to help improve this. When there are more values than will fit in the selection box, you'll see a "View more" link in the top right corner.

Clicking on "View more" shows the top 100 in an overlay.

Clicking on any of the values in the overlay will limit your search, same as clicking on a value in the short list. Enjoy!

Monday, February 25, 2008

A Place for Xen

At MarkMail you can now find your Zen. Or, to be more accurate, you can find your Xen.

Xen is an open source "hypervisor" (similar to VMWare) that enables operating system virtualization. It's supported by Citrix and used by Amazon EC2, among others.

I can joke about finding Xen at MarkMail because we recently loaded a bit over 100,000 messages from the Xen community. If you're into virtualization, enjoy!

If you want to compare VMWare with Xen, you'll find some good discussion in the archive.

Wednesday, February 20, 2008

PostgreSQL: More Traffic than MySQL (and a first Google spotting)

When we announced back in December we'd loaded the MySQL database mailing lists, we heard from several people who asked us to load the PostgreSQL lists also. We said we'd be happy to, and MarkMail now has 635,000 PostgreSQL emails loaded and searchable.

Comparing PostgreSQL and MySQL is kind of interesting. With all the talk about the LAMP (Linux/Apache/MySQL/PHP-Perl-Python) architecture you'd think MySQL had a lock on the open source database market, but based on simple message traffic analytics, PostgreSQL has a much higher level of community involvement. Looking at January 2000 onward, the MySQL lists have amassed 340,000 messages with about 3,000 new messages each month:

In the same time period, the PostgreSQL lists have hit 583,000 messages with 7,000 new each month:

I wouldn't have thought it, but there it is.

Also in the PostgreSQL lists we find the very first mention of Google in all of the messages loaded so far! The first Google sighting was on the pgsql-interfaces list, January 28, 1999, in a post by James Thomson:

"I've been using the Oracle Pro*C precompiler manual. I don't have the URL here at work but I found an online copy using www.google.com"

The first mention in another community happened on the xml-dev list a couple months later, March 10, 1999, in a post by Andrew McNaughton:

"You need a new search engine. I've recently been using www.google.com with results an order of magnitude better than what I got from altavista (though altavista still has it's place for more complex query definitions)."

Here's the query if you want to look for yourself:

http://markmail.org/search/google+order:df

Maybe as we load more community archives we'll get even earlier sightings.

Wednesday, February 13, 2008

Announcing an Informal Partnership with Codehaus

We're happy to announce we've developed an informal partnership with Codehaus to load all their mail archives and receive automatic notification of new Codehaus lists as they get created.

The automatic update is particularly important because Codehaus is a fast-growing home to open source projects with new lists being created all the time. How fast is Codehaus growing? Looking at the traffic chart, it shows a beautiful upward trend line. For comparison, it has the same level of activity as Apache had in late 2000.

Previously we loaded the Groovy, Mule, and XFire archives from Codehaus. We now have the archives from Grails, Castor, Mojo, JRuby, Plexus, PicoContainer, Cargo, Drools, OpenEJB, and XStream as well as almost a hundred more. In total we're archiving 400,000 emails across the Codehaus lists.

P.S. Curious what happened in May 2006? They had some ISP troubles that month. And of course the last month is short since February has only just begun.

Squid Cache: Searching our own Dog Food

Yesterday we loaded 115,000 messages from the Squid mailing lists. We're particularly pleased about this because Squid plays a prominent role in the MarkMail site architecture and we plan to use these searchable archives to help with our own development.

Squid is probably the most famous caching proxy out there. It's been around for years, is fire-tested, and has oodles of configuration options. At MarkMail we use Squid as our "reverse proxy cache". In case you're not familiar with that term, let me explain.

On the web a "proxy" is a piece of software that sits between the user and the web server. When a user wants a web page, the user makes the request to the proxy and the proxy makes the request to the web server. Simple proxies provide a means to poke through firewalls, mask user identity, and things like that.

A "caching proxy" is a proxy that remembers the traffic passing through it, so later requests for the same content can (subject to configurable rules) be delivered to the user without actually connecting to the destination server. Schools, companies, and even countries use caching proxies to reduce their bandwidth costs and speed their users' web browsing. For example, once any user has pulled a logo image from a web site, every other user at that organization can just pull the proxy's version. Caching proxies make web browsing better, faster, and cheaper.

A "reverse proxy cache" is a caching proxy that runs on the server-side instead of the client-side. It gets first crack at each user request. In many cases, like when the requested page is already in its cache, a reverse proxy cache can handle the user request on its own and reduce the load on the actual web server.

On MarkMail, Squid sits in front of our MarkLogic Server instance (our web server) and gets first crack at all user requests. It handles several tasks:

URL rewriting. This lets us present friendly URLs like /message/xyzzy to our users, while we actually serve the content from a .xqy XQuery page in the MarkLogic Server back-end. We use a Squid plug-in called Squirm for this. It lets us map public URLs to private URLs.
Caching. Almost every page in MarkMail is dynamic, even the home page with all those count statistics, but that doesn't mean we should regenerate the page on every request. We let Squid cache the results of each page for a few minutes. If you're looking at a page that anyone else saw recently, we're probably serving it to you from cache.
Connection pooling. On the web there's a feature called Keep-Alive that lets users hold open connections to the web server in case they make later requests. A common Keep-Alive period is 30 seconds. Keep-Alive saves the cost of opening up new connections but holding all the open connections can be resource intensive for a web server. By using Squid, we let Squid hold all the Keep-Alive connections to end users (hundreds of connections) while MarkLogic Server only talks to Squid. This reduces the load on the actual web server, leaving it free to focus its energy on searching, rendering, counting, etc.

Of course there's more we'd like Squid to do for us. We'd like some help in blocking abusive users, automatically gzipping content, and things like that. We'll probably look for those features in a new load balancer. More on that later.

In the meanwhile, hope you enjoy the Squid archives.

Saturday, February 9, 2008

New Feature: Sweep the Chart to Select a Date Range

People often write us saying they want to click or sweep on the chart to select a date range. We're happy to announce that's now possible.

To demonstrate, if you search for JavaOne you see a repeating yearly spike which correlates to the dates of the annual Java developer conference. Lets say you want to investigate what people said in just the last few years about the show. You can click and swipe your mouse over the time period of interest:

This adds a date: constraint to the query and automatically updates the search results. You can remove the date constraint by clicking on the "Remove date refinements" link in the top right of the graph.

You can sweep to select or click on individual months, and if you hold down control (command on a Mac) it toggles the selection, enabling you to create non-contiguous selections.

Monday, February 4, 2008

Saxon: Loaded 10,000 emails about XSLT and XQuery

We've recently begun archiving the saxon-help mailing list, with its 10,000 emails about the famous XSLT and XQuery processor written and maintained by Michael Kay.

Michael's a great guy, in person and online, and he writes long detailed emails answering people's questions. He stays considerate even when the receiver is being a little "dense", ignoring people's help and exasperating the others who try to help out. A recent quote from Michael on the xquery-talk list:

I've spent five or ten minutes writing this response in the hope that you will learn from it and not make the same mistake again, which will save everyone time in the future. If you come back with another query showing the same error in a week's time, I shall give up.

I hope that by making the saxon-help archives more easily searchable than the built-in SourceForge search we'll be able to save him and the readers even more time.

Extra tidbit: If you admin a project on SourceForge and want your archives in MarkMail, there's an easy way to work with SourceForge to make that happen. Just let us know.

Tuesday, January 29, 2008

Give us a Date, and We'll Search It

In MarkMail we like to have both a search box way to do something (easy for experts) and a graphical way to do something (easy for novices). Recently we added support for date-based query constraints. At the moment it's only available in the search box, but we thought it worth talking about anyway. A graphical version will be coming soon (we know, we can't wait to click and swipe the months on the chart either).

With the new feature you specify a date or date range by adding a date: term to a query. For example, lets say you'd like to investigate the cause of the sudden spike in messages from the PHP lists for the query "register globals". You see this histogram:

So let's satisfy that curiosity. In April of 2002 things really start to heat up and it remains a pretty hot topic until roughly November 2003. To restrict our query to these dates all we have to do is add "date:2002/04-2003/11" to our register globals query, yielding "register globals date:2002/04-2003/11". Two dates separated by a hyphen indicate a range. This gives you only the matching messages from April 2002 through November 2003. The chart even highlights the selection range:

Looking at the statistics for this query, most of the messages were posted to the discuss list so it's likely that users are having problems. We might assume that something changed with the language so let's add "release" to the query. Sure enough, looks like they changed the default behavior of register globals in the 4.2.0 release which was released on April 22nd, 2002.

We support more than just date ranges. Here's a just a few of the formats that we support:

date:today	Messages posted today
date:"last week"	Messages posted in the last 7 days
date:"last month"	Messages posted in the last 30 days
date:lastmonth	Spaces are optional, for convenience
date:2008/01/26	Selection by day
date:20080126	Slashes are optional, if you prefer
date:2008/01	Selection by month
date:2007	Selection by year
date:2005/06-	Everything from June 2005 onward, because of the trailing hyphen
date:-2005/06	Everything up to the end of June 2005, because of the leading hyphen
date:2007-2008	This year and last
date:"July 4th, 2007"	Human readable formats are supported too
date:"Julio 4th, 2007"	For all of you Spanish speakers
date:t90d	Messages from the last 90 days, don't forget the t
-date:2008	Negation is also allowed, put the hyphen in front of the date:

Don't worry if you can't remember all this. The question mark graphic next to the search box will pop up a reminder. So now while we continue to work on allowing you to interact with the graphs, give this a spin and let us know what you think.

P.S. Jason really likes this because it lets him examine all the changes between JDOM 1.0 and JDOM 1.1.

Sunday, January 27, 2008

Ruby vs Groovy: What Can List Traffic Tell Us?

Over the weekend we loaded the main Ruby lists from ruby-lang.org, about 300,000 messages across the last six years. The ruby-talk list alone weighs in at 245,000 messages and is our new second place traffic champ, trailing only php-general.

With both the Ruby and Groovy archives loaded, we have an exciting opportunity to compare the two communities. In my experience talking with developers at Java conferences, they often look to both Ruby and Groovy as possible next languages to learn. The Java developers have a natural desire to go toward Groovy because it lets them keep their Java stack, but they're concerned about Groovy's level of support relative to Ruby, which has been around much longer and has a larger community.

How much larger? Is the community growing or shrinking? It can be hard to tell with open source, having no revenue numbers and with download counts skewed by bundling. I think looking at email list traffic patterns are about as good a gauge as anything.

Below you'll see the a composite graphic showing the traffic from the five Ruby lists compared to the five Groovy lists, with the Groovy lists inlaid at matching scale:

The Ruby lists are more active, by about double. In the months before the Groovy 1.0 launch in January 2007, the spread was even larger. Both communities seem to have plateaued in 2007. I look forward to seeing what 2008 brings.

P.S. The Ruby lists are half English and half Japanese. Yukihiro (Matz) Matsumoto who created Ruby is Japanese, and the language first took off in Japan. If you speak Japanese, feel free to search for Japanese words. It should work but do let us know if you spot any issues.

Thursday, January 24, 2008

Groovy: Traffic Doubled with a Formal Release

A few days ago we loaded the Groovy lists and their 70,000 messages. The list traffic chart helped change my mind about the language.

The groovy project calls itself, "an agile and dynamic language for the Java Virtual Machine". I'd call it a cool scripting language that compiles to Java bytecodes and so lets you write in a scripting language while accessing the vast set of Java libraries out there.

The first time I saw Groovy, years back, I got very excited -- but then it didn't seem to be catching on, and I thought it was slowly on the downturn. In fact it's not like that at all, it just takes time to develop a language. Look at the shape of the message traffic histogram:

Looks like the project caught some fire. Guillaume Laforge blogged that the big jump you see here starting in January 2007 was due to the release of Groovy 1.0. I see they're on Groovy 1.5 now, as of a month ago. The 100+ messages per day rate will probably continue. My friend Scott Davis was right.

Monday, January 21, 2008

Stuffing Six Million Pages Down Google's Throat

Tim O'Reilly in the O'Reilly Radar blog just reposted an email I sent him discussing the challenges with getting our millions of emails indexed by the major search engines. Here's a follow-up email he didn't repost, concerning the techniques we do use:

We do two things to help search engines with crawling. If there's more we can or should do, I'd be happy to hear it. Hopefully Google will agree that getting these emails (some of which aren't on the web anywhere else) into their index is a Good Thing.

First, we use sitemap files. Our robots.txt file points at a sitemap index file which points at several dozen sitemap files.

Second, we have a "Browse" link in the footer. It's not real prominent because it's not very friendly to humans but it provides a full navigation tree ideal for spiders. The top page links to all hosted lists, each list's page shows the messages by month, and each month page shows the messages for that month.

Our sitemap explicitly excludes mentioning any commit emails (which we view as lower value) but we notice Google still crawls them, so we can deduce their crawler found the Browse link. Their crawler also pulls the sitemap files regularly so it found them also.

Some people recommend having no more than 100 links on a page, and (because we have not implemented any browse paging) it's true that for some months on popular lists we exceed that. In some basic testing though, Google does appear to index emails both at the top and bottom of the long lists, so I'm not sure this is the problem.

The above discussion pertains to Google, but the real challenge right now is how to get the Yahoo spider count (19k pages) and MSN spider count (4k pages) even half as high as the Google spider count (851k pages).

Saturday, January 19, 2008

We've loaded PHP and PEAR (700,000 emails)

Move over tomcat-users, there's a new king in town: php-general.

In the last few weeks we loaded the PHP and PEAR mailing lists, a sum total of about 700,000 new messages. Contained within the new load is the php-general list, now statistically our largest list at 266,000 messages, passing by the old king tomcat-users with its 225,000 messages. Third place now goes to the main MySQL list.

Hmm, I wonder if http://php.markmail.org/ could be the logic behind the http://php.net/ mailing list search box, instead of the MARC archives that are supporting it today?