Monday, January 21, 2008

Stuffing Six Million Pages Down Google's Throat

Tim O'Reilly in the O'Reilly Radar blog just reposted an email I sent him discussing the challenges with getting our millions of emails indexed by the major search engines. Here's a follow-up email he didn't repost, concerning the techniques we do use:

We do two things to help search engines with crawling. If there's more we can or should do, I'd be happy to hear it. Hopefully Google will agree that getting these emails (some of which aren't on the web anywhere else) into their index is a Good Thing.

First, we use sitemap files. Our robots.txt file points at a sitemap index file which points at several dozen sitemap files.

Second, we have a "Browse" link in the footer. It's not real prominent because it's not very friendly to humans but it provides a full navigation tree ideal for spiders. The top page links to all hosted lists, each list's page shows the messages by month, and each month page shows the messages for that month.

Our sitemap explicitly excludes mentioning any commit emails (which we view as lower value) but we notice Google still crawls them, so we can deduce their crawler found the Browse link. Their crawler also pulls the sitemap files regularly so it found them also.

Some people recommend having no more than 100 links on a page, and (because we have not implemented any browse paging) it's true that for some months on popular lists we exceed that. In some basic testing though, Google does appear to index emails both at the top and bottom of the long lists, so I'm not sure this is the problem.
The above discussion pertains to Google, but the real challenge right now is how to get the Yahoo spider count (19k pages) and MSN spider count (4k pages) even half as high as the Google spider count (851k pages).

No comments: