Tuesday, June 10, 2008

Diacritics, or should I say dịẫçritícs

We changed our indexing this week regarding how we handle diacritics -- those accent marks you see on vowels and some consonants in many languages.

Previously we resolved all queries in a diacritic insensitive manner. That meant that a search for "francois" would match both "francois" and "françois", and a search for "françois" would do the same. Basically we specified in our MarkLogic Server configuration that the c versus ç difference be ignored.

Now we've changed the configuration so the diacritic sensitivity choice depends on the search term. A term containing diacritics will trigger a diacritic sensitive match, while a term without diacritics will remain diacritic insensitive. That means a search for "francois" will match with and without diacritics (the same as before), but a search for "françois" will respect the ç character constraint and won't match "francois" anymore.

To summarize: If you care enough to type a diacritic, we'll care enough to match it for you!
This is a particularly helpful change as we've expanded from English-only content into lists written in Japanese, Vietnamese, Spanish, German, Italian, Dutch, Portuguese, Slovak, Polish, and Farsi. We even have one mail in Frisian. Who knew!


On a per-message basis, we get more traffic from these lists than our English lists. Perhaps they're underserved by other email archive systems? Maybe the other systems have issues hosting messages with the non-ASCII characters. We've definitely had trouble finding "clean" historical archive records for non-English lists, ones where the diacritics were reliably preserved. Luckily for us, being built on XML, we have native support for all Unicode characters.

We hope you find the new indexing logic helpful.

No comments: