Tuesday, August 19, 2008

How to Shutdown All Your Machines Without Anyone Noticing

Last week we discovered we had to replace some bad memory chips in 2 of the 3 machines we use to run the MarkMail service. This blog post tells the story of how we managed to replace these memory chips without (almost) any of our visitors noticing.

Architecture

First, a word about our architecture. The three machines I'm talking about here all run MarkLogic Server. We have some other machines in the overall MarkMail system that do things like queue and handle incoming mail, but they're not directly involved in the web site experience. I'm talking here about the three MarkLogic machines that work together in a cluster and that you interact with when you hit http://markmail.org.

The MarkLogic machines have specialized roles. One machine (I like to picture it up front) listens for client connections. It's responsible for running XQuery script code, gathering answers from the other two machines, and formatting responses. The other two machines manage the stored content, about half on each. They support the front machine by actually executing queries and returning content.

I'll refer to the front machine as E1, which stands for evaluator #1. We don't have an E2 yet but we're planning for that when user load requires. The back-end machines are D1 and D2, which stands for data manager #1 and #2.

The bad memory was on E1 and D1.

We'll Fix E1 First

We decided to fix E1 first because it's easiest. We gathered the MarkMail team and started at 5pm. That's the time period with our lowest traffic. It's a little counter-intuitive but since we're a global site we're as busy at 2am (Pacific) as we are at 2pm. The time around 5pm Pacific still sees a lot of traffic, but relatively less. Why? We theorize it's because we get the most traffic during the visitor's local business hours, and the 5pm to 8pm Pacific time slot puts the local business hours out in the middle of the Pacific.

The E1 server is important because it catches all requests. Our plan was to place a new host, essentially E2, into the cluster and route all traffic through it instead of E1. There's no state held by the front-end machines, so this is an easy change. We borrowed a machine, added it to the MarkLogic cluster, told it to join the "group" that would make it act like E1, and has our reverse proxy start routing traffic to it instead. We did all this with the MarkLogic web-based administration. It was far too easy, frankly.

We immediately saw the E1 access logs go silent and we knew our patient was, in effect, on a heart-lung bypass machine. We told our sysadmin in the colo to proceed.

That's when he told us that on more careful inspection the memory problems were on D1 and D2. The E1 server was just fine. Hmm...

We decided to call the maneuver good practice for later and put things back like we found them.

OK, We'll Fix D1 First

Performing maintenance on a machine like D1 requires more consideration because it's managing content. If we were to just unplug it, the content on the site would appear to be reduced by half. It'd be like winding the clock back to April, with our home page saying we just passed the 10 million message mark.

All email messages go into MarkLogic data structures called "forests". (Get it? Forests are collections of trees, each document being an XML tree). Our D1 server manages forests MarkMail 1 and MarkMail 2, the oldest two. They're now effectively read-only because we're loading into higher numbered forests now on D2.

Turns out that's a highly convenient fact. It means we could back up the content from D1 and put it on our spare box, now acting like a D3. Then with a single transactional call to MarkLogic we could enable the two backup forests on D3 and disable the two original forests on D1. No one on the outside would see a difference. Zero downtime.

It worked great! It took a few hours to copy things because it's hundreds of gigs of messages, but like a chef on TV we knew what we were going to need for showtime and prepared things in advance.

With the new memory chip placed in D1 we did a transactional switch-back, put the two original forests back into service and had the spare box unused again, ready to help with D2.

We Need an Alternate Approach for D2

Had we planned in advance to work on D2 we probably would have followed the same "use a backup forest" approach we used to work on D1 because it allowed for zero downtime. It would have required pushing ingestion activities to another machine like D1 so the forests could settle down and be read-only, but that's done easily enough. We didn't do this, however, because we were too impatient to wait for the data to copy between machines. Instead we decided to leave the data in place and do a SAN mount switch.

We host all our forest content on a big SAN (a storage area network, basically a bunch of drives working together to act like a big fast disk). All the data managing machines (D1, D2, and the spare acting as D3) have access to it. Usually we partition things into individual mount points so they can't step on each other's toes and corrupt things. You never want two databases to operate against the same data! Here we decided to remove the isolation. We'd have D2 "detach" the MarkMail 3 and MarkMail 4 forests and have our spare machine (acting like D3) quickly "attach" them. We would essentially transfer a few hundred gigs in seconds.

This system change couldn't be made transactionally, so we had a decision to make: Is it better to turn off the MarkMail site for a short time or let the world see a MarkMail with only half its content? We decided to just turn off the site. Our total downtime for the switch was 43 seconds going over, just over a minute coming back after the memory change.

We think we could do it faster next time with some optimizations in the MarkLogic configuration -- turning off things like index compatibility checks, which we know we don't need. Maybe 20 seconds, or even 15.

The Moral

Looking back, we're happy that we could cycle through disabling every machine in our MarkLogic cluster yet not have any substantial downtime. Looking forward, we expect operations like this will get easier. If and when we add a permanent E2 machine to the cluster it means we won't have to do anything special to take one of them out of commission. Our load balancer will just automatically route around any unresponsive front-end servers. We were also happy to see that our configuration for SAN-based manual failover works. We proved that as long as another machine can access the SAN, we'll be able to bring the content back online should a back-end machine fail.

Everyone on the MarkMail team works at Mark Logic, the company that makes the core technology that powers our site. In fact, in years past some of us have been directly involved in building the technology. But despite our familiarity, we were still delighted to take the production MarkLogic cluster out for a walk and get it to do tricks. It did the right thing time after time with every disconnect and reconnect and reconfiguration, and we couldn't help but feel a point of pride. This is some fun software! If you're a Mark Logic customer, we trust you know what we mean.

A non-techie friend once asked why managing a high-uptime web site was hard. I said, "It's like we're driving from California to New York and we're not allowed to stop the car. We have to fill the gas tank, change the tires, wash the windows, and tune the engine but never reduce our speed. And really, because we're trying to add new features and load new content as we go, we need to leave California driving a Mini Cooper S and arrive in New York with a Mercedes ML320."

So far so good! Here's to the long roads ahead...

2 comments:

Juergen said...

Nice Story! ;-)

Jukka said...

Sounds fun, and it's nice to read stories about global services that _don't_ need a cloud or a huge cluster of redundant servers to achieve good uptime and response times. Keep up the good work!

BR,

Jukka Zitting