OMG! GMAIL FAILED!

Sep 2, 2009 at 8:56 am

Yesterday Gmail went down for about two-and-a-half hours. The event caused so many people to go to Twitter to kvetch about it, that it nearly took down that site as well. To be accurate, people who use Gmail via an email client like Apple Mail or Microsoft Outlook had no problems getting their email. It was the web interface that went down. The Pop and Imap servers didn't seem to skip a beat.

Google made the following statement via their Official Gmail Blog explaining the outage and what they are doing to prevent this from happening again:


Here's what happened: This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem -- we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.

The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.

What's next: We've turned our full attention to helping ensure this kind of event doesn't happen again. Some of the actions are straightforward and are already done -- for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle -- for example, we have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements -- Gmail remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity.

Of course if you are using Gmail for critical business communications this event might have made you think twice about doing this or at the very least consider alternatives to Gmail. But to be fair, no email system is fool proof. Still, there are things you can do to avoid getting burned in the future. Here are a couple suggestions: 

  • Set up Gmail to work with a desktop client. This is fairly easy to do and Google has instructions here on how to do it. The advantage of using a desktop client like Outlook, Mail, or Thunderbird is that all your mail gets downloaded from the server and stored locally on your machine. So even if Gmail or Yahoo Mail go down for a couple of hours, you still have access to most of your mail. Even if you still use the web interface most of the time, this is still a good idea to set this up and sync the client every so often as a local backup to your web-based email. There have been cases of people having their free web-mail accounts deleted so this is a good hedge against that happening. 
  • Get your own domain and mail hosting. If you use your own hosting and your own domain for your email address, you aren't as dependent on Gmail or another service for email. And that doesn't mean you have to use one or the other. Your hosted email and Gmail can compliment each other. If your host's mail server goes down, you still have Gmail. If Gmail goes down, you still have your hosted email on your own domain. This does have a cost, but if your email is critical to you it's worth it. This link explains how to use email with your email on your own domain.

Now you can use both of these methods together. And if you do you can add one more layer of protection by making sure you are giving all your contacts both your own email address on your own domain and your Gmail address. That way if one or the other fails, your contacts can still get in touch with you. 

Of course doing these things won't protect you if both your web host and Gmail go down. But what are the chances of that happening? And if you are using Gmail for critical business communications, there isn't any reason you shouldn't be doing this.