As some of you know, MeloTel suffered an outage for a total of 64 minutes this morning. We take the reliability of our systems very seriously, and are writing this to give you full disclosure on what happened, what we did wrong, what we did right, and what we’re doing to help prevent this in the future.
We also want to let you know that we are very sorry this outage happened. We have been working hard over the past 12 months on re-engineering our systems to be fully fault tolerant. We are tantalizingly close, but not quite there yet. Read on for the full details and steps we are taking to make sure this never happens again.
MeloTel became aware of a looming problem at 10:04am EST and our services became inaccessible at 10:44am EST morning.
At 11:01am EST we identified a spike is IOPS usage on our infrastructure and were not able to identify why.
At 11:08am (22 minutes after the start of the outage), Amazon Support Engineers were engaged to investigate the IOPS slow down.
At 11:18am EST Amazon Support Engineers informed us that our gp2 volume has exhausted burst credits, the credits have been exhausted at approximately 8:55 AM and this meant that our gp2 volume was operating at baseline speeds. This was causing significant performance issues.
At 11:36am EST we followed the instructions provided by Amazon to increase our gp2 storage size which would allow us more IOPS credits immediately. The process started immediately but would take time working behind the scenes.
At 11:48am EST all services were fully restored.
64 minutes seems like a long time between when our outage began and when we perform our resolution. And it is.
We use multiple external monitoring systems to monitor MeloTel and alert all of us when there are issues. After careful examination, the alerts from these systems were not monitoring Amazon Instance IOPS. As a result, we didn’t see this coming and responded to the incident later.
This is obviously an action item on us to remedy as soon as possible. These minutes count. We know they are very important to you. We will look at switching or augmenting our monitoring systems as soon as possible.
We’ve previously taken steps to be able to mitigate these large-scale EC2 events when they happen.
One such step is the very existence of our second AWS availability zone fallback environment. This is an (expensive) solution to this rare problem. We regularly run internal fire drills where we test and practice the procedure to flip to this environment. We will continue these drills.
We are implementing more monitoring for the specific IOPS credits in Amazon Log Watch to ensure if we ever meet a threshold like this again we can be proactive instead of re-active.