Slow Dialing & Click2Call

Incident Report for MeloTel Network Operations (NOC)

Postmortem

As some of you know, MeloTel suffered an outage for a total of 64 minutes this morning. We take the reliability of our systems very seriously, and are writing this to give you full disclosure on what happened, what we did wrong, what we did right, and what we’re doing to help prevent this in the future.

We also want to let you know that we are very sorry this outage happened. We have been working hard over the past 12 months on re-engineering our systems to be fully fault tolerant. We are tantalizingly close, but not quite there yet. Read on for the full details and steps we are taking to make sure this never happens again.

The Outage

MeloTel became aware of a looming problem at 10:04am EST and our services became inaccessible at 10:44am EST morning.

At 11:01am EST we identified a spike is IOPS usage on our infrastructure and were not able to identify why.
At 11:08am (22 minutes after the start of the outage), Amazon Support Engineers were engaged to investigate the IOPS slow down.
At 11:18am EST Amazon Support Engineers informed us that our gp2 volume has exhausted burst credits, the credits have been exhausted at approximately 8:55 AM and this meant that our gp2 volume was operating at baseline speeds. This was causing significant performance issues.
At 11:36am EST we followed the instructions provided by Amazon to increase our gp2 storage size which would allow us more IOPS credits immediately. The process started immediately but would take time working behind the scenes.
At 11:48am EST all services were fully restored.

What we did wrong

64 minutes seems like a long time between when our outage began and when we perform our resolution. And it is.
We use multiple external monitoring systems to monitor MeloTel and alert all of us when there are issues. After careful examination, the alerts from these systems were not monitoring Amazon Instance IOPS. As a result, we didn’t see this coming and responded to the incident later.

This is obviously an action item on us to remedy as soon as possible. These minutes count. We know they are very important to you. We will look at switching or augmenting our monitoring systems as soon as possible.

What we did right

We’ve previously taken steps to be able to mitigate these large-scale EC2 events when they happen.
One such step is the very existence of our second AWS availability zone fallback environment. This is an (expensive) solution to this rare problem. We regularly run internal fire drills where we test and practice the procedure to flip to this environment. We will continue these drills.

What we’re going to do

We are implementing more monitoring for the specific IOPS credits in Amazon Log Watch to ensure if we ever meet a threshold like this again we can be proactive instead of re-active.

Closing

We re sincerely sorry for the inconvenience to our customers during this event and fully understand events that impact your ability to communicate are never taken lightly. As we are growing every day, we are continuing to develop on our infrastructure to make it more stable and reliable to mitigate issues like this in the future. We sincerely appreciate your understanding and business.

Posted Mar 16, 2017 - 12:30 EDT

Resolved

This incident has been resolved.

Posted Mar 16, 2017 - 12:24 EDT

Update

We have confirmed that all services are now operating optimally. As this event almost exceeded 2 hours, a postmortem for this event will follow. We are still continuing to monitor in the meantime.

Posted Mar 16, 2017 - 11:54 EDT

Monitoring

Calling is restored. We are continuing to monitor the situation but service should be restored at this time.

Posted Mar 16, 2017 - 11:48 EDT

Update

We're not out of the wood yet, but close. optimizing (29%). We needed to restart services again. You will be updated.

Posted Mar 16, 2017 - 11:45 EDT

Update

Calls are now completing. Although services are still starting.

Posted Mar 16, 2017 - 11:42 EDT

Update

optimizing (20%) done. You will be updated.

Posted Mar 16, 2017 - 11:39 EDT

Update

optimizing (10%) done. You will be updated.

Posted Mar 16, 2017 - 11:36 EDT

Update

We are currently working with Amazon for a plan to resolution. We will update shortly.

Posted Mar 16, 2017 - 11:18 EDT

Update

We are currently working with Amazon AWS Engineers to help with this issue. We sincerely apologize for the inconvenience. Updates will follow.

Posted Mar 16, 2017 - 11:08 EDT

Update

We have restarted voipnow services, some calls may go through, perhaps with bad audio. We will update you shortly.

Posted Mar 16, 2017 - 10:59 EDT

Update

SIP devices have lost registration again. We will keep you updated.

Posted Mar 16, 2017 - 10:49 EDT

Update

SIP registrations are coming back online. Calls are completing. We are investigating.

Posted Mar 16, 2017 - 10:44 EDT

Update

All SIP extensions have unregistered. We are continuing to investigate.

Posted Mar 16, 2017 - 10:44 EDT

Investigating

We are working on this issue as restarting some services has caused a notice "extension disabled" when placing a call. We will update shortly.

Posted Mar 16, 2017 - 10:42 EDT

Identified

We have identified the problem causing the slow dials and click2call delays. We are working resolve the problem.

Posted Mar 16, 2017 - 10:21 EDT

Investigating

We have reports of some users experiencing Post Dial Delay, especially when using Click2Call. We are investigating now.

Posted Mar 16, 2017 - 10:04 EDT