Incoming calls to some DIDs Failing

Incident Report for MeloTel Network Operations (NOC)

Postmortem

We have received a Postmortem from our ULC.

To our valued customers,

Yesterday between 11:47 ET and 15:48 ET, we faced a routing issue on one of five SBC clusters serving our Wholesale Trunking product. The scope of the impact was to customers on this cluster affected their DID calls. The root cause of the issues was human-error on the part of the engineering team in validating a software upgrade on our SBC provisioning system, not the SBCs themselves. The vendor software upgrade completed Thursday, May 23rd, 2019. Yesterday at 11:47, during a routine Least-Cost-Routing (LCR) push to the affected cluster, an incorrect system setting resulting from the upgrade caused this LCR update to render a large amount of DID routing data ineffective.

We have a mature and well-defined network change control process to identify risks and build out a set of test cases to mitigate risk severity by reducing the time-to-detect and time-to-resolve however it did not capture and account for such severe negligence on the part of our engineering team. We executed our test plans, which, unfortunately, did not factor in verifying our global configurations settings maintained correct values after the upgrade. Once we identified the root cause of the issue, we attempted to restore the system configuration for the SBC provisioning server and re-apply the LCR update however this didn’t work correctly on the new major release. We then moved onto a complete system restoration from an SBC DB backup taken yesterday morning at 04:00 ET.

Improvement Plan

We are committed to delivering the highest quality VoIP Trunking service to our customers, and we recognize yesterdays service impact reflects poorly, and we sincerely apologize. We are working on re-structuring our change control committee and ensuring we have more engineers and more checkpoints present to ensure that we don’t miss potential risk factors, even when executing what should be completely innocuous maintenances.

Regards,

Comwave

Posted May 28, 2019 - 09:32 EDT

Resolved

As per our vendors confirmation issue has been resolved at 03:47pm ET

They are in process of investigating further to determine the root cause of the issue.

Posted May 27, 2019 - 15:58 EDT

Update

Or underline carrier has advised us they have identified the issue and are presently working on resolving.

They estimate that this should be fixed within the hour. Should it not be fixed within the hour, they/we will provide further updates.

Posted May 27, 2019 - 15:26 EDT

Identified

Our underlying carrier has confirmed they are experiencing an outage at their network operation center. We are waiting updates and will provide any details as soon as they come. There is no estimated time to resolution at this time. We are monitoring this very closely.

Posted May 27, 2019 - 14:19 EDT

Investigating

We have several reports of customers not being able to receive calls to their incoming telephone number. We have a critical ticket open with the underlying carrier for those telephone numbers. Updates will follow

Posted May 27, 2019 - 13:59 EDT

This incident affected: VoIP Services (Canadian DIDs, USA DIDs, Toll Free DIDs).