We have received a Postmortem from our ULC.
To our valued customers,
Yesterday between 11:47 ET and 15:48 ET, we faced a routing issue on one of five SBC clusters serving our Wholesale Trunking product. The scope of the impact was to customers on this cluster affected their DID calls. The root cause of the issues was human-error on the part of the engineering team in validating a software upgrade on our SBC provisioning system, not the SBCs themselves. The vendor software upgrade completed Thursday, May 23rd, 2019. Yesterday at 11:47, during a routine Least-Cost-Routing (LCR) push to the affected cluster, an incorrect system setting resulting from the upgrade caused this LCR update to render a large amount of DID routing data ineffective.
We have a mature and well-defined network change control process to identify risks and build out a set of test cases to mitigate risk severity by reducing the time-to-detect and time-to-resolve however it did not capture and account for such severe negligence on the part of our engineering team. We executed our test plans, which, unfortunately, did not factor in verifying our global configurations settings maintained correct values after the upgrade. Once we identified the root cause of the issue, we attempted to restore the system configuration for the SBC provisioning server and re-apply the LCR update however this didn’t work correctly on the new major release. We then moved onto a complete system restoration from an SBC DB backup taken yesterday morning at 04:00 ET.
We are committed to delivering the highest quality VoIP Trunking service to our customers, and we recognize yesterdays service impact reflects poorly, and we sincerely apologize. We are working on re-structuring our change control committee and ensuring we have more engineers and more checkpoints present to ensure that we don’t miss potential risk factors, even when executing what should be completely innocuous maintenances.