Firewall Upgrade Failure at Optus Leads to Emergency Call Outage and Fatalities
==============================================================================
In September, Australian telco Optus experienced a significant network failure during a routine firewall upgrade that contributed to the inability to route emergency calls, resulting in two deaths. An independent report highlights a series of critical mistakes made by technicians, including at least ten errors during the upgrade process.
Overview of the Incident
----------------------------
On September 18, for approximately 14 hours, Optus could not route some 000 emergency calls, leaving 455 calls unanswered and affecting urgent safety services. Despite being unaware of the issue initially, the company later discovered the problem through customer complaints. Tragically, two individuals died due to the failure to connect their emergency calls in time.
Findings of the Investigation
------------------------------
The report, authored by Dr. Kerry Schott, reveals that Optus had successfully completed 15 out of 18 planned firewall upgrades without issues. The problems arose on the 16th upgrade when multiple oversight failures occurred:
- Incorrect Instructions to Nokia: Optus provided flawed guidance to its outsourced provider, Nokia, leading to incorrect procedures being followed.
- Lack of Staff Attention and Attendance: Network engineers failed to attend critical meetings that discussed the upgrade’s impact.
- Unusual Device Isolation Procedures: Changes intended to isolate devices and lock gateways, which had not been used in previous upgrades, caused traffic redirection failures.
- Use of Outdated Methodology: Nokia used a 2022 Method of Procedure that was unsuitable for the current upgrade.
- Misclassification of the Upgrade's Impact: The task was incorrectly labeled as having no network impact, leading to bypassing proper reviews.
- Insufficient Monitoring and Data Granularity: Post-implementation checks showed rising call failure rates, but these signals were not recognized or acted upon swiftly. Aggregate data was too broad to detect localized issues early.
Additional Issues Revealed
-----------------------------
The report criticizes Optus' call center for not recognizing the early warning signs of call failures, which could have served as an initial alert about the outage. It also mentions the difficulty of routing 000 calls during outages, complicated by device incompatibilities and the lack of comprehensive testing for all smartphone models, especially those bought online or overseas.
Recommendations and Criticism
-----------------------------
Schott recommends that Optus and other Australian telcos move away from siloed operations and strengthen their crisis management procedures. The report condemns the execution of the firewall upgrade as "inexcusable," calling for greater discipline and supervision among network teams and providers like Nokia.
Related Issues and Broader Impact
-------------------------------------
- Outdated devices contributing to emergency call failures, including Samsung handsets.
- Calls for the telecom industry to avoid shaming providers during outages.
- Ongoing efforts by all Australian telcos to identify and rectify network vulnerabilities, including comprehensive testing of consumer devices.
Summary
-----------
This incident underscores the importance of meticulous planning, execution, and monitoring in critical infrastructure upgrades. It also highlights the need for better communication channels and testing procedures to prevent tragedies resulting from technical failures.
---
Note: All images are representative queries for visual context.