Microsoft says this week’s five-hour Microsoft 365 worldwide outage was caused by a change in the router’s IP address that led to packet forwarding problems between all other routers in its Wide Area Network (WAN).
Redmond said at the time that the outage was due to DNS and WAN network configuration issues caused by a WAN update, and that users across all regions served by the affected infrastructure were having trouble accessing the affected Microsoft 365 services.
The issue led to service impact in waves peaking approximately every 30 minutes as shared on the Microsoft Azure service status page (this status page was also impacted as it intermittently displayed “504 Gateway Timeout” errors).
The list of services affected by the outage included Microsoft Teams, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, PowerBi, Microsoft 365 Admin Center, Microsoft Graph, Microsoft Intune, Microsoft Defender for Cloud Apps and Microsoft Defender for Identity,
In total, it took Redmond over five hours to fix the problem, from 7:05 UTC, where it began investigating, until 12:43 UTC when service was restored.
“Between 07:05 UTC and 12:43 UTC on January 25, 2023, customers experienced network connectivity issues manifesting as long network latency and/or timeout when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services, including Microsoft 365 and Power Platform,” Microsoft said in a preliminary post-incident report published today.
“While most regions and services were restored by 09:00 UTC, intermittent packet loss issues were fully resolved by 12:43 UTC. This incident also affected Azure Government cloud services that rely on the Azure public cloud.”
We have confirmed that the affected services have been restored and remain stable. We are investigating a potential impact to the Exchange Online service. Additionally, updates on the Exchange investigation will be available in your admin center under SI# EX502694.
— Microsoft 365 Status (@MSFT365Status) 25 January 2023
Microsoft now also revealed that the problem was triggered when changing the IP address of a WAN router using a command that had not been thoroughly researched and has different behavior on different network devices.
“As part of a planned change to update the IP address of a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, resulting in them all recalculating their neighbor and forwarding tables.” Microsoft said.
“During this recalculation process, the routers were unable to properly forward packets traversing them.”
While the network began to recover from 08:10 UTC, the automated systems responsible for maintaining Wide Area Network (WAN) health paused due to the impact on the network.
These systems included those for identifying and eliminating unhealthy devices as well as traffic engineering systems for optimizing data flow across the network.
As a result of the outage, some network paths continued to experience increased packet loss from 9:35 UTC until the systems were manually rebooted, returning the WAN to optimal operating conditions and completing the recovery process at 12:43 UTC.
Following this incident, Microsoft says it is now blocking highly impactful commands from being executed, and will also require all command execution to follow guidelines for secure configuration changes.