Someone who can self-identify if desired shared Google's summary of the recent email outages (PDF). This is the outage that caused my address (and many others) to start sending permanent bounce messages.
Background: The Gmail SMTP inbound service uses a configuration system that allows specific service options and flags to be changed while the service is already deployed in production. The "gmail.com" domain name is specified as one of these configuration options. An ongoing migration was in effect to update this underlying configuration system to meet Google internal best practices.
A configuration change during this migration shifted the formatting behavior of a service option so that it incorrectly provided an invalid domain name, instead of the intended "gmail.com" domain name, to the Google MTP inbound service. As a result, the service incorrectly transformed lookups of certain email addresses ending in "(at)gmail.com" into non-existent email addresses. When the Gmail user accounts service checked each of these non-existent email addresses, the service could not detect a valid user, resulting in SMTP error code 550.
To guard against the issue recurring and to reduce the impact of similar events, we are taking the following actions:
- Update the existing configuration difference tests to detect unexpected changes to the SMTP service configuration before applying the change.
- Improve internal service logging to allow more accurate and faster diagnosis of similar types of errors.
- Implement additional restrictions on configuration changes that may affect production resources globally.
- Improve static analysis tooling for configuration differences to more accurately project differences in production behavior.
Fixing things in production systems is hard. I've been there; things can go wrong, sometimes badly wrong. I'm used to thinking of Google as having near-infinite resources, including a replica of their production system to test changes on. Perhaps that's unrealistic.