This short posting is a follow-up to the one last week about Amazon’s culture and its outage. Amazon has finally explained what caused the outage last week. The explanation is long and complex, but boils down to an avalanche. One relatively small problem caused a few more, which cascaded into a large issue, similar to a small snowball down the side of a mountain causing an avalanche. I have been there, I know it, it hurts. They appear to be serious about trying to stop such cascade effects in the future, and the methods for doing so are highly technical in nature.
What is buried 2/3 of the way through the article is that the proximate cause of the outage, the catalyst that set it off, was a change an Amazon employee performed, or, in this case, misperformed. The engineer in question made a very simple human error that was the rolling stone that set off the avalanche.
It is hard to admit fault. But after all of this, and with all of the heat Amazon has taken for lack of communication, was it really so hard to say, “we screwed up, we are sorry,” and to say it at the very beginning, loud and clear?
An interesting insight into Amazon’s internal culture will be what happens to the employee who made the mistake. A bad company will penalize him; an awful one will fire him. But a great company will look at this and say, “gee, why did we set this up so that one person can make a very human error? How can we change this so that the human error simply cannot be repeated?”