March 24th, 2011 by Editor
Major social news website Reddit.com had a major outage last week, and on their “You Broke Reddit (the site is down)” page, the company clearly blamed Amazon for its problems, stating: “Amazon is currently having an issue with their storage network. Please stand by.”
Reddit has released an official statement on their blog regarding the outage:
“As you will see, the blame was partly ours and partly Amazon’s (our hosting provider)… We’ll certainly admit that more could be done from our side to prevent hosting issues from affecting us so gravely. However this was a very serious outage which affected a large proportion of our disks. We would be lying if we said Amazon didn’t have some fault here.”
Then there is a more detailed explanation of the problem, including how data was committed to “slave” database replication clusters even though it was not committed to the “masters”.
“In a normal replication scenario, this should never, ever happen. The master commits the data, then tells the slave it is safe to commit the same data…The replication issue resulted in key conflicts on some of our slave databases. If you work in RDBMS at all, you know this is an extremely bad thing.”
Extremely bad things aside, Reddit explained that the Elastic block service had “reliability issues” and decided to move their Cassandra database to local storage, which offered less functionality but more reliability. They may also do the same with Postgres clusters.
Even though Reddit admitted partial culpability, three former Reddit employees posted on Reddit that the current Reddit team was “giving Amazon too much credit here,” calling Amazon’s EBS “a barrel of laughs in terms of performance and reliability, and are a constant (and the single largest) source of failure across Reddit… Their EBS product alone accounts for probably 80% of Reddit’s downtime.”
“EBSs are too slow to use single EBSs for a database machine servicing as many requests as Reddit’s are. So you have to RAID them together. But that increases your surface-area for risk in failure. Couple that with ridiculously [emphasis King’s] high variability in their actual performance … This … caus[es] many many micro-downtimes throughout the day, accounting for the vast majority of the ‘You broke reddit’ pages.”
Which brings up an interesting idea; if you could ensure performance without RAID, would that increase the reliability?
It would be disingenuous to say that I don’t have an agenda here, as I currently split my time between two companies. One produces a product which monitors systems for problems before they become problems (ScienceLogic), and another which improves performance of storage (CacheIQ). It would be easy for me to say: “Well – buy our wonderful, awesome, guaranteed to save your eternal soul or triple-your-money back technology product(s).”
Unfortunately that probably wouldn’t help in this case. While I think that Amazon does deserve some blame here, the real root cause of the problem is that Reddit has been chronically underfunded.
Reddit is one of the largest, if not the largest, social news sites, with a billion pageviews to its name. Plus, it’s loaded with tons of yummy market-targeting information. For example, if you sold, I don’t know, Weasel shaving lotion, you would be able to reach a targeted audience in Reddit’s section by buying an ad in the section devoted to weasel shaving, reddit.com/r/Weaselshaving.
But it only has a handful of staff – in fact, only one developer – and is seriously underfunded compared to similar sites. So you could make the argument that better performing hardware is important; but you could also make the argument that they would be on better performing hardware if they had the budget and staff.
Of course, Reddit isn’t exactly the favorite subdivision with the executives at Conde Nast, their parent company. One example: when Reddit was told by Conde Nast that they “could not accept money” from political ads to support California Proposition 19 (the legalization of marijuana), despite accepting advertising money from other controversial groups, Reddit’s administration decided, in response, to comply with corporate policy and their conscience by running the ads for free.
The point here is this:
Yes, there are a lot of products (like Cache IQ’s and ScienceLogic’s) and services out there which will help you make the most of your existing investments once you have them in place. But ultimately, if you’re not supporting your critical infrastructure – well, you’re not really treating it like critical infrastructure, are you?Tagged with: Amazon, cloud computing, performance, reliability