Hadoop Map/Reduce versus DBMS, benchmarks

Here’s a recent benchmark published at SIGMOD ’09 by a team of researchers and students from Brown, M.I.T. and Wisconsin-Madison universities. The details of their setup here and this is the paper (PDF).

They ran a few simple tasks such as loading, „grepping” (as described in the original M/R paper), aggregation, selection and join on a total of 1TB of data. On the same 100-nodes RedHat cluster they compared Vertica (a well-known MPP), „plain” Hadoop with custom-coded Map/Reduce tasks and an unnamed DBMS-X (probably Oracle Exadata, which is mentioned in the article).

The final result shows Vertica and DBMS-X being (not astonishing at all!) 2, respectively 3 times faster than the brute M/R approach. What they also mention is that Hadoop was surprisingly easy to install and run, while the DBMS-X installation process was a relatively complex one, followed by tuning. Parallel databases were using space more efficiently due to compression, while Hadoop needed at least 3 times the space due to redundancy mechanism. A good point for Hadoop was the failure model allowing for quick recovery from faults and uninterrupted long-running jobs.

The authors recommend parallel DBMS-es against „brute force” models. “[…] we are wary of devoting huge computational clusters and “brute force” approaches to computation when sophisticated software would could do the same processing with far less hardware and consume far less energy, or in less time, thereby obviating the need for a sophisticated fault tolerance model. A multithousand- node cluster of the sort Google, Microsoft, and Yahoo! run uses huge amounts of energy, and as our results show, for many data processing tasks a parallel DBMS can often achieve the same performance using far fewer nodes. As such, the desirable approach is to use high-performance algorithms with modest parallelism rather than brute force approaches on much larger clusters.

What do you think, dear reader? I would be curious to see the same benchmark replicated on other NoSQL systems. Also, I find 1TB too low for most web-scale apps today.