M/R vs DBMS benchmark paper rebutted

In a recent ACM article, Jeffrey Dean and Sanjay Ghemawat are discussing some pitfalls in the Hadoop vs DBMS comparison benchmarks that I’ve mentioned in one of my previous posts. They are clarifying three M/R misconceptions from the article:

- MapReduce cannot use indexes and implies a full scan of all input data; - MapReduce input and outputs are always simple files in a file system; - MapReduce requires the use of inefficient textual data formats.

and also they emphasize some Hadoop strong points not covered by the benchmark paper.

The biggest drawback which is lack of indexes, while partially compensated in certain use cases by the range query feature, is typically solved by using an external indexing service such as Lucene/SOLR or even a dedicated RDBMS. One can employ vertical and horizontal sharding techniques on indexes in order to answer queries on these pre-canned indexes, instead of scanning the whole data-set as the authors of the comparison paper imply.

Some performance assumptions are also discussed in the second part of the paper. While the benchmarks results were not challenged per se, here’s Jeffrey and Sanjay’s conclusion:

“In our experience, MapReduce is a highly effective and efficient tool for large-scale fault-tolerant data analysis.


MapReduce provides many significant advantages over parallel databases. First and foremost, it provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Second, MapReduce is very useful for handling data processing and data loading in a heterogenous system with many different storage systems. Third, MapReduce provides a good framework for the execution of more complicated functions than are supported directly in SQL.”