links for 2009-07-22

  • f you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)

    Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the enormous eBay 6.5 petabyte database (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS.

  • An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  • As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications

    2. A sub-optimal implementation, in that it uses brute force instead of indexing

    3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago

    4. Missing most of the features that are routinely included in current DBMS

    5. Incompatible with all of the tools DBMS users have come to depend on

    First, we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.