Sunday, May 2, 2010

The NOSQL Debate

I have attended the Stanford InfoLab conference, and there are 2 panel discussions in Cloud computing transaction processing and analytic processing.

The session turns out to be a debate between people from the academia side with the open source developer. Both sides have their points ...

Background of RDBMS

RDBMS is based on a clear separation between application logic and data management. This loose coupling allows application and DB technologies to be evolved independently.

This philosophy drives the DBMS architecture to be more general in order to support a wide range of applications. Over many years of evolution, it has a well-defined data model + query model based on relational algebra. On the other hand, it also have a well-defined transaction processing model.

On the other hand, applications also benefit from having a unified data model. They have more freedom to switch their DB vendor without too much of code changes.

For OLTP applications, the RDBMS model has been proven to be highly successful in many large enterprises. The success of both Oracle and MySQL can speak to that.

For Analytic applications, the RDBMS model is also used widely for implementing data warehousing based on a STAR schema model (composed of Facts table and Dimension tables).

This model also put DBA into a very important position in the enterprise. They are equipped with sophisticated management tools and specialized knowledge to control the commonly shared DB schema.

Background of NOSQL

In the last 10 years, there are a very few of highly successful web2.0 companies whose applications have gone beyond what a centralized RDBMS can handle. Partition data across many servers and spread the workload across them seems to be a reasonable first thing to try. Many RDBMS solution provides a Master/Slave replication mechanism where one can spread the READ-ONLY workload across many servers, and it helps.

In a partitioned environment, application needs to be more tolerant to data integrity issues, especially when data is asynchronously replicated between servers. The famous CAP theorem from Eric Brewer capture the essence of the tradeoff decisions for highly scalable applications (which must have partition tolerance), they have to choose between "Consistency" and "Availability".

Fortunately, most of these web-scale application have a high tolerance in data integrity, so they choose "availability" over "consistency". This is a very major deviation from the transaction processing model of RDBMS which typically weight "consistency" much higher than "availability".

On the other hand, the data structure used in these web-scale application (e.g. user profile, preference, social graph ... etc) are much richer than the simple row/column/table model. Some of the operations involves navigation within a large social graph which involves too many join operations in a RDBMS model.

The "higher tolerance of data integrity" as well as "efficiency support for rich data structure" challenges some of the very fundamental assumption of RDBMS. Amazon has experimented their large scale key/value store called Dynamo and Google also has their BigTable column-oriented storage model, both are designed from the ground up with a distributed architecture in mind.

Since then, there are many open source clones based on these two models. To represent the movement of this trend, Eric Evans from Rackspace coin a term "NOSQL". This term in my opinion is not accurately reflecting the underlying technologies but nevertheless provide a marketing term for every non RDBMS technologies to get together, even those (e.g. CouchDB, Neo4j) who is not originally trying to tackle the scalability problem.

Today, NOSQL provides an alternative approach for Non-Relational Database.

For analytical application, they also take a highly-parallel brute-force processing model based on the Map/reduce model.

The Debate

There are relatively few criticism on the data model aspects. Jun Rao from IBM Lab summarized the key difference between the philosophy. The traditional RDBMS approach is first to figure out the right model, and then provide an implementation and hope it is scalable. The modern NOSQL approach is doing the opposite by first figuring out how a highly scalable architecture looks like and then layer a data model on top of that. Basically people on both camps agree that we should use a data model that is optimized for the application's access patterns, weakening the "one-size-fits-all" philosophy of RDBMS.

Most of the debate is centered around the transaction processing model itself. Basically RDBMS proponents thinks NOSQL camp hasn't spent enough time to understand the theoretical foundation of the transaction processing model. The new "eventual consistency" model is not well-defined and different implementations may differs significantly with each other. This means figuring out all these inconsistent behavior lands on the application developer's responsibilities and make their life much harder. Hard to reason about the DB's behavior can be very dangerous if the application made wrong assumption about the underlying data integrity guarantees.

While agree that application developers now have more to worry about, the NOSQL camp argues that this is actually a benefit because it gives the domain-specific optimization opportunities back to the application developers who now no longer constrained by a one-size-fits-all model. But they admit that making such optimization decision requires a lot of experience and can be very error-prone and dangerous if not done by experts.

On the other hand, the academia also make a note that the movement to NOSQL may deem fashionable and cool to new technologists who may not have the expertise and skills. The community as a whole should articular the pros and cons of NOSQL.

Notice that this is not the debate of the first time. StoneBraker has written a very good article from the RDBMS side.

7 comments:

Prof. Dr. Stefan Edlich said...

Thanks for this interesting posting.

Especially the comment from Juan Rao was interesting for me: model->scale vs. scale->model and "should use a data model that is optimized for the application's access patterns".

This looks like a lot of work to write up all access patterns with all possible data models, link them correctly and then propose the best system...

Regards
Stefan Edlich
http://nosql-database.org

Anonymous said...

I see a tie-in between NoSQL and SOA:

As you note, the RDBMS historically played a "central integration point" role, putting the DBA at the crossroads of all projects. This often meant a lengthy approval process for developers just to get trivial changes into the database. Switching databases was nearly impossible.

In SOA, each service is it's own island, with it's own private database. Not only is there no "DBA as peace maker" role, but it frees each service to pick a different database to suit their needs. (At Amazon, each service can pick their own implementation language too.) This is the perfect environment to 'experiment' with different databases.

> NOSQL may deem fashionable and cool to new technologists who may not have the expertise and skills

I think this is a good thing. Every time technology gets easier to use, more people try it and bring new ideas. Even in the current RDBMS world, there are plenty of dumb people (see dailywtf).

Some of the NoSQL models are so simple to reason about that it makes the RDBMS seem "baroque". Think of someone who grows up with MongoDB as their first database, then tries to learn SQL. "Why do I have to create an empty table? Why do I have to create multiple tables to represent one object? Why can't I add fields dynamically? How do I write 'order.item.price'?"

Ricky Ho said...

Thanks for the great comment !

To Stefan's comment :

I think this concept is not new, it is similar to picking the right data structure when we writing our program (e.g. List, Stack, Tree, Hashtable ... etc). But now we are extending the same idea to the underlying storage model (e.g. RDBMS, Document DB, Graph DB, Distributed Hashtable, BTree ... etc.)

Although the concept is simple, but implementing that persistent data structure model in a highly distributed architecture is non-trivial.

Thanks for the next comment highlighting the difference between two system integration approaches: a) Interface centric b) DB centric. I think each of them has its pros and cons. The fundamental question is how "tight" should the coupling (between different systems) be. I think this is a case by case judgement. If it is within a single organization, a tighter coupling is sometimes not a bad thing.

Claudio Martella said...

I completely agree with you, it's very similar to the issue of choosing the datastructure in your program. What the big problem is, as you state too, is the absence of a unifying approach to eventual consistency. there should/must be an approach to transactions-like operations that is well understood and widely accepted. It's true that delegating the decisions to the developer pushes the developer to optimize the solution on his problem-space, but personally i can tell that a lot of people just don't give nosql a try because of the lack of information available (there's no book, best practices or anything, yet). I wrote a post about this debate from a different perspective, if you mind looking at it

Prof. Dr. Stefan Edlich said...

Dear Ricky, dear Claudio,

of course you are right. The concept is not new. And of course it's similar to picking the right datata stuctures in a program.

Nevertheless my point was: there is no written document as a guideline that shows the access patterns, guides you through all non functional db requirements and leading you to a fine DB decision. I think it's worth a try (at least Ph.D.?) to do this. Perhaps it has never been done because the DB or the language field is exploding too fast and every writing will be outdated immediately ...

Best
Stefan E.

(P.S: I can see no view on NoSQL:EU on advantages and drawbacks. There are slides about Redis, Tokyo, Couch
and Mongo)

Dominique De Vito said...

While NoSQL database programming/choice could be hard, IMHO categorizing such NoSQL databases is simpler that expecting.

I think NoSQL databases are somewhat disguised object databases.

See my post: Thinking about NoSQL databases (classification and use cases)

Unknown said...

I invite you to read my post on NoSQL and SQL, looking at the broader aspects, such as design, operation etc:
http://blog.xeround.com/2010/12/nosql-the-sequel