Tuesday, 6 January 2015

The rise of the multimodel database

By mapping documents, graphs, and relational tables to a collection of keys and values, a single data store can support multiple data models

NoSQL entered the scene nearly six years ago as an alternative to traditional relational databases. The offerings from the major relational vendors couldn’t cut it in terms of the cost, scalability, and fault tolerance that developers need to build reliable, modern Web applications.
Flash forward to today, and now vendors everywhere tout their NoSQL solutions. Open source projects have sprouted all over the place with thousands of developers contributing to them. In fact, more than 200 different NoSQL products and companies in the market are vying for developers' attention.

Beyond SQL

To understand the NoSQL boom, it helps to take a quick look at how we got here. The relational data model has been around since the early 1970s, and it became popular for good reasons. In the form of SQL, it offers a general query language based on a rigorous data model. Query planners can usually optimize queries without requiring detailed knowledge of the physical data layout on the part of the user.
Since that time, many data formats that go beyond the relational model have gained popularity. For example, JSON is a common format used in software development and for document-oriented data. Some SQL vendors allow you to store JSON as a serialized string, but it’s not a first-class citizen in terms of querying or indexing. You can decompose a JSON object into multiple tables, but you’ll have to use multiple joins to query the data, paying a large performance penalty.
Graphs, based on nodes and links, are another popular data model. Graphs are often used to store data structured as networks, such as social networks. As with JSON, there are straightforward ways to translate a graph into relational tables, but the relevant graph structure is lost, and the resulting queries require expensive, iterated joins. As a result, the likes of a shortest path query, which is natural and straightforward in a graph data model, become extremely complex in SQL.
Other data models, like time series, blobs, and geospatial data, pose similar challenges. Many SQL vendors support proprietary add-ons for these data types, but there is no uniform and efficient way to represent them in a relational model.

The NoSQL response

The proliferation of NoSQL databases is a response to the needs of modern applications, which work with different types of data with different storage requirements. Not all data can be shoehorned into a particular model, whether relational or otherwise. That’s why there are so many different database options in the market. The need for multiple data models in modern applications is our reality.
While developers need multiple data models, they shouldn’t have to adopt different databases to get them. It’s not uncommon to hear that an application has multiple databases in its back end. Martin Fowler has advocated an architectural pattern of “polyglot persistence,” meaning the application stores data in separate databases of different types. Polyglot persistence with separate databases responds to a real need, but it leads to an operational nightmare. Running multiple data silos creates as many problems as it solves, beginning with operational complexity.

One back end, multiple data models

A multimodel database provides a single back end that supports multiple data models. It’s all about being able to choose the best data model for the job with a single storage substrate. Multimodel databases eliminate the back-end fragmentation already discussed. Multimodel databases provide two key benefits:
Easing operational complexity. The fragmented environments caused by running different databases side by side increase the complexity of both operations and development. For example, a polyglot application stack might include Redis as a caching layer, MongoDB for collecting logs, Postgres for metadata, and Elasticsearch for indexing and search. The goal is to use the best component for the job.
However commendable the intention, polyglot persistence means you end up with multiple databases, each with its own storage and operational requirements; integrating them for scalability and fault tolerance is up to you. Assuring that a system with many such components is fault-tolerant is challenging, to say the least. The need to integrate multiple databases imposes significant engineering and operational costs. Your team needs to have experts in each database technology. For the application to stay online, all of the databases need to remain up. This renders the fault tolerance of the application equal to the weakest link in the stack. 
Consistency. Even worse, there is no support for transactions across different databases, so there is no good way to maintain consistency among different models. Suppose your application receives a stream of data on user activity, and you decide to store related data elements in time series, graph, and document stores. You usually require these elements to reflect a consistent state, but without ACID transactions, this requirement can be difficult if not impossible to achieve.

A new approach

Can we somehow keep the good parts of polyglot persistence but lose the bad parts? It turns out, we can. The main idea is to keep all state in a single store that supports multiple data models by mapping the higher-level models to a lower-level representation. To pull off this trick, the storage substrate needs to have some important properties. At a minimum, it needs to support true, multikey ACID transactions in a performant manner. It turns out that ordering among keys is another important tool for efficient data modeling. These considerations lead to an ordered, transactional key-value store as the basic storage substrate we’ll need.
With these building blocks in place, supporting new data models becomes a matter of mapping the higher-level representation to a collection of keys and values. JSON documents, graphs, and relational tables can all be efficiently mapped to key-value pairs. By taking advantage of the ordering property among keys, we can even design optimizations that let us avoid joins in many cases where a traditional SQL database would require them.
This new approach gives you several big wins. All of the states that your application stores can be kept in a single component. You can have transactions across your data models. Best of all, you can use the data models your application really needs. ACID transactions give you the “glue” to keep all your data synchronized across different models.
Of course, to meet the demand of modern applications, these features need to be delivered in a way that retains the advantages of a NoSQL database running on a distributed cluster, especially horizontal scalability and fault tolerance. ACID transactions are especially powerful in combination with those capabilities. 

A multimodel database

Rather than having to integrate multiple databases, it’s much simpler if your development team can build the data models you need on a single back end. That’s the approach FoundationDB has taken. FoundationDB is a multimodel database that combines scalability, fault tolerance, and high performance with incredibly powerful multikey ACID transactions. The “secret sauce” that enables this approach is performant ACID transactions. Building a custom data model that supports concurrent updates usually requires synchronization of disparate data elements. Such synchronization is easy with ACID transactions and very difficult without them.
This is where the database market is heading: toward ACID-compliant, multimodel databases that can meet an application’s requirements for fault tolerance, scalability, and performance.

No comments:

Post a Comment