With so many NoSQL choices, how do you decide on one? Here’s a handy guide for narrowing your choice to three
Hadoop gets much of the big data credit, but the reality is that NoSQL
databases are far more broadly deployed -- and far more broadly
developed. In fact, while shopping for a Hadoop vendor is relatively
straightforward, picking a NoSQL database is anything but. There are,
after all, in excess of 100 NoSQL databases, as the DB-Engines database popularity ranking shows.
Which should you choose?
Spoiled for choice
Because choose you must. As nice as it might be to live in a happy utopia of so-called polyglot persistence,
“where any decent-sized enterprise will have a variety of different
data storage technologies for different kinds of data,” as Martin Fowler
argues, the reality is you can’t afford to invest in learning more than
a few.
Fortunately, the choice is getting easier as the market coalesces around
three dominant NoSQL databases: MongoDB (backed by my former employer),
Cassandra (primarily developed by DataStax, though hatched at
Facebook), and HBase (closely aligned with Hadoop and developed by the
same community).
Note that I purposefully exclude Redis from this list. While a great
data store, it’s primarily used for caching data and isn’t well suited
for a wide array of workloads.
LinkedIn data from 451 Research shows how the market is gravitating to MongoDB, Cassandra, and HBase:
That’s LinkedIn profile data. A more complete view is DB-Engines',
which aggregates jobs, search, and other data to understand database
popularity. While Oracle, SQL Server, and MySQL reign supreme, MongoDB
(no. 5), Cassandra (no. 9), and HBase (no. 15) are giving them a run for
their money.
While it’s too soon to call every other NoSQL database a rounding error,
we’re rapidly reaching that point, exactly as happened in the
relational database market.
To better understand why these three databases shine, I asked
representatives from each to identify key attributes for their success:
Kelly Stirman, director of products at MongoDB; Patrick McFadin, chief
Cassandra evangelist at DataStax; and Justin Kestelyn, senior director
of developer relations at Cloudera.
But first, we need to understand why NoSQL matters.
A world built with unstructured data
We increasingly live in a world where data doesn’t fit nicely into the
tidy rows and columns of an RDBMS. Mobile, social, and cloud computing
have spawned a massive flood of data. According to a variety of
estimates, 90 percent of the world’s data was created in the last two
years, with Gartner pegging 80 percent of all enterprise data as
unstructured. What's more, unstructured data is growing at twice the
rate of structured data.
As the world changes, data management requirements go beyond the
effective scope of traditional relational databases. The first
organizations to observe the need for alternative solutions were Web
pioneers, government agencies, and companies that specialize in
information services.
Increasingly now, companies of all stripes are looking to capitalize on
the advantage of alternatives like NoSQL and Hadoop: NoSQL to build
operational applications that drive their business through systems of
engagement, and Hadoop to build applications that analyze their data
retrospectively and help deliver powerful insights.
MongoDB: Of the developers, for the developers
Among the NoSQL options, MongoDB's Stirman points out, MongoDB has aimed
for a balanced approach suited to a wide variety of applications. While
the functionality is close to that of a traditional relational
database, MongoDB allows users to capitalize on the benefits of cloud
infrastructure with its horizontal scalability and to easily work with
the diverse data sets in use today thanks to its flexible data model.
MongoDB is often the first NoSQL database developers will try because
it’s so easy to learn. Will Shulman, CEO of MongoLab (a
MongoDB-as-a-service provider), says it this way:
The disproportionate success of MongoDB is largely based on its innovation as a data structure store that lets us more easily and expressively model the "things" at the heart of our applications….
Having the same basic data model in our code and in the database is the superior method for most use cases, as it dramatically simplifies the task of application development, and eliminates the layers of complex mapping code that are otherwise required.
Notably, MongoDB, like the other databases on this list, is not a
one-trick pony. Enterprises that learn MongoDB “can amortize their
investments in MongoDB across many, many projects, making it one of
short list of standards they rely upon for all data management,” as
Stirman told me.
Of course, like any technology MongoDB has its strengths and weaknesses.
MongoDB is designed for OLTP workloads. It can do complex queries, but
it’s not necessarily the best fit for reporting-style workloads. Or if
you need complex transactions, it’s not going to be a good choice.
However, MongoDB’s simplicity makes it a great place to start.
Cassandra: Safely run at scale
There are at least two kinds of database simplicity: development
simplicity and operational simplicity. While MongoDB rightly gets credit
for an easy out-of-the-box experience, Cassandra earns full marks for
being easy to manage at scale.
As DataStax's McFadin told me, users tend to gravitate to Cassandra the
more they butt their heads against the difficulty of making relational
databases faster and more reliable, particularly at scale. A former
Oracle DBA, McFadin was elated to discover that “replication and linear
scaling are primitives” with Cassandra, and the features were “the
primary design goal from the beginning.”
In the RDBMS world, database features like scaling and replication are
the hard parts left to the user. This worked fine in yesterday’s
enterprise when scale wasn’t a big issue. Today it’s quickly becoming the issue.
As I heard from McFadin and others, Cassandra particularly shines in
scale-out deployments. Cassandra comes with baked-in support for
multiple data centers. As for adding capacity to a cluster, “You simply
boot up a new machine and tell Cassandra where the other nodes are,"
McFadin said, "and it takes care of the rest.”
This ease of scaling, coupled with exceptional write performance (“All
you’re doing is appending to the end of a log file”) and predictable
query performance, add up to a high-performance workhorse in Cassandra.
One article of NoSQL faith I’ve long held is that Cassandra may be
powerful at scale, but it requires a doctorate degree to get started.
Not so, McFadin insisted:
The replication and read and write paths are purposefully simple. You can learn the core internals of Cassandra in a few hours. That can bring a lot of confidence as you deploy new technology because there are less “black box” details that introduce complex failure modes.
This means that the price for admission to effective Cassandra
development is in understanding the data model and how it will work with
your application. Given the familiarity of Cassandra’s CQL query
language (intended to be “exactly like SQL except when it’s not”), McFadin said, it’s not a steep learning curve.
More important, he told me, “Cassandra rewards you with the one thing
you want from a database: no drama. This is why users love to use
Cassandra.”
HBase: Bosom buddies with Hadoop
HBase, like Cassandra a column-oriented key-value store, gets a lot of
use in large part because of its common pedigree with Hadoop. Indeed, as
Cloudera's Kestelyn put it, “HBase provides a record-based storage
layer that enables fast, random reads and writes to data, complementing
Hadoop by emphasizing high throughput at the expense of low-latency
I/O.”
Kestelyn goes on:
Changes are efficiently cataloged in memory to achieve maximum access while the data is persisted to HDFS. This design enables a Hadoop-based EDH [enterprise data hub] to serve random reads and writes to users and applications in real time, yet still enjoy the fault-tolerance and durability of HDFS.
Affinity with Hadoop isn’t the only reason HBase keeps rising in the
database popularity ranks, though that might be enough. Similar to
Cassandra, HBase’s roots as an open source implementation of Google’s Bigtable translate into the database being highly scalable by design.
Because it can utilize the storage, memory, and CPU resources of any
number of servers, as well as has scale-out features like automatic
sharding, HBase can scale limitlessly as load and performance demands
increase simply by adding server nodes. HBase was designed from the
ground up to provide optimal performance when consistency is critical.
But scale isn’t it’s only utility. As Kestelyn noted, “Thanks to its
tight integration with the rest of the Hadoop ecosystem, data is readily
available to users and applications via SQL queries (using Cloudera
Impala, Apache Phoenix, or Apache Hive) or even faceted free-text search
(using Cloudera Search).” Thus, HBase gives developers a way to
leverage existing expertise with SQL while building on a more modern,
distributed database.
Each database comes with its own strengths and shortcomings, but each of
the three profiled here has filled a major hole in the big data
landscape. While it’s possible that a new database will come along to
claim a spot in the NoSQL top three (DynamoDB?), the reality is that
developers and the enterprises they serve are already standardizing on a
few strong options: MongoDB, Cassandra, and HBase.
Now VP of mobile at Adobe, Matt Asay was previously vice president
of community at MongoDB, Inc. He is an emeritus board member of the Open
Source Initiative (OSI) and earned his juris doctorate at Stanford,
where he focused on open source and other intellectual property
licensing issues, and his master's from the University of Kent at
Canterbury and his bachelor's from Brigham Young University. Asay was
one of InfoWorld's first bloggers.
Source: http://www.infoworld.com
No comments:
Post a Comment