Is Hadoop the new tape? — Data-Driven Blog by John Webster

By , Tuesday, March 27th 2012

I attended GigaOM’s Structure:Data 2012 conference in New York City last week. This is the second one I’ve attended and I’m now a confirmed advocate of this event. Om Malik brings together people who, in one way or another, represent much the creative thinking around so-called big data. I got the feeling that I could strike up a conversation with anyone there and learn something new.

I noticed at least two major differences between the Structure:Data event I attended last year and this year’s version. Last year, most if not all of the exhibiting vendors represented the xSQL community (mySQL, NoSQL, etc.). Much more diversity was on display this year. Hadoop vendors were there in force — no surprise there. Hadoop is an open-source-developed, distributed computing platform commonly used for data-intensive analytics and business intelligence applications. But, I was surprised by the number of storage vendors making the case for using shared storage in big-data analytics architectures that are typically classified as “shared nothing.” Shared storage mixes with shared nothing like oil mixes with water. (See below.)

So in keeping with the theme of storage in big-data analytics, here are a few choice comments I heard during sessions and on the show floor that relate to storage and Hadoop:

“Hadoop is a revamp of how we store and access data.”
This one got me thinking about Hadoop as a storage device. One of the presenters mentioned that if you put 1 terabyte of RAM into each of 1,000 Hadoop data nodes in a single cluster you would have, in aggregate, 1 petabyte of very high-performance storage that’s built on open-source software and commodity hardware. And, there’s more to this story. Hadoop has an embedded, distributed files system as do some scale-out network-attached storage (NAS) implementations. Data protection is built-in. It’s not RAID (redundant array of independent disks), but Hadoop does maintain multiple copies of data (typically three) across data nodes. So, should you as an IT administrator evaluate Hadoop on the basis of it being a storage device? I think you should.

“Hadoop is not about real time.”
The Hadoop community prefers to use disk storage embedded in each data node rather than large, centralized, shared storage (NAS, SAN), for two reasons: 1. Speed. Using DAS (direct-attached storage) in each data node reduces overall cluster latency. Users get closer to having results available to them in real time. 2. Cost. DAS is inexpensive. NAS and SAN (storage area network) are perceived to be not so. But, if your Hadoop cluster isn’t about real time, and you want your Hadoop environment to have some of the features that shared storage brings to the table — like dynamic capacity expansion, snapshotting, and data deduplication — then shared storage is worth considering.

“The big elephant doesn’t move through the little pipes especially well.”
Getting data into and out of Hadoop from remote locations is a problem that has been identified by service providers. Storage system and networking developers have been working out the “data here, data there” issues for years. Distributed storage architectures are emerging that could address this problem.

“Hadoop is the new tape.”
Yikes. We go from Hadoop being the cool new storage thing to yesterday’s toast in the span of a single blog post. Here’s how I interpret this comment. There’s a debate going on within the Hadoop community regarding the need for better responsiveness from Hadoop developers. Known issues with Apache Hadoop need to be addressed more quickly. The user learning curve needs to be concatenated. There are other knocks too. All of which leads some people to believe that Hadoop is merely a bridge to some better, future platform. Me? I’m in the definite maybe camp. I do see an opportunity for Hadoop implementations with applications built on top that would address the user elongated learning curve issue.

More to come from this event in future blog posts. Here’s another comment I could start one with: “We don’t actually need big data. What we need is small data.”