Of the numerous ways that have emerged in the last few years to do “big data” analytics, none has received more media attention than Hadoop. Hadoop began conceptually with a paper that emanated from Google in 2004 describing a process for parallelizing the processing of Web-based data (MapReduce) and a supporting file system. Shortly thereafter, Apache Hadoop was born as an open source implementation of the MapReduce framework. The community surrounding it is now growing dramatically and producing add-ons that expand its usability within corporate data centers.
The potential Hadoop user is now confronted with a growing list of platform sourcing choices that range from open source (Apache Hadoop and Cloudera) to more commercialized and “enterprise ready” versions (EMC Greenplum HD, Hortonworks, MapR). Because each uses storage differently, it’s important for data storage administrators to first understand storage in the context of Apache Hadoop as the commercial versions address what their vendors believe to be storage-related shortcomings. We’ll look more closely at these perceived shortcomings in an upcoming tip on managing big data.
Hadoop in some ways mimics the behavior of a file system, but it really isn’t a file system. Rather, it’s defined in Wikipedia as a software framework that can be used to support data-intensive applications that require the processing of petabyte-scale data sets. The Hadoop framework incorporates a distributed file system (the Hadoop Distributed File System [HDFS]).
A Hadoop cluster is commonly referred to as “shared nothing.” One of the things that distinguishes HDFS from some of the more common file systems like NFS and CIFS is its ability to support a distributed computing, shared nothing architecture.
Shared nothing means almost exactly what it says. In a distributed computing cluster composed of parallelized nodes, the only thing that’s actually shared is the cluster network that interconnects the compute nodes. Nothing else is shared, including storage, which is implemented as disk-based direct-attached storage (DAS). Usually DAS here consists of one set of eight to 10 disks per node configured as RAID or JBOD for maximum performance. Solid-state drives (SSDs) aren’t typically used because of cost.
One of the objectives of the shared nothing paradigm is to reduce processing latency. Keep in mind that we want to process queries that grind through an enormous amount of data, often in five seconds or less. So minimizing cluster-wide latency is a critical priority for Hadoop developers and system architects.
As Hadoop ingests raw data, often from multiple sources, it spreads data across many disks within a Hadoop cluster. Query processes are then assigned to processors in the cluster that operate in parallel on the data spread across the disks directly attached to each processor in the cluster. This minimizes system latency.
To better understand why latency is minimized, it helps to have at least a rudimentary understanding of the CAP theorem, which is basically stated as follows:
It is impossible for a distributed computer system (Hadoop for example) to simultaneously provide all three of the following guarantees:
A distributed system can satisfy any two of these guarantees at the same time, but not all three.
Hadoop developers, who understand they can’t have consistency, availability and partition tolerance all at once, typically go for “A” and “P” first and settle for what’s commonly called “eventual consistency.” Cluster-wide performance is maximized to support these assumptions. Latency is squeezed out of the Hadoop compute cluster whenever and wherever possible. Storage is one of the first places developers look for free-floating latency they can grab and throw away, which is why they bring the data closer to the compute node by using DAS. In distributed computing, this concept is commonly referred to as “data locality.” Networked storage (networked-attached storage [NAS] and storage-area networks [SANs]) doesn’t provide data locality to the cluster.
The intense scrutiny that has followed the media attention has been both good news and bad news for the Hadoop community. The bad news is that commercial vendors have been busy pointing out Hadoop’s shortcomings as they position add-ons and/or competing versions aimed at providing a better Hadoop experience for users, particularly those within the enterprise IT community. The good news is that the heightened scrutiny will, in the long run, accelerate the propagation of the Hadoop analytics framework because commercial vendors will want to make it, as well as make it faster and easier to use within the enterprise. The clear implication for storage administrators is that if you haven’t encountered Hadoop yet, don’t assume you’ll never see it in your environment or that it will never fall under your list of management responsibilities.
However, here’s the challenge you may face as Hadoop makes it enterprise charge: HDFS more or less assumes storage is distributed across Hadoop cluster nodes as DAS to provide data locality and to reduce latency. SAN and NAS storage, while scalable and resilient, isn’t appreciated. Therefore, the storage environment you’re familiar with, along with your skills for managing it, may not be appreciated either. But does it have to be that way?
We’re now seeing the use of SAN and NAS as “secondary” storage for Hadoop clusters — storage that essentially functions as a data protection and/or archival storage layer in conjunction with Hadoop’s DAS-based primary storage layer. Also emerging is the unthinkable — replacing DAS with NAS or SAN as Hadoop’s primary storage layer. We may even see scale-out storage systems using distributed computing architectures that run Hadoop’s MapReduce framework on top of their internal file systems.