Vendors have a penchant for attaching the hype cycle of the moment to any new product announcement. This time around, companies are “big data” washing everything they do. As a data storage administrator, you may find this confusing when it comes to the management of big data in your own environment. Vendors talk about big data storage and big data analytics almost in the same breath, so you could understandably conclude the two are connected — that big data storage is storage for big data analytics. At the moment, however, they’re two separate computing technology areas: one devoted to the development of storage platforms at petabyte and even exabyte scale (big data storage), and the other focused on processing very large and diverse data sets in minimum time (big data analytics).
Yet there are at least two connection points between these rapidly moving trends that will become increasingly important to storage administrators. First, big data analytics processes, which are distinctly different from those of traditional data warehousing, are moving into the enterprise at both the business department and data center level. This is where storage administrators enter the picture. As the platforms they’re based upon — Hadoop and others — become increasingly business critical, user groups are relying on them more, making them the subject of enterprise security, data protection and data governance policies.
Second, and for reasons that will be explained later, storage within the distributed computing platforms typically used for data analytics platforms isn’t the network-attached storage (NAS) and storage-area network (SAN) you’re used to dealing with — it’s direct-attached storage (DAS) buried among and within distributed computing nodes that make up the cluster. That makes managing big data more complex because you can’t apply your established security, protection and preservation processes to this data as you would normally do. However, the need to enforce these policies is integral to managing a distributed computing cluster and is changing the way its compute and storage “layers” interact.
In this first article in our series on managing big data in your organization, we’ll look at how big data analytics differs from traditional data warehousing and introduce the distributed compute cluster as a foundation to big data analytics. Next, we’ll look at storage in distributed computing and take a deeper look into how Hadoop creates and uses a storage layer. After that, we’ll examine a three-stage storage model that incorporates NAS and SAN with Hadoop’s storage layer. Finally, we’ll evaluate Hadoop as a storage device by using some of the same decision points you as a storage administrator would use to evaluate a storage array.
Big data analytics is an area of rapidly growing technological diversity. Therefore, trying to define it in terms of a single technology such as Hadoop at this point isn’t helpful. However, identifying characteristics that are common to the technologies identified with big data analytics is illuminating. These include:
Traditional data warehousing systems typically pull data from existing relational databases. However, it’s estimated that more than 80% of stored corporate data is unstructured — data not encompassed by a relational database management system (RDBMS) such as DB2 and Oracle. Generally speaking, and for the purposes of this discussion, unstructured data is all data that doesn’t fit easily into a structured relational database. Unstructured data types that organizations now want to extract informational value from include:
In the context of big data analytics, it’s critical to view these data types as far more diverse than RDBMS data — ones that represent a variety of important new information sources. And with the amount of stored unstructured data growing at an annual rate that’s 10 to 50 times faster than structured data, this data becomes even more enticing from a business perspective.
From a big data analytics perspective, the challenge for business executives lies in capturing data from these sources and performing analytical processes to unlock their informational value. Traditional data warehousing technology wasn’t designed to process large volumes of unstructured data in relatively short periods of time (five seconds or less), so new approaches to managing big data are required.
Enter the distributed computing cluster. The distributed computing cluster concept has been around for decades, but has lived at the margins of IT for most of that time. In 2004, Google published a paper on a process called MapReduce that used such an architecture. Under the MapReduce process, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). Because Google was so incredibly successful, others wanted to copy the process. MapReduce was transformed from a framework that only Google owned to an Apache open source project named Hadoop.
In part two of this series, we’ll take a closer look at storage from the perspective of the distributed computing cluster, and talk specifically about how Hadoop uses storage.