In my previous article I wrote about the five factors that limit Hadoop’s role in the enterprise datacenter. To recap, the limitations are,
The first three issues stem directly from the architecture choice recommended for Hadoop clusters – use of many different compute nodes, each with its own embedded Direct Attached Storage (DAS). Enterprises choosing Hadoop are forced to make the trade-off of accepting these limitations in favor of getting the power of Hadoop analytics. Is DAS the only choice despite its limitations?
In a very well researched article, John Webster of Evaluator group poses and answers the question on alternatives to DAS for running Hadoop. Webster writes,
Hadoop storage has issues, and that these issues can be addressed by using more robust and scalable storage platforms to support Hadoop clusters.
Based on his research, Webster proposes a three stage approach to Hadoop storage
Stage 1: External high-performance storage arrays that still function as DAS
Stage 2: Address Hadoop’s three copies problem by using an external storage array as primary copy, there by limiting size of DAS and cluster
Stage 3: Use SAN or NAS storage instead of DAS
This is a very good migration path if an organization has already gone down the path of building distributed cluster with multiple compute nodes with DAS. The template provides a way to fix the storage issues and regain the benefits they had to forgo with the adoption of DAS.
Looking at this staged approach, it is clear that an enterprise does not have to start at Stage 1 and move sequentially. Enterprises already have SAN storage and have built experience managing SAN storage. In other words they are already starting at Stage 3. Why move from there to DAS, incurring additional investment and making costly trade-offs only to return at a later time to where they started?
This does not mean simply running HDFS as it exists today on a file system that supports SAN storage. What this means is enterprises need a way to run Hadoop, utilizing SAN storage, taking advantage of all the benefits and without making trade-offs. This is exactly the solution we are building at Symantec, to enable enterprises run Hadoop on their existing infrastructure.
The solution is built on Cluster File System, the high-performance file system for fastest failover of applications and databases. Running Hadoop on Cluster File System enables,
Our position is, the Cluster File System based Hadoop solution enables enterprises take advantage of Big Data analytics without the trade-offs.
See here to learn more about the solution and sign up for early access.