Shared Storage in a Shared Nothing Environment — Blog by John Webster

By , Wednesday, February 9th 2011

Tags: analytics, Big Data, business analytics, Greenplum, Hadoop, Isilon, mapreduce, Netezza, ParAccel, shared disk, shared nothing,

The computing industry is seeing dramatic growth in the use of “shared nothing” database architectures where each node functions independently of one another and is self-sufficient (Hadoop Distributed File System for example). For the sake of performance, contention among nodes for shared disk resources (SAN and NAS) is one of the things these architectures avoid by dedicating storage resources to each node, i.e. no shared disk.

While these computing architectures are best-known in the context of Web-based applications and development activities, they are no longer confined to the Web. EMC Greenplum, IBM Netezza, and ParAccel are all examples of shared-nothing database architectures that are being used increasingly in “big data” business analytics applications and within corporate data centers.

The goal of shared nothing in the context of business analytics is to ingest massive amounts of data, often from multiple data sources, and produce results that can be used in real or near real-time decision making. Therefore, a shared-nothing architecture makes extensive use of parallelism, distributing processing across independent and self-sufficient nodes. In addition, the architecture eliminates single points of contention.

That brings us to shared storage–seen in the context of shared nothing as a single point of contention. Shared-nothing nodes operate with their own dedicated, direct-attached storage reserves–typically disk although there is a move by some vendors to move solid-state drive into this application to accelerate performance. However, issues with the dedicated storage/node approach can arise.

One occurs when a user initiates a query that requires the system to combine data points by looking across multiple nodes. In this case, response time is dependent on the system’s inter-nodal communications network. The other revolves around assuring system availability and disaster recovery capabilities. Disk dedicated to each node can be RAID-1-protected, but doing so adds significantly to the cost of these systems when one considers that each system node would be configured with redundant RAID-1 arrays–one for each node.

One alternative we have seen recently combines the use of node-based JBOD (Just a bunch of drives) storage with a storage-area network (SAN) or network-attached storage (NAS) system for system availability, disaster recovery, and performance when processing queries involving multiple nodes. An example of this is ParAccel’s Analytic Database (PADB) that can use a shared storage system (in this case NetApp SAN or NAS) by leveraging a feature ParAccel calls “Blended Scan.”

While the PADB uses a shared-nothing architecture, it can “blend” the use of direct-attached storage (DAS) with SAN or NAS. Distributed DAS handles input/output for each node while mirroring data to a back-end SAN or NAS system. Response to complex queries is accelerated by directing I/Os to both DAS and shared storage. System availability and data protection capabilities are also enhanced through the use of shared storage-based snapshot copy functions.

I believe that, over time, shared-nothing database architectures will have a profound impact on storage–both the physical and more abstract notions of storage. For example, it is common to talk about structured vs. unstructured data. The Greenplum shared-nothing database however may be an example of a way to structure what are unstructured data sources. The result represents a new class of data yet to be named.

As mentioned, it is now common to see shared-nothing database architectures as an alternative to shared disk. But shared-storage approaches to shared-nothing environments are now appearing as well. While the ParAccel Analytic Database is one, we would not be surprised to see EMC combine the Greenplum Database with its recent acquisition of Isilon NAS that EMC also positions in the “big data” space. As a consequence, our traditional notions about data and storage are about to change.