Categories: Analyst Blogs
Tags: Apache Spark, EXASOL, Forbes, GridGain, In-Memory Computing, John Webster, SAP HANA, Spark, Tier-1 Storage,
We’ve been watching the advance of solid state disk (SSD) as its ever improving price for capacity metrics make it an attractive replacement for relatively slow, spinning, mechanical disk media. Disk drive manufactures like Hitachi saw this coming years ago when they projected the discontinuance of high performance, enterprise-oriented fibre channel drives by 2013. Now in 2015, enterprise users have indeed begun what is in some cases a wholesale replacement of primary disk storage arrays with all-flash storage systems.
But what I think has yet to really sink-in is a realization that the same kinds of technology advancements that are driving the revolution flash storage revolution are also quietly driving significant changes in the ways that computer system memory can now be implemented and used as is the case with non-volatile RAM (NVRAM). What may be even less obvious is that the rise of in-memory computing could impact, maybe significantly, the opportunities for solid state disk. Memory is riding down the same cost for capacity curve as SSD. And technologies are now coming to market that allow discreet memory modules inside clustered servers to be networked together to form a scalable memory fabric, as I will discuss.
It is now possible for an entire database to live inside memory, potentially obviating the need to page data out of and back into mechanical disk. Yes, memory has traditionally not been a place to persist data – an absolutely essential function for most applications. But that view of memory could be changed as well.
One of SAP’s hottest offerings right now is HANA which leverages database in memory technology. Oracle and IBM also have products in this category. What these products essentially try to do is to keep as much of a database as possible in computer memory to eliminate disk I/O operations that add significant latency to compute performance. As a result of using in-memory computing techniques, users see not only an improvement in performance for their applications. They can also see that in-memory computing opens the door to real time analytics applications—no more waiting until the following day for the data warehouse to produce reports. Users can query the database in real time.
However, SAP HANA and the like use in-memory computing techniques to accelerate the performance of specific vendor-oriented applications. Emerging now are more general purpose in-memory computing solutions that can be used across a range of applications. Examples include:
GridGain’s Data Fabric – an in-memory analytics software implementation that can be used for real time Big Data applications. It is designed for a distributed, massively parallel processing environment composed of commodity servers. Distributed cluster memory is utilized as primary storage for computation while disk becomes secondary storage for data protection and longer-term persistence. The Data Fabric was originally developed by GridGain as a proprietary solution but is now also an Apache Incubator open source project called Ignite.
EXASOL – a high performance database with a strong focus on analytic querying. EXASOL’s genesis goes back to the early days of Massively Parallel Processing (remember Kendall Square, MasPar, Thinking Machines?) During processing, data is maintained in memory and CPU cache across a cluster while disk accesses are minimized. EXASOL is somewhat more conservative with memory usage by taking a less aggressive approach to minimizing disk I/O. EXASOL uses compression and pipelining to stream data in and out of distributed memory from disk that remains as a persistence layer. The objective here is to use memory more efficiently while greatly reducing disk-induced latency. And it is also interesting to note that, because of the streaming nature of EXASOL I/Os, SSD is of little performance benefit.
The Open Source community is also actively involved in advancing in-memory computing, as in:
Spark – one of the hottest Apache Software Foundation projects going at the moment – hotter than Hadoop in terms of code contributions by committers. Spark is a Big Data analytics platform that leverages both data in memory and data on disk across a distributed cluster with the objective of aggressively maintaining data in memory. It can run within a Hadoop cluster using HDFS as a data store or stand-alone other data stores like Amazon’s S3. Levels of data persistence in memory can be set by the developer. Commercial distributions are now appearing that include databricksand Stratio. Currently, there is an active debate over whether or not SSD can improve Hadoop performance enough to justify the cost.
Tachyon – a project under development at the UC Berkeley AMPLab. Tachyon is a memory-centric fault-tolerant distributed file system enabling file sharing at memory-speed across distributed clusters such as Hadoop. Existing Spark and MapReduce programs can run on top of Tachyon without code changes. It is another example of aggressively maintaining data in memory to reduce latency.
It is well known that SSD can dramatically accelerate database transaction-oriented application performance. But in-memory computing can take that acceleration one giant step further. Yes, it could be more expensive than SSD and at the same time not address all of an application’s data persistence requirements. But as we have seen with SSD, the price of memory is coming down while functionality defined in software is going up.
Storage professionals often think of SSD is a high performance storage tier in a logical structure that includes disk as the next tier down in performance followed by tape and optical. Here, SSD is commonly referred to as Tier 0. Could they now see distributed, in-memory storage as Tier -1? And could the advance of in-memory computing diminish the usefulness of SSD?
Register for John Webster’s March 12 webinar on “Storage for Big Data: It’s Way More than Just JBODs and DAS’