Hadoop has progressed from a large scale, batch-oriented analytics tool used by a handful of webscalers to a multi-application processing platform for webscale and enterprise users. The vendors of Hadoop’s three major distributions uniformly characterize it as a modern architecture for Big Data that will integrate with and compliment the enterprise’s existing Business Intelligence (BI) systems. And they’ve been saying this for at least a year or more.
Problem is that the adoption rate of Hadoop by enterprise data center administrators has been slow at best. I believe that there are a number of reasons for this including a desire for real-time applications and a reluctance to move large volumes of data from existing data stores to Hadoop’s. As a result, some wonder if there are other platforms that could address the needs. To date, at least two have surfaced. They are known as Apache Spark and Facebook’s Presto.
Apache Spark combines the analysis of stored data in batch mode (MapReduce) with the analysis of data entering the system in real time, all on a single platform. It achieves real time processing speed by aggressively keeping data in processing node memory and being considerably more light weight in terms of the lines of code running the Spark processing engine vs. Hadoop.
At present, the majority of Spark implementations are tethered to Hadoop, either by using Hadoop’s HDFS as a data store or by running Spark concurrently with other applications within a Hadoop cluster. However, Spark can use a number of different data stores as sources and repositories including Apache Cassandra and Amazon S3. And because it’s not bound to HDFS, it doesn’t inherit the shortcomings of HDFS. Furthermore, it can use these data sources as an independent analytics tool in stand-alone mode—no dependence on Hadoop at all. As such, it is currently the hottest of the Apache Software Foundation’s projects in terms of the number of committers engaged and the number of lines of code contributed.
Presto is not another better/faster MapReduce implementation. Rather, according to Facebook, Presto is a new interactive query system that operates fast at petabyte scale that is founded on a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. And like Spark, all processing is in memory. Facebook recently open-sourced the code and the Presto community can be found here.
Unlike Spark or Hadoop, Presto can concurrently use a number of data stores as sources. All that is needed are “connectors” that provide interfaces for metadata, data locations, and data access. This obviates the need to move data around in order to query it—a requirement that’s becoming critical to many IT administrators. Simply plug the data source into Presto and—presto!—it can be interactively queried in real time. Connectors are currently available for Hadoop/Hive (Apache and Cloudera distributions) and Cassandra. But one can imagine more could be built for the enterprises’ existing data stores.
Both Spark and Presto answer the call from enterprise users for a real time analytics platform that functions at large scale. In addition, Presto has the potential to bridge gaps between the enterprise’s existing data siloes by obviating the need to move data into HDFS prior to querying. And because both can be run independently of Hadoop, they can be seen as emerging alternatives. Why do I think that’s news? Surveys continue to report the slow adoption rates of Hadoop within enterprise IT. Yet interest in Big Data solutions among these same potential users is at an all-time high. Therefore, the Hadoop alternatives will get increasing attention.