I’ve been accepted to Google Summer of Code 2015 with my proposal for Apache Gora. Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. In Avro, the beans to hold the data and RPC interfaces are defined using a JSON schema. In mapping the data beans to data store specific settings, Gora depends on mapping files, which are specific to each data store. Unlike other OTD (Object-to-Datastore) mapping implementations, in Gora the data bean to data store specific schema mapping is explicit. This has the advantage that, when using data models such as HBase and Cassandra, you can always know how the values are persisted.
On the other hand, Spark is an Apache project advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment. Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte. There are also some alternatives to Apache Spark, i.e. Apache Tez.
Even Spark is so powerful compared to Map/Reduce which Gora currently supports; there is no Spark backend for Gora. This proposal aims to develop a solution for Gora to support Spark. To support Spark within Gora, its own format RDD should be supported.
Following behaviors are specific to Hadoop’s implementation rather than the idea of MapReduce in the abstract:
• Mappers and Reducers always use key-value pairs as input and output.
• A Reducer reduces values per key only.
• A Mapper or Reducer may emit 0, 1 or more key-value pairs for every input.
• Mappers and Reducers may emit any arbitrary keys or values, not just subsets or transformations of those in the input.
• Mapper and Reducer objects have a lifecycle that spans many map() and reduce() calls. They support a setup() and cleanup() method, which can be used to take actions before or after a batch of records is processed.
I’ll try to implement these during my GSoC period:
1) Gora Input Format to RDD Transformation: Gora has its own input/output format and Spark uses RDD. Gora Input formats should be transformed to RDD format.
2) Generic Abstraction Layer Backend: Gora has developed for Hadoop Map/Reduce and it does not have an abstraction. Necessary infrastructure should be changed to work properly to handle Spark backend.
3) Data Storage via GoraInputmap: Gora’s internal classes should be updated and it should let to read/write data either as how it does now and via Spark style.
You can check my proposal at Google Melange web site. I’ll try to share my experiences during GSoC 2015 period.