In-memory Data Model and Persistence for Big Data

ORM frameworks help developers when they want to interact with relational databases. There are many excellent ORM frameworks for relational databases such as Hibernate and Apache OpenJPA and some of them are really good.

Big DataNowadays, big data is emerging and more and more people develops applications which runs on big data. There have been developed different kinds of NoSQL databases to store such size of data i.e. column stores and document stores.

Despite the fact that ORM frameworks solve many problems (even they have drawbacks) and so common at relational databases side, situation is different for NoSQL databases due to NoSQL databases do not have a common standard.

Apache GoraApache Gora aims to give users easy-to-use in-memory data model and persistence for big data framework with data store specific mappings. The overall goal for Apache Gora is to become the standard data representation and persistence framework for big data.

Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Gora uses Apache Avro and depends on mapping files, which are specific to each data store. Unlike other OTD (Object-to-Datastore) mapping implementations, in Gora the data bean to data store specific schema mapping is explicit. This has the advantage that, when using data models such as HBase and Cassandra, you can always know how the values are persisted.

The roadmap of Apache Gora:

Data Persistence: Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system or Hadoop HDFS.
Data Access: An easy to use Java-friendly common API for accessing the data regardless of its location.
Indexing: Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
Analysis: Accessing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
MapReduce support: Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

What are differences between Apache Gora and current solutions?

• Gora is specially focused at NoSQL data stores, but also has limited support for SQL databases.
• The main use case for Gora is to access/analyze big data using Hadoop.
• Gora uses Avro for bean definition, not byte code enhancement or annotations.
• Object-to-data store mappings are backend specific, so that full data model can be utilized.
• Gora is simple since it ignores complex SQL mappings.
• Gora will support persistence, indexing and analysis of data, using Pig, Lucene, Hive, etc.

Supported Datastores by Apache Gora:

• Apache Accumulo
• Apache Cassandra
• Amazon DynamoDB
• Apache HBase
• Apache Solr
• MongoDB

Apache SparkApache Spark is a shining project for big data developers. Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Currently Gora doesn’t support Spark and during my GSoC period I’m implementing Spark backend for Apache Gora to fill that gap.

Resources:

[1] http://gora.apache.org/current/tutorial.html#introduction
[2] http://gora.apache.org/current/tutorial.html#introduction
[3] https://github.com/apache/gora
[4] https://hadoop.apache.org/
[5] https://avro.apache.org/
[6] http://spark.apache.org/
[7] Big data logo is taken from: http://www.smartdatacollective.com/sites/smartdatacollective.com/files/big-data-big_0.png

kamaci

Leave a Reply

Your email address will not be published. Required fields are marked *