Apache HBase is an open source NoSQL database management system which is column-oriented and runs on top of HDFS (Hadoop Distributed File System), unlike Relational Database Management Systems. It is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN. HBase does not support a structured query language like SQL. It works well in case of sparse data sets and is common in many big data use cases. A dataset is referred as sparse not solely on the basis of information that it represents is less comprehensive. A sparse data can be considered fairly a rare dataset. This is the reason why it is of more value for the business owners.
Much like a traditional database, an HBase database contains a set of tables having rows and columns. It is necessary to define a Primary Key in each table of HBase database. This Primary Key should be used for making all the access attempts to this database.
HBase is based on Google's Big Table. The data contained in these tables are in the form of key value format. This database is able to provide an API enabling development in almost any programming language. The Hadoop ecosystem provides a random real-time read/write access to the data present in Hadoop File System.
In order to store, manage and process big data Apache Hadoop is widely used and has gained popularity for handling a large volume of multi-structured data in the big data space. On the other side, it is difficult for Apache Hadoop to handle high velocity of random writes and reads. Also, it is not able to change or manipulate a file without completely rewriting it. In order to overcome the drawbacks of HDFS, HBase is a NoSQL database built on top of a Hadoop ecosystem as it allows fast random writes and reads in an optimized way. Also, it is difficult for traditional relational databases to render performance while handling a large variety of exponentially growing data. HBase is successful in providing scalability as well as partitioning for efficient storage and retrieval of the data.
HBase architecture comprises of four main components.
The first component of HBase is HMaster which is the implementation of the Master server in HBase architecture. The HMaster behaves as a monitoring agent and keeps monitoring all Region Server instances which are part of the cluster. All the metadata changes are provided through an interface by the HMaster. In a distributed cluster environment, Master runs on NameNode and initiates several background threads as well.
The HMaster performs the following functions.
Multiple region servers are present in an HBase architecture. At the time of multiple read and write requests handling from the client, the Region Server assigns those requests to specific regions. These regions are present where actual column family resides.
Also, it is possible to make a direct communication or contact between a client and HRegion servers without any permission from HMaster. The help from HMaster is required during manipulation of the metadata and schema change operations.
HRegionServer is nothing but the implementation of the Region Servers. The regions or data present in a distributed cluster is served and managed using HRegionServer. The region servers run on Data Nodes present in the Hadoop cluster.
The HMaster along with the different HRegion Servers performs the following functions.
The tables distribution and column families are present in HRegions which are the basic building elements of HBase cluster. A store is present for each column family. Memstore and Hfile comprise the two main components of the HRegions.
The operations involving read and write requests by the client are stored into HFile.
All the modifications related to the in-memory is kept by the Memstore.
To avoid a single point of failure, a Zookeeper service is provided by a cluster of servers under Hadoop. A distributed consensus protocol is used by the Zookeeper to determine the node which is the leader at any given time. The leader performs the role of assigning a timestamp to each update for smooth functioning and to keep the order of services. The data contained in the update can be coordinated with elements of the data store, once a majority of nodes have acknowledged receipt of a time-stamped update followed by a declaration of a quorum by the leader. Quorum performs the job of ensuring that the service always returns consistent answers.
Zookeeper is a centralized monitoring server which performs the functioning of maintaining information related to the configuration and providing distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster while maintaining coordination services between the nodes. The servers’ client has to approach Zookeeper first in order to communicate with regions.
It is an open source project, and it provides so many important services.
The main requirement of Telecom Industry is to store billions of CDR (Call detailed recording) log records generated by telecom domain every day. Once the log records are stored, it becomes necessary to provide real-time access to CDR logs and billing information as well as per the requirement of the users as well as customer care executives. Need for providing cost effective solutions as compared to the traditional database system was also present.
To counter all the above-mentioned requirements HBase is used. The HBase stores billions of rows of call detailed records. If 20TB of data is added per month to the existing RDBMS database, performance will deteriorate. To manage a wide amount of data in this scenario, HBase is the best solution. HBase performs fast querying and displays records.
A million of records have been generated on daily basis in a Banking industry. Also, analytical solutions are required as well can detect Fraud in money transactions.
To satisfy the above needs of storing, processing and updating a huge volume of data and performing analytics, an ideal solution is - HBase integrated with several Hadoop eco system components.
That apart, HBase can be used -
HBase is having applications in the medical domain as well. It is used in the medical field in order to store genome sequences which can be further refined by running MapReduce on it. Also, it can be used to store the history of diseases of people on the basis of an area, city or country, and many others.
HBase can be used in the field of sports in order to store match histories. This dataset can be used for a better analytical and prediction purposes as well.
History of users accessing the web can be stored using HBase for the purpose of better customer targeting and predictive modal developing for the analytics domain.
The exploration data in the oil and petroleum industries can be stored for analysis using HBase. This can be used for predicting the places where oil can be found.
HBase finds application in the domain of e-commerce as well. This can be used for the purpose of recording and storing logs and essential details about the customer search history. Analytics can be performed later to target the advertisement for the better business.
On both production and development, adobe has about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes. HBase is been used by Adobe in several areas from social services to structured data and processing for internal use. Data has been consistently written to HBase and MapReduce jobs have been executed to process these data. The data is again stored back to HBase or external systems. Adobe is running the production cluster since Oct 2008.
Fast Spark SQL/Data Frame capabilities are provided to HBase data by Astro. It is providing a super-efficient access to multi-dimensional HBase rows through native Spark execution in HBase coprocessor plus systematic and accurate partition pruning and predicate pushdown from arbitrarily complex data filtering logic. The batch load is optimized to run on the Spark execution engine.
Pinterest is completely deployed on Amazon EC2. A follow model is used by Pinterest where users follow other users. The following feed for every user is required that gets updated every time a follower creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
Two use cases are there at Groupon.
First, involves delivering deals to users via email which is a batch processing job and to provide a relevant user experience on the website. They have implemented personalization at user level in order to tune their deals to be more accurate and relevant.
MapReduce was used for email deal delivery and MySQL was used for their online application. The Relevance and Personalization system is running on the HBase. In order to satisfy the very different workload characteristics of the two systems (email, online), two HBase clusters are used. These clusters are replicated so they have the same content but are tuned and accessed differently.
A wide schema is used by Groupon one is a column family schema for the history of the user and their profile and the other one for email history.