Apache HBase is an open source NoSQL database management system which is column-oriented and runs on top of HDFS (Hadoop Distributed File System), unlike Relational Database Management Systems. It is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN. HBase does not support a structured query language like SQL. It works well in case of sparse data sets and is common in many big data use cases. A dataset is referred as sparse not solely on the basis of information that it represents is less comprehensive. A sparse data can be considered fairly a rare dataset. This is the reason why it is of more value for the business owners.
Much like a traditional database, an HBase database contains a set of tables having rows and columns. It is necessary to define a Primary Key in each table of HBase database. This Primary Key should be used for making all the access attempts to this database.
HBase is based on Google's Big Table. The data contained in these tables are in the form of key value format. This database is able to provide an API enabling development in almost any programming language. The Hadoop ecosystem provides a random real-time read/write access to the data present in Hadoop File System.
2. Need for HBase
In order to store, manage and process big data Apache Hadoop is widely used and has gained popularity for handling a large volume of multi-structured data in the big data space. On the other side, it is difficult for Apache Hadoop to handle high velocity of random writes and reads. Also, it is not able to change or manipulate a file without completely rewriting it. In order to overcome the drawbacks of HDFS, HBase is a NoSQL database built on top of a Hadoop ecosystem as it allows fast random writes and reads in an optimized way. Also, it is difficult for traditional relational databases to render performance while handling a large variety of exponentially growing data. HBase is successful in providing scalability as well as partitioning for efficient storage and retrieval of the data.
3. Features of HBase
- Scalability: HBase supports scalability in both linear and modular form over cluster environment.
- Low Latency Operations: This database is built for use of low latency operations.
- Sharding: It gives support to automatic sharding of tables which is also configurable.
- Distributed storage: Like Hadoop Distributed File System, HBase supports distributed storage as well.
- Consistency: Consistent read and write operations are supported by HBase.
- Failover support: Automatic failover support is provided between Region Servers.
- API support: Clients can access HBase easily using Java APIs client.
- MapReduce support: It supports parallel processing of a large volume of data using MapReduce.
- Back up support: The tables present in HBase supports the back up of Hadoop MapReduce jobs.
- Real time processing: Real time query processing can be made easy in HBase using block cache and Bloom filters.
4. HBase Architecture
HBase architecture comprises of four main components.
The first component of HBase is HMaster which is the implementation of the Master server in HBase architecture. The HMaster behaves as a monitoring agent and keeps monitoring all Region Server instances which are part of the cluster. All the metadata changes are provided through an interface by the HMaster. In a distributed cluster environment, Master runs on NameNode and initiates several background threads as well.
4.1.1 Roles performed by HMaster
The HMaster performs the following functions.
- The performance of the nodes in a cluster is maintained by HMaster.
- Provides performance to the admin and distributes services along with different region servers.
- Regions to the region servers are assigned through HMaster.
- Provides feature of load balancing and failover handling by distributing load to different nodes present in the cluster.
- Responsible for changing operations related to metadata and schema.
4.2 HRegions Servers
Multiple region servers are present in an HBase architecture. At the time of multiple read and write requests handling from the client, the Region Server assigns those requests to specific regions. These regions are present where actual column family resides.
Also, it is possible to make a direct communication or contact between a client and HRegion servers without any permission from HMaster. The help from HMaster is required during manipulation of the metadata and schema change operations.
HRegionServer is nothing but the implementation of the Region Servers. The regions or data present in a distributed cluster is served and managed using HRegionServer. The region servers run on Data Nodes present in the Hadoop cluster.
4.2.1 Roles performed by HRegions Servers
The HMaster along with the different HRegion Servers performs the following functions.
- The regions are hosted and managed using HMaster along with HRegion Servers.
- The HRegions Server performs the role of splitting the regions automatically.
- It is also responsible for handling read and writes requests.
- It can be used to initiate a direct communication with the client.
The tables distribution and column families are present in HRegions which are the basic building elements of HBase cluster. A store is present for each column family. Memstore and Hfile comprise the two main components of the HRegions.
The operations involving read and write requests by the client are stored into HFile.
- A client first communicates with the Region Servers and then Regions in order to write data to the HFile.
- The different Regions performs the operation of contacting memstore for the purpose of storing associated with the column family.
- The data is stored and sorted into a Memstore, followed by flushing of the data into HFile. The Memstore stores the data in the Distributed file system based on Row Key mechanism. HFiles are flushed or written into HDFS while Memstore is placed in the main memory of the Region servers.
- When a Client wants to read a data or make requests related to the data, the data can be accessed directly from the Mem store.
- The client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
All the modifications related to the in-memory is kept by the Memstore.
To avoid a single point of failure, a Zookeeper service is provided by a cluster of servers under Hadoop. A distributed consensus protocol is used by the Zookeeper to determine the node which is the leader at any given time. The leader performs the role of assigning a timestamp to each update for smooth functioning and to keep the order of services. The data contained in the update can be coordinated with elements of the data store, once a majority of nodes have acknowledged receipt of a time-stamped update followed by a declaration of a quorum by the leader. Quorum performs the job of ensuring that the service always returns consistent answers.
Zookeeper is a centralized monitoring server which performs the functioning of maintaining information related to the configuration and providing distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster while maintaining coordination services between the nodes. The servers’ client has to approach Zookeeper first in order to communicate with regions.
4.4.1 Services provided by Zookeeper
It is an open source project, and it provides so many important services.
- Zookeeper is responsible for maintaining the information related to the configuration.
- It performs the functioning of providing distributed synchronization.
- Communication at the client side is established with the region servers using Zookeeper as a medium.
- Also, provides the ephemeral nodes which represent different region servers.
- The available servers in the cluster are determined by Zookeeper in order to provide master server’s usability of ephemeral nodes.
- Server failure and network partitions are tracked using Zookeeper as well.
5. HBase Use Cases
5.1 Telecom Industry
The main requirement of Telecom Industry is to store billions of CDR (Call detailed recording) log records generated by telecom domain every day. Once the log records are stored, it becomes necessary to provide real-time access to CDR logs and billing information as well as per the requirement of the users as well as customer care executives. Need for providing cost effective solutions as compared to the traditional database system was also present.
To counter all the above-mentioned requirements HBase is used. The HBase stores billions of rows of call detailed records. If 20TB of data is added per month to the existing RDBMS database, performance will deteriorate. To manage a wide amount of data in this scenario, HBase is the best solution. HBase performs fast querying and displays records.
5.2 Banking Industry
A million of records have been generated on daily basis in a Banking industry. Also, analytical solutions are required as well can detect Fraud in money transactions.
To satisfy the above needs of storing, processing and updating a huge volume of data and performing analytics, an ideal solution is - HBase integrated with several Hadoop eco system components.
That apart, HBase can be used -
- Whenever there is a need to write heavy applications.
- Performing online log analytics and to generate compliance reports.
5.3 Medical Field
HBase is having applications in the medical domain as well. It is used in the medical field in order to store genome sequences which can be further refined by running MapReduce on it. Also, it can be used to store the history of diseases of people on the basis of an area, city or country, and many others.
HBase can be used in the field of sports in order to store match histories. This dataset can be used for a better analytical and prediction purposes as well.
History of users accessing the web can be stored using HBase for the purpose of better customer targeting and predictive modal developing for the analytics domain.
5.6 Oil and petroleum
The exploration data in the oil and petroleum industries can be stored for analysis using HBase. This can be used for predicting the places where oil can be found.
HBase finds application in the domain of e-commerce as well. This can be used for the purpose of recording and storing logs and essential details about the customer search history. Analytics can be performed later to target the advertisement for the better business.
On both production and development, adobe has about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes. HBase is been used by Adobe in several areas from social services to structured data and processing for internal use. Data has been consistently written to HBase and MapReduce jobs have been executed to process these data. The data is again stored back to HBase or external systems. Adobe is running the production cluster since Oct 2008.
5.9 Project Astro
Fast Spark SQL/Data Frame capabilities are provided to HBase data by Astro. It is providing a super-efficient access to multi-dimensional HBase rows through native Spark execution in HBase coprocessor plus systematic and accurate partition pruning and predicate pushdown from arbitrarily complex data filtering logic. The batch load is optimized to run on the Spark execution engine.
Pinterest is completely deployed on Amazon EC2. A follow model is used by Pinterest where users follow other users. The following feed for every user is required that gets updated every time a follower creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
Two use cases are there at Groupon.
First, involves delivering deals to users via email which is a batch processing job and to provide a relevant user experience on the website. They have implemented personalization at user level in order to tune their deals to be more accurate and relevant.
MapReduce was used for email deal delivery and MySQL was used for their online application. The Relevance and Personalization system is running on the HBase. In order to satisfy the very different workload characteristics of the two systems (email, online), two HBase clusters are used. These clusters are replicated so they have the same content but are tuned and accessed differently.
A wide schema is used by Groupon one is a column family schema for the history of the user and their profile and the other one for email history.