Apache Pig

Introduction to Apache Pig

A hypothesis called Apache Pig was developed on top of Apache Hadoop ecosystem MapReduce to process wide range and a large amount of dataset. Pig is a platform developed by Yahoo which was later converted into an open source Apache project. A large amount of data can be analyzed using Apache Pig by transforming the dataset into data flows. Map Reduce architecture works functions by transforming the programs into series of different Map and Reduces phases. In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. Since this does not come under a programming model, it becomes difficult for a data analyst to get accustomed with. Apache Pig was built on top of Apache Hadoop ecosystem to fill this gap.

Pig Latin and Runtime environment

Pig Latin is a scripting language which can be used to perform ETL operations (Extract, Transform and Load) as well as raw data analysis. Similar to SQL scripting and query language, Pig is also used to load and dump the data in the required structure after applying various constraints and filters.

A java runtime environment i.e. JRE is required to run the created programs. All the operations are effectively handled by Hadoop by converting the operations into Map and Reduce modules. This allows the programmer to focus on the operations rather than concentrating on each and every Mapper and Reducer functions.

Pig converts all the operations into Map and Reduce tasks which can be efficiently processed on Hadoop. It basically allows us to concentrate on the whole operation irrespective of the individual mapper and reducer functions. The main reason behind the development of Pig is to find an easy and better approach to program applications using Map Reduce. Previously, JAVA was the only programming language used to process and analyze dataset using HDFS.

Pig can find applications in data analysis where query operations are being used on a dataset. For example, to find all the rows having a value of the variable income greater than $50,000. This can also be applied in the cases where there is a need to combine two different datasets on the basis of a key value. Pig can also be used in the situations when it is required to keep on applying an algorithm on a dataset in an iterative manner. Ideal for ETL operations and managing irregular schema data, it also promotes applying a sequential procedure to transform the data.

Features of Apache Pig

Let’s go through the following features of Apache Pig:

  • Pig provides a great number of the rich set of operators to be used. Providing operations such as filter, join, sort etc. to different operators.
  • Pig Latin is easy to write and shares similarity with SQL query language. Being good at one of these can help with the other one as well. Enabling ease to write programs.
  • Provides various optimization opportunities. It is required to concentrate more on the semantics of the language. Roles related to the execution of the programs are handled by the Apache Pig itself.
  • Apache Pig provides the ability to extend the system i.e. extensibility. Users can create their desired functions using the existing operators, users can develop their own functions to read, process, and write data.
  • Procedures to developing user defined function using other programming languages like JAVA are provided as well. Also, it is possible to call and merge these functions using Pig Scripts.
  • A wide range of data can be handled and managed using Apache Pig. Provides ways to manage both structured and unstructured data. The result is stored using HDFS.

Apache Pig vs. MapReduce

Apache Pig is a preferred approach while writing MapReduce modules as knowledge of Python and Java Programming language is required otherwise.

  • Apache Pig differs from MapReduce in the type of the approach and data flow. MapReduce is a low-level model while Apache Pig is a high-level model used for data processing.
  • Pig Latin can be used to achieve the benefits of MapReduce without even using JAVA programming language and its implementations. Making it very easy for the user to grasp the concept of Pig.
  • Varied MapReduce tasks can be achieved using a single MapReduce query. This shortens the length of the code by a great extent. Reducing the development period by a large degree.
  • Performing various data operations such as sorting, joins and filtering etc. is a big task. These same functions can be performed easily using Apache Pig.
  • A join operation is simple to execute in case of Apache Pig. In the case of MapReduce, it is required to create and initiate multiple MapReduce tasks. These tasks are required to be executed sequentially to complete the desired functionality.
  • Nested data types such as maps, tuples and bags are provided as well which are not present in MapReduce.

Apache Pig Components

The following components form the fundamentals behind the Apache Pig architecture and its working. Let’s go through these components one by one.

1. Parser

The Parser performs the functioning of handling the script as and performs the functions related to syntax checking, type checking etc. The Parser provided the output to the user in the form of Directed acyclic graphs. These graphs represent the statements with logical operators belonging to Pig Latin.

The Directed acyclic graphs contain the information about the nodes using the logical operators and the edges using the flow of the data.

2. Optimizer

The operations such as projection and pushdown which comes under logical optimizer are performed using the logic plan represented using the Directed acyclic graphs.

3. Compiler

The logical plan from the optimizer is given to the compiler which performs the job of converting it to a sequence of MapReduce jobs.

4. Execution engine

After performing the above operations, all the jobs under MapReduce are handed to Hadoop sequentially. The execution of the MapReduce jobs then takes place on Hadoop for the required results.

Download and Install Apache Pig

Apache Pig can be downloaded using the following command on a Linux operating system.
$ wget http://mirror.symnds.com/software/Apache/pig/pig-0.12.0/pig-0.12.0.tar.gz
To untar the above-downloaded package following command can be used.
$ tar xvzf pig-0.12.0.tar.gz
Set the required path to the Apache Pig directory structure using the command as follows.
export PATH=$PATH:/home/hduser/pig/bin

Executing Apache Pig Script

Yelp Dataset

We are going to use YELP review dataset which contains the reviews related to different businesses.The data has been extracted from the YELP website and stored in a script named as yelp.csv. The yelp.csv file contains variables names as user_id, business_id, date, stars, review_length, votes_funny, votes_useful, votes_cool, votes_total, pos_words, neg_words and net_sentiment.

Below are the sample rows extracted from the YELP reviews dataset.

{`
Xqd0DzHaiyRqVH3WRG7hzg	vcNAWiLM4dR7D2nwwJ7nCA	5/17/2007	5	94	0	2	1	3	4	1	3
H1kH6QZV7Le4zqTRNxoZow	vcNAWiLM4dR7D2nwwJ7nCA	3/22/2010	2	114	0	2	0	2	3	7	-4
zvJCcrpm2yOZrxKffwGQLA	vcNAWiLM4dR7D2nwwJ7nCA	2/14/2012	4	55	0	1	1	2	6	0	6
KBLW4wJA_fwoWmMhiHRVOA	vcNAWiLM4dR7D2nwwJ7nCA	3/2/2012	4	97	0	0	0	0	3	0	3
zvJCcrpm2yOZrxKffwGQLA	vcNAWiLM4dR7D2nwwJ7nCA	5/15/2012	4	53	0	2	1	3	1	2	-1
Qrs3EICADUKNFoUq2iHStA	vcNAWiLM4dR7D2nwwJ7nCA	4/19/2013	1	212	0	0	0	0	4	8	-4
jE5xVugujSaskAoh2DRx3Q	vcNAWiLM4dR7D2nwwJ7nCA	1/2/2014	5	62	0	0	0	0	6	0	6
QnhQ8G51XbUpVEyWY2Km-A	vcNAWiLM4dR7D2nwwJ7nCA	1/8/2014	5	67	0	0	0	0	4	1	3
tAB7GJpUuaKF4W-3P0d95A	vcNAWiLM4dR7D2nwwJ7nCA	8/1/2014	1	194	0	1	0	1	5	2	3
GP-h9colXgkT79BW7aDJeg	vcNAWiLM4dR7D2nwwJ7nCA	12/12/2014	5	52	0	0	0	0	8	0	8
uK8tzraOp4M5u3uYrqIBXg	UsFtqoBl7naz8AVUBZMjQQ	11/8/2013	5	75	0	0	0	0	12	0	12
I_47G-R2_egp7ME5u_ltew	UsFtqoBl7naz8AVUBZMjQQ	3/29/2014	3	137	0	0	0	0	5	0	5
PP_xoMSYlGr2pb67BbqBdA	UsFtqoBl7naz8AVUBZMjQQ	10/29/2014	2	61	0	0	0	0	10	0	10
JPPhyFE-UE453zA6K0TVgw	           	        11/28/2014	        4	63	1	1	1	3	7	2	5
fhNxoMwwTipzjO8A9LFe8Q	cE27W9VPgO88Qxe4ol6y_g	8/19/2012	3	86	0	1	0	1	8	3	5
-6rEfobYjMxpUWLNxszaxQ	cE27W9VPgO88Qxe4ol6y_g	4/18/2013	1	218	0	1	0	1	7	4	3
KZuaJtFindQM9x2ZoMBxcQ	cE27W9VPgO88Qxe4ol6y_g	7/14/2013	1	108	0	0	0	0	3	1	2
H9E5VejGEsRhwcbOMFknmQ	cE27W9VPgO88Qxe4ol6y_g	8/16/2013	4	186	0	0	0	0	7	0	7
ljwgUJowB69klaR8Au-H7g	cE27W9VPgO88Qxe4ol6y_g	7/11/2014	4	74	0	0	0	0	3	1	2
JbAeIYc89Sk8SWmrBCJs9g	HZdLhv6COCleJMo7nPl-RA	6/10/2013	5	121	3	7	7	17	6	2	4
l_szjd-ken3ma6oHDkTYXg	HZdLhv6COCleJMo7nPl-RA	12/23/2013	2	50	1	1	1	3	4	1	3
zo_soThZw8eVglPbCRNC9A	HZdLhv6COCleJMo7nPl-RA	9/4/2014	4	27	0	0	0	0	3	0	3
LWbYpcangjBMm4KPxZGOKg	mVHrayjG3uZ_RLHkLj-AMg	12/1/2012	5	184	0	5	0	5	14	1	13
m1FpV3EAeggaAdfPx0hBRQ	mVHrayjG3uZ_RLHkLj-AMg	3/15/2013	5	10	0	0	0	0	1	1	0
8fApIAMHn2MZJFUiCQto5Q	mVHrayjG3uZ_RLHkLj-AMg	3/30/2013	5	228	0	2	1	3	17	6	11
uK8tzraOp4M5u3uYrqIBXg	mVHrayjG3uZ_RLHkLj-AMg	10/20/2013	4	75	0	1	0	1	7	1	6
6wvlM5L4_EroGXbnb_92xQ	mVHrayjG3uZ_RLHkLj-AMg	11/7/2013	5	37	0	0	0	0	6	1	5
345nDw0oC-jOcglqxmzweQ	mVHrayjG3uZ_RLHkLj-AMg	3/22/2014	5	67	0	2	1	3	6	0	6
u9ULAsnYTdYH65Haj5LMSw	mVHrayjG3uZ_RLHkLj-AMg	9/29/2014	4	24	0	0	0	0	2	1	1
`}

This file has 1,569,264 rows of data across 12 columns.

Apache Pig can be used in the following modes:

  1. Local Mode
  2. Cluster Mode

Running Apache Pig in Local Mode:

We can run Apache Pig in local mode using the following command.

{`
$ pig -x local
After executing the above command on the terminal, the below output is observed.
2017-08-03 17:00:23,258 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2017-08-03 17:00:23,259 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig/myscripts/pig_1388027786256.log
2017-08-03 17:00:23,281 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2017-08-03 17:00:23,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///grunt
`}

Running Apache Pig in Cluster Mode:

We can run Apache Pig in cluster mode using the following command.

{`$ pig
2017-08-03 17:37:23,274 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
2017-08-03 17:37:23,274 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig/myscripts/pig_1388027982272.log
2017-08-03 17:37:23,300 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2017-08-03 17:37:23,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310
2017-08-03 17:37:23,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: hdfs://localhost:9001
grunt
`}

The above command shows a grunt shell. PigLatin statements can be executed in a fast manner using this grunt shell. This can be used to verify or test the data flows without following a sequential procedure with a complete script. Now let’s move ahead and test our data using PigLatin.

Pig Latin

The data present in the dataset can be queried to test various Pig Latin implementations and procedures. The first step is to make data accessible in Pig.

The following command can be used to load the yelp reviews data into a variable using Pig Latin.

grunt reviews = LOAD '/home/hadoop/pig/myscripts/yelp.csv' USING PigStorage(',') as (user_id,business_id,date,stars,review_length,votes_funny,votes_useful,votes_coolvotes_total,pos_words,neg_words,net_sentiment);

In the previous command, we have used reviews which are called the relation or alias in Pig but are not a variable. This statement does not make any Map Reduce task to execute. The keyword PigStorage(‘,’) is used since the records in the database are separated using a comma operator.

Names of the fields present in a dataset can be given using ‘as’ keyword. This keyword assigns a name for every field or column present in the dataset.

Testing the loaded data

To test whether the data has been successfully loaded using the previous command, a DUMP command can be used.

{`grunt DUMP reviews;
After executing the previous command, the terminal shows a large text on the screen which forms the output of the DUMP command. We have shown only partial output as below.

2017-03-08 17:40:08,550 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2017-03-08 17:40:08,633 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - &lbrace RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier] &rbrace
2017-03-08 17:40:08,748 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-03-08 17:40:08,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-03-08 17:40:08,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-03-08 17:40:08,853 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
................
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.1.2 0.12.0 hadoop 2013-12-25 23:03:04 2013-12-25 23:03:05 UNKNOWN

Success!
Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0006 reviews MAP_ONLY file:/ptmp/ptemp-5323122347/tmp2718191010,
Input(s):
Successfully read records from: "/home/hadoop/pig/myscripts/yelp.csv"
Output(s):
Successfully stored records in: "file:/ptmp/ptemp-5323122347/tmp2718191010,"
Job DAG:
job_local_0006
................
(Xqd0DzHaiyRqVH3WRG7hzg	,vcNAWiLM4dR7D2nwwJ7nCA,17-05-2007,5,94,0,2)
(H1kH6QZV7Le4zqTRNxoZow	,vcNAWiLM4dR7D2nwwJ7nCA,22-03-2010,2,114,0,2)
(zvJCcrpm2yOZrxKffwGQLA,vcNAWiLM4dR7D2nwwJ7nCA,14-02-2012,4,55,0,1)
(KBLW4wJA_fwoWmMhiHRVOA,vcNAWiLM4dR7D2nwwJ7nCA,02-03-2012,4,97,0,0)
(zvJCcrpm2yOZrxKffwGQLA,vcNAWiLM4dR7D2nwwJ7nCA,15-05-2012,4,53,0,2)
(Qrs3EICADUKNFoUq2iHStA,vcNAWiLM4dR7D2nwwJ7nCA,19-04-2013,1,212,0,0)
(jE5xVugujSaskAoh2DRx3Q,vcNAWiLM4dR7D2nwwJ7nCA,02-01-2014,5,62,0,0)
(QnhQ8G51XbUpVEyWY2Km-A,vcNAWiLM4dR7D2nwwJ7nCA,08-01-2014,5,67,0,0)
(tAB7GJpUuaKF4W-3P0d95A,vcNAWiLM4dR7D2nwwJ7nCA,01-08-2014,1,194,0)
`}

Once a DUMP statement is executed, the MapReduce job starts as well. From the previous output, it can be seen that the data has been successfully loaded into the reviews field.

Performing Queries

After loading the data desired queries can be performed on the dataset. Listing the reviews with net sentiment with value less than 10.

{`grunt netsentiment_less_than_ten = FILTER reviews BY (int)rating 10.0;
grunt DUMP netsentiment_less_than_ten;
`}

The above statements filter the alias reviews and store the results in a new alias netsentiment_less_than_ten. The netsentiment_less_than_ten alias will have only records of reviews where the net_sentiment is greater than 10.

The DUMP command is only used to display information onto the standard output. The data can be stored in a file using the following command.

{`grunt store netsentiment_less_than_ten  into '/user/hadoop/ netsentiment_less_than_ten ;`}