Introduction to PIG Latin for Beginners
In today’s time, when organisations are gathering a huge amount of data, popular websites, such as, Facebook and Instagram are making use of Big Data technology to store, process and analyse the data for later use. Hadoop framework is an innovative solution to support the big data. This framework majorly consists of two components, Hadoop Distributed File System (HDFS) and MapReduce. While HDFS helps in storing the big data, MapReduce helps processing the big data. MapReduce programing paradigm is written in Java.
Apache PIG is a tool to process and analyse massive data, big data, as data flows. It is a high-level scripting language built over MapReduce for expressing data analysis programs. Apache PIG gives an abstraction to reduce the complexity of developing MapReduce programming for the developers. The scripting language used for PIG is Pig Latin. Apache Pig was developed in 2006 by Yahoo to create and manipulate MapReduce tasks on the datasets.
Architecture of Apache PIG:
- PIG Latin – The PIG Latin which is a high-level data processing language that enables users/developers to write code for data processing and analyzing.
- Runtime Environment – A runtime environment which is an execution mechanism (platform) to run PIG Latin programs.
The PIG architecture comprises of various elements including parser, optimiser, compiler and finally execution engine.
Apache PIG execution modes:
- Local mode: In this mode, the files are accessed from the local host and local file system.
- MapReduce Mode: In this mode, the files are accessed from the Hadoop file system (HDFS).
Apache PIG execution mechanism:
The programs written in Apache PIG can be executed in three ways:
- Interactive Mode (Grunt Shell)
- Batch Mode (Script)
- Embedded Mode (UDF)
Invoking Grunt shell:
The Grunt shell can be invoked in desired mode (local/MapReduce) in the following ways:
- Local Mode: The command to invoke the grunt shell in local mode is
$ ./pig –x local
- Map Reduce Mode: The command to invoke the grunt shell in mapreduce mode is
$ ./pig -x mapreduce
Either of the commands will give the Grunt shell prompt as shown:
grunt>
You can exit this shell using ‘ctrl + d’. Once the grunt shell is invoked, PIG Latin statement can be executed directly.
PIG Latin Basics
PIG Latin is a high level language that is used to analyse data in Hadoop using Apache PIG. The data model of PIG is a Relation is the outermost structure of the PIG Latin and is fully nested. A relation is a bag where:
- A bag is a collection of tuples
- A tuple is an ordered set of fields
- A field is a piece of data and is atomic
Statements are the basic construct of PIG Latin. The statement includes expressions and schemas and work with relations. Each statement in PIG Latin ends with semicolon (;).
The simple data types of PIG include int, long, float, double, chararray, Bytearray, Boolean, Datetime, Biginteger, Big decimal. Whereas Bag, Tuple and Map are the complex data types in PIG. All data types above can be NULL value which means an unknown value or non-existing value. Apache PIG treats NULL value in a similar way as SQL. The arithmetic operators and comparison operators in Apache PIG works in the same way. The various relational operators in PIG Latin include LOAD, STORE, FILTER, DISTINCT, FOREACH GENERATE, JOIN, GROUP etc.
- LOAD is the operator that is used to read the data from the file system (Local/HDFS). For example, lets assume that data.csv is the file from where the data needs to be read. The data file has the following contents:
1,2,3
4,2,1
8,3,4
4,3,3
grunt> A = Load 'data.csv' USING PigStorage(',') as (f1:int, f2:chararray, f3:chararray);
Where A is a relation or bag of tuples.
The above script results in creation of a bag of tuples with the name ‘A’ having the data in a csv file with the values above.
- DUMP is the diagnostic operator to display the results on the screen
grunt> DUMP A;
The above script results in display of the contents of ‘A’ on the screen with the data below
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
- STORE operator is used to store the result into a file
grunt> STORE A INTO 'path' USING PigStorage(',');
Here, the ‘A’ is stored into the file named ‘path’
- GROUP operator is used to group the data in one or more relations based on a key
grunt> B = GROUP A BY f2;
The above script would give the following output
(2,{(1,2,3),(4,2,1)})
(3,{(8,3,4),(4,3,3)})
- JOIN operator is used to combine the data from two or more relations. One (or a group) of tuple(s) from each relation is used as a key to perform a join operation. Joins can be of the following types:
- Self-join
- Inner-join
- Outer-join
- Left join
- Right join
- Full join
These joins behave in same way as they would in any relational schema model.
grunt> A = LOAD 'data' USING PigStorage(',') as (f1:int, f2:chararray, f3:chararray);
grunt> B = LOAD 'data' USING PigStorage(',') as (f1:int, f2:chararray, f3:chararray);
grunt> C = JOIN A BY f1, B BY f1;
grunt> DUMP C;
The above scripts is an example of Inner-Join that would join ‘A’ and ‘B on the fields f1 in ‘A’ and f1 in ‘B’ and give the following output C with the DUMP command.
(1,2,3,1,2,3)
(4,2,1,4,2,1)
(8,3,4,8,3,4)
(4,3,3,4,3,3)
- FILTER operator is used to select the required tuples from a relation based on some condition.
grunt> A = LOAD 'data' USING PigStorage(',') as (f1:int, f2:chararray, f3:chararray);
grunt> B = FILTER A by f1 == 4;
grunt> DUMP B;
The above script would filter the data on the f1 field and display the results with the DUMP command.
(4,2,1)
(4,3,3)
- DISTINCT operator is used to remove the duplicate (redundant) tuples from a relation.
We have a bag ‘A’ having the following tuples:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(4,2,1)
The DISTINCT operator can be applied as in the below script.
grunt> B = DISTINCT A;
grunt> DUMP B;
The result of the DISTINCT operator will be:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
- FOREACH operator is used to generate specified data transformation based on the column data.
We have a bag ‘A’ having the following tuples:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(4,2,1)
The FOREACH operator can be applied as:
grunt> B = FOREACH A GENERATE (f1, f3);
grunt> DUMP B;
The result of the FOREACH operator will be:
(1,3)
(4,1)
(8,4)
(4,3)
(4,1)
- ORDER BY operator is used to sort the tuples in a relation based on one or more fields
We have a bag ‘A’ having the following tuples:
(8,3,4)
(1,2,3)
(4,3,3)
The ORDER BY command can be applied on f1 as:
grunt> B = ORDER A BY f1;
grunt> DUMP B;
The result of the ORDER BY command will be:
(1,2,3)
(4,3,3)
(8,3,4)
- LIMIT operator is used extract limited number of tuples from a relation.
We have a bag ‘A’ having the following tuples:
(8,3,4)
(1,2,3)
(4,3,3)
The LIMIT command can be applied as:
grunt> B = LIMIT A 1;
grunt> DUMP B;
The result of the above command will be:
(8,3,4)
This concludes a basic overview of Pig Latin data processing language for big data platform.
Add Comment
You must be logged in to post a comment.