Wednesday 6 April 2016

How does count work using apache-pig?

I am learning pig and came across some question like how grouping works and how to get the count.
I thought to put some example which could be useful for others when they go for doing some grouping and count out of it.

I am not going to cover what is pig and the advantages of it. Will directly jump on with an example and try to elaborate more on to it.

Let's start with it.

Here is sample data for the example.

1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333


The data is related with movies as movie name, release year, rating for the movie and number of likes.

Our assignment is to get the count of movies for a year.
Below are steps to be taken to analyse the data.

1. Loading Data : Will use the Load function to read the data into Pig. Here Load is just a pointer and does not load the data. When Store or Dump is requested then only pig starts with an execution.

movies = load '/home/abhijit/Downloads/movies.txt' Using PigStorage(',');

2. To verify the output or for testing/debugging of your changes you can use DUMP.
DUMP displays the results on to the terminal.(Never use the same on production. Use the STORE instead)

DUMP movies;

The result on the terminal is :

(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)


3. Group the data by key year. We will use group operator for the same. It group together the tuples that have same key, which will be called group key. In our case release year is the group key.

groupByYear = group movies by $2;

$2 here is the position that represents the value of release year(We are using the position here as we have not created any schema. We will talk about schema later).

Again call DUMP to verify the result.

DUMP groupByYear ;

The result on the terminal is :

(1921,{(3,Orphans of the Storm,1921,3.2,9062)})
(1929,{(9,Nosferatu: Original Version,1929,3.5,5651)})
(1932,{(2,The Mummy,1932,3.5,4388)})
(1963,{(5,Night Tide,1963,2.8,5126)})
(1985,{(6,One Magic Christmas,1985,3.8,5333)})
(1991,{(4,The Object of Beauty,1991,2.8,6150)})
(1993,{(1,The Nightmare Before Christmas,1993,3.9,4568)})
(1994,{(7,Muriel's Wedding,1994,3.5,6323),(8,Mother's Boys,1994,3.4,5733)})
(1995,{(10,Nick of Time,1995,3.4,5333)})


4. Now from the output it is clear why we do grouping before we go for the count.

(1921,{(3,Orphans of the Storm,1921,3.2,9062)})

1921 signifies the key and (3,Orphans of the Storm,1921,3.2,9062) is the Tuple or the value of it. Here we have single tuple in the form of (3,Orphans of the Storm,1921,3.2,9062).

Tuple is an ordered set of fields.
 (3,Orphans of the Storm,1921,3.2,9062).

A bag is a collection of Tuples
Representation of bag is
{(3,Orphans of the Storm,1921,3.2,9062)}

Now we want to get the count of the tuples or the values of key.

movieCountOfYear = foreach groupByYear generate $0, COUNT($1);

$0 represents the key and $1 represents the value.

To verify the result call DUMP.

DUMP movieCountOfYear;

(1921,1)
(1929,1)
(1932,1)
(1963,1)
(1985,1)
(1991,1)
(1993,1)
(1994,2)
(1995,1)

In the result set you can see by the year and the respective movie count of it.



No comments:

Post a Comment