Big Data Unit 3 Notes | Hive and Pig

Unit 3

🎯 Unit 3 Overview

Big Data Unit 3 mainly covers two important Hadoop ecosystem tools: Hive and Pig. Hive is used for SQL-like querying on Hadoop, while Pig is used for data flow and ETL processing.

Exam Tip: Hive Architecture, HiveQL, Pig Architecture, Pig Latin and Hive vs Pig are very important for RGPV exams.

🏢 Introduction to Hive

Hive is a data warehouse tool built on top of Hadoop. It allows users to write SQL-like queries called HiveQL to analyze large datasets stored in HDFS.

Why Hive is Used?

To query Big Data using SQL-like language.
To process structured data stored in Hadoop.
To avoid writing complex MapReduce programs manually.
To perform data summarization and analysis.

Simple Meaning: Hive Hadoop ke data par SQL jaisi query chalane ke liye use hota hai.

🏗️ Hive Architecture

Hive architecture contains different components that convert HiveQL queries into MapReduce jobs and execute them on Hadoop.

Component	Work
User Interface	Allows user to submit HiveQL queries.
Driver	Receives query and manages execution lifecycle.
Compiler	Checks query syntax and converts query into execution plan.
Metastore	Stores metadata such as table name, columns, data types and location.
Execution Engine	Executes the query plan using MapReduce, Tez or Spark.
HDFS	Stores actual data files.

Important: Hive ka Metastore actual data store nahi karta, sirf metadata store karta hai.

📂 Hive Data Types

1. Primitive Data Types

INT
BIGINT
FLOAT
DOUBLE
BOOLEAN
STRING
DATE
TIMESTAMP

2. Complex Data Types

ARRAY
MAP
STRUCT
UNION

Data Type	Example
INT	101
STRING	'RGPV'
ARRAY	['Java','Python','SQL']
MAP	{'name':'Amit','city':'Bhopal'}
STRUCT	student.name, student.rollno

💻 Hive Query Language

Hive Query Language or HiveQL is similar to SQL. It is used to create tables, load data, query data and perform analysis on Hadoop data.

Common HiveQL Commands

CREATE DATABASE college;
CREATE TABLE student(id INT, name STRING, marks INT);
LOAD DATA INPATH '/student.txt' INTO TABLE student;
SELECT * FROM student;
SELECT name, marks FROM student WHERE marks > 60;

Uses of HiveQL

Create databases and tables.
Load data into tables.
Filter and group data.
Perform aggregation using SUM, COUNT, AVG etc.
Analyze large structured datasets.

🐷 Introduction to Pig

Pig is a high-level platform used for analyzing large datasets in Hadoop. Pig uses a scripting language called Pig Latin.

Features of Pig

Easy to write compared to Java MapReduce.
Supports ETL operations.
Works with structured and semi-structured data.
Allows data transformation and analysis.
Reduces development time.

Simple Meaning: Pig large data ko clean, transform aur process karne ke liye use hota hai.

🏗️ Pig Architecture

Pig architecture explains how Pig Latin scripts are converted into MapReduce jobs.

Component	Work
Pig Latin Script	User writes data processing commands.
Parser	Checks syntax and creates logical plan.
Optimizer	Improves the logical plan for better performance.
Compiler	Converts logical plan into MapReduce jobs.
Execution Engine	Runs MapReduce jobs on Hadoop.
HDFS	Stores input and output data.

🔄 Pig on Hadoop

Pig runs on Hadoop and uses HDFS for storage and MapReduce for processing. The user writes Pig Latin scripts, and Pig automatically converts them into MapReduce jobs.

User writes Pig Latin script.
Pig parses the script.
Logical plan is generated.
Plan is optimized.
MapReduce jobs are created.
Jobs are executed on Hadoop cluster.
Output is stored in HDFS.

🧾 Pig Latin

Pig Latin is a data flow language used in Apache Pig. It is simple and suitable for data transformation.

Example Pig Latin Script

student = LOAD 'student.txt' USING PigStorage(',') AS (id:int, name:chararray, marks:int);
passed = FILTER student BY marks > 60;
grouped = GROUP passed BY name;
DUMP passed;

Common Pig Latin Commands

LOAD
STORE
DUMP
FILTER
GROUP
JOIN
ORDER
FOREACH

⚙️ ETL Processing in Pig

ETL means Extract, Transform and Load. Pig is widely used for ETL operations in Big Data.

Step	Meaning
Extract	Data is collected from different sources.
Transform	Data is cleaned, filtered and converted into useful format.
Load	Processed data is stored into target system.

🧮 Operators in Pig

Operator	Use
LOAD	Loads data from file system.
FILTER	Selects data based on condition.
FOREACH	Generates required fields.
GROUP	Groups records.
JOIN	Combines two datasets.
ORDER	Sorts data.
DISTINCT	Removes duplicate values.
STORE	Saves output into file system.
DUMP	Displays output on screen.

🧩 Functions in Pig

Pig supports built-in functions and user-defined functions for data processing.

Common Built-in Functions

COUNT()
SUM()
AVG()
MIN()
MAX()
SIZE()
TOKENIZE()

User Defined Functions

If built-in functions are not sufficient, users can create their own functions using Java, Python or other supported languages.

📂 Data Types in Pig

Data Type	Meaning
int	Integer value
long	Large integer value
float	Decimal value
double	Large decimal value
chararray	String value
bytearray	Raw data
tuple	Collection of fields
bag	Collection of tuples
map	Key-value pair

⚖️ Hive vs Pig

Hive	Pig
Uses HiveQL	Uses Pig Latin
SQL-like query language	Data flow scripting language
Best for structured data	Best for semi-structured data
Used by analysts	Used by developers
Good for reporting	Good for ETL processing

⭐ Important Questions

Explain Hive Architecture with diagram.
What is Hive? Explain features of Hive.
Explain Hive data types.
What is HiveQL? Write common HiveQL commands.
Explain Pig Architecture.
What is Pig Latin? Explain with example.
Explain Pig on Hadoop.
Explain ETL processing in Pig.
Explain operators and functions in Pig.
Differentiate between Hive and Pig.

🔥 Last Minute Revision

Hive = SQL-like query tool on Hadoop.
HiveQL = Query language of Hive.
Metastore stores metadata.
Pig = Data flow tool for large data processing.
Pig Latin = Scripting language of Pig.
Pig is mostly used for ETL processing.
Hive is suitable for structured data.
Pig is suitable for semi-structured data.

🔗 Related Links

Back to Subject Previous Unit Next Unit