Hive, HiveQL, Pig, Pig Latin, Pig Architecture, ETL Processing, Operators, Functions and Data Types
Big Data Unit 3 mainly covers two important Hadoop ecosystem tools: Hive and Pig. Hive is used for SQL-like querying on Hadoop, while Pig is used for data flow and ETL processing.
Hive is a data warehouse tool built on top of Hadoop. It allows users to write SQL-like queries called HiveQL to analyze large datasets stored in HDFS.
Hive architecture contains different components that convert HiveQL queries into MapReduce jobs and execute them on Hadoop.
| Component | Work |
|---|---|
| User Interface | Allows user to submit HiveQL queries. |
| Driver | Receives query and manages execution lifecycle. |
| Compiler | Checks query syntax and converts query into execution plan. |
| Metastore | Stores metadata such as table name, columns, data types and location. |
| Execution Engine | Executes the query plan using MapReduce, Tez or Spark. |
| HDFS | Stores actual data files. |
| Data Type | Example |
|---|---|
| INT | 101 |
| STRING | 'RGPV' |
| ARRAY | ['Java','Python','SQL'] |
| MAP | {'name':'Amit','city':'Bhopal'} |
| STRUCT | student.name, student.rollno |
Hive Query Language or HiveQL is similar to SQL. It is used to create tables, load data, query data and perform analysis on Hadoop data.
Pig is a high-level platform used for analyzing large datasets in Hadoop. Pig uses a scripting language called Pig Latin.
Pig architecture explains how Pig Latin scripts are converted into MapReduce jobs.
| Component | Work |
|---|---|
| Pig Latin Script | User writes data processing commands. |
| Parser | Checks syntax and creates logical plan. |
| Optimizer | Improves the logical plan for better performance. |
| Compiler | Converts logical plan into MapReduce jobs. |
| Execution Engine | Runs MapReduce jobs on Hadoop. |
| HDFS | Stores input and output data. |
Pig runs on Hadoop and uses HDFS for storage and MapReduce for processing. The user writes Pig Latin scripts, and Pig automatically converts them into MapReduce jobs.
Pig Latin is a data flow language used in Apache Pig. It is simple and suitable for data transformation.
ETL means Extract, Transform and Load. Pig is widely used for ETL operations in Big Data.
| Step | Meaning |
|---|---|
| Extract | Data is collected from different sources. |
| Transform | Data is cleaned, filtered and converted into useful format. |
| Load | Processed data is stored into target system. |
| Operator | Use |
|---|---|
| LOAD | Loads data from file system. |
| FILTER | Selects data based on condition. |
| FOREACH | Generates required fields. |
| GROUP | Groups records. |
| JOIN | Combines two datasets. |
| ORDER | Sorts data. |
| DISTINCT | Removes duplicate values. |
| STORE | Saves output into file system. |
| DUMP | Displays output on screen. |
Pig supports built-in functions and user-defined functions for data processing.
If built-in functions are not sufficient, users can create their own functions using Java, Python or other supported languages.
| Data Type | Meaning |
|---|---|
| int | Integer value |
| long | Large integer value |
| float | Decimal value |
| double | Large decimal value |
| chararray | String value |
| bytearray | Raw data |
| tuple | Collection of fields |
| bag | Collection of tuples |
| map | Key-value pair |
| Hive | Pig |
|---|---|
| Uses HiveQL | Uses Pig Latin |
| SQL-like query language | Data flow scripting language |
| Best for structured data | Best for semi-structured data |
| Used by analysts | Used by developers |
| Good for reporting | Good for ETL processing |