🎯 Unit 2 Overview
Unit 2 mainly focuses on Hadoop and its ecosystem. Hadoop is the most important topic of Big Data and is frequently asked in RGPV examinations.
Exam Tip: Hadoop Architecture, HDFS, YARN and MapReduce are among the most repeated 7 and 14 marks questions.
📘 Introduction to Hadoop
Hadoop is an open-source framework developed by Apache Software Foundation for storing and processing huge amounts of data across multiple computers.
Features of Hadoop
- Distributed Storage
- Fault Tolerance
- Scalability
- High Availability
- Parallel Processing
- Cost Effective
Advantages
- Handles massive data efficiently
- Works on commodity hardware
- Reliable and fault tolerant
- Supports structured and unstructured data
🏗 Hadoop Core Components
| Component |
Purpose |
| HDFS |
Distributed storage system |
| YARN |
Resource management system |
| MapReduce |
Data processing framework |
| Common Utilities |
Support libraries and APIs |
🌍 Hadoop Ecosystem
The Hadoop Ecosystem consists of multiple tools working together for Big Data processing.
| Tool |
Function |
| HDFS |
Data Storage |
| MapReduce |
Data Processing |
| YARN |
Resource Management |
| Hive |
SQL Query Processing |
| Pig |
Data Flow Language |
| HBase |
NoSQL Database |
| Sqoop |
Data Transfer |
| Flume |
Log Collection |
🗄 HDFS (Hadoop Distributed File System)
HDFS is Hadoop's storage layer. It stores large files across multiple machines and provides fault tolerance.
Main Components
- NameNode
- DataNode
- Secondary NameNode
Working of HDFS
- File is divided into blocks.
- Blocks are stored on DataNodes.
- NameNode manages metadata.
- Replication ensures fault tolerance.
Important: NameNode stores metadata while DataNodes store actual data.
⚡ Hadoop Limitations
- High latency
- Not suitable for real-time processing
- Complex setup
- Requires technical expertise
- Consumes large storage
- MapReduce is comparatively slow
⚖️ RDBMS vs Hadoop
| RDBMS |
Hadoop |
| Structured Data |
Structured + Unstructured Data |
| GB Scale |
TB/PB Scale |
| Vertical Scaling |
Horizontal Scaling |
| Expensive Hardware |
Commodity Hardware |
| Low Fault Tolerance |
High Fault Tolerance |
🏢 Hive Physical Architecture
Hive is a data warehouse tool built on Hadoop that allows SQL-like queries using HiveQL.
Hive Components
- User Interface
- Driver
- Compiler
- Metastore
- Execution Engine
- HDFS Storage Layer
Hive converts SQL-like queries into MapReduce jobs.
⚙️ YARN (Yet Another Resource Negotiator)
YARN manages resources and scheduling in Hadoop clusters.
Main Components
- Resource Manager
- Node Manager
- Application Master
- Containers
Functions
- Resource Allocation
- Job Scheduling
- Cluster Monitoring
- Application Management
🔄 MapReduce Programming
MapReduce is a programming model used for processing huge datasets in parallel.
Map Phase
Input data is divided and processed independently.
Reduce Phase
Results from Map phase are combined to generate final output.
Steps of MapReduce
- Input Splitting
- Mapping
- Shuffling
- Sorting
- Reducing
- Output Generation
⭐ Most Important Questions
- Explain Hadoop Architecture with diagram.
- Explain Hadoop Ecosystem.
- What is HDFS? Explain its architecture.
- Explain NameNode and DataNode.
- Differentiate between Hadoop and RDBMS.
- Explain Hive Architecture.
- Explain YARN Architecture and working.
- Explain MapReduce programming model.
- Write advantages and limitations of Hadoop.
- Explain Hadoop Core Components.
🔥 Last Minute Revision
- Hadoop = Storage + Processing Framework
- HDFS = Storage Layer
- NameNode = Metadata
- DataNode = Actual Data
- YARN = Resource Management
- MapReduce = Processing Engine
- Hive = SQL on Hadoop
- RDBMS vs Hadoop frequently asked