📊 Big Data Unit 2

Introduction to Hadoop, Hadoop Ecosystem, HDFS, YARN, MapReduce and Hive Architecture

🎯 Unit 2 Overview

Unit 2 mainly focuses on Hadoop and its ecosystem. Hadoop is the most important topic of Big Data and is frequently asked in RGPV examinations.

Exam Tip: Hadoop Architecture, HDFS, YARN and MapReduce are among the most repeated 7 and 14 marks questions.

📘 Introduction to Hadoop

Hadoop is an open-source framework developed by Apache Software Foundation for storing and processing huge amounts of data across multiple computers.

Features of Hadoop

Advantages

🏗 Hadoop Core Components

Component Purpose
HDFS Distributed storage system
YARN Resource management system
MapReduce Data processing framework
Common Utilities Support libraries and APIs

🌍 Hadoop Ecosystem

The Hadoop Ecosystem consists of multiple tools working together for Big Data processing.

Tool Function
HDFS Data Storage
MapReduce Data Processing
YARN Resource Management
Hive SQL Query Processing
Pig Data Flow Language
HBase NoSQL Database
Sqoop Data Transfer
Flume Log Collection

🗄 HDFS (Hadoop Distributed File System)

HDFS is Hadoop's storage layer. It stores large files across multiple machines and provides fault tolerance.

Main Components

Working of HDFS

  1. File is divided into blocks.
  2. Blocks are stored on DataNodes.
  3. NameNode manages metadata.
  4. Replication ensures fault tolerance.
Important: NameNode stores metadata while DataNodes store actual data.

⚡ Hadoop Limitations

⚖️ RDBMS vs Hadoop

RDBMS Hadoop
Structured Data Structured + Unstructured Data
GB Scale TB/PB Scale
Vertical Scaling Horizontal Scaling
Expensive Hardware Commodity Hardware
Low Fault Tolerance High Fault Tolerance

🏢 Hive Physical Architecture

Hive is a data warehouse tool built on Hadoop that allows SQL-like queries using HiveQL.

Hive Components

Hive converts SQL-like queries into MapReduce jobs.

⚙️ YARN (Yet Another Resource Negotiator)

YARN manages resources and scheduling in Hadoop clusters.

Main Components

Functions

🔄 MapReduce Programming

MapReduce is a programming model used for processing huge datasets in parallel.

Map Phase

Input data is divided and processed independently.

Reduce Phase

Results from Map phase are combined to generate final output.

Steps of MapReduce

  1. Input Splitting
  2. Mapping
  3. Shuffling
  4. Sorting
  5. Reducing
  6. Output Generation

⭐ Most Important Questions

  1. Explain Hadoop Architecture with diagram.
  2. Explain Hadoop Ecosystem.
  3. What is HDFS? Explain its architecture.
  4. Explain NameNode and DataNode.
  5. Differentiate between Hadoop and RDBMS.
  6. Explain Hive Architecture.
  7. Explain YARN Architecture and working.
  8. Explain MapReduce programming model.
  9. Write advantages and limitations of Hadoop.
  10. Explain Hadoop Core Components.

🔥 Last Minute Revision