📘 Data Mining & Warehousing Unit 5

Clustering Algorithms, BIRCH, DBSCAN, CURE, Association Rule Mining, Apriori and FP-Growth

Unit 5

🎯 Unit 5 Overview

Unit 5 covers unsupervised learning techniques such as clustering and association rule mining. It includes hierarchical algorithms, partitional algorithms, clustering large databases, BIRCH, DBSCAN, CURE, Apriori algorithm and FP-Growth algorithm.

Exam Tip: Clustering, hierarchical vs partitional clustering, DBSCAN, BIRCH, Apriori and FP-Growth are very important for RGPV exams.

🔗 Clustering

Clustering is an unsupervised learning technique used to group similar data objects together. Objects inside one cluster are similar to each other and different from objects of other clusters.

Example

Customers can be grouped based on buying behavior such as low buyers, medium buyers and high buyers.

Applications

📂 Types of Clustering Algorithms

Type Description
Hierarchical Clustering Creates a tree-like structure of clusters.
Partitional Clustering Divides data into fixed number of clusters.
Density-Based Clustering Forms clusters based on dense regions.
Grid-Based Clustering Divides data space into grids.
Model-Based Clustering Uses mathematical models to form clusters.

🌳 Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters. It represents clusters using a tree-like diagram called dendrogram.

Types

Advantages

Limitations

📊 Partitional Clustering

Partitional clustering divides data into a fixed number of clusters. Each data object belongs to one cluster.

K-Means Algorithm

  1. Select number of clusters K.
  2. Choose K initial cluster centers.
  3. Assign each data object to nearest cluster center.
  4. Recalculate cluster centers.
  5. Repeat until clusters become stable.

Advantages

Limitations

⚖️ Hierarchical vs Partitional Clustering

Hierarchical Clustering Partitional Clustering
Creates hierarchy of clusters. Creates fixed number of clusters.
Uses dendrogram. Uses cluster centers or partitions.
No need to specify K initially. Need to specify K.
Slow for large datasets. Fast for large datasets.
Example: Agglomerative clustering. Example: K-Means clustering.

🏢 Clustering Large Databases

Large databases contain huge amounts of data, so normal clustering algorithms may become slow. Special scalable algorithms are used for clustering large datasets.

Requirements

🌿 BIRCH Algorithm

BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies. It is designed for clustering very large datasets.

Features

Steps

  1. Scan data and build clustering feature tree.
  2. Compress data into smaller summaries.
  3. Apply clustering algorithm on leaf entries.
  4. Refine clusters if required.

🧭 DBSCAN Algorithm

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It forms clusters based on density of data points.

Important Terms

Advantages

🎯 CURE Algorithm

CURE stands for Clustering Using Representatives. It uses representative points to identify clusters and can handle non-spherical cluster shapes.

Features

🛒 Association Rule Mining

Association rule mining finds relationships among items in large datasets. It is commonly used in market basket analysis.

Example

If customers buy bread and butter together, then store can recommend jam also.

Rule Format

A → B means if item A is purchased, item B is likely to be purchased.

📌 Support, Confidence and Lift

Measure Meaning
Support How frequently an itemset appears in transactions.
Confidence How often rule is true when condition occurs.
Lift Strength of association between items.

🧮 Apriori Algorithm

Apriori algorithm is used to find frequent itemsets and generate association rules. It works on the principle that all subsets of a frequent itemset must also be frequent.

Steps

  1. Set minimum support and confidence.
  2. Find frequent 1-itemsets.
  3. Generate candidate 2-itemsets.
  4. Remove itemsets below minimum support.
  5. Repeat for larger itemsets.
  6. Generate association rules from frequent itemsets.

Advantages

Limitation

⚡ FP-Growth Algorithm

FP-Growth stands for Frequent Pattern Growth. It finds frequent itemsets without generating large candidate sets like Apriori.

Steps

  1. Scan database and find frequent items.
  2. Sort frequent items by frequency.
  3. Build FP-tree.
  4. Generate frequent patterns from FP-tree.

Advantages

⚖️ Apriori vs FP-Growth

Apriori FP-Growth
Generates candidate itemsets. Does not generate candidate itemsets.
Requires multiple database scans. Requires fewer scans.
Slower for large datasets. Faster for large datasets.
Easy to understand. Uses FP-tree structure.

⭐ Important Questions

  1. Define clustering and explain its applications.
  2. Explain hierarchical clustering and its types.
  3. Explain partitional clustering with K-Means algorithm.
  4. Differentiate between hierarchical and partitional clustering.
  5. Explain clustering large databases.
  6. Explain BIRCH algorithm.
  7. Explain DBSCAN algorithm with important terms.
  8. Explain CURE algorithm.
  9. Explain association rule mining with support and confidence.
  10. Explain Apriori and FP-Growth algorithms.

🔥 Last Minute Revision

🔗 Related Links