Data Mining Unit 5 Notes | Clustering & Association Rule Mining

Unit 5

🎯 Unit 5 Overview

Unit 5 covers unsupervised learning techniques such as clustering and association rule mining. It includes hierarchical algorithms, partitional algorithms, clustering large databases, BIRCH, DBSCAN, CURE, Apriori algorithm and FP-Growth algorithm.

Exam Tip: Clustering, hierarchical vs partitional clustering, DBSCAN, BIRCH, Apriori and FP-Growth are very important for RGPV exams.

🔗 Clustering

Clustering is an unsupervised learning technique used to group similar data objects together. Objects inside one cluster are similar to each other and different from objects of other clusters.

Example

Customers can be grouped based on buying behavior such as low buyers, medium buyers and high buyers.

Applications

Customer segmentation
Image segmentation
Document grouping
Market analysis
Fraud detection
Social network analysis

📂 Types of Clustering Algorithms

Type	Description
Hierarchical Clustering	Creates a tree-like structure of clusters.
Partitional Clustering	Divides data into fixed number of clusters.
Density-Based Clustering	Forms clusters based on dense regions.
Grid-Based Clustering	Divides data space into grids.
Model-Based Clustering	Uses mathematical models to form clusters.

🌳 Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters. It represents clusters using a tree-like diagram called dendrogram.

Types

Agglomerative: Bottom-up approach. Each object starts as separate cluster and clusters are merged step by step.
Divisive: Top-down approach. All objects start in one cluster and are divided step by step.

Advantages

No need to specify number of clusters initially
Easy to understand using dendrogram
Useful for small datasets

Limitations

Computationally expensive for large data
Once merge or split is done, it cannot be undone

📊 Partitional Clustering

Partitional clustering divides data into a fixed number of clusters. Each data object belongs to one cluster.

K-Means Algorithm

Select number of clusters K.
Choose K initial cluster centers.
Assign each data object to nearest cluster center.
Recalculate cluster centers.
Repeat until clusters become stable.

Advantages

Simple and fast
Works well for large datasets
Easy to implement

Limitations

Need to choose K in advance
Sensitive to outliers
Initial centroid selection affects result

⚖️ Hierarchical vs Partitional Clustering

Hierarchical Clustering	Partitional Clustering
Creates hierarchy of clusters.	Creates fixed number of clusters.
Uses dendrogram.	Uses cluster centers or partitions.
No need to specify K initially.	Need to specify K.
Slow for large datasets.	Fast for large datasets.
Example: Agglomerative clustering.	Example: K-Means clustering.

🏢 Clustering Large Databases

Large databases contain huge amounts of data, so normal clustering algorithms may become slow. Special scalable algorithms are used for clustering large datasets.

Requirements

Scalability
High speed
Ability to handle noise
Minimum memory usage
Support for high dimensional data

🌿 BIRCH Algorithm

BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies. It is designed for clustering very large datasets.

Features

Uses clustering feature tree
Works efficiently on large databases
Reduces data size before clustering
Memory efficient
Good for incremental clustering

Steps

Scan data and build clustering feature tree.
Compress data into smaller summaries.
Apply clustering algorithm on leaf entries.
Refine clusters if required.

🧭 DBSCAN Algorithm

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It forms clusters based on density of data points.

Important Terms

Epsilon: Radius around a point.
MinPts: Minimum points required to form dense region.
Core Point: Point having at least MinPts within epsilon radius.
Border Point: Point near a core point but not core itself.
Noise Point: Point that does not belong to any cluster.

Advantages

Finds arbitrary shaped clusters
Detects noise and outliers
No need to specify number of clusters

🎯 CURE Algorithm

CURE stands for Clustering Using Representatives. It uses representative points to identify clusters and can handle non-spherical cluster shapes.

Features

Uses multiple representative points
Handles outliers better
Works with complex cluster shapes
Suitable for large datasets with sampling

🛒 Association Rule Mining

Association rule mining finds relationships among items in large datasets. It is commonly used in market basket analysis.

Example

If customers buy bread and butter together, then store can recommend jam also.

Rule Format

A → B means if item A is purchased, item B is likely to be purchased.

📌 Support, Confidence and Lift

Measure	Meaning
Support	How frequently an itemset appears in transactions.
Confidence	How often rule is true when condition occurs.
Lift	Strength of association between items.

🧮 Apriori Algorithm

Apriori algorithm is used to find frequent itemsets and generate association rules. It works on the principle that all subsets of a frequent itemset must also be frequent.

Steps

Set minimum support and confidence.
Find frequent 1-itemsets.
Generate candidate 2-itemsets.
Remove itemsets below minimum support.
Repeat for larger itemsets.
Generate association rules from frequent itemsets.

Advantages

Simple and easy to understand
Useful for market basket analysis
Finds frequent itemsets systematically

Limitation

Generates many candidate itemsets
Requires multiple database scans
Can be slow for large datasets

⚡ FP-Growth Algorithm

FP-Growth stands for Frequent Pattern Growth. It finds frequent itemsets without generating large candidate sets like Apriori.

Steps

Scan database and find frequent items.
Sort frequent items by frequency.
Build FP-tree.
Generate frequent patterns from FP-tree.

Advantages

Faster than Apriori
No candidate generation
Requires fewer database scans
Efficient for large datasets

⚖️ Apriori vs FP-Growth

Apriori	FP-Growth
Generates candidate itemsets.	Does not generate candidate itemsets.
Requires multiple database scans.	Requires fewer scans.
Slower for large datasets.	Faster for large datasets.
Easy to understand.	Uses FP-tree structure.

⭐ Important Questions

Define clustering and explain its applications.
Explain hierarchical clustering and its types.
Explain partitional clustering with K-Means algorithm.
Differentiate between hierarchical and partitional clustering.
Explain clustering large databases.
Explain BIRCH algorithm.
Explain DBSCAN algorithm with important terms.
Explain CURE algorithm.
Explain association rule mining with support and confidence.
Explain Apriori and FP-Growth algorithms.

🔥 Last Minute Revision

Clustering groups similar objects.
Hierarchical clustering uses dendrogram.
Partitional clustering uses fixed K clusters.
K-Means is fast but needs K value.
BIRCH uses clustering feature tree.
DBSCAN is density-based and detects noise.
CURE uses representative points.
Association rule mining finds item relationships.
Apriori generates candidate itemsets.
FP-Growth uses FP-tree and is faster than Apriori.

📘 Data Mining & Warehousing Unit 5

🎯 Unit 5 Overview

🔗 Clustering

Example

Applications

📂 Types of Clustering Algorithms

🌳 Hierarchical Clustering

Types

Advantages

Limitations

📊 Partitional Clustering

K-Means Algorithm

Advantages

Limitations

⚖️ Hierarchical vs Partitional Clustering

🏢 Clustering Large Databases

Requirements

🌿 BIRCH Algorithm

Features

Steps

🧭 DBSCAN Algorithm

Important Terms

Advantages

🎯 CURE Algorithm

Features

🛒 Association Rule Mining

Example

Rule Format

📌 Support, Confidence and Lift

🧮 Apriori Algorithm

Steps

Advantages

Limitation

⚡ FP-Growth Algorithm

Steps

Advantages

⚖️ Apriori vs FP-Growth

⭐ Important Questions

🔥 Last Minute Revision

🔗 Related Links