Unit 5
🎯 Unit 5 Overview
Unit 5 covers unsupervised learning techniques such as clustering and association rule mining.
It includes hierarchical algorithms, partitional algorithms, clustering large databases,
BIRCH, DBSCAN, CURE, Apriori algorithm and FP-Growth algorithm.
Exam Tip: Clustering, hierarchical vs partitional clustering, DBSCAN, BIRCH, Apriori and FP-Growth are very important for RGPV exams.
🔗 Clustering
Clustering is an unsupervised learning technique used to group similar data objects together.
Objects inside one cluster are similar to each other and different from objects of other clusters.
Example
Customers can be grouped based on buying behavior such as low buyers, medium buyers and high buyers.
Applications
- Customer segmentation
- Image segmentation
- Document grouping
- Market analysis
- Fraud detection
- Social network analysis
📂 Types of Clustering Algorithms
| Type |
Description |
| Hierarchical Clustering |
Creates a tree-like structure of clusters. |
| Partitional Clustering |
Divides data into fixed number of clusters. |
| Density-Based Clustering |
Forms clusters based on dense regions. |
| Grid-Based Clustering |
Divides data space into grids. |
| Model-Based Clustering |
Uses mathematical models to form clusters. |
🌳 Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters. It represents clusters using a tree-like
diagram called dendrogram.
Types
- Agglomerative: Bottom-up approach. Each object starts as separate cluster and clusters are merged step by step.
- Divisive: Top-down approach. All objects start in one cluster and are divided step by step.
Advantages
- No need to specify number of clusters initially
- Easy to understand using dendrogram
- Useful for small datasets
Limitations
- Computationally expensive for large data
- Once merge or split is done, it cannot be undone
📊 Partitional Clustering
Partitional clustering divides data into a fixed number of clusters. Each data object belongs to
one cluster.
K-Means Algorithm
- Select number of clusters K.
- Choose K initial cluster centers.
- Assign each data object to nearest cluster center.
- Recalculate cluster centers.
- Repeat until clusters become stable.
Advantages
- Simple and fast
- Works well for large datasets
- Easy to implement
Limitations
- Need to choose K in advance
- Sensitive to outliers
- Initial centroid selection affects result
⚖️ Hierarchical vs Partitional Clustering
| Hierarchical Clustering |
Partitional Clustering |
| Creates hierarchy of clusters. |
Creates fixed number of clusters. |
| Uses dendrogram. |
Uses cluster centers or partitions. |
| No need to specify K initially. |
Need to specify K. |
| Slow for large datasets. |
Fast for large datasets. |
| Example: Agglomerative clustering. |
Example: K-Means clustering. |
🏢 Clustering Large Databases
Large databases contain huge amounts of data, so normal clustering algorithms may become slow.
Special scalable algorithms are used for clustering large datasets.
Requirements
- Scalability
- High speed
- Ability to handle noise
- Minimum memory usage
- Support for high dimensional data
🌿 BIRCH Algorithm
BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.
It is designed for clustering very large datasets.
Features
- Uses clustering feature tree
- Works efficiently on large databases
- Reduces data size before clustering
- Memory efficient
- Good for incremental clustering
Steps
- Scan data and build clustering feature tree.
- Compress data into smaller summaries.
- Apply clustering algorithm on leaf entries.
- Refine clusters if required.
🧭 DBSCAN Algorithm
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
It forms clusters based on density of data points.
Important Terms
- Epsilon: Radius around a point.
- MinPts: Minimum points required to form dense region.
- Core Point: Point having at least MinPts within epsilon radius.
- Border Point: Point near a core point but not core itself.
- Noise Point: Point that does not belong to any cluster.
Advantages
- Finds arbitrary shaped clusters
- Detects noise and outliers
- No need to specify number of clusters
🎯 CURE Algorithm
CURE stands for Clustering Using Representatives. It uses representative points to identify
clusters and can handle non-spherical cluster shapes.
Features
- Uses multiple representative points
- Handles outliers better
- Works with complex cluster shapes
- Suitable for large datasets with sampling
🛒 Association Rule Mining
Association rule mining finds relationships among items in large datasets.
It is commonly used in market basket analysis.
Example
If customers buy bread and butter together, then store can recommend jam also.
Rule Format
A → B means if item A is purchased, item B is likely to be purchased.
📌 Support, Confidence and Lift
| Measure |
Meaning |
| Support |
How frequently an itemset appears in transactions. |
| Confidence |
How often rule is true when condition occurs. |
| Lift |
Strength of association between items. |
🧮 Apriori Algorithm
Apriori algorithm is used to find frequent itemsets and generate association rules.
It works on the principle that all subsets of a frequent itemset must also be frequent.
Steps
- Set minimum support and confidence.
- Find frequent 1-itemsets.
- Generate candidate 2-itemsets.
- Remove itemsets below minimum support.
- Repeat for larger itemsets.
- Generate association rules from frequent itemsets.
Advantages
- Simple and easy to understand
- Useful for market basket analysis
- Finds frequent itemsets systematically
Limitation
- Generates many candidate itemsets
- Requires multiple database scans
- Can be slow for large datasets
⚡ FP-Growth Algorithm
FP-Growth stands for Frequent Pattern Growth. It finds frequent itemsets without generating
large candidate sets like Apriori.
Steps
- Scan database and find frequent items.
- Sort frequent items by frequency.
- Build FP-tree.
- Generate frequent patterns from FP-tree.
Advantages
- Faster than Apriori
- No candidate generation
- Requires fewer database scans
- Efficient for large datasets
⚖️ Apriori vs FP-Growth
| Apriori |
FP-Growth |
| Generates candidate itemsets. |
Does not generate candidate itemsets. |
| Requires multiple database scans. |
Requires fewer scans. |
| Slower for large datasets. |
Faster for large datasets. |
| Easy to understand. |
Uses FP-tree structure. |
⭐ Important Questions
- Define clustering and explain its applications.
- Explain hierarchical clustering and its types.
- Explain partitional clustering with K-Means algorithm.
- Differentiate between hierarchical and partitional clustering.
- Explain clustering large databases.
- Explain BIRCH algorithm.
- Explain DBSCAN algorithm with important terms.
- Explain CURE algorithm.
- Explain association rule mining with support and confidence.
- Explain Apriori and FP-Growth algorithms.
🔥 Last Minute Revision
- Clustering groups similar objects.
- Hierarchical clustering uses dendrogram.
- Partitional clustering uses fixed K clusters.
- K-Means is fast but needs K value.
- BIRCH uses clustering feature tree.
- DBSCAN is density-based and detects noise.
- CURE uses representative points.
- Association rule mining finds item relationships.
- Apriori generates candidate itemsets.
- FP-Growth uses FP-tree and is faster than Apriori.