Unit 4
🎯 Unit 4 Overview
Unit 4 covers supervised learning and classification techniques used in data mining.
Classification is used to assign data objects into predefined classes using different
algorithms such as statistical-based, distance-based, decision tree, neural network,
rule-based and probabilistic classifiers.
Exam Tip: Classification, decision tree classifier, neural network classifier, rule-based classifier and probabilistic classifier are very important for RGPV exams.
🤖 Supervised Learning
Supervised learning is a machine learning technique where the model is trained using labeled data.
Labeled data means input data already contains correct output or class label.
Example
If student data contains attendance, marks and result as Pass/Fail, then result is the class label.
The model learns from this data and predicts result for new students.
Applications
- Email spam detection
- Medical disease prediction
- Loan approval prediction
- Customer churn prediction
- Student performance prediction
🏷️ Classification
Classification is a supervised learning task in which data objects are assigned to predefined classes.
Basic Steps of Classification
- Collect labeled training data.
- Train classification model.
- Test model using test data.
- Evaluate accuracy.
- Use model for prediction on new data.
Classification ka output category hota hai, jaise Yes/No, Pass/Fail, Spam/Not Spam.
⚖️ Classification vs Regression
| Classification |
Regression |
| Predicts class/category. |
Predicts continuous numeric value. |
| Output is discrete. |
Output is continuous. |
| Example: Spam or Not Spam. |
Example: House price prediction. |
| Used for decision problems. |
Used for forecasting numeric values. |
📊 Statistical-Based Algorithms
Statistical-based classification algorithms use statistical methods and mathematical models
to classify data.
Characteristics
- Based on probability and statistics
- Uses data distribution
- Works well when data follows assumptions
- Useful for prediction and classification
Examples
- Naive Bayes classifier
- Logistic regression
- Linear discriminant analysis
📏 Distance-Based Algorithms
Distance-based algorithms classify data based on distance or similarity between data points.
K-Nearest Neighbor Algorithm
KNN classifies a new data point based on the majority class among its nearest neighbors.
Steps of KNN
- Select value of K.
- Calculate distance between new point and training points.
- Find K nearest neighbors.
- Check majority class among neighbors.
- Assign that class to new point.
Common Distance Measures
- Euclidean distance
- Manhattan distance
- Cosine similarity
🌳 Decision Tree-Based Algorithms
Decision tree is a classification technique that represents decisions in the form of a tree.
Each internal node represents a test on attribute, each branch represents result of test,
and each leaf node represents class label.
Important Terms
- Root Node: Starting node of tree.
- Internal Node: Represents condition or test.
- Branch: Represents outcome of test.
- Leaf Node: Represents final class label.
Advantages
- Easy to understand
- Easy to represent using diagram
- Handles categorical and numerical data
- Useful for classification rules
Examples
- ID3 algorithm
- C4.5 algorithm
- CART algorithm
🧠 Neural Network-Based Algorithms
Neural networks are inspired by the human brain. They consist of interconnected nodes called neurons.
Neural networks are useful for complex classification problems.
Layers in Neural Network
- Input Layer: Receives input features.
- Hidden Layer: Processes information.
- Output Layer: Produces final class prediction.
Advantages
- Can handle complex patterns
- Useful for large datasets
- Good for image, speech and pattern recognition
- Learns automatically from data
Limitations
- Requires large training data
- Training may take more time
- Difficult to interpret compared to decision tree
📜 Rule-Based Algorithms
Rule-based classification uses IF-THEN rules to classify data.
Example Rule
IF attendance > 75% AND marks > 40 THEN class = Pass
Advantages
- Easy to understand
- Rules are simple to explain
- Useful for expert systems
- Can be generated from decision trees
Limitations
- Too many rules may become complex
- Rules may conflict
- Not suitable for very noisy data
🎲 Probabilistic Classifiers
Probabilistic classifiers classify data based on probability. They calculate the probability
of each class and assign the class with highest probability.
Naive Bayes Classifier
Naive Bayes is a probabilistic classifier based on Bayes theorem. It assumes that features are independent.
Features
- Simple and fast
- Works well for text classification
- Used in spam detection
- Requires less training data
Applications
- Email spam filtering
- Sentiment analysis
- Document classification
- Disease prediction
📌 Classification Algorithm Summary
| Algorithm Type |
Main Idea |
Example |
| Statistical-Based |
Uses statistics and probability. |
Logistic Regression |
| Distance-Based |
Uses distance between data points. |
KNN |
| Decision Tree-Based |
Uses tree-like decision structure. |
ID3, C4.5 |
| Neural Network-Based |
Uses interconnected neurons. |
Artificial Neural Network |
| Rule-Based |
Uses IF-THEN rules. |
Rule Classifier |
| Probabilistic |
Uses probability of classes. |
Naive Bayes |
⚖️ Decision Tree vs Neural Network
| Decision Tree |
Neural Network |
| Easy to understand and interpret. |
Difficult to interpret. |
| Tree-based structure. |
Layer-based structure. |
Works well for rule extraction. |
Works well for complex patterns. |
| Requires less training time. |
May require more training time. |
⭐ Important Questions
- Define supervised learning with example.
- Explain classification and its steps.
- Differentiate between classification and regression.
- Explain statistical-based classification algorithms.
- Explain distance-based classification algorithm KNN.
- Explain decision tree classifier with diagram.
- Explain neural network-based classification.
- Explain rule-based classifier with example.
- Explain probabilistic classifier and Naive Bayes.
- Compare decision tree and neural network classifier.
🔥 Last Minute Revision
- Supervised learning uses labeled data.
- Classification predicts category/class.
- KNN is distance-based classifier.
- Decision tree has root, internal and leaf nodes.
- Neural network has input, hidden and output layers.
- Rule-based classifier uses IF-THEN rules.
- Naive Bayes is probabilistic classifier.
- Decision tree is easy to interpret, neural network handles complex patterns.