Data Types, Data Quality, Preprocessing, Similarity Measures, KDD, Data Mining Tasks and Fuzzy Logic
Unit 3 introduces data and data mining concepts. It covers data types, quality of data, data preprocessing, similarity measures, summary statistics, data distributions, basic data mining tasks, KDD, issues in data mining and fuzzy logic.
Data is a collection of facts, values, observations or records. In data mining, data is analyzed to discover useful patterns and knowledge.
| Data Type | Description | Example |
|---|---|---|
| Nominal Data | Categories without order. | Gender, city, branch |
| Ordinal Data | Categories with order. | Low, medium, high |
| Interval Data | Numeric data without true zero. | Temperature in Celsius |
| Ratio Data | Numeric data with true zero. | Age, income, weight |
| Discrete Data | Countable values. | Number of students |
| Continuous Data | Measured values. | Height, time, distance |
Data quality means how accurate, complete, consistent and useful data is for analysis. Poor data quality gives wrong results in data mining.
Data preprocessing is the process of converting raw data into clean and useful data before mining.
Similarity measures are used to find how similar or different two data objects are. They are mostly used in clustering and classification.
| Measure | Use |
|---|---|
| Euclidean Distance | Distance between two points in space. |
| Manhattan Distance | Distance measured along right-angle paths. |
| Cosine Similarity | Measures angle similarity between vectors. |
| Jaccard Similarity | Used for set similarity. |
Summary statistics describe the main features of data using numerical values.
Data distribution shows how data values are spread over a range.
Understanding distribution helps in selecting suitable data mining algorithms.
| Task | Description |
|---|---|
| Classification | Assigns data into predefined classes. |
| Clustering | Groups similar data objects. |
| Association Rule Mining | Finds relationships between items. |
| Regression | Predicts continuous numeric values. |
| Prediction | Predicts future outcomes. |
| Anomaly Detection | Finds abnormal or unusual data. |
KDD means Knowledge Discovery in Databases. It is the complete process of discovering useful knowledge from large datasets. Data mining is one important step of KDD.
| Data Mining | KDD |
|---|---|
| It is a step in KDD. | It is the complete knowledge discovery process. |
| Focuses on pattern extraction. | Includes selection, cleaning, mining and interpretation. |
| Uses algorithms. | Uses complete methodology. |
| Output is patterns. | Output is useful knowledge. |
A fuzzy set allows partial membership. In classical sets, an element either belongs or does not belong to a set. But in fuzzy sets, membership value can be between 0 and 1.
A person can be partially tall. Membership value may be 0.7 instead of only true or false.
Fuzzy logic is a form of logic that handles uncertainty and approximate reasoning. It is useful where answers are not simply true or false.