Charles Explorer logo
🇬🇧

Data Mining

Class at Faculty of Mathematics and Physics |
NDBI023

Syllabus

Introduction to the area of data mining

Motivation for data mining and its importance for practice, an overview of frequent data mining tasks, main data mining methodologies.

Main principles of machine learning – supervised training, self-organization, semi-supervised learning, training set, test set and validation set, generalization and overfitting, Occam´s razor.

Fundamental paradigms of the data mining process

Data gathering, preparation and preprocessing – sampling, variability and confidence, discretization of numeric attributes and handling nonnumerical variables, replacement of missing and empty values, series variables.

Transformation, reduction and cleaning of the data – relationships among the attributes (similarity measures, hypothesis testing, correlation, regression, discriminant and cluster analysis), dimensionality reduction.

Validation of the obtained results – cross-validation, overall accuracy, confusion matrix, learning curve, lift curve, ROC curve, combination of models (bagging, boosting).

Techniques for association rule mining

Market basket analysis – frequent itemsets, association rules, their formulation and main characteristics.

Generation of frequent item combinations – algorithm apriori, "frequent-pattern-growth"-techniques (FP-Growth and TD-FP-Growth), combinational data analysis.

Constraint-based search for interesting rules (specification of time, items, etc.).

Methods for cluster analysis

The k-means algorithm, the choice of a suitable metric, evaluation of the obtained results (cluster validity), representation and visualization of the found clusters.

Clustering based on the fuzzy set approach (FCM-clustering), neural approach and hierarchical clustering.

Advanced concepts – scalable techniques (CLARANS, BIRCH, CURE), outlier analysis.

Approaches to data classification and prediction

Decision trees and their induction – algorithms ID3, C4.5, CART and CHAID.

Probabilistic classifiers – Bayessian models and techniques for their training and inference.

Nature-inspired models – artificial neural networks of the perceptron type, SVM-machines, ELM-networks, genetic algorithms.

Annotation

A rapid development in the area of data mining is motivated by the necessity to "translate" huge amounts of processed and stored data into meaningful information easy to use in practice. This lecture is focused on understanding principal concepts and techniques applicable to data mining.

Basic principles of their application to novel solutions of practical tasks will be used to solve a student project as a part of the subject. Possible application areas comprise mainly business and Web applications, but others as well.

Knowledge BSc. in CS of mathematical principles and programming is assumed.