Introduction to clustering analysis with R

Date:

July 4 to 8. MORNING: 9 to 12h

Classroom:

Not defined yet

Instructor

Dr. Marta Nai Ruscone has a degree in Economics and Statistical Sciences at Università Cattolica del Sacro Cuore of Milan. Since 2015 she is an Assistant Professor of Statistics (SECS-S/01) at the LIUC University of Castellanza (Italy). Starting December 2019 she joined the Statistics group in Genova. She has taught several bachelor, master and PhD courses of statistics (more often about clustering and classification). Her current research interest revolves around the development of statistical methods based on copula and their applications to multivariate data. In particular, she is interested in measures of association for ordinal and rank variables when the underlying distributions of those variables cannot be assumed to be normal. The main application is to the analysis and full understanding of complex and hidden dependence patterns in correlated multivariate data, with special emphasis on model-based clustering methods.

Language

English

Description

Clustering analysis comprises a class of applied statistical techniques for classifying multivariate data into groups or clusters based on their similar features. Clustering is widely used in a wide range of domains including social sciences, psychology, and marketing. This short course provides an accessible and comprehensive introduction to clustering, their more well-known methods, and offers practical guidelines for applying clustering tools and extensive data analyses. The clustering techniques will be illustrated and practiced using real-life examples and case studies and the open-source statistical software R and RStudio. The software is gradually introduced with a detailed explanation about the code . The mix between theoretical and practical aspects is motivated by the fact that clustering methods should not be considered as “black-box” tools producing an output given some input.
The course will be delivered through a mix of lectures, practical sessions and discussion sessions.

Course goals

The course provides students with the current methods and techniques of data science using modern computational methods and software with an emphasis on rigorous statistical thinking.
On completion of this course, students should have acquired the following skills:

Have an understanding of the theory regarding all the clustering methods introduced
Being able to use the different techniques according to the context and the purpose of analysis
Being able to assess the performance of the statistical learning methods introduced
Use the statistical software R to implement these methods and being able to interpret the relevant output

Course contents

Introduction to clustering
Distance-based clustering

Introduction to distance measures
Hierarchical algorithms: agglomerative hierarchical clustering, divisive hierarchical clustering
Non-hierarchical algorithms: k-Means; k-Medoids; partition around medoids; divisive k- Means

Model-based clustering

Finite mixture models
EM algorithm for Finite Mixture Models
EM-type algorithms
Initializing the EM algorithm
Stopping criteria for the EM algorithm
On identifiability of Finite Mixture Models
Model selection
Gaussian Mixture Models for continuous data
Departures from Gaussian Mixture Models: Mixture soft distributions; Mixtures of Skew-Normal distribution
Mixture Models for categorical and mixed data

Fuzzy clustering

Fuzzy k-Means
Fuzzy k-Medoids

Introductory discussion on clustering big data

The application of these methods is illustrated through examples and practical sessions involving the analysis of real life data using the software R and RStudio.

Prerequisites

A working knowledge of statistical methods. Familiarity with the R statistical software.

Targeted at

MSc and PhD students.
Professionals of data analysis interested in learning how to deal with clustering.
Practitioners who aim at applying clustering algorithms using R

Evaluation

A small applied project based on real clustering data analysis.

Software requirements

R
R-studio