Introduction to clustering analysis with R

Date:

July 4 to 8. MORNING: 9 to 12h

Classroom:

Not defined yet

Instructor

Marta Nai Ruscone 

Dr. Marta Nai Ruscone has a degree in Economics and Statistical Sciences at Università Cattolica del Sacro Cuore of Milan. Since 2015 she is an Assistant Professor of Statistics (SECS-S/01) at the LIUC University of Castellanza (Italy). Starting December 2019 she joined the Statistics group in Genova. She has taught several bachelor, master and PhD courses of statistics (more often about clustering and classification). Her current research interest revolves around the development of statistical methods based on copula and their applications to multivariate data. In particular, she is interested in measures of association for ordinal and rank variables when the underlying distributions of those variables cannot be assumed to be normal. The main application is to the analysis and full understanding of complex and hidden dependence patterns in correlated multivariate data, with special emphasis on model-based clustering methods.

Language

English

Description

Clustering analysis comprises a class of applied statistical techniques for classifying multivariate data into groups or clusters based on their similar features. Clustering is widely used in a wide range of domains including social sciences, psychology, and marketing. This short course provides an accessible and comprehensive introduction to clustering, their more well-known methods, and offers practical guidelines for applying clustering tools and extensive data analyses. The clustering techniques will be illustrated and practiced using real-life examples and case studies and the open-source statistical software R and RStudio. The software is gradually introduced with a detailed explanation about the code . The mix between theoretical and practical aspects is motivated by the fact that clustering methods should not be considered as “black-box” tools producing an output given some input.
The course will be delivered through a mix of lectures, practical sessions and discussion sessions.

Course goals

The course provides students with the current methods and techniques of data science using modern computational methods and software with an emphasis on rigorous statistical thinking.
On completion of this course, students should have acquired the following skills:

  • Have an understanding of the theory regarding all the clustering methods introduced
  • Being able to use the different techniques according to the context and the purpose of analysis
  • Being able to assess the performance of the statistical learning methods introduced
  • Use the statistical software R to implement these methods and being able to interpret the relevant output

Course contents

  1. Introduction to clustering
  2. Distance-based clustering
    1. Introduction to distance measures
    2. Hierarchical algorithms: agglomerative hierarchical clustering, divisive hierarchical clustering
    3. Non-hierarchical algorithms: k-Means; k-Medoids; partition around medoids; divisive k- Means
  3. Model-based clustering
    1. Finite mixture models
    2. EM algorithm for Finite Mixture Models
    3. EM-type algorithms
    4. Initializing the EM algorithm
    5. Stopping criteria for the EM algorithm
    6. On identifiability of Finite Mixture Models
    7. Model selection
    8. Gaussian Mixture Models for continuous data
    9. Departures from Gaussian Mixture Models: Mixture soft distributions; Mixtures of Skew-Normal distribution
    10. Mixture Models for categorical and mixed data
  4. Fuzzy clustering
    1. Fuzzy k-Means
    2. Fuzzy k-Medoids
  5. Introductory discussion on clustering big data
    1. The application of these methods is illustrated through examples and practical sessions involving the analysis of real life data using the software R and RStudio.

Prerequisites

A working knowledge of statistical methods. Familiarity with the R statistical software.

Targeted at

  • MSc and PhD students.
  • Professionals of data analysis interested in learning how to deal with clustering.
  • Practitioners who aim at applying clustering algorithms using R

Evaluation

A small applied project based on real clustering data analysis.

Software requirements