Non Gaussian Model-based Clustering

Date:

July 6 to 10. MORNING: 9 to 12h.

Instructor

Marta Nai Ruscone

Prof. Marta Nai Ruscone is an Associate Professor of Statistics at the University of Genoa. She began her academic journey at the Università Cattolica del Sacro Cuore in Milan, where she earned her degree in Economics and Statistical Sciences.
Before joining the statistics group in Genoa in December 2019, she served as an Assistant Professor at LIUC University of Castellanza. Throughout her career, Prof. Nai Ruscone has developed a robust teaching portfolio, delivering specialized courses at the Bachelor, Master, and PhD levels, with a particular focus on clustering and classification. Her current research interest revolves around the development of statistical methods based on copula and their applications to multivariate data. In particular, she is interested in measures of association for ordinal and rank variables when the underlying distributions of those variables cannot be assumed to be normal. The main application is to the analysis and full understanding of complex and hidden dependence patterns in correlated multivariate data, with special emphasis on model-based clustering methods.

https://rubrica.unige.it/personale/UkJCUlth

Language

English

Course goals

The course provides students with the current methods and techniques of data science using modern computational methods and software with an emphasis on rigorous statistical thinking. On completion of this course, students should have acquired the following skills: - Have an understanding of the theory regarding all the clustering methods introduced - Being able to use the different techniques according to the context and the purpose of analysis - Being able to assess the performance of the statistical learning methods introduced - Use the statistical software R to implement these methods and being able to interpret the relevant output

Course contents

Model-based Clustering: Basic Ideas:

1. Finite Mixture Models

2. Estimation by Maximum Likelihood

3. Initializing the EM Algorithm

3.1. Initialization by Hierarchical Model-based Clustering

4. Choosing the Number of Clusters and the Clustering Model

Non-Gaussian Model-based Clustering:

1. Mixture of t Distributions

2. Mixture of Skew-Normal distributions

3. Mixture of Skew-t Distribution

4. Mixture of Restricted Skew-t Distribution

5. Mixture of Unrestricted Skew-t Distribution

6. Box–Cox Transformed Mixtures

7. Generalized Hyperbolic Distribution

Prerequisites

A working knowledge of statistical methods. Familiarity with the R statistical software.

Targeted at

MSc and PhD students. Professionals of data analysis interested in learning how to deal with clustering. Practitioners who aim at applying clustering algorithms using R

Evaluation

A small applied project based on real clustering data analysis.

Teaching Methodology and Activities

Cluster analysis is a multivariate statistical technique that identifies homogeneous groups of units within the data. One of the most commonly used technique is model based clustering. Model based clustering assumes that a population is a mixture of sub-population, each component is modeled through a probability density function and a component can be considered a cluster.

The scope of this short course is to introduce model based clustering on continuous data.

The course will start by establishing a foundation in Gaussian Mixture Models (GMMs), including a technical deep dive into the underlying Estimation-Maximization (EM) algorithm. Recognizing the limitations of standard normality, the course will then transition to flexible non-Gaussian techniques designed to handle more complex data structures. To bridge theory and practice, the session concludes with a hands-on tutorial using R, demonstrating the implementation of these models through key statistical packages.

Software requirements

R (free)
R-studio (free)
Python (free)