Applied Machine Learning to Solve Real-World Problems

Date:

July 6 to 10. AFTERNOON: 15 to 18h.

Instructor

Jordi Moragas

Degree in Mathematics and PhD in Mathematics from Universitat Politècnica de Catalunya (UPC), with specialization in Combinatorics, Graph Theory, and Additive Number Theory.

Currently Head of Decision Sciences within Global Risk Management at BBVA Group headquarters in Madrid, working for the Bank holding. I lead the Decision Sciences function for Retail Credit Risk, translating high level business objectives and regulatory requirements into concrete analytical and technical strategies across the organization. My role covers the definition of the multi year roadmap for risk modeling and advanced analytics, alignment between business, data science and technology teams, governance of modeling standards and data infrastructure, and the adoption of Machine Learning and AI methods.

Former Head of Data Science at N26 Bank in Berlin. I led a multidisciplinary team responsible for the prototyping, development, and deployment of Machine Learning models across Credit Risk, Financial Crime prevention and detection, and real time Credit Card transaction authorization.

Previously Head of Data and Analytics at Bluecap Management Consulting, a firm specialized in financial services advisory for top tier European Banks. In this role, I worked closely with senior banking stakeholders on quantitative risk and regulatory projects, with a strong focus on Credit Risk modeling (PD and LGD), stress testing for solvency and liquidity, capital and provisions calculation frameworks, and risk adjusted pricing using RAROC type approaches.

https://www.linkedin.com/in/jordi-moragas-vilarnau-32795b165/

Language

English

Description

This course aims to offer a highly practical perspective on applying modeling techniques to real-world problems, ranging from fundamental to cutting-edge methods, within quantitative business environments such as banking and finance

Course goals

To understand the challenges associated with creating models in real-world scenarios, from data acquisition to selecting the optimal model and validation methods tailored to specific problem types. Additionally, to promote good practices and provide insights into common issues encountered in large-scale modeling projects

Course contents

Introduction to Modeling

Supervised vs. unsupervised learning
Supervised:

Classification
Regression

Metrics:

Classification (AUC, Gini, AR, Somers' D, Kolmogorov-Smirnov...)
Regression (R^2, RMSE...)

Best modeling practices (train/test partition, cross-validation, overfitting control)

Feature Engineering

Data processing in real-life and Big Data. Common errors in data management (e.g., future information and data leakage)
Handling missing values (mean/median/percentiles, control dummy, K-Nearest Neighbors...)
Treatment and creation of new features (categorical variables, Weight of Evidence/continuous bucketization, dummies and one-hot encoding, alert counters...)

Models and Applicability Cases

Basic problem examples:

Ratings in credit risk (classification)
House prices (regression)
Image recognition (classification)

Linear models (OLS/Logistic) and regularization (LASSO, RIDGE, ElasticNet)
Decision trees and random forests
Gradient Boosting (GBM, XGBoost, LGBM, CatBoost)
Neural networks (1D & 2D CNN) for text and image processing
Generative AI and Large Language Models (LLM):

APE (Automatic Prompt Engineering)
LLM prompting instead of rules

Stacking techniques
Parameter optimization (hyperparameter tuning)
Feature interpretation in black-box models using SHAP (effect of removing/adding each feature)
Using LLMs for explanations

Calibration of Classification Models

Need for calibration
Confusion matrix and pay-off methodology
Basic calibration techniques of the probability curve (Platt scaling/binomial model, isotonic regression, Bayesian central trend correction)
Advanced calibration techniques of the probability curve (Tasche: "The art of probability-of-default curve calibration")

Model Validation

Stability
Performance

Model Development in a Real Case and Kaggle Evaluation Contest

Building a classification model from scratch to predict the probability of default for corporations

Prerequisites

Basic algebra and statistics. Programming (the course will be taught in Python)

Targeted at

Students interested in applying current Machine Learning techniques in a production environment. Graduates in mathematics, statistics, engineering and related quantitative fields

Evaluation

An internal modeling contest will be held on Kaggle

Teaching Methodology and Activities

The course will be structured as a combination of theoretical instruction and hands-on practical exercises to ensure a strong understanding of the different concepts and methodologies.

Each class will be led by the instructor, who will present the core concepts through slides and structured explanations. Theoretical discussions will be alternated with guided coding exercises, where students will implement the concepts introduced in class. These coding tasks will be carefully designed to follow a step-by-step approach, with the instructor providing pre-written code to support learning and minimize setup difficulties. The primary objective of these practical exercises is to provide students with the necessary skills to train and evaluate Machine Learning models, aligning with the course evaluation criteria.

Student participation is highly encouraged throughout the sessions. Questions, discussions, and interactions will be welcomed to create an engaging learning environment. Active engagement will help reinforce key concepts and allow students to clarify any doubts in real time.

Given the hands-on nature of the course, it is recommended that students bring a laptop with the required software installed. Alternatively, students may collaborate in small groups (2-3 students) to share a laptop. Support will be provided during the sessions to assist with software installation and setup to ensure that all participants can actively engage in the practical exercises.

Software requirements

Python 3 will be the preferred tool for its ease of use and flexibility. Students may also use R if they feel more comfortable with it