Applied Machine Learning to solve Real Problems - July 1st to 5th

Date:

July 1st to 5th. Afternoon, 3:00 PM to 6:00 PM

Classroom:

PC2

Instructor

Jordi Moragas Vilarnau

Degree in Mathematics and PhD in Mathematics (Combinatorics, Graph Theory and Additive Number Theory) by the UPC, Department of Applied Mathematics IV.

Senior Manager at Bluecap Management Consulting, firm specialized on financial services and consultancy projects for top-tier European banks in regulatory, strategy and risk management. As an expert in quantitative risk analysis, his main tasks are the development, implementation and validation of internal credit risk models (ratings/scoring systems and calibration of PD and LGD under Basel III and IFRS9 framework), stress testing (solvency and liquidity), capital and impairments, pricing and RAROC.

Extensive use of Machine Learning techniques, advanced programming tools (Python, R) and data management (SAS/Oracle/SQL).

Language

English

Description

This course aims to provide an extremely practical vision of the application to real problems of modeling techniques, from the most basic to the most state-of-the art ones, in highly quantitative business environments (banking and finance).

Course goals

To learn the challenges involved in creating a model in the real world from obtaining the data to the best model to be used and the validation mechanisms depending on the type of problem. Good practices and description of the usual problems in a large modeling project.

Course contents

Introduction to modeling

  • Supervised vs. unsupervised learning
  • Supervised:
    • Classification
    • Regression
  • Metrics
    • Classification (AUC, Gini, AR, Somers' D, Kolmogorov-Smirnov...)
    • Regression (R^2, RMSE...)
  • Best modeling practices (train / test partition, cross-validation, overfitting control)

Feature Engineering

  • Data processing in real life and Big Data. Typical errors in data management (future information)
  • Missing values treatment (mean / median / percentiles, control dummy, K-Nearest Neighbors ...)
  • Treatment and creation of new factors (categorical variables, Weight of Evidence / continuous bucketization, dummies and one-hot encoding, alert counters...)

Models and applicability cases

  • Basic problem examples:
    • Ratings and counterparty risk (classification)
    • Pricing (regression)
    • Sentiment analysis (classification)
  • Linear models (logistic / OLS) and regularization (LASSO, RIDGE, ElasticNet)
  • Decision trees and random forests
  • Gradient Boosting (GBM, XGBoost, LGBM, CATBoost)
  • Neural networks (1D CNN) for text mining and natural language processing
  • Parameter optimization (hyperparameter tuning)
  • Stacking techniques (to win contests)
  • Factor interpretation in black-box models
    • SHAP (effect of removing/adding each feature)
    • LIME (local linear behavior)

Calibration of classification models

  • Need for calibration
  • Confusion matrix and pay-off metrics
  • Basic calibration techniques of the probability curve (binomial model)
  • Advanced calibration techniques of the probability curve (Tasche: "The art of probability-of-default curve calibration")

Model Validation

  • Stability
  • Performance

Model development in real cases and Kaggle competitions

  • Rating
  • House prices
  • Sentiment analysis of news (including web scraping)

Prerequisites

Basic algebra and statistics. Programming.

Targeted at

Students who wish to see the application of current Machine Learning techniques in a production environment. Graduates in mathematics, statistics and engineering.

Evaluation

An internal modeling contest will be held in Kaggle.

Computer class or student's laptop?

Student's laptop

Software requirements

Python 3 (free) will be the preferred tool for its ease of use and flexibility (latest stable version supporting TensorFlow). A student can also use R if he feels more comfortable with it.