Share:

Introduction to Statistical Data Privacy

Date:

June 20 to 23. MORNING: 9 to 13h (June 20, 21, and 22) and 9 to 12h (June 23) .

Instructor

Jingchen (Monika) Hu

Jingchen (Monika) Hu is an Associate Professor of Mathematics and Statistics at Vassar College. She completed a B.S. in Computing Mathematics at City University of Hong Kong and M.S. and Ph.D. in Statistical Science at Duke University in North Carolina, United States.

Monika’s main research interests are statistical data privacy. She focuses on developing Bayesian methodology in creating synthetic datasets that maintaining high utility and preserving privacy protection. More recently, Monika works on creating synthetic datasets that can satisfy differential privacy. She teaches an undergraduate senior seminar on statistical data privacy at Vassar College and has given short courses on statistical data privacy and Bayesian methods at several federal statistical agencies in the United States. Monika is currently a consultant on several privacy projects at the New York City Department of Health and Mental Hygiene. 

Monika is an Associated Editor for Journal of Statistics and Survey Methodology and INFORMS Journal on Computing and is on the editorial board of Transactions on Data Privacy. Her work is funded through several U.S. National Science Foundation grants. In addition to scholarly research, Monika publishes articles on statistics education and is a co-author of an undergraduate Bayesian textbook, Probability and Bayesian Modeling.

Language

English

Description

This course is an introduction to the ideas of statistical data privacy, with an emphasis on synthetic data for privacy purposes and differential privacy as a privacy measure.

We will begin by providing an overview of the field of statistical data privacy (approximately 3 hours). We will explain what we mean by privacy and why it is important, using examples. We will describe different contexts in which privacy is of interest, different sub-fields of work on privacy and review different methods classically used by agencies for statistical data privacy.

For the portion on synthetic data (approximately 6 hours), we will explain what they are, how to assess their utility and their privacy protection, and a few methods to generate them.

For the portion on differential privacy (approximately 6 hours), we will explain the origin of this formal privacy measure, look in details at its mathematical definition, its meaning, and its limitations. We will also present concrete applications of differential privacy by statistical agencies and private companies. We will then delve in some more technical details, with different mechanisms to achieve differential privacy for statistical tasks.

All hands-on materials are in R.

Course goals

Students will learn the concept of statistical data privacy and two modern approaches: synthetic data and differential privacy. For each approach, we introduce the method as well as hands-on applications in R. After taking the course, students will be able to choose appropriate methods for statistical data privacy tasks (such as producing a privacy-preserved version of a private dataset for protection) and evaluate the effectiveness of the implemented methods.

Course contents

Overview of Statistical Data Privacy - part 1 Motivations and Definitions (1.5 hours)

Overview of Statistical Data Privacy - part 2 Differential Approaches to Privacy (1.5 hours)

Synthetic Data - part 1 Bayesian Modeling and Generation of Synthetic Data (3 hours)

Synthetic Data - part 2 Utility Assessment of Synthetic Data (1.5 hours)

Synthetic Data - part 3 Risk Assessment of Synthetic Data (1.5 hours)

Differential Privacy - part 1 Understanding the Definition and Its Limits (2.5 hours)

Differential Privacy - part 2 Simple Mechanisms with Examples (2 hours)

Differential Privacy - part 3 Statistics, Models, and Practical Applications (1.5 hours)

Prerequisites

Experience in R programming

Targeted at

Students interested in statistical data privacy (undergraduates and postgraduates)

Evaluation

Learning will be evaluated by hands-on examples and exercises during the course

Software requirements

R and RStudio