Big Data management with R (July 2 to 6)

Prof. B. Skorulski (Schibsted), Language: English, AFTERNOON: 3.00 to 6.00 pm.

Course title

Big Data management with R.

Faculty

Bartek Skorulski. Senior Data Scientist SCRM (Lidl). bartekskorulski@gmail.com

Bartek Skorulski is a PhD in Mathematics that works as Data Scientist in Schibsted, an international media group that is one of the world’s leading online classified ads businesses. The list of companies he was working for includes King (Acitivision Blizzard) and SCRM (Lidl) . The list of projects he has been involved in includes recommender systems, forecasting tools, predictive models, A/B Test Tools, analysis of complex scenarios, developing analysis strategies, etc. He has also more than 10 years experience as a academic researcher and teacher. His areas of investigations include Dynamical Systems, Probability & Statistics, Numerical Analysis, Formal Mathematics. He was also involved in developing computer system Mizar that automatically check if a proof of mathematical theorem is correct.

Course language

English

Course schedule

July 2nd to 6th, from 15:00 to 18:00h

Description

We are going to explain how Data Scientist trained to work with R can work with Hadoop without need to learn Scala. We will discuss two possible workﬂows.

First approach is using SQL on Hive or Impala. This implies that the most important part of analysis has to be done locally.
Second method is to try to do some part of analysis on clusters using sparkR/sparklyR.

Both methods have their advantages and limitations. We are going to discover them by working on a project that involves creating predictive model.

Lecture plan

Data Scientist vs. Big Data: DS workﬂow.
Big Data Storage.
Hive/Impala/Spark: adding SQL to R.
Analytics and Machine Learning on clusters with sparkR: opportunities and limitations. Comparing R/Python/Scala.

Evaluation

Project that involves train and test predictive model.

Aula

PC2