Big Data management with R

 

Course title

Big Data management with R.

 

Faculty

Bartek Skorulski. Senior Data Scientist SCRM (Lidl). 

Bartek Skorulski is a PhD in Mathematics that works as Data Scientist in Lidl Plus project, a project that will change customer service and the way people do shopping in the chain of more than 10,000 Lidl supermarkets across Europe. His role is setting up hypothesis-driven decision-making environment and creating recommender systems. The list of companies he was working for includes King (Acitivision Blizzard). King is a very successful game company, that has more than 300 million unique users players every month and their user data base is one of the biggest in the world. He participated in developing and improving the best bubble shooters ever: Bubble Witch 2 and 3. His role was performing analysis of complex scenarios, developing analysis strategies, and creating predictive models. He was also working in engineering and marketing on-line companies. He has also more than 10 years experience as a academic researcher and teacher. His areas of investigations include Dynamical Systems, Probability & Statistics, Numerical Analysis, Formal Mathematics. He was also involved in developing computer system Mizar that automatically check if a proof of mathematical theorem is correct. He also co-organize two meetups in Barcelona: R Users Group and Grup d'Estudi de Machine Learning.

 

Course language

English (however professor speaks Spanish) .

 

Course schedule

June 27 to 29: 9:00am to 13:00am.

June 30: 9:00am to 12:00am

 

Description

We are going to explain how Data Scientist trained to work with R can work with Hadoop without need to learn Scala. We will discuss two possible workflows.

  • First approach is using SQL on Hive or Impala. This implies that the most important part of analysis has to be done locally.
  • Second method is to try to do some part of analysis on clusters using sparkR/sparklyR.

Both methods have their advantages and limitations. We are going to discover them by working on a project that involves creating predictive model.

 

Lecture plan

  • Hadoop/Spark: why, what, how.
  • Data Scientist vs. Big Data: DS workflow.
  • Hive/Impala/Spark: adding SQL to R.
  • User-Defined Functions with spark.
  • Machine Learning on clusters with sparkR: opportunities and limitations.

 

Evaluation

Project that involves train and test predictive model.

 

Aula

PC2