Web scrapping: Getting data from the web with R. (June 25 to 29)
Prof. A. Sánchez (UB), Language: English, AFTERNOON: 3.00 to 6.00 pm.
Title
Web scrapping: Getting data from the web with R.
Instructor
Alex Sánchez, Degree in Biology (1980) and PhD in Statistics (1996) from the Univ. de Barcelona. Master in Bioinformatics from the University of Manchester. Professor in the Genetics, Microbiology and Statistics Department at Univ. de Barcelona and director of the Unidad en Estadística y Bioinformática del Instituto de Investigación del Hospital de la Vall d’Hebrón.
Course language
English
Schedule
June 25th to 29th, from 15:00 to 18:00
Description
An important aspect when dealing with data is that often they are found in the web in formats that require some preprocessing before being analyzed. This course will explore techniques to understand these formats so that you can retrieve data from the web and extract the desired information. The first part introduces the most
common web technologies, their relationship and some tools to manipulate
and extract the information. Some most common formats for storing web information (HTML, XML, JSON) are presented, as well as tools to extract it, as XPath and CSS selectors. Finally we introduce some R package suitable to process Web information.
Course goals
Specifically at the end of the course students should:
- Be familiar with the main technologies to deal with information stored in the web.
- Be able to recognize the different formats that can be used for storage.
- Know how to extract information from these formats using specific R packages.
Course contents
- Introducing Web technologies. Web scrapping and web scrapping projects.
- Data representation in the web HTML, XML, JSON. Other technologies.
- Regular expressions for data manipulation.
- Parsing HTML and XML. Using CSS selectors and XPath.
- Case study: Scraping Twitter for Sentiment Analysis.
Evaluation
It will be based on a practical exercise. Each student will suggest and implement a scrapping task, of a minimal complexity, consistent with the course contents. Some proposals will be available for those who prefer a guided exercise.
Classroom
PC3
Share: