Algorithms for Data Science

This course introduces key algorithmic techniques for solving large-scale data science problems, focusing on efficient data processing and analysis methods. Students will explore fundamental topics such as frequent itemset mining, mining similar items, and data stream algorithms. The course will blend theoretical lectures with practical lab sessions to reinforce concepts through hands-on experience. Students will also engage in a project to apply these algorithms to real-world data, culminating in an exam to assess their mastery of the material.

Organization

Each session begins with a lecture based on the provided slides, followed by a graded lab session applying the techniques introduced in the lecture. Labs are to be completed individually during the session and submitted as indicated below.

Learning Outcomes

By the end of the course, students should be able to:

  • Implement scalable algorithms for mining frequent and similar patterns.
  • Apply sketching and sampling techniques for large data streams.
  • Evaluate trade-offs between accuracy, memory, and time in streaming and approximate methods.
  • Design and test small-scale experiments on real datasets.

Course Structure

  • Week 1 - Intro, Frequent Itemset Mining

    wget "https://phparis.net/uploads/m2_ds_algods_lab1_frequent.ipynb"
    
  • Week 2 - Mining Similar Items

    wget "https://phparis.net/uploads/m2_ds_algods_lab2_similar.ipynb
    
  • Week 3 - Data Stream Algorithms

    wget "wget https://phparis.net/uploads/m2_ds_algods_lab3_filtering.ipynb
    
  • Weeks 4 - Data Stream Algorithms (continued)

    wget "wget https://phparis.net/uploads/m2_ds_algods_lab4_distinct.ipynb
    
  • Week 5 - Project

    • Project
    • To upload your labs and the project
      • Rename you lab file as last_first_labXXX.ipynb where last is your last/family name, first is your first name and XXX is the lab number.
      • Then upload it here: repository
      • 🚨 DO NOT USE EMAILS! ONLY NOTEBOOKS (NO PYTHON SCRIPT) 🚨
  • Week 6 - Advertising on the Web

    • ads
    • To download the lab, run:
    wget "wget https://phparis.net/uploads/m2_ds_algods_lab6_adwords.ipynb
    
  • Week 7 - Exam

Evaluation

  • 10% labs
  • 40% project (programming assignment) – starting week 5
  • 50% written exam (exercises, course questions) – week 7

Environment

All labs use Python 3 and Jupyter notebooks. Required libraries include numpy, pandas, and matplotlib. Install them beforehand.

References

  1. J. Leskovec, A. Rajaraman, J. Ullman. “Mining of Massive Datasets”. site

Previous exams

Pierre-Henri Paris
Pierre-Henri Paris
Associate Professor in Artificial Intelligence

My research interests include Knowlegde Graphs, Information Extraction, and NLP.