LSML 21: Large Scale Machine Learning
PSL week Spring Course 2021
Large-Scale Machine Learning
March 8-12, 2021
MINES ParisTech, 60 boulevard Saint-Michel, 75006 Paris
This course is co-organized by Chloé-Agathe Azencott (MINES ParisTech & Institut Curie) and Fabien Moutarde (MINES ParisTech).
outline | schedule | registration | grading | textbook | practical sessions
Outline
Machine learning is a fast-growing field at the interface of mathematics, computer science and engineering, which provides computers with the ability to learn without being explicitly programmed, in order to make predictions or take rational actions. From cancer research to finance, natural language processing, marketing or self-driving cars, many fields are nowadays impacted by recent progress in machine learning algorithms that benefit from the ability to collect huge amounts of data and "learn" from them.
The goal of this intensive 5-day advanced course is to present the theoretical foundations and practical algorithms to implement and solve large-scale machine learning and data mining problems, and to expose the students to current applications and challenges of "big data" in science and industry.
Prerequisites:
- Numerical Python (ie familiarity with programming in Python and the numpy, scipy, matplotlib librairies).
- Basics of machine learning (such as the content of the Apprentissage Artificiel course for MINES ParisTech students).
Schedule
Practical sessions are only open to officially enrolled PSL students taking the course for credit.
Monday, March 8, 2021
- 09:00 – 12:15 Lecture: Introduction to large-scale ML & optimization (C.-A. Azencott)
- [whiteboard] · [slides]
- Video recordings available upon request
- See also [the 2019 slides] for further information.
- 13:45 – 17:00 Practical session: ML on large data with scikit-learn; this session will also contain an introduction to scikit-learn for those who have not used the library before.
Tuesday, March 9, 2021
- 09:00 – 12:15 Lecture: Deep learning, convolutional neural networks, and generative models (F. Moutarde)
- 13:45 – 17:00 Practical session: Deep learning with Python
Wednesday, March 10, 2021
- 09:00 – 12:15 Lecture: Deep reinforcement learning (F. Moutarde)
- 13:45 – 17:00 Practical session: Deep reinforcement learning with Python
Thursday, March 11, 2021
- 09:00 – 12:15 Lecture: Systems for large-scale ML: focus on MapReduce (C.-A. Azencott)
- [whiteboard] · [slides]
- Video recordings available upon request
- 13:45 – 15:45 Practical session: Stochastic Gradient Descent
Friday, March 12, 2021
- 09:30 – 12:30 Guest lecture: Large-Scale Natural Language Processing (NLP) by Édouard Grave (Facebook AI Research Paris).
- 14:00 – 16:00 Exam.
Registration
PSL students must enroll officially through their institutions.
Mines ParisTech students and staff can attend the lectures remotely by connecting to room L.109 via Zoom. All course materials will be in English but some lectures will be given in French.
PhD students who want to participate may email me to register and receive a certificate of attendance. These students will also be allowed to attend the practical sessions, although priority will be given to assisting engineering students who are officially enrolled.
Grade
If you are taking this class for credit, you will be ask to turn in the notebooks of all your practical sessions.
There will also be a written exam.
Total credits: 2 ECTS.
Textbook
There is no single textbook for this course, but the following resources are relevant:
- Mining of massive datasets by Leskovec, Rajaraman and Ullman;
- Deep learning by Goodfellow, Bengio and Courville;
- Large-Scale Optimization: Beyond Stochastic Gradient Descent and Convexity by Sra and Bach.
This course is not an introductory course to machine learning! If you want to learn the basics, or need a refresher, we recommend:
- In French, the lectures of the Parcours Data Scientist sur OpenClassrooms (vidéos et textes en accès libre);
- In French, Introduction au Machine Learning. Chloé-Agathe Azencott, Collection InfoSup, Dunod, 2018. EAN: 9782100780808 dont la version électronique (sans exercices) est disponible ici;
- In French, Apprentissage statistique supervisé by Fabien Moutarde in Techniques de l'Ingénieur;
- In English, Machine learning by Andrew Ng on Coursera;
- In English, The elements of statistical learning by Hastie, Tibshirani and Friedman;
- In English, Pattern recognition and machine learning by Bishop.
Practical sessions
Practical sessions will take the form of Jupyter notebooks on the course github repo
Personal installation
You will need to have Python3 and all the relevant packages installed.
The easiest way to install all the requirements is to install Anaconda. You can test your installation by downloading one or several of the SciPy 2016 notebooks, starting Anaconda then Jupyter, open the notebook(s) and run them.
If you prefer, you can also install only the required packages (numpy, scipy, matplotlib, seaborn, joblibs, scikit-learn, tensorflow, keras and jupyter lab) with pip or conda.
An alternative (sometimes preferable for deep learning notebooks) is to use Google Colab, for which you will need a Google account.
Git-what?
GitHub is a web-based repository hosting services, allowing for version control and source code management. GitHub is based on the git version control system. A version control system allows you to manage automatically different versions and draft of a document; in essence, it is the grownup version of lab1_final_v2.2_chloe-copy-1.ipynb
. You can read more about the benefits of version control here. Git (and GitHub) are widely used in tech nowadays.
GitHub offers both private and public repositories, and supports free accounts for academics. Here is a short tutorial of how to use GitHub to version control your own copy of the labs:
- Log onto GitHub (start by signing up if you do not have an account)
- Create a fork of the lsml19 repository. A fork is a copy you own and can experiment with without changing the project. To do so: navigate to https://github.com/chagaz/lsml19 and click "Fork" in the upper right corner.
- Download and install git if it is not installed on your computer. To do so, follow the official instructions. If you do not know whether Git is installed on your computer, try typing
git
in a terminal. If it returns a help message, then git is installed. If you'd rather use a graphical interface (a GUI) than the command line, have a look here. - Set up git, following these instructions
- Clone your fork. This means you’ll get a local version on your computer (for now your fork only exists on GitHub’s servers):
- On the GitHub website, navigate to your fork of the lsml19 repository. Its URL should be something like
https://github.com/<yourusername>/lsml19
. - Click "Clone or download" (on the top right)
- Copy the URL that was just displayed (should be something like
https://github.com/<yourusername>/lsml19.git
) - In the terminal (I’m assuming Linux/MacOS), navigate to where you want your copy to be. For example, if you want it under
Desktop > Courses > 2019
, typecd Desktop/Courses/2019
. - Then type
git clone <the URL you just copied>
. You should see a message telling you the repository is downloading.
- On the GitHub website, navigate to your fork of the lsml19 repository. Its URL should be something like
For more on creating forks, see here.
- On your computer, edit the file you want to make changes to (for example, the first notebook).
- Push your changes (i.e. send them from your computer to your GitHub account). To do this, from your lsml19 repository, do:
git add <name of the file you edited> git commit -m "<Short message explaining your modification>" git push origin master