## MA2823 Foundations of Machine Learning (Fall 2015)

This is a course I am teaching at Ecole Centrale Paris, as part of the Engineering Program as well as the M.Sc. in Data Science.

**Syllabus** (downloadable as a pdf)

Machine learning lies at the heart of data science. It is essentially the intersection between statistics and computation, though the principles of machine learning have been rediscovered from many different traditions, including artificial intelligence, Bayesian statistics, and frequentist statistics. In this course, we view machine learning as the automatic learning of a prediction function given a training sample of data (labeled or not).

Machine learning methods form the foundation of many successful companies and technologies in multiple domains. Their applications, to name a few, include search engines, robotics, bioinformatics analyses of genetic data, algorithmic trading, social network analysis, targeted advertising, computer vision, or machine translation.

This course gives an overview of the most important trends in machine learning, with a particular focus on statistical risk and its minimization with respect to a prediction function. A substantial lab section involves group projects on data science competitions and gives students the ability to apply the course theory to real-world problems.

This course will be evaluated through a project report (on a data science competition) as well as a written exam (on **December 18**). Download the **Project report grading rubric (pdf)**.

This course is divided in 13 chapters of 1.5 hours each, as well as 9 labs of 1.5 hours each. The first two labs will be dedicated to a tutorial on the scikit-learn library for machine learning in Python. The other labs will give the students the opportunity to apply the course theory to a data science competition.

**Teaching team**

**Instructor: **

Chloé-Agathe Azencott `chloe-agathe.azencott@mines-paristech.fr`

**TAs:**

Jiaqian Yu `jiaqian.yu@centralesupelec.fr`

Eugene Belilovsky `eugene.belilovsky@inria.fr`

**Textbook**

*The Elements of Statistical Learning: Data Mining, Inference and Prediction.*
Trevor Hastie, Robert Tibshirani and Jerome Friedman.
Available online at http://statweb.stanford.edu/~tibs/ElemStatLearn/

In addition, slides, labs and other supplementary materials will be made available on this website throughout the course.

**Lectures**

**Chap 1. Introduction (Sep 23)** [slides (pdf)] [video]

We introduce machine learning, its appplications, and various classes of problems.

**Chap 2. Supervised learning (Sep 23)** [slides (pdf)] [proof for p20 (pdf)]

We introduce and formalize a core problem of machine learning: supervised learning, in which the data is labeled and the goal is to predict the label of new, unseen data points.

Concepts: classification and regression, hypothesis space, Vapnik-Chervonenkis dimension, probably approximately correct (PAC) learning, overfitting.

**Chap 3. Model evaluation and selection (Sep 25)** [slides (pdf)]

We discuss the assessment and evaluation of supervised machine learning models.

Concepts: training and test sets, cross-validation, bootstrap, measures of performance for classification and regression, measures of model complexity.

Lab: Introduction to scikit-learn (part 1) [pdf]

**Chap 4. Bayesian decision theory (Oct 2nd)** [slides (pdf)]

We discuss the quantity to be optimized in statistical estimation, and its various finite sample approximations.

Concepts: Bayes rule, losses and risks, Bayes risk, maximum a posteriori.

Lab: Introduction to scikit-learn (part 2) [pdf]

Handout: Kaggle project presentation [pdf] [slides (pdf)]

**Chap 5. Linear and logistic regression (Oct 9)** [handout (pdf)] [slides (pdf)] [proof of Gauss-Markov and derivations for the logistic regression (pdf)]

We introduce parametric approaches to supervised learning as well as the most simple linear models. We formulate linear regression as a maximum likelihood estimation problem and derive its estimator.

Concepts: parametric methods, maximum likelihood estimates, linear regression, logistic regression.

Lab: Kaggle project.

**Chap 6. Regularized linear regression (Oct 16)** [handout (pdf)] [slides (pdf)]

We introduce the concept of regularization as a means to controlling the complexity of the hypothesis space, and apply it to linear models.

Concepts: Lasso, ridge regression, structured regularization.

Lab: Kaggle project.

**Chap 7. Nearest-neighbors methods (Nov 6)** [handout (pdf)] [slides (pdf)]

We introduce non-parametric methods, whose complexity grows with the size of the data sample. We illustrate them with nearest-neighbors approaches.

Concepts: non-parametric learning, nearest neighbor, k-nearest neighbors, instance-based learning, similarities, Voronoi tesselation, curse of dimensionality.

Lab: Kaggle project.

**Chap 8. Tree-based methods (Nov 13)** [handout (pdf)] [slides (pdf)]

We introduce decision trees, one of the most intuitive supervised learning algorithms, and show how to combine simple classifiers to yield state-of-the-art predictors.

Concepts: decision trees, ensemble methods, boosting, random forests.

Lab: Kaggle project.

**Chap 9a and 9b. Support vector machines (Nov 20)** [handout (pdf)] [slides (pdf)]

We introduce a very popular class of machine learning methods, that has achieved state-of-the-art performance on a wide range of tasks. We derive the support-vector machine from first principles in the case of linearly separable data, extend it to non-separable data, and show how positive-definite kernels can be used to extend the approach to non-linear separating functions.

Concepts: maximum margin, soft-margin SVM, non-linear data mapping, kernel trick, kernels.

**Chap 10. Neural networks (Nov 27)** [handout (pdf)] [slides (pdf)]

We introduce the perceptron algorithm from Rosenblatt (1957), one of the earliest steps towards learning with computers, and discuss its many extensions.

Concepts: perceptrons, multi-layer networks, backpropagation.

Lab: Kaggle project.

More slides about training neural networks, not covered in class, [here (pdf)].

**Chap 11. Dimensionality reduction (Dec 4)** [handout (pdf)] [slides (pdf)]

We discuss how to approach high-dimensional learning problems, and present approaches to reduce this dimension.

Concepts: feature selection, wrapper approaches, principal component analysis, autoencoders.

Lab: Kaggle project.

**Chap 12. Clustering (Dec 11)** [handout (pdf)] [slides (pdf)]

We conclude this course by presenting the most common unsupervised learning problem, that is to say clustering, or how to find groups within data that is given without labels.

Concepts: hierarchical clustering, k-means.

Lab: Kaggle project.