Chloé-Agathe Azencott

Associate Professor at the Centre for Computational Biology (CBIO) of Mines ParisTech, Institut Curie and INSERM.


Some work practices (2022 update)

scroll down

This is a list of tools and working practices I am trying to encourage in myself and people working under my supervision.

Continue reading...


Un autre numérique reste possible

scroll down

This post is in French, as it is a copy of a text written in French with other French researchers and informed by our experience of the French system.

Miroir du texte publié en CC-BY-4.0 sur Medium

Signataires :

Chloé-Agathe Azencott, Enseignante-chercheuse en mathématiques appliquées à MINES ParisTech

Anne Baillot, Professeure des Universités à l’Université du Mans, Etudes germaniques et Humanités numériques

Frédéric Clavert, Professeur assistant en histoire contemporaine, C2DH, université du Luxembourg

Alix Deleporte, Maître de Conférences, Institut Mathématique d’Orsay, Université Paris-Saclay

Julie Giovacchini, Ingénieur de recherche en Analyse de sources anciennes et Humanités numériques, CNRS, Centre Jean Pépin (UMR8230)

Anne Grand d’Esnon, Doctorante en littérature comparée, Université Bourgogne-Franche-Comté

Catherine Psilakis, Université de Lyon 1

Une dystopie numérique universitaire se dessine, dont l'émergence est accélérée par la crise sanitaire. Il est néanmoins toujours temps de faire du numérique à l'université un outil au service des enseignant·e·s-cherch.eur·euses·s, ingénieur·e·s et étudiant·e·s – et plus largement de toutes celles et ceux qui enseignent, cherchent et transmettent le fruit de leur recherche.

Continue reading...


Machine learning approaches to disease prediction

scroll down

I've had the great pleasure to spend a few days in Copenhagen attending, first, a symposium on Big Data approaches to health, disease and treatment trajectories, and second, a two-day workshop on machine learning approaches to disease prediction.

The workshop, organized by Rikke Linnemann Nielsen, Agnes Martine Nielsen and Ramneek Gupta, had around 40 attendees, and featured Jason Moore, Marylyn Ritchie, Andrea Califano, Laurent Gautier and myself as invited speakers.

There was a lot of time built in for discussion, and I wanted to summarize here some of the points that were raised because I think they can be very useful.

Understanding the machine learning algorithms you use is key. In particular, run simulations, and check whether the algorithm / implementation you are using behaves as you expect it on them. Yes, this is boring, but essential, as however else are you going to trust that it's the right tool for your problem? Marylyn drove that point very well.

No algorithm is going to solve all your problems. It's not because a method worked beautifully on a paper you've read that it's going to be good for your problem, and certainly not with the default parameters. In my own words, there's this little thing called the no free lunch theorem.

Some algorithms are what Jason refers to as frozen accidents. Someone had an idea, tried it on some data, got it published, and then for the following twenty year the entire community believes the way to treat vaguely similar data is with that idea and nothing else. Challenge that. (You'll still probably need to use the frozen accident in your paper, but maybe you can also present an alternative that's better for your problem.)

Please take a step back and think before using t-SNE. What are you trying to do exactly? Remember, t-SNE is a dimensionality reduction tool, not a clustering algorithm. You can use it for visualization. What do you think of your visualization before DBSCAN or any other clustering algorithm colors the points in different clusters? What happens to it if you change the perplexity? Remove a couple of samples?

I keep repeating this, and I will repeat it again: you do not evaluate your model on the data you've used to train it. Feature selection is part of training. Model selection is part of training. If your model changes when you add an observation to your evaluation set, it means that your validation is not evaluating generalization properly. Clean validation requires holding out a data set that you only use for evaluating performance after having selected features, tweaked hyperparameters, and so on and so forth. DREAM challenges are good for this, by the way.

Choosing the right evaluation criterion (or criteria) is crucial. Area under the ROC curve may well not be informative for a problem with a high class imbalance, for instance.

Electronic health records are bringing new machine learning challenges that seem far from solved. How do you deal with time series data where each sample has its own time points? How do you deal with heterogeneous data types? How do you deal with sloppy data? How do you deal with missing data that can be missing for very different reasons?

About missing data, we've spent a lot of time discussing imputation, and we're not big fans. It seems like a great way to introduce more noise and biases in data that already has more than its share of it. In addition, data from EHR can be missing from very different reasons. Did the patient not get that blood work done because the medical doctor did not prescribe it? Because the patient hates needles? Because she cannot afford the test? Or were the results just not entered in her record?

If your goal is to do translational research, you need to understand what clinicians need. On the symposium on Tuesday, Thorkild Sørensen made the excellent point that the only thing clinicians care about is to improve patient care. What is a good measure of clinical utility for your problem? A simple, interpretable model may not perform as well as a deep boosted random kernel forest ­­– I'm expecting royalties if you're actually creating that algorithm, by the way –, but it may still perform better than the current tools. Also, what does interpretability means for them? Is this needed for this particular problem?

About p-values, remember that statistics are not biology. If we can agree that all what we're doing with our computers is to generate hypotheses (rather than biological knowledge), there's no clear evidence that a p-value is a more meaningful score than a random forest feature importance score, a regression weight or whatever else you want to compute. On the other hand, p-values are a good tool to compare what you're doing with random chance, and you can construct a null for about anything by permutation testing.

Finally, we also talked a lot about negative results, which for us are mainly the whole bunch of methods we tried to apply to our data and that led us nowhere. There was a large consensus that those are science, and they're interesting to the community, and they should be published. There is also a general agreement that publishing them is not easy, and that you cannot get a PhD / faculty position / grant based on these type of results only. Sadly.

Oh, and here are my Tuesday slides on network-guided sparsity and my Wednesday slides on multitask approaches.


Some work practices

scroll down

This is a list of tools and working practices I am trying to develop for myself and people working under my supervision. They are not set in stone and are meant to evolve according to the people and project.

Continue reading...


Local user installation of gcc

scroll down

On our compute cluster, I needed gcc-4.8.4 to compile some code. At the global level, gcc-4.4.7 is installed, and I do not have superuser privileges on the system (which is, all things considered, a good thing).

Here are my notes on how I installed gcc-4.8.4 locally, without superuser privileges, in case they might one day be of use to someone...

Continue reading...

- page 1 of 4