I've had the great pleasure to spend a few days in Copenhagen attending, first, a symposium on Big Data approaches to health, disease and treatment trajectories, and second, a two-day workshop on machine learning approaches to disease prediction.

The workshop, organized by Rikke Linnemann Nielsen, Agnes Martine Nielsen and Ramneek Gupta, had around 40 attendees, and featured Jason Moore, Marylyn Ritchie, Andrea Califano, Laurent Gautier and myself as invited speakers.

There was a lot of time built in for discussion, and I wanted to summarize here some of the points that were raised because I think they can be very useful.

Understanding the machine learning algorithms you use is key. In particular, run simulations, and check whether the algorithm / implementation you are using behaves as you expect it on them. Yes, this is boring, but essential, as however else are you going to trust that it's the right tool for your problem? Marylyn drove that point very well.

No algorithm is going to solve all your problems. It's not because a method worked beautifully on a paper you've read that it's going to be good for your problem, and certainly not with the default parameters. In my own words, there's this little thing called the no free lunch theorem.

Some algorithms are what Jason refers to as frozen accidents. Someone had an idea, tried it on some data, got it published, and then for the following twenty year the entire community believes the way to treat vaguely similar data is with that idea and nothing else. Challenge that. (You'll still probably need to use the frozen accident in your paper, but maybe you can also present an alternative that's better for your problem.)

Please take a step back and think before using t-SNE. What are you trying to do exactly? Remember, t-SNE is a dimensionality reduction tool, not a clustering algorithm. You can use it for visualization. What do you think of your visualization before DBSCAN or any other clustering algorithm colors the points in different clusters? What happens to it if you change the perplexity? Remove a couple of samples?

I keep repeating this, and I will repeat it again: you do not evaluate your model on the data you've used to train it. Feature selection is part of training. Model selection is part of training. If your model changes when you add an observation to your evaluation set, it means that your validation is not evaluating generalization properly. Clean validation requires holding out a data set that you only use for evaluating performance after having selected features, tweaked hyperparameters, and so on and so forth. DREAM challenges are good for this, by the way.

Choosing the right evaluation criterion (or criteria) is crucial. Area under the ROC curve may well not be informative for a problem with a high class imbalance, for instance.

Electronic health records are bringing new machine learning challenges that seem far from solved. How do you deal with time series data where each sample has its own time points? How do you deal with heterogeneous data types? How do you deal with sloppy data? How do you deal with missing data that can be missing for very different reasons?

About missing data, we've spent a lot of time discussing imputation, and we're not big fans. It seems like a great way to introduce more noise and biases in data that already has more than its share of it. In addition, data from EHR can be missing from very different reasons. Did the patient not get that blood work done because the medical doctor did not prescribe it? Because the patient hates needles? Because she cannot afford the test? Or were the results just not entered in her record?

If your goal is to do translational research, you need to understand what clinicians need. On the symposium on Tuesday, Thorkild Sørensen made the excellent point that the only thing clinicians care about is to improve patient care. What is a good measure of clinical utility for your problem? A simple, interpretable model may not perform as well as a deep boosted random kernel forest ­­– I'm expecting royalties if you're actually creating that algorithm, by the way –, but it may still perform better than the current tools. Also, what does interpretability means for them? Is this needed for this particular problem?

About p-values, remember that statistics are not biology. If we can agree that all what we're doing with our computers is to generate hypotheses (rather than biological knowledge), there's no clear evidence that a p-value is a more meaningful score than a random forest feature importance score, a regression weight or whatever else you want to compute. On the other hand, p-values are a good tool to compare what you're doing with random chance, and you can construct a null for about anything by permutation testing.

Finally, we also talked a lot about negative results, which for us are mainly the whole bunch of methods we tried to apply to our data and that led us nowhere. There was a large consensus that those are science, and they're interesting to the community, and they should be published. There is also a general agreement that publishing them is not easy, and that you cannot get a PhD / faculty position / grant based on these type of results only. Sadly.

Oh, and here are my Tuesday slides on network-guided sparsity and my Wednesday slides on multitask approaches.