2022-03-14
This is a list of tools and working practices I am trying to encourage in myself and people working under my supervision.
General tips
Backups
Back up your code. Back up your data (being mindful of what is possible when we are working on private data). Back up your results. Do so often. I use a combination of ownCloud (provided by Mines), private github repositories (more on Github below) and backups to hard drive with back in time.
Time off and vacations
Although there are a lot of pressure on us to believe otherwise, I strongly believe that you do not need to work all the time to be successful in research. More than that, the current scientific evidence suggests that not working all the time is essential to be a good researcher in the long run. The "always-on" culture is strong in academia, but I encourage resisting that.
We have the freedom to organize our time as we see fit, working into the evening if we want to raise late in the morning, taking two hours off in the middle of the afternoon for a swim or music practice or a medical appointment if it doesn't interfere with classes and meetings, and so on. Nevertheless, this does not mean that working all the time is a good idea.
I avoid working evenings and weekends as much as I can, and I do not check email during that time, nor when I am on vacation. I really encourage you to take all of your paid leave off, and to make sure there are periods of time where you do _not_ engage with work stuff (no email, no slack, no reading papers, no debugging, no watching video recording of conferences, etc).
Reproducible research
I cannot stress how important it is that your research is reproducible. Here are a few resources you can check out on this topic:
- Reproducible Research
- Posts tagged "reproducibility" at Titus Brown's blog Living in an Ivory Basement
- Ten simple rules for reproducible computational research by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig.
This implies in particular that your code should be (1) made publicly available and (2) well-documented. More tips on how to achieve this below.
Open Access
As a general policy, we make sure both code and publications are openly accessible. We typically post preprints to arxiv, biorxiv, chemrxiv or medrxiv (depending on topic), and make sure to publish our papers open access. This is sometimes required by funders, but mostly I believe this is the right thing to do.
A few more resources on preprint servers and open access publishing:
- The Open Access Movement in Scholarly Communication by Michael Eisen
- The statement of leading Machine Learning researchers that led to the creation of JMLR in 2000 Documenting code and the much more recent statement on Nature Machine Intelligence
Daily log
Keep track (every day!) of your research activity. This includes:
- papers you have read (and ideas you have found in them)
- ideas you've had
- code you wrote (what it does, where it is located)
- results you obtained (not just publishable results, but also errors, bugs you found, etc.)
- next steps you are planning to take
Whether this is done on paper or electronically, under which format, etc. is left to you. Finding something that works for you might take some time, trying and discarding several approaches.
If you're curious about some more evolved setups for electronic notebooks using github and what to experiment, you can have a look at the Madsen Lab Notebook or at Carl Boettinger's Lab Notebook.
Reading papers
A safe bet is that you are not reading enough papers. (Neither am I.) Knowing what others have already done about problems you're interested in and what the challenges are is essential.
I strongly encourage you to use a reference manager. It may not seem necessary at first, but by the time you have been doing research for several years, you won't have the ability to track references with a mere bibtex file any more. I use Zotero, another popular option in the lab is Mendeley; Wikipedia has a good list of options here.
Online presence
Having an academic website is something I strongly recommend. It allows people who may be interested in your work to know more about you, in a way that _you_ control. It makes it easier for people to invite you to give talks, it allows you to share all your contributions to a research topic in a single place, it gives you a place where to share teaching material, and quite often, this is the first thing prospective employers or collaborators will be looking for. You can go all the way and host your own website, as I do, but you can also easily set up a GitHub Page. Those are only two of the many available options.
I'm personally quite active on Twitter, which I find nice to hear about recent research as well have discussions around research and teaching (funding and lack thereof, remote conferences, diversity, ecological concerns, etc.)
Technical tips
Github
Github is a web-based repository hosting services, allowing for version control and source code management. Github is based on the git version control system.
Github offers both private and public repositories, and supports free accounts for academics.
We therefore use github both for (1) backing up code, with version control and (2) sharing code, for reproducibility.
Unlike there are strong reasons to proceed otherwise, each of your projects should have its own github repo.
Additional resources:
- Beanstalk tutorial on version control
- the Learning git branching interactive tutorial
- the pro-git book
Terminal
I strongly encourage you to look into using a modern terminal utility, like tmux on Linux or iTerm2 on Mac.
LaTeX
We write papers in LaTeX, a document preparation system much used in technical and scientific domains. Unlike What You See Is What You Get (WYSIWYG) software such as LibreOffice or Microsoft Word, LaTeX encourages you to focus on logical structure rather than format, and makes it easy to typeset mathematical formulas. As LaTeX documents are written in plain text, this also makes version control much easier.
A good place to start with LaTeX is Overleaf, a collaborative LaTeX editing platform.
Do take some time to set up a nice working environment for LaTeX on your own computer.
If like me you're a dinosaur using emacs, I recommend AUCTeX with the following configuration in your .emacs
:
;; auto-complete for latex (require 'auto-complete-auctex) ;; make auctex use pdflatex to compile (when C-c C-c) (TeX-global-PDF-mode t) (setq TeX-engine 'pdflatex) ;; make auctex use evince and firefox for visualization (when C-c C-v) (setq TeX-output-view-style (quote (("^pdf$" "." "evince -f %o") ("^html?$" "." "firefox %o"))))
Some resources:
Technical tips (Python)
Contributing to scikit-learn
No, I am not forcing you to become a regular contributor to scikit-learn. However, I strongly recommend that you take the time to go through this step-by-step tutorial on how to do it, as it will also be a way to learn about programming tools and practices such as:
- Github (including forking, branching, pull requests, continuous integration)
- conda environments
- VS code (I use emacs myself, with the emacs-for-python add-on, but I am old and set in my ways)
- pytest
- linting with tools such as flake8 and automated formatting with black.
Code readability
Documenting code is important for yourself, for your reviewers, and for maximizing impact (by making it easier for others to reuse your work). For each project, we will attempt to respect the following rules:
- The repository contains a
README.md
file, in Markdown syntax (easily displayed within Github and readable as plain text), which describes:- The goal of the software
- Who created it
- How to contact authors in case of issues
- How to install it (be specific, list all dependencies)
- How to use it (give specific examples, document each functionality)
- The repository contains a
LICENSE
file (plain text). I am partial to the MIT License; check out ChooseALicense for more possibilities. - Options, parameters, variables, methods, classes must be documented. In Python, we will follow the NumPy style guide as well as PEP 0257 regarding docstrings.
- Each script/program that can be called from the command line must give useful information when called without arguments (or with the
-h
or--help
options)
In Python, it will also help if you are following PEP8. Check out flake8 and black to help.
Jupyter Lab
Jupyter notebooks support a variety of programming languages, including R, Scala, Julia, and Python. Notebooks allow you to run commands from a web browser, track typed commands and obtained results (be there printouts or images), organize them in sections, and introduce, comment and annotate your work (with formatting). They are a great way to track your work and produce technical reports, and a good tool for reproducible research. I recommend Jupyter Lab over Jupyter Notebook.