Github

Github is a web-based repository hosting services, allowing for version control and source code management. Github is based on the git version control system.

Github offers both private and public repositories, and supports free accounts for academics.

Trainees should

  • create a github account (and apply for the academic discount to get free additional functionalities, such as unlimited private repositories)
  • prepare their work environment following the Set Up Git guide
  • create a private repository for their electronic lab notebook, that they will share with me by adding me (chagaz) as a collaborator (see Settings > Collaborators). What I'm interesting in is a reverse-chronological-order diary of key points, probably in Markdown format, keeping track of specific tasks to accomplish, results, analyses, conclusions, next steps, overall goals, etc, as discussed in our meetings. One can also envision to add Jupyter notebooks to this repository.

If you're curious about some more evolved setups for electronic notebooks using github and what to experiment, you can have a look at the Madsen Lab Notebook or at Carl Boettinger's Lab Notebook.

The goal is to also use public repositories for code and papers, with one repository per project.

Additional resources:

Jupyter notebooks

Jupyter notebooks support a variety of programming languages, including R, Scala, Julia, and Python. Notebooks allow you to run commands from a web browser, track typed commands and obtained results (be there printouts or images), organize them in sections, and introduce, comment and annotate your work (with formatting). They are a great way to track your work and produce technical reports, and a good tool for reproducible research.

Resources about reproducible research:

Code documentation

Documenting code is important for yourself, for your reviewers, and for maximizing impact (by making it easier for others to reuse your work). For each project, we will attempt to respect the following rules:

  • The repository contains a README.md file, in Markdown syntax (easily displayed within Github and readable as plain text), which describes:
    • The goal of the software
    • Who created it
    • How to contact authors in case of issues
    • How to install it (be specific, list all dependencies)
    • How to use it (give specific examples, document each functionality)
  • The repository contains a LICENSE file (plain text). I am partial to the MIT License; check out ChooseALicense for more possibilities.
  • Options, parameters, variables, methods, classes must be documented. In Python, we will follow the NumPy style guide as well as PEP 0257 regarding docstrings.
  • Each script/program that can be called from the command line must give useful information when called without arguments (or with the -h or --help options)

Python style

We will endeavor to follow PEP 0008. In particular:

  • Variable names and comments must be in English.
  • Indentations are done by blocks of 4 spaces (and not with tabs).
  • For spacing, check out the Pet Peeves section of PEP 0008.
  • CamelCase only applies to class names.
  • lowercase_with_underscores applies to:
    • package names
    • module names
    • function names
    • method names
    • class instance names
    • variables, parameters, arguments.
  • UPPER_CASE_WITH_UNDERSCORES only applies to constant names.

Scientific Python

For scientific computing in Python, we routinely use NumPy, SciPy, PyTables or pandas, matplotlib (although maybe we should switch to seaborn or Bokeh), and of course scikit-learn.

As a side note, contributing to scikit-learn is a good way to familiarize yourself with Github, Python, and many practical aspects of machine learning. You can start with the Easy Issues.

I also recommend working with the interactive shell IPython. Some functionalities of IPython:

  • tab completion
  • inline help (with help(module_name) or object_name?
  • magic functions (predefined functions starting with %), such as %history, paste (preserves indentation!), run script.py, or save.

If you're an emacs user, I recommend the emacs-for-python package.

LaTeX

We write papers in LaTeX, a document preparation system much used in technical and scientific domains. Unlike What You See Is What You Get (WYSIWYG) software such as LibreOffice or Microsoft Word, LaTeX encourages you to focus on logical structure rather than format, and makes it easy to typeset mathematical formulas. As LaTeX documents are written in plain text, this also makes version control much easier.

A good place to start with LaTeX is Overleaf, a collaborative LaTeX editing platform.

Do take some time to set up a nice working environment for LaTeX on your own computer. For emacs users I recommend AUCTeX with the following configuration in your .emacs:

;; auto-complete for latex
(require 'auto-complete-auctex)

;; make auctex use pdflatex to compile (when C-c C-c)
(TeX-global-PDF-mode t)
(setq TeX-engine 'pdflatex)

;; make auctex use evince and firefox for visualization (when C-c C-v)
(setq TeX-output-view-style
      (quote
       (("^pdf$" "." "evince -f %o")
        ("^html?$" "." "firefox %o"))))

Some resources:

Reference management

Chances are at some point you will want the ability to manage your bibliographical references with something more advanced than a mere .bibtex file. I use Zotero, another popular option in the lab is Mendeley; Wikipedia has a good list of options here.