December, 2014

Motivation

  • To do reproducible research & encourage others to do so also

  • Increase credibility: article as only advertisement; show the correctness of my results

  • Increase impact: allow reuse of method; currently the main beneficiary is future me

Challenges

  • Pre-publication work viewed as trade secrets

  • Anxiety about exposure to ridicule

  • Wide variation in data analysis tools

  • Scripted analyses are uncommon

Culture shock

  • Oberg (1960) popularized the term culture shock as the "anxiety that results from losing all of our familiar signs and symbols of social intercourse"

  • Weaver (1994) says culture shock has three basic causal explanations:
  1. the loss of familiar cues,
  2. the breakdown of interpersonal communications, and
  3. an identity crisis
  • Phenomenological etiology: A researcher cannot convey and validate central aspects of their identity

Technical challenges

  • Dependencies

  • Imprecise documentation

  • Code rot

  • Barriers to adoption and reuse in existing solutions

Solutions

  • For the cultural problems
  1. Do all of the work
  2. Validate
  3. Isolate
  4. Educate
  • For the technical problems
  1. Workflow software
  2. Virtual machines
  3. Linux Containers

Do all the work

  • Repository with code and data (R markdown file, scripts, RProj, R package)

  • Review cycle means MS Word is a necessary format

  • Code is circulated with co-authors, but they don't do anything with it

  • Cited and described in methods section

Validation

  • Decipher analysis in Excel or SPSS file

  • Recompute all or some with R

  • Create repository with R code and data

  • More like a lab notebook, not cited in manuscript

Isolation

  • Keep my contribution self-contained

  • Create repository for my contribution (from specific commit or release)

  • Cite repository in publication at figure caption (with no explanation)

Education

  • Require student collaborators to acquire skills (sneak into coursework, require it for graduate student milestones, Software Carpentry)

  • Normalise scripted analyses by talking about it, showing it, citing it (at appropriate moments…)

  • Advocate Open Methods, flattery often works (open science may be a bridge too far for some)

Technical problems

  • Workflow software: elegant but esoteric & nobody uses them

  • Virtual machine: isolated & intelligible but heavyweight & black box

  • Linux container: in their infancy…

Docker

  • Operating system-level virtualization: very lightweight Linux VM

  • Plain text Dockerfile
    • defines and documents the image
    • includes all software dependencies down to the level of the OS
    • is easily stored, shared and versioned
    • minimizing dependency and code rot problems
  • My use is inspired by the rocker project of Carl Boettiger & Dirk Eddelbuettel, and Carl's paper

Docker and barriers to adoption

  • Is optimized at the level of single applications, for me this is RStudio and the shell

  • Less disruptive to established workflows; not a drain on my laptop, I can use my usual text editor, use RStudio in my web browser, etc.

  • Highly portable to give identical environments across different machines; images can be snapshotted

  • Reusing and remixing images is trivial

  • Docker Hub gives free open hosting and continuous integration for images (can link to dockerfiles hosted on github, etc.)

Limitations of Docker

  • Docker does not provide complete visualization but relies on the Linux kernel provided by the host

  • Docker is limited to 64 bit host machines

  • On Windows & OSX Docker must still be run in a fully virtualized environment (VirtualBox). The boot2docker tool helps, but could be smoother

  • Potential security issues

  • Will Docker be significantly adopted by any scientific research or teaching community?

Future directions

  • Increase visibility of scripted analyses to reduce culture shock

  • Create opportunities with minimal inessential weirdness for students to learn & peers to familiarize (Software Carpentry, Open Methods)

  • Research project as R package (rather than RProj, scripts, etc.)

  • Document dependencies with Dockerfile and use Docker as a common computational environment for research and teaching

Colophon

Citations