Reproducible analysis

What is this all about?

Question mark
These traps aren’t theoretical - they are way
more common that you might imagine.

Terms like “reproducible research” and “open research” get used a lot, and they can mean a few different things:

  • Documenting your methods such that others could repeat the work,
  • Publishing your data and code so that others can check the conclusions or undertake further research with the same source data, or
  • Being able to redo your analysis at some later point.

All of these sound good! We should probably all do what we can towards all of these, within what is reasonable, of course. Datasets can be confidential for privacy, commercial or legal reasons, and those limitations have to be respected. The last of these points could be described as “reproducible analysis”, which is much narrower than the very broad topic of “reproducible research”; the discussion below focusses on making the analysis specifically more reproducible.

Beware of the quicksand!

stuck in Quicksand
Invest time in learning how to avoid the
quicksand, rather than losing lots of time
trying to drag yourself out of it.

Modern publishing standards and expectations are pushing researchers more on the first two of these. However, even when you ignore dataset confidentiality issues, the third one remains really critical. Here are some things that can happen that illustrate the importance of this type of reproducibility:

  • Reviewers or a colleague asks you to make modifications to some analysis you have done months/years earlier, and you are either getting different answers or it’s not 100% clear what you did or why you did exactly what you did;
  • It is no longer clear which of the data entries are the original values, or which ones have been “cleaned” at a later time;
  • You have a large and increasingly unmanageable number of data files or versions of the code that does the analysis; or
  • Your code doesn’t work when it did yesterday, and it’s unclear what has happened to make it break.

You can easily get caught in the quicksand with having to retrace your steps, figure out what you have done, and reproduce that crucial result. You can lose so much time this way!

Much of this can be avoided if you invest some time in learning practices of reproducible research. There’s no canonical set of steps that can make things fail-proof, but here are some things you can do to help.

My top tips for reproducible analysis

Apple blogging code
Work smarter, not harder.
  1. Separate original data files, cleaned/processed data files, code files, plots and documentation into different folders or a clear file naming convention
  2. Back up your input data and all other key inputs (e.g., code, text you have written) in some way
  3. Use code/scripts to do the data cleaning, analysis and plotting
  4. Write lots of comments in your code to explain what you are doing and why
    • Use “literate programming” to combine code, comments, tables and plots in a nicely formatted manner (e.g., R markdown/Quarto documents, Jupyter notebooks)
  5. Make README files to explain what the project is about, what key input files contain, what all scripts do
    • This can include metadata (e.g., a “data dictionary”), and the when/how/why/who of the dataset access
  6. Use version control to keep track of your code: commit frequently, write meaningful commit messages
    • Commit any important versions of the code (e.g., associated with publications/reports)
    • Synchronise the tracked version of your code with services such as Github/Gitlab/Bitbucket
    • The synchronised versions can be private if this is a single-user project, or if it is not ready or appropriate for publication
  7. Use “reproducible environments” for your projects, such that you keep a stable set of packages/modules
    • Within R, packages such as renv or packrat provide this kind of functionality
    • In Python, virtual environments or frameworks such as Anaconda help manage this

This isn’t a strict list of priorities, but as you go further down the list, these practices take more time to learn and become familiar with. There shouldn’t be any reason not to get going with 1. and 2. at any stage of your work.

Who has the most to gain from this?

data
It’s “Data, data everywhere” in today’s world, so
it’s helpful to know how to navigate your path through!

These steps may well seem altogether too complicated, and more for the benefit of your supervisor/colleagues/company. In my experience, the main reason to do these is the very real and frequent issue of coming back to one’s own work 6 or 12 months later, and not remembering all of the details. These steps, particularly 1-6, make it easier to “jump back in” and either reproduce results you have made earlier or fix errors that have come to light long after the work was done. Most frequently, the person who gains the most from this is you.