Reproducible data analysis

Prerequisites

It’s assumed that you have some experience with the programming languages and that you have already installed R and RStudio. If not, here are some resources for getting started:

A (very) short introduction to R here This short introduction could be handy if you did not remember the elementary functions of R. Although the first two pages are in german from section 2 onwards, all the materials are in English.
Edx - (Free) Introduction to R here

Some videos in German to install r if you did not instaled R before: R installieren (Windows) und RStudio installieren, Aufbau von RStudio

1. Introduccion

During the previous notebook we were reviewing some of the basic functions of R. This helped us to understand the most frequent operators, the most basic classes of data, how to assign values and some of the basic functions. However, during the previous session we were working online and although we installed R, it is likely that you have not yet started working locally with your computer.

In this tutorial you will work in parallel with R from your computer. The goal is to prepare your computer for working on practical cases. As you know, R is a language and environment for statistical calculations and graphics. But also the tool used to create this tutorial.

The development of statistical analysis is not restricted to the display of data or results or the application of statistical techniques. Data analysis requires the development of a workflow. A workflow is a sequence of tasks that processes a set of data. In statistics it is of vital importance how this workflow is documented and carried out. One of the major characteristics that the development of statistical analysis must meet is that of reproducibility.

Reproducibility is the ability of an analysis to be reproduced or replicated by others. Reproducibility is one of the pillars of the scientific method, falsifiability being the other.

Although there are conceptual differences according to the scientific discipline, in many disciplines, especially those involving the use of statistics and computational processes, it is understood that a study is reproducible if it is possible to recreate all the results exactly from the original data and the computer code used for the analyses.

2.R basics and workflows

Working with R from your computer requires much more than just using R as a calculator, applying statistical formulas and making graphs.

One day you will need to quit R, go do something else and return to your analysis later.

One day you will have multiple analyses going that use R and you want to keep them separate.

One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.

To handle these real life situations, you need to make two decisions:

What about your analysis is “real”, i.e. will you save it as your lasting record of what happened? Where does your analysis “live”?

2.1 Workspace, .RData

As a beginning R user, it’s OK to consider your workspace “real”. Very soon, I urge you to evolve to the next level, where you consider your saved R scripts as “real”. (In either case, of course the input data is very much real and requires preservation!) With the input data and the R code you used, you can reproduce everything. You can make your analysis fancier. You can get to the bottom of puzzling results and discover and fix bugs in your code. You can reuse the code to conduct similar analyses in new projects. You can remake a figure with different aspect ratio or save is as TIFF instead of PDF. You are ready to take questions. You are ready for the future.

If you regard your workspace as “real” (saving and reloading all the time), if you need to redo analysis … you’re going to either redo a lot of typing (making mistakes all the way) or will have to mine your R history for the commands you used. Rather than becoming an expert on managing the R history, a better use of your time and psychic energy is to keep your “good” R code in a script for future reuse.

Because it can be useful sometimes, note the commands you’ve recently run appear in the History pane.

But you don’t have to choose right now and the two strategies are not incompatible. Let’s demo the save / reload the workspace approach.

Upon quitting R, you have to decide if you want to save your workspace, for potential restoration the next time you launch R. Depending on your set up, R or your IDE, e.g. RStudio, will probably prompt you to make this decision.

Quit R/RStudio, either from the menu, using a keyboard shortcut, or by typing q() in the Console. You’ll get a prompt like this:

Save workspace image to ~/.Rdata?

Note where the workspace image is to be saved and then click “Save”.

Using your favorite method, visit the directory where image was saved and verify there is a file named .RData. You will also see a file .Rhistory, holding the commands submitted in your recent session.

Restart RStudio. In the Console you will see a line like this:

[Workspace loaded from ~/.RData]

indicating that your workspace has been restored. Look in the Workspace pane and you’ll see the same objects as before. In the History tab of the same pane, you should also see your command history. You’re back in business. This way of starting and stopping analytical work will not serve you well for long but it’s a start.

Working directory Any process running on your computer has a notion of its “working directory”. In R, this is where R will look, by default, for files you ask it to load. It also where, by default, any files you write to disk will go. Chances are your current working directory is the directory we inspected above, i.e. the one where RStudio wanted to save the workspace.

You can explicitly check your working directory with:

getwd()

It is also displayed at the top of the RStudio console.

As a beginning R user, it’s OK let your home directory or any other weird directory on your computer be R’s working directory. Very soon, I urge you to evolve to the next level, where you organize your analytical projects into directories and, when working on project A, set R’s working directory to the associated directory.

Although I do not recommend it, in case you’re curious, you can set R’s working directory at the command line like so:

setwd("~/myNewProject")

Although I do not recommend it, you can also use RStudio’s Files pane to navigate to a directory and then set it as working directory from the menu: Session > Set Working Directory > To Files Pane Location. (You’ll see even more options there). Or within the Files pane, choose “More” and “Set As Working Directory”.

But there’s a better way. A way that also puts you on the path to managing your R work like an expert.

2.2 RStudio projects

Keeping all the files associated with a project organized together – input data, R scripts, analytical results, figures – is such a wise and common practice that RStudio has built-in support for this via its projects.

Let’s make one to use for the rest of this workshop/class. Do this: File > New Project…. The directory name you choose here will be the project name. Call it whatever you want (or follow me for convenience).

I created a directory and, therefore RStudio project, called swc in my tmp directory, FYI.

setwd("~/tmp/swc")

Now check that the “home” directory for your project is the working directory of our current R process:

getwd()

I can’t print my output here because this document itself does not reside in the RStudio Project we just created.

Let’s enter a few commands in the Console, as if we are just beginning a project:

a <- 2
b <- -3
sig_sq <- 0.5
x <- runif(40)
y <- a + b * x + rnorm(40, sd = sqrt(sig_sq))

(avg_x <- mean(x))

## [1] 0.4971693

write(avg_x, "avg_x.txt")
plot(x, y)
abline(a, b, col = "purple")

dev.print(pdf, "toy_line_plot.pdf")

## png 
##   2

Let’s say this is a good start of an analysis and your ready to start preserving the logic and code. Visit the History tab of the upper right pane. Select these commands. Click “To Source”. Now you have a new pane containing a nascent R script. Click on the floppy disk to save. Give it a name ending in .R or .r, I used toy-line.r and note that, by default, it will go in the directory associated with your project.

Quit RStudio. Inspect the folder associated with your project if you wish. Maybe view the PDF in an external viewer.

Restart RStudio. Notice that things, by default, restore to where we were earlier, e.g. objects in the workspace, the command history, which files are open for editing, where we are in the file system browser, the working directory for the R process, etc. These are all Good Things.

Change some things about your code. Top priority would be to set a sample size n at the top, e.g. n <- 40, and then replace all the hard-wired 40’s with n. Change some other minor-but-detectable stuff, e.g. alter the sample size n, the slope of the line b,the color of the line … whatever. Practice the different ways to re-run the code:

Walk through line by line by keyboard shortcut (Command+Enter) or mouse (click “Run” in the upper right corner of editor pane).

Source the entire document – equivalent to entering source(‘toy-line.r’) in the Console – by keyboard shortcut (Shift+Command+S) or mouse (click “Source” in the upper right corner of editor pane or select from the mini-menu accessible from the associated down triangle).

Source with echo from the Source mini-menu.

Visit your figure in an external viewer to verify that the PDF is changing as you expect.

In your favorite OS-specific way, search your files for toy_line_plot.pdf and presumably you will find the PDF itself (no surprise) but also the script that created it (toy-line.r). This latter phenomenon is a huge win. One day you will want to remake a figure or just simply understand where it came from. If you rigorously save figures to file with R code and not ever ever ever the mouse or the clipboard, you will sing my praises one day. Trust me.

2.3 Scripts

So far you’ve been using the console to run code. That’s a great place to start, but you’ll find it gets cramped pretty quickly as you create more complex graphics and analyses. To give yourself more room to work, it’s a great idea to use the script editor. Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes:

The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it’s a good idea to save your scripts regularly and to back them up.

The script editor is also a great place to build up complex plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorise one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below. If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates not_cancelled. It will also move the cursor to the next statement (beginning with not_cancelled %>%). That makes it easy to run your complete script by repeatedly pressing Cmd/Ctrl + Enter.

Instead of running expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S. Doing this regularly is a great way to check that you’ve captured all the important parts of your code in the script.

I recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see what packages they need to install. Note, however, that you should never include install.packages() or setwd() in a script that you share. It’s very antisocial to change settings on someone else’s computer!

When working through future chapters, I highly recommend starting in the editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won’t even think about it.

2.4 Stuff

It is traditional to save R scripts with a .R or .r suffix. Follow this convention unless you have some extraordinary reason not to.

Comments start with one or more # symbols. Use them. RStudio helps you (de)comment selected lines with Ctrl+Shift+C (Windows and Linux) or Command+Shift+C (Mac).

Clean out the workspace, i.e. pretend like you’ve just revisited this project after a long absence. The broom icon or rm(list = ls()). Good idea to do this, restart R (available from the Session menu), re-run your analysis to truly check that the code you’re saving is complete and correct (or at least rule out obvious problems!).

This workflow will serve you well in the future:

Create an RStudio project for an analytical project Keep inputs there (we’ll soon talk about importing) Keep scripts there; edit them, run them in bits or as a whole from there Keep outputs there (like the PDF written above) Avoid using the mouse for pieces of your analytical workflow, such as loading a dataset or saving a figure. Terribly important for reproducibility and for making it possible to retrospectively determine how a numerical table or PDF was actually produced (searching on local disk on filename, among .R files, will lead to the relevant script).

Many long-time users never save the workspace, never save .RData files (I’m one of them), never save or consult the history. Once/if you get to that point, there are options available in RStudio to disable the loading of .RData and permanently suppress the prompt on exit to save the workspace (go to Tools > Options > General).

For the record, when loading data into R and/or writing outputs to file, you can always specify the absolute path and thereby insulate yourself from the current working directory. This is rarely necessary when using RStudio projects properly.

3. Reproducible research and reproducible data analysis

Reproducible research or reproducible is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. Outside Academia the need for reproducibility is also increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available. This section will focus on literate statistical analysis tools which allow one to publish data analyses in a single document that allows others to easily execute the same analysis to obtain the same results. Here you can find a short article on Reproducible Research on Biostatistics, from an academic point of view.

One way to make reproducible work is to use literate statistical programming or literate programming. The concept of literate program comes from the computer scientist Don Knuth (note that even Don Knuth used R). The basic idea is to write programs while documenting the code at the same time. What can be called literate statistical programming is thus the practice of having both the code (and the data) for the statistical analysis and its documentation in the same document.

3.1 Literate Programming

As data analyst, our goal with collaborators is to generate a reproducible analysis workflow. That means that: all the instructions, the data and the explanation of the methods used are collected so that any reader can reproduce the study, this is what is called a literate statistical analysis and is based on the same principles as reproducible research and literate programming.

Literate programming is a methodology that combines a programming language with a documentation language, thereby making programs more robust, more portable, more easily maintained, and arguably more fun to write than programs that are written only in a high-level language. The main idea is to treat a program as a piece of literature, addressed to human beings rather than to a computer. The program is also viewed as a hypertext document, rather like the World Wide Web“.

Literate Programming has many aliases and sometimes synonims, including:

Reproducible research (RR)
Replicable science (RS)
Reproducible (data) analysis (RDA)
Dynamic data analysis
Dynamic report generation
Literate (data/statistical) analysis

This approach to statistical analysis requires a documentation language and a programming language. The first attempt in R (with a package called sweave) used LaTeX as the documentation language and R as the programming language. Then, Yihui Xie, while he was a grad student at Iowa State University developed the R package knitr which supports other documentation languages such as Markdown (Markdown is a simplified version of what is called a markup language), LaTeX and HTML and programming languages (Python, SQL, Bash, Rcpp, Stan, Jav), see this webpage for further information. Then, the output can be exported to PDF and HTML or to other formats (Word document, slide show, notebook, handout, book, dashboard, webpage package vignette, beamer, slidy, revealjs, or other format) using other tools such as Pandoc (see this webpage for further information).

3.2 Literate Programming, statistics and data visualization

In statistics, the ability to document both programming language coding as well as mathematical thought is critical to understandable, explainable, and reproducible data analysis. We will refer to the activities involved in statistical research and data analysis as statistical practice. These activities often involve computing, ranging from finding reference and background material to programming and computation.

Literate Statistical analysis is a programming methodology, derived from Literate Programming, which encourages the construction of documentation for data management and statistical analysis as the code for it is produced. All code and documentation is interweaved into each literate document. The resulting document should provide a clear description of the paths taken during the analyses to produce the working dataset, descriptive, exploratory, and confirmatory analyses. This should describe results and lessons learned, both substantiative and for statistical practice, as well as a means to reproduce all steps, even those not used in a concise reconstruction, which were taken in the analysis. Literate Statistical Programming may be conceived as a stream of code chunks and human-readable text chunks.

Code chunks	Human-readable text chunks
load and prepare data	Describe the data
compute a result	Explain analysis
create a table or plot	Present a result

Tools:

There is a growing number of (awareness about?) open source tools that facilitate LSP.

R: software programming language / environment for statistical computing / graphics
R Studio: integrated development environment (IDE) for R
Sweave: R functionality
knitR: R functionality/package (subsumes Sweave)
LaTeX: document preparation system and markup language
HTML: standard markup language for web pages (HyperText Markup Language)
Markdown: plain text formatting syntax easily convertible to HTML
pandoc: (universal?) document converter

3.3 Literate Statistical Practice… in practique

Literate Statistical Programming includes the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied — and the ability to describe the use case, application and resulting value.

Therefore, data literacy is an underlying component of Literate Statistical Practice. It refers to the ability to understand and communicate (Visualize) in a common data language. It is the difference between successfully deriving value from data and analytics. Further, data literacy is a component of digital dexterity, which in the private sector is consider as the employee’s ability and desire to use existing and emerging technology to drive better business outcomes. You can learn more about data literacy and from a firm perspective in this video the second part is here

If you have seen the videos you can undestand why both the people producing the data analysis and the people on the receiving end must have a foundational level of data literacy in order to be able to communicate effectively via the data. But how can this be defined? It is expected that during this course, you will learn how to develop a workflow such that it is possible for you to efficiently develop dashboards and visualisations of data adapted to your target audience.

In practice, many literary styles are possible. Each style reflects a combination of how the statistical practitioner (data analyst) views the problem as well as the message to be conveyed to the target audience. While literate programming assumes a minimal level of programming competence by the reader (final users usually do not check software code), literate statistical practice has many possible targets, including other statisticians, customers, colleages from other departments, students, as well as scientists with minimal statistical background.

One approach to literate Statistical Practice may be the one developed by a Consultant. This style may involve a LATEX article format, sectioning the various components of a short-term consulting project. The sections of the consultancy may be ordered like this:

Introduction to the problem;
Planned approach;
Data management;
Descriptive statistics;
Inferential statistics;
Discussion description;
Conclusions and lessons for the analyst to remember, including the bibliography of methods employed and the practical issues (coding, data handling) associated with the methods;
The concluding report for the consulting client.

R Notebooks and literate statistical analysis

R Notebooks, like Jupyter Notebooks (formerly known as IPython Notebooks) are ubiquitous in modern data analysis. The Notebook format allows statistical code and its output to be viewed on any computer in a logical and reproducible manner, avoiding both the confusion caused by unclear code and the inevitable “it only works on my system” curse.

R Notebooks are a format maintained by RStudio, which develops and maintains a large number of open source R packages and tools, most notably the free-for-consumer RStudio R IDE. More specifically, R Notebooks are an extension of the earlier R Markdown .Rmd format, useful for rendering analyses into HTML/PDFs, or other cool formats.

Instead of having separate cells for code and text, a R Markdown file is all plain text. The cells are indicated by three backticks and a gray background in RStudio, which makes it easy to enter a code block, easy to identify code blocks at a glance, and easy to execute a notebook block-by-block. Each cell also has a green indicator bar which shows which code is running and which code is queued, line-by-line.

For Notebook files, a HTML webpage is automatically generated whenever the file is saved, which can immediately be viewed in any browser (the generated webpage stores the cell output and any necessary dependencies).

In R Notebooks, each block of R input code executes in its own cell, and the output of the block appears inline; this allows the user to iterate on the results, both to make the data transformations explicit and to and make sure the results are as expected.

Using Notebooks

You can create a new notebook in RStudio with the menu command File -> New File -> R Notebook, or by using the html_notebook output type in your document’s YAML metadata.

An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. For example:

By default, RStudio enables inline output (Notebook mode) on all R Markdown documents, so you can interact with any R Markdown document as though it were a notebook. If you have a document with which you prefer to use the traditional console method of interaction, you can disable notebook mode by clicking the gear button in the editor toolbar, and choosing Chunk Output in Console. You can find a good introduction to R Markdown here