Course Home

Motivation

Spreadsheets are a common entry-point for many types of analyses. Whilst they are great for exploring / visualising data in an interactive and intuitive manner, they are not ideal for many production-level analyses.

  • Tedious and time-consuming to repeatedly process multiple files
  • Error-prone
  • Unwieldy and difficult to deal with large amounts of data
  • How can you (or someone else?) repeat what you did several months, or years down the line?
spreadsheet

spreadsheet

This course aims to translate how we think of data in spreadsheets to a series of operations that can be performed and chained together in R.

A data analysis can be broken-down into several stages. There is, however, no such thing as a typical analysis. Most datasets we will encounter will have their own issues and problems that need fixing. We will also need to spend a lot of time visualising our data in different ways in order to gain insights. For every figure or table presented in a paper, there may be tens or hundreds of exploratory analyses that were generated along the way. We will show how such exploratory analysis can be performed in R.

data-cycle

data-cycle

(from Hadley Wickham’s workshop at useR2014)

Unfortunately, in R there are many hundreds (thousands!) of functions for us to choose from to achieve our goals, and everyone will have their own set of favourites. The tools we will meet today help us to explore data in a consistent and “pipeline-able” manner.

Hadley also has these words of advice that we should bear in mind as we proceed through the course.

Whenever you’re learning a new tool, for a long time you’re going to suck… But the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.

duke-scandal

duke-scandal

Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research at the forefront of everyone’s mind.

Keith Baggerly’s talk on the subject is highy-recommended.

How can R enable Reproducible Research?

Within RStudio we can write markdown documents, which are a mix of R code and text. The markdown file can be used as a template to generated PDF, HTML, or even Word documents. The clever bit is that all R code in the template can be execute and the results displayed (tables, graphics etc) along with the code. The compiled document can be passed to your collaborators and they should be able to generate the same results. Alternatively, you can choose to hide the code if your PI just wants to see the results, and not neccesarily what packages, parameters you used. Long-term R users may have heard of Sweave. Markdown is the same concept, but an easier to write (and read) syntax

Markdown can also generate presentations and courses. Indeed, all the materials for this course were written in markdown.

markdown

markdown

  1. Header information
  2. Section heading
  3. Plain text
  4. R code to be run
  5. Plain text
  6. R code to be run

Each line of R code can be executed in the R console by placing the cursor on the line and pressing CTRL + ENTER. You can also highlight multiple lines of code. NB. You do not need to highlight to the backtick (```) symbols.

Hitting the Knit button (*) will run all R code in order and (providing there are no errors!) you will get a PDF or HTML document. The resultant document will contain all the plain text you wrote, the R code, and any outputs (including graphs, tables etc) that R produced. You can then distribute this document to have a reproducible account of your analysis.

How to use the template

  • Change your name, add a title and date in the header section
  • Add notes, explanations of code etc in the white space between code chunks. You can add new lines with ENTER. Clicking the ? next to the Knit HTML button will give more information about how to format this text. You can introduce bold and italics for example.
  • Some code chunks are left blank. These are for you to write the R code required to answer the questions
  • You can try to knit the document at any point to see how it looks

A Short Analysis Example

  • Open the file iris.Rmd
    • this contains the code we have just looked at to load and visualise the iris dataset
    • change the header information to contain your name and the date
    • compile the document by pressing the Knit HTML button. This should produce a file called iris.html
    • modify the Rmd file so that the boxplot of the Petal.Length variable is produced, rather than Sepal.Length