Spreadsheets are a common entry-point for many types of analyses. Whilst they are great for exploring / visualising data in an interactive and intuitive manner, they are not ideal for many production-level analyses.
This course aims to translate how we think of data in spreadsheets to a series of operations that can be performed and chained together in R.
A data analysis can be broken-down into several stages. There is, however, no such thing as a typical analysis. Most datasets we will encounter will have their own issues and problems that need fixing. We will also need to spend a lot of time visualising our data in different ways in order to gain insights. For every figure or table presented in a paper, there may be tens or hundreds of exploratory analyses that were generated along the way. We will show how such exploratory analysis can be performed in R.
(from Hadley Wickham’s workshop at useR2014)
Unfortunately, in R there are many hundreds (thousands!) of functions for us to choose from to achieve our goals, and everyone will have their own set of favourites. The tools we will meet today help us to explore data in a consistent and “pipeline-able” manner.
Hadley also has these words of advice that we should bear in mind as we proceed through the course.
Whenever you’re learning a new tool, for a long time you’re going to suck… But the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.
Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research at the forefront of everyone’s mind.
Keith Baggerly’s talk on the subject is highy-recommended.
Within RStudio we can write markdown documents, which are a mix of R code and text. The markdown file can be used as a template to generated PDF, HTML, or even Word documents. The clever bit is that all R code in the template can be execute and the results displayed (tables, graphics etc) along with the code. The compiled document can be passed to your collaborators and they should be able to generate the same results. Alternatively, you can choose to hide the code if your PI just wants to see the results, and not neccesarily what packages, parameters you used. Long-term R users may have heard of Sweave. Markdown is the same concept, but an easier to write (and read) syntax
Markdown can also generate presentations and courses. Indeed, all the materials for this course were written in markdown.
Each line of R code can be executed in the R console by placing the cursor on the line and pressing CTRL + ENTER
. You can also highlight multiple lines of code. NB. You do not need to highlight to the backtick (```) symbols.
Hitting the Knit button (*) will run all R code in order and (providing there are no errors!) you will get a PDF or HTML document. The resultant document will contain all the plain text you wrote, the R code, and any outputs (including graphs, tables etc) that R produced. You can then distribute this document to have a reproducible account of your analysis.
ENTER
. Clicking the ?
next to the Knit HTML
button will give more information about how to format this text. You can introduce bold and italics for example.iris.Rmd
iris
datasetiris.html
Petal.Length
variable is produced, rather than Sepal.Length