Overview of this section

We will now turn out attention to some of the more advanced operations you can perform on data frames using dplyr.

Resources

There is a very useful cheat sheet on dplyr from the creators of RStudio.

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Summarising data

We can compute summary statistics for selected columns in our dataset using the summarise verb. For example, we could use summarise to calculate the average length of petals in the iris dataset.

# let's first remind ourselves what the iris dataset looks like
tbl_df(iris)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with 140 more rows
# now let's compute the mean petal length
summarise(iris, mean(Petal.Length))
  mean(Petal.Length)
1              3.758

If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in luck as dplyr accommodates the alternative spelling.

Note that the result is a data frame consisting in this case of a single row and a single column, unlike the more usual way of calculating the mean value for a vector or column, which results in a single numeric value (actually in R this is numeric vector of length 1).

mean(iris$Petal.Length)
[1] 3.758

Returning a data frame might be quite useful, particularly if we’re summarising multiple columns or using more than one function, for example computing the average and standard deviation.

summarise(iris, average = mean(Petal.Length), standard_deviation = sd(Petal.Length))
  average standard_deviation
1   3.758           1.765298

summarise collapses data into a single row of values. Notice how we also named the output columns in this last example.

Summary functions

summarise can take any R function that takes a vector of values and returns a single value. Some of the more useful functions include:

  • min minimum value
  • max maximum value
  • sum sum of values
  • mean mean value
  • sd standard deviation
  • median median value
  • IQR the interquartile range
  • n the number of values
  • n_distinct the number of distinct values

It is also possible to summarise using a function that takes more than one value, i.e. from multiple columns. For example, we could compute the correlation between petal width and length:

summarise(iris, correlation = cor(Petal.Width, Petal.Length))
  correlation
1   0.9628654

Summarizing multiple columns

We can apply the same function to each column using summarise_each.

summarise_each(iris, funs(mean))
Warning in mean.default(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
argument is not numeric or logical: returning NA
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1     5.843333    3.057333        3.758    1.199333      NA

We got a warning from the mean function complaining about being given values that are not numerical. Can you spot what caused this?

A look at the documentation for the summarise_each function (type ‘?summarise_each’ at the command prompt) shows that it can take additional arguments to specify each of the columns of interest.

summarise_each(iris, funs(mean), Petal.Length, Petal.Width)
  Petal.Length Petal.Width
1        3.758    1.199333

Just like with the select operation, we can instead specify those columns to exclude.

summarise_each(iris, funs(mean), -Species)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.843333    3.057333        3.758    1.199333

Or we can use one of the helper functions to choose which columns to operate on.

summarise_each(iris, funs(mean), starts_with("Petal"))
summarise_each(iris, funs(mean), ends_with("Length"))

summarise_each(iris, funs(mean), contains("."))

# use regular expression to select columns on which to operate
summarise_each(iris, funs(mean), matches("^S.*th$"))

# use one_of if you have a vector of column names
columns <- c("Petal.Length", "Sepal.Width")
summarise_each(iris, funs(mean), one_of(columns))

Mutating multiple columns

When looking at the help documentation for summarise_each you may have noticed a related function, mutate_each. This is a very useful function for applying the same operation to several columns within a table. For example, in the iris dataset the measurements are in centimetres but we can convert these to millimetres quite easily.

mutate_each(iris, funs(. * 10), -Species) %>% tbl_df
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1            51          35           14           2  setosa
2            49          30           14           2  setosa
3            47          32           13           2  setosa
4            46          31           15           2  setosa
5            50          36           14           2  setosa
6            54          39           17           4  setosa
7            46          34           14           3  setosa
8            50          34           15           2  setosa
9            44          29           14           2  setosa
10           49          31           15           1  setosa
# ... with 140 more rows

Here, we’ve had to specify the ‘.’ character to represent the columns that we’re multiplying by 10.

Rounding

mutate_each is particularly useful for rounding values to a specified number of decimal places or significant figures.

# to show rounding in action let's read in a cleaned version of the patient dataset
patients <- read.delim("patient-data-cleaned.txt", stringsAsFactors = FALSE) %>% tbl_df
patients
# A tibble: 100 × 14
          ID      Name     Race   Age    Sex Smokes  Height Weight
       <chr>     <chr>    <chr> <int>  <chr>  <lgl>   <dbl>  <dbl>
1  AC/AH/001 Demetrius    White    43   Male   TRUE 176.479 77.071
2  AC/AH/017   Rosario    White    42   Male   TRUE 174.748 83.268
3  AC/AH/020     Julio    Black    43   Male  FALSE 173.945 85.632
4  AC/AH/022      Lupe    White    43   Male  FALSE 180.029 88.720
5  AC/AH/029    Lavern      Cat    42   Male  FALSE 178.637 97.054
6  AC/AH/033    Bernie   Native    42 Female  FALSE 159.778 68.312
7  AC/AH/037    Samuel    White    43   Male   TRUE 170.950 95.658
8  AC/AH/044     Clair    White    42 Female   TRUE 161.827 66.029
9  AC/AH/045   Shirley    White    42   Male  FALSE 169.662 77.295
10 AC/AH/048     Merle Hispanic    43 Female  FALSE 160.121 69.596
# ... with 90 more rows, and 6 more variables: Birth <chr>, State <chr>,
#   Pet <chr>, Grade <int>, Died <lgl>, Date <chr>
# now let's round the patient's weight and height to 1 decimal place
patients %>% mutate_each(funs(round(., digits = 1)), Height, Weight)
# A tibble: 100 × 14
          ID      Name     Race   Age    Sex Smokes Height Weight
       <chr>     <chr>    <chr> <int>  <chr>  <lgl>  <dbl>  <dbl>
1  AC/AH/001 Demetrius    White    43   Male   TRUE  176.5   77.1
2  AC/AH/017   Rosario    White    42   Male   TRUE  174.7   83.3
3  AC/AH/020     Julio    Black    43   Male  FALSE  173.9   85.6
4  AC/AH/022      Lupe    White    43   Male  FALSE  180.0   88.7
5  AC/AH/029    Lavern      Cat    42   Male  FALSE  178.6   97.1
6  AC/AH/033    Bernie   Native    42 Female  FALSE  159.8   68.3
7  AC/AH/037    Samuel    White    43   Male   TRUE  170.9   95.7
8  AC/AH/044     Clair    White    42 Female   TRUE  161.8   66.0
9  AC/AH/045   Shirley    White    42   Male  FALSE  169.7   77.3
10 AC/AH/048     Merle Hispanic    43 Female  FALSE  160.1   69.6
# ... with 90 more rows, and 6 more variables: Birth <chr>, State <chr>,
#   Pet <chr>, Grade <int>, Died <lgl>, Date <chr>

Grouping

While the summarise function is useful on its own, it becomes really powerful when applied to groups of observations within a dataset. For example, suppose we want to compute the mean petal length for each of the species in the iris dataset. We could take each species in turn and filter the data frame to only contain rows for a given species, then apply summarise, but that would be somewhat cumbersome. In dplyr, the group_by function allows this to be done in one simple step.

iris %>% group_by(Species) %>% summarise(mean(Petal.Length))
# A tibble: 3 × 2
     Species `mean(Petal.Length)`
      <fctr>                <dbl>
1     setosa                1.462
2 versicolor                4.260
3  virginica                5.552

As before, we can summarise multiple observations.

iris %>% group_by(Species) %>% summarise(n(), mean(Petal.Length), sd(Petal.Length))
# A tibble: 3 × 4
     Species `n()` `mean(Petal.Length)` `sd(Petal.Length)`
      <fctr> <int>                <dbl>              <dbl>
1     setosa    50                1.462          0.1736640
2 versicolor    50                4.260          0.4699110
3  virginica    50                5.552          0.5518947

We can make the output more presentable by renaming the columns and using the round function to round to a specified number of significant figures. Note the use of backticks ( ` ) for specifying column names that contain spaces.

iris %>%
  group_by(Species) %>%
  summarise(
    N = n(),
    `Average petal length` = mean(Petal.Length),
    `Standard deviation petal length` = sd(Petal.Length)
  ) %>%
  mutate_each(funs(signif(., digits = 2)), `Average petal length`, `Standard deviation petal length`)
# A tibble: 3 × 4
     Species     N `Average petal length`
      <fctr> <int>                  <dbl>
1     setosa    50                    1.5
2 versicolor    50                    4.3
3  virginica    50                    5.6
# ... with 1 more variables: `Standard deviation petal length` <dbl>

A shorthand for summarise(n()) for counting the number of observations of each group is available in the form of the count function.

# these both give the same output
iris %>% group_by(Species) %>% summarise(n = n())
count(iris, Species)
# A tibble: 3 × 2
     Species     n
      <fctr> <int>
1     setosa    50
2 versicolor    50
3  virginica    50

group_by can also be used in conjunction with other dplyr verbs.

iris %>% group_by(Species) %>% summarise_each(funs(mean))
# A tibble: 3 × 5
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
      <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Returning to one of the earlier examples, we can also compute the correlation between petal width and length on a per-group basis.

iris %>% group_by(Species) %>% summarise(correlation = cor(Petal.Width, Petal.Length))
# A tibble: 3 × 2
     Species correlation
      <fctr>       <dbl>
1     setosa   0.3316300
2 versicolor   0.7866681
3  virginica   0.3221082

Hold on a minute, wasn’t the correlation coefficient for the overall dataset quite a lot higher? (yes, it was 0.96). What’s going on here? A plot might help to understand why.

library(ggplot2)
qplot(Petal.Width, Petal.Length, data = iris, colour = Species)

But we’re getting ahead of ourselves! Plotting with ggplot2 is for the next part of the course.

group_by - the technical details

Some of you might be wondering what’s going on under the hood with this group_by function. The help page for group_by is a little on the technical side but essentially tells us that the data frame we pass it gets converted into a grouped_df data frame object. dplyr functions that operate on a grouped_df object know to treat this in a special way, operating on each group separately. The following sequence of R commands might help to make this a bit clearer.

# first let's take a look at the class of the iris data frame
class(iris)
[1] "data.frame"
# now we'll create a grouped version with group_by and look at its class
iris_grouped <- group_by(iris, Species)
class(iris_grouped)
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
# the groups function lets us see the groups
groups(iris_grouped)
[[1]]
Species
# the ungroup function removes the grouping
iris_ungrouped <- ungroup(iris_grouped)
class(iris_ungrouped)
[1] "tbl_df"     "tbl"        "data.frame"
groups(iris_ungrouped)
NULL

ungroup can be quite helpful in more complicated chains of dplyr operations where the grouping is only required for some of the steps and would have unintended consequences for subsequent operations within the chain.

Exercise: Rmarkdown Template

Read a cleaned version of the patients dataset (patient-data-cleaned.txt) into RStudio and use summarise, summarise_each, mutate_each and group_by to answer the following questions:

  • What is the mean age, height and weight of patients in the patients dataset?
    • Modify the output by rounding these computed means to 2 decimal places
  • See what happens if you try to compute the mean of a logical (boolean) variable
    • What proportion of our patient cohort has died?
  • Compare the average height of males and females in this patient cohort.
  • Are smokers heavier or lighter on average than non-smokers in this dataset?

Joining

In many real life situations, data are spread across multiple tables or spreadsheets. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources. It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a common column, for example, an attribute that uniquely identifies the subject.

dplyr provides a set of join functions for combining two data frames based on matches within specified columns. These operations are very similar to carrying out join operations between tables in a relational database using SQL.

left_join

To illustrate join operations we’ll first consider the most common type, a “left join”. In the schematic below the two data frames share a common column, V1. We can combine the two data frames into a single data frame by matching rows in the first data frame with those in the second data frame that share the same value of V1.

dplyr_join

dplyr_join

left_join returns all rows from the first data frame regardless of whether there is a match in the second data frame. Rows with no match are included in the resulting data frame but have NA values in the additional columns coming from the second data frame. Here’s an example in which data about some interesting people are contained in two tables. The name column identifies each of the people concerned and is used for matching rows from the two tables.

# let's create some data frames that we can join
df1 <- data_frame(
  name = c("Matt", "Mark", "Tom"),
  home = c("Harston", "Histon", "London"),
  institute = c("CRUK-CI", "CRUK-CI", "MRC-CSC")
)
df2 <- data_frame(
  name = c("Matt", "Mark", "Jane"),
  colour = c("blue", "green", "cerise")
)
df1
# A tibble: 3 × 3
   name    home institute
  <chr>   <chr>     <chr>
1  Matt Harston   CRUK-CI
2  Mark  Histon   CRUK-CI
3   Tom  London   MRC-CSC
df2
# A tibble: 3 × 2
   name colour
  <chr>  <chr>
1  Matt   blue
2  Mark  green
3  Jane cerise
left_join(df1, df2, by = "name")
# A tibble: 3 × 4
   name    home institute colour
  <chr>   <chr>     <chr>  <chr>
1  Matt Harston   CRUK-CI   blue
2  Mark  Histon   CRUK-CI  green
3   Tom  London   MRC-CSC   <NA>

right_join is similar but returns all rows from the second data frame that have a match with rows in the first data frame based on the specified column.

right_join(df1, df2, by = "name")
# A tibble: 3 × 4
   name    home institute colour
  <chr>   <chr>     <chr>  <chr>
1  Matt Harston   CRUK-CI   blue
2  Mark  Histon   CRUK-CI  green
3  Jane    <NA>      <NA> cerise

inner_join

It is also possible to return only those rows where matches could be made. The inner_join function accomplishes this.

dplyr_join

dplyr_join

inner_join(df1, df2, by = "name")
# A tibble: 2 × 4
   name    home institute colour
  <chr>   <chr>     <chr>  <chr>
1  Matt Harston   CRUK-CI   blue
2  Mark  Histon   CRUK-CI  green

full_join

We’ve seen how missing rows from one table can be retained in the joined data frame using left_join or right_join but sometimes data for a given subject may be missing from either of the tables and we still want that subject to appear in the combined table. A full_join will return all rows and all columns from the two tables and where there are no matching values, NA values are used to fill in the missing values.

dplyr_join

dplyr_join

full_join(df1, df2, by = "name")
# A tibble: 4 × 4
   name    home institute colour
  <chr>   <chr>     <chr>  <chr>
1  Matt Harston   CRUK-CI   blue
2  Mark  Histon   CRUK-CI  green
3   Tom  London   MRC-CSC   <NA>
4  Jane    <NA>      <NA> cerise

Multiple matches in join operations

Where there are multiple rows in one or both of the two tables, these will be represented in the combined table. All combinations of the matching rows will be included.

df1 <- data_frame(
  name = c("Matt", "Mark", "Mark", "Tom"),
  home = c("Harston", "Histon", "Hardwick", "London"),
  institute = c("CRUK-CI", "CRUK-CI", "IPH", "MRC-CSC")
)
df2 <- data_frame(
  name = c("Matt", "Mark", "Mark", "Jane"),
  colour = c("blue", "yellow", "green", "cerise")
)
left_join(df1, df2, by = "name")
# A tibble: 6 × 4
   name     home institute colour
  <chr>    <chr>     <chr>  <chr>
1  Matt  Harston   CRUK-CI   blue
2  Mark   Histon   CRUK-CI yellow
3  Mark   Histon   CRUK-CI  green
4  Mark Hardwick       IPH yellow
5  Mark Hardwick       IPH  green
6   Tom   London   MRC-CSC   <NA>

Joining by matching on multiple columns

We can specify multiple columns to use for matching rows. This is useful where the uniquely-identifying information for a subject is containing in more than one column. The following extends our simple example to use both name and home for matching rows.

# let's create some data frames that we can join
df1 <- data_frame(
  name = c("Matt", "Mark", "Mark", "Tom"),
  home = c("Harston", "Histon", "Hardwick", "London"),
  institute = c("CRUK-CI", "CRUK-CI", "IPH", "MRC-CSC")
)
df2 <- data_frame(
  name = c("Matt", "Mark", "Mark", "Jane"),
  institute = c("CRUK-CI", "CRUK-CI", "IPH", "MRC-CSC"),
  colour = c("blue", "yellow", "green", "cerise")
)
left_join(df1, df2, by = c("name", "institute"))
# A tibble: 4 × 4
   name     home institute colour
  <chr>    <chr>     <chr>  <chr>
1  Matt  Harston   CRUK-CI   blue
2  Mark   Histon   CRUK-CI yellow
3  Mark Hardwick       IPH  green
4   Tom   London   MRC-CSC   <NA>

Filtering joins

A variation on the join operations we’ve considered are semi_join and anti_join that filter the rows in one table based on matches or lack of matches to rows in another table.

# semi_join returns all rows from the first table where there are matches in the other table
semi_join(df1, df2, by = "name")
# A tibble: 3 × 3
   name     home institute
  <chr>    <chr>     <chr>
1  Matt  Harston   CRUK-CI
2  Mark   Histon   CRUK-CI
3  Mark Hardwick       IPH
# anti_join returns all rows where there is no match in the other table
anti_join(df1, df2, by = "name")
# A tibble: 1 × 3
   name   home institute
  <chr>  <chr>     <chr>
1   Tom London   MRC-CSC

Exercise

The exercise uses a more realistic dataset, building on the patients table we’ve already been working with.

The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.

diabetes <- read.delim("diabetes.txt", stringsAsFactors = FALSE) %>% tbl_df
diabetes
# A tibble: 1,316 × 4
          ID       Date Glucose    BP
       <chr>      <chr>   <int> <int>
1  AC/AH/001 2011-03-07     100    98
2  AC/AH/001 2011-03-14     110    89
3  AC/AH/001 2011-03-24      94    88
4  AC/AH/001 2011-03-31     111    92
5  AC/AH/001 2011-04-03      94    83
6  AC/AH/001 2011-05-21     110    93
7  AC/AH/001 2011-06-24     105    79
8  AC/AH/001 2011-07-11      88    86
9  AC/AH/001 2011-07-11     101    92
10 AC/AH/001 2011-07-13     112    88
# ... with 1,306 more rows
  • Find the number of visits made by each patient to the hospital to have blood glucose and pressure measurements taken.

  • Find the date of the last visit made by each patient.

  • Using the ID column to connect information about patients, add these values as additional columns to the patients table.
    • Can you do this as a single sequence of dplyr commands using the %>% symbol?