We will now turn out attention to some of the more advanced operations you can perform on data frames using dplyr
.
There is a very useful cheat sheet on dplyr from the creators of RStudio.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
We can compute summary statistics for selected columns in our dataset using the summarise
verb. For example, we could use summarise
to calculate the average length of petals in the iris
dataset.
# let's first remind ourselves what the iris dataset looks like
tbl_df(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
# now let's compute the mean petal length
summarise(iris, mean(Petal.Length))
mean(Petal.Length)
1 3.758
If you prefer Oxford spelling, in which -ize is preferred to -ise, you’re in luck as dplyr
accommodates the alternative spelling.
Note that the result is a data frame consisting in this case of a single row and a single column, unlike the more usual way of calculating the mean value for a vector or column, which results in a single numeric value (actually in R this is numeric vector of length 1).
mean(iris$Petal.Length)
[1] 3.758
Returning a data frame might be quite useful, particularly if we’re summarising multiple columns or using more than one function, for example computing the average and standard deviation.
summarise(iris, average = mean(Petal.Length), standard_deviation = sd(Petal.Length))
average standard_deviation
1 3.758 1.765298
summarise
collapses data into a single row of values. Notice how we also named the output columns in this last example.
summarise
can take any R function that takes a vector of values and returns a single value. Some of the more useful functions include:
min
minimum valuemax
maximum valuesum
sum of valuesmean
mean valuesd
standard deviationmedian
median valueIQR
the interquartile rangen
the number of valuesn_distinct
the number of distinct valuesIt is also possible to summarise using a function that takes more than one value, i.e. from multiple columns. For example, we could compute the correlation between petal width and length:
summarise(iris, correlation = cor(Petal.Width, Petal.Length))
correlation
1 0.9628654
We can apply the same function to each column using summarise_each
.
summarise_each(iris, funs(mean))
Warning in mean.default(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
argument is not numeric or logical: returning NA
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.843333 3.057333 3.758 1.199333 NA
We got a warning from the mean
function complaining about being given values that are not numerical. Can you spot what caused this?
A look at the documentation for the summarise_each
function (type ‘?summarise_each’ at the command prompt) shows that it can take additional arguments to specify each of the columns of interest.
summarise_each(iris, funs(mean), Petal.Length, Petal.Width)
Petal.Length Petal.Width
1 3.758 1.199333
Just like with the select
operation, we can instead specify those columns to exclude.
summarise_each(iris, funs(mean), -Species)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.843333 3.057333 3.758 1.199333
Or we can use one of the helper functions to choose which columns to operate on.
summarise_each(iris, funs(mean), starts_with("Petal"))
summarise_each(iris, funs(mean), ends_with("Length"))
summarise_each(iris, funs(mean), contains("."))
# use regular expression to select columns on which to operate
summarise_each(iris, funs(mean), matches("^S.*th$"))
# use one_of if you have a vector of column names
columns <- c("Petal.Length", "Sepal.Width")
summarise_each(iris, funs(mean), one_of(columns))
When looking at the help documentation for summarise_each
you may have noticed a related function, mutate_each
. This is a very useful function for applying the same operation to several columns within a table. For example, in the iris
dataset the measurements are in centimetres but we can convert these to millimetres quite easily.
mutate_each(iris, funs(. * 10), -Species) %>% tbl_df
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 51 35 14 2 setosa
2 49 30 14 2 setosa
3 47 32 13 2 setosa
4 46 31 15 2 setosa
5 50 36 14 2 setosa
6 54 39 17 4 setosa
7 46 34 14 3 setosa
8 50 34 15 2 setosa
9 44 29 14 2 setosa
10 49 31 15 1 setosa
# ... with 140 more rows
Here, we’ve had to specify the ‘.’ character to represent the columns that we’re multiplying by 10.
mutate_each
is particularly useful for rounding values to a specified number of decimal places or significant figures.
# to show rounding in action let's read in a cleaned version of the patient dataset
patients <- read.delim("patient-data-cleaned.txt", stringsAsFactors = FALSE) %>% tbl_df
patients
# A tibble: 100 × 14
ID Name Race Age Sex Smokes Height Weight
<chr> <chr> <chr> <int> <chr> <lgl> <dbl> <dbl>
1 AC/AH/001 Demetrius White 43 Male TRUE 176.479 77.071
2 AC/AH/017 Rosario White 42 Male TRUE 174.748 83.268
3 AC/AH/020 Julio Black 43 Male FALSE 173.945 85.632
4 AC/AH/022 Lupe White 43 Male FALSE 180.029 88.720
5 AC/AH/029 Lavern Cat 42 Male FALSE 178.637 97.054
6 AC/AH/033 Bernie Native 42 Female FALSE 159.778 68.312
7 AC/AH/037 Samuel White 43 Male TRUE 170.950 95.658
8 AC/AH/044 Clair White 42 Female TRUE 161.827 66.029
9 AC/AH/045 Shirley White 42 Male FALSE 169.662 77.295
10 AC/AH/048 Merle Hispanic 43 Female FALSE 160.121 69.596
# ... with 90 more rows, and 6 more variables: Birth <chr>, State <chr>,
# Pet <chr>, Grade <int>, Died <lgl>, Date <chr>
# now let's round the patient's weight and height to 1 decimal place
patients %>% mutate_each(funs(round(., digits = 1)), Height, Weight)
# A tibble: 100 × 14
ID Name Race Age Sex Smokes Height Weight
<chr> <chr> <chr> <int> <chr> <lgl> <dbl> <dbl>
1 AC/AH/001 Demetrius White 43 Male TRUE 176.5 77.1
2 AC/AH/017 Rosario White 42 Male TRUE 174.7 83.3
3 AC/AH/020 Julio Black 43 Male FALSE 173.9 85.6
4 AC/AH/022 Lupe White 43 Male FALSE 180.0 88.7
5 AC/AH/029 Lavern Cat 42 Male FALSE 178.6 97.1
6 AC/AH/033 Bernie Native 42 Female FALSE 159.8 68.3
7 AC/AH/037 Samuel White 43 Male TRUE 170.9 95.7
8 AC/AH/044 Clair White 42 Female TRUE 161.8 66.0
9 AC/AH/045 Shirley White 42 Male FALSE 169.7 77.3
10 AC/AH/048 Merle Hispanic 43 Female FALSE 160.1 69.6
# ... with 90 more rows, and 6 more variables: Birth <chr>, State <chr>,
# Pet <chr>, Grade <int>, Died <lgl>, Date <chr>
While the summarise
function is useful on its own, it becomes really powerful when applied to groups of observations within a dataset. For example, suppose we want to compute the mean petal length for each of the species in the iris
dataset. We could take each species in turn and filter
the data frame to only contain rows for a given species, then apply summarise
, but that would be somewhat cumbersome. In dplyr
, the group_by
function allows this to be done in one simple step.
iris %>% group_by(Species) %>% summarise(mean(Petal.Length))
# A tibble: 3 × 2
Species `mean(Petal.Length)`
<fctr> <dbl>
1 setosa 1.462
2 versicolor 4.260
3 virginica 5.552
As before, we can summarise multiple observations.
iris %>% group_by(Species) %>% summarise(n(), mean(Petal.Length), sd(Petal.Length))
# A tibble: 3 × 4
Species `n()` `mean(Petal.Length)` `sd(Petal.Length)`
<fctr> <int> <dbl> <dbl>
1 setosa 50 1.462 0.1736640
2 versicolor 50 4.260 0.4699110
3 virginica 50 5.552 0.5518947
We can make the output more presentable by renaming the columns and using the round
function to round to a specified number of significant figures. Note the use of backticks ( ` ) for specifying column names that contain spaces.
iris %>%
group_by(Species) %>%
summarise(
N = n(),
`Average petal length` = mean(Petal.Length),
`Standard deviation petal length` = sd(Petal.Length)
) %>%
mutate_each(funs(signif(., digits = 2)), `Average petal length`, `Standard deviation petal length`)
# A tibble: 3 × 4
Species N `Average petal length`
<fctr> <int> <dbl>
1 setosa 50 1.5
2 versicolor 50 4.3
3 virginica 50 5.6
# ... with 1 more variables: `Standard deviation petal length` <dbl>
A shorthand for summarise(n())
for counting the number of observations of each group is available in the form of the count
function.
# these both give the same output
iris %>% group_by(Species) %>% summarise(n = n())
count(iris, Species)
# A tibble: 3 × 2
Species n
<fctr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
group_by
can also be used in conjunction with other dplyr
verbs.
iris %>% group_by(Species) %>% summarise_each(funs(mean))
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Returning to one of the earlier examples, we can also compute the correlation between petal width and length on a per-group basis.
iris %>% group_by(Species) %>% summarise(correlation = cor(Petal.Width, Petal.Length))
# A tibble: 3 × 2
Species correlation
<fctr> <dbl>
1 setosa 0.3316300
2 versicolor 0.7866681
3 virginica 0.3221082
Hold on a minute, wasn’t the correlation coefficient for the overall dataset quite a lot higher? (yes, it was 0.96). What’s going on here? A plot might help to understand why.
library(ggplot2)
qplot(Petal.Width, Petal.Length, data = iris, colour = Species)
But we’re getting ahead of ourselves! Plotting with ggplot2
is for the next part of the course.
group_by
- the technical detailsSome of you might be wondering what’s going on under the hood with this group_by
function. The help page for group_by
is a little on the technical side but essentially tells us that the data frame we pass it gets converted into a grouped_df
data frame object. dplyr
functions that operate on a grouped_df
object know to treat this in a special way, operating on each group separately. The following sequence of R commands might help to make this a bit clearer.
# first let's take a look at the class of the iris data frame
class(iris)
[1] "data.frame"
# now we'll create a grouped version with group_by and look at its class
iris_grouped <- group_by(iris, Species)
class(iris_grouped)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
# the groups function lets us see the groups
groups(iris_grouped)
[[1]]
Species
# the ungroup function removes the grouping
iris_ungrouped <- ungroup(iris_grouped)
class(iris_ungrouped)
[1] "tbl_df" "tbl" "data.frame"
groups(iris_ungrouped)
NULL
ungroup
can be quite helpful in more complicated chains of dplyr
operations where the grouping is only required for some of the steps and would have unintended consequences for subsequent operations within the chain.
Read a cleaned version of the patients dataset (patient-data-cleaned.txt) into RStudio and use summarise
, summarise_each
, mutate_each
and group_by
to answer the following questions:
In many real life situations, data are spread across multiple tables or spreadsheets. Usually this occurs because different types of information about a subject, e.g. a patient, are collected from different sources. It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a common column, for example, an attribute that uniquely identifies the subject.
dplyr
provides a set of join functions for combining two data frames based on matches within specified columns. These operations are very similar to carrying out join operations between tables in a relational database using SQL.
left_join
To illustrate join operations we’ll first consider the most common type, a “left join”. In the schematic below the two data frames share a common column, V1. We can combine the two data frames into a single data frame by matching rows in the first data frame with those in the second data frame that share the same value of V1.
left_join
returns all rows from the first data frame regardless of whether there is a match in the second data frame. Rows with no match are included in the resulting data frame but have NA
values in the additional columns coming from the second data frame. Here’s an example in which data about some interesting people are contained in two tables. The name column identifies each of the people concerned and is used for matching rows from the two tables.
# let's create some data frames that we can join
df1 <- data_frame(
name = c("Matt", "Mark", "Tom"),
home = c("Harston", "Histon", "London"),
institute = c("CRUK-CI", "CRUK-CI", "MRC-CSC")
)
df2 <- data_frame(
name = c("Matt", "Mark", "Jane"),
colour = c("blue", "green", "cerise")
)
df1
# A tibble: 3 × 3
name home institute
<chr> <chr> <chr>
1 Matt Harston CRUK-CI
2 Mark Histon CRUK-CI
3 Tom London MRC-CSC
df2
# A tibble: 3 × 2
name colour
<chr> <chr>
1 Matt blue
2 Mark green
3 Jane cerise
left_join(df1, df2, by = "name")
# A tibble: 3 × 4
name home institute colour
<chr> <chr> <chr> <chr>
1 Matt Harston CRUK-CI blue
2 Mark Histon CRUK-CI green
3 Tom London MRC-CSC <NA>
right_join
is similar but returns all rows from the second data frame that have a match with rows in the first data frame based on the specified column.
right_join(df1, df2, by = "name")
# A tibble: 3 × 4
name home institute colour
<chr> <chr> <chr> <chr>
1 Matt Harston CRUK-CI blue
2 Mark Histon CRUK-CI green
3 Jane <NA> <NA> cerise
inner_join
It is also possible to return only those rows where matches could be made. The inner_join
function accomplishes this.
inner_join(df1, df2, by = "name")
# A tibble: 2 × 4
name home institute colour
<chr> <chr> <chr> <chr>
1 Matt Harston CRUK-CI blue
2 Mark Histon CRUK-CI green
full_join
We’ve seen how missing rows from one table can be retained in the joined data frame using left_join
or right_join
but sometimes data for a given subject may be missing from either of the tables and we still want that subject to appear in the combined table. A full_join
will return all rows and all columns from the two tables and where there are no matching values, NA
values are used to fill in the missing values.
full_join(df1, df2, by = "name")
# A tibble: 4 × 4
name home institute colour
<chr> <chr> <chr> <chr>
1 Matt Harston CRUK-CI blue
2 Mark Histon CRUK-CI green
3 Tom London MRC-CSC <NA>
4 Jane <NA> <NA> cerise
Where there are multiple rows in one or both of the two tables, these will be represented in the combined table. All combinations of the matching rows will be included.
df1 <- data_frame(
name = c("Matt", "Mark", "Mark", "Tom"),
home = c("Harston", "Histon", "Hardwick", "London"),
institute = c("CRUK-CI", "CRUK-CI", "IPH", "MRC-CSC")
)
df2 <- data_frame(
name = c("Matt", "Mark", "Mark", "Jane"),
colour = c("blue", "yellow", "green", "cerise")
)
left_join(df1, df2, by = "name")
# A tibble: 6 × 4
name home institute colour
<chr> <chr> <chr> <chr>
1 Matt Harston CRUK-CI blue
2 Mark Histon CRUK-CI yellow
3 Mark Histon CRUK-CI green
4 Mark Hardwick IPH yellow
5 Mark Hardwick IPH green
6 Tom London MRC-CSC <NA>
We can specify multiple columns to use for matching rows. This is useful where the uniquely-identifying information for a subject is containing in more than one column. The following extends our simple example to use both name and home for matching rows.
# let's create some data frames that we can join
df1 <- data_frame(
name = c("Matt", "Mark", "Mark", "Tom"),
home = c("Harston", "Histon", "Hardwick", "London"),
institute = c("CRUK-CI", "CRUK-CI", "IPH", "MRC-CSC")
)
df2 <- data_frame(
name = c("Matt", "Mark", "Mark", "Jane"),
institute = c("CRUK-CI", "CRUK-CI", "IPH", "MRC-CSC"),
colour = c("blue", "yellow", "green", "cerise")
)
left_join(df1, df2, by = c("name", "institute"))
# A tibble: 4 × 4
name home institute colour
<chr> <chr> <chr> <chr>
1 Matt Harston CRUK-CI blue
2 Mark Histon CRUK-CI yellow
3 Mark Hardwick IPH green
4 Tom London MRC-CSC <NA>
A variation on the join operations we’ve considered are semi_join
and anti_join
that filter the rows in one table based on matches or lack of matches to rows in another table.
# semi_join returns all rows from the first table where there are matches in the other table
semi_join(df1, df2, by = "name")
# A tibble: 3 × 3
name home institute
<chr> <chr> <chr>
1 Matt Harston CRUK-CI
2 Mark Histon CRUK-CI
3 Mark Hardwick IPH
# anti_join returns all rows where there is no match in the other table
anti_join(df1, df2, by = "name")
# A tibble: 1 × 3
name home institute
<chr> <chr> <chr>
1 Tom London MRC-CSC
The exercise uses a more realistic dataset, building on the patients table we’ve already been working with.
The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.
diabetes <- read.delim("diabetes.txt", stringsAsFactors = FALSE) %>% tbl_df
diabetes
# A tibble: 1,316 × 4
ID Date Glucose BP
<chr> <chr> <int> <int>
1 AC/AH/001 2011-03-07 100 98
2 AC/AH/001 2011-03-14 110 89
3 AC/AH/001 2011-03-24 94 88
4 AC/AH/001 2011-03-31 111 92
5 AC/AH/001 2011-04-03 94 83
6 AC/AH/001 2011-05-21 110 93
7 AC/AH/001 2011-06-24 105 79
8 AC/AH/001 2011-07-11 88 86
9 AC/AH/001 2011-07-11 101 92
10 AC/AH/001 2011-07-13 112 88
# ... with 1,306 more rows
Find the number of visits made by each patient to the hospital to have blood glucose and pressure measurements taken.
Find the date of the last visit made by each patient.
%>%
symbol?