select
and mutate
verbs in dplyrLike families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)
http://vita.had.co.nz/papers/tidy-data.pdf
Tidy data are important because they are an entry point for the analysis cycle we will describe during the course.
As we saw, our data, and especially data we might find out “in the wild”, may need “cleaning”" before we can use it. We will discuss how to clean our data for analysis. However, there are steps you can take to make sure your data are tidy and organised before importing into R.
We have already seen a tidy dataset in the form of Fisher’s iris dataset.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Consider the following….
Name | treatmenta | treatmentb |
---|---|---|
John Smith | - | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
Data in this format are quite familiar to us, but not easily-interpretable by the computer. We need to think of the dataset in terms of variables and values. In this example dataset we have 18 values and 3 variables
The guiding principles:
The tidy form of this data thus becomes;
## Name Treatment Result
## 1 John Smith a -
## 2 Jane Doe a 16
## 3 Mary Johnson a 3
## 4 John Smith b 2
## 5 Jane Doe b 11
## 6 Mary Johnson b 1
gather
is a function in the tidyr
package that can collapse multiple columns into key-value pairs.
stack
in base Rmelt
in the reshape
/ reshape2
pacakgesYou can get these example data here
library(tidyr)
untidy <- read.delim("tidyr-example.txt")
untidy
## Name treatmenta treatmentb
## 1 John Smith - 2
## 2 Jane Doe 16 11
## 3 Mary Johnson 3 1
The arguments are;
gather(untidy, Treatment, Result, c(treatmenta,treatmentb))
## Name Treatment Result
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
When specifying columns, an alternative is to ignore some column names
gather(untidy, Treatment, Result, -Name)
## Name Treatment Result
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
Can also specify column index
gather(untidy, Treatment, Result, 2:3)
## Name Treatment Result
## 1 John Smith treatmenta -
## 2 Jane Doe treatmenta 16
## 3 Mary Johnson treatmenta 3
## 4 John Smith treatmentb 2
## 5 Jane Doe treatmentb 11
## 6 Mary Johnson treatmentb 1
Note that after all these operations, the original data frame (untidy
) is unaltered
untidy
## Name treatmenta treatmentb
## 1 John Smith - 2
## 2 Jane Doe 16 11
## 3 Mary Johnson 3 1
tidy <- gather(untidy, Treatment, Result, c(treatmenta,treatmentb))
## Warning: attributes are not identical across measure variables; they will
## be dropped
Using data from the Mango Solutions training dataset
library(mangoTraining)
data("messyData")
Subject | Placebo.1 | Placebo.2 | Drug1.1 | Drug1.2 | Drug2.1 | Drug2.2 |
---|---|---|---|---|---|---|
1 | 45 | 41 | 40 | 30 | 51 | 37 |
2 | 56 | 57 | 53 | 43 | 47 | 49 |
3 | 39 | 30 | 37 | 28 | 40 | 32 |
4 | 42 | 38 | 34 | 25 | 43 | 26 |
5 | 50 | 49 | 35 | 29 | 38 | 31 |
6 | 47 | 44 | 51 | 43 | 33 | 46 |
tidyr
package to create a tidy version of the data## Subject Treatment Count
## 1 1 Placebo.1 45
## 2 2 Placebo.1 56
## 3 3 Placebo.1 39
## 4 4 Placebo.1 42
## 5 5 Placebo.1 50
## 6 6 Placebo.1 47
separate
from the same package will also sort out the treatment names.
separate(tidy,Treatment,into=c("Treatment","Rep"))
## Subject Treatment Rep Count
## 1 1 Placebo 1 45
## 2 2 Placebo 1 56
## 3 3 Placebo 1 39
## 4 4 Placebo 1 42
## 5 5 Placebo 1 50
## 6 6 Placebo 1 47
N.B. spread
has the opposite effect to gather and will translate tidy data back into human-readable form
spread(tidy, Treatment,Count)
## Subject Drug1.1 Drug1.2 Drug2.1 Drug2.2 Placebo.1 Placebo.2
## 1 1 40 30 51 37 45 41
## 2 2 53 43 47 49 56 57
## 3 3 37 28 40 32 39 30
## 4 4 34 25 43 26 42 38
## 5 5 35 29 38 31 50 49
## 6 6 51 43 33 46 47 44
We are going to explore some of the basic features of dplyr
using some patient data; the kind of data that we might encounter in the wild. However, rather than using real-life data we are going to make some up. There is a package called wakefield
that is particularly convenient for this task.
Various patient characteristics have been generated and saved in the file patient-data.txt.
patients <- read.delim("patient-data.txt")
In Rstudio , you can view the contents of this data frame in a tab.
View(patients)
Using dplyr
, patients
can be converted to a special kind of data frame; the tbl_df
. The main difference is that the whole data frame does not get printed to screen. All operations that we could do on a data frame can be applied to a tbl_df
.
library(dplyr)
patients <- tbl_df(patients)
patients
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <fctr>
## 1 1 Demetrius White 22 Male TRUE 176.479144276913cm
## 2 2 Rosario White 25 Male TRUE 174.748482460591cm
## 3 3 Julio Black 20 Male No 173.944873836774cm
## 4 4 Lupe White 26 Male FALSE 180.028888936188cm
## 5 5 Lavern Cat 35 Male No 178.636539798649cm
## 6 6 Bernie Native 30 Female FALSE 159.777836229305cm
## 7 7 Samuel White 32 Male TRUE 170.949966267411cm
## 8 8 Clair White 29 Female TRUE 161.826590769764cm
## 9 9 Shirley White 22 Male FALSE 169.66230745064cm
## 10 10 Merle Hispanic 26 Female FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
The tbl_df
function in dplyr creates a new object with more-convenient default display
glimpse
can be used to get a snapshot of your data
str
function in base Rglimpse(patients)
## Observations: 100
## Variables: 15
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Name <fctr> Demetrius, Rosario, Julio, Lupe, Lavern, Bernie, ...
## $ Race <fctr> White, White, Black, White, Cat, Native, White, W...
## $ Age <int> 22, 25, 20, 26, 35, 30, 32, 29, 22, 26, 31, 30, 0,...
## $ Sex <fctr> Male, Male, Male, Male, Male, Female, Male, Fema...
## $ Smokes <fctr> TRUE, TRUE, No, FALSE, No, FALSE, TRUE, TRUE, FAL...
## $ Height <fctr> 176.479144276913cm, 174.748482460591cm, 173.94487...
## $ Weight <fctr> 77.0707023167105kg, 83.2683499179022kg, 85.632310...
## $ Birth <fctr> 1972-02-11, 1972-07-22, 1971-11-23, 1971-10-03, 1...
## $ State <fctr> New York, Florida, Connecticut, Massachusetts, Ka...
## $ Pet <fctr> CAT, Cat, DOG, Dog, Cat, Cat, Dog, Dog, Dog, Cat,...
## $ Grade_Level <int> 1, 1, 2, 1, 3, 2, 1, 1, 1, 1, 3, 99, 1, 1, 1, 3, 1...
## $ Died <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS...
## $ Count <dbl> -0.74400114, -1.07145310, -0.82156356, 0.49766715,...
## $ Date <fctr> 2015-03-04, 2015-03-04, 2015-03-04, 2015-03-04, 2...
We are planning a study that will involve overweight smokers. So one of our first task is to identify a list of candidates.
select
verbAs we mentioned, dplyr
introduces a series of verbs that can be used to operate on our data. The first we will introduce is select
N.B. base R does provide a function called subset
which allowed similar operations.
The default dispaly of a tbl_df
object is to print the data frame in an intelligent manner, so not to flood the R console. If we want to control what variables are printed we can use the select
function, which allows us to print certain columns to the screen.
In the R introduction, we saw how to extract columns from a data frame with a $
symbol.
patients$Name
## [1] Demetrius Rosario Julio Lupe Lavern
## [6] Bernie Samuel Clair Shirley Merle
## [11] Martin Frances Courtney Francis Vernon
## [16] Lester Robin Albert Tommy Kyle
## [21] Dong Michel Jeremy Pat Eugene
## [26] Kris Tracy Jame Clyde Brett
## [31] Tony George Edward Rory Shane
## [36] Devin Jerry Drew Ronald Christopher
## [41] Dominique Van Bobbie Lawrence Keith
## [46] Son Charlie Jay Richard Carlos
## [51] Gail Marion Lindsay Sean Andrea
## [56] Jesus Jan Walter Dana Sammy
## [61] Theo Shaun Jimmie Carl Evan
## [66] Merrill Jon Shayne Thomas Valentine
## [71] Cameron Eddie Brian Matthew Leslie
## [76] Jason Sol Connie Rudy Michal
## [81] Darnell Daryl Jordan Kenneth Raymond
## [86] Elmer Jimmy Whitney Britt Logan
## [91] Terry Jamie Lacy Ronnie Joseph
## [96] Stacy Anthony Alva Dean Luis
## 100 Levels: Albert Alva Andrea Anthony Bernie Bobbie Brett Brian ... Whitney
The $
operator works fine for single columns, but for multiple columns it quickly gets a bit messy as we have to resort to using []
patients[,c("Name","Race")]
## # A tibble: 100 × 2
## Name Race
## <fctr> <fctr>
## 1 Demetrius White
## 2 Rosario White
## 3 Julio Black
## 4 Lupe White
## 5 Lavern Cat
## 6 Bernie Native
## 7 Samuel White
## 8 Clair White
## 9 Shirley White
## 10 Merle Hispanic
## # ... with 90 more rows
Non R users probably find these commands a bit obtuse
[ ]
?c
?[,...]
means display all rowsHowever, select
is a lot more intuitive; easier to convert to the sentance “Select the Name column from patients”
""
select(patients, Name)
## # A tibble: 100 × 1
## Name
## <fctr>
## 1 Demetrius
## 2 Rosario
## 3 Julio
## 4 Lupe
## 5 Lavern
## 6 Bernie
## 7 Samuel
## 8 Clair
## 9 Shirley
## 10 Merle
## # ... with 90 more rows
“Select the Name and Race columns”
select(patients, Name, Race)
## # A tibble: 100 × 2
## Name Race
## <fctr> <fctr>
## 1 Demetrius White
## 2 Rosario White
## 3 Julio Black
## 4 Lupe White
## 5 Lavern Cat
## 6 Bernie Native
## 7 Samuel White
## 8 Clair White
## 9 Shirley White
## 10 Merle Hispanic
## # ... with 90 more rows
Excluding certain columns in base R would require us to know the index we wanted to exclude, or to do something complicated with the column names
patients[,-2]
patients[,setdiff(colnames(patients), "Name")]
head(patients[,-2])
head(patients[,setdiff(colnames(patients), "Name")])
Compare to dplyr version….“Select all columns apart from Name”
select(patients, -Name)
## # A tibble: 100 × 14
## ID Race Age Sex Smokes Height
## <int> <fctr> <int> <fctr> <fctr> <fctr>
## 1 1 White 22 Male TRUE 176.479144276913cm
## 2 2 White 25 Male TRUE 174.748482460591cm
## 3 3 Black 20 Male No 173.944873836774cm
## 4 4 White 26 Male FALSE 180.028888936188cm
## 5 5 Cat 35 Male No 178.636539798649cm
## 6 6 Native 30 Female FALSE 159.777836229305cm
## 7 7 White 32 Male TRUE 170.949966267411cm
## 8 8 White 29 Female TRUE 161.826590769764cm
## 9 9 White 22 Male FALSE 169.66230745064cm
## 10 10 Hispanic 26 Female FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
Similarly easier to get columns within a particular range. “Select all columns between Name and Sex”
select(patients, Name:Sex)
## # A tibble: 100 × 4
## Name Race Age Sex
## <fctr> <fctr> <int> <fctr>
## 1 Demetrius White 22 Male
## 2 Rosario White 25 Male
## 3 Julio Black 20 Male
## 4 Lupe White 26 Male
## 5 Lavern Cat 35 Male
## 6 Bernie Native 30 Female
## 7 Samuel White 32 Male
## 8 Clair White 29 Female
## 9 Shirley White 22 Male
## 10 Merle Hispanic 26 Female
## # ... with 90 more rows
Say we cannot remember the name of the column
select(patients, starts_with("Grade"))
## # A tibble: 100 × 1
## Grade_Level
## <int>
## 1 1
## 2 1
## 3 2
## 4 1
## 5 3
## 6 2
## 7 1
## 8 1
## 9 1
## 10 1
## # ... with 90 more rows
see also
Height
and Grade_Level
Height
and Grade_Level
, but NOT Pet
Height
and Weight
There are several ways to solve these. Feel free to explore
We can now begin to explore the data. Imagine that we want to do a simple barplot to the number of males and females. Using base graphics, we can do this with;
barplot(table(select(patients,Sex)))
nchar
function to determine the problemSo we notice that the Sex column has “whitespace”; not an uncommon problem if data have been entered manually.
table(select(patients, Sex))
##
## Female Male Female Male
## 6 4 42 48
as.character(patients$Sex)
## [1] "Male" " Male" "Male" "Male" "Male" "Female" "Male"
## [8] "Female" "Male" " Female" "Female" "Male" "Female" "Female"
## [15] "Female" "Male" "Female" "Female" "Male" "Female" "Male"
## [22] "Male" "Female" "Female" "Male" "Male" " Female" "Male"
## [29] "Male" "Female" "Female" "Female" "Male" "Female" "Female"
## [36] "Female" "Female" "Male" " Male" "Male" "Male" "Male"
## [43] "Male" " Female" "Female" "Male" "Male" "Male" "Male"
## [50] " Male" " Male" "Female" "Male" "Female" "Male" "Female"
## [57] "Female" "Male" "Female" "Male" "Female" "Female" "Male"
## [64] "Male" "Male" "Male" "Female" " Female" "Female" "Female"
## [71] "Female" "Female" "Male" "Female" "Female" "Male" "Female"
## [78] "Male" "Male" "Female" " Female" "Female" "Male" "Male"
## [85] "Male" "Male" "Female" " Female" "Female" "Male" "Female"
## [92] "Female" "Male" "Female" "Male" "Male" "Male" "Male"
## [99] "Female" "Male"
The base distribution of R has many functions for dealing with strings. However, the stringr
package (again from Hadley Wickham) aims to provide a clean modern interface to many common string operations. In a similar way, lubridate
provides a convenient interface to deal with date information.
Of particular interest here is the str_trim
(?str_trim
) function that will trim whitespace from a vector of strings. Arguments can be used to specify whether to remove whitespace from either side of the string, or just the left or right.
library(stringr)
str_trim(patients$Sex)
## [1] "Male" "Male" "Male" "Male" "Male" "Female" "Male"
## [8] "Female" "Male" "Female" "Female" "Male" "Female" "Female"
## [15] "Female" "Male" "Female" "Female" "Male" "Female" "Male"
## [22] "Male" "Female" "Female" "Male" "Male" "Female" "Male"
## [29] "Male" "Female" "Female" "Female" "Male" "Female" "Female"
## [36] "Female" "Female" "Male" "Male" "Male" "Male" "Male"
## [43] "Male" "Female" "Female" "Male" "Male" "Male" "Male"
## [50] "Male" "Male" "Female" "Male" "Female" "Male" "Female"
## [57] "Female" "Male" "Female" "Male" "Female" "Female" "Male"
## [64] "Male" "Male" "Male" "Female" "Female" "Female" "Female"
## [71] "Female" "Female" "Male" "Female" "Female" "Male" "Female"
## [78] "Male" "Male" "Female" "Female" "Female" "Male" "Male"
## [85] "Male" "Male" "Female" "Female" "Female" "Male" "Female"
## [92] "Female" "Male" "Female" "Male" "Male" "Male" "Male"
## [99] "Female" "Male"
Once we have worked-out how to created a cleaned version of the variable, we need to update the data frame. This can be done using the next of the dplyr verbs; mutate
.
mutate
verbmutate
allows us to compute and append one or more new columns to a data frame. We can specify a new column name, or overwrite an existing one. As above, we do not change the data frame so need to create a new variable if we want the changes to persist.
mutate(patients, Sex2 = str_trim(Sex))
## # A tibble: 100 × 16
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <fctr>
## 1 1 Demetrius White 22 Male TRUE 176.479144276913cm
## 2 2 Rosario White 25 Male TRUE 174.748482460591cm
## 3 3 Julio Black 20 Male No 173.944873836774cm
## 4 4 Lupe White 26 Male FALSE 180.028888936188cm
## 5 5 Lavern Cat 35 Male No 178.636539798649cm
## 6 6 Bernie Native 30 Female FALSE 159.777836229305cm
## 7 7 Samuel White 32 Male TRUE 170.949966267411cm
## 8 8 Clair White 29 Female TRUE 161.826590769764cm
## 9 9 Shirley White 22 Male FALSE 169.66230745064cm
## 10 10 Merle Hispanic 26 Female FALSE 160.121017002151cm
## # ... with 90 more rows, and 9 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>, Sex2 <chr>
mutate(patients, Sex = str_trim(Sex))
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <chr> <fctr> <fctr>
## 1 1 Demetrius White 22 Male TRUE 176.479144276913cm
## 2 2 Rosario White 25 Male TRUE 174.748482460591cm
## 3 3 Julio Black 20 Male No 173.944873836774cm
## 4 4 Lupe White 26 Male FALSE 180.028888936188cm
## 5 5 Lavern Cat 35 Male No 178.636539798649cm
## 6 6 Bernie Native 30 Female FALSE 159.777836229305cm
## 7 7 Samuel White 32 Male TRUE 170.949966267411cm
## 8 8 Clair White 29 Female TRUE 161.826590769764cm
## 9 9 Shirley White 22 Male FALSE 169.66230745064cm
## 10 10 Merle Hispanic 26 Female FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
patients_clean <- mutate(patients, Sex = factor(str_trim(Sex)))
We might want to look at the distribution of heights amongst our patients. However, if we try this we are immediately hit with an error message
hist(select(patients,Height))
## Error in hist.default(select(patients, Height)) : 'x' must be numeric
The problem is that R thinks that values are not numbers. Why might that be?
select(patients,Height)
## # A tibble: 100 × 1
## Height
## <fctr>
## 1 176.479144276913cm
## 2 174.748482460591cm
## 3 173.944873836774cm
## 4 180.028888936188cm
## 5 178.636539798649cm
## 6 159.777836229305cm
## 7 170.949966267411cm
## 8 161.826590769764cm
## 9 169.66230745064cm
## 10 160.121017002151cm
## # ... with 90 more rows
To convert the values back to numbers we first need to remove the "cm"
at the end of each value. We can do this with a string replacement operation. In this case, the particular string we want to put in the place of "cm"
is the empty string ""
. The replacement can be done using the str_replace_all
function.
mutate(patients_clean, Height= str_replace_all(Height,pattern = "cm",""))
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <chr>
## 1 1 Demetrius White 22 Male TRUE 176.479144276913
## 2 2 Rosario White 25 Male TRUE 174.748482460591
## 3 3 Julio Black 20 Male No 173.944873836774
## 4 4 Lupe White 26 Male FALSE 180.028888936188
## 5 5 Lavern Cat 35 Male No 178.636539798649
## 6 6 Bernie Native 30 Female FALSE 159.777836229305
## 7 7 Samuel White 32 Male TRUE 170.949966267411
## 8 8 Clair White 29 Female TRUE 161.826590769764
## 9 9 Shirley White 22 Male FALSE 169.66230745064
## 10 10 Merle Hispanic 26 Female FALSE 160.121017002151
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
Note that the input to this function is the patients_clean
object that we created previously.
However, what do you notice about the class reported for the Height column after this operation.
Count
column, for exampleWe need an extra conversion step to make sure these are represented as numeric values.
patients_clean <- mutate(patients_clean, Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm","")))
hist(patients_clean$Height)
Other useful functions from stringr
, and the lubridate
package, are described after the exercise. If time allows, we will go through these in the class. If you finish the exercise early, feel free to take a look.
Smokes
column to contain just TRUE
or FALSE
values## # A tibble: 100 × 17
## ID Name Race Age Sex Smokes Height Weight
## <int> <fctr> <fctr> <int> <fctr> <lgl> <dbl> <dbl>
## 1 1 Demetrius White 22 Male TRUE 176.4791 77.07070
## 2 2 Rosario White 25 Male TRUE 174.7485 83.26835
## 3 3 Julio Black 20 Male FALSE 173.9449 85.63231
## 4 4 Lupe White 26 Male FALSE 180.0289 88.72032
## 5 5 Lavern Cat 35 Male FALSE 178.6365 97.05405
## 6 6 Bernie Native 30 Female FALSE 159.7778 68.31247
## 7 7 Samuel White 32 Male TRUE 170.9500 95.65815
## 8 8 Clair White 29 Female TRUE 161.8266 66.02854
## 9 9 Shirley White 22 Male FALSE 169.6623 77.29546
## 10 10 Merle Hispanic 26 Female FALSE 160.1210 69.59613
## # ... with 90 more rows, and 9 more variables: Birth <fctr>, State <fctr>,
## # Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>,
## # BMI <dbl>, Overweight <lgl>
Rather than removing whitespace, sometimes we might want to adjust a set of strings so that they have a fixed width. We can do this be adding a padding character to each string if necessary. This operations is often done with a set of identifiers.
str_pad(patients$ID,pad = "0",width=3)
## [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
## [12] "012" "013" "014" "015" "016" "017" "018" "019" "020" "021" "022"
## [23] "023" "024" "025" "026" "027" "028" "029" "030" "031" "032" "033"
## [34] "034" "035" "036" "037" "038" "039" "040" "041" "042" "043" "044"
## [45] "045" "046" "047" "048" "049" "050" "051" "052" "053" "054" "055"
## [56] "056" "057" "058" "059" "060" "061" "062" "063" "064" "065" "066"
## [67] "067" "068" "069" "070" "071" "072" "073" "074" "075" "076" "077"
## [78] "078" "079" "080" "081" "082" "083" "084" "085" "086" "087" "088"
## [89] "089" "090" "091" "092" "093" "094" "095" "096" "097" "098" "099"
## [100] "100"
patients_clean <- mutate(patients_clean, ID=str_pad(ID,pad = "0",width=3))
Note that we can perform several operations in the same command.
patients_clean <- mutate(patients, Sex = factor(str_trim(Sex)),ID=str_pad(ID,pad = "0",width=3))
Similarly, we might want to trim all the strings between a particular start and end position
str_sub(patients$State, 1,3)
## [1] "New" "Flo" "Con" "Mas" "Kan" "Ill" "Col" "Sou" "Ill" "Mis" "Ala"
## [12] "Col" "Ida" "Flo" "New" "Pen" "Mis" "Ind" "Mic" "Mic" "Ore" "Tex"
## [23] "Nev" "Wis" "Uta" "Rho" "Pen" "Ill" "Vir" "Ill" "Tex" "Cal" "Geo"
## [34] "Nor" "Col" "Ida" "Tex" "Ari" "Iow" "Cal" "Mar" "Mic" "Tex" "Mar"
## [45] "New" "Ind" "Tex" "Ida" "Ohi" "New" "Ari" "Ind" "Ten" "Mar" "Mic"
## [56] "Pen" "New" "Ten" "New" "Cal" "Was" "Cal" "Nor" "Rho" "Mai" "Sou"
## [67] "Vir" "Cal" "Ida" "Ari" "Ken" "Col" "Tex" "Cal" "Pen" "Cal" "Mar"
## [78] "Wis" "Ohi" "Flo" "Wes" "Was" "Okl" "New" "New" "Ida" "Mis" "Flo"
## [89] "Kan" "Cal" "Nor" "Wis" "Ark" "Cal" "Mic" "Vir" "Ill" "Col" "Ken"
## [100] "Ari"
str_sub(patients$Name, 2,5)
## [1] "emet" "osar" "ulio" "upe" "aver" "erni" "amue" "lair" "hirl" "erle"
## [11] "arti" "ranc" "ourt" "ranc" "erno" "este" "obin" "lber" "ommy" "yle"
## [21] "ong" "iche" "erem" "at" "ugen" "ris" "racy" "ame" "lyde" "rett"
## [31] "ony" "eorg" "dwar" "ory" "hane" "evin" "erry" "rew" "onal" "hris"
## [41] "omin" "an" "obbi" "awre" "eith" "on" "harl" "ay" "icha" "arlo"
## [51] "ail" "ario" "inds" "ean" "ndre" "esus" "an" "alte" "ana" "ammy"
## [61] "heo" "haun" "immi" "arl" "van" "erri" "on" "hayn" "homa" "alen"
## [71] "amer" "ddie" "rian" "atth" "esli" "ason" "ol" "onni" "udy" "icha"
## [81] "arne" "aryl" "orda" "enne" "aymo" "lmer" "immy" "hitn" "ritt" "ogan"
## [91] "erry" "amie" "acy" "onni" "osep" "tacy" "ntho" "lva" "ean" "uis"
Finally, we can convert entire strings to upper- or lower-case
str_to_lower(patients$Pet)
## [1] "cat" "cat" "dog" "dog" "cat" "cat" "dog" "dog"
## [9] "dog" "cat" "cat" "dog" "none" "dog" "none" "cat"
## [17] "none" "cat" "none" "bird" "cat" "dog" "dog" "cat"
## [25] "dog" "none" "cat" "dog" "none" "dog" "dog" "dog"
## [33] "cat" "cat" "cat" "null" "none" "dog" "none" "none"
## [41] "none" "cat" "cat" "dog" "horse" "cat" "dog" "dog"
## [49] "dog" "dog" "cat" "none" "none" "none" "dog" "cat"
## [57] "none" "cat" "dog" "dog" "cat" "none" "none" "cat"
## [65] "none" NA "cat" "none" "cat" "null" "none" "none"
## [73] "cat" "dog" "dog" "cat" "none" "none" "none" "dog"
## [81] "dog" "dog" "dog" "none" "none" "dog" "cat" "cat"
## [89] "dog" "cat" "none" "none" NA "null" "none" "dog"
## [97] "none" "dog" "dog" "cat"
str_to_upper(patients$Name)
## [1] "DEMETRIUS" "ROSARIO" "JULIO" "LUPE" "LAVERN"
## [6] "BERNIE" "SAMUEL" "CLAIR" "SHIRLEY" "MERLE"
## [11] "MARTIN" "FRANCES" "COURTNEY" "FRANCIS" "VERNON"
## [16] "LESTER" "ROBIN" "ALBERT" "TOMMY" "KYLE"
## [21] "DONG" "MICHEL" "JEREMY" "PAT" "EUGENE"
## [26] "KRIS" "TRACY" "JAME" "CLYDE" "BRETT"
## [31] "TONY" "GEORGE" "EDWARD" "RORY" "SHANE"
## [36] "DEVIN" "JERRY" "DREW" "RONALD" "CHRISTOPHER"
## [41] "DOMINIQUE" "VAN" "BOBBIE" "LAWRENCE" "KEITH"
## [46] "SON" "CHARLIE" "JAY" "RICHARD" "CARLOS"
## [51] "GAIL" "MARION" "LINDSAY" "SEAN" "ANDREA"
## [56] "JESUS" "JAN" "WALTER" "DANA" "SAMMY"
## [61] "THEO" "SHAUN" "JIMMIE" "CARL" "EVAN"
## [66] "MERRILL" "JON" "SHAYNE" "THOMAS" "VALENTINE"
## [71] "CAMERON" "EDDIE" "BRIAN" "MATTHEW" "LESLIE"
## [76] "JASON" "SOL" "CONNIE" "RUDY" "MICHAL"
## [81] "DARNELL" "DARYL" "JORDAN" "KENNETH" "RAYMOND"
## [86] "ELMER" "JIMMY" "WHITNEY" "BRITT" "LOGAN"
## [91] "TERRY" "JAMIE" "LACY" "RONNIE" "JOSEPH"
## [96] "STACY" "ANTHONY" "ALVA" "DEAN" "LUIS"
Dates are often problematic as they may be encoded in lots of different schemes
If we wanted the year in which wach person was born it would be tempting to trim the first four characters of the string. However, getting the month or day might be a bit more problematic.
str_sub(patients$Birth,1,4)
## [1] "1972" "1972" "1971" "1971" "1972" "1972" "1971" "1972" "1972" "1972"
## [11] "1971" "1972" "1972" "1972" "1971" "1972" "1971" "1972" "1972" "1972"
## [21] "1971" "1972" "1972" "1971" "1972" "1971" "1971" "1971" "1972" "1972"
## [31] "1971" "1971" "1972" "1971" "1971" "1971" "1972" "1972" "1972" "1972"
## [41] "1972" "1972" "1971" "1972" "1971" "1971" "1972" "1972" "1971" "1972"
## [51] "1972" "1972" "1971" "1972" "1972" "1972" "1972" "1972" "1973" "1973"
## [61] "1971" "1972" "1972" "1972" "1971" "1972" "1971" "1971" "1971" "1972"
## [71] "1971" "1972" "1972" "1973" "1971" "1972" "1973" "1971" "1972" "1972"
## [81] "1972" "1971" "1972" "1972" "1971" "1972" "1972" "1971" "1971" "1971"
## [91] "1972" "1972" "1971" "1972" "1972" "1971" "1971" "1972" "1971" "1971"
The lubridate
package has lots of useful functions for dealing with dates (also by Hadley Wickham!). For example it allows the year, month and day to be extracted.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
year(patients$Birth)
## [1] 1972 1972 1971 1971 1972 1972 1971 1972 1972 1972 1971 1972 1972 1972
## [15] 1971 1972 1971 1972 1972 1972 1971 1972 1972 1971 1972 1971 1971 1971
## [29] 1972 1972 1971 1971 1972 1971 1971 1971 1972 1972 1972 1972 1972 1972
## [43] 1971 1972 1971 1971 1972 1972 1971 1972 1972 1972 1971 1972 1972 1972
## [57] 1972 1972 1973 1973 1971 1972 1972 1972 1971 1972 1971 1971 1971 1972
## [71] 1971 1972 1972 1973 1971 1972 1973 1971 1972 1972 1972 1971 1972 1972
## [85] 1971 1972 1972 1971 1971 1971 1972 1972 1971 1972 1972 1971 1971 1972
## [99] 1971 1971
month(patients$Birth)
## [1] 2 7 11 10 11 7 11 3 10 2 4 1 6 4 11 8 10 10 5 2 3 2 10
## [24] 4 5 11 5 5 5 10 4 3 11 9 9 2 4 10 9 3 1 7 10 10 8 3
## [47] 10 3 9 1 1 1 10 8 6 9 6 11 1 2 8 4 6 12 10 6 10 3 11
## [70] 7 4 2 4 1 12 1 1 9 1 9 2 6 1 12 3 5 4 9 4 12 6 10
## [93] 6 11 8 7 3 9 12 11
It also supporting some clever manipulation, such as being able to subtract and add times.
dob <- ymd(patients$Birth)
today <- ymd("20160509")
age <- interval(dob, today)
as.period(age)
## [1] "44y 2m 28d 0H 0M 0S" "43y 9m 17d 0H 0M 0S" "44y 5m 16d 0H 0M 0S"
## [4] "44y 7m 6d 0H 0M 0S" "43y 5m 16d 0H 0M 0S" "43y 9m 9d 0H 0M 0S"
## [7] "44y 5m 26d 0H 0M 0S" "44y 1m 25d 0H 0M 0S" "43y 6m 29d 0H 0M 0S"
## [10] "44y 2m 11d 0H 0M 0S" "45y 1m 7d 0H 0M 0S" "44y 3m 14d 0H 0M 0S"
## [13] "43y 11m 1d 0H 0M 0S" "44y 1m 0d 0H 0M 0S" "44y 6m 4d 0H 0M 0S"
## [16] "43y 8m 24d 0H 0M 0S" "44y 6m 16d 0H 0M 0S" "43y 6m 24d 0H 0M 0S"
## [19] "44y 0m 7d 0H 0M 0S" "44y 2m 27d 0H 0M 0S" "45y 2m 7d 0H 0M 0S"
## [22] "44y 3m 6d 0H 0M 0S" "43y 6m 18d 0H 0M 0S" "45y 0m 14d 0H 0M 0S"
## [25] "43y 11m 21d 0H 0M 0S" "44y 5m 21d 0H 0M 0S" "45y 0m 3d 0H 0M 0S"
## [28] "44y 11m 22d 0H 0M 0S" "44y 0m 6d 0H 0M 0S" "43y 6m 21d 0H 0M 0S"
## [31] "45y 1m 0d 0H 0M 0S" "45y 1m 28d 0H 0M 0S" "43y 5m 18d 0H 0M 0S"
## [34] "44y 7m 27d 0H 0M 0S" "44y 7m 12d 0H 0M 0S" "45y 2m 11d 0H 0M 0S"
## [37] "44y 0m 22d 0H 0M 0S" "43y 6m 10d 0H 0M 0S" "43y 8m 0d 0H 0M 0S"
## [40] "44y 1m 13d 0H 0M 0S" "44y 3m 18d 0H 0M 0S" "43y 9m 24d 0H 0M 0S"
## [43] "44y 7m 7d 0H 0M 0S" "43y 6m 26d 0H 0M 0S" "44y 8m 28d 0H 0M 0S"
## [46] "45y 1m 10d 0H 0M 0S" "43y 6m 24d 0H 0M 0S" "44y 1m 29d 0H 0M 0S"
## [49] "44y 7m 28d 0H 0M 0S" "44y 3m 16d 0H 0M 0S" "44y 3m 25d 0H 0M 0S"
## [52] "44y 4m 8d 0H 0M 0S" "44y 7m 0d 0H 0M 0S" "43y 9m 4d 0H 0M 0S"
## [55] "43y 11m 5d 0H 0M 0S" "43y 7m 22d 0H 0M 0S" "43y 10m 19d 0H 0M 0S"
## [58] "43y 6m 8d 0H 0M 0S" "43y 4m 8d 0H 0M 0S" "43y 2m 27d 0H 0M 0S"
## [61] "44y 9m 8d 0H 0M 0S" "44y 1m 6d 0H 0M 0S" "43y 10m 14d 0H 0M 0S"
## [64] "43y 5m 5d 0H 0M 0S" "44y 6m 13d 0H 0M 0S" "43y 11m 4d 0H 0M 0S"
## [67] "44y 6m 16d 0H 0M 0S" "45y 1m 24d 0H 0M 0S" "44y 5m 12d 0H 0M 0S"
## [70] "43y 9m 29d 0H 0M 0S" "45y 1m 8d 0H 0M 0S" "44y 2m 27d 0H 0M 0S"
## [73] "44y 0m 16d 0H 0M 0S" "43y 3m 18d 0H 0M 0S" "44y 4m 28d 0H 0M 0S"
## [76] "44y 3m 16d 0H 0M 0S" "43y 3m 16d 0H 0M 0S" "44y 7m 21d 0H 0M 0S"
## [79] "44y 3m 29d 0H 0M 0S" "43y 7m 11d 0H 0M 0S" "44y 2m 10d 0H 0M 0S"
## [82] "44y 10m 14d 0H 0M 0S" "44y 3m 13d 0H 0M 0S" "43y 4m 20d 0H 0M 0S"
## [85] "45y 2m 0d 0H 0M 0S" "44y 0m 4d 0H 0M 0S" "44y 0m 10d 0H 0M 0S"
## [88] "44y 7m 28d 0H 0M 0S" "45y 0m 28d 0H 0M 0S" "44y 4m 9d 0H 0M 0S"
## [91] "43y 11m 1d 0H 0M 0S" "43y 6m 28d 0H 0M 0S" "44y 10m 27d 0H 0M 0S"
## [94] "43y 5m 16d 0H 0M 0S" "43y 9m 3d 0H 0M 0S" "44y 9m 24d 0H 0M 0S"
## [97] "45y 1m 23d 0H 0M 0S" "43y 7m 21d 0H 0M 0S" "44y 4m 17d 0H 0M 0S"
## [100] "44y 5m 16d 0H 0M 0S"
patients_clean <- mutate(patients_clean, Age = year(as.period(age)))
patients_clean
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <chr> <fctr> <fctr> <int> <fctr> <fctr> <fctr>
## 1 001 Demetrius White 44 Male TRUE 176.479144276913cm
## 2 002 Rosario White 43 Male TRUE 174.748482460591cm
## 3 003 Julio Black 44 Male No 173.944873836774cm
## 4 004 Lupe White 44 Male FALSE 180.028888936188cm
## 5 005 Lavern Cat 43 Male No 178.636539798649cm
## 6 006 Bernie Native 43 Female FALSE 159.777836229305cm
## 7 007 Samuel White 44 Male TRUE 170.949966267411cm
## 8 008 Clair White 44 Female TRUE 161.826590769764cm
## 9 009 Shirley White 43 Male FALSE 169.66230745064cm
## 10 010 Merle Hispanic 44 Female FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>