Course Home

Overview of this section

tolstoy

tolstoy

hadley

hadley

Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)

http://vimeo.com/33727555

http://vita.had.co.nz/papers/tidy-data.pdf

Tidy data are important because they are an entry point for the analysis cycle we will describe during the course.

data-cycle

data-cycle

As we saw, our data, and especially data we might find out “in the wild”, may need “cleaning”" before we can use it. We will discuss how to clean our data for analysis. However, there are steps you can take to make sure your data are tidy and organised before importing into R.

We have already seen a tidy dataset in the form of Fisher’s iris dataset.

iris

iris

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Example (from tidyr paper)

Consider the following….

Name treatmenta treatmentb
John Smith - 2
Jane Doe 16 11
Mary Johnson 3 1

Data in this format are quite familiar to us, but not easily-interpretable by the computer. We need to think of the dataset in terms of variables and values. In this example dataset we have 18 values and 3 variables

  1. Person (John, Jane or Mary)
  2. Treatment (A or B)
  3. “Result”

The guiding principles:

  • Each column is a variable
  • Each row is an observation
  • Each type of observational unit forms a table

The tidy form of this data thus becomes;

##           Name Treatment Result
## 1   John Smith         a      -
## 2     Jane Doe         a     16
## 3 Mary Johnson         a      3
## 4   John Smith         b      2
## 5     Jane Doe         b     11
## 6 Mary Johnson         b      1

gather is a function in the tidyr package that can collapse multiple columns into key-value pairs.

  • similar to stack in base R
  • or melt in the reshape / reshape2 pacakges

You can get these example data here

library(tidyr)
untidy <- read.delim("tidyr-example.txt")
untidy
##           Name treatmenta treatmentb
## 1   John Smith          -          2
## 2     Jane Doe         16         11
## 3 Mary Johnson          3          1

The arguments are;

  • the data frame you want to manipulate
  • the name of the key column you want to create
  • the name of the value column you want to create
  • the names of the columns in the data frame that are to be collapsed
    • note you don’t need “” to define the column names
gather(untidy, Treatment, Result, c(treatmenta,treatmentb))
##           Name  Treatment Result
## 1   John Smith treatmenta      -
## 2     Jane Doe treatmenta     16
## 3 Mary Johnson treatmenta      3
## 4   John Smith treatmentb      2
## 5     Jane Doe treatmentb     11
## 6 Mary Johnson treatmentb      1

When specifying columns, an alternative is to ignore some column names

gather(untidy, Treatment, Result, -Name)
##           Name  Treatment Result
## 1   John Smith treatmenta      -
## 2     Jane Doe treatmenta     16
## 3 Mary Johnson treatmenta      3
## 4   John Smith treatmentb      2
## 5     Jane Doe treatmentb     11
## 6 Mary Johnson treatmentb      1

Can also specify column index

gather(untidy, Treatment, Result, 2:3)
##           Name  Treatment Result
## 1   John Smith treatmenta      -
## 2     Jane Doe treatmenta     16
## 3 Mary Johnson treatmenta      3
## 4   John Smith treatmentb      2
## 5     Jane Doe treatmentb     11
## 6 Mary Johnson treatmentb      1

Note that after all these operations, the original data frame (untidy) is unaltered

  • we need to create a new variable to save the result
  • you should never work directly on the raw data
untidy
##           Name treatmenta treatmentb
## 1   John Smith          -          2
## 2     Jane Doe         16         11
## 3 Mary Johnson          3          1
tidy <- gather(untidy, Treatment, Result, c(treatmenta,treatmentb))
## Warning: attributes are not identical across measure variables; they will
## be dropped

Example: Simulated clinical data

Using data from the Mango Solutions training dataset

library(mangoTraining)
data("messyData")
Subject Placebo.1 Placebo.2 Drug1.1 Drug1.2 Drug2.1 Drug2.2
1 45 41 40 30 51 37
2 56 57 53 43 47 49
3 39 30 37 28 40 32
4 42 38 34 25 43 26
5 50 49 35 29 38 31
6 47 44 51 43 33 46
  • What variables and observations do we have?
  • What might a ‘tidy’ version of the dataset look like?
  • Use the tidyr package to create a tidy version of the data

##   Subject Treatment Count
## 1       1 Placebo.1    45
## 2       2 Placebo.1    56
## 3       3 Placebo.1    39
## 4       4 Placebo.1    42
## 5       5 Placebo.1    50
## 6       6 Placebo.1    47

separate from the same package will also sort out the treatment names.

separate(tidy,Treatment,into=c("Treatment","Rep"))
##   Subject Treatment Rep Count
## 1       1   Placebo   1    45
## 2       2   Placebo   1    56
## 3       3   Placebo   1    39
## 4       4   Placebo   1    42
## 5       5   Placebo   1    50
## 6       6   Placebo   1    47

N.B. spread has the opposite effect to gather and will translate tidy data back into human-readable form

spread(tidy, Treatment,Count)
##   Subject Drug1.1 Drug1.2 Drug2.1 Drug2.2 Placebo.1 Placebo.2
## 1       1      40      30      51      37        45        41
## 2       2      53      43      47      49        56        57
## 3       3      37      28      40      32        39        30
## 4       4      34      25      43      26        42        38
## 5       5      35      29      38      31        50        49
## 6       6      51      43      33      46        47        44

dpylr introduction, and more data-cleaning

We are going to explore some of the basic features of dplyr using some patient data; the kind of data that we might encounter in the wild. However, rather than using real-life data we are going to make some up. There is a package called wakefield that is particularly convenient for this task.

Various patient characteristics have been generated and saved in the file patient-data.txt.

patients <- read.delim("patient-data.txt")

In Rstudio , you can view the contents of this data frame in a tab.

View(patients)

Using dplyr, patients can be converted to a special kind of data frame; the tbl_df. The main difference is that the whole data frame does not get printed to screen. All operations that we could do on a data frame can be applied to a tbl_df.

library(dplyr)
patients <- tbl_df(patients)
patients
## # A tibble: 100 × 15
##       ID      Name     Race   Age     Sex Smokes             Height
##    <int>    <fctr>   <fctr> <int>  <fctr> <fctr>             <fctr>
## 1      1 Demetrius    White    22    Male   TRUE 176.479144276913cm
## 2      2   Rosario    White    25    Male   TRUE 174.748482460591cm
## 3      3     Julio    Black    20    Male     No 173.944873836774cm
## 4      4      Lupe    White    26    Male  FALSE 180.028888936188cm
## 5      5    Lavern      Cat    35    Male     No 178.636539798649cm
## 6      6    Bernie   Native    30  Female  FALSE 159.777836229305cm
## 7      7    Samuel    White    32    Male   TRUE 170.949966267411cm
## 8      8     Clair    White    29  Female   TRUE 161.826590769764cm
## 9      9   Shirley    White    22    Male  FALSE  169.66230745064cm
## 10    10     Merle Hispanic    26  Female  FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

The tbl_df function in dplyr creates a new object with more-convenient default display

  • The object behaves in exactly the same way as a data frame
  • glimpse can be used to get a snapshot of your data
    • similar to str function in base R
glimpse(patients)
## Observations: 100
## Variables: 15
## $ ID          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Name        <fctr> Demetrius, Rosario, Julio, Lupe, Lavern, Bernie, ...
## $ Race        <fctr> White, White, Black, White, Cat, Native, White, W...
## $ Age         <int> 22, 25, 20, 26, 35, 30, 32, 29, 22, 26, 31, 30, 0,...
## $ Sex         <fctr> Male,  Male, Male, Male, Male, Female, Male, Fema...
## $ Smokes      <fctr> TRUE, TRUE, No, FALSE, No, FALSE, TRUE, TRUE, FAL...
## $ Height      <fctr> 176.479144276913cm, 174.748482460591cm, 173.94487...
## $ Weight      <fctr> 77.0707023167105kg, 83.2683499179022kg, 85.632310...
## $ Birth       <fctr> 1972-02-11, 1972-07-22, 1971-11-23, 1971-10-03, 1...
## $ State       <fctr> New York, Florida, Connecticut, Massachusetts, Ka...
## $ Pet         <fctr> CAT, Cat, DOG, Dog, Cat, Cat, Dog, Dog, Dog, Cat,...
## $ Grade_Level <int> 1, 1, 2, 1, 3, 2, 1, 1, 1, 1, 3, 99, 1, 1, 1, 3, 1...
## $ Died        <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS...
## $ Count       <dbl> -0.74400114, -1.07145310, -0.82156356, 0.49766715,...
## $ Date        <fctr> 2015-03-04, 2015-03-04, 2015-03-04, 2015-03-04, 2...

Aims

We are planning a study that will involve overweight smokers. So one of our first task is to identify a list of candidates.

Selecting variables; the select verb

As we mentioned, dplyr introduces a series of verbs that can be used to operate on our data. The first we will introduce is select

N.B. base R does provide a function called subset which allowed similar operations.

select

select

The default dispaly of a tbl_df object is to print the data frame in an intelligent manner, so not to flood the R console. If we want to control what variables are printed we can use the select function, which allows us to print certain columns to the screen.

In the R introduction, we saw how to extract columns from a data frame with a $ symbol.

  • Remember that we’re not changing the data frame, just exploring it
patients$Name
##   [1] Demetrius   Rosario     Julio       Lupe        Lavern     
##   [6] Bernie      Samuel      Clair       Shirley     Merle      
##  [11] Martin      Frances     Courtney    Francis     Vernon     
##  [16] Lester      Robin       Albert      Tommy       Kyle       
##  [21] Dong        Michel      Jeremy      Pat         Eugene     
##  [26] Kris        Tracy       Jame        Clyde       Brett      
##  [31] Tony        George      Edward      Rory        Shane      
##  [36] Devin       Jerry       Drew        Ronald      Christopher
##  [41] Dominique   Van         Bobbie      Lawrence    Keith      
##  [46] Son         Charlie     Jay         Richard     Carlos     
##  [51] Gail        Marion      Lindsay     Sean        Andrea     
##  [56] Jesus       Jan         Walter      Dana        Sammy      
##  [61] Theo        Shaun       Jimmie      Carl        Evan       
##  [66] Merrill     Jon         Shayne      Thomas      Valentine  
##  [71] Cameron     Eddie       Brian       Matthew     Leslie     
##  [76] Jason       Sol         Connie      Rudy        Michal     
##  [81] Darnell     Daryl       Jordan      Kenneth     Raymond    
##  [86] Elmer       Jimmy       Whitney     Britt       Logan      
##  [91] Terry       Jamie       Lacy        Ronnie      Joseph     
##  [96] Stacy       Anthony     Alva        Dean        Luis       
## 100 Levels: Albert Alva Andrea Anthony Bernie Bobbie Brett Brian ... Whitney

The $ operator works fine for single columns, but for multiple columns it quickly gets a bit messy as we have to resort to using []

patients[,c("Name","Race")]
## # A tibble: 100 × 2
##         Name     Race
##       <fctr>   <fctr>
## 1  Demetrius    White
## 2    Rosario    White
## 3      Julio    Black
## 4       Lupe    White
## 5     Lavern      Cat
## 6     Bernie   Native
## 7     Samuel    White
## 8      Clair    White
## 9    Shirley    White
## 10     Merle Hispanic
## # ... with 90 more rows

Non R users probably find these commands a bit obtuse

  • why the [ ]?
  • what is c?
  • need to remember the row and column index
  • [,...] means display all rows

However, select is a lot more intuitive; easier to convert to the sentance “Select the Name column from patients”

  • Notice the lack of ""
select(patients, Name)
## # A tibble: 100 × 1
##         Name
##       <fctr>
## 1  Demetrius
## 2    Rosario
## 3      Julio
## 4       Lupe
## 5     Lavern
## 6     Bernie
## 7     Samuel
## 8      Clair
## 9    Shirley
## 10     Merle
## # ... with 90 more rows

“Select the Name and Race columns”

select(patients, Name, Race)
## # A tibble: 100 × 2
##         Name     Race
##       <fctr>   <fctr>
## 1  Demetrius    White
## 2    Rosario    White
## 3      Julio    Black
## 4       Lupe    White
## 5     Lavern      Cat
## 6     Bernie   Native
## 7     Samuel    White
## 8      Clair    White
## 9    Shirley    White
## 10     Merle Hispanic
## # ... with 90 more rows

Excluding certain columns in base R would require us to know the index we wanted to exclude, or to do something complicated with the column names

patients[,-2]
patients[,setdiff(colnames(patients), "Name")]
head(patients[,-2])
head(patients[,setdiff(colnames(patients), "Name")])

Compare to dplyr version….“Select all columns apart from Name”

select(patients, -Name)
## # A tibble: 100 × 14
##       ID     Race   Age     Sex Smokes             Height
##    <int>   <fctr> <int>  <fctr> <fctr>             <fctr>
## 1      1    White    22    Male   TRUE 176.479144276913cm
## 2      2    White    25    Male   TRUE 174.748482460591cm
## 3      3    Black    20    Male     No 173.944873836774cm
## 4      4    White    26    Male  FALSE 180.028888936188cm
## 5      5      Cat    35    Male     No 178.636539798649cm
## 6      6   Native    30  Female  FALSE 159.777836229305cm
## 7      7    White    32    Male   TRUE 170.949966267411cm
## 8      8    White    29  Female   TRUE 161.826590769764cm
## 9      9    White    22    Male  FALSE  169.66230745064cm
## 10    10 Hispanic    26  Female  FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

Similarly easier to get columns within a particular range. “Select all columns between Name and Sex”

select(patients, Name:Sex)
## # A tibble: 100 × 4
##         Name     Race   Age     Sex
##       <fctr>   <fctr> <int>  <fctr>
## 1  Demetrius    White    22    Male
## 2    Rosario    White    25    Male
## 3      Julio    Black    20    Male
## 4       Lupe    White    26    Male
## 5     Lavern      Cat    35    Male
## 6     Bernie   Native    30  Female
## 7     Samuel    White    32    Male
## 8      Clair    White    29  Female
## 9    Shirley    White    22    Male
## 10     Merle Hispanic    26  Female
## # ... with 90 more rows

Say we cannot remember the name of the column

select(patients, starts_with("Grade"))
## # A tibble: 100 × 1
##    Grade_Level
##          <int>
## 1            1
## 2            1
## 3            2
## 4            1
## 5            3
## 6            2
## 7            1
## 8            1
## 9            1
## 10           1
## # ... with 90 more rows

see also

  • contains, matches, ends_with

Exercise

  • Print all the columns between Height and Grade_Level
  • Print all the columns between Height and Grade_Level, but NOT Pet
  • Print the columns Height and Weight

There are several ways to solve these. Feel free to explore


Transforming / cleaning the data

We can now begin to explore the data. Imagine that we want to do a simple barplot to the number of males and females. Using base graphics, we can do this with;

barplot(table(select(patients,Sex)))


  • What do you think the problem might be?
  • Try printing out all values for this column, or using the nchar function to determine the problem

So we notice that the Sex column has “whitespace”; not an uncommon problem if data have been entered manually.

table(select(patients, Sex))
## 
##  Female    Male  Female    Male 
##       6       4      42      48
as.character(patients$Sex)
##   [1] "Male"    " Male"   "Male"    "Male"    "Male"    "Female"  "Male"   
##   [8] "Female"  "Male"    " Female" "Female"  "Male"    "Female"  "Female" 
##  [15] "Female"  "Male"    "Female"  "Female"  "Male"    "Female"  "Male"   
##  [22] "Male"    "Female"  "Female"  "Male"    "Male"    " Female" "Male"   
##  [29] "Male"    "Female"  "Female"  "Female"  "Male"    "Female"  "Female" 
##  [36] "Female"  "Female"  "Male"    " Male"   "Male"    "Male"    "Male"   
##  [43] "Male"    " Female" "Female"  "Male"    "Male"    "Male"    "Male"   
##  [50] " Male"   " Male"   "Female"  "Male"    "Female"  "Male"    "Female" 
##  [57] "Female"  "Male"    "Female"  "Male"    "Female"  "Female"  "Male"   
##  [64] "Male"    "Male"    "Male"    "Female"  " Female" "Female"  "Female" 
##  [71] "Female"  "Female"  "Male"    "Female"  "Female"  "Male"    "Female" 
##  [78] "Male"    "Male"    "Female"  " Female" "Female"  "Male"    "Male"   
##  [85] "Male"    "Male"    "Female"  " Female" "Female"  "Male"    "Female" 
##  [92] "Female"  "Male"    "Female"  "Male"    "Male"    "Male"    "Male"   
##  [99] "Female"  "Male"

The base distribution of R has many functions for dealing with strings. However, the stringr package (again from Hadley Wickham) aims to provide a clean modern interface to many common string operations. In a similar way, lubridate provides a convenient interface to deal with date information.

Of particular interest here is the str_trim (?str_trim) function that will trim whitespace from a vector of strings. Arguments can be used to specify whether to remove whitespace from either side of the string, or just the left or right.

library(stringr)
str_trim(patients$Sex)
##   [1] "Male"   "Male"   "Male"   "Male"   "Male"   "Female" "Male"  
##   [8] "Female" "Male"   "Female" "Female" "Male"   "Female" "Female"
##  [15] "Female" "Male"   "Female" "Female" "Male"   "Female" "Male"  
##  [22] "Male"   "Female" "Female" "Male"   "Male"   "Female" "Male"  
##  [29] "Male"   "Female" "Female" "Female" "Male"   "Female" "Female"
##  [36] "Female" "Female" "Male"   "Male"   "Male"   "Male"   "Male"  
##  [43] "Male"   "Female" "Female" "Male"   "Male"   "Male"   "Male"  
##  [50] "Male"   "Male"   "Female" "Male"   "Female" "Male"   "Female"
##  [57] "Female" "Male"   "Female" "Male"   "Female" "Female" "Male"  
##  [64] "Male"   "Male"   "Male"   "Female" "Female" "Female" "Female"
##  [71] "Female" "Female" "Male"   "Female" "Female" "Male"   "Female"
##  [78] "Male"   "Male"   "Female" "Female" "Female" "Male"   "Male"  
##  [85] "Male"   "Male"   "Female" "Female" "Female" "Male"   "Female"
##  [92] "Female" "Male"   "Female" "Male"   "Male"   "Male"   "Male"  
##  [99] "Female" "Male"

Once we have worked-out how to created a cleaned version of the variable, we need to update the data frame. This can be done using the next of the dplyr verbs; mutate.

Creating new variables: The mutate verb

mutate allows us to compute and append one or more new columns to a data frame. We can specify a new column name, or overwrite an existing one. As above, we do not change the data frame so need to create a new variable if we want the changes to persist.

mutate(patients, Sex2 = str_trim(Sex))
## # A tibble: 100 × 16
##       ID      Name     Race   Age     Sex Smokes             Height
##    <int>    <fctr>   <fctr> <int>  <fctr> <fctr>             <fctr>
## 1      1 Demetrius    White    22    Male   TRUE 176.479144276913cm
## 2      2   Rosario    White    25    Male   TRUE 174.748482460591cm
## 3      3     Julio    Black    20    Male     No 173.944873836774cm
## 4      4      Lupe    White    26    Male  FALSE 180.028888936188cm
## 5      5    Lavern      Cat    35    Male     No 178.636539798649cm
## 6      6    Bernie   Native    30  Female  FALSE 159.777836229305cm
## 7      7    Samuel    White    32    Male   TRUE 170.949966267411cm
## 8      8     Clair    White    29  Female   TRUE 161.826590769764cm
## 9      9   Shirley    White    22    Male  FALSE  169.66230745064cm
## 10    10     Merle Hispanic    26  Female  FALSE 160.121017002151cm
## # ... with 90 more rows, and 9 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>, Sex2 <chr>
mutate(patients, Sex = str_trim(Sex))
## # A tibble: 100 × 15
##       ID      Name     Race   Age    Sex Smokes             Height
##    <int>    <fctr>   <fctr> <int>  <chr> <fctr>             <fctr>
## 1      1 Demetrius    White    22   Male   TRUE 176.479144276913cm
## 2      2   Rosario    White    25   Male   TRUE 174.748482460591cm
## 3      3     Julio    Black    20   Male     No 173.944873836774cm
## 4      4      Lupe    White    26   Male  FALSE 180.028888936188cm
## 5      5    Lavern      Cat    35   Male     No 178.636539798649cm
## 6      6    Bernie   Native    30 Female  FALSE 159.777836229305cm
## 7      7    Samuel    White    32   Male   TRUE 170.949966267411cm
## 8      8     Clair    White    29 Female   TRUE 161.826590769764cm
## 9      9   Shirley    White    22   Male  FALSE  169.66230745064cm
## 10    10     Merle Hispanic    26 Female  FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>
patients_clean <- mutate(patients, Sex = factor(str_trim(Sex)))

We might want to look at the distribution of heights amongst our patients. However, if we try this we are immediately hit with an error message

hist(select(patients,Height))
## Error in hist.default(select(patients, Height)) : 'x' must be numeric

The problem is that R thinks that values are not numbers. Why might that be?

select(patients,Height)
## # A tibble: 100 × 1
##                Height
##                <fctr>
## 1  176.479144276913cm
## 2  174.748482460591cm
## 3  173.944873836774cm
## 4  180.028888936188cm
## 5  178.636539798649cm
## 6  159.777836229305cm
## 7  170.949966267411cm
## 8  161.826590769764cm
## 9   169.66230745064cm
## 10 160.121017002151cm
## # ... with 90 more rows

To convert the values back to numbers we first need to remove the "cm" at the end of each value. We can do this with a string replacement operation. In this case, the particular string we want to put in the place of "cm" is the empty string "". The replacement can be done using the str_replace_all function.

mutate(patients_clean, Height= str_replace_all(Height,pattern = "cm",""))
## # A tibble: 100 × 15
##       ID      Name     Race   Age    Sex Smokes           Height
##    <int>    <fctr>   <fctr> <int> <fctr> <fctr>            <chr>
## 1      1 Demetrius    White    22   Male   TRUE 176.479144276913
## 2      2   Rosario    White    25   Male   TRUE 174.748482460591
## 3      3     Julio    Black    20   Male     No 173.944873836774
## 4      4      Lupe    White    26   Male  FALSE 180.028888936188
## 5      5    Lavern      Cat    35   Male     No 178.636539798649
## 6      6    Bernie   Native    30 Female  FALSE 159.777836229305
## 7      7    Samuel    White    32   Male   TRUE 170.949966267411
## 8      8     Clair    White    29 Female   TRUE 161.826590769764
## 9      9   Shirley    White    22   Male  FALSE  169.66230745064
## 10    10     Merle Hispanic    26 Female  FALSE 160.121017002151
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

Note that the input to this function is the patients_clean object that we created previously.

However, what do you notice about the class reported for the Height column after this operation.

  • compare to the Count column, for example

We need an extra conversion step to make sure these are represented as numeric values.

patients_clean <- mutate(patients_clean, Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm","")))
hist(patients_clean$Height)

Other useful functions from stringr, and the lubridate package, are described after the exercise. If time allows, we will go through these in the class. If you finish the exercise early, feel free to take a look.

Exercise - See Rmarkdown template

  • For a follow-on study, we are interested in overweight smokers
    • clean the Smokes column to contain just TRUE or FALSE values
  • We need to calculate the Body Mass Index (BMI) for each of our patients
  • \(BMI = (Weight) / (Height^2)\)
    • where Weight is measured in Kilograms, and Height in Metres
  • A BMI of 25 is considered overweight, calculate a new variable to indicate which individuals are overweight
  • (EXTRA) What other problems can you find in the data?
## # A tibble: 100 × 17
##       ID      Name     Race   Age    Sex Smokes   Height   Weight
##    <int>    <fctr>   <fctr> <int> <fctr>  <lgl>    <dbl>    <dbl>
## 1      1 Demetrius    White    22   Male   TRUE 176.4791 77.07070
## 2      2   Rosario    White    25   Male   TRUE 174.7485 83.26835
## 3      3     Julio    Black    20   Male  FALSE 173.9449 85.63231
## 4      4      Lupe    White    26   Male  FALSE 180.0289 88.72032
## 5      5    Lavern      Cat    35   Male  FALSE 178.6365 97.05405
## 6      6    Bernie   Native    30 Female  FALSE 159.7778 68.31247
## 7      7    Samuel    White    32   Male   TRUE 170.9500 95.65815
## 8      8     Clair    White    29 Female   TRUE 161.8266 66.02854
## 9      9   Shirley    White    22   Male  FALSE 169.6623 77.29546
## 10    10     Merle Hispanic    26 Female  FALSE 160.1210 69.59613
## # ... with 90 more rows, and 9 more variables: Birth <fctr>, State <fctr>,
## #   Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>,
## #   BMI <dbl>, Overweight <lgl>

Solution

See

Supplementary:

Rather than removing whitespace, sometimes we might want to adjust a set of strings so that they have a fixed width. We can do this be adding a padding character to each string if necessary. This operations is often done with a set of identifiers.

str_pad(patients$ID,pad = "0",width=3)
##   [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
##  [12] "012" "013" "014" "015" "016" "017" "018" "019" "020" "021" "022"
##  [23] "023" "024" "025" "026" "027" "028" "029" "030" "031" "032" "033"
##  [34] "034" "035" "036" "037" "038" "039" "040" "041" "042" "043" "044"
##  [45] "045" "046" "047" "048" "049" "050" "051" "052" "053" "054" "055"
##  [56] "056" "057" "058" "059" "060" "061" "062" "063" "064" "065" "066"
##  [67] "067" "068" "069" "070" "071" "072" "073" "074" "075" "076" "077"
##  [78] "078" "079" "080" "081" "082" "083" "084" "085" "086" "087" "088"
##  [89] "089" "090" "091" "092" "093" "094" "095" "096" "097" "098" "099"
## [100] "100"
patients_clean <- mutate(patients_clean, ID=str_pad(ID,pad = "0",width=3))

Note that we can perform several operations in the same command.

patients_clean <- mutate(patients, Sex = factor(str_trim(Sex)),ID=str_pad(ID,pad = "0",width=3))

Similarly, we might want to trim all the strings between a particular start and end position

str_sub(patients$State, 1,3)
##   [1] "New" "Flo" "Con" "Mas" "Kan" "Ill" "Col" "Sou" "Ill" "Mis" "Ala"
##  [12] "Col" "Ida" "Flo" "New" "Pen" "Mis" "Ind" "Mic" "Mic" "Ore" "Tex"
##  [23] "Nev" "Wis" "Uta" "Rho" "Pen" "Ill" "Vir" "Ill" "Tex" "Cal" "Geo"
##  [34] "Nor" "Col" "Ida" "Tex" "Ari" "Iow" "Cal" "Mar" "Mic" "Tex" "Mar"
##  [45] "New" "Ind" "Tex" "Ida" "Ohi" "New" "Ari" "Ind" "Ten" "Mar" "Mic"
##  [56] "Pen" "New" "Ten" "New" "Cal" "Was" "Cal" "Nor" "Rho" "Mai" "Sou"
##  [67] "Vir" "Cal" "Ida" "Ari" "Ken" "Col" "Tex" "Cal" "Pen" "Cal" "Mar"
##  [78] "Wis" "Ohi" "Flo" "Wes" "Was" "Okl" "New" "New" "Ida" "Mis" "Flo"
##  [89] "Kan" "Cal" "Nor" "Wis" "Ark" "Cal" "Mic" "Vir" "Ill" "Col" "Ken"
## [100] "Ari"
str_sub(patients$Name, 2,5)
##   [1] "emet" "osar" "ulio" "upe"  "aver" "erni" "amue" "lair" "hirl" "erle"
##  [11] "arti" "ranc" "ourt" "ranc" "erno" "este" "obin" "lber" "ommy" "yle" 
##  [21] "ong"  "iche" "erem" "at"   "ugen" "ris"  "racy" "ame"  "lyde" "rett"
##  [31] "ony"  "eorg" "dwar" "ory"  "hane" "evin" "erry" "rew"  "onal" "hris"
##  [41] "omin" "an"   "obbi" "awre" "eith" "on"   "harl" "ay"   "icha" "arlo"
##  [51] "ail"  "ario" "inds" "ean"  "ndre" "esus" "an"   "alte" "ana"  "ammy"
##  [61] "heo"  "haun" "immi" "arl"  "van"  "erri" "on"   "hayn" "homa" "alen"
##  [71] "amer" "ddie" "rian" "atth" "esli" "ason" "ol"   "onni" "udy"  "icha"
##  [81] "arne" "aryl" "orda" "enne" "aymo" "lmer" "immy" "hitn" "ritt" "ogan"
##  [91] "erry" "amie" "acy"  "onni" "osep" "tacy" "ntho" "lva"  "ean"  "uis"

Finally, we can convert entire strings to upper- or lower-case

str_to_lower(patients$Pet)
##   [1] "cat"   "cat"   "dog"   "dog"   "cat"   "cat"   "dog"   "dog"  
##   [9] "dog"   "cat"   "cat"   "dog"   "none"  "dog"   "none"  "cat"  
##  [17] "none"  "cat"   "none"  "bird"  "cat"   "dog"   "dog"   "cat"  
##  [25] "dog"   "none"  "cat"   "dog"   "none"  "dog"   "dog"   "dog"  
##  [33] "cat"   "cat"   "cat"   "null"  "none"  "dog"   "none"  "none" 
##  [41] "none"  "cat"   "cat"   "dog"   "horse" "cat"   "dog"   "dog"  
##  [49] "dog"   "dog"   "cat"   "none"  "none"  "none"  "dog"   "cat"  
##  [57] "none"  "cat"   "dog"   "dog"   "cat"   "none"  "none"  "cat"  
##  [65] "none"  NA      "cat"   "none"  "cat"   "null"  "none"  "none" 
##  [73] "cat"   "dog"   "dog"   "cat"   "none"  "none"  "none"  "dog"  
##  [81] "dog"   "dog"   "dog"   "none"  "none"  "dog"   "cat"   "cat"  
##  [89] "dog"   "cat"   "none"  "none"  NA      "null"  "none"  "dog"  
##  [97] "none"  "dog"   "dog"   "cat"
str_to_upper(patients$Name)
##   [1] "DEMETRIUS"   "ROSARIO"     "JULIO"       "LUPE"        "LAVERN"     
##   [6] "BERNIE"      "SAMUEL"      "CLAIR"       "SHIRLEY"     "MERLE"      
##  [11] "MARTIN"      "FRANCES"     "COURTNEY"    "FRANCIS"     "VERNON"     
##  [16] "LESTER"      "ROBIN"       "ALBERT"      "TOMMY"       "KYLE"       
##  [21] "DONG"        "MICHEL"      "JEREMY"      "PAT"         "EUGENE"     
##  [26] "KRIS"        "TRACY"       "JAME"        "CLYDE"       "BRETT"      
##  [31] "TONY"        "GEORGE"      "EDWARD"      "RORY"        "SHANE"      
##  [36] "DEVIN"       "JERRY"       "DREW"        "RONALD"      "CHRISTOPHER"
##  [41] "DOMINIQUE"   "VAN"         "BOBBIE"      "LAWRENCE"    "KEITH"      
##  [46] "SON"         "CHARLIE"     "JAY"         "RICHARD"     "CARLOS"     
##  [51] "GAIL"        "MARION"      "LINDSAY"     "SEAN"        "ANDREA"     
##  [56] "JESUS"       "JAN"         "WALTER"      "DANA"        "SAMMY"      
##  [61] "THEO"        "SHAUN"       "JIMMIE"      "CARL"        "EVAN"       
##  [66] "MERRILL"     "JON"         "SHAYNE"      "THOMAS"      "VALENTINE"  
##  [71] "CAMERON"     "EDDIE"       "BRIAN"       "MATTHEW"     "LESLIE"     
##  [76] "JASON"       "SOL"         "CONNIE"      "RUDY"        "MICHAL"     
##  [81] "DARNELL"     "DARYL"       "JORDAN"      "KENNETH"     "RAYMOND"    
##  [86] "ELMER"       "JIMMY"       "WHITNEY"     "BRITT"       "LOGAN"      
##  [91] "TERRY"       "JAMIE"       "LACY"        "RONNIE"      "JOSEPH"     
##  [96] "STACY"       "ANTHONY"     "ALVA"        "DEAN"        "LUIS"

Dealing with dates

Dates are often problematic as they may be encoded in lots of different schemes

If we wanted the year in which wach person was born it would be tempting to trim the first four characters of the string. However, getting the month or day might be a bit more problematic.

str_sub(patients$Birth,1,4)
##   [1] "1972" "1972" "1971" "1971" "1972" "1972" "1971" "1972" "1972" "1972"
##  [11] "1971" "1972" "1972" "1972" "1971" "1972" "1971" "1972" "1972" "1972"
##  [21] "1971" "1972" "1972" "1971" "1972" "1971" "1971" "1971" "1972" "1972"
##  [31] "1971" "1971" "1972" "1971" "1971" "1971" "1972" "1972" "1972" "1972"
##  [41] "1972" "1972" "1971" "1972" "1971" "1971" "1972" "1972" "1971" "1972"
##  [51] "1972" "1972" "1971" "1972" "1972" "1972" "1972" "1972" "1973" "1973"
##  [61] "1971" "1972" "1972" "1972" "1971" "1972" "1971" "1971" "1971" "1972"
##  [71] "1971" "1972" "1972" "1973" "1971" "1972" "1973" "1971" "1972" "1972"
##  [81] "1972" "1971" "1972" "1972" "1971" "1972" "1972" "1971" "1971" "1971"
##  [91] "1972" "1972" "1971" "1972" "1972" "1971" "1971" "1972" "1971" "1971"

The lubridate package has lots of useful functions for dealing with dates (also by Hadley Wickham!). For example it allows the year, month and day to be extracted.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
year(patients$Birth)
##   [1] 1972 1972 1971 1971 1972 1972 1971 1972 1972 1972 1971 1972 1972 1972
##  [15] 1971 1972 1971 1972 1972 1972 1971 1972 1972 1971 1972 1971 1971 1971
##  [29] 1972 1972 1971 1971 1972 1971 1971 1971 1972 1972 1972 1972 1972 1972
##  [43] 1971 1972 1971 1971 1972 1972 1971 1972 1972 1972 1971 1972 1972 1972
##  [57] 1972 1972 1973 1973 1971 1972 1972 1972 1971 1972 1971 1971 1971 1972
##  [71] 1971 1972 1972 1973 1971 1972 1973 1971 1972 1972 1972 1971 1972 1972
##  [85] 1971 1972 1972 1971 1971 1971 1972 1972 1971 1972 1972 1971 1971 1972
##  [99] 1971 1971
month(patients$Birth)
##   [1]  2  7 11 10 11  7 11  3 10  2  4  1  6  4 11  8 10 10  5  2  3  2 10
##  [24]  4  5 11  5  5  5 10  4  3 11  9  9  2  4 10  9  3  1  7 10 10  8  3
##  [47] 10  3  9  1  1  1 10  8  6  9  6 11  1  2  8  4  6 12 10  6 10  3 11
##  [70]  7  4  2  4  1 12  1  1  9  1  9  2  6  1 12  3  5  4  9  4 12  6 10
##  [93]  6 11  8  7  3  9 12 11

It also supporting some clever manipulation, such as being able to subtract and add times.

dob <- ymd(patients$Birth)
today <- ymd("20160509")
age <- interval(dob, today) 
as.period(age)
##   [1] "44y 2m 28d 0H 0M 0S"  "43y 9m 17d 0H 0M 0S"  "44y 5m 16d 0H 0M 0S" 
##   [4] "44y 7m 6d 0H 0M 0S"   "43y 5m 16d 0H 0M 0S"  "43y 9m 9d 0H 0M 0S"  
##   [7] "44y 5m 26d 0H 0M 0S"  "44y 1m 25d 0H 0M 0S"  "43y 6m 29d 0H 0M 0S" 
##  [10] "44y 2m 11d 0H 0M 0S"  "45y 1m 7d 0H 0M 0S"   "44y 3m 14d 0H 0M 0S" 
##  [13] "43y 11m 1d 0H 0M 0S"  "44y 1m 0d 0H 0M 0S"   "44y 6m 4d 0H 0M 0S"  
##  [16] "43y 8m 24d 0H 0M 0S"  "44y 6m 16d 0H 0M 0S"  "43y 6m 24d 0H 0M 0S" 
##  [19] "44y 0m 7d 0H 0M 0S"   "44y 2m 27d 0H 0M 0S"  "45y 2m 7d 0H 0M 0S"  
##  [22] "44y 3m 6d 0H 0M 0S"   "43y 6m 18d 0H 0M 0S"  "45y 0m 14d 0H 0M 0S" 
##  [25] "43y 11m 21d 0H 0M 0S" "44y 5m 21d 0H 0M 0S"  "45y 0m 3d 0H 0M 0S"  
##  [28] "44y 11m 22d 0H 0M 0S" "44y 0m 6d 0H 0M 0S"   "43y 6m 21d 0H 0M 0S" 
##  [31] "45y 1m 0d 0H 0M 0S"   "45y 1m 28d 0H 0M 0S"  "43y 5m 18d 0H 0M 0S" 
##  [34] "44y 7m 27d 0H 0M 0S"  "44y 7m 12d 0H 0M 0S"  "45y 2m 11d 0H 0M 0S" 
##  [37] "44y 0m 22d 0H 0M 0S"  "43y 6m 10d 0H 0M 0S"  "43y 8m 0d 0H 0M 0S"  
##  [40] "44y 1m 13d 0H 0M 0S"  "44y 3m 18d 0H 0M 0S"  "43y 9m 24d 0H 0M 0S" 
##  [43] "44y 7m 7d 0H 0M 0S"   "43y 6m 26d 0H 0M 0S"  "44y 8m 28d 0H 0M 0S" 
##  [46] "45y 1m 10d 0H 0M 0S"  "43y 6m 24d 0H 0M 0S"  "44y 1m 29d 0H 0M 0S" 
##  [49] "44y 7m 28d 0H 0M 0S"  "44y 3m 16d 0H 0M 0S"  "44y 3m 25d 0H 0M 0S" 
##  [52] "44y 4m 8d 0H 0M 0S"   "44y 7m 0d 0H 0M 0S"   "43y 9m 4d 0H 0M 0S"  
##  [55] "43y 11m 5d 0H 0M 0S"  "43y 7m 22d 0H 0M 0S"  "43y 10m 19d 0H 0M 0S"
##  [58] "43y 6m 8d 0H 0M 0S"   "43y 4m 8d 0H 0M 0S"   "43y 2m 27d 0H 0M 0S" 
##  [61] "44y 9m 8d 0H 0M 0S"   "44y 1m 6d 0H 0M 0S"   "43y 10m 14d 0H 0M 0S"
##  [64] "43y 5m 5d 0H 0M 0S"   "44y 6m 13d 0H 0M 0S"  "43y 11m 4d 0H 0M 0S" 
##  [67] "44y 6m 16d 0H 0M 0S"  "45y 1m 24d 0H 0M 0S"  "44y 5m 12d 0H 0M 0S" 
##  [70] "43y 9m 29d 0H 0M 0S"  "45y 1m 8d 0H 0M 0S"   "44y 2m 27d 0H 0M 0S" 
##  [73] "44y 0m 16d 0H 0M 0S"  "43y 3m 18d 0H 0M 0S"  "44y 4m 28d 0H 0M 0S" 
##  [76] "44y 3m 16d 0H 0M 0S"  "43y 3m 16d 0H 0M 0S"  "44y 7m 21d 0H 0M 0S" 
##  [79] "44y 3m 29d 0H 0M 0S"  "43y 7m 11d 0H 0M 0S"  "44y 2m 10d 0H 0M 0S" 
##  [82] "44y 10m 14d 0H 0M 0S" "44y 3m 13d 0H 0M 0S"  "43y 4m 20d 0H 0M 0S" 
##  [85] "45y 2m 0d 0H 0M 0S"   "44y 0m 4d 0H 0M 0S"   "44y 0m 10d 0H 0M 0S" 
##  [88] "44y 7m 28d 0H 0M 0S"  "45y 0m 28d 0H 0M 0S"  "44y 4m 9d 0H 0M 0S"  
##  [91] "43y 11m 1d 0H 0M 0S"  "43y 6m 28d 0H 0M 0S"  "44y 10m 27d 0H 0M 0S"
##  [94] "43y 5m 16d 0H 0M 0S"  "43y 9m 3d 0H 0M 0S"   "44y 9m 24d 0H 0M 0S" 
##  [97] "45y 1m 23d 0H 0M 0S"  "43y 7m 21d 0H 0M 0S"  "44y 4m 17d 0H 0M 0S" 
## [100] "44y 5m 16d 0H 0M 0S"
patients_clean <- mutate(patients_clean, Age = year(as.period(age)))
patients_clean
## # A tibble: 100 × 15
##       ID      Name     Race   Age    Sex Smokes             Height
##    <chr>    <fctr>   <fctr> <int> <fctr> <fctr>             <fctr>
## 1    001 Demetrius    White    44   Male   TRUE 176.479144276913cm
## 2    002   Rosario    White    43   Male   TRUE 174.748482460591cm
## 3    003     Julio    Black    44   Male     No 173.944873836774cm
## 4    004      Lupe    White    44   Male  FALSE 180.028888936188cm
## 5    005    Lavern      Cat    43   Male     No 178.636539798649cm
## 6    006    Bernie   Native    43 Female  FALSE 159.777836229305cm
## 7    007    Samuel    White    44   Male   TRUE 170.949966267411cm
## 8    008     Clair    White    44 Female   TRUE 161.826590769764cm
## 9    009   Shirley    White    43   Male  FALSE  169.66230745064cm
## 10   010     Merle Hispanic    44 Female  FALSE 160.121017002151cm
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>