Course Home

Overview of this section

We’ve ended up with a long chain of steps to perform on our data. It is quite common to nest commands in R into a single line;

patients <- tbl_df(read.delim("patient-data.txt"))

we read as

apply the tbl_df to the result of reading the file patient-data.txt

Could also do the same for our mutate statements, although this would quickly become convoluted..

remove the cm characters from the Height variable of the data frame created by trimming white space from the sex variable from the patients data frame

patients_clean <- mutate(mutate(patients, Sex = factor(str_trim(Sex))),
                         Height = str_replace_all(patients$Height,pattern="cm",""))

We always have to work out what the first statement was, and work forwards from that.

Alternatively, we could write each command as a separate line

patients_clean<- mutate(patients, Sex = factor(str_trim(Sex)))
patients_clean <- mutate(patients_clean, Height= str_replace_all(patients_clean$Height,pattern = "cm",""))

Introducing piping

The output of one operations gets used as the input of the next

In computing, this is referring to as piping

  • unix commands use the | symbol

magrittr

not-a-pipe

not-a-pipe

also-not-a-pipe

also-not-a-pipe

  • the magrittr library implements this in R

Simple example

read the file patient-data.txt and then use the tbl_df function

  • Use the special function %>% at the end of the line
  • Shortcut in RStudio: CTRL + SHIFT + M, CMD + SHIFT + M (Mac)
read.delim("patient-data.txt") %>% 
  tbl_df

The read.delim function produces a data frame as an output, and the tbl_df function only has one argument (a data frame). All the dplyr verbs that we will discuss today can be pipelined with %>% as they take a data frame as input, and produce a data frame as output.

Note that we need an assignment if we want to save the result

patients <- read.delim("patient-data.txt") %>% 
  tbl_df

We can re-write our data cleaning steps from the previous section with;

patients_clean <- read.delim("patient-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm","")))

patients_clean
## # A tibble: 100 × 15
##       ID      Name     Race   Age    Sex Smokes   Height
##    <int>    <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>
## 1      1 Demetrius    White    22   Male   TRUE 176.4791
## 2      2   Rosario    White    25   Male   TRUE 174.7485
## 3      3     Julio    Black    20   Male     No 173.9449
## 4      4      Lupe    White    26   Male  FALSE 180.0289
## 5      5    Lavern      Cat    35   Male     No 178.6365
## 6      6    Bernie   Native    30 Female  FALSE 159.7778
## 7      7    Samuel    White    32   Male   TRUE 170.9500
## 8      8     Clair    White    29 Female   TRUE 161.8266
## 9      9   Shirley    White    22   Male  FALSE 169.6623
## 10    10     Merle Hispanic    26 Female  FALSE 160.1210
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

read the file patient-data.txt and then use the tbl_df function and then trim the whitespace from the Sex variable and then replace cm with blank characters in the Height variable

Note that we didn’t need to specify patients_clean as an argument to the mutate function

  • it as assumed that the output of one line acts as the first argument to the next line
  • we don’t need to create temporary variables

Exercise: Rmarkdown Template


Take the steps used to clean the patients dataset and calculate BMI (see template for the code)

  • Re-write in the piping framework
  • Add a step to print just the ID, Name, Date of Birth, Smokes and Overweight columns

## # A tibble: 100 × 6
##       ID      Name      Birth      BMI Smokes Overweight
##    <int>    <fctr>     <fctr>    <dbl>  <chr>      <lgl>
## 1      1 Demetrius 1972-02-11 24.74586   TRUE      FALSE
## 2      2   Rosario 1972-07-22 27.26799   TRUE       TRUE
## 3      3     Julio 1971-11-23 28.30182     No       TRUE
## 4      4      Lupe 1971-10-03 27.37403  FALSE       TRUE
## 5      5    Lavern 1972-11-23 30.41397     No       TRUE
## 6      6    Bernie 1972-07-30 26.75882  FALSE       TRUE
## 7      7    Samuel 1971-11-13 32.73286   TRUE       TRUE
## 8      8     Clair 1972-03-14 25.21343   TRUE       TRUE
## 9      9   Shirley 1972-10-10 26.85241  FALSE       TRUE
## 10    10     Merle 1972-02-28 27.14491  FALSE       TRUE
## # ... with 90 more rows

Now having displayed the relevant information for our patients, we want to extract rows of interest from the data frame.

Selecting rows: The filter verb

filter

filter

The filter verb is used to select rows from the data frame. The criteria we use to select can use the comparisons ==, >, <, !=

e.g. select all the males

filter(patients_clean, Sex == "Male") 
## # A tibble: 52 × 15
##       ID      Name   Race   Age    Sex Smokes   Height             Weight
##    <int>    <fctr> <fctr> <int> <fctr> <fctr>    <dbl>             <fctr>
## 1      1 Demetrius  White    22   Male   TRUE 176.4791 77.0707023167105kg
## 2      2   Rosario  White    25   Male   TRUE 174.7485 83.2683499179022kg
## 3      3     Julio  Black    20   Male     No 173.9449 85.6323107011968kg
## 4      4      Lupe  White    26   Male  FALSE 180.0289 88.7203170843447kg
## 5      5    Lavern    Cat    35   Male     No 178.6365 97.0540533225734kg
## 6      7    Samuel  White    32   Male   TRUE 170.9500 95.6581459256895kg
## 7      9   Shirley  White    22   Male  FALSE 169.6623 77.2954550926939kg
## 8     12   Frances  White    30   Male  FALSE 170.5112 84.3504821028372kg
## 9     16    Lester  Black    21   Male   TRUE 178.7927 88.8860701841483kg
## 10    19     Tommy  Black    23   Male  FALSE 167.5139 84.1505069519618kg
## # ... with 42 more rows, and 7 more variables: Birth <fctr>, State <fctr>,
## #   Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>

In base R, we would do

patients_clean[patients_clean$Sex == "Male",]

Again, to non R-users, this is less intuitive. However, the result is the same.

## # A tibble: 6 × 15
##      ID      Name   Race   Age    Sex Smokes   Height             Weight
##   <int>    <fctr> <fctr> <int> <fctr> <fctr>    <dbl>             <fctr>
## 1     1 Demetrius  White    22   Male   TRUE 176.4791 77.0707023167105kg
## 2     2   Rosario  White    25   Male   TRUE 174.7485 83.2683499179022kg
## 3     3     Julio  Black    20   Male     No 173.9449 85.6323107011968kg
## 4     4      Lupe  White    26   Male  FALSE 180.0289 88.7203170843447kg
## 5     5    Lavern    Cat    35   Male     No 178.6365 97.0540533225734kg
## 6     7    Samuel  White    32   Male   TRUE 170.9500 95.6581459256895kg
## # ... with 7 more variables: Birth <fctr>, State <fctr>, Pet <fctr>,
## #   Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>

The %in% function can check for matches in a pre-defined vector

filter(patients_clean, Pet %in% c("CAT", "Cat"))
## # A tibble: 30 × 15
##       ID      Name     Race   Age    Sex Smokes   Height
##    <int>    <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>
## 1      1 Demetrius    White    22   Male   TRUE 176.4791
## 2      2   Rosario    White    25   Male   TRUE 174.7485
## 3      5    Lavern      Cat    35   Male     No 178.6365
## 4      6    Bernie   Native    30 Female  FALSE 159.7778
## 5     10     Merle Hispanic    26 Female  FALSE 160.1210
## 6     11    Martin    White    31 Female  FALSE 159.3178
## 7     16    Lester    Black    21   Male   TRUE 178.7927
## 8     18    Albert    White    23 Female  FALSE 157.0079
## 9     21      Dong    White    22   Male  FALSE 168.1559
## 10    24       Pat    Black    32 Female  FALSE 158.9396
## # ... with 20 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

Combining conditions in filter can be achieved adding , or using the Boolean operators |, &

filter(patients_clean, Sex == "Male", Died)
## # A tibble: 32 × 15
##       ID      Name     Race   Age    Sex Smokes   Height
##    <int>    <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>
## 1      1 Demetrius    White    22   Male   TRUE 176.4791
## 2      2   Rosario    White    25   Male   TRUE 174.7485
## 3      4      Lupe    White    26   Male  FALSE 180.0289
## 4      9   Shirley    White    22   Male  FALSE 169.6623
## 5     12   Frances    White    30   Male  FALSE 170.5112
## 6     19     Tommy    Black    23   Male  FALSE 167.5139
## 7     22    Michel    White    29   Male  FALSE 175.6678
## 8     26      Kris Hispanic     0   Male  FALSE 176.9776
## 9     29     Clyde Hispanic    26   Male  FALSE 184.4273
## 10    33    Edward    White    20   Male  FALSE 171.9395
## # ... with 22 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

The code in base R is a lot more verbose

patients[patients_clean$Sex == "Male" & patients_clean$Died,]

Either Females, or Grade greater than 1

filter(patients_clean, Sex == "Female" | Grade_Level > 1)
## # A tibble: 85 × 15
##       ID     Name     Race   Age    Sex Smokes   Height             Weight
##    <int>   <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>             <fctr>
## 1      3    Julio    Black    20   Male     No 173.9449 85.6323107011968kg
## 2      5   Lavern      Cat    35   Male     No 178.6365 97.0540533225734kg
## 3      6   Bernie   Native    30 Female  FALSE 159.7778  68.312471247052kg
## 4      8    Clair    White    29 Female   TRUE 161.8266  66.028536339608kg
## 5     10    Merle Hispanic    26 Female  FALSE 160.1210 69.5961280454617kg
## 6     11   Martin    White    31 Female  FALSE 159.3178  64.920442105117kg
## 7     12  Frances    White    30   Male  FALSE 170.5112 84.3504821028372kg
## 8     13 Courtney    White     0 Female   TRUE 161.8411 69.9705548836931kg
## 9     14  Francis    White    33 Female   TRUE 162.9728 62.5016983970221kg
## 10    15   Vernon    White    32 Female   TRUE 168.4921 73.3540772541499kg
## # ... with 75 more rows, and 7 more variables: Birth <fctr>, State <fctr>,
## #   Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>

Exercise


  • Select the Smokers with Age over 30
  • Select the patients that are still alive
  • Select the patients who do NOT own dogs
  • Select the patients from New York or New Jersey

Feel free to experiment with different ways to do these


We now have all the tools in place in order to retrieve the overweight smokers from our dataset. An additional step might be to write the filtered data frame out to a file. Recall that functions write.table, write.csv are used to write a data frame to a tab-delimited or csv file. Because these functions accept a data frame as their first argument, they fit easily into the pipeline framework as the last step.

Exercise: workflow-exercise.Rmd


Modify the workflow to

  • select the candidates (overweight smokers)
  • make sure the candidates are still alive
  • write the result to a csv file

Ordering rows: The arrange verb

arrange

arrange

A related verb, that works on the rows of the table, is called arrange. You may have seen that we can use order in base R to re-order tables. Again, arrange is more intuitive.

Lets say we want to sort our patients by height;

arrange(patients_clean, Height)
## # A tibble: 100 × 15
##       ID      Name     Race   Age    Sex Smokes   Height
##    <int>    <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>
## 1     69    Thomas    White    24 Female  FALSE 156.0006
## 2     44  Lawrence Hispanic    31 Female    Yes 156.3782
## 3     61      Theo    Asian    32 Female  FALSE 156.4514
## 4     67       Jon    White     0 Female  FALSE 156.5484
## 5     18    Albert    White    23 Female  FALSE 157.0079
## 6     23    Jeremy    White     0 Female  FALSE 157.0379
## 7     72     Eddie Hispanic    25 Female     No 157.5677
## 8     71   Cameron    Black    30 Female  FALSE 157.6721
## 9     70 Valentine Hispanic    34 Female  FALSE 157.7601
## 10    34      Rory    Asian    34 Female   TRUE 158.0671
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

Can specify descending order using desc

arrange(patients_clean, desc(Height))
## # A tibble: 100 × 15
##       ID        Name     Race   Age    Sex Smokes   Height
##    <int>      <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>
## 1     95      Joseph    White    34   Male  FALSE 191.5574
## 2     43      Bobbie    White    34   Male    Yes 184.9876
## 3     29       Clyde Hispanic    26   Male  FALSE 184.4273
## 4     73       Brian Hispanic     0   Male  FALSE 184.3635
## 5     28        Jame    White    24   Male  FALSE 184.3365
## 6     58      Walter    White    34   Male  FALSE 183.1878
## 7     40 Christopher    White    28   Male  FALSE 183.0974
## 8    100        Luis Hispanic    31   Male  FALSE 182.1725
## 9     50      Carlos    White    30   Male   TRUE 181.2262
## 10    48         Jay    White     0   Male  FALSE 180.9550
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## #   Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## #   Count <dbl>, Date <fctr>

Can use multiple variables in the sorting

e.g. Grade_Level and then Count

arrange(patients_clean, Grade_Level, Count)
## # A tibble: 100 × 15
##       ID    Name     Race   Age    Sex Smokes   Height             Weight
##    <int>  <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>             <fctr>
## 1      9 Shirley    White    22   Male  FALSE 169.6623 77.2954550926939kg
## 2     19   Tommy    Black    23   Male  FALSE 167.5139 84.1505069519618kg
## 3     94  Ronnie    White    26 Female  FALSE 164.5555 68.4283083462713kg
## 4     33  Edward    White    20   Male  FALSE 171.9395 82.1098370845095kg
## 5      2 Rosario    White    25   Male   TRUE 174.7485 83.2683499179022kg
## 6     43  Bobbie    White    34   Male    Yes 184.9876 85.5216473586341kg
## 7     72   Eddie Hispanic    25 Female     No 157.5677 68.2669628109541kg
## 8      8   Clair    White    29 Female   TRUE 161.8266  66.028536339608kg
## 9     26    Kris Hispanic     0   Male  FALSE 176.9776 90.6467686495169kg
## 10    49 Richard    White    26   Male   TRUE 173.6746 80.0520716781511kg
## # ... with 90 more rows, and 7 more variables: Birth <fctr>, State <fctr>,
## #   Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>

A really convenient function is top_n

top_n(patients_clean,10,Height)
## # A tibble: 10 × 15
##       ID        Name     Race   Age    Sex Smokes   Height
##    <int>      <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>
## 1     28        Jame    White    24   Male  FALSE 184.3365
## 2     29       Clyde Hispanic    26   Male  FALSE 184.4273
## 3     40 Christopher    White    28   Male  FALSE 183.0974
## 4     43      Bobbie    White    34   Male    Yes 184.9876
## 5     48         Jay    White     0   Male  FALSE 180.9550
## 6     50      Carlos    White    30   Male   TRUE 181.2262
## 7     58      Walter    White    34   Male  FALSE 183.1878
## 8     73       Brian Hispanic     0   Male  FALSE 184.3635
## 9     95      Joseph    White    34   Male  FALSE 191.5574
## 10   100        Luis Hispanic    31   Male  FALSE 182.1725
## # ... with 8 more variables: Weight <fctr>, Birth <fctr>, State <fctr>,
## #   Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
top_n(patients_clean,10,Weight)
## # A tibble: 10 × 15
##       ID    Name     Race   Age    Sex Smokes   Height             Weight
##    <int>  <fctr>   <fctr> <int> <fctr> <fctr>    <dbl>             <fctr>
## 1      5  Lavern      Cat    35   Male     No 178.6365 97.0540533225734kg
## 2      7  Samuel    White    32   Male   TRUE 170.9500 95.6581459256895kg
## 3     26    Kris Hispanic     0   Male  FALSE 176.9776 90.6467686495169kg
## 4     28    Jame    White    24   Male  FALSE 184.3365  90.413290088493kg
## 5     53 Lindsay    White    29   Male   TRUE 177.9423 92.4209777707246kg
## 6     64    Carl Hispanic    21   Male  FALSE 180.9077 97.3835285144635kg
## 7     73   Brian Hispanic     0   Male  FALSE 184.3635 93.2462170418945kg
## 8     76   Jason    White    33   Male  FALSE 169.1347  92.756262807277kg
## 9     97 Anthony    White    28   Male     No 180.6568 91.9818422000367kg
## 10    98    Alva    White    33   Male  FALSE 178.4494   91.77367816887kg
## # ... with 7 more variables: Birth <fctr>, State <fctr>, Pet <fctr>,
## #   Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>

With the workflow we have developed in the previous exercises, we can add extra steps to arrange and then select the candidates with the highest BMI prior to writing to a file.

read.delim("patient-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm",""))) %>% 
  mutate(Weight = as.numeric(str_replace_all(Weight,"kg",""))) %>% 
  mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>% 
  mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>% 
  select(ID, Name, Birth,BMI,Smokes,Overweight,Died) %>% 
  filter(Smokes & Overweight & !Died) %>% 
  arrange(desc(BMI)) %>% 
  top_n(10, BMI)
## # A tibble: 9 × 7
##      ID     Name      Birth      BMI Smokes Overweight  Died
##   <int>   <fctr>     <fctr>    <dbl>  <lgl>      <lgl> <lgl>
## 1     7   Samuel 1971-11-13 32.73286   TRUE       TRUE FALSE
## 2    17    Robin 1971-10-23 28.11525   TRUE       TRUE FALSE
## 3    45    Keith 1971-08-11 27.96600   TRUE       TRUE FALSE
## 4    16   Lester 1972-08-15 27.80572   TRUE       TRUE FALSE
## 5    44 Lawrence 1972-10-13 27.06717   TRUE       TRUE FALSE
## 6    92    Jamie 1972-10-11 26.94385   TRUE       TRUE FALSE
## 7    50   Carlos 1972-01-23 26.00887   TRUE       TRUE FALSE
## 8    15   Vernon 1971-11-05 25.83838   TRUE       TRUE FALSE
## 9     8    Clair 1972-03-14 25.21343   TRUE       TRUE FALSE
  write.csv("candidates.csv")
## "","x"
## "1","candidates.csv"

Re-usable pipelines

Imagine we have a second dataset that we want to process; cohort-data.txt.

This time we have a million patients, and shouldn’t really be dealing with these data in Excel. However, once we have defined a pipeline on our smaller dataset, we can see how it scales to the new data

 read.delim("cohort-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Weight = as.numeric(str_replace_all(Weight,"kg",""))) %>% 
  mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm",""))) %>% 
  mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "Yes", TRUE))) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
  filter(Smokes & Overweight) %>% 
  select(ID, Name, Birth, BMI, Smokes,Overweight)  %>% 
  arrange(desc(BMI)) %>% 
  top_n(10, BMI) %>% 
  write.table("study-candidates.csv")

As the file is quite large, we might want to switch to readr for smarter and faster reading

library(readr)
 read_tsv("cohort-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Weight = as.numeric(str_replace_all(Weight,"kg",""))) %>% 
  mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm",""))) %>% 
  mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>% 
  mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
  filter(Smokes & Overweight) %>% 
  select(ID, Name, Birth, BMI, Smokes,Overweight)  %>% 
  arrange(desc(BMI)) %>% 
  top_n(10, BMI) %>% 
  write.table("study-candidates.csv")

Summary

In the past two sections we have introduced the concept of pipeline and various verbs in dplyr that will help us to manipulate our data

  • select; for choosing columns
  • mutate; for adding new columns
  • filter; for selecting rows
  • arrange; for ordering rows

We can use these in any order we like to create a workflow. When faced with a dataset, we should asked ourselves what steps we want to perform.

  • The piping facility in R should allow us to express the workflow in a way that is intuitive and easy-to-ready

Solution

See here