We’ve ended up with a long chain of steps to perform on our data. It is quite common to nest commands in R into a single line;
patients <- tbl_df(read.delim("patient-data.txt"))
we read as
apply the
tbl_df
to the result of reading the filepatient-data.txt
Could also do the same for our mutate
statements, although this would quickly become convoluted..
remove the cm characters from the Height variable of the data frame created by trimming white space from the sex variable from the patients data frame
patients_clean <- mutate(mutate(patients, Sex = factor(str_trim(Sex))),
Height = str_replace_all(patients$Height,pattern="cm",""))
We always have to work out what the first statement was, and work forwards from that.
Alternatively, we could write each command as a separate line
patients_clean<- mutate(patients, Sex = factor(str_trim(Sex)))
patients_clean <- mutate(patients_clean, Height= str_replace_all(patients_clean$Height,pattern = "cm",""))
The output of one operations gets used as the input of the next
In computing, this is referring to as piping
|
symbolmagrittr
library implements this in Rread the file
patient-data.txt
and then use thetbl_df
function
%>%
at the end of the lineread.delim("patient-data.txt") %>%
tbl_df
The read.delim
function produces a data frame as an output, and the tbl_df
function only has one argument (a data frame). All the dplyr
verbs that we will discuss today can be pipelined with %>%
as they take a data frame as input, and produce a data frame as output.
Note that we need an assignment if we want to save the result
patients <- read.delim("patient-data.txt") %>%
tbl_df
We can re-write our data cleaning steps from the previous section with;
patients_clean <- read.delim("patient-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm","")))
patients_clean
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl>
## 1 1 Demetrius White 22 Male TRUE 176.4791
## 2 2 Rosario White 25 Male TRUE 174.7485
## 3 3 Julio Black 20 Male No 173.9449
## 4 4 Lupe White 26 Male FALSE 180.0289
## 5 5 Lavern Cat 35 Male No 178.6365
## 6 6 Bernie Native 30 Female FALSE 159.7778
## 7 7 Samuel White 32 Male TRUE 170.9500
## 8 8 Clair White 29 Female TRUE 161.8266
## 9 9 Shirley White 22 Male FALSE 169.6623
## 10 10 Merle Hispanic 26 Female FALSE 160.1210
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
read the file
patient-data.txt
and then use thetbl_df
function and then trim the whitespace from the Sex variable and then replace cm with blank characters in the Height variable
Note that we didn’t need to specify patients_clean
as an argument to the mutate
function
Take the steps used to clean the patients dataset and calculate BMI (see template for the code)
## # A tibble: 100 × 6
## ID Name Birth BMI Smokes Overweight
## <int> <fctr> <fctr> <dbl> <chr> <lgl>
## 1 1 Demetrius 1972-02-11 24.74586 TRUE FALSE
## 2 2 Rosario 1972-07-22 27.26799 TRUE TRUE
## 3 3 Julio 1971-11-23 28.30182 No TRUE
## 4 4 Lupe 1971-10-03 27.37403 FALSE TRUE
## 5 5 Lavern 1972-11-23 30.41397 No TRUE
## 6 6 Bernie 1972-07-30 26.75882 FALSE TRUE
## 7 7 Samuel 1971-11-13 32.73286 TRUE TRUE
## 8 8 Clair 1972-03-14 25.21343 TRUE TRUE
## 9 9 Shirley 1972-10-10 26.85241 FALSE TRUE
## 10 10 Merle 1972-02-28 27.14491 FALSE TRUE
## # ... with 90 more rows
Now having displayed the relevant information for our patients, we want to extract rows of interest from the data frame.
filter
verbThe filter
verb is used to select rows from the data frame. The criteria we use to select can use the comparisons ==
, >
, <
, !=
e.g. select all the males
filter(patients_clean, Sex == "Male")
## # A tibble: 52 × 15
## ID Name Race Age Sex Smokes Height Weight
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl> <fctr>
## 1 1 Demetrius White 22 Male TRUE 176.4791 77.0707023167105kg
## 2 2 Rosario White 25 Male TRUE 174.7485 83.2683499179022kg
## 3 3 Julio Black 20 Male No 173.9449 85.6323107011968kg
## 4 4 Lupe White 26 Male FALSE 180.0289 88.7203170843447kg
## 5 5 Lavern Cat 35 Male No 178.6365 97.0540533225734kg
## 6 7 Samuel White 32 Male TRUE 170.9500 95.6581459256895kg
## 7 9 Shirley White 22 Male FALSE 169.6623 77.2954550926939kg
## 8 12 Frances White 30 Male FALSE 170.5112 84.3504821028372kg
## 9 16 Lester Black 21 Male TRUE 178.7927 88.8860701841483kg
## 10 19 Tommy Black 23 Male FALSE 167.5139 84.1505069519618kg
## # ... with 42 more rows, and 7 more variables: Birth <fctr>, State <fctr>,
## # Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
In base R, we would do
patients_clean[patients_clean$Sex == "Male",]
Again, to non R-users, this is less intuitive. However, the result is the same.
## # A tibble: 6 × 15
## ID Name Race Age Sex Smokes Height Weight
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl> <fctr>
## 1 1 Demetrius White 22 Male TRUE 176.4791 77.0707023167105kg
## 2 2 Rosario White 25 Male TRUE 174.7485 83.2683499179022kg
## 3 3 Julio Black 20 Male No 173.9449 85.6323107011968kg
## 4 4 Lupe White 26 Male FALSE 180.0289 88.7203170843447kg
## 5 5 Lavern Cat 35 Male No 178.6365 97.0540533225734kg
## 6 7 Samuel White 32 Male TRUE 170.9500 95.6581459256895kg
## # ... with 7 more variables: Birth <fctr>, State <fctr>, Pet <fctr>,
## # Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
The %in%
function can check for matches in a pre-defined vector
filter(patients_clean, Pet %in% c("CAT", "Cat"))
## # A tibble: 30 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl>
## 1 1 Demetrius White 22 Male TRUE 176.4791
## 2 2 Rosario White 25 Male TRUE 174.7485
## 3 5 Lavern Cat 35 Male No 178.6365
## 4 6 Bernie Native 30 Female FALSE 159.7778
## 5 10 Merle Hispanic 26 Female FALSE 160.1210
## 6 11 Martin White 31 Female FALSE 159.3178
## 7 16 Lester Black 21 Male TRUE 178.7927
## 8 18 Albert White 23 Female FALSE 157.0079
## 9 21 Dong White 22 Male FALSE 168.1559
## 10 24 Pat Black 32 Female FALSE 158.9396
## # ... with 20 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
Combining conditions in filter
can be achieved adding ,
or using the Boolean operators |
, &
filter(patients_clean, Sex == "Male", Died)
## # A tibble: 32 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl>
## 1 1 Demetrius White 22 Male TRUE 176.4791
## 2 2 Rosario White 25 Male TRUE 174.7485
## 3 4 Lupe White 26 Male FALSE 180.0289
## 4 9 Shirley White 22 Male FALSE 169.6623
## 5 12 Frances White 30 Male FALSE 170.5112
## 6 19 Tommy Black 23 Male FALSE 167.5139
## 7 22 Michel White 29 Male FALSE 175.6678
## 8 26 Kris Hispanic 0 Male FALSE 176.9776
## 9 29 Clyde Hispanic 26 Male FALSE 184.4273
## 10 33 Edward White 20 Male FALSE 171.9395
## # ... with 22 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
The code in base R is a lot more verbose
patients[patients_clean$Sex == "Male" & patients_clean$Died,]
Either Females, or Grade greater than 1
filter(patients_clean, Sex == "Female" | Grade_Level > 1)
## # A tibble: 85 × 15
## ID Name Race Age Sex Smokes Height Weight
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl> <fctr>
## 1 3 Julio Black 20 Male No 173.9449 85.6323107011968kg
## 2 5 Lavern Cat 35 Male No 178.6365 97.0540533225734kg
## 3 6 Bernie Native 30 Female FALSE 159.7778 68.312471247052kg
## 4 8 Clair White 29 Female TRUE 161.8266 66.028536339608kg
## 5 10 Merle Hispanic 26 Female FALSE 160.1210 69.5961280454617kg
## 6 11 Martin White 31 Female FALSE 159.3178 64.920442105117kg
## 7 12 Frances White 30 Male FALSE 170.5112 84.3504821028372kg
## 8 13 Courtney White 0 Female TRUE 161.8411 69.9705548836931kg
## 9 14 Francis White 33 Female TRUE 162.9728 62.5016983970221kg
## 10 15 Vernon White 32 Female TRUE 168.4921 73.3540772541499kg
## # ... with 75 more rows, and 7 more variables: Birth <fctr>, State <fctr>,
## # Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
Feel free to experiment with different ways to do these
We now have all the tools in place in order to retrieve the overweight smokers from our dataset. An additional step might be to write the filtered data frame out to a file. Recall that functions write.table
, write.csv
are used to write a data frame to a tab-delimited or csv file. Because these functions accept a data frame as their first argument, they fit easily into the pipeline framework as the last step.
Modify the workflow to
arrange
verbA related verb, that works on the rows of the table, is called arrange
. You may have seen that we can use order
in base R to re-order tables. Again, arrange
is more intuitive.
Lets say we want to sort our patients by height;
arrange(patients_clean, Height)
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl>
## 1 69 Thomas White 24 Female FALSE 156.0006
## 2 44 Lawrence Hispanic 31 Female Yes 156.3782
## 3 61 Theo Asian 32 Female FALSE 156.4514
## 4 67 Jon White 0 Female FALSE 156.5484
## 5 18 Albert White 23 Female FALSE 157.0079
## 6 23 Jeremy White 0 Female FALSE 157.0379
## 7 72 Eddie Hispanic 25 Female No 157.5677
## 8 71 Cameron Black 30 Female FALSE 157.6721
## 9 70 Valentine Hispanic 34 Female FALSE 157.7601
## 10 34 Rory Asian 34 Female TRUE 158.0671
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
Can specify descending order using desc
arrange(patients_clean, desc(Height))
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl>
## 1 95 Joseph White 34 Male FALSE 191.5574
## 2 43 Bobbie White 34 Male Yes 184.9876
## 3 29 Clyde Hispanic 26 Male FALSE 184.4273
## 4 73 Brian Hispanic 0 Male FALSE 184.3635
## 5 28 Jame White 24 Male FALSE 184.3365
## 6 58 Walter White 34 Male FALSE 183.1878
## 7 40 Christopher White 28 Male FALSE 183.0974
## 8 100 Luis Hispanic 31 Male FALSE 182.1725
## 9 50 Carlos White 30 Male TRUE 181.2262
## 10 48 Jay White 0 Male FALSE 180.9550
## # ... with 90 more rows, and 8 more variables: Weight <fctr>,
## # Birth <fctr>, State <fctr>, Pet <fctr>, Grade_Level <int>, Died <lgl>,
## # Count <dbl>, Date <fctr>
Can use multiple variables in the sorting
e.g. Grade_Level
and then Count
arrange(patients_clean, Grade_Level, Count)
## # A tibble: 100 × 15
## ID Name Race Age Sex Smokes Height Weight
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl> <fctr>
## 1 9 Shirley White 22 Male FALSE 169.6623 77.2954550926939kg
## 2 19 Tommy Black 23 Male FALSE 167.5139 84.1505069519618kg
## 3 94 Ronnie White 26 Female FALSE 164.5555 68.4283083462713kg
## 4 33 Edward White 20 Male FALSE 171.9395 82.1098370845095kg
## 5 2 Rosario White 25 Male TRUE 174.7485 83.2683499179022kg
## 6 43 Bobbie White 34 Male Yes 184.9876 85.5216473586341kg
## 7 72 Eddie Hispanic 25 Female No 157.5677 68.2669628109541kg
## 8 8 Clair White 29 Female TRUE 161.8266 66.028536339608kg
## 9 26 Kris Hispanic 0 Male FALSE 176.9776 90.6467686495169kg
## 10 49 Richard White 26 Male TRUE 173.6746 80.0520716781511kg
## # ... with 90 more rows, and 7 more variables: Birth <fctr>, State <fctr>,
## # Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
A really convenient function is top_n
top_n(patients_clean,10,Height)
## # A tibble: 10 × 15
## ID Name Race Age Sex Smokes Height
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl>
## 1 28 Jame White 24 Male FALSE 184.3365
## 2 29 Clyde Hispanic 26 Male FALSE 184.4273
## 3 40 Christopher White 28 Male FALSE 183.0974
## 4 43 Bobbie White 34 Male Yes 184.9876
## 5 48 Jay White 0 Male FALSE 180.9550
## 6 50 Carlos White 30 Male TRUE 181.2262
## 7 58 Walter White 34 Male FALSE 183.1878
## 8 73 Brian Hispanic 0 Male FALSE 184.3635
## 9 95 Joseph White 34 Male FALSE 191.5574
## 10 100 Luis Hispanic 31 Male FALSE 182.1725
## # ... with 8 more variables: Weight <fctr>, Birth <fctr>, State <fctr>,
## # Pet <fctr>, Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
top_n(patients_clean,10,Weight)
## # A tibble: 10 × 15
## ID Name Race Age Sex Smokes Height Weight
## <int> <fctr> <fctr> <int> <fctr> <fctr> <dbl> <fctr>
## 1 5 Lavern Cat 35 Male No 178.6365 97.0540533225734kg
## 2 7 Samuel White 32 Male TRUE 170.9500 95.6581459256895kg
## 3 26 Kris Hispanic 0 Male FALSE 176.9776 90.6467686495169kg
## 4 28 Jame White 24 Male FALSE 184.3365 90.413290088493kg
## 5 53 Lindsay White 29 Male TRUE 177.9423 92.4209777707246kg
## 6 64 Carl Hispanic 21 Male FALSE 180.9077 97.3835285144635kg
## 7 73 Brian Hispanic 0 Male FALSE 184.3635 93.2462170418945kg
## 8 76 Jason White 33 Male FALSE 169.1347 92.756262807277kg
## 9 97 Anthony White 28 Male No 180.6568 91.9818422000367kg
## 10 98 Alva White 33 Male FALSE 178.4494 91.77367816887kg
## # ... with 7 more variables: Birth <fctr>, State <fctr>, Pet <fctr>,
## # Grade_Level <int>, Died <lgl>, Count <dbl>, Date <fctr>
With the workflow we have developed in the previous exercises, we can add extra steps to arrange and then select the candidates with the highest BMI prior to writing to a file.
read.delim("patient-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm",""))) %>%
mutate(Weight = as.numeric(str_replace_all(Weight,"kg",""))) %>%
mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>%
mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
select(ID, Name, Birth,BMI,Smokes,Overweight,Died) %>%
filter(Smokes & Overweight & !Died) %>%
arrange(desc(BMI)) %>%
top_n(10, BMI)
## # A tibble: 9 × 7
## ID Name Birth BMI Smokes Overweight Died
## <int> <fctr> <fctr> <dbl> <lgl> <lgl> <lgl>
## 1 7 Samuel 1971-11-13 32.73286 TRUE TRUE FALSE
## 2 17 Robin 1971-10-23 28.11525 TRUE TRUE FALSE
## 3 45 Keith 1971-08-11 27.96600 TRUE TRUE FALSE
## 4 16 Lester 1972-08-15 27.80572 TRUE TRUE FALSE
## 5 44 Lawrence 1972-10-13 27.06717 TRUE TRUE FALSE
## 6 92 Jamie 1972-10-11 26.94385 TRUE TRUE FALSE
## 7 50 Carlos 1972-01-23 26.00887 TRUE TRUE FALSE
## 8 15 Vernon 1971-11-05 25.83838 TRUE TRUE FALSE
## 9 8 Clair 1972-03-14 25.21343 TRUE TRUE FALSE
write.csv("candidates.csv")
## "","x"
## "1","candidates.csv"
Imagine we have a second dataset that we want to process; cohort-data.txt
.
This time we have a million patients, and shouldn’t really be dealing with these data in Excel. However, once we have defined a pipeline on our smaller dataset, we can see how it scales to the new data
read.delim("cohort-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Weight = as.numeric(str_replace_all(Weight,"kg",""))) %>%
mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm",""))) %>%
mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "Yes", TRUE))) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
filter(Smokes & Overweight) %>%
select(ID, Name, Birth, BMI, Smokes,Overweight) %>%
arrange(desc(BMI)) %>%
top_n(10, BMI) %>%
write.table("study-candidates.csv")
As the file is quite large, we might want to switch to readr
for smarter and faster reading
library(readr)
read_tsv("cohort-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Weight = as.numeric(str_replace_all(Weight,"kg",""))) %>%
mutate(Height= as.numeric(str_replace_all(Height,pattern = "cm",""))) %>%
mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>%
mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
filter(Smokes & Overweight) %>%
select(ID, Name, Birth, BMI, Smokes,Overweight) %>%
arrange(desc(BMI)) %>%
top_n(10, BMI) %>%
write.table("study-candidates.csv")
In the past two sections we have introduced the concept of pipeline and various verbs in dplyr that will help us to manipulate our data
We can use these in any order we like to create a workflow. When faced with a dataset, we should asked ourselves what steps we want to perform.
See here