Session 1 covered introduction to R data types, inputing data, plotting and statistics.
R stores data in five main data types.
Data can be read into R as a table with the read.table() function and written to file with the write.table() function.
Table <- read.table("data/readThisTable.csv",sep=",",header=T,row.names=1)
Table[1:3,]
Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low
Gene_a 4.570237 3.230467 3.351827 3.930877 4.098247
Gene_b 3.561733 3.632285 3.587523 4.185287 1.380976
Gene_c 3.797274 2.874462 4.016916 4.175772 1.988263
Sample_1.low
Gene_a 4.418726
Gene_b 5.936990
Gene_c 3.780917
write.table(Table,file="data/writeThisTable.csv", sep=",", row.names =F,col.names=T)
R has a rich set of statistical functions.
1- pnorm(8,mean=8,sd=3)
[1] 0.5
tTestExample <- read.table("data/tTestData.csv",sep=",",header=T)
Result <- t.test(tTestExample$A,tTestExample$B,alternative ="two.sided", var.equal = T)
Result
Two Sample t-test
data: tTestExample$A and tTestExample$B
t = -41.3528, df = 18, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-14.60253 -13.19051
sample estimates:
mean of x mean of y
26.50152 40.39804
We have looked at using logical vectors as a way to index other data types
x <- 1:10
x[x < 4]
[1] 1 2 3
Logicals are also used in controlling how scripted procedures execute.
While I’m analysing data, if I need to execute complex statistical procedures on the data I will use R else I will use a calculator.
Conditional Branching is the evaluation of a logical to determine whether a chunk of code is executed.
In R, we use the if statement with the logical to be evaluated in () and dependent code to be executed in {}.
x <- TRUE
if(x){
message("x is true")
}
x is true
x <- FALSE
if(x){
message("x is true")
}
More often, we construct the logical value within () itself.This can be termed the condition.
x <- 10
y <- 4
if(x > y){
message("The value of x is ",x," is greater than ", y)
}
The value of x is 10 is greater than 4
Here the message is printed because x is greater than y.
y <- 20
if(x > y){
message("The value of x is ",x," is greater than ", y)
}
Here, x is not longer greater than y, so no message is printed.
We really still want a message telling us what was the result of the condition.
If we want to perform an operation when the condition is false we can follow the if() statement with an else statement.
x < - 10
[1] FALSE
if(x < 5){
message(x, " is less than to 5")
}else{
message(x," is greater than or equal to 5")
}
10 is greater than or equal to 5
With the addition of the else statement, when x is not greater than 5 the code following the else statement is executed.
x <- 3
if(x < 5){
message(x, " is less than 5")
}else{
message(x," is greater than or equal to 5")
}
3 is less than 5
We may wish to execute different procedures under multiple conditions. This can be controlled in R using the else if() following an initial if() statement.
x <- 5
if(x > 5){
message(x," is greater than 5")
}else if(x == 5){
message(x," is 5")
}else{
message(x, " is less than 5")
}
5 is 5
A useful function to evaluate conditional statements over vectors is the ifelse() function.
x <- 1:10
message(x)
The ifelse() functions take the arguments of the condition to evaluate, the action if the condition is true and the action when condition is false.
ifelse(x <= 3,"lessOrEqual","more")
[1] "lessOrEqual" "lessOrEqual" "lessOrEqual" "more" "more"
[6] "more" "more" "more" "more" "more"
This allows for multiple nested “else if” statements to be applied to vectors.
ifelse(x == 3,"same",
ifelse(x < 3,"less","more")
)
[1] "less" "less" "same" "more" "more" "more" "more" "more" "more" "more"
The two main generic methods of looping in R are while and for
while - while loops repeat the execution of code while a condition evaluates as true.
for - for loops repeat the execution of code for a range of specified values.
While loops are most useful if you know the condition will be satisified but are not sure when. (i.e. Looking for a point when a number first occurs in a list).
x <- 1
while(x != 3){
message("x is ",x," ")
x <- x+1
}
x is 1
x is 2
message("Finally x is 3")
Finally x is 3
For loops allow the user to cycle through a range of values applying an operation for every value.
Here we cycle through a numeric vector and print out its value.
x <- 1:5
for(i in x){
message("Loop",i," ", appendLF = F)
}
Loop1 Loop2 Loop3 Loop4 Loop5
Similarly we can cycle through other vector types (or lists)
x <- toupper(letters[1:5])
for(i in x){
message("Loop",i," ", appendLF = F)
}
LoopA LoopB LoopC LoopD LoopE
We may wish to keep track of the position in x we are evaluating to retrieve the same index in other variables. A common practice is to loop though all possible index positions of x using the expression 1:length(x).
geneName <- c("Ikzf1","Myc","Igll1")
expression <- c(10.4,4.3,6.5)
1:length(geneName)
[1] 1 2 3
for(i in 1:length(geneName)){
message(geneName[i]," has an RPKM of ",expression[i])
}
Ikzf1 has an RPKM of 10.4
Myc has an RPKM of 4.3
Igll1 has an RPKM of 6.5
Left:60% Loops can be combined with conditional statements to allow for complex control of their execution over R objects.
x <- 1:13
for(i in 1:13){
if(i > 10){
message("Number ",i," is greater than 10")
}else if(i == 10){
message("Number ",i," is 10")
}else{
message("Number ",i," is less than 10")
}
}
Number 1 is less than 10
Number 2 is less than 10
Number 3 is less than 10
Number 4 is less than 10
Number 5 is less than 10
Number 6 is less than 10
Number 7 is less than 10
Number 8 is less than 10
Number 9 is less than 10
Number 10 is 10
Number 11 is greater than 10
Number 12 is greater than 10
Number 13 is greater than 10
We can use conditionals to exit a loop if a condition is satisfied, just a like while loop.
x <- 1:13
for(i in 1:13){
if(i < 10){
message("Number ",i," is less than 10")
}else if(i == 10){
message("Number ",i," is 10")
break
}else{
message("Number ",i," is greater than 10")
}
}
Number 1 is less than 10
Number 2 is less than 10
Number 3 is less than 10
Number 4 is less than 10
Number 5 is less than 10
Number 6 is less than 10
Number 7 is less than 10
Number 8 is less than 10
Number 9 is less than 10
Number 10 is 10
There are functions which allow you to loop over a data type and apply a function to the subsection of that data.
apply - Apply function to rows or columns of a matrix/data frame and return results as a vector,matrix or list.
lapply - Apply function to every element of a vector or list and return results as a list.
sapply - Apply function to every element of a vector or list and return results as a vector,matrix or list.
The apply() function applys a function to the rows or columns of a matrix. The arguments FUN specifies the function to apply and MARGIN whether to apply the functions by rows/columns or both.
apply(DATA,MARGIN,FUN,...)
matExample <- matrix(c(1:4),nrow=2,ncol=2,byrow=T)
matExample
[,1] [,2]
[1,] 1 2
[2,] 3 4
Get the mean of rows
apply(matExample,1,mean)
[1] 1.5 3.5
Get the mean of columns
apply(matExample,2,mean)
[1] 2 3
Additional arguments to be used by the function in the apply loop can be specified after the function argument.
Arguments may be ordered as if passed to function directly. For paste() function however this isn’t possible.
apply(matExample,1,paste,collapse=";")
[1] "1;2" "3;4"
Similar to apply, lapply applies a function to every element of a vector or list.
lapply returns a list object containing the results of evaluating the function.
lapply(c(1,2),mean)
[[1]]
[1] 1
[[2]]
[1] 2
As with apply() additional arguments can be supplied after the function name argument.
lapply(list(1,NA,2),mean,na.rm=T)
[[1]]
[1] 1
[[2]]
[1] NaN
[[3]]
[1] 2
sapply (smart apply) acts as lapply but attempts to return the results as the most appropriate data type.
Here sapply returns a vector where lapply would return lists.
exampleVector <- c(1,2,3,4,5)
exampleList <- list(1,2,3,4,5)
sapply(exampleVector,mean,na.rm=T)
[1] 1 2 3 4 5
sapply(exampleList,mean,na.rm=T)
[1] 1 2 3 4 5
In this example lapply returns a list of vectors from the quantile function.
exampleList <- list(row1=1:5, row2=6:10, row3=11:15)
exampleList
$row1
[1] 1 2 3 4 5
$row2
[1] 6 7 8 9 10
$row3
[1] 11 12 13 14 15
lapply(exampleList,quantile)
$row1
0% 25% 50% 75% 100%
1 2 3 4 5
$row2
0% 25% 50% 75% 100%
6 7 8 9 10
$row3
0% 25% 50% 75% 100%
11 12 13 14 15
Here is an example of sapply parsing a result from the quantile function in a smart way.
When a function always returns a vector of the same length, sapply will create a matrix with elements by column.
sapply(exampleList,quantile)
row1 row2 row3
0% 1 6 11
25% 2 7 12
50% 3 8 13
75% 4 9 14
100% 5 10 15
When sapply cannot parse the result to a vector or matrix, a list will be returned.
exampleList <- list(df=data.frame(sample=paste0("patient",1:2), data=c(1,12)), vec=c(1,3,4,5))
sapply(exampleList,summary)
$df
sample data
patient1:1 Min. : 1.00
patient2:1 1st Qu.: 3.75
Median : 6.50
Mean : 6.50
3rd Qu.: 9.25
Max. :12.00
$vec
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.50 3.50 3.25 4.25 5.00
Exercise on loops and conditional branching can be found here
Answers can be found here here
As we have seen, a function is command which requires one or more arguments and returns a single R object.
This allows for the user to perform complex calculations and prodecures with one simple operation.
x=rnorm(100,70,10)
y <- jitter(x,amount=1)+20
mean(x)
[1] 70.40512
lmExample <- data.frame(X=x,Y=y)
lmResult <- lm(Y~X,data=lmExample)
plot(Y~X,data=lmExample,main="Line of best fit with lm()",
xlim=c(0,150),ylim=c(0,150))
abline(lmResult,col="red",lty=3,lwd=3)
Although we have access to many built functions in R, there will be many complex tasks we wish to perform regularly which are particular to our own work and for which no suitable function exists.
For these tasks we can construct our own functions with function()
Function_Name <- function(Arguments){
Result <- Arguments
return(Result)
}
To define a function with function() we need to decide
- the argument names within () - the expression to be evaluated within {}
- the variable to which the function will be assigned with <-. - the output from the function using return()
Function_name <- function(Argument1,Argument2){ Expression}
myFirstFunction <- function(myArgument1,myArgument2){
myResult <- (myArgument1*myArgument2)
return(myResult)
}
myFirstFunction(4,5)
[1] 20
In functions, a default value for an argument may be used. This allows the function to provide a value for an argument when the user does not specify one.
Default arguments can be specified by assigning a value to the argument with = operator
mySecondFunction <- function(myArgument1,myArgument2=10){
myResult <- (myArgument1*myArgument2)
return(myResult)
}
mySecondFunction(4,5)
[1] 20
mySecondFunction(4)
[1] 40
In some cases a function may wish to deal with missing arguments in a different way to setting a generic default for the argument. The missing() function can be used to evaluate whether an argument has been defined
mySecondFunction <- function(myArgument1,myArgument2){
if(missing(myArgument2)){
message("Value for myArgument2 not provided so will square myArgument1")
myResult <- myArgument1*myArgument1
}else{
myResult <- (myArgument1*myArgument2)
}
return(myResult)
}
mySecondFunction(4)
Value for myArgument2 not provided so will square myArgument1
[1] 16
We have seen that a function returns the value within the return() function. If no return is specified, the result of last line evaluated in the function is returned.
myforthFunction <- function(myArgument1,myArgument2=10){
myResult <- (myArgument1*myArgument2)
return(myResult)
print("I returned the result")
}
myfifthFunction <- function(myArgument1,myArgument2=10){
(myArgument1*myArgument2)
}
myforthFunction(4,5)
[1] 20
myfifthFunction(4,5)
[1] 20
Note that the print() statment after the return() is not evaluated in myforthFuntion.
The return() function can only return one R object at a time. To return multiple data objects from one function call, a list can be used to contain other data objects.
mySixthFunction <- function(arg1,arg2){
result1 <- arg1*arg2
result2 <- date()
return(list(Calculation=result1,DateRun=result2))
}
result <- mySixthFunction(10,10)
result
$Calculation
[1] 100
$DateRun
[1] "Tue Feb 3 11:04:13 2015"
When arguments or variables are created within a function, they only exist within that function and disappear once the function is complete.
mySeventhFunction <- function(arg1,arg2){
internalValue <- arg1*arg2
return(internalValue)
}
result <- mySeventhFunction(10,10)
internalValue
Error in eval(expr, envir, enclos): object 'internalValue' not found
arg1
Error in eval(expr, envir, enclos): object 'arg1' not found
Exercise on functions can be found here
Answers can be found here here
Once we have got our functions together and know how we want to analyse our data, we can save our analysis as a script. By convention R scripts typically end in .r or .R
To save a file in RStudio.
-> File -> Save as
To open a previous R script
->File -> Open File..
To save all the objects (workspace) with extension .RData
->Session -> Save workspace as
R scripts allow us to save and reuse custom functions we have written. To run the code from an R script we can use the source() function with the name of the R script as the argument.
The file dayOfWeek.r in the “scripts” directory contains a simple R script to tell you what day it is after your marathon R coding session.
#Contents of dayOfWeek.r
dayOfWeek <- function(){
return(gsub(" .*","",date()))
}
source("scripts/dayOfWeek.R")
dayOfWeek()
[1] "Tue"
R scripts can be run non-interactively from the command line with the Rscript command, usually with the option –vanilla to avoid saving or restoring workspaces. All messages/warnings/errors will be output to the console.
Rscript --vanilla myscript.r
An alternative to Rscript is R CMD BATCH. Here all messages/warnings/errors are directed to a file and the processing time appended.
R CMD BATCH myscript.r
To provide arguments to an R script at the command line we must add commandArgs() function to parse command line arguments.
args <- commandArgs(TRUE)
myFirstArgument <- args[1]
myFirstArgument
as.numeric(myFirstArgument
'10'
as.numeric(myFirstArgument)
10
Since vectors can only be one type, all command line arguments are strings and must be converted to numeric if needed with as.numeric()
Libraries can be loaded using the library() function with an argument of the name of the library
library(ggplot2)
You can see what libraries are available in the Packages panel or by the library() function with no arguments supplied
library()
Libraries can be installed through the R studio menu
-> Tools -> Install packages ..
Or by using the install.packages() command
install.packages("Hmisc")
Use vectorisation Keep 2D numeric data in matrices