MRC Clinical Sciences Centre
http://mrccsc.github.io/r_course/introToR_Session1.html
R is a scripting language and environment for statistical computing.
Developed by Robert Gentleman and Ross Ihaka.
Inheriting much from S (Bell labs).
R comes with excellent “out-of-the-box” statistical and plotting capabilities.
R provides access to 1000s of packages (CRAN/MRAN/R-forge) which extend the basic functionality of R while maintaining high quality documentation.
In particular, Robert Gentleman developed the Bioconductor project where 100's of packages are directly related to computational biology and analysis of associated high-throughput experiments.
Freely available from R-project website.
RStudio provides an integrated development environment (IDE) which is freely available from RStudio site
We will be using RStudio and R already installed on your machines.
Four main panels
Let's load RStudio and take a look
At its most basic, R can be used as a simple calculator.
> 3+1
[1] 4
> 2*2
[1] 4
> sqrt(25)-1
[1] 4
The sqrt(25) demostrates the use of functions in R. A function performs a complex operation on it's arguments and returns the result.
In R, arguments are provided to a function within the parenthesis – ( ) – that follows the function name. So sqrt(ARGUMENT) will provide the square root of the value of ARGUMENT.
Other examples of functions include min(), sum(), max().
Note multiple arguments are separated by a comma.
min(2,4,6)
[1] 2
sum(2,4,6)
[1] 12
max(2,4,6)
[1] 6
R has many useful functions “built in” and ready to use as soon as R is loaded.
An incomplete, illustrative list can be seen here
In addition to R standard functions, additional functionality can be loaded into R using libraries. These include specialised tools for areas such as sequence alignment, read counting etc.
If you need to see how a function works try ? in front of the function name.
?sqrt
Lets run ?sqrt in RStudio and look at the help.
With functions such as min() and sqrt(), the arguments to be provided are obvious and the order of these arguments doesnt matter.
min(5,4,6)
[1] 4
min(6,4,5)
[1] 4
Many functions however have an order to their arguments. Try and look at the arguments for the dir() function.
?dir
Often we know the names of arguments but not necessarily their order. In cases where we want to be sure we specify the right argument, we provide names for the arguments used.
dir()
dir(full.names=T)
This also means we don't have to copy out all the defaults for arguments preceeding it.
dir(full.names=T)
# Is equivalent to...
dir(".",NULL,FALSE,T)
As with other programming languages and even graphical calculators, R makes use of variables.
A variable stores a value as a letter or word.
In R, we make use of the assignment operator <-
x <- 10
Now x holds the value of 10
x
[1] 10
x
[1] 10
Variables can be altered in place
x <- 20
x
[1] 20
Variables can be used just as the values they contain.
x + sqrt(25)
[1] 25
Variables can be used to create new variables
y <- x + sqrt(25)
y
[1] 25
In R the most basic variable or data type is a vector. A vector is an ordered collection of values. The x and y variables we have previously assigned are examples of a vector of length 1.
x
[1] 20
length(x)
[1] 1
To create a multiple value vector we use the function c() to combine the supplied arguments into one vector.
x <- c(1,2,3,4,5,6,7,8,9,10)
x
[1] 1 2 3 4 5 6 7 8 9 10
length(x)
[1] 10
Vectors of continuous stretches of values can be created by the shortcut - :
y <- 6:10
y
[1] 6 7 8 9 10
Other useful function to create stretchs of numeric vectors are seq() and rep(). The seq() function creates a sequence of numeric values from a specified start and end value, incrementing by a user defined amount. The rep() function repeats a variable a user-defined number of times.
seq(from=1,to=5,by=2)
[1] 1 3 5
rep(c(1,5,10),3)
[1] 1 5 10 1 5 10 1 5 10
Square brackets [] identify the position within a vector (the index). These indices can be used to extract relevant values from vectors.
x
[1] 1 2 3 4 5 6 7 8 9 10
x[1]
[1] 1
x[8]
[1] 8
Indices can be used to extract values from multiple positions within a vector.
x[c(1,6)]
[1] 1 6
Negative indices can be used to extract all positions except that specified
x[-5]
[1] 1 2 3 4 6 7 8 9 10
We can use indices to modify a specific position in vector
x
[1] 1 2 3 4 5 6 7 8 9 10
x[5] <- -5
x
[1] 1 2 3 4 -5 6 7 8 9 10
Indices can be specified using other vectors.
y
[1] 6 7 8 9 10
x[y] <- 0
x
[1] 1 2 3 4 -5 0 0 0 0 0
Square brackets [] for indexing
x[1]
[1] 1
Parentheses () for function argments.
sqrt(4)
[1] 2
Vectors in R can be used in arithmetic operations as seen with variables earlier. When a standard arithmetic operation is applied to vector, the operation is applied to each position in a vector.
x <- c(1,2,3,4,5,6,7,8,9,10)
x
[1] 1 2 3 4 5 6 7 8 9 10
y <- x*2
y
[1] 2 4 6 8 10 12 14 16 18 20
Multiple vectors can be used within arithmetic operations.
x+y
[1] 3 6 9 12 15 18 21 24 27 30
When applying an arithmetic operation between two vectors of unequal length, the shorter will be recycled.
x+c(1,2)
[1] 2 4 4 6 6 8 8 10 10 12
x+c(1,2,3)
[1] 2 4 6 5 7 9 8 10 12 11
So far we have only looked at numeric vectors.
In R we can also create character vectors again using c() function. These vectors can be indexed just the same.
y <- c("ICTEM","CommonWealth","Wolfson")
y[2]
[1] "CommonWealth"
Character vectors can be used to assign names to other vectors.
x <- c(1:3)
names(x) <- y
x
ICTEM CommonWealth Wolfson
1 2 3
These named vectors maybe indexed by a position's “name”.
x[c("ICTEM","Wolfson")]
ICTEM Wolfson
1 3
Index names missing from vectors will return special value “NA”
x[c("Strand")]
<NA>
NA
In R, like many languages, when a value in a variable is missing, the value is assigned a NA value.
Similarly, when a calculation can not be perfomed, R will input a NaN value.
NA values allow for R to handle missing data correctly but requires different handling than standard numeric or character values. We will discuss handling NA values later.
The unique() function can be used to retrieve all unique values from a vector.
geneList <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene1","Gene3")
unique(geneList)
[1] "Gene1" "Gene2" "Gene3" "Gene4" "Gene5"
A common task in R is to subset one vector by the values in another vector.
The %in% operator can be used to subset the values within one vector by those which are present in a second vector.
geneList <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene1","Gene3")
secondGeneList <- c("Gene5","Gene3")
geneList[geneList %in% secondGeneList]
[1] "Gene3" "Gene5" "Gene3"
Logical vectors are a class of vector made up of TRUE/T or FALSE/F boolean values.
z <- c(T,F,T,F,T,F,T,F,T,F)
# z <- c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
z
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
Logical vectors can be used like an index to specify postions in a vector. TRUE values will return the corresponding position in the vector being indexed.
x <- 1:10
x[z]
[1] 1 3 5 7 9
Other vectors may be evaluated to produce logical vectors. This can be very useful when using a logical to index.
Common examples are:
x > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
x[x > 5]
[1] 6 7 8 9 10
Logical vectors can be used in combination in order to index vectors. To combine logical vectors we can use some common R operators.
x <- 1:10
!x > 4
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x > 4 & x < 7
[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
x > 4 | x < 7
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Such combinations can allow for complex selection of a vector's values.
x <- 1:10
x[x > 4 & x < 7]
[1] 5 6
x[x > 4 & !x < 7]
[1] 7 8 9 10
Exercise on vectors can be found here
Answers can be found here here
In programs such as Excel, we are used to tables.
All progamming languages have a concept of a table. In R, the most basic table type is a matrix.
A matrix can be created using the matrix() function with the arguments of nrow and ncol specifying the number of rows and columns respectively.
narrowMatrix <- matrix(1:10, nrow=5, ncol=2)
narrowMatrix
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
wideMatrix <- matrix(1:10, nrow=2, ncol=5)
wideMatrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
To find dimensions of a matrix, the dim() function will provide dimensions as row number and column number while nrow() and ncol() will return just row number and column number respectively.
dim(narrowMatrix)
[1] 5 2
nrow(narrowMatrix)
[1] 5
ncol(narrowMatrix)
[1] 2
A matrix can be created from multiple vectors or other matrices.
cbind() can be used to attach data to a matrix as columns.
x <- 1:10
y <- 11:20
z <- 21:22
newMatrix <- cbind(x,y)
newMatrix
x y
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
rbind() functions to bind to a matrix as rows.
newerMatrix <- rbind(newMatrix,z)
newerMatrix
x y
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
z 21 22
When creating a matrix using cbind() or matrix() from incompatable vectors then the shorter vector is recycled.
recycledMatrix2 <- matrix(1:5,ncol=2,nrow=3)
recycledMatrix2
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 1
For rbind() function, the longer vector is clipped.
recycledMatrix3 <- rbind(recycledMatrix2,c(1:5))
recycledMatrix3
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 1
[4,] 1 2
As with vectors, matrices can be named. For matrices the naming is done by columns and rows using colnames() and rownames() functions.
namedMatrix <- matrix(1:10,ncol=5,nrow=2)
colnames(namedMatrix) <- paste("Column",1:5,sep="_")
rownames(namedMatrix) <- paste("Row",1:2,sep="_")
namedMatrix
Column_1 Column_2 Column_3 Column_4 Column_5
Row_1 1 3 5 7 9
Row_2 2 4 6 8 10
Information on matrix names can also be retreived using the same functions.
colnames(namedMatrix)
[1] "Column_1" "Column_2" "Column_3" "Column_4" "Column_5"
rownames(namedMatrix)
[1] "Row_1" "Row_2"
Selecting and replacing portions of a matrix can be done by indexing using square brackets [] much like for vectors.
When indexing matrices, two values may be provided within the square brackets separated by a comma to retrieve information on a matrix position.
The first value(s) corresponds to row(s) and the second to column(s).
narrowMatrix
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
Value of first column, second row
narrowMatrix[2,1]
[1] 2
Similarly, whole rows or columns can be extracted. Single rows and columns will return a vector. When multiple columns or row indices are specified, a matrix is returned.
Values of second column (row index is empty!)
narrowMatrix[,2]
[1] 6 7 8 9 10
Values of third row (column index is empty!)
narrowMatrix[3,]
[1] 3 8
Values of second and third row (column index is empty!)
narrowMatrix[c(2,3),]
[,1] [,2]
[1,] 2 7
[2,] 3 8
As with vectors, names can be used for indexing when present
colnames(narrowMatrix) <- paste("Column",1:2,sep="_")
rownames(narrowMatrix) <- paste("Row",1:5,sep="_")
narrowMatrix[,"Column_1"]
Row_1 Row_2 Row_3 Row_4 Row_5
1 2 3 4 5
narrowMatrix["Row_1",]
Column_1 Column_2
1 6
narrowMatrix[,"Column_1"]
Row_1 Row_2 Row_3 Row_4 Row_5
1 2 3 4 5
narrowMatrix["Row_1","Column_1"]
[1] 1
As with vectors, matrices can be subset by logical vectors
narrowMatrix
Column_1 Column_2
Row_1 1 6
Row_2 2 7
Row_3 3 8
Row_4 4 9
Row_5 5 10
narrowMatrix[,1]
Row_1 Row_2 Row_3 Row_4 Row_5
1 2 3 4 5
narrowMatrix[,1] < 5
Row_1 Row_2 Row_3 Row_4 Row_5
TRUE TRUE TRUE TRUE FALSE
narrowMatrix[narrowMatrix[,1] < 5,]
Column_1 Column_2
Row_1 1 6
Row_2 2 7
Row_3 3 8
Row_4 4 9
As with vectors, matrices can have arithmetic operations applied to cells,rows, columns or the whole matrix
narrowMatrix
Column_1 Column_2
Row_1 1 6
Row_2 2 7
Row_3 3 8
Row_4 4 9
Row_5 5 10
narrowMatrix[1,1]+2
[1] 3
narrowMatrix[1,]+2
Column_1 Column_2
3 8
mean(narrowMatrix)
[1] 5.5
As with vectors, matrices can have their elements replaced
narrowMatrix
Column_1 Column_2
Row_1 1 6
Row_2 2 7
Row_3 3 8
Row_4 4 9
Row_5 5 10
narrowMatrix[1,1] <- 10
narrowMatrix[,2] <- 1
narrowMatrix
Column_1 Column_2
Row_1 10 1
Row_2 2 1
Row_3 3 1
Row_4 4 1
Row_5 5 1
Matrices must be all one type (i.e. numeric or character).
Here replacing one value with character will turn numeric matrix to character matrix.
narrowMatrix[,2] *2
Row_1 Row_2 Row_3 Row_4 Row_5
2 2 2 2 2
narrowMatrix[1,1] <- "Not_A_Number"
narrowMatrix
Column_1 Column_2
Row_1 "Not_A_Number" "1"
Row_2 "2" "1"
Row_3 "3" "1"
Row_4 "4" "1"
Row_5 "5" "1"
narrowMatrix[,2] *2
Error in narrowMatrix[, 2] * 2: non-numeric argument to binary operator
Exercise on matrices can be found here
Answers can be found here here
A special case of a vector is a factor.
Factors are used to store data which may be grouped in categories (categorical data). Specifying data as categorical allows R to properly handle the data and make use of functions specific to categorical data.
To create a factor from a vector we use the factor() function. Note that the factor now has an additional component called “levels” which identifies all categories within the vector.
vectorExample <- c("male","female","female","female")
factorExample <- factor(vectorExample)
factorExample
[1] male female female female
Levels: female male
levels(factorExample)
[1] "female" "male"
An example of the use of levels can be seen from applying the summary() function to the vector and factor examples
summary(vectorExample)
Length Class Mode
4 character character
summary(factorExample)
female male
3 1
In our factor example, the levels have been displayed in an alphabetical order. To adjust the display order of levels in a factor, we can supply the desired display order to levels argument in the factor() function call.
factorExample <- factor(vectorExample,levels=c("male","female"))
factorExample
[1] male female female female
Levels: male female
summary(factorExample)
male female
1 3
In some cases there is no natural order to the categories such that one category is greater than the other (nominal data). In this case we can see that R is gender neutral.
factorExample <- factor(vectorExample,levels=c("male","female"))
factorExample[1] < factorExample[2]
[1] NA
In other cases there will be a natural ordering to the categories (ordinal data). A factor can be specified to be ordered using the ordered argument in combination with specified levels argument.
factorExample <- factor(c("small","big","big","small"),ordered=TRUE,levels=c("small","big"))
factorExample
[1] small big big small
Levels: small < big
factorExample[1] < factorExample[2]
[1] TRUE
Unlike vectors, replacing elements within a factor isn't so easy. While replacing one element with an established level is possible, replacing with a novel element will result in a warning.
factorExample <- factor(c("small","big","big","small"))
factorExample[1] <- c("big")
factorExample
[1] big big big small
Levels: big small
factorExample[1] <- c("huge")
factorExample
[1] <NA> big big small
Levels: big small
To add a new level we can use the levels argument.
levels(factorExample) <- c("big","small","huge")
factorExample[1] <- c("huge")
factorExample
[1] huge big big small
Levels: big small huge
We saw that with matrices you can only have one type of data. We tried to create a matrix with a character element and the entire matrix became a character.
In practice, we would want to have a table which is a mixture of types (e.g a table with sample names (character), sample type (factor) and survival time (numeric))
In R, we make use of the data frame object which allows us to store tables with columns of different data types. To create a data frame we can simply use the data.frame() function.
patientName <- c("patient1","patient2","patient3","patient4")
patientType <- factor(rep(c("male","female"),2))
survivalTime <- c(1,30,2,20)
dfExample <- data.frame(Name=patientName, Type=patientType,Survival_Time=survivalTime)
dfExample
Name Type Survival_Time
1 patient1 male 1
2 patient2 female 30
3 patient3 male 2
4 patient4 female 20
Data frames may be indexed just as matrices.
dfExample
Name Type Survival_Time
1 patient1 male 1
2 patient2 female 30
3 patient3 male 2
4 patient4 female 20
dfExample[dfExample[,"Survival_Time"] > 10,]
Name Type Survival_Time
2 patient2 female 30
4 patient4 female 20
Unlike matrices, it is possible to index a column by using the $ symbol.
dfExample <- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
dfExample$Survival_Time
[1] 1 30 2 20
dfExample[dfExample$Survival_Time < 10,]
Name Type Survival_Time
1 patient1 male 1
3 patient3 male 2
Using the $ allows for R to autocomplete your selection and so can speed up coding.
dfExample$Surv
[1] 1 30 2 20
But this will not work..
dfExample[,"Surv"]
The $ operator also allows for the creation of new columns for a data frame on the fly.
dfExample
Name Type Survival_Time
1 patient1 male 1
2 patient2 female 30
3 patient3 male 2
4 patient4 female 20
dfExample$newColumn <- rep("newData",nrow(dfExample))
dfExample
Name Type Survival_Time newColumn
1 patient1 male 1 newData
2 patient2 female 30 newData
3 patient3 male 2 newData
4 patient4 female 20 newData
Certain columns can not be replaced in data frames. Numeric columns may have their values replaced but columns with character values may not by default. This occurs because character vectors are treated as factors by default.
dfExample[dfExample[,"Survival_Time"] < 10,"Survival_Time"] <- 0
dfExample
Name Type Survival_Time newColumn
1 patient1 male 0 newData
2 patient2 female 30 newData
3 patient3 male 0 newData
4 patient4 female 20 newData
dfExample[dfExample[,"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample
Name Type Survival_Time newColumn
1 <NA> male 0 newData
2 patient2 female 30 newData
3 <NA> male 0 newData
4 patient4 female 20 newData
It is possible to update factors in data frames just as with standard factors.
dfExample <- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
levels(dfExample[,"Name"]) <- c(levels(dfExample[,"Name"]) , "patientX")
dfExample[dfExample[,"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample
Name Type Survival_Time
1 patientX male 1
2 patient2 female 30
3 patientX male 2
4 patient4 female 20
If you wish to avoid using factors in data frames then the stringsAsFactors argument to data.frame() function should be set to FALSE
dfExample <- data.frame(Name=patientName,
Type=patientType,
Survival_Time=survivalTime,
stringsAsFactors = F)
levels(dfExample[,"Name"]) <- c(levels(dfExample[,"Name"]) , "patientX")
dfExample[dfExample[,"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample
Name Type Survival_Time
1 patientX male 1
2 patient2 female 30
3 patientX male 2
4 patient4 female 20
A useful function in R is order()
For numeric vectors, order() by default returns the indices of a vector in that vector's increasing order. This behaviour can be altered by using the “decreasing” argument passed to order.
testOrder <- c(2,1,3)
testOrder
[1] 2 1 3
testOrder[order(testOrder)]
[1] 1 2 3
testOrder[order(testOrder,decreasing=T)]
[1] 3 2 1
When a vector contains NA values, these NA values will, by default, be placed last in ordering indices. This can be controlled by na.last argument.
testOrder <- c(2,1,NA,3)
testOrder[order(testOrder,decreasing=T,na.last=T)]
[1] 3 2 1 NA
testOrder[order(testOrder,decreasing=T,na.last=F)]
[1] NA 3 2 1
Since the order argument returns an index of intended order for a vector, we can use the order() function to order data frames by certain columns
dfExample
Name Type Survival_Time
1 patientX male 1
2 patient2 female 30
3 patientX male 2
4 patient4 female 20
dfExample[order(dfExample$Surv, decreasing=T),]
Name Type Survival_Time
2 patient2 female 30
4 patient4 female 20
3 patientX male 2
1 patientX male 1
We can also use order to arrange multiple columns in a data frame by providing multiple vectors to order() function. Ordering will be performed in order of arguments.
dfExample[order(dfExample$Type,
dfExample$Survival,
decreasing=T),]
Name Type Survival_Time
3 patientX male 2
1 patientX male 1
2 patient2 female 30
4 patient4 female 20
A common operation is to join two data frames by a column of common values.
dfExample <- data.frame(Name=patientName,
Type=patientType,
Survival_Time=survivalTime)
dfExample
Name Type Survival_Time
1 patient1 male 1
2 patient2 female 30
3 patient3 male 2
4 patient4 female 20
dfExample2 <- data.frame(Name=patientName[1:3],
height=c(6.1,5.1,5.5))
dfExample2
Name height
1 patient1 6.1
2 patient2 5.1
3 patient3 5.5
To do this we can use the merge() function with the data frames as the first two arguments. We can then specify the columns to merge by with the by argument. To keep only data pertaining to values common to both data frames the all argument is set to TRUE.
mergedDF <- merge(dfExample,dfExample2,by=1,all=F)
mergedDF
Name Type Survival_Time height
1 patient1 male 1 6.1
2 patient2 female 30 5.1
3 patient3 male 2 5.5
Exercise on data frames and factors can be found here
Answers can be found here here
Lists are the final data-type we will look at.
In R, lists provide a general container which may hold any data types of unequal lengths as part of its elements. To create a list we can simply use the list() function with arguments specifying the data we wish to include in the list.
firstElement <- c(1,2,3,4)
secondElement <- matrix(1:10,nrow=2,ncol=5)
thirdElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four"))
myList <- list(firstElement,secondElement,thirdElement)
myList
[[1]]
[1] 1 2 3 4
[[2]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
[[3]]
colOne colTwo
1 1 One
2 2 Two
3 4 Three
4 5 Four
Just as with vectors, list elements can be assigned names.
myNamedList <- list(First=firstElement,Second=secondElement,Third=thirdElement)
myNamedList
$First
[1] 1 2 3 4
$Second
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
$Third
colOne colTwo
1 1 One
2 2 Two
3 4 Three
4 5 Four
List, as with other data types in R can be indexed. In contrast to other types, using [] on a list will subset the list to another list of selected indices. To retrieve an element from a list in R , two square brackets [[]] must be used.
myList <- list(firstElement,secondElement,thirdElement)
myList[1]
[[1]]
[1] 1 2 3 4
myList[[1]]
[1] 1 2 3 4
As with data.frames, the $ sign may be used to extract named elements from a list
myNamedList$First
[1] 1 2 3 4
Again, similar to vectors, lists can be joined together in R using the c() function
myNamedList <- list(First=firstElement,Second=secondElement,Third=thirdElement)
myNamedList <- c(myNamedList,list(fourth=c(4,4)))
myNamedList[c(1,4)]
$First
[1] 1 2 3 4
$fourth
[1] 4 4
Note that on last slide we are joining two lists. If we joined a vector to a list, all elements of the vector would become list elements.
myList <- c(myList,c(4,4))
myList
[[1]]
[1] 1 2 3 4
[[2]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
[[3]]
colOne colTwo
1 1 One
2 2 Two
3 4 Three
4 5 Four
[[4]]
[1] 4
[[5]]
[1] 4
Sometimes you will wish to “flatten” out a list. When a list contains compatable objects, i.e. list of all one type, the unlist() function can be used. Note the maintenance of names with their additional sufficies.
myNamedList <- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
myNamedList
$First
[1] 1 2 3
$Second
[1] 2 6 7
$Third
[1] 1 4 7
flatList <- unlist(myNamedList)
flatList[1:7]
First1 First2 First3 Second1 Second2 Second3 Third1
1 2 3 2 6 7 1
A common step is to turn a list of standard results into matrix. This can be done in a few steps in R.
myNamedList <- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
flatList <- unlist(myNamedList)
listAsMat <- matrix(flatList,
nrow=length(myNamedList),
ncol=3,
byrow=T,
dimnames=list(names(myNamedList)))
listAsMat
[,1] [,2] [,3]
First 1 2 3
Second 2 6 7
Third 1 4 7
Most of the time, you will not be generating data in R but will be importing data from external files.
A standard format for this data is a table
Gene_Name | Sample_1.hi | Sample_2.hi | Sample_3.hi | Sample_4.low | Sample_5.low | Sample_1.low |
---|---|---|---|---|---|---|
Gene_a | 2.181518 | 2.963433 | 3.618821 | 3.703849 | 3.621196 | 3.466088 |
Gene_b | 3.953413 | 5.003427 | 3.128916 | 3.505461 | 3.953361 | 4.261023 |
Gene_c | 3.831760 | 1.840931 | 3.259247 | 2.898734 | 3.822889 | 4.384299 |
Gene_d | 4.272112 | 3.795516 | 5.858744 | 6.944077 | 7.550353 | 9.242483 |
Gene_e | 9.914889 | 10.758701 | 10.140730 | 3.633438 | 1.285404 | 2.734017 |
Gene_f | 10.071533 | 10.252845 | 10.517175 | 5.378654 | 3.831697 | 1.848423 |
Gene_g | 11.316427 | 9.815200 | 9.354936 | 8.879610 | 9.591012 | 11.768586 |
Gene_h | 8.350820 | 9.495415 | 10.523986 | 12.283628 | 11.289700 | 10.012964 |
Tables from text files can be read with read.table() function
Table <- read.table("data/readThisTable.csv",sep=",",header=T)
Table[1:4,1:3]
Gene_Name Sample_1.hi Sample_2.hi
1 Gene_a 4.570237 3.230467
2 Gene_b 3.561733 3.632285
3 Gene_c 3.797274 2.874462
4 Gene_d 3.398242 4.415202
Here we have provided two arguments.
read.table() allows for significant control over reading files through its many arguments. Have a look at options by using ?read.table
The row.names argument can be used to specify a column to use as row names for the resulting data frame. Here we use the first column as row names.
Table <- read.table("data/readThisTable.csv",sep=",",header=T,row.names=1)
Table[1:4,1:3]
Sample_1.hi Sample_2.hi Sample_3.hi
Gene_a 4.570237 3.230467 3.351827
Gene_b 3.561733 3.632285 3.587523
Gene_c 3.797274 2.874462 4.016916
Gene_d 3.398242 4.415202 4.893561
As mentioned, data which is read into R through read.table() will be of data frame class.
To avoid character columns being converted into factors, we can specify the stringsAsFactors argument here.
Table <- read.table("data/readThisTable.csv",sep=",",header=T,stringsAsFactors=F)
Other very useful functions for read table include:
The read.table function can also read data from http.
URL <- "http://mrccsc.github.io/readThisTable.csv"
Table <- read.table(URL,sep=",",header=T)
Table[1:2,1:3]
Gene_Name Sample_1.hi Sample_2.hi
1 Gene_a 4.111851 3.837018
2 Gene_b 6.047822 5.683518
And the clipboard.(This is Windows version)
Table <- read.table(file="clipboard",sep=",",header=T)
read.table() function will by default read every row and column of a file.
The scan() function allows for the selection of particular columns to be read into R and so can save memory when files are large.
x <- scan("data/readThisTable.csv",sep=",",
what = c(list(""),rep(list(NULL), 6)),skip=1)
x[1:3]
[[1]]
[1] "Gene_a" "Gene_b" "Gene_c" "Gene_d" "Gene_e" "Gene_f" "Gene_g" "Gene_h"
[[2]]
NULL
[[3]]
NULL
Once we have our data analysed in R, we will want to export it to a file.
The most common method is to use the write.table() function
write.table(Table,file="data/writeThisTable.csv",sep=",")
Since our data has column names but no row names, I will provide the arguments col.names and row.names to write.table()
write.table(Table,file="data/writeThisTable.csv", sep=",", row.names =F,col.names=T)
Exercise on reading and writing data can be found here
Answers can be found here here
Introduction
Base graphics
Saving your plot
Lattice R package
ggplot2 R package
R has excellent graphics and plotting capabilities. They are mostly found in following three sources.
Lattice and ggplot2 packages are built on grid graphics package while the base graphics routines adopt a pen and paper model for plotting.
First we'll produce a very simple graph using the values in the treatment vector:
treatment <- c(0.02,1.8, 17.5, 55,75.7, 80)
Plot the treatment vector with default parameters
plot(treatment)
Now, let's add a title, a line to connect the points, and some color:
Plot treatment using blue points overlayed by a line
plot(treatment, type="o", col="blue")
Create a title with a red, bold/italic font
title(main="Treatment", col.main="red", font.main=4)
Now let's add a red line for a control vector and specify the y-axis range directly so it will be large enough to fit the data:
Define control vector
control <- c(0, 20, 40, 60, 80,100)
Plot treatment using a y axis that ranges from 0 to 100
plot(treatment, type="o", col="blue", ylim=c(0,100))
Plot control with red dashed line and square points
lines(control, type="o", pch=22, lty=2, col="red")
Create a title with a red, bold/italic font
title(main="Expression Data", col.main="red", font.main=4)
Next let's change the axes labels to match our data and add a legend. We'll also compute the y-axis values using the max function so any changes to our data will be automatically reflected in our graph.
Calculate range from 0 to max value of data
g_range <- range(0, treatment, control)
range returns a vector containing the minimum and maximum of all the given arguments.
Plot treatment using y axis that ranges from 0 to max value in treatment or control vector. Turn off axes and annotations (axis labels) so we can specify them ourselves.
plot(treatment, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
Make x axis using labels
axis(1, at=1:6, lab=c("Mon","Tue","Wed","Thu","Fri","Sat"))
Make y axis with horizontal labels that display ticks at every 20 marks.
axis(2, las=1, at=20*0:g_range[2])
Create box around plot
box()
Plot control vector with red dashed line and square points
lines(control, type="o", pch=22, lty=2, col="red")
Create a title with a red, bold/italic font
title(main="Data", col.main="red", font.main=4)
Label the x and y axes with dark green text
title(xlab="Days", col.lab=rgb(0,0.5,0))
title(ylab="Values", col.lab=rgb(0,0.5,0))
Create a legend at (1, g_range[2]) that is slightly smaller (cex) and uses the same line colors and points used by the actual plots
legend(1, g_range[2], c("treatment","control"), cex=0.8, col=c("blue","red"), pch=21:22, lty=1:2);
Let's start with a simple bar chart graphing the treatment vector: Plot treatment
barplot(treatment)
Let's now read the data from the example.txt data file, add labels, blue borders around the bars, and density lines:
Read values from tab-delimited example.txt
data <- read.table("data/example.txt", header=T, sep="\t")
Plot treatment with specified labels for axes. Use blue borders and diagonal lines in bars.
barplot(data$treatment, main="Treatment", xlab="Days",ylab="values", names.arg=c("Mon","Tue","Wed","Thu","Fri","Sat"), border="blue", density=c(10,20,30,40,50,60))
names.arg is a vector of names to be plotted below each bar or group of bars. density a vector giving the density of shading lines, in lines per inch, for the bars or bar components.
Now let's plot the treatment data using some color and show a legend:
Graph data with adjacent bars using colors
barplot(as.matrix(data), main="Data", ylab= "Total", beside=TRUE, col= c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk","maroon"))
Place the legend at the top-left corner with no frame
legend("topleft", c("Mon","Tue","Wed","Thu","Fri","Sat"), cex=0.8,bty="n",
fill= c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk","maroon"));
Let's start with a simple histogram plotting the distribution of the treatment vector:
Create a histogram for treatment
hist(treatment)
Concatenate the three vectors
all <- c(data$control, data$treatment)
Create a histogram for data in light blue with the y axis ranging from 0-10
hist(all, col="lightblue", ylim=c(0,10))
Now change the breaks so none of the values are grouped together and flip the y-axis labels horizontally.
Compute the largest value used in the data
max_num <- max(all)
Create a histogram for data with fire colors, set breaks so each number is in its own group, make x axis range from 0-max_num, disable right-closing of cell intervals, set heading, and make y-axis labels horizontal.
hist(all, col=heat.colors(max_num), breaks=max_num, xlim=c(0,max_num), right=F,
main="Histogram", las=1)
breaks: a single number giving the number of cells for the histogram, An open interval does not include its endpoints, and is indicated with parentheses.
For example (0,1) means greater than 0 and less than 1.
A closed interval includes its endpoints, and is denoted with square brackets. For example [0,1] means greater than or equal to 0 and less than or equal to 1.
Now let's create uneven breaks and graph the probability density.
Create uneven breaks
brk <- c(0,30,40,50,60,80,100)
Create a histogram for all data with fire colours, set uneven breaks, make x axis range from 0-max_num, disable right-closing of cell intervals, set heading, make y-axis labels horizontal, make axis labels smaller, make areas of each column proportional to the count
hist(all, col=heat.colors(length(brk)), breaks=brk,xlim=c(0,max_num), right=F, main="Probability Density",las=1, cex.axis=0.8, freq=F)
freq logical; if TRUE, the histogram graphic is a representation of frequencies
if FALSE, probability densities, component density, are plotted
Let's start with a simple pie chart graphing the treatment vector: Create a pie chart for treatment
pie(treatment)
Now let's add a heading, change the colours, and define our own labels:
Create a pie chart with defined heading and custom colours and labels
pie(treatment, main="Treatment", col= c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk","maroon"),
labels=c("Mon","Tue","Wed","Thu","Fri","Sat"))
Now let's change the colours, label using percentages, and create a legend:
Define some colours ideal for black & white print
colors <- c("white","grey70","grey90","grey50","black")
Calculate the percentage for each day, rounded to one decimal place
treatment_labels <- round(treatment/sum(treatment) * 100, 1)
Concatenate a '%' char after each value
treatment_labels <- paste(treatment_labels, "%", sep="")
Create a pie chart with defined heading and custom colors and labels
pie(treatment, main="treatment", col=colors, labels= treatment_labels,cex=0.8)
Create a legend at the right
legend(1.5, 0.5, c("Mon","Tue","Wed","Thu","Fri","Sat"), cex=0.8,fill=colors)
Let's start with a simple dot chart graphing the data:
Create a dot chart for data Function t returns the transpose of a matrix.
dotchart(t(data))
Let's make the dotchart a little more colorful:
Create a colored dotchart for autos with smaller labels
dotchart(t(data), color=c("red","blue","darkgreen"),main="Dotchart", cex=0.8)
R makes it easy to combine multiple plots into one overall graph, using either the par( ) or layout( ) function. With the par( ) function, you can include the option mfrow=c(nrows, ncols) to create a matrix of nrows x ncols plots that are filled in by row. mfcol=c(nrows, ncols) fills in the matrix by columns.
Define a layout with 2 rows and 2 columns
par(mfrow=c(2,2))
Here, we will use different dataset with two columns each for treated and untreated samples.
data1 <- read.table("data/gene_data.txt", header=T, sep="\t")
head(data1)
ensembl_gene_id Untreated1 Untreated2 Treated1 Treated2
1 ENSDARG00000093639 0.8616832 1.9311442 0.1041508 0.14055604
2 ENSDARG00000094508 0.9857575 2.0256352 0.1549917 0.20301609
3 ENSDARG00000095893 0.8498889 1.9875580 0.2317969 0.20925123
4 ENSDARG00000095252 0.9242996 2.0857620 0.2562264 0.24669079
5 ENSDARG00000078878 0.3571734 0.4653908 0.1167221 0.09710237
6 ENSDARG00000079403 1.0604071 1.2581398 0.3884836 0.31567299
Plot histograms for different columns in the data frame separately. This is not very efficient. You could also do it more efficiently using for loop.
hist(data1$Untreated1)
hist(data1$Treated2)
hist(data1$Untreated2)
boxplot(data1$Treated1)
There are many different ways of saving your plots in R.
The only argument you would need is name of file in which you want to save the plot.
Plotting commands then can be entered as usual. The output would be redirected to the file.
When you're done with your plotting commands, enter the dev.off() command.
bmp(filename, width = 480, height = 480, units = "px", point-size = 12)
jpeg(filename, width = 480, height = 480, units = "px", point-size = 12, quality = 75)
Saving in bitmap format
bmp(file = "control.bmp")
plot(control)
dev.off()
Saving in jpeg format
jpeg(file = "control.jpg", quality = 20)
plot(control)
dev.off()
Saving in postscript format
postscript(file = "control.ps")
plot(control)
dev.off()
saving in pdf format
pdf(file = "control.pdf", paper = "A4")
plot(control)
dev.off()
Lattice is an excellent package for visualizing multivariate data. It has great set of routines for quickly displaying complex data sets with ease.
Advantages of using lattice package are as following.
Plots with lattice package usually look better.
They can be extended in powerful ways.
The resulting output can be annotated, edited and saved.
Basic form for lattice function call is function.name (formula).
The general arrangement of a formula in a lattice function is:
vertical.axis.variable ~ horizontal.axis.variable
Note that the tilde operator (i.e., ~) must be used in a lattice function call, even if the graph only uses a single variable.
For e.g., histogram(~data$x), xyplot(data$y ~ data$x)
Some of the functions available in lattice package are as following:
histogram(), densityplot(),bwplot()
qqmath(),qq()
xyplot() for creating scatterplots
Let’s start by loading the lattice package.
library(lattice)
Read the data from file named gene_data.txt
data <- read.table("data/gene_data.txt", header=T, sep="\t")
A simple scatter plot can be produced as,
xyplot(Untreated2~Treated2, data=data)
or the output can be redirected to an object as,
tplot<-xyplot(Untreated2~Treated2, data=data)
and then printed as,
print(tplot)
The object containing the plot can further be modified. for e.g.
tplot2<-update(tplot, main="Drug treatment in Cells" )
print(tplot2)
Box and whisker plot can be produced with bwplot function. Here, we are again using a singer data frame which is bundled with lattice package. You would have to load lattice package first before using this database.
head(singer)
height voice.part
1 64 Soprano 1
2 62 Soprano 1
3 66 Soprano 1
4 65 Soprano 1
5 60 Soprano 1
6 61 Soprano 1
bwplot(voice.part ~ height, data=singer, xlab="Height (inches)")
A density plot can be drawn with densityplot function.
densityplot( ~ height | voice.part, data = singer, layout = c(2, 4), xlab = "Height (inches)", bw = 5)
qqmath function is used to draw quantile-Quantile plots of a sample against a theoretical distribution.
qqmath(~ rnorm(100), distribution = function(p) qt(p, df = 10))
Let's look at how to create a scatterplot in ggplot2. We'll use the iris data frame that's automatically loaded into R. What does the data frame contain? We can use the head function to look at the first few rows
library(ggplot2)
head(iris, n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
by default, head displays the first 6 rows.
head(iris, n = 10)
we can also explicitly set the number of rows to display
(The data frame actually contains three types of species: setosa, versicolor, and virginica.)
Let's plot Sepal.Length against Petal.Length using ggplot2's qplot() function.
qplot(Sepal.Length, Petal.Length, data = iris)
Plot Sepal.Length vs. Petal.Length, using data from the iris
data frame.
First argument Sepal.Length
goes on the x-axis.
Second argument Petal.Length
goes on the y-axis.
data = iris
means to look for this data in the iris
data frame.
To see where each species is located in this graph, we can color each point by adding a color = Species argument.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
Similarly, we can let the size of each point denote petal width, by adding a size = Petal.Width argument.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)
We see that Iris setosa flowers have the narrowest petals.
Finally, let's fix the axis labels and add a title to the plot.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length", main = "Sepal vs. Petal Length in Fisher's Iris data")
In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to qplot(). These two invocations are equivalent.
qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
qplot(Sepal.Length, Petal.Length, data = iris)
But we can also easily use other types of geoms to create more kinds of plots. Barcharts: geom = “bar”
Let’s construct a data frame called movies.
movies = data.frame(
director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
minutes = c(124, 163, 195, 600, 187)
)
Plot the number of movies each director has.
qplot(director, data = movies, geom = "bar", ylab = "# movies")
By default, the height of each bar is simply a count. But we can also supply a different weight. Here the height of each bar is the total running time of the director's movies.
qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "total length (min.)")
qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)
Orange
is another built-in data frame that describes the growth of orange trees.
qplot(age, circumference, data = Orange, geom = "line", colour = Tree,
main = "How does orange tree circumference vary with age?")
We can also plot both points and lines.
qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)
Exercise on Plotting can be found here
Answers can be found here here
R has a powerful set of statistical methods. Including:
One of the simplest statistical tools are summary() and table(). These functions provide descriptive statistics for data frames and character vectors or factors.
patientName <- c("patient1","patient2","patient3","patient4")
patientType <- factor(rep(c("male","female"),2))
survivalTime <- c(1,30,2,20)
dfExample <- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
The table() function provides a breakdown of the frequency of occurrence of all unique values in vector or factor.
table(patientType)
patientType
female male
2 2
summary() provides a breakdown of occurrence for all character or factor columns and min,max,mean and quantiles for numeric columns.
summary(dfExample)
Name Type Survival_Time
patient1:1 female:2 Min. : 1.00
patient2:1 male :2 1st Qu.: 1.75
patient3:1 Median :11.00
patient4:1 Mean :13.25
3rd Qu.:22.50
Max. :30.00
The table() function can be used to generate frequency tables across data.frames.
table(dfExample)
, , Survival_Time = 1
Type
Name female male
patient1 0 1
patient2 0 0
patient3 0 0
patient4 0 0
, , Survival_Time = 2
Type
Name female male
patient1 0 0
patient2 0 0
patient3 0 1
patient4 0 0
, , Survival_Time = 20
Type
Name female male
patient1 0 0
patient2 0 0
patient3 0 0
patient4 1 0
, , Survival_Time = 30
Type
Name female male
patient1 0 0
patient2 1 0
patient3 0 0
patient4 0 0
ftable() function provides a method to generate or print the results from table() in a neater format.
ftable(dfExample)
Survival_Time 1 2 20 30
Name Type
patient1 female 0 0 0 0
male 1 0 0 0
patient2 female 0 0 0 1
male 0 0 0 0
patient3 female 0 0 0 0
male 0 1 0 0
patient4 female 0 0 1 0
male 0 0 0 0
A common task in statistical analysis is to investigate the relationship between pairs of numeric vectors.
This can be done by identifying the correlation between numeric vectors using the cor() function in R.
In this example we use cor() to identify the Pearson correlation between two variables. The method argument may be set to make use of different correlation methods.
x <- rnorm(100,10,2)
z <- rnorm(100,10,2)
y <- x
cor(x,y) #
[1] 1
cor(x,-y)
[1] -1
cor(x,z)
[1] -0.01135608
Often we wish to apply correlation analysis to all columns or rows in a matrix in a pair-wise manner. To do this in R, we can simply pass the cor() function a single argument of the numeric matrix of interest. The cor() function will then perform all pair-wise correlations between columns.
Sample_1.hi | Sample_2.hi | Sample_3.hi | Sample_4.low | Sample_5.low | Sample_1.low |
---|---|---|---|---|---|
2.181518 | 2.963433 | 3.618821 | 3.703849 | 3.621196 | 3.466088 |
3.953413 | 5.003427 | 3.128916 | 3.505461 | 3.953361 | 4.261023 |
cor(minRep)[1:2,2:5]
Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low
Sample_1.hi 0.9451116 0.9202844 0.4639150 0.2544438
Sample_2.hi 1.0000000 0.9398130 0.4692492 0.1980213
image(cor(minRep),axes=F)
axis(1,at=seq(0,1,length.out=6), colnames(minRep))
axis(2,at=seq(0,1,length.out=6), colnames(minRep))
R comes with functions for extracting information from most common distibutions types. An example of standard R functions for dealing with distibution can be seen here using the normal distributions.
Similar functions are available for other distibution types including:
We can use rnorm to generate random values following a normal distribution. Here we produce 10 normally distributed numeric values with mean 8 and standard deviation of 3
rnorm(10,mean=8,sd=3)
[1] 3.523634 9.462042 4.183035 6.450474 13.042276 14.233895 6.862742
[8] 6.454255 13.119918 8.869193
We can also use these functions to interrogate values assuming a normal distribution for the data.
The probablity of a value being exactly 8 for a distribution of mean 8 and standard deviation 3.
dnorm(8,mean=8,sd=3)
[1] 0.1329808
The probablity of a value being less than 8 for a distribution of mean 8 and standard deviation 3.
pnorm(8,mean=8,sd=3)
[1] 0.5
The value for which i have a 50 percent being greater than given a normal distribution of mean 8 and standard deviation 3.
qnorm(0.5,mean=8,sd=3)
[1] 8
On top of descriptive statistics, R has several statistical tests covering a range of problems and data types.
Some common tests include:
var.test() - Comparing 2 variances (Fisher's F test)
t.test() - Comparing 2 sample means with normal errors (Student's t-test)
wilcox.test() - Comparing 2 means with non-normal errors (Wilcoxon's rank test)
fisher.test() - Testing for independence of 2 variables in a contingency table (Fisher's exact test)
To perform a t-test we will read in some datasets, test that the variances of the datasets are equal and then perform the actual t-tests.
tTestExample <- read.table("data/tTestData.csv",sep=",",header=T)
tTestExample
A B C
1 26.03199 40.65549 33.78191
2 27.00897 40.59726 39.48001
3 27.57468 39.86103 34.63261
4 27.15929 39.96254 42.62425
5 25.82156 42.54450 35.48016
6 26.18872 39.31856 40.49148
7 26.30709 40.80434 38.91200
8 26.21609 39.96368 40.74275
9 25.84095 40.08374 34.28705
10 26.86587 40.18931 37.80057
First we can specify the columns of interest using $ and calculate their variance using var().
var(tTestExample$A)
[1] 0.367875
var(tTestExample$B)
[1] 0.7614086
var(tTestExample$C)
[1] 9.681179
Now we can test for any differences in variances between A and B and A and C with an F-test using the var.test() function.
var.test(tTestExample$A,tTestExample$B)
F test to compare two variances
data: tTestExample$A and tTestExample$B
F = 0.4832, num df = 9, denom df = 9, p-value = 0.2936
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1200078 1.9451614
sample estimates:
ratio of variances
0.4831506
var.test(tTestExample$A,tTestExample$C)
F test to compare two variances
data: tTestExample$A and tTestExample$C
F = 0.038, num df = 9, denom df = 9, p-value = 4.092e-05
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.009438411 0.152983699
sample estimates:
ratio of variances
0.03799899
The data type holding the result var.test() is a little more complex than the data types we have looked.
In R, special objects (S3 or S4 objects) can be created which have methods associated to them. The result from var.test is an object of class htest.
Since we have not come across this before, in order to discover its structure we can use the str() function with the object of interest as the argument.
result <- var.test(tTestExample$A, tTestExample$B)
str(result)
List of 9
$ statistic : Named num 0.483
..- attr(*, "names")= chr "F"
$ parameter : Named int [1:2] 9 9
..- attr(*, "names")= chr [1:2] "num df" "denom df"
$ p.value : num 0.294
$ conf.int : atomic [1:2] 0.12 1.95
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 0.483
..- attr(*, "names")= chr "ratio of variances"
$ null.value : Named num 1
..- attr(*, "names")= chr "ratio of variances"
$ alternative: chr "two.sided"
$ method : chr "F test to compare two variances"
$ data.name : chr "tTestExample$A and tTestExample$B"
- attr(*, "class")= chr "htest"
Now we know the structure and class of the htest object we can access the slots containing information we want just as with a named list.
The p-value
result$p.value
[1] 0.293592
The statistic
result$statistic
F
0.4831506
The data used in function call
result$data.name
[1] "tTestExample$A and tTestExample$B"
We have ascertained that GroupA and GroupB have similar variances. We can therefore perform a standard t-test to assess the significance of differences between these groups.
Result <- t.test(tTestExample$A,tTestExample$B,alternative ="two.sided", var.equal = T)
Result
Two Sample t-test
data: tTestExample$A and tTestExample$B
t = -41.3528, df = 18, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-14.60253 -13.19051
sample estimates:
mean of x mean of y
26.50152 40.39804
To compare groups of unequal variance then the var.equal argument may be set to FALSE (which is the default).
Result <- t.test(tTestExample$A,tTestExample$C,alternative ="two.sided", var.equal = F)
Result
Welch Two Sample t-test
data: tTestExample$A and tTestExample$C
t = -11.2941, df = 9.683, p-value = 6.855e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.56531 -9.07821
sample estimates:
mean of x mean of y
26.50152 37.82328
The same result to that shown could be achieved by specifying a formula for the comparison. Here we wish to compare column A versus B so we could simply specify the formula and the data to be used.
longFrame <- data.frame(Group = c(rep("A",nrow(tTestExample)),rep("B",nrow(tTestExample))), Value=c(tTestExample$A,tTestExample$B))
result <- t.test(Value~Group,longFrame,alternative ="two.sided",
var.equal = T)
result
Two Sample t-test
data: Value by Group
t = -41.3528, df = 18, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-14.60253 -13.19051
sample estimates:
mean in group A mean in group B
26.50152 40.39804
We have seen how we can find the correlation between two sets of variables using cor() function.
R also provides a comprehensive set of tools for regression analysis including the well used linear modeling function lm()
To fit a linear regression we use a similar set of arguments as passed to the t-test fuction in the previous slide.
> lmExample <- read.table("data/lmExample.txt",h=T,sep="\t")
> lmResult <- lm(Y~X,data=lmExample)
> plot(Y~X,data=lmExample,main="Line of best fit with lm()",
+ xlim=c(0,150),ylim=c(0,200))
> abline(lmResult,col="red",lty=3,lwd=3)
The lm() function fits a linear regression to your data and provides useful information on the generated fit.
In the example below we fit a linear model using lm() on the lmExample dataset with column Y as the dependent variable and column X as the explanatory variable.
> lmResult <- lm(Y~X,data=lmExample)
> lmResult
Call:
lm(formula = Y ~ X, data = lmExample)
Coefficients:
(Intercept) X
7.001 1.972
Printing the result from lm() shows the call to lm() and the coefficients including the intercept.
From the previous slides we now know the formula for the line.
Y = 7.001 + 1.972*X
We can add the line of best fit using abline()
> plot(Y~X,data=lmExample,main="Line of best fit with lm()",
+ xlim=c(0,100),ylim=c(0,200))
> abline(lmResult,col="red",lty=3,lwd=3)
As we have seen, printing the model result provides the intercept and slope of line.
To get some more information on the model we can use the summary() function
> summary(lmResult)
Call:
lm(formula = Y ~ X, data = lmExample)
Residuals:
Min 1Q Median 3Q Max
-5.0150 -2.3688 -0.2079 2.6068 5.0538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.00053 0.93207 7.511 3.91e-13 ***
X 1.97218 0.01325 148.894 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.858 on 398 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9823
F-statistic: 2.217e+04 on 1 and 398 DF, p-value: < 2.2e-16
Call:
lm(formula = Y ~ X, data = lmExample)
Residuals:
Min 1Q Median 3Q Max
-5.0150 -2.3688 -0.2079 2.6068 5.0538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.00053 0.93207 7.511 3.91e-13 ***
X 1.97218 0.01325 148.894 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.858 on 398 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9823
F-statistic: 2.217e+04 on 1 and 398 DF, p-value: < 2.2e-16
The residuals are the difference between the predicted and actual values. To retrieve the residuals we can access the slot or use the resid() function.
> summary(resid(lmResult))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.0150 -2.3690 -0.2079 0.0000 2.6070 5.0540
> summary(lmResult$residual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.0150 -2.3690 -0.2079 0.0000 2.6070 5.0540
Ideally you would want your residuals to be normally distributed around 0.
Call:
lm(formula = Y ~ X, data = lmExample)
Residuals:
Min 1Q Median 3Q Max
-5.0150 -2.3688 -0.2079 2.6068 5.0538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.00053 0.93207 7.511 3.91e-13 ***
X 1.97218 0.01325 148.894 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.858 on 398 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9823
F-statistic: 2.217e+04 on 1 and 398 DF, p-value: < 2.2e-16
The R-squared value represents the proportion of variability in the response variable that is explained by the explanatory variable.
A high R-squared here indicates that the line fits closely to the data.
> summary(lmResult)
Call:
lm(formula = Y ~ X, data = lmExample)
Residuals:
Min 1Q Median 3Q Max
-5.0150 -2.3688 -0.2079 2.6068 5.0538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.00053 0.93207 7.511 3.91e-13 ***
X 1.97218 0.01325 148.894 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.858 on 398 degrees of freedom
Multiple R-squared: 0.9824, Adjusted R-squared: 0.9823
F-statistic: 2.217e+04 on 1 and 398 DF, p-value: < 2.2e-16
The results from linear models also provides a measure of significance for a variable not being relevant.
Exercise on statistics can be found here
Answers can be found here here