Bioinformatics Team

MRC LMS

Thomas Carroll

The Bioinformatics Team.

Tom Carroll (ex-Head of Bioinformatics)
Gopuraja Dharmalingam (Epigenetics)
Sanjay Khadayate (Genes and Metabolism)
Yi-Fang Wang (Integrative Biology)
Marion Dore (Genomics)
TBD (Proteomics)

Websites

http://mrccsc.github.io/

Computing and Bioinformatics

http://bioinformatics.csc.mrc.ac.uk/

Where to find the team.

ICTEM
2nd floor, MRC.
Central aisle,
Behind the printers.

Role

Experimental design.
Analysis
Bioinformatics Infrastructure.
Training.

Text

Experimental Design

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

Fisher RA, 1938

Work closely with Genomics Team to help with design questions
- Replicate number.
- Sequencing depth.
- Sequencing strategy.

Nice example experimental design

RNA-seq experiment (2014)
Graph shows major sources of variation.
Samples from same groups close together.
Samples from different experimental conditions separate.

Nice example of experimental design

Smaller sources of variance relating to other metadata.
Samples group according to the day that RNA was extracted on.
Known effects can be removed from analysis.

Analysis

Initial data processing and QC.
Advice and support as needed.
Support throughout project.

Increased demand for long term support.

Authorships in > 30 publication since 2014

Analysis support

Increased use of high throughput techniques in projects.
Greater use for bioinformatics in projects.
Analysis across project lifetime or individual elements.
Requires reproducible research.

Reproducible research

Reproducible results from computational methods should be straight forward.
Common problems.
- Version and software changes.
- Lack of analysis documentation.

rMarkdown

rMarkdown converted R (or python/many languages') code to dynamic reports.
Code, results and versions are reported within the same page.
HTML allows for inclusion of dynamic elements.

A do it yourself guide

Project tracking

Use Redmine software.
Multiple user interface to record project information.
Repository to version control scripts (SVN).
Wiki for internal documentation.

Infrastructure

Analysis pipelines.
Data delivery.
Software development.

ChIP-seq and RNA-seq

pipelines.

Common analysis steps can be automated.
Optimised for local resources.
Reproducible and comparable.
ChIP-seq and RNA-seq pipeline to automate alignment and quality control.
Freely available for use or customisation on github

http://mrccsc.github.io/

Genomics Pipeline

Joined forces with Genomics to better automate some processes of basecalling/demultiplexing data.
- Basecalling - Convert optical data to sequence information.
- Demultiplexing - Split sequence data by a sequence tag (index) which is unique per sample
Process is
- Users submit indexed samples and a sample sheet containing sample/index information.
- Genomics provide a set of QCed sample sequence files for further analysis.

Why?

Reduce time consuming steps.
- Sample sheet templates often misinterpreted by users or incorrectly filled out.
- Thankfully not very originally and common errors can be automatically caught.
- Manual creation of basemasks for index/read layout.
- Remove dependence on Illumina's Windows' point and click software - Sequencing Analysis Viewer (SAV).
- Reduce copy and paste.
Remove sources of potential error.
- An automated process will be more consistent (in both accuracy and error).
- Reduce copy and paste errors.

The MRC-LMS Genomics Pipeline

Automate samplesheet cleaning and updating.
Create basemasks or index/read description.
Parse results from basecalling/demultiplexing
- Binary files from Real Time Analysis software.
- XML files from BCL2fastq version 2.17 or above.
Summarise results to provide to user alongside resulting sequence files in fastq format.
Written in R, moved to Bioconductor in May 2017.

Others in the pipeline

Internal RNA-seq/ChIP-seq pipeline
- Written in R.
- Easily installed, maintained.
- Allows Core to move between systems easily (both install and parallelisation).
- Released soon.
Basecalling to ChIP/RNA-seq QC/initial analysis all in one reproducible, version controlled report.
Proteomics data pipeline?

UCSC genome browser

UCSC allows for visualisation of a range of genomics data types.
Public instances can be very slow.
CSC public instance maintained by Bioinformatics team.
web: http://ucsc

FTP: ftp://ucsc

Software

Develop and maintain software relevant to our work.
R packages and javascript toolsets.
Release software to public (peer-reviewed) repositories.
- Reviews improves code quality
- Collaborative feedback.
- Automated build reports and checking.

ChIPQC

Lack of suitable R/Bioconductor quality control tools for ChIP-seq.
Require methods to assess quality across high volumes of samples
ChIPQC developed and tested on 500 public datasets.

Package

Bioc2014 Tutorial

IGV is an popular alternate to UCSC.
Allows for inclusion of per sample metadata and complex sample display types.
Tracktables creates standalone and rMarkdown compliant tables.

Tracktables

Simple Example - Loading data

Visualising genomics data over regions of the genome.
Allows for rapid generation of profiles and subsetting by IDs or other regions.
Arithmetic operations between and within profiles allows for rapid, iterative investigation of hypotheses.

Soggi

Peak calling in R is convenient.
Many peak callers in R have unwieldy input and far from optimised.
triform contains a reliable peak calling algorithm in need of optimisation for speed and long marks.
MRC CSC took over maintenance of triform in 2015

triform

basecallQC

Automate basecalling and demultiplexing.
Less manual intervention.
No dependence on proprietary software.
Customised for MRC-LMS use

Training

Aim to develop courses to meet MRC Clinical Sciences requirements.
- R (8 courses)
- Python (1 courses)
- High throughput sequencing analysis. (3 courses)
Relevant and targeted training.
Real world examples.
Taught courses and material for online learning.

Material Repository

2014 (R Basics)
- Intro to R course.
- Reproducible R.
2015 (Basics of Bioinformatics)
- ChIP-seq.
- RNA-seq.
- Bioconductor.
- Genome Browsers and File Formats.
2016 (Intermediate R and Bioinformatics)
- Intermediate R.
- Alignments and Counting.
- Visualising High Throughput Sequeuncing Data
- Introduction to Python

CSC Bioinformatics Course

Current and upcoming Bioinformatics training material can be found at our site

http://mrccsc.github.io/training.html

Github/Github.io/Travis/Appveyor and Training in 2017

All training is hosted on Github.
All pages automatically synced to Github.io
Training courses automatically tested on windows/mac/linux after any changes.

Training Collaborations

All material is open source (CC BY-NC-ND 4.0).
- Free to distribute or collaborate.
- No commercial use.
- No remixes.
Bioinformatics Core Shared Training (with Mark Dunning)
- Genomic File Formats
- Intermediate R
- ChIP-seq ->
- IGV ->
- -> Experimental Design

Training on the cloud.

Awarded grant from Amazon Web Services.
Use virtual linux servers to host R and RStudio pre-loaded with course material.
Allow for larger, real world analysis tasks during training.
No need for dedicated classroom - train from anywhere.

Future!

Release and link RNA-seq and ChIP-seq pipelines.
- Submit samples and get QCed, aligned data back.
Support proteomics and analysis of other MRC high throughput technologies.
Training updates
- More python!
- More non-programmer courses (e.g. Deeptools)!
- Julia and other languages?!
- Interactive online training!!
More analysis!!!

Have a great week!

Bioinformatics Team

Tom - thomas.carroll@imperial.ac.uk

Gopu - gopuraja.dharmalingam@imperial.ac.uk

Sanjay - sanjay.khadayate@imperial.ac.uk

Yi-Fang - yifang.wang@imperial.ac.uk

Marion - marion.dore@imperial.ac.uk