1 SPEAQeasy introduction

Instructor: Leo

Congrats Nick https://t.co/O3u5XRPXy2 for your @biorxivpreprint first pre-print! 🙌🏽

SPEAQeasy is our @nextflowio implementation of the #RNAseq processing pipeline that produces @Bioconductor-friendly #rstats objects that we use at @LieberInstitute

📜 https://t.co/zKuBRtBCmY pic.twitter.com/F83fXI90eP
— 🇲🇽 Leonardo Collado-Torres (@lcolladotor) December 12, 2020

1.1 2022-04-20 overview slides

1.2 SPEAQeasy main links

Paper: https://doi.org/10.1186/s12859-021-04142-3
Documentation website: http://research.libd.org/SPEAQeasy/
- Source code: https://github.com/LieberInstitute/SPEAQeasy
Example website: http://research.libd.org/SPEAQeasy-example/
- Source code: https://github.com/LieberInstitute/SPEAQeasy-example
Differential expression analysis bootcamp: https://lcolladotor.github.io/bioc_team_ds/differential-expression-analysis.html
- 3 sessions, each 2 hours long

1.3 Pipeline outputs

Documentation chapter: http://research.libd.org/SPEAQeasy/outputs.html

That’s enough links! Lets download some data to check it out. We’ll use BiocFileCache to keep the data in a local cache in case we want to run this example again and don’t want to re-download the data from the web.

## Load the container package for this type of data
library("SummarizedExperiment")

## Download and cache the file
library("BiocFileCache")
bfc <- BiocFileCache::BiocFileCache()
cached_rse_gene_example <- BiocFileCache::bfcrpath(
    x = bfc,
    "https://github.com/LieberInstitute/SPEAQeasy-example/raw/master/rse_speaqeasy.RData"
)

## adding rname 'https://github.com/LieberInstitute/SPEAQeasy-example/raw/master/rse_speaqeasy.RData'

## Check the local path on our cache
cached_rse_gene_example

##                                                                  BFC1 
## "/github/home/.cache/R/BiocFileCache/48f7acf1717_rse_speaqeasy.RData"

## Load the rse_gene object
load(cached_rse_gene_example, verbose = TRUE)

## Loading objects:
##   rse_gene

## General overview of the object
rse_gene

## class: RangedSummarizedExperiment 
## dim: 60609 40 
## metadata(0):
## assays(1): counts
## rownames(60609): ENSG00000223972.5 ENSG00000227232.5 ... ENSG00000210195.2 ENSG00000210196.2
## rowData names(10): Length gencodeID ... NumTx gencodeTx
## colnames(40): R13896_H7JKMBBXX R13903_HCTYLBBXX ... R15120_HFY2MBBXX R15134_HFFGHBBXX
## colData names(67): SAMPLE_ID FQCbasicStats ... AgeDeath BrNum

## We can check how big the object is with lobstr
lobstr::obj_size(rse_gene)

## 35.78 MB

1.4 Exercises

Exercise 1: Either by exploring the object rse_gene or by checking the SPEAQeasy documentation, what are the possible values for the variable trimmed?

Exercise 2: Across genes (rse_gene), exons (rse_exon), exon-exon junctions (rse_jx), and transcripts (rse_tx), what part of the output is identical?

If you want to answer this question with data, you could use the 4 RSE objects from the BrainSEQ Phase II study that are available at http://eqtl.brainseq.org/phase2/. They were created with the scripts at https://github.com/LieberInstitute/brainseq_phase2#rse_gene_unfilteredrdata. Note that these are much larger objects since they contain information for 900 samples.

1.5 Solutions

Solution 1: From http://research.libd.org/SPEAQeasy/outputs.html#quality-metrics the answer was:

A boolean value (“TRUE” or “FALSE”), indicating whether the given sample underwent trimming

With code, it’s this:

class(rse_gene$trimmed)

## [1] "logical"

## logical vectors can take 2 values (plus the third `NA` if it's missing)

Solution 2: From http://research.libd.org/SPEAQeasy/outputs.html#coldata-of-rse-objects the answer is that all objects have identical colData().