11 Differential gene expression exercise
Instructor: Leo
11.1 Recap
So far we know how to:
- choose a study from
recount3
- download data for a study with
recount3::create_rse()
- explore the data interactively with
iSEE
- expand Sequence Read Archive (SRA) attributes
- sometimes we need to clean them up a bit before we can use them
- use
edgeR::calcNormFactors()
to reduce composition bias - build a differential gene expression model with
model.matrix()
- explore and interpret the model with
ExploreModelMatrix
- use
limma::voom()
and related functions to compute the differential gene expression statistics - extract the DEG statistics with
limma::topTable(sort.by = "none")
- use some
limma
functions for making MA or volcano plots
among several other plots and tools we learned along the way.
Alternatively to recount3
, we have learned about the RangedSummarizedExperiment
objects produced by SPEAQeasy
and in particular the one we are using on the smokingMouse
project.
You might have your own data already. Maybe you have it as an AnnData
python object. If so, you can convert it to R with zellkonverter.
11.2 Exercise
Exercise option 1:
This will be an open ended exercise. Think of it as time to practice what we’ve learnt using data from recount3
or another subset of the smokingMouse
dataset. You could also choose to re-run code from earlier parts of the course and ask clarifying questions. You could also use this time to adapt some of the code we’ve covered to use it with your own dataset.
If you prefer a more structured exercise:
Exercise option 2:
- Choose two
recount3
studies that can be used to study similar research questions. For example, two studies with brain samples across age. - Download and process each dataset independently, up to the point where you have differential expression t-statistics for both. Skip most of the exploratory data analyses steps as for the purpose of this exercise, we are most interested in the DEG t-statistics.
- If you don’t want to choose another
recount3
study, you could use thesmokingMouse
data and subset it once to the pups in nicotine arm of the study and a second time for the pups in the smoking arm of the study. - Or you could use the GTEx brain data from
recount3
, subset it to the prefrontal cortex (PFC), and compute age related expression changes. That would be in addition to SRA study SRP045638 that we used previously.
::create_rse_manual(
recount3project = "BRAIN",
project_home = "data_sources/gtex",
organism = "human",
annotation = "gencode_v26",
type = "gene"
)
- Make a scatterplot of the t-statistics between the two datasets to assess correlation / concordance. You might want to use
GGally::ggpairs()
for this https://ggobi.github.io/ggally/reference/ggpairs.html. Orggpubr::ggscatter()
https://rpkgs.datanovia.com/ggpubr/reference/ggscatter.html.
- For example, between the GTEx PFC data and the data we used previously from SRA study SRP045638.
- Or between the nicotine-exposed pups and the smoking-exposed pups in
smokingMouse
. - Or using the two
recount3
studies you chose.
- Are there any DEGs FDR < 5% in both datasets? Or FDR < 5% in dataset 1 that have a p-value < 5% in the other one?
- You could choose to make a concordance at the top plot like at http://leekgroup.github.io/recount-analyses/example_de/recount_SRP019936.html, though you will likely need more time to complete this.