4 smokingMouse RSE

Instructor: Daianna

Once you have reviewed how to use RangedSummarizedExperiment (RSE) objects in R, now we’ll start exploring the smokingMouse data.

As previously described, the smoking mouse project is a complex study with more than 200 samples from brain and blood, adult mice and pups, pregnant and not pregnant mice, etc. The whole datasets of this project can be downloaded from the smokingMouse (Gonzalez-Padilla and Collado-Torres, 2023) package. Visit here for more details.

4.1 Download data

For illustrative purposes, we’ll use nicotine data at the gene level which reside in a RangedSummarizedExperiment (RSE) object called rse_gene. We’ll use BiocFileCache to keep the data in a local cache in case we want to run this example again and don’t want to re-download the data from the web.

## Load the container package for this type of data
library("SummarizedExperiment")

## Download and cache the file
library("BiocFileCache")
bfc <- BiocFileCache::BiocFileCache()
cached_rse_gene <- BiocFileCache::bfcrpath(
    x = bfc,
    "https://github.com/LieberInstitute/SPEAQeasyWorkshop2023/raw/devel/provisional_data/rse_gene_mouse_RNAseq_nic-smo.Rdata"
)
#> adding rname 'https://github.com/LieberInstitute/SPEAQeasyWorkshop2023/raw/devel/provisional_data/rse_gene_mouse_RNAseq_nic-smo.Rdata'

## Check the local path on our cache
cached_rse_gene
#>                                                                                  BFC2 
#> "/github/home/.cache/R/BiocFileCache/48f3eb1e88e_rse_gene_mouse_RNAseq_nic-smo.Rdata"

## Load the rse_gene object
load(cached_rse_gene, verbose = TRUE)
#> Loading objects:
#>   rse_gene

## General overview of the object
rse_gene
#> class: RangedSummarizedExperiment 
#> dim: 55401 208 
#> metadata(1): Obtained_from
#> assays(2): counts logcounts
#> rownames(55401): ENSMUSG00000102693.1 ENSMUSG00000064842.1 ... ENSMUSG00000064371.1 ENSMUSG00000064372.1
#> rowData names(13): Length gencodeID ... DE_in_pup_brain_nicotine DE_in_pup_brain_smoking
#> colnames: NULL
#> colData names(71): SAMPLE_ID FQCbasicStats ... retained_after_QC_sample_filtering
#>   retained_after_manual_sample_filtering

4.2 Data overview

4.2.1 Assays

The dataset rse_gene contains the following assays:

  • counts: original read counts of the 55,401 mouse genes across 208 samples (inlcuding the 65 nicotine samples of interest).
  • logcounts: normalized and scaled counts (\(log_2(CPM + 0.5)\)) of the same genes across the same samples; normalization was carried out applying TMM method with cpm(calcNormFactors()) of edgeR.
## Explore main assay (of raw counts)
assay(rse_gene)[1:3, 1:3] ## counts for first 3 genes and 3 samples
#>                      [,1] [,2] [,3]
#> ENSMUSG00000102693.1    0    0    0
#> ENSMUSG00000064842.1    0    0    0
#> ENSMUSG00000051951.5  811  710  812
## Access the same raw data with assays()
assays(rse_gene)$counts[1:3, 1:3]
#>                      [,1] [,2] [,3]
#> ENSMUSG00000102693.1    0    0    0
#> ENSMUSG00000064842.1    0    0    0
#> ENSMUSG00000051951.5  811  710  812
## Access lognorm counts
assays(rse_gene)$logcounts[1:3, 1:3]
#>                           [,1]      [,2]      [,3]
#> ENSMUSG00000102693.1 -5.985331 -5.985331 -5.985331
#> ENSMUSG00000064842.1 -5.985331 -5.985331 -5.985331
#> ENSMUSG00000051951.5  4.509114  4.865612  4.944597

4.2.2 Sample data

  • Yellow variables correspond to SPEAQeasy outputs that are going to be used in downstream analyses.
  • Pink variables are specific to the study, such as sample metadata and some others containing additional information about the genes.
  • Blue variables are quality-control metrics computed by addPerCellQC() of scuttle.

The same RSE contains the sample information in colData(RSE):

  • SAMPLE_ID : is the name of the sample.
  • ERCCsumLogErr : a summary statistic quantifying overall difference of expected and actual ERCC concentrations for one sample. For more about ERCC check their product page at https://www.thermofisher.com/order/catalog/product/4456740.
  • overallMapRate : the decimal fraction of reads which successfully mapped to the reference genome (i.e. numMapped / numReads).
  • mitoMapped : the number of reads which successfully mapped to the mitochondrial chromosome.
  • totalMapped : the number of reads which successfully mapped to the canonical sequences in the reference genome (excluding mitochondrial chromosomes).
  • mitoRate : the decimal fraction of reads which mapped to the mitochondrial chromosome, of those which map at all (i.e. mitoMapped / (totalMapped + mitoMapped))
  • totalAssignedGene : the decimal fraction of reads assigned unambiguously to a gene (including mitochondrial genes), with featureCounts (Liao et al. 2014), of those in total.
  • rRNA_rate : the decimal fraction of reads assigned to a gene whose type is ‘rRNA’, of those assigned to any gene.
  • Tissue : tissue (mouse brain or blood) from which the sample comes.
  • Age : if the sample comes from an adult or a pup mouse.
  • Sex : if the sample comes from a female (F) or male (M) mouse.
  • Expt : the experiment (nicotine or smoking exposure) to which the sample mouse was subjected; it could be an exposed or a control mouse of that experiment.
  • Group : if the sample belongs to a nicotine/smoking-exposed mouse (Expt) or a nicotine/smoking control mouse (Ctrl).
  • plate : is the plate (1,2 or 3) in which the sample library was prepared.
  • Pregnancy : if the sample comes from a pregnant (Yes) or not pregnant (No) mouse.
  • medium : is the medium in which the sample was treated: water for brain samples and an elution buffer (EB) for the blood ones.
  • flowcell : is the sequencing batch of each sample.
  • sum : library size (total sum of counts across all genes for each sample).
  • detected : number of non-zero expressed genes in each sample.
  • subsets_Mito_sum : total sum of read counts of mt genes in each sample.
  • subsets_Mito_detected : total number of mt genes in each sample.
  • subsets_Mito_percent : % of mt genes’ counts of the total counts of the sample.
  • subsets_Ribo_sum : total sum of read counts of ribosomal genes in each sample.
  • subsets_Ribo_detected : total number of ribosomal genes in each sample.
  • subsets_Ribo_percent : % of ribosomal genes’ counts of the total counts of the sample.

Note: in our case, we’ll use samples from the nicotine experiment only, so all samples come from brain and were treated in water.

## Data for first 3 samples and 5 variables
colData(rse_gene)[1:3, 1:5]
#> DataFrame with 3 rows and 5 columns
#>     SAMPLE_ID FQCbasicStats perBaseQual perTileQual  perSeqQual
#>   <character>   <character> <character> <character> <character>
#> 1 Sample_2914          PASS        PASS        PASS        PASS
#> 2 Sample_4042          PASS        PASS        PASS        PASS
#> 3 Sample_4043          PASS        PASS        PASS        PASS

4.2.3 Gene Information

Among the information in rowData(RSE) the next variables are of interest for the analysis:

  • gencodeID : GENCODE ID of each gene.
  • ensemblID : gene ID in Ensembl database.
  • EntrezID : identifier of each gene in NCBI Entrez database.
  • Symbol : official gene symbol for each mouse gene.
  • retained_after_feature_filtering : Boolean variable that equals TRUE if the gene passed the gene filtering (with filterByExpr() of edgeR) based on its expression levels and FALSE if not.
  • DE_in_adult_brain_nicotine : Boolean variable that equals TRUE if the feature is differentially expressed (DE) in adult brain samples exposed to nicotine and FALSE if not.
  • DE_in_pup_brain_nicotine : Boolean variable that equals TRUE if the feature is differentially expressed (DE) in pup brain samples exposed to nicotine and FALSE if not.
## Data for first 3 genes and 5 variables
rowData(rse_gene)[1:3, 1:5]
#> DataFrame with 3 rows and 5 columns
#>                         Length            gencodeID          ensemblID      gene_type    EntrezID
#>                      <integer>          <character>        <character>    <character> <character>
#> ENSMUSG00000102693.1      1070 ENSMUSG00000102693.1 ENSMUSG00000102693            TEC       71042
#> ENSMUSG00000064842.1       110 ENSMUSG00000064842.1 ENSMUSG00000064842          snRNA          NA
#> ENSMUSG00000051951.5      6094 ENSMUSG00000051951.5 ENSMUSG00000051951 protein_coding      497097

📑 Exercise 1: How would you access data of a specific sample variable?

For illustrative purposes, we’ll use nicotine data at the gene level only so let’s access that data.

## Original dimensions of the data
dim(rse_gene)
#> [1] 55401   208
rse_gene_nic <- rse_gene[, which(rse_gene$Expt == "Nicotine")]
## New dimensions
dim(rse_gene_nic)
#> [1] 55401    65

📑 Exercise 2: How could you check that all samples are from the nicotine experiment?

📑 Exercise 3: How many nicotine samples correspond to adults and how many to pups? How many pups were males and how many were females?

© 2011-2023. All thoughts and opinions here are my own. The icon was designed by Mauricio Guzmán and is inspired by Huichol culture; it represents my community building interests.

Published with Bookdown