6 RSE plotting basics with smokingMouse

Instructor: Daianna

6.1 Download the smokingMouse data

We’ve used these commands already in an earlier chapter. Here we’ll re-download the data, though technically we are using the cached data we had previously downloaded so this time the code below will run faster.

## Load the container package for this type of data
library("SummarizedExperiment")

## Download data
library("BiocFileCache")
bfc <- BiocFileCache::BiocFileCache()
cached_rse_gene <- BiocFileCache::bfcrpath(
    x = bfc,
    "https://github.com/LieberInstitute/SPEAQeasyWorkshop2023/raw/devel/provisional_data/rse_gene_mouse_RNAseq_nic-smo.Rdata"
)

## Check the local path on our cache
cached_rse_gene
#>                                                                                  BFC2 
#> "/github/home/.cache/R/BiocFileCache/48f3eb1e88e_rse_gene_mouse_RNAseq_nic-smo.Rdata"

## Load the rse_gene object
load(cached_rse_gene, verbose = TRUE)
#> Loading objects:
#>   rse_gene

## Nicotine data
rse_gene_nic <- rse_gene[, which(rse_gene$Expt == "Nicotine")]

In this part of the course you’ll have a plotting session in which you’ll learn what type of plots you can create to explore your data, but please note that this is not a differential expression analysis itself. The steps that will be performed are shown in Figure 1. The main objective of this first part is to explore the quality of the samples, their differences in gene expression variations and the impact of sample variables on gene expression variance. You’ll also learn how to visualize gene expression counts.

**Figure 1**: **Summary of the analyses for differential expression.** Steps in yellow will be properly performed. **1. Data Preparation**: in this initial part counts are normalized and scaled and low-expressed genes are filtered. **2. Exploratory Data Analysis**: the quality of the samples is compared, poor-quality samples are removed and both gene-level and sample-level effects are explored. **3. Differential Expression Analysis**: a linear model is fitted for each gene and log fold changes are obtained for the contrast of interest; other statistics are also computed and compared. Here the differentially expressed genes (DEGs) are determined and quantified. **4. GO & KEGG Analyses**: in this part Gene Ontology and KEGG enrichment analyses are performed to identify biological processes and functions in which DEGs are significantly involved. **5. DEG visualization**: heatmaps are created to visually contrast gene expression levels of DEGs in control and experimental samples. Abbreviations: CPM: counts per million; QC: quality control; PC: principal component; DEG: differentially expressed genes; GO: Gene Ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes.

6.2 Data preparation

Even before exploring the samples, we must normalize the counts and filter non-expressed genes; we won’t do the processes themselves since we already have normalized and filtered data but let’s check why these steps are important and where to extract the data we need for posterior analyses.

6.2.1 Data normalization

Data normalization is a relevant preliminary step when working with expression data because raw counts do not necessarily reflect real expression measures of the genes, since there are technical differences in the way the libraries are prepared and sequenced, as well as intrinsic differences in the genes that are translated into more or less mapped reads. Particularly, there are within-sample effects that are the differences between genes in the same sample, such as their length (the longer the gene, the more reads it will have) and GC content, factors that contribute to variations in their counts. On the other hand, between-sample effects are differences between samples such as the sequencing depth, i.e., the total number of molecules sequenced, and the library size, i.e., the total number of reads of each sample [1].

These variables lead to virtually different mRNA amounts but of course are not due to the biological or treatment conditions of interest (such as nicotine administration in this example) so in order to remove, or at least, to minimize this technical bias and obtain measures comparable and consistent across samples, raw counts must be normalized by these factors. The data that we’ll use in this case are already normalized in assays(rse_gene)$logcounts. Specifically, the assay contains counts per million (CPM), also known as reads per million (RPM), one basic gene expression unit that only normalizes by the sequencing depth and is computed by dividing the read counts of a gene in a sample by a scaling factor given by the total mapping reads of the sample per million [2]:

\[CPM = \frac{read \ \ counts \ \ of \ \ gene \ \ \times \ \ 10^6 }{Total \ \ mapping \ \ reads \ \ of \ \ sample}\]

As outlined in Data overview and download, the scaling factors were obtained with calcNormFactors() applying the Trimmed Mean of M-Values (TMM) method, the edgeR package’s default normalization method that assumes that most genes are not differentially expressed. The effective library sizes of the samples and the CPM of each observation were computed with the edgeR function cpm() setting the log argument to TRUE and prior.count to 0.5 to receive values in $log_2(CPM+0.5)$.

After data normalization and scaling, we’d expect the read counts to follow a normal distribution, something we can confirm by comparing the counts’ distribution before and after the normalization. Consider both datasets contain the exact same genes.

library("ggplot2")

## Histogram and density plot of read counts before and after normalization

## Raw counts
counts_data <- data.frame(counts = as.vector(assays(rse_gene_nic)$counts))
plot <- ggplot(counts_data, aes(x = counts)) +
    geom_histogram(colour = "black", fill = "lightgray") +
    labs(x = "read counts", y = "Frecuency") +
    theme_classic()
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))


## Normalized counts
logcounts_data <- data.frame(logcounts = as.vector(assays(rse_gene_nic)$logcounts))
plot <- ggplot(logcounts_data, aes(x = logcounts)) +
    geom_histogram(aes(y = ..density..), colour = "darkgray", fill = "lightgray") +
    theme_classic() +
    geom_density(fill = "#69b3a2", alpha = 0.3) +
    labs(x = "log(CPM+0.5)", y = "Frecuency")
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))

As presented, after data transformation, we can now see a more widespread distribution of the counts, but note that most of them are zeros in the first plot (the one with the raw counts) and those zeros remain after normalization, corresponding to counts below 0 in the second plot. That is because we haven’t filtered the low and zero-expressed genes.

6.2.2 Gene filtering

Lowly-expressed or non-expressed genes in many samples are not of biological interest in a study of differential expression because they don’t inform about the gene expression changes and they are, by definition, not differentially expressed, so we have to drop them using filterByExpr() from edgeR that only keeps genes with at least K CPM in n samples and with a minimum total number of counts across all samples.

## Retain genes that passed filtering step
rse_gene_filt <- rse_gene_nic[rowData(rse_gene_nic)$retained_after_feature_filtering == TRUE, ]

## Normalized counts and filtered genes
filt_logcounts_data <- data.frame(logcounts = as.vector(assays(rse_gene_filt)$logcounts))

## Plot
plot <- ggplot(filt_logcounts_data, aes(x = logcounts)) +
    geom_histogram(aes(y = ..density..), colour = "darkgray", fill = "lightgray") +
    theme_classic() +
    geom_density(fill = "#69b3a2", alpha = 0.3) +
    labs(x = "log(CPM+0.5)", y = "Frecuency")
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))

In this third plot we can observe a curve that is closer (though not completely) to a normal distribution and with less lowly-expressed genes.

With the object rse_gene_filt we can proceed with downstream analyses.

6.3 Exploratory Data Analysis

The first formal step that we will be performing is the sample exploration. This crucial initial part of the analysis consists of an examination of differences and relationships between Quality-Control (QC) metrics of the samples from different groups in order to identify poor-quality samples that must be removed before DEA. After that, the sample variables in the metadata also need to be analyzed and filtered based on the percentage of gene expression variance that they explain for each gene.

6.4 Quality Control Analysis

First we have to explore and compare the the quality-control metrics of the samples in the different groups given by covariates such as age, sex, pregnancy state, group, plate and flowcell. See Sample Information section in chapter 04 for a description of these variables.

❓ Why is that relevant? As you could imagine, technical and methodological differences in all the steps that were carried out during the experimental stages are potential sources of variation in the quality of the samples. Just imagine all that could have been gone wrong or unexpected while experimenting with mice, during the sampling, in the RNA extraction using different batches, when treating samples in different mediums, when preparing libraries in different plates and sequencing in different flowcells. Moreover, the inherent features of the mice from which the samples come from such as age, tissue, sex and pregnancy state could also affect the samples’ metrics if, for example, they were separately analyzed and processed.

❓But why do we care about mitochodrial and ribosomal counts as QC metrics? In the process of mRNA extraction, either by mRNA enrichment (capturing polyadenilated mRNAs) or rRNA-depletion (removing rRNA), we’d expect to have a low number of ribosomal counts, i.e., counts that map to rDNA, and if we don’t, something must have gone wrong with the procedures. In the case of mitochondrial counts something similar occurs: higher mitochondrial rates will be obtained if the cytoplasmic mRNA capture was deficient or if the transcripts were lost by some technical issue, increasing the proportion of mitochondrial mRNAs. As a result, high mitoRate and rRNA_rate imply poor quality in the samples.

Note: the QC metrics were computed with the unprocessed datasets (neither filtered nor normalized) to preserve the original estimates of the samples.

6.4.1 Evaluate QC metrics for groups of samples

Fortunately, we can identify to some extent possible factors that could have influenced on the quality of the samples, as well as isolated samples that are problematic. To do that, we will create boxplots that present the distribution of the samples’ metrics separating them by sample variables.

library("Hmisc")
library("stringr")
library("cowplot")

## Define QC metrics of interest
qc_metrics <- c("mitoRate", "overallMapRate", "totalAssignedGene", "rRNA_rate", "sum", "detected", "ERCCsumLogErr")

## Define sample variables of interest
sample_variables <- c("Group", "Age", "Sex", "Pregnancy", "plate", "flowcell")


## Function to create boxplots of QC metrics for groups of samples

QC_boxplots <- function(qc_metric, sample_var) {
    ## Define sample colors depending on the sample variable
    if (sample_var == "Group") {
        colors <- c("Control" = "brown2", "Experimental" = "deepskyblue3")
    } else if (sample_var == "Age") {
        colors <- c("Adult" = "slateblue3", "Pup" = "yellow3")
    } else if (sample_var == "Sex") {
        colors <- c("F" = "hotpink1", "M" = "dodgerblue")
    } else if (sample_var == "Pregnancy") {
        colors <- c("Yes" = "darkorchid3", "No" = "darkolivegreen4")
    } else if (sample_var == "plate") {
        colors <- c("Plate1" = "darkorange", "Plate2" = "lightskyblue", "Plate3" = "deeppink1")
    } else if (sample_var == "flowcell") {
        colors <- c(
            "HKCG7DSXX" = "chartreuse2", "HKCMHDSXX" = "magenta", "HKCNKDSXX" = "turquoise3",
            "HKCTMDSXX" = "tomato"
        )
    }

    ## Axis labels
    x_label <- capitalize(sample_var)
    y_label <- str_replace_all(qc_metric, c("_" = ""))

    ## x-axis text angle and position
    if (sample_var == "flowcell") {
        x_axis_angle <- 18
        x_axis_hjust <- 0.5
        x_axis_vjust <- 0.7
        x_axis_size <- 4
    } else {
        x_axis_angle <- 0
        x_axis_hjust <- 0.5
        x_axis_vjust <- 0.5
        x_axis_size <- 6
    }

    ## Extract sample data in colData(rse_gene_filt)
    data <- data.frame(colData(rse_gene_filt))

    ## Sample variable separating samples in x-axis and QC metric in y-axis
    ## (Coloring by sample variable)
    plot <- ggplot(data = data, mapping = aes(x = !!rlang::sym(sample_var), y = !!rlang::sym(qc_metric), color = !!rlang::sym(sample_var))) +
        ## Add violin plots
        geom_violin(alpha = 0, size = 0.4, color = "black", width = 0.7) +
        ## Spread dots
        geom_jitter(width = 0.08, alpha = 0.7, size = 1.3) +
        ## Add boxplots
        geom_boxplot(alpha = 0, size = 0.4, width = 0.1, color = "black") +
        ## Set colors
        scale_color_manual(values = colors) +
        ## Define axis labels
        labs(y = y_label, x = x_label) +
        ## Get rid of the background
        theme_bw() +
        ## Hide legend and define plot margins and size of axis title and text
        theme(
            legend.position = "none",
            plot.margin = unit(c(0.5, 0.4, 0.5, 0.4), "cm"),
            axis.title = element_text(size = 7),
            axis.text = element_text(size = x_axis_size),
            axis.text.x = element_text(angle = x_axis_angle, hjust = x_axis_hjust, vjust = x_axis_vjust)
        )

    return(plot)
}



## Plots of all QC metrics for each sample variable
multiple_QC_boxplots <- function(sample_var) {
    i <- 1
    plots <- list()
    for (qc_metric in qc_metrics) {
        ## Call function to create each individual plot
        plots[[i]] <- QC_boxplots(qc_metric, sample_var)
        i <- i + 1
    }
    ## Arrange multiple plots into a grid
    print(plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], plots[[7]], nrow = 2))
}

multiple_QC_boxplots("Age")

Initially, when we separate samples by Age, we can appreciate a clear segregation of adult and pup samples in mitoRate , with higher mitochondrial rates for adult samples and thus, being lower quality samples than the pup ones. We can also see that pup samples have higher totalAssignedGene , again being higher quality. The samples are very similar in the rest of the QC metrics. The former differences must be taken into account because they guide further sample separation by Age, which is necessary to avoid dropping most of the adult samples (that are lower quality) in the QC-based sample filtering (see below) and to prevent misinterpreting sample variation given by quality and not by mouse age itself.

multiple_QC_boxplots("Sex")

With Sex, female and male samples have similar QC metrics but there are some female samples that have high mitoRate and low totalAssignedGene , but males don´t. That is consistent with the fact that all males are pups and as we previously observed, pup samples present better QC metrics, but since not all pups are males, some female pups also have smaller mitoRate and greater totalAssignedGene . So in this case samples’ differences are actually dictated by Age and not by Sex.

multiple_QC_boxplots("Group")

Notably, for Group no evident contrasts are seen in the quality of the samples, which means that both control and exposed samples have similar metrics and therefore the differences between them won’t be determined by technical factors but effectively by gene expression changes.

multiple_QC_boxplots("Pregnancy")

Pregnancy variable is interesting because pregnant dams were obviously females and adults, which we already noted are lower quality. Accordingly to that, pregnant samples present overall higher mitoRate and lower totalAssignedGene , sum (library size) and detected (number of expressed genes) than some samples coming from non-pregnant mice. However, these samples have smaller | ERCCsumLogErr |, which means that they had smaller differences between expected and observed concentrations of control transcripts.

Notwithstanding, we must clarify one more time that as in Sex, the trends observed in Pregnancy are all given by Age: it is Age the variable that fully segregates samples in mitoRate and almost completely in , lower totalAssignedGene . Because all pregnant dams were adults, their metrics will be positioned where the adult samples were, but there were also not-pregnant adults that share similar QC values.

multiple_QC_boxplots("plate")

In plate, no alarming differences are presented but some samples from the 1st plate have low (though not much lower) overallMapRate , totalAssignedGene and sum .

multiple_QC_boxplots("flowcell")

For the flowcell, again no worrying distinctions are seen, with the exception of a few individual samples far from the rest.

In all the previous plots we can appreciate a group of samples placed below in mitoRate that correspond to pup samples. The relationship is more fuzzy in totalAssignedGene .

6.4.2 Examine relationships between QC metrics of samples

Secondly, we want to analyze if there are relationships between QC variables that explain the presence of samples with two or more particular QC metrics. For that, scatterplots are useful because they enable the evaluation of pairs of samples’ QC metrics.

## Scatterplots for a pair of QC metrics

QC_scatterplots <- function(sample_var, qc_metric1, qc_metric2) {
    ## Define sample colors
    if (sample_var == "Group") {
        colors <- c("Control" = "brown2", "Experimental" = "deepskyblue3")
    } else if (sample_var == "Age") {
        colors <- c("Adult" = "slateblue3", "Pup" = "yellow3")
    } else if (sample_var == "Sex") {
        colors <- c("F" = "hotpink1", "M" = "dodgerblue")
    } else if (sample_var == "Pregnancy") {
        colors <- c("Yes" = "darkorchid3", "No" = "darkolivegreen4")
    } else if (sample_var == "plate") {
        colors <- c("Plate1" = "darkorange", "Plate2" = "lightskyblue", "Plate3" = "deeppink1")
    } else if (sample_var == "flowcell") {
        colors <- c(
            "HKCG7DSXX" = "chartreuse2", "HKCMHDSXX" = "magenta", "HKCNKDSXX" = "turquoise3",
            "HKCTMDSXX" = "tomato"
        )
    }

    data <- colData(rse_gene_filt)

    ## Scatterplots for continuous variable vs continuous variable
    ## First QC metric in x-axis and second QC metric in y-axis
    plot <- ggplot(as.data.frame(data), aes(
        x = !!rlang::sym(qc_metric1),
        y = !!rlang::sym(qc_metric2),
        ## Color samples by a variable
        color = !!rlang::sym(sample_var)
    )) +
        ## Add scatterplot
        geom_point(size = 1) +
        ## Add regression line
        stat_smooth(geom = "line", alpha = 0.4, size = 0.4, span = 0.25, method = lm, color = "orangered3") +
        ## Colors
        scale_color_manual(name = sample_var, values = colors) +
        theme_bw() +
        ## Add Pearson correlation coefficient between the metrics as subtitle
        labs(
            subtitle = paste0("Corr: ", signif(cor(data[, qc_metric1], data[, qc_metric2], method = "pearson"), digits = 3)),
            ## Add axis labels
            y = gsub("_", " ", qc_metric2),
            x = gsub("_", " ", qc_metric1)
        ) +
        ## Plot margins and text size
        theme(
            plot.margin = unit(c(0.1, 1.2, 0.1, 1.2), "cm"),
            axis.title = element_text(size = (7)),
            axis.text = element_text(size = (6)),
            plot.subtitle = element_text(size = 7, color = "gray40"),
            legend.text = element_text(size = 6),
            legend.title = element_text(size = 7)
        )
    return(plot)
}



## QC scatterplots coloring by all sample variables
multiple_QC_scatterplots <- function(qc_metric1, qc_metric2) {
    sample_variables <- c("Age", "Sex", "plate", "Pregnancy", "Group", "flowcell")

    i <- 1
    plots <- list()
    for (sample_var in sample_variables) {
        plots[[i]] <- QC_scatterplots(sample_var, qc_metric1, qc_metric2)
        i <- i + 1
    }
    plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], nrow = 3, rel_widths = c(1, 1))
}

multiple_QC_scatterplots("mitoRate", "rRNA_rate")

For mitoRate vs rRNA_rate plots there’s a negligible positive correlation which was anticipated since no obvious relationship exists between them, though if cytoplasmic mRNAs weren’t well captured (lowering the total number of mapping reads) we could expect an increase in both mitochondrial and ribosomal rates.

Note the complete separation of samples in mitoRate (but not in rRNA_rate ) by Age, which in turn causes the grouping of male samples (all pups).

multiple_QC_scatterplots("mitoRate", "totalAssignedGene")

Forcefully, mitoRate and totalAssignedGene have a strong negative correlation, which reveals the existence of very bad quality samples (with small totalAssignedGene and high mitoRate ).

multiple_QC_scatterplots("sum", "detected")

Noticeably, a positive correlation is present for sum and detected and that is also congruent because a higher number of expressed-genes implies bigger library sizes, but that is not necessarily true in the other way around: just a few highly expressed genes (small number of detected genes) could increase the library size. Interestingly, in this case samples are diagonally separated by age.

multiple_QC_scatterplots("sum", "totalAssignedGene")

Contrary to expectations, sum is not correlated to totalAssignedGene and this is unexpected because higher proportions of genes’ reads could initially increase library size. However, since totalAssignedGene is a fraction of the total reads, it could be that a high fraction of reads mapping to genes reflects a small number of total reads and does not necessarily imply more expressed genes (and therefore bigger libraries). In other words, even when totalAssignedGene equals 1, if the total number of reads is small, the genes won’t be widely covered by them and the libraries won’t be sizeable. Observe the separarion of samples by Age.

multiple_QC_scatterplots("detected", "totalAssignedGene")

Lastly, there’s a slight positive correlation between totalAssignedGene and detected , which makes sense because reads assigned to genes rise the number of non-zero expressed genes but as mentioned, those genes are not necessarily highly expressed and thus, library sizes are not perforce so much bigger. Also note that the correlation wouldn’t be conserved within pup samples.

📑 Exercise 1: What’s the correlation between subsets_Mito_sum and mitoMapped ? What factor could explain that correlation? Why are mitoMapped numbers bigger than subsets_Mito_sum values?

Remarkably, samples were not differentiated by any other variable (including Group!) but the differences between adult and pup metrics suggest that they must be splitted into two different groups for the following analyses.

6.4.3 QC sample filtering

After assessing how different or similar are the QC values between samples, we can now proceed to sample filtering based precisely, on these metrics. For that, we will use isOutlier() from scater to identify outlier samples only at the lower end or the higher end, depending on the QC metric.

library("scater")
library("rlang")
library("ggrepel")

## Separate data by Age
rse_gene_pups <- rse_gene_filt[, which(rse_gene_filt$Age == "Pup")]
rse_gene_adults <- rse_gene_filt[, which(rse_gene_filt$Age == "Adult")]


## Find outlier samples based on their QC metrics (samples that are 3 median-absolute-deviations away from the median)


## Filter all samples together

## Drop samples with lower library sizes (sum), detected number of genes and totalAssignedGene
outliers_library_size <- isOutlier(rse_gene_filt$sum, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_filt$detected, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_filt$totalAssignedGene, nmads = 3, type = "lower")
## Drop samples with higher mitoRates and rRNA rates
outliers_mito <- isOutlier(rse_gene_filt$mitoRate, nmads = 3, type = "higher")
outliers_rRNArate <- isOutlier(rse_gene_filt$rRNA_rate, nmads = 3, type = "higher")

## Keep not outlier samples
not_outliers <- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
rse_gene_qc <- rse_gene_filt[, not_outliers]

## Number of samples retained
dim(rse_gene_qc)[2]
#> [1] 39

## Add new variables to rse_gene_filt with info of samples retained/dropped
rse_gene_filt$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_filt$SAMPLE_ID, function(x) {
    if (x %in% rse_gene_qc$SAMPLE_ID) {
        "Retained"
    } else {
        "Dropped"
    }
}))



## Filter adult samples

outliers_library_size <- isOutlier(rse_gene_adults$sum, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_adults$detected, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_adults$totalAssignedGene, nmads = 3, type = "lower")
outliers_mito <- isOutlier(rse_gene_adults$mitoRate, nmads = 3, type = "higher")
outliers_rRNArate <- isOutlier(rse_gene_adults$rRNA_rate, nmads = 3, type = "higher")

not_outliers <- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
rse_gene_adults_qc <- rse_gene_adults[, not_outliers]

## Number of samples retained
dim(rse_gene_adults_qc)[2]
#> [1] 20

rse_gene_adults$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_adults$SAMPLE_ID, function(x) {
    if (x %in% rse_gene_adults_qc$SAMPLE_ID) {
        "Retained"
    } else {
        "Dropped"
    }
}))



## Filter pup samples

outliers_library_size <- isOutlier(rse_gene_pups$sum, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_pups$detected, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_pups$totalAssignedGene, nmads = 3, type = "lower")
outliers_mito <- isOutlier(rse_gene_pups$mitoRate, nmads = 3, type = "higher")
outliers_rRNArate <- isOutlier(rse_gene_pups$rRNA_rate, nmads = 3, type = "higher")

not_outliers <- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
rse_gene_pups_qc <- rse_gene_pups[, not_outliers]

## Number of samples retained
dim(rse_gene_pups_qc)[2]
#> [1] 41

rse_gene_pups$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_pups$SAMPLE_ID, function(x) {
    if (x %in% rse_gene_pups_qc$SAMPLE_ID) {
        "Retained"
    } else {
        "Dropped"
    }
}))

We already filtered outlier samples … but what have we removed? It is always important to trace the QC metrics of the filtered samples to verify that they really are poor quality. We don’t want to get rid of useful samples! So let’s go back to the QC boxplots but now color samples according to whether they passed or not the filtering step and also distinguishing samples’ groups by shape.

## Boxplots of QC metrics after sample filtering

## Boxplots
boxplots_after_QC_filtering <- function(rse_gene, qc_metric, sample_var) {
    ## Color samples
    colors <- c("Retained" = "deepskyblue", "Dropped" = "brown2")

    ## Sample shape by sample variables
    if (sample_var == "Group") {
        shapes <- c("Control" = 0, "Experimental" = 15)
    } else if (sample_var == "Age") {
        shapes <- c("Adult" = 16, "Pup" = 1)
    } else if (sample_var == "Sex") {
        shapes <- c("F" = 11, "M" = 19)
    } else if (sample_var == "Pregnancy") {
        shapes <- c("Yes" = 10, "No" = 1)
    } else if (sample_var == "plate") {
        shapes <- c("Plate1" = 12, "Plate2" = 5, "Plate3" = 4)
    } else if (sample_var == "flowcell") {
        shapes <- c(
            "HKCG7DSXX" = 3, "HKCMHDSXX" = 8, "HKCNKDSXX" = 14,
            "HKCTMDSXX" = 17
        )
    }

    y_label <- str_replace_all(qc_metric, c("_" = " "))

    data <- data.frame(colData(rse_gene))

    ## Median of the QC var values
    median <- median(eval(parse_expr(paste("rse_gene$", qc_metric, sep = ""))))
    ## Median-absolute-deviation of the QC var values
    mad <- mad(eval(parse_expr(paste("rse_gene$", qc_metric, sep = ""))))

    plot <- ggplot(data = data, mapping = aes(
        x = "", y = !!rlang::sym(qc_metric),
        color = !!rlang::sym("Retention_after_QC_filtering")
    )) +
        geom_jitter(alpha = 1, size = 2, aes(shape = eval(parse_expr((sample_var))))) +
        geom_boxplot(alpha = 0, size = 0.15, color = "black") +
        scale_color_manual(values = colors) +
        scale_shape_manual(values = shapes) +
        labs(x = "", y = y_label, color = "Retention after QC filtering", shape = sample_var) +
        theme_classic() +
        ## Median line
        geom_hline(yintercept = median, size = 0.5) +
        ## Line of median + 3 MADs
        geom_hline(yintercept = median + (3 * mad), size = 0.5, linetype = 2) +
        ## Line of median - 3 MADs
        geom_hline(yintercept = median - (3 * mad), size = 0.5, linetype = 2) +
        theme(
            axis.title = element_text(size = (9)),
            axis.text = element_text(size = (8)),
            legend.position = "right",
            legend.text = element_text(size = 8),
            legend.title = element_text(size = 9)
        )

    return(plot)
}

In the following plot we can confirm that taking all samples together (both adults and pups) all adult samples are dropped by their high mitoRate , but that is not desirable because we want to analyze these samples too, so we need to analyze pup and adult samples separately.

## Plots

## All samples together
p <- boxplots_after_QC_filtering(rse_gene_filt, "mitoRate", "Age")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))

For adults, only 3 controls are dropped, 3 of them by their mitoRate .

## Adult samples
p <- boxplots_after_QC_filtering(rse_gene_adults, "mitoRate", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))

For pups, 1 experimental sample was dropped by its rRNA_rate .

## Pup samples
p <- boxplots_after_QC_filtering(rse_gene_pups, "rRNA_rate", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))

📑 Exercise 2: Were the filtered adult samples from pregnant or not pregnant mice? From which plates and flowcells? What were their library sizes?

Try the function yourself with different QC metrics and sample variables!

6.5 Miscellaneous about CPM values

## Why do we see log(CPM + 0.5) values smaller than -1?
log2(0.5)
#> [1] -1

## Maybe the prior count is getting shrunk
log2(0.05)
#> [1] -4.321928

## logcounts values we see across all samples
summary(as.vector(assays(rse_gene)$logcounts))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -5.9853 -5.9853 -5.9853 -2.7514  0.4171 18.2732

## Sample 203 in particular has some cases like it
summary(assays(rse_gene)$logcounts[, 203])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -5.9853 -5.9853 -5.9853 -2.6620  0.6067 12.5785

## Where we find one gene in sample 203 with that property
i_gene <- which.min(assays(rse_gene)$logcounts[, 203])

## We can check the logcounts and counts
assays(rse_gene)$logcounts[i_gene, 203]
#> ENSMUSG00000102693.1 
#>            -5.985331
assays(rse_gene)$counts[i_gene, 203]
#> ENSMUSG00000102693.1 
#>                    0

## What if we try to reverse engineer the number?
2^assays(rse_gene)$logcounts[i_gene, 203]
#> ENSMUSG00000102693.1 
#>           0.01578469
log2(0.01578469)
#> [1] -5.98533
0.01578469 / 0.5
#> [1] 0.03156938
0.5 * 0.03156938
#> [1] 0.01578469

## Hm... 31 doesn't ring any bells
1 / 0.03156938
#> [1] 31.67626

Bioconductor website for browsing code: https://code.bioconductor.org/
Bioconductor support website: https://support.bioconductor.org/

Links Leo showed:

## Check the documentation
## ?edgeR::cpm
## > If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to y to avoid taking the log of zero.

## https://github.com/LieberInstitute/smokingMouse_Indirects/blob/704692a357ec391348ebc3568188d41827328ba5/code/02_build_objects/02_build_objects.R#L100C51-L100C135
## Let's save the output of calcNormFactors
DGElist_with_norm <- edgeR::calcNormFactors(rse_gene, method = "TMM")
class(DGElist_with_norm)
#> [1] "DGEList"
#> attr(,"package")
#> [1] "edgeR"

## https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L9
## We can see that it has the lib.size and norm.factors values there
head(DGElist_with_norm$samples[, seq_len(3)])
#>         group lib.size norm.factors
#> Sample1     1 37119948    0.9601434
#> Sample2     1 24904754    0.9783991
#> Sample3     1 27083602    0.9740923
#> Sample4     1 48312479    0.9408362
#> Sample5     1 34921004    0.9772143
#> Sample6     1 31558549    0.9246274

## Exploring lib.size across all samples
summary(DGElist_with_norm$samples$lib.size)
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#>   6908739  26209304  29205003  29475155  34809355 114202414
DGElist_with_norm$samples$lib.size[203]
#> [1] 34869201

## https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L14
## We can see there how edgeR::cpm() computes the adjusted library sizes
summary(DGElist_with_norm$samples$norm.factors)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.3595  0.9656  1.1230  1.0306  1.1589  1.2336
DGElist_with_norm$samples$norm.factors[203]
#> [1] 1.19894

## adjusted library sizes
DGElist_with_norm$samples$lib.size[203] * DGElist_with_norm$samples$norm.factors[203]
#> [1] 41806065

## Let's save these values for all samples
adj_lib_size <- DGElist_with_norm$samples$lib.size * DGElist_with_norm$samples$norm.factors
adj_lib_size[203]
#> [1] 41806065

## Proportional adjusted library size to the mean adjusted library size
adj_lib_size[203] / mean(adj_lib_size)
#> [1] 1.319791
1 / (adj_lib_size[203] / mean(adj_lib_size))
#> [1] 0.7576956
## Hm.... we are missing something to get to the values we saw earlier when we
## tried to reverse engineer the issue.

## Hm.... I couldn't recalculate that -5.98 value manually
log2(0.5 / (adj_lib_size[203] / mean(adj_lib_size) * 2))
#> [1] -2.40031

In the end, I (Leo) don’t know enough C++ to fully understand the math of what edgeR is doing. But well, the point is that it is scaling that prior.count (0.5 in the smokingMouse case instead of the default value of 2) which results in a smaller number, and hence why the log2() value of it is smaller than -1.

References

Evans, C., Hardin, J., & Stoebel, D. M. (2018). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in bioinformatics, 19(5), 776-792.
Bedre, R. (2023). Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq. Web site: https://www.reneshbedre.com/blog/expression_units.html