6 RSE plotting basics with smokingMouse
Instructor: Daianna
6.1 Download the smokingMouse data
We’ve used these commands already in an earlier chapter. Here we’ll re-download the data, though technically we are using the cached data we had previously downloaded so this time the code below will run faster.
## Load the container package for this type of data
library("SummarizedExperiment")
## Download data
library("BiocFileCache")
<- BiocFileCache::BiocFileCache()
bfc <- BiocFileCache::bfcrpath(
cached_rse_gene x = bfc,
"https://github.com/LieberInstitute/SPEAQeasyWorkshop2023/raw/devel/provisional_data/rse_gene_mouse_RNAseq_nic-smo.Rdata"
)
## Check the local path on our cache
cached_rse_gene#> BFC2
#> "/github/home/.cache/R/BiocFileCache/48f3eb1e88e_rse_gene_mouse_RNAseq_nic-smo.Rdata"
## Load the rse_gene object
load(cached_rse_gene, verbose = TRUE)
#> Loading objects:
#> rse_gene
## Nicotine data
<- rse_gene[, which(rse_gene$Expt == "Nicotine")] rse_gene_nic
In this part of the course you’ll have a plotting session in which you’ll learn what type of plots you can create to explore your data, but please note that this is not a differential expression analysis itself. The steps that will be performed are shown in Figure 1. The main objective of this first part is to explore the quality of the samples, their differences in gene expression variations and the impact of sample variables on gene expression variance. You’ll also learn how to visualize gene expression counts.
6.2 Data preparation
Even before exploring the samples, we must normalize the counts and filter non-expressed genes; we won’t do the processes themselves since we already have normalized and filtered data but let’s check why these steps are important and where to extract the data we need for posterior analyses.
6.2.1 Data normalization
Data normalization is a relevant preliminary step when working with expression data because raw counts do not necessarily reflect real expression measures of the genes, since there are technical differences in the way the libraries are prepared and sequenced, as well as intrinsic differences in the genes that are translated into more or less mapped reads. Particularly, there are within-sample effects that are the differences between genes in the same sample, such as their length (the longer the gene, the more reads it will have) and GC content, factors that contribute to variations in their counts. On the other hand, between-sample effects are differences between samples such as the sequencing depth, i.e., the total number of molecules sequenced, and the library size, i.e., the total number of reads of each sample [1].
These variables lead to virtually different mRNA amounts but of course are not due to the biological or treatment conditions of interest (such as nicotine administration in this example) so in order to remove, or at least, to minimize this technical bias and obtain measures comparable and consistent across samples, raw counts must be normalized by these factors. The data that we’ll use in this case are already normalized in assays(rse_gene)$logcounts
. Specifically, the assay contains counts per million (CPM), also known as reads per million (RPM), one basic gene expression unit that only normalizes by the sequencing depth and is computed by dividing the read counts of a gene in a sample by a scaling factor given by the total mapping reads of the sample per million [2]:
\[CPM = \frac{read \ \ counts \ \ of \ \ gene \ \ \times \ \ 10^6 }{Total \ \ mapping \ \ reads \ \ of \ \ sample}\]
As outlined in Data overview and download, the scaling factors were obtained with calcNormFactors()
applying the Trimmed Mean of M-Values (TMM) method, the edgeR package’s default normalization method that assumes that most genes are not differentially expressed. The effective library sizes of the samples and the CPM of each observation were computed with the edgeR function cpm()
setting the log
argument to TRUE
and prior.count
to 0.5 to receive values in \(log_2(CPM+0.5)\).
After data normalization and scaling, we’d expect the read counts to follow a normal distribution, something we can confirm by comparing the counts’ distribution before and after the normalization. Consider both datasets contain the exact same genes.
library("ggplot2")
## Histogram and density plot of read counts before and after normalization
## Raw counts
<- data.frame(counts = as.vector(assays(rse_gene_nic)$counts))
counts_data <- ggplot(counts_data, aes(x = counts)) +
plot geom_histogram(colour = "black", fill = "lightgray") +
labs(x = "read counts", y = "Frecuency") +
theme_classic()
+ theme(plot.margin = unit(c(2, 4, 2, 4), "cm")) plot
## Normalized counts
<- data.frame(logcounts = as.vector(assays(rse_gene_nic)$logcounts))
logcounts_data <- ggplot(logcounts_data, aes(x = logcounts)) +
plot geom_histogram(aes(y = ..density..), colour = "darkgray", fill = "lightgray") +
theme_classic() +
geom_density(fill = "#69b3a2", alpha = 0.3) +
labs(x = "log(CPM+0.5)", y = "Frecuency")
+ theme(plot.margin = unit(c(2, 4, 2, 4), "cm")) plot
As presented, after data transformation, we can now see a more widespread distribution of the counts, but note that most of them are zeros in the first plot (the one with the raw counts) and those zeros remain after normalization, corresponding to counts below 0 in the second plot. That is because we haven’t filtered the low and zero-expressed genes.
6.2.2 Gene filtering
Lowly-expressed or non-expressed genes in many samples are not of biological interest in a study of differential expression because they don’t inform about the gene expression changes and they are, by definition, not differentially expressed, so we have to drop them using filterByExpr()
from edgeR that only keeps genes with at least K CPM in n samples and with a minimum total number of counts across all samples.
## Retain genes that passed filtering step
<- rse_gene_nic[rowData(rse_gene_nic)$retained_after_feature_filtering == TRUE, ]
rse_gene_filt
## Normalized counts and filtered genes
<- data.frame(logcounts = as.vector(assays(rse_gene_filt)$logcounts))
filt_logcounts_data
## Plot
<- ggplot(filt_logcounts_data, aes(x = logcounts)) +
plot geom_histogram(aes(y = ..density..), colour = "darkgray", fill = "lightgray") +
theme_classic() +
geom_density(fill = "#69b3a2", alpha = 0.3) +
labs(x = "log(CPM+0.5)", y = "Frecuency")
+ theme(plot.margin = unit(c(2, 4, 2, 4), "cm")) plot
In this third plot we can observe a curve that is closer (though not completely) to a normal distribution and with less lowly-expressed genes.
With the object rse_gene_filt
we can proceed with downstream analyses.
6.3 Exploratory Data Analysis
The first formal step that we will be performing is the sample exploration. This crucial initial part of the analysis consists of an examination of differences and relationships between Quality-Control (QC) metrics of the samples from different groups in order to identify poor-quality samples that must be removed before DEA. After that, the sample variables in the metadata also need to be analyzed and filtered based on the percentage of gene expression variance that they explain for each gene.
6.4 Quality Control Analysis
First we have to explore and compare the the quality-control metrics of the samples in the different groups given by covariates such as age, sex, pregnancy state, group, plate and flowcell. See Sample Information section in chapter 04 for a description of these variables.
❓ Why is that relevant? As you could imagine, technical and methodological differences in all the steps that were carried out during the experimental stages are potential sources of variation in the quality of the samples. Just imagine all that could have been gone wrong or unexpected while experimenting with mice, during the sampling, in the RNA extraction using different batches, when treating samples in different mediums, when preparing libraries in different plates and sequencing in different flowcells. Moreover, the inherent features of the mice from which the samples come from such as age, tissue, sex and pregnancy state could also affect the samples’ metrics if, for example, they were separately analyzed and processed.
❓But why do we care about mitochodrial and ribosomal counts as QC metrics? In the process of mRNA extraction, either by mRNA enrichment (capturing polyadenilated mRNAs) or rRNA-depletion (removing rRNA), we’d expect to have a low number of ribosomal counts, i.e., counts that map to rDNA, and if we don’t, something must have gone wrong with the procedures. In the case of mitochondrial counts something similar occurs: higher mitochondrial rates will be obtained if the cytoplasmic mRNA capture was deficient or if the transcripts were lost by some technical issue, increasing the proportion of mitochondrial mRNAs. As a result, high mitoRate and rRNA_rate imply poor quality in the samples.
Note: the QC metrics were computed with the unprocessed datasets (neither filtered nor normalized) to preserve the original estimates of the samples.
6.4.1 Evaluate QC metrics for groups of samples
Fortunately, we can identify to some extent possible factors that could have influenced on the quality of the samples, as well as isolated samples that are problematic. To do that, we will create boxplots that present the distribution of the samples’ metrics separating them by sample variables.
library("Hmisc")
library("stringr")
library("cowplot")
## Define QC metrics of interest
<- c("mitoRate", "overallMapRate", "totalAssignedGene", "rRNA_rate", "sum", "detected", "ERCCsumLogErr")
qc_metrics
## Define sample variables of interest
<- c("Group", "Age", "Sex", "Pregnancy", "plate", "flowcell")
sample_variables
## Function to create boxplots of QC metrics for groups of samples
<- function(qc_metric, sample_var) {
QC_boxplots ## Define sample colors depending on the sample variable
if (sample_var == "Group") {
<- c("Control" = "brown2", "Experimental" = "deepskyblue3")
colors else if (sample_var == "Age") {
} <- c("Adult" = "slateblue3", "Pup" = "yellow3")
colors else if (sample_var == "Sex") {
} <- c("F" = "hotpink1", "M" = "dodgerblue")
colors else if (sample_var == "Pregnancy") {
} <- c("Yes" = "darkorchid3", "No" = "darkolivegreen4")
colors else if (sample_var == "plate") {
} <- c("Plate1" = "darkorange", "Plate2" = "lightskyblue", "Plate3" = "deeppink1")
colors else if (sample_var == "flowcell") {
} <- c(
colors "HKCG7DSXX" = "chartreuse2", "HKCMHDSXX" = "magenta", "HKCNKDSXX" = "turquoise3",
"HKCTMDSXX" = "tomato"
)
}
## Axis labels
<- capitalize(sample_var)
x_label <- str_replace_all(qc_metric, c("_" = ""))
y_label
## x-axis text angle and position
if (sample_var == "flowcell") {
<- 18
x_axis_angle <- 0.5
x_axis_hjust <- 0.7
x_axis_vjust <- 4
x_axis_size else {
} <- 0
x_axis_angle <- 0.5
x_axis_hjust <- 0.5
x_axis_vjust <- 6
x_axis_size
}
## Extract sample data in colData(rse_gene_filt)
<- data.frame(colData(rse_gene_filt))
data
## Sample variable separating samples in x-axis and QC metric in y-axis
## (Coloring by sample variable)
<- ggplot(data = data, mapping = aes(x = !!rlang::sym(sample_var), y = !!rlang::sym(qc_metric), color = !!rlang::sym(sample_var))) +
plot ## Add violin plots
geom_violin(alpha = 0, size = 0.4, color = "black", width = 0.7) +
## Spread dots
geom_jitter(width = 0.08, alpha = 0.7, size = 1.3) +
## Add boxplots
geom_boxplot(alpha = 0, size = 0.4, width = 0.1, color = "black") +
## Set colors
scale_color_manual(values = colors) +
## Define axis labels
labs(y = y_label, x = x_label) +
## Get rid of the background
theme_bw() +
## Hide legend and define plot margins and size of axis title and text
theme(
legend.position = "none",
plot.margin = unit(c(0.5, 0.4, 0.5, 0.4), "cm"),
axis.title = element_text(size = 7),
axis.text = element_text(size = x_axis_size),
axis.text.x = element_text(angle = x_axis_angle, hjust = x_axis_hjust, vjust = x_axis_vjust)
)
return(plot)
}
## Plots of all QC metrics for each sample variable
<- function(sample_var) {
multiple_QC_boxplots <- 1
i <- list()
plots for (qc_metric in qc_metrics) {
## Call function to create each individual plot
<- QC_boxplots(qc_metric, sample_var)
plots[[i]] <- i + 1
i
}## Arrange multiple plots into a grid
print(plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], plots[[7]], nrow = 2))
}
multiple_QC_boxplots("Age")
Initially, when we separate samples by Age, we can appreciate a clear segregation of adult and pup samples in mitoRate , with higher mitochondrial rates for adult samples and thus, being lower quality samples than the pup ones. We can also see that pup samples have higher totalAssignedGene , again being higher quality. The samples are very similar in the rest of the QC metrics. The former differences must be taken into account because they guide further sample separation by Age, which is necessary to avoid dropping most of the adult samples (that are lower quality) in the QC-based sample filtering (see below) and to prevent misinterpreting sample variation given by quality and not by mouse age itself.
multiple_QC_boxplots("Sex")
With Sex, female and male samples have similar QC metrics but there are some female samples that have high mitoRate and low totalAssignedGene , but males don´t. That is consistent with the fact that all males are pups and as we previously observed, pup samples present better QC metrics, but since not all pups are males, some female pups also have smaller mitoRate and greater totalAssignedGene . So in this case samples’ differences are actually dictated by Age and not by Sex.
multiple_QC_boxplots("Group")
Notably, for Group no evident contrasts are seen in the quality of the samples, which means that both control and exposed samples have similar metrics and therefore the differences between them won’t be determined by technical factors but effectively by gene expression changes.
multiple_QC_boxplots("Pregnancy")
Pregnancy variable is interesting because pregnant dams were obviously females and adults, which we already noted are lower quality. Accordingly to that, pregnant samples present overall higher mitoRate and lower totalAssignedGene , sum (library size) and detected (number of expressed genes) than some samples coming from non-pregnant mice. However, these samples have smaller | ERCCsumLogErr |, which means that they had smaller differences between expected and observed concentrations of control transcripts.
Notwithstanding, we must clarify one more time that as in Sex, the trends observed in Pregnancy are all given by Age: it is Age the variable that fully segregates samples in mitoRate and almost completely in , lower totalAssignedGene . Because all pregnant dams were adults, their metrics will be positioned where the adult samples were, but there were also not-pregnant adults that share similar QC values.
multiple_QC_boxplots("plate")
In plate, no alarming differences are presented but some samples from the 1st plate have low (though not much lower) overallMapRate , totalAssignedGene and sum .
multiple_QC_boxplots("flowcell")
For the flowcell, again no worrying distinctions are seen, with the exception of a few individual samples far from the rest.
In all the previous plots we can appreciate a group of samples placed below in mitoRate that correspond to pup samples. The relationship is more fuzzy in totalAssignedGene .
6.4.2 Examine relationships between QC metrics of samples
Secondly, we want to analyze if there are relationships between QC variables that explain the presence of samples with two or more particular QC metrics. For that, scatterplots are useful because they enable the evaluation of pairs of samples’ QC metrics.
## Scatterplots for a pair of QC metrics
<- function(sample_var, qc_metric1, qc_metric2) {
QC_scatterplots ## Define sample colors
if (sample_var == "Group") {
<- c("Control" = "brown2", "Experimental" = "deepskyblue3")
colors else if (sample_var == "Age") {
} <- c("Adult" = "slateblue3", "Pup" = "yellow3")
colors else if (sample_var == "Sex") {
} <- c("F" = "hotpink1", "M" = "dodgerblue")
colors else if (sample_var == "Pregnancy") {
} <- c("Yes" = "darkorchid3", "No" = "darkolivegreen4")
colors else if (sample_var == "plate") {
} <- c("Plate1" = "darkorange", "Plate2" = "lightskyblue", "Plate3" = "deeppink1")
colors else if (sample_var == "flowcell") {
} <- c(
colors "HKCG7DSXX" = "chartreuse2", "HKCMHDSXX" = "magenta", "HKCNKDSXX" = "turquoise3",
"HKCTMDSXX" = "tomato"
)
}
<- colData(rse_gene_filt)
data
## Scatterplots for continuous variable vs continuous variable
## First QC metric in x-axis and second QC metric in y-axis
<- ggplot(as.data.frame(data), aes(
plot x = !!rlang::sym(qc_metric1),
y = !!rlang::sym(qc_metric2),
## Color samples by a variable
color = !!rlang::sym(sample_var)
+
)) ## Add scatterplot
geom_point(size = 1) +
## Add regression line
stat_smooth(geom = "line", alpha = 0.4, size = 0.4, span = 0.25, method = lm, color = "orangered3") +
## Colors
scale_color_manual(name = sample_var, values = colors) +
theme_bw() +
## Add Pearson correlation coefficient between the metrics as subtitle
labs(
subtitle = paste0("Corr: ", signif(cor(data[, qc_metric1], data[, qc_metric2], method = "pearson"), digits = 3)),
## Add axis labels
y = gsub("_", " ", qc_metric2),
x = gsub("_", " ", qc_metric1)
+
) ## Plot margins and text size
theme(
plot.margin = unit(c(0.1, 1.2, 0.1, 1.2), "cm"),
axis.title = element_text(size = (7)),
axis.text = element_text(size = (6)),
plot.subtitle = element_text(size = 7, color = "gray40"),
legend.text = element_text(size = 6),
legend.title = element_text(size = 7)
)return(plot)
}
## QC scatterplots coloring by all sample variables
<- function(qc_metric1, qc_metric2) {
multiple_QC_scatterplots <- c("Age", "Sex", "plate", "Pregnancy", "Group", "flowcell")
sample_variables
<- 1
i <- list()
plots for (sample_var in sample_variables) {
<- QC_scatterplots(sample_var, qc_metric1, qc_metric2)
plots[[i]] <- i + 1
i
}plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], nrow = 3, rel_widths = c(1, 1))
}
multiple_QC_scatterplots("mitoRate", "rRNA_rate")
For mitoRate vs rRNA_rate plots there’s a negligible positive correlation which was anticipated since no obvious relationship exists between them, though if cytoplasmic mRNAs weren’t well captured (lowering the total number of mapping reads) we could expect an increase in both mitochondrial and ribosomal rates.
Note the complete separation of samples in mitoRate (but not in rRNA_rate ) by Age, which in turn causes the grouping of male samples (all pups).
multiple_QC_scatterplots("mitoRate", "totalAssignedGene")
Forcefully, mitoRate and totalAssignedGene have a strong negative correlation, which reveals the existence of very bad quality samples (with small totalAssignedGene and high mitoRate ).
multiple_QC_scatterplots("sum", "detected")
Noticeably, a positive correlation is present for sum and detected and that is also congruent because a higher number of expressed-genes implies bigger library sizes, but that is not necessarily true in the other way around: just a few highly expressed genes (small number of detected genes) could increase the library size. Interestingly, in this case samples are diagonally separated by age.
multiple_QC_scatterplots("sum", "totalAssignedGene")
Contrary to expectations, sum is not correlated to totalAssignedGene and this is unexpected because higher proportions of genes’ reads could initially increase library size. However, since totalAssignedGene is a fraction of the total reads, it could be that a high fraction of reads mapping to genes reflects a small number of total reads and does not necessarily imply more expressed genes (and therefore bigger libraries). In other words, even when totalAssignedGene equals 1, if the total number of reads is small, the genes won’t be widely covered by them and the libraries won’t be sizeable. Observe the separarion of samples by Age.
multiple_QC_scatterplots("detected", "totalAssignedGene")
Lastly, there’s a slight positive correlation between totalAssignedGene and detected , which makes sense because reads assigned to genes rise the number of non-zero expressed genes but as mentioned, those genes are not necessarily highly expressed and thus, library sizes are not perforce so much bigger. Also note that the correlation wouldn’t be conserved within pup samples.
📑 Exercise 1: What’s the correlation between subsets_Mito_sum and mitoMapped ? What factor could explain that correlation? Why are mitoMapped numbers bigger than subsets_Mito_sum values?
Remarkably, samples were not differentiated by any other variable (including Group!) but the differences between adult and pup metrics suggest that they must be splitted into two different groups for the following analyses.
6.4.3 QC sample filtering
After assessing how different or similar are the QC values between samples, we can now proceed to sample filtering based precisely, on these metrics. For that, we will use isOutlier()
from scater to identify outlier samples only at the lower end or the higher end, depending on the QC metric.
library("scater")
library("rlang")
library("ggrepel")
## Separate data by Age
<- rse_gene_filt[, which(rse_gene_filt$Age == "Pup")]
rse_gene_pups <- rse_gene_filt[, which(rse_gene_filt$Age == "Adult")]
rse_gene_adults
## Find outlier samples based on their QC metrics (samples that are 3 median-absolute-deviations away from the median)
## Filter all samples together
## Drop samples with lower library sizes (sum), detected number of genes and totalAssignedGene
<- isOutlier(rse_gene_filt$sum, nmads = 3, type = "lower")
outliers_library_size <- isOutlier(rse_gene_filt$detected, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_filt$totalAssignedGene, nmads = 3, type = "lower")
outliers_totalAssignedGene ## Drop samples with higher mitoRates and rRNA rates
<- isOutlier(rse_gene_filt$mitoRate, nmads = 3, type = "higher")
outliers_mito <- isOutlier(rse_gene_filt$rRNA_rate, nmads = 3, type = "higher")
outliers_rRNArate
## Keep not outlier samples
<- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
not_outliers <- rse_gene_filt[, not_outliers]
rse_gene_qc
## Number of samples retained
dim(rse_gene_qc)[2]
#> [1] 39
## Add new variables to rse_gene_filt with info of samples retained/dropped
$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_filt$SAMPLE_ID, function(x) {
rse_gene_filtif (x %in% rse_gene_qc$SAMPLE_ID) {
"Retained"
else {
} "Dropped"
}
}))
## Filter adult samples
<- isOutlier(rse_gene_adults$sum, nmads = 3, type = "lower")
outliers_library_size <- isOutlier(rse_gene_adults$detected, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_adults$totalAssignedGene, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_adults$mitoRate, nmads = 3, type = "higher")
outliers_mito <- isOutlier(rse_gene_adults$rRNA_rate, nmads = 3, type = "higher")
outliers_rRNArate
<- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
not_outliers <- rse_gene_adults[, not_outliers]
rse_gene_adults_qc
## Number of samples retained
dim(rse_gene_adults_qc)[2]
#> [1] 20
$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_adults$SAMPLE_ID, function(x) {
rse_gene_adultsif (x %in% rse_gene_adults_qc$SAMPLE_ID) {
"Retained"
else {
} "Dropped"
}
}))
## Filter pup samples
<- isOutlier(rse_gene_pups$sum, nmads = 3, type = "lower")
outliers_library_size <- isOutlier(rse_gene_pups$detected, nmads = 3, type = "lower")
outliers_detected_num <- isOutlier(rse_gene_pups$totalAssignedGene, nmads = 3, type = "lower")
outliers_totalAssignedGene <- isOutlier(rse_gene_pups$mitoRate, nmads = 3, type = "higher")
outliers_mito <- isOutlier(rse_gene_pups$rRNA_rate, nmads = 3, type = "higher")
outliers_rRNArate
<- which(!(outliers_library_size | outliers_detected_num | outliers_totalAssignedGene | outliers_mito | outliers_rRNArate))
not_outliers <- rse_gene_pups[, not_outliers]
rse_gene_pups_qc
## Number of samples retained
dim(rse_gene_pups_qc)[2]
#> [1] 41
$Retention_after_QC_filtering <- as.vector(sapply(rse_gene_pups$SAMPLE_ID, function(x) {
rse_gene_pupsif (x %in% rse_gene_pups_qc$SAMPLE_ID) {
"Retained"
else {
} "Dropped"
} }))
We already filtered outlier samples … but what have we removed? It is always important to trace the QC metrics of the filtered samples to verify that they really are poor quality. We don’t want to get rid of useful samples! So let’s go back to the QC boxplots but now color samples according to whether they passed or not the filtering step and also distinguishing samples’ groups by shape.
## Boxplots of QC metrics after sample filtering
## Boxplots
<- function(rse_gene, qc_metric, sample_var) {
boxplots_after_QC_filtering ## Color samples
<- c("Retained" = "deepskyblue", "Dropped" = "brown2")
colors
## Sample shape by sample variables
if (sample_var == "Group") {
<- c("Control" = 0, "Experimental" = 15)
shapes else if (sample_var == "Age") {
} <- c("Adult" = 16, "Pup" = 1)
shapes else if (sample_var == "Sex") {
} <- c("F" = 11, "M" = 19)
shapes else if (sample_var == "Pregnancy") {
} <- c("Yes" = 10, "No" = 1)
shapes else if (sample_var == "plate") {
} <- c("Plate1" = 12, "Plate2" = 5, "Plate3" = 4)
shapes else if (sample_var == "flowcell") {
} <- c(
shapes "HKCG7DSXX" = 3, "HKCMHDSXX" = 8, "HKCNKDSXX" = 14,
"HKCTMDSXX" = 17
)
}
<- str_replace_all(qc_metric, c("_" = " "))
y_label
<- data.frame(colData(rse_gene))
data
## Median of the QC var values
<- median(eval(parse_expr(paste("rse_gene$", qc_metric, sep = ""))))
median ## Median-absolute-deviation of the QC var values
<- mad(eval(parse_expr(paste("rse_gene$", qc_metric, sep = ""))))
mad
<- ggplot(data = data, mapping = aes(
plot x = "", y = !!rlang::sym(qc_metric),
color = !!rlang::sym("Retention_after_QC_filtering")
+
)) geom_jitter(alpha = 1, size = 2, aes(shape = eval(parse_expr((sample_var))))) +
geom_boxplot(alpha = 0, size = 0.15, color = "black") +
scale_color_manual(values = colors) +
scale_shape_manual(values = shapes) +
labs(x = "", y = y_label, color = "Retention after QC filtering", shape = sample_var) +
theme_classic() +
## Median line
geom_hline(yintercept = median, size = 0.5) +
## Line of median + 3 MADs
geom_hline(yintercept = median + (3 * mad), size = 0.5, linetype = 2) +
## Line of median - 3 MADs
geom_hline(yintercept = median - (3 * mad), size = 0.5, linetype = 2) +
theme(
axis.title = element_text(size = (9)),
axis.text = element_text(size = (8)),
legend.position = "right",
legend.text = element_text(size = 8),
legend.title = element_text(size = 9)
)
return(plot)
}
In the following plot we can confirm that taking all samples together (both adults and pups) all adult samples are dropped by their high mitoRate , but that is not desirable because we want to analyze these samples too, so we need to analyze pup and adult samples separately.
## Plots
## All samples together
<- boxplots_after_QC_filtering(rse_gene_filt, "mitoRate", "Age")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm")) p
For adults, only 3 controls are dropped, 3 of them by their mitoRate .
## Adult samples
<- boxplots_after_QC_filtering(rse_gene_adults, "mitoRate", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm")) p
For pups, 1 experimental sample was dropped by its rRNA_rate .
## Pup samples
<- boxplots_after_QC_filtering(rse_gene_pups, "rRNA_rate", "Group")
p + theme(plot.margin = unit(c(2, 4, 2, 4), "cm")) p
📑 Exercise 2: Were the filtered adult samples from pregnant or not pregnant mice? From which plates and flowcells? What were their library sizes?
Try the function yourself with different QC metrics and sample variables!
6.5 Miscellaneous about CPM values
## Why do we see log(CPM + 0.5) values smaller than -1?
log2(0.5)
#> [1] -1
## Maybe the prior count is getting shrunk
log2(0.05)
#> [1] -4.321928
## logcounts values we see across all samples
summary(as.vector(assays(rse_gene)$logcounts))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -5.9853 -5.9853 -5.9853 -2.7514 0.4171 18.2732
## Sample 203 in particular has some cases like it
summary(assays(rse_gene)$logcounts[, 203])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -5.9853 -5.9853 -5.9853 -2.6620 0.6067 12.5785
## Where we find one gene in sample 203 with that property
<- which.min(assays(rse_gene)$logcounts[, 203])
i_gene
## We can check the logcounts and counts
assays(rse_gene)$logcounts[i_gene, 203]
#> ENSMUSG00000102693.1
#> -5.985331
assays(rse_gene)$counts[i_gene, 203]
#> ENSMUSG00000102693.1
#> 0
## What if we try to reverse engineer the number?
2^assays(rse_gene)$logcounts[i_gene, 203]
#> ENSMUSG00000102693.1
#> 0.01578469
log2(0.01578469)
#> [1] -5.98533
0.01578469 / 0.5
#> [1] 0.03156938
0.5 * 0.03156938
#> [1] 0.01578469
## Hm... 31 doesn't ring any bells
1 / 0.03156938
#> [1] 31.67626
- Bioconductor website for browsing code: https://code.bioconductor.org/
- Bioconductor support website: https://support.bioconductor.org/
Links Leo showed:
smokingMouse_indirects
code for computing thelogcounts
https://github.com/LieberInstitute/smokingMouse_Indirects/blob/704692a357ec391348ebc3568188d41827328ba5/code/02_build_objects/02_build_objects.R#L100- https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L9
- https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L14
- https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L78
- https://code.bioconductor.org/browse/edgeR/blob/devel/src/R_calculate_cpm.cpp#L30
- https://code.bioconductor.org/browse/edgeR/blob/devel/src/add_prior.cpp#L29
- https://support.bioconductor.org/post/search/?query=prior.count+edgeR
- https://support.bioconductor.org/p/59846/
- https://support.bioconductor.org/p/59846/
## Check the documentation
## ?edgeR::cpm
## > If log-values are computed, then a small count, given by prior.count but scaled to be proportional to the library size, is added to y to avoid taking the log of zero.
## https://github.com/LieberInstitute/smokingMouse_Indirects/blob/704692a357ec391348ebc3568188d41827328ba5/code/02_build_objects/02_build_objects.R#L100C51-L100C135
## Let's save the output of calcNormFactors
<- edgeR::calcNormFactors(rse_gene, method = "TMM")
DGElist_with_norm class(DGElist_with_norm)
#> [1] "DGEList"
#> attr(,"package")
#> [1] "edgeR"
## https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L9
## We can see that it has the lib.size and norm.factors values there
head(DGElist_with_norm$samples[, seq_len(3)])
#> group lib.size norm.factors
#> Sample1 1 37119948 0.9601434
#> Sample2 1 24904754 0.9783991
#> Sample3 1 27083602 0.9740923
#> Sample4 1 48312479 0.9408362
#> Sample5 1 34921004 0.9772143
#> Sample6 1 31558549 0.9246274
## Exploring lib.size across all samples
summary(DGElist_with_norm$samples$lib.size)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 6908739 26209304 29205003 29475155 34809355 114202414
$samples$lib.size[203]
DGElist_with_norm#> [1] 34869201
## https://code.bioconductor.org/browse/edgeR/blob/devel/R/cpm.R#L14
## We can see there how edgeR::cpm() computes the adjusted library sizes
summary(DGElist_with_norm$samples$norm.factors)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.3595 0.9656 1.1230 1.0306 1.1589 1.2336
$samples$norm.factors[203]
DGElist_with_norm#> [1] 1.19894
## adjusted library sizes
$samples$lib.size[203] * DGElist_with_norm$samples$norm.factors[203]
DGElist_with_norm#> [1] 41806065
## Let's save these values for all samples
<- DGElist_with_norm$samples$lib.size * DGElist_with_norm$samples$norm.factors
adj_lib_size 203]
adj_lib_size[#> [1] 41806065
## Proportional adjusted library size to the mean adjusted library size
203] / mean(adj_lib_size)
adj_lib_size[#> [1] 1.319791
1 / (adj_lib_size[203] / mean(adj_lib_size))
#> [1] 0.7576956
## Hm.... we are missing something to get to the values we saw earlier when we
## tried to reverse engineer the issue.
## Hm.... I couldn't recalculate that -5.98 value manually
log2(0.5 / (adj_lib_size[203] / mean(adj_lib_size) * 2))
#> [1] -2.40031
In the end, I (Leo) don’t know enough C++ to fully understand the math of what edgeR
is doing. But well, the point is that it is scaling that prior.count
(0.5
in the smokingMouse
case instead of the default value of 2
) which results in a smaller number, and hence why the log2()
value of it is smaller than -1
.
References
Evans, C., Hardin, J., & Stoebel, D. M. (2018). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in bioinformatics, 19(5), 776-792.
Bedre, R. (2023). Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq. Web site: https://www.reneshbedre.com/blog/expression_units.html