Feature Selection

# <strong>Feature Selection</strong>
## Analyzing <strong>scRNA-seq</strong> data with <strong>Bioconductor</strong> for <strong>LCG-EJ-UNAM</strong> March 2020
### <a href="http://lcolladotor.github.io/">Leonardo Collado-Torres</a>
### 2020-03-25

---

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

]

.footnote[ Download the materials for this course with `usethis::use_course('lcolladotor/osca_LIIGH_UNAM_2020')` or view online at [**lcolladotor.github.io/osca_LIIGH_UNAM_2020**](http://lcolladotor.github.io/osca_LIIGH_UNAM_2020).]

/* https://stackoverflow.com/questions/50919104/horizontally-scrollable-output-on-xaringan-slides */
pre {
  max-width: 100%;
  overflow-x: scroll;
}

/* From https://github.com/yihui/xaringan/wiki/Font-Size */
.tiny{
  font-size: 40%
}

/* From https://github.com/yihui/xaringan/wiki/Title-slide */
.title-slide {
  background-image: url(https://raw.githubusercontent.com/Bioconductor/OrchestratingSingleCellAnalysis/master/images/Workflow.png);
  background-size: 33%;
  background-position: 0% 100%
}
</style>

---

# Slides by Peter Hickey

View them [here](https://docs.google.com/presentation/d/19J2FyjKlBQdAkku4Oa6UZ6SA-Y4P7AEKCRIbEQWA9ho/edit#slide=id.ga100bba375887aa_0)

---

# Code and output

```r
library('BiocFileCache')
```

```
## Loading required package: dbplyr
```

```r
bfc <- BiocFileCache()
raw.path <-
    bfcrpath(
        bfc,
        file.path(
            "http://cf.10xgenomics.com/samples",
            "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"
        )
    )
untar(raw.path, exdir = file.path(tempdir(), "pbmc4k"))

library('DropletUtils')
```

```
## Loading required package: SingleCellExperiment
```

```
## Loading required package: SummarizedExperiment
```

```
## Loading required package: GenomicRanges
```

```
## Loading required package: stats4
```

```
## Loading required package: BiocGenerics
```

```
## Loading required package: parallel
```

```
## 
## Attaching package: 'BiocGenerics'
```

```
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
```

```
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
```

```
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
```

```
## Loading required package: S4Vectors
```

```
## 
## Attaching package: 'S4Vectors'
```

```
## The following object is masked from 'package:base':
## 
##     expand.grid
```

```
## Loading required package: IRanges
```

```
## Loading required package: GenomeInfoDb
```

```
## Loading required package: Biobase
```

```
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
```

```
## Loading required package: DelayedArray
```

```
## Loading required package: matrixStats
```

```
## 
## Attaching package: 'matrixStats'
```

```
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
```

```
## Loading required package: BiocParallel
```

```
## 
## Attaching package: 'DelayedArray'
```

```
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
```

```
## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum
```

```r
library('Matrix')
```

```
## 
## Attaching package: 'Matrix'
```

```
## The following object is masked from 'package:S4Vectors':
## 
##     expand
```

```r
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names = TRUE)

# gene-annotation
library('scater')
```

```
## Loading required package: ggplot2
```

```r
rownames(sce.pbmc) <- uniquifyFeatureNames(rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library('EnsDb.Hsapiens.v86')
```

```
## Loading required package: ensembldb
```

```
## Loading required package: GenomicFeatures
```

```
## Loading required package: AnnotationDbi
```

```
## Loading required package: AnnotationFilter
```

```
## 
## Attaching package: 'ensembldb'
```

```
## The following object is masked from 'package:stats':
## 
##     filter
```

```r
location <- mapIds(
    EnsDb.Hsapiens.v86,
    keys = rowData(sce.pbmc)$ID,
    column = "SEQNAME",
    keytype = "GENEID"
)
```

```
## Warning: Unable to map 144 of 33694 requested IDs.
```

]

---

```r
# cell-detection
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[, which(e.out$FDR <= 0.001)]

# quality-control
stats <-
    perCellQCMetrics(sce.pbmc, subsets = list(Mito = which(location == "MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent,
    type = "higher")
sce.pbmc <- sce.pbmc[, !high.mito]

# normalization
library('scran')
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster = clusters)
```

```
## Warning in .calculate_sum_factors(assay(x, i = assay.type), subset.row =
## subset.row, : partial argument match of 'cluster' to 'clusters'
```

```r
sce.pbmc <- logNormCounts(sce.pbmc)
```

]

---

# Exercises

* How did we determine which genes were mitochondrial?

* How did we decide to filter cells?

* Can you explain how we normalized the data?

???

* Ensembl v86 for Human
* We used the output from both `emptyDrops()` with an FDR threshold of 0.1% and the 3 MAD filter on mitochondrial expression.
* We found some quick clusters for the cells and used that information for computing the sum factors.

---

```r
# Illustrative dataset: 416B ---------------------------------------------------

library('scRNAseq')
sce.416b <- LunSpikeInData(which = "416b")
```

```
## snapshotDate(): 2019-10-22
```

```
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
```

```
## loading from cache
```

```
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
```

```
## loading from cache
```

```
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
```

```
## loading from cache
```

```r
sce.416b$block <- factor(sce.416b$block)

# gene-annotation
library('AnnotationHub')
```

```
## 
## Attaching package: 'AnnotationHub'
```

```
## The following object is masked from 'package:Biobase':
## 
##     cache
```

```r
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
```

```
## snapshotDate(): 2019-10-29
```

```
## loading from cache
```

```r
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(
    ens.mm.v97,
    keys = rownames(sce.416b),
    keytype = "GENEID",
    column = "SYMBOL"
)
```

```
## Warning: Unable to map 563 of 46604 requested IDs.
```

```r
rowData(sce.416b)$SEQNAME <- mapIds(
    ens.mm.v97,
    keys = rownames(sce.416b),
    keytype = "GENEID",
    column = "SEQNAME"
)
```

```
## Warning: Unable to map 563 of 46604 requested IDs.
```

```r
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL,
    rowData(sce.416b)$SYMBOL)

# quality-control
mito <- which(rowData(sce.416b)$SEQNAME == "MT")
stats <- perCellQCMetrics(sce.416b, subsets = list(Mt = mito))
qc <- quickPerCellQC(
    stats,
    percent_subsets = c("subsets_Mt_percent", "altexps_ERCC_percent"),
    batch = sce.416b$block
)
sce.416b <- sce.416b[, !qc$discard]

# normalization
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- logNormCounts(sce.416b)
```

]

---

# Exercises

* How did we determine which genes were mitochondrial?

* How did we decide to filter cells?

* Can you explain how we normalized the data?

???

* Ensembl v87 for mouse
* We used the 3 MAD filter on mitochondrial expression as well as for the ERCC expression percent taking into account the sequencing batch.
* We computing the library size factors with default parameters without any extra work.

---

```r
# Variance of the log-counts ---------------------------------------------------

dec.pbmc <- modelGeneVar(sce.pbmc)

# Visualizing the fit:
fit.pbmc <- metadata(dec.pbmc)
plot(fit.pbmc$mean, fit.pbmc$var,
    xlab = "Mean of log-expression",
    ylab = "Variance of log-expression")
curve(fit.pbmc$trend(x),
    col = "dodgerblue",
    add = TRUE,
    lwd = 2)
```

![](05-feature-selection_files/figure-html/all_code4-1.png)

```r
# Ordering by most interesting genes for inspection.
dec.pbmc[order(dec.pbmc$bio, decreasing = TRUE), ]
```

```
## DataFrame with 33694 rows and 6 columns
##                     mean             total              tech                bio
##                <numeric>         <numeric>         <numeric>          <numeric>
## LYZ     1.97769673707568   5.1159500611095  0.82727664672734   4.28867341438216
## S100A9  1.94950920785793  4.58859014766227 0.827542445867915   3.76104770179436
## S100A8  1.71827632030156  4.45723403089262 0.819401996008416   3.63783203488421
## HLA-DRA 2.09694376970247   3.7268983984442 0.823662585789582   2.90323581265462
## CD74    2.89839753243019  3.30912153093723 0.793202990130979   2.51591854080625
## ...                  ...               ...               ...                ...
## PTMA    3.83013315155046 0.471105138449811 0.740524978380523 -0.269419839930711
## HLA-B   4.50160947331388 0.475347815893854 0.755806549658926 -0.280458733765072
## EIF1     3.2426121173403 0.478352466438016 0.771315504705692 -0.292963038267676
## TMSB4X  6.08482790661183 0.408393742972349 0.742839890904661 -0.334446147932313
## B2M     5.95481007289176 0.304436638362183 0.714660541046408 -0.410223902684225
##                       p.value                   FDR
##                     <numeric>             <numeric>
## LYZ      7.0321706653829e-271  6.9091076787387e-267
## S100A9  9.35453836302795e-209 6.12722262778331e-205
## S100A8    2.604683703797e-199 1.27955086949028e-195
## HLA-DRA 1.69875615764308e-126 2.38432560697761e-123
## CD74    7.30210305204499e-103 6.83268214155638e-100
## ...                       ...                   ...
## PTMA        0.993177414234264     0.999999999531493
## HLA-B       0.994058786260565     0.999999999531493
## EIF1         0.99498711208533     0.999999999531493
## TMSB4X      0.998864257564838     0.999999999531493
## B2M         0.999950168961814     0.999999999531493
```

]

---

# Exercises

* What type of object did `modelGeneVar()` return?

* Is `dec.pbmc` a table? Or does it contain extra information?

* What type of object is `fit.pbmc` and what named objects does it contain?

* What type of object is `fit.pbmc$trend`?

* Where can we find more details about this function?

???

Solutions

* A `DFrame`
* No, it contains more information inside `metadata(dec.pbmc)`
* `class(metadata(dec.pbmc))` and `sapply(metadata(dec.pbmc), class)`
* A function
* Check `?fitTrendVar` and ultimately look at the source code https://github.com/MarioniLab/scran/blob/master/R/fitTrendVar.R

---

```r
# Coefficient of variation -----------------------------------------------------

dec.cv2.pbmc <- modelGeneCV2(sce.pbmc)
# Visualizing the fit:
fit.cv2.pbmc <- metadata(dec.cv2.pbmc)
plot(fit.cv2.pbmc$mean, fit.cv2.pbmc$cv2, log = "xy")
```

```
## Warning in xy.coords(x, y, xlabel, ylabel, log): 14044 x values <= 0 omitted
## from logarithmic plot
```

```r
curve(fit.cv2.pbmc$trend(x),
    col = "dodgerblue",
    add = TRUE,
    lwd = 2)
```

![](05-feature-selection_files/figure-html/all_code5-1.png)

```r
# Ordering by most interesting genes for inspection.
dec.cv2.pbmc[order(dec.cv2.pbmc$ratio, decreasing = TRUE), ]
```

```
## DataFrame with 33694 rows and 6 columns
##                         mean            total            trend            ratio
##                    <numeric>        <numeric>        <numeric>        <numeric>
## HIST1H2AC  0.904516934666557 267.718236577368 1.55979164387129  171.63717835602
## GNG11      0.690568836016836 219.322909957405 1.98063875112505 110.733423665887
## PRTFDC1    0.041251143542588 3034.95248349643 29.9868178384793 101.209554806511
## TNNC2      0.102157654693312 1210.58511395578  12.228721423039 98.9952319688162
## PF4         1.10837584898308 128.809452421799 1.30994958879054 98.3316102574044
## ...                      ...              ...              ...              ...
## AC023491.2                 0              NaN              Inf              NaN
## AC233755.2                 0              NaN              Inf              NaN
## AC233755.1                 0              NaN              Inf              NaN
## AC213203.1                 0              NaN              Inf              NaN
## FAM231B                    0              NaN              Inf              NaN
##              p.value       FDR
##            <numeric> <numeric>
## HIST1H2AC          0         0
## GNG11              0         0
## PRTFDC1            0         0
## TNNC2              0         0
## PF4                0         0
## ...              ...       ...
## AC023491.2       NaN       NaN
## AC233755.2       NaN       NaN
## AC233755.1       NaN       NaN
## AC213203.1       NaN       NaN
## FAM231B          NaN       NaN
```

]

---

```r
# In the presence of spike-ins -------------------------------------------------

dec.spike.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC")
# Ordering by most interesting genes for inspection.
dec.spike.416b[order(dec.spike.416b$bio, decreasing = TRUE), ]
```

```
## DataFrame with 46604 rows and 6 columns
##                      mean             total             tech               bio
##                 <numeric>         <numeric>        <numeric>         <numeric>
## Lyz2     6.61096803891878  13.8497201762179 1.57130530147958  12.2784148747383
## Ccl9     6.67845998360554  13.1869005745724 1.50034784936951  11.6865527252029
## Top2a    5.81023942468506  14.1787245984925 2.54775542311768  11.6309691753748
## Cd200r3  4.83179821635761  15.5612599938701 4.22984248305335  11.3314175108168
## Ccnb2    5.97776034107469  13.1392817339087 2.30176764546838  10.8375140884403
## ...                   ...               ...              ...               ...
## Rpl5-ps2 3.60625401733281 0.612623344818866 6.32852661749413 -5.71590327267527
## Gm11942  3.38767658515201 0.798570319939665  6.5147301200209 -5.71615980008124
## Gm12816  2.91276213414496  0.83866968433397 6.57363561000052 -5.73496592566655
## Gm13623  2.72843841213055 0.708071335002565  6.4544837822736 -5.74641244727104
## Rps12l1  3.15419890541469 0.746615468755298 6.59331632615833 -5.84670085740304
##                        p.value                   FDR
##                      <numeric>             <numeric>
## Lyz2     1.48993187648261e-186 1.54156125498507e-183
## Ccl9     2.21855596985267e-185   2.199790683941e-182
## Top2a      3.8001564031599e-65  1.13040402407495e-62
## Cd200r3   9.46221237861832e-24  6.08573697226973e-22
## Ccnb2     3.68706262912477e-69  1.20193190938743e-66
## ...                        ...                   ...
## Rpl5-ps2      0.99961624700222     0.999726145742423
## Gm11942       0.99945891635767     0.999726145742423
## Gm12816      0.999422192505515     0.999726145742423
## Gm13623      0.999543762578621     0.999726145742423
## Rps12l1      0.999521783333319     0.999726145742423
```

```r
# In the absence of spike-ins --------------------------------------------------

set.seed(0010101)
dec.pois.pbmc <- modelGeneVarByPoisson(sce.pbmc)
```

```
## Warning in seq.default(from = log(min(means)), to = log(max(means)), length =
## npts): partial argument match of 'length' to 'length.out'
```

```r
# Ordering by most interesting genes for inspection.
dec.pois.pbmc[order(dec.pois.pbmc$bio, decreasing = TRUE), ]
```

```
## DataFrame with 33694 rows and 6 columns
##                      mean             total              tech
##                 <numeric>         <numeric>         <numeric>
## LYZ      1.97769673707568   5.1159500611095  0.62154711670504
## S100A9   1.94950920785793  4.58859014766227 0.627306158799179
## S100A8   1.71827632030156  4.45723403089262 0.669427971437804
## HLA-DRA  2.09694376970247   3.7268983984442 0.596371862296273
## CD74     2.89839753243019  3.30912153093723  0.42262399834624
## ...                   ...               ...               ...
## ATP5J    0.61952391166994 0.455658705950256 0.507720135991298
## NEDD8   0.846016230212089 0.561647992087122 0.614757941258374
## NDUFA1  0.866546068674659 0.561476949834516 0.621857770268662
## SAP18   0.765422347908533 0.515945653144755 0.582613562895256
## SUMO2    1.36130581955543 0.618400317390215 0.698945564859564
##                         bio           p.value       FDR
##                   <numeric>         <numeric> <numeric>
## LYZ        4.49440294440446                 0         0
## S100A9     3.96128398886309                 0         0
## S100A8     3.78780605945482                 0         0
## HLA-DRA    3.13052653614793                 0         0
## CD74       2.88649753259099                 0         0
## ...                     ...               ...       ...
## ATP5J    -0.052061430041042 0.947965420327746         1
## NEDD8   -0.0531099491712516 0.914572836636807         1
## NDUFA1  -0.0603808204341457 0.938118682234383         1
## SAP18   -0.0666679097505009 0.965153998959554         1
## SUMO2   -0.0805452474693483 0.966130141578016         1
```

```r
# Considering experimental factors ---------------------------------------------

dec.block.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC",
    block = sce.416b$block)
dec.block.416b[order(dec.block.416b$bio, decreasing = TRUE), ]
```

```
## DataFrame with 46604 rows and 7 columns
##                      mean             total             tech               bio
##                 <numeric>         <numeric>        <numeric>         <numeric>
## Lyz2     6.61235092956153  13.8618988024144 1.58416440849175  12.2777343939227
## Ccl9     6.67841214065115  13.2598761518269 1.44553397943297  11.8143421723939
## Top2a    5.81274666129111  14.0191605357462 2.74571164189595  11.2734488938502
## Cd200r3  4.83305175110888  15.5908569933105 4.31892122397075  11.2719357693398
## Ccnb2    5.97999269432625  13.0256084334992 2.46646680303025  10.5591416304689
## ...                   ...               ...              ...               ...
## Gm12816  2.91299478430894  0.84257353020541  6.6772969278893 -5.83472339768389
## Gm5786   2.90716973737035  0.87948493152832 6.71686487286208 -5.83737994133376
## Rpl9-ps4 3.26420500290278 0.807057424228047 6.64931798189563 -5.84226055766759
## Gm13623  2.72787943922598 0.700296494037514 6.63874744721625 -5.93845095317873
## Rps12l1  3.15425441544777 0.750775094843529 6.70032524451785 -5.94955014967432
##                        p.value                   FDR
##                      <numeric>             <numeric>
## Lyz2                         0                     0
## Ccl9                         0                     0
## Top2a    3.89855011267489e-137 8.43398154830222e-135
## Cd200r3   1.17783163724758e-54  7.00721486789514e-53
## Ccnb2     1.2038008051688e-151 2.98404664172937e-149
## ...                        ...                   ...
## Gm12816      0.999988544502341     0.999998947604576
## Gm5786       0.999994012598123     0.999998947604576
## Rpl9-ps4     0.999988234321942     0.999998947604576
## Gm13623       0.99999778490972     0.999998947604576
## Rps12l1      0.999995018990454     0.999998947604576
##                                                                                                                per.block
##                                                                                                              <DataFrame>
## Lyz2       6.35651616065101:13.3747698142285:2.08227421625012:...:6.86818569847206:14.3490277906004:1.08605460073337:...
## Ccl9       6.68726308721402:13.0777537500965:1.65923180202562:...:6.66956119408827:13.4419985535572:1.23183615684032:...
## Top2a      5.34890788917346:17.5971863642305:3.91641735592849:...:6.27658543340876:10.4411347072618:1.57500592786341:...
## Cd200r3    4.60114782212492:15.7869618177592:5.55587131248954:...:5.06495568009284:15.3947521688619:3.08197113545196:...
## Ccnb2         5.5670073427887:15.4149994633951:3.4693116817634:...:6.3929780458638:10.6362174036033:1.46362192429709:...
## ...                                                                                                                  ...
## Gm12816   2.86995450397276:0.624143273081851:7.43035752357821:...:2.95603506464512:1.06100378732897:5.92423633220039:...
## Gm5786   2.96006470849401:0.902872362148704:7.49910657772781:...:2.85427476624668:0.856097500907936:5.93462316799636:...
## Rpl9-ps4   3.60690125067732:0.54327593848473:7.36805467284382:...:2.92150875512824:1.07083890997136:5.93058129094744:...
## Gm13623  2.83128942657178:0.852900814696626:7.39441634068788:...:2.62446945188018:0.547692173378403:5.88307855374461:...
## Rps12l1   3.14398505932761:0.716669519607103:7.5724598084379:...:3.16452377156793:0.784880670079955:5.82819068059779:...
```

```r
dec.block.416b$per.block
```

```
## DataFrame with 46604 rows and 2 columns
##                                                         X20160113
##                                                       <DataFrame>
## 1                                                       0:0:0:...
## 2                                                       0:0:0:...
## 3                                                       0:0:0:...
## 4                                                       0:0:0:...
## 5     0.015818190192731:0.0232700081105271:0.0754413464457108:...
## ...                                                           ...
## 46600                                                   0:0:0:...
## 46601                                                   0:0:0:...
## 46602                                                   0:0:0:...
## 46603                                                   0:0:0:...
## 46604   14.8188636663469:2.90986400629689:0.00884548590566011:...
##                                                       X20160325
##                                                     <DataFrame>
## 1                                                     0:0:0:...
## 2                                                     0:0:0:...
## 3                                                     0:0:0:...
## 4                                                     0:0:0:...
## 5                                                     0:0:0:...
## ...                                                         ...
## 46600                                                 0:0:0:...
## 46601                                                 0:0:0:...
## 46602                                                 0:0:0:...
## 46603                                                 0:0:0:...
## 46604 15.2007368851855:3.41768404695488:0.00618921550122494:...
```

```r
dec.block.416b$per.block$X20160113
```

```
## DataFrame with 46604 rows and 6 columns
##                                 mean              total                tech
##                            <numeric>          <numeric>           <numeric>
## 4933401J01Rik                      0                  0                   0
## Gm26206                            0                  0                   0
## Xkr4                               0                  0                   0
## Gm18956                            0                  0                   0
## Gm37180            0.015818190192731 0.0232700081105271  0.0754413464457108
## ...                              ...                ...                 ...
## CAAA01098150.1                     0                  0                   0
## CAAA01064564.1                     0                  0                   0
## Vmn2r122                           0                  0                   0
## CAAA01147332.1                     0                  0                   0
## CBFB-MYH11-mcherry  14.8188636663469   2.90986400629689 0.00884548590566011
##                                    bio           p.value               FDR
##                              <numeric>         <numeric>         <numeric>
## 4933401J01Rik                        0               NaN               NaN
## Gm26206                              0               NaN               NaN
## Xkr4                                 0               NaN               NaN
## Gm18956                              0               NaN               NaN
## Gm37180            -0.0521713383351837 0.999748220051701 0.999999330234571
## ...                                ...               ...               ...
## CAAA01098150.1                       0               NaN               NaN
## CAAA01064564.1                       0               NaN               NaN
## Vmn2r122                             0               NaN               NaN
## CAAA01147332.1                       0               NaN               NaN
## CBFB-MYH11-mcherry    2.90101852039123                 0                 0
```

---

# Exercises

* What type of object is `dec.block.416b$per.block`?

???

* `class(dec.block.416b$per.block)` A `DataFrame` with 2 columns, each of which are also `DataFrame`s

]

---

```r
# Selecting HVGs on the largest metrics ----------------------------------------

# Works with modelGeneVar() output
hvg.pbmc.var <- getTopHVGs(dec.pbmc, n = 1000)
str(hvg.pbmc.var)
```

```
##  chr [1:1000] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3" "TYROBP" ...
```

```r
# Works with modelGeneVarWithSpikes() output
hvg.416b.var <- getTopHVGs(dec.spike.416b, n = 1000)
str(hvg.416b.var)
```

```
##  chr [1:1000] "Lyz2" "Ccl9" "Top2a" "Cd200r3" "Ccnb2" "Gm10736" "Hbb-bt" ...
```

```r
# Also works with modelGeneCV2() but note `var.field`
hvg.pbmc.cv2 <- getTopHVGs(dec.cv2.pbmc,
    var.field = "ratio", n = 1000)
str(hvg.pbmc.cv2)
```

```
##  chr [1:1000] "HIST1H2AC" "GNG11" "PRTFDC1" "TNNC2" "PF4" "HGD" "PPBP" ...
```

]

---

# Exercises

* What percent of the HVG genes overlap across the two sets pbmc sets?

* Bonus: make a venn diagram of the 2 sets of pbmc HVGs

???

* `table(hvg.pbmc.var %in% hvg.pbmc.cv2)`

---

* Search code written by others, for example: https://github.com/LieberInstitute/brainseq_phase2/search?p=2&q=venn&unscoped_q=venn

```r
if (!requireNamespace("gplots", quietly = TRUE))
    install.packages("gplots")

if (!requireNamespace("VennDiagram", quietly = TRUE))
    BiocManager::install("VennDiagram")

## Quick and simple venn diagram
gplots::venn(list('var' = hvg.pbmc.var, 'cv2' = hvg.pbmc.cv2))
```

![](05-feature-selection_files/figure-html/all_code7b-1.png)

```r
## Fancier one
v <- VennDiagram::venn.diagram(
    list('var' = hvg.pbmc.var, 'cv2' = hvg.pbmc.cv2),
    filename = NULL,
    fill = c('forest green', 'orange')
    
)
grid::grid.newpage()
grid::grid.draw(v)
```

![](05-feature-selection_files/figure-html/all_code7b-2.png)

]

---

```r
# Selecting HVGs on statistical significance -----------------------------------

# Works with modelGeneVar() output
hvg.pbmc.var.2 <- getTopHVGs(dec.pbmc, fdr.threshold = 0.05)
str(hvg.pbmc.var.2)
```

```
##  chr [1:651] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3" "TYROBP" ...
```

```r
# Works with modelGeneVarWithSpikes() output
hvg.416b.var.2 <- getTopHVGs(dec.spike.416b,
    fdr.threshold = 0.05)
str(hvg.416b.var.2)
```

```
##  chr [1:2568] "Lyz2" "Ccl9" "Top2a" "Cd200r3" "Ccnb2" "Gm10736" "Hbb-bt" ...
```

```r
# Also works with modelGeneCV2() but note `var.field`
hvg.pbmc.cv2.2 <- getTopHVGs(dec.cv2.pbmc,
    var.field = "ratio", fdr.threshold = 0.05)
str(hvg.pbmc.cv2.2)
```

```
##  chr [1:1699] "HIST1H2AC" "GNG11" "PRTFDC1" "TNNC2" "PF4" "HGD" "PPBP" ...
```

]

---

# Exercises

* Which list of pbmc HVGs is longer?

* Make a venn diagram of the pbmc lists.

???

* hvg.pbmc.cv2.2

---

```r
# Selecting genes above the trend as HVGs --------------------------------------

# Works with modelGeneVar() output
hvg.pbmc.var.3 <- getTopHVGs(dec.pbmc, var.threshold = 0)
str(hvg.pbmc.var.3)
```

```
##  chr [1:12791] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3" "TYROBP" ...
```

```r
# Works with modelGeneVarWithSpikes() output
hvg.416b.var.3 <- getTopHVGs(dec.spike.416b,
    var.threshold = 0)
str(hvg.416b.var.3)
```

```
##  chr [1:11087] "Lyz2" "Ccl9" "Top2a" "Cd200r3" "Ccnb2" "Gm10736" "Hbb-bt" ...
```

```r
# Also works with modelGeneCV2() but note `var.field` and
# value of `var.threshold`
hvg.pbmc.cv2.3 <- getTopHVGs(dec.cv2.pbmc,
    var.field = "ratio", var.threshold = 1)
str(hvg.pbmc.cv2.2)
```

```
##  chr [1:1699] "HIST1H2AC" "GNG11" "PRTFDC1" "TNNC2" "PF4" "HGD" "PPBP" ...
```

]

---

```r
# Putting it all together ------------------------------------------------------

dec.pbmc <- modelGeneVar(sce.pbmc)
chosen <- getTopHVGs(dec.pbmc, prop = 0.1)
str(chosen)
```

```
##  chr [1:1279] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3" "TYROBP" ...
```

```r
# Subsetting to just the HVGs --------------------------------------------------

sce.pbmc.hvg <- sce.pbmc[chosen, ]
sce.pbmc.hvg
```

```
## class: SingleCellExperiment 
## dim: 1279 3922 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(1279): LYZ S100A9 ... HIST1H2BN TTC39C
## rowData names(2): ID Symbol
## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCACAGGTCCAC-1 TTTGTCATCCCAAGAT-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
```

```r
# Specifying HVGs in downstream functions --------------------------------------

# Example of specifying HVGs in a downstream function
# Performing PCA only on the chosen HVGs.
sce.pbmc <- runPCA(sce.pbmc, subset_row = chosen)
sce.pbmc
```

```
## class: SingleCellExperiment 
## dim: 33694 3922 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCACAGGTCCAC-1 TTTGTCATCCCAAGAT-1
## colData names(2): Sample Barcode
## reducedDimNames(1): PCA
## spikeNames(0):
## altExpNames(0):
```

```r
# Witchcraft -------------------------------------------------------------------

# Add the full SCE to the subsetted data SCE
altExp(sce.pbmc.hvg, "original") <- sce.pbmc
sce.pbmc.hvg
```

```r
altExp(sce.pbmc.hvg, "original")
```

]

---

# Exercises

* How did we tell `runPCA()` which were the highly variable genes?

???

* Through the `subset_row` argument.

---

# Thanks!

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan) and themed with [**xaringanthemer**](https://github.com/gadenbuie/xaringanthemer).

This course is based on the book [**Orchestrating Single Cell Analysis with Bioconductor**](https://osca.bioconductor.org/) by [Aaron Lun](https://www.linkedin.com/in/aaron-lun-869b5894/), [Robert Amezquita](https://robertamezquita.github.io/), [Stephanie Hicks](https://www.stephaniehicks.com/) and [Raphael Gottardo](http://rglab.org), plus [**WEHI's scRNA-seq course**](https://drive.google.com/drive/folders/1cn5d-Ey7-kkMiex8-74qxvxtCQT6o72h) by [Peter Hickey](https://www.peterhickey.org/).

You can find the files for this course at [lcolladotor/osca_LIIGH_UNAM_2020](https://github.com/lcolladotor/osca_LIIGH_UNAM_2020).

Instructor: [**Leonardo Collado-Torres**](http://lcolladotor.github.io/).

]

---

# R session information

```r
options(width = 120)
sessioninfo::session_info()
```

```
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.3 (2020-02-29)
##  os       macOS Catalina 10.15.3      
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-03-25                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package                * version  date       lib source                                   
##  AnnotationDbi          * 1.48.0   2019-10-29 [1] Bioconductor                             
##  AnnotationFilter       * 1.10.0   2019-10-29 [1] Bioconductor                             
##  AnnotationHub          * 2.18.0   2019-10-29 [1] Bioconductor                             
##  askpass                  1.1      2019-01-13 [1] CRAN (R 3.6.0)                           
##  assertthat               0.2.1    2019-03-21 [1] CRAN (R 3.6.0)                           
##  beeswarm                 0.2.3    2016-04-25 [1] CRAN (R 3.6.0)                           
##  Biobase                * 2.46.0   2019-10-29 [1] Bioconductor                             
##  BiocFileCache          * 1.10.2   2019-11-08 [1] Bioconductor                             
##  BiocGenerics           * 0.32.0   2019-10-29 [1] Bioconductor                             
##  BiocManager              1.30.10  2019-11-16 [1] CRAN (R 3.6.1)                           
##  BiocNeighbors            1.4.2    2020-02-29 [1] Bioconductor                             
##  BiocParallel           * 1.20.1   2019-12-21 [1] Bioconductor                             
##  BiocSingular             1.2.2    2020-02-14 [1] Bioconductor                             
##  BiocVersion              3.10.1   2019-06-06 [1] Bioconductor                             
##  biomaRt                  2.42.0   2019-10-29 [1] Bioconductor                             
##  Biostrings               2.54.0   2019-10-29 [1] Bioconductor                             
##  bit                      1.1-15.2 2020-02-10 [1] CRAN (R 3.6.0)                           
##  bit64                    0.9-7    2017-05-08 [1] CRAN (R 3.6.0)                           
##  bitops                   1.0-6    2013-08-17 [1] CRAN (R 3.6.0)                           
##  blob                     1.2.1    2020-01-20 [1] CRAN (R 3.6.0)                           
##  caTools                  1.18.0   2020-01-17 [1] CRAN (R 3.6.0)                           
##  cli                      2.0.2    2020-02-28 [1] CRAN (R 3.6.0)                           
##  codetools                0.2-16   2018-12-24 [1] CRAN (R 3.6.3)                           
##  colorout               * 1.2-1    2019-05-07 [1] Github (jalvesaq/colorout@7ea9440)       
##  colorspace               1.4-1    2019-03-18 [1] CRAN (R 3.6.0)                           
##  crayon                   1.3.4    2017-09-16 [1] CRAN (R 3.6.0)                           
##  curl                     4.3      2019-12-02 [1] CRAN (R 3.6.0)                           
##  DBI                      1.1.0    2019-12-15 [1] CRAN (R 3.6.0)                           
##  dbplyr                 * 1.4.2    2019-06-17 [1] CRAN (R 3.6.0)                           
##  DelayedArray           * 0.12.2   2020-01-06 [1] Bioconductor                             
##  DelayedMatrixStats       1.8.0    2019-10-29 [1] Bioconductor                             
##  digest                   0.6.25   2020-02-23 [1] CRAN (R 3.6.0)                           
##  dplyr                    0.8.5    2020-03-07 [1] CRAN (R 3.6.0)                           
##  dqrng                    0.2.1    2019-05-17 [1] CRAN (R 3.6.0)                           
##  DropletUtils           * 1.6.1    2019-10-30 [1] Bioconductor                             
##  edgeR                    3.28.1   2020-02-26 [1] Bioconductor                             
##  EnsDb.Hsapiens.v86     * 2.99.0   2020-03-23 [1] Bioconductor                             
##  ensembldb              * 2.10.2   2019-11-20 [1] Bioconductor                             
##  evaluate                 0.14     2019-05-28 [1] CRAN (R 3.6.0)                           
##  ExperimentHub            1.12.0   2019-10-29 [1] Bioconductor                             
##  fansi                    0.4.1    2020-01-08 [1] CRAN (R 3.6.0)                           
##  fastmap                  1.0.1    2019-10-08 [1] CRAN (R 3.6.0)                           
##  formatR                  1.7      2019-06-11 [1] CRAN (R 3.6.0)                           
##  futile.logger            1.4.3    2016-07-10 [1] CRAN (R 3.6.0)                           
##  futile.options           1.0.1    2018-04-20 [1] CRAN (R 3.6.0)                           
##  gdata                    2.18.0   2017-06-06 [1] CRAN (R 3.6.0)                           
##  GenomeInfoDb           * 1.22.0   2019-10-29 [1] Bioconductor                             
##  GenomeInfoDbData         1.2.2    2019-10-31 [1] Bioconductor                             
##  GenomicAlignments        1.22.1   2019-11-12 [1] Bioconductor                             
##  GenomicFeatures        * 1.38.2   2020-02-15 [1] Bioconductor                             
##  GenomicRanges          * 1.38.0   2019-10-29 [1] Bioconductor                             
##  ggbeeswarm               0.6.0    2017-08-07 [1] CRAN (R 3.6.0)                           
##  ggplot2                * 3.3.0    2020-03-05 [1] CRAN (R 3.6.0)                           
##  glue                     1.3.2    2020-03-12 [1] CRAN (R 3.6.0)                           
##  gplots                   3.0.3    2020-02-25 [1] CRAN (R 3.6.0)                           
##  gridExtra                2.3      2017-09-09 [1] CRAN (R 3.6.0)                           
##  gtable                   0.3.0    2019-03-25 [1] CRAN (R 3.6.0)                           
##  gtools                   3.8.1    2018-06-26 [1] CRAN (R 3.6.0)                           
##  HDF5Array                1.14.3   2020-02-29 [1] Bioconductor                             
##  hms                      0.5.3    2020-01-08 [1] CRAN (R 3.6.0)                           
##  htmltools                0.4.0    2019-10-04 [1] CRAN (R 3.6.0)                           
##  httpuv                   1.5.2    2019-09-11 [1] CRAN (R 3.6.0)                           
##  httr                     1.4.1    2019-08-05 [1] CRAN (R 3.6.0)                           
##  igraph                   1.2.5    2020-03-19 [1] CRAN (R 3.6.0)                           
##  interactiveDisplayBase   1.24.0   2019-10-29 [1] Bioconductor                             
##  IRanges                * 2.20.2   2020-01-13 [1] Bioconductor                             
##  irlba                    2.3.3    2019-02-05 [1] CRAN (R 3.6.0)                           
##  KernSmooth               2.23-16  2019-10-15 [1] CRAN (R 3.6.3)                           
##  knitr                    1.28     2020-02-06 [1] CRAN (R 3.6.0)                           
##  lambda.r                 1.2.4    2019-09-18 [1] CRAN (R 3.6.0)                           
##  later                    1.0.0    2019-10-04 [1] CRAN (R 3.6.0)                           
##  lattice                  0.20-40  2020-02-19 [1] CRAN (R 3.6.0)                           
##  lazyeval                 0.2.2    2019-03-15 [1] CRAN (R 3.6.0)                           
##  lifecycle                0.2.0    2020-03-06 [1] CRAN (R 3.6.0)                           
##  limma                    3.42.2   2020-02-03 [1] Bioconductor                             
##  locfit                   1.5-9.1  2013-04-20 [1] CRAN (R 3.6.0)                           
##  magrittr                 1.5      2014-11-22 [1] CRAN (R 3.6.0)                           
##  Matrix                 * 1.2-18   2019-11-27 [1] CRAN (R 3.6.3)                           
##  matrixStats            * 0.56.0   2020-03-13 [1] CRAN (R 3.6.0)                           
##  memoise                  1.1.0    2017-04-21 [1] CRAN (R 3.6.0)                           
##  mime                     0.9      2020-02-04 [1] CRAN (R 3.6.0)                           
##  munsell                  0.5.0    2018-06-12 [1] CRAN (R 3.6.0)                           
##  openssl                  1.4.1    2019-07-18 [1] CRAN (R 3.6.0)                           
##  pillar                   1.4.3    2019-12-20 [1] CRAN (R 3.6.0)                           
##  pkgconfig                2.0.3    2019-09-22 [1] CRAN (R 3.6.1)                           
##  prettyunits              1.1.1    2020-01-24 [1] CRAN (R 3.6.2)                           
##  progress                 1.2.2    2019-05-16 [1] CRAN (R 3.6.0)                           
##  promises                 1.1.0    2019-10-04 [1] CRAN (R 3.6.0)                           
##  ProtGenerics             1.18.0   2019-10-29 [1] Bioconductor                             
##  purrr                    0.3.3    2019-10-18 [1] CRAN (R 3.6.0)                           
##  R.methodsS3              1.8.0    2020-02-14 [1] CRAN (R 3.6.0)                           
##  R.oo                     1.23.0   2019-11-03 [1] CRAN (R 3.6.0)                           
##  R.utils                  2.9.2    2019-12-08 [1] CRAN (R 3.6.1)                           
##  R6                       2.4.1    2019-11-12 [1] CRAN (R 3.6.1)                           
##  rappdirs                 0.3.1    2016-03-28 [1] CRAN (R 3.6.0)                           
##  Rcpp                     1.0.4    2020-03-17 [1] CRAN (R 3.6.0)                           
##  RCurl                    1.98-1.1 2020-01-19 [1] CRAN (R 3.6.0)                           
##  rhdf5                    2.30.1   2019-11-26 [1] Bioconductor                             
##  Rhdf5lib                 1.8.0    2019-10-29 [1] Bioconductor                             
##  rlang                    0.4.5    2020-03-01 [1] CRAN (R 3.6.0)                           
##  rmarkdown                2.1      2020-01-20 [1] CRAN (R 3.6.0)                           
##  Rsamtools                2.2.3    2020-02-23 [1] Bioconductor                             
##  RSQLite                  2.2.0    2020-01-07 [1] CRAN (R 3.6.0)                           
##  rstudioapi               0.11     2020-02-07 [1] CRAN (R 3.6.0)                           
##  rsvd                     1.0.3    2020-02-17 [1] CRAN (R 3.6.0)                           
##  rtracklayer              1.46.0   2019-10-29 [1] Bioconductor                             
##  S4Vectors              * 0.24.3   2020-01-18 [1] Bioconductor                             
##  scales                   1.1.0    2019-11-18 [1] CRAN (R 3.6.1)                           
##  scater                 * 1.14.6   2019-12-16 [1] Bioconductor                             
##  scran                  * 1.14.6   2020-02-03 [1] Bioconductor                             
##  scRNAseq               * 2.0.2    2019-11-12 [1] Bioconductor                             
##  sessioninfo              1.1.1    2018-11-05 [1] CRAN (R 3.6.0)                           
##  shiny                    1.4.0.2  2020-03-13 [1] CRAN (R 3.6.0)                           
##  SingleCellExperiment   * 1.8.0    2019-10-29 [1] Bioconductor                             
##  statmod                  1.4.34   2020-02-17 [1] CRAN (R 3.6.0)                           
##  stringi                  1.4.6    2020-02-17 [1] CRAN (R 3.6.0)                           
##  stringr                  1.4.0    2019-02-10 [1] CRAN (R 3.6.0)                           
##  SummarizedExperiment   * 1.16.1   2019-12-19 [1] Bioconductor                             
##  tibble                   2.1.3    2019-06-06 [1] CRAN (R 3.6.0)                           
##  tidyselect               1.0.0    2020-01-27 [1] CRAN (R 3.6.0)                           
##  vctrs                    0.2.4    2020-03-10 [1] CRAN (R 3.6.0)                           
##  VennDiagram              1.6.20   2018-03-28 [1] CRAN (R 3.6.0)                           
##  vipor                    0.4.5    2017-03-22 [1] CRAN (R 3.6.0)                           
##  viridis                  0.5.1    2018-03-29 [1] CRAN (R 3.6.0)                           
##  viridisLite              0.3.0    2018-02-01 [1] CRAN (R 3.6.0)                           
##  withr                    2.1.2    2018-03-15 [1] CRAN (R 3.6.0)                           
##  xaringan                 0.15     2020-03-04 [1] CRAN (R 3.6.3)                           
##  xaringanthemer         * 0.2.0    2020-03-22 [1] Github (gadenbuie/xaringanthemer@460f441)
##  xfun                     0.12     2020-01-13 [1] CRAN (R 3.6.0)                           
##  XML                      3.99-0.3 2020-01-20 [1] CRAN (R 3.6.0)                           
##  xtable                   1.8-4    2019-04-21 [1] CRAN (R 3.6.0)                           
##  XVector                  0.26.0   2019-10-29 [1] Bioconductor                             
##  yaml                     2.2.1    2020-02-01 [1] CRAN (R 3.6.0)                           
##  zlibbioc                 1.32.0   2019-10-29 [1] Bioconductor                             
## 
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
```

]]