Publications

BiocMAP: A Bioconductor-friendly, GPU-Accelerated Pipeline for Bisulfite-Sequencing Data
BiocMAP: A Bioconductor-friendly, GPU-Accelerated Pipeline for Bisulfite-Sequencing Data

Background Bisulfite sequencing is a powerful tool for profiling genomic methylation, an epigenetic modification critical in the understanding of cancer, psychiatric disorders, and many other conditions. Raw data generated by whole genome bisulfite sequencing (WGBS) requires several computational steps before it is ready for statistical analysis, and particular care is required to process data in a timely and memory-efficient manner. Alignment to a reference genome is one of the most computationally demanding steps in a WGBS workflow, taking several hours or even days with commonly used WGBS-specific alignment software. This naturally motivates the creation of computational workflows that can utilize GPU-based alignment software to greatly speed up the bottleneck step. In addition, WGBS produces raw data that is large and often unwieldy; a lack of memory-efficient representation of data by existing pipelines renders WGBS impractical or impossible to many researchers. Results We present BiocMAP, a Bioconductor-friendly Methylation Analysis Pipeline consisting of two modules, to address the above concerns. The first module performs computationally-intensive read alignment using Arioc, a GPU-accelerated short-read aligner. The extraction module extracts and merges DNA methylation proportions - the fractions of methylated cytosines across all cells in a sample at a given genomic site. Since GPUs are not always available on the same computing environments where traditional CPU-based analyses are convenient, BiocMAP is split into two modules, with just the alignment module requiring an available GPU. Bioconductor-based output objects in R utilize an on-disk data representation to drastically reduce required main memory and make WGBS projects computationally feasible to more researchers. Conclusions BiocMAP is implemented using Nextflow and available at http://research.libd.org/BiocMAP/. To enable reproducible analysis across a variety of typical computing environments, BiocMAP can be containerized with Docker or Singularity, and executed locally or with the SLURM or SGE scheduling engines. By providing Bioconductor objects, BiocMAP’s output can be integrated with powerful analytical open source software for analyzing methylation data.

Genetics and Brain Transcriptomics of Completed Suicide

Objective: The authors sought to study the transcriptomic and genomic features of completed suicide by parsing the method chosen, to capture molecular correlates of the distinctive frame of mind of individuals who die by suicide, while reducing heterogeneity. Methods: The authors analyzed gene expression (RNA sequencing) from postmortem dorsolateral prefrontal cortex of patients who died by suicide with violent compared with nonviolent means, nonsuicide patients with the same psychiatric disorders, and a neurotypical group (total N=329). They then examined genomic risk scores (GRSs) for each psychiatric disorder included, and GRSs for cognition (IQ) and for suicide attempt, testing how they predict diagnosis or traits (total N=888). Results: Patients who died by suicide by violent means showed a transcriptomic pattern remarkably divergent from each of the other patient groups but less from the neurotypical group; consistently, their genomic profile of risk was relatively low for their diagnosed illness as well as for suicide attempt, and relatively high for IQ: they were more similar to the neurotypical group than to other patients. Differentially expressed genes (DEGs) associated with patients who died by suicide by violent means pointed to purinergic signaling in microglia, showing similarities to a genome-wide association study of Drosophila aggression. Weighted gene coexpression network analysis revealed that these DEGs were coexpressed in a context of mitochondrial metabolic activation unique to suicide by violent means. Conclusions: These findings suggest that patients who die by suicide by violent means are in part biologically separable from other patients with the same diagnoses, and their behavioral outcome may be less dependent on genetic risk for conventional psychiatric disorders and be associated with an alteration of purinergic signaling and mitochondrial metabolism.

Amygdala and anterior cingulate transcriptomes from individuals with bipolar disorder reveal downregulated neuroimmune and synaptic pathways
Amygdala and anterior cingulate transcriptomes from individuals with bipolar disorder reveal downregulated neuroimmune and synaptic pathways

Recent genetic studies have identified variants associated with bipolar disorder (BD), but it remains unclear how brain gene expression is altered in BD and how genetic risk for BD may contribute to these alterations. Here, we obtained transcriptomes from subgenual anterior cingulate cortex and amygdala samples from post-mortem brains of individuals with BD and neurotypical controls, including 511 total samples from 295 unique donors. We examined differential gene expression between cases and controls and the transcriptional effects of BD-associated genetic variants. We found two coexpressed modules that were associated with transcriptional changes in BD: one enriched for immune and inflammatory genes and the other with genes related to the postsynaptic membrane. Over 50% of BD genome-wide significant loci contained significant expression quantitative trait loci (QTL) (eQTL), and these data converged on several individual genes, including SCN2A and GRIN2A. Thus, these data implicate specific genes and pathways that may contribute to the pathology of BD.

Single-nucleus transcriptome analysis reveals cell-type-specific molecular signatures across reward circuitry in the human brain
Single-nucleus transcriptome analysis reveals cell-type-specific molecular signatures across reward circuitry in the human brain

Single-cell gene expression technologies are powerful tools to study cell types in the human brain, but efforts have largely focused on cortical brain regions. We therefore created a single-nucleus RNA-sequencing resource of 70,615 high-quality nuclei to generate a molecular taxonomy of cell types across five human brain regions that serve as key nodes of the human brain reward circuitry: nucleus accumbens, amygdala, subgenual anterior cingulate cortex, hippocampus, and dorsolateral prefrontal cortex. We first identified novel subpopulations of interneurons and medium spiny neurons (MSNs) in the nucleus accumbens and further characterized robust GABAergic inhibitory cell populations in the amygdala. Joint analyses across the 107 reported cell classes revealed cell-type substructure and unique patterns of transcriptomic dynamics. We identified discrete subpopulations of D1- and D2-expressing MSNs in the nucleus accumbens to which we mapped cell-type-specific enrichment for genetic risk associated with both psychiatric disease and addiction.

Genome-wide sequencing-based identification of methylation quantitative trait loci and their role in schizophrenia risk
Genome-wide sequencing-based identification of methylation quantitative trait loci and their role in schizophrenia risk

DNA methylation (DNAm) is an epigenetic regulator of gene expression and a hallmark of gene-environment interaction. Using whole-genome bisulfite sequencing, we have surveyed DNAm in 344 samples of human postmortem brain tissue from neurotypical subjects and individuals with schizophrenia. We identify genetic influence on local methylation levels throughout the genome, both at CpG sites and CpH sites, with 86% of SNPs and 55% of CpGs being part of methylation quantitative trait loci (meQTLs). These associations can further be clustered into regions that are differentially methylated by a given SNP, highlighting the genes and regions with which these loci are epigenetically associated. These findings can be used to better characterize schizophrenia GWAS-identified variants as epigenetic risk variants. Regions differentially methylated by schizophrenia risk-SNPs explain much of the heritability associated with risk loci, despite covering only a fraction of the genomic space. We provide a comprehensive, single base resolution view of association between genetic variation and genomic methylation, and implicate schizophrenia GWAS-associated variants as influencing the epigenetic plasticity of the brain.

SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses
SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/Bioconductor-powered RNA-seq analyses

Background. RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step–such as alignment of reads to a reference genome–of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, we have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features we have found to be important in our own analyses. Results. In response to these concerns, we have developed a Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy), which is easy to install and share, and provides a bridge towards R/Bioconductor downstream analysis solutions. SPEAQeasy is portable across computational frameworks (SGE, SLURM, local, docker integration) and different configuration files are provided (http://research.libd.org/SPEAQeasy/). Conclusions. SPEAQeasy is user-friendly and lowers the computational-domain entry barrier for biologists and clinicians to RNA-seq data processing as the main input file is a table with sample names and their corresponding FASTQ files. The goal is to provide a flexible pipeline that is immediately usable by researchers, regardless of their technical background or computing environment.

Detection of pathogenic splicing events from RNA-sequencing data using dasper
Detection of pathogenic splicing events from RNA-sequencing data using dasper

Although next-generation sequencing technologies have accelerated the discovery of novel gene-to-disease associations, many patients with suspected Mendelian diseases still leave the clinic without a genetic diagnosis. An estimated one third of these patients will have disorders caused by mutations impacting splicing. RNA-sequencing has been shown to be a promising diagnostic tool, however few methods have been developed to integrate RNA-sequencing data into the diagnostic pipeline. Here, we introduce dasper, an R/Bioconductor package that improves upon existing tools for detecting aberrant splicing by using machine learning to incorporate disruptions in exon-exon junction counts as well as coverage. dasper is designed for diagnostics, providing a rank-based report of how aberrant each splicing event looks, as well as including visualization functionality to facilitate interpretation. We validate dasper using 16 patient-derived fibroblast cell lines harbouring pathogenic variants known to impact splicing. We find that dasper is able to detect pathogenic splicing events with greater accuracy than existing LeafCutterMD or z-score approaches. Furthermore, by only applying a broad OMIM gene filter (without any variant-level filters), dasper is able to detect pathogenic splicing events within the top 10 most aberrant identified for each patient. Since using publicly available control data minimises costs associated with incorporating RNA-sequencing into diagnostic pipelines, we also investigate the use of 504 GTEx fibroblast samples as controls. We find that dasper leverages publicly available data effectively, ranking pathogenic splicing events in the top 25. Thus, we believe dasper can increase diagnostic yield for a pathogenic splicing variants and enable the efficient implementation of RNA-sequencing for diagnostics in clinical laboratories.

Developmental Profile of Psychiatric Risk Associated With Voltage-Gated Cation Channel Activity
Developmental Profile of Psychiatric Risk Associated With Voltage-Gated Cation Channel Activity

Background. Recent breakthroughs in psychiatric genetics have implicated biological pathways onto which genetic risk for psychiatric disorders converges. However, these studies do not reveal the developmental time point(s) at which these pathways are relevant. Methods. We aimed to determine the relationship between psychiatric risk and developmental gene expression relating to discrete biological pathways. We used postmortem RNA sequencing data (BrainSeq and BrainSpan) from brain tissue at multiple prenatal and postnatal time points, with summary statistics from recent genome-wide association studies of schizophrenia, bipolar disorder, and major depressive disorder. We prioritized gene sets for overall enrichment of association with each disorder and then tested the relationship between the association of their constituent genes with their relative expression at each developmental stage. Results. We observed relationships between the expression of genes involved in voltage-gated cation channel activity during early midfetal, adolescence, and early adulthood time points and association with schizophrenia and bipolar disorder, such that genes more strongly associated with these disorders had relatively low expression during early midfetal development and higher expression during adolescence and early adulthood. The relationship with schizophrenia was strongest for the subset of genes related to calcium channel activity, while for bipolar disorder, the relationship was distributed between calcium and potassium channel activity genes. Conclusions. Our results indicate periods during development when biological pathways related to the activity of calcium and potassium channels may be most vulnerable to the effects of genetic variants conferring risk for psychiatric disorders. Furthermore, they indicate key time points and potential targets for disorder-specific therapeutic interventions.

Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex
Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex

We used the 10x Genomics Visium platform to define the spatial topography of gene expression in the six-layered human dorsolateral prefrontal cortex. We identified extensive layer-enriched expression signatures and refined associations to previous laminar markers. We overlaid our laminar expression signatures on large-scale single nucleus RNA-sequencing data, enhancing spatial annotation of expression-driven clusters. By integrating neuropsychiatric disorder gene sets, we showed differential layer-enriched expression of genes associated with schizophrenia and autism spectrum disorder, highlighting the clinical relevance of spatially defined expression. We then developed a data-driven framework to define unsupervised clusters in spatial transcriptomics data, which can be applied to other tissues or brain regions in which morphological architecture is not as well defined as cortical laminae. Last, we created a web application for the scientific community to explore these raw and summarized data to augment ongoing neuroscience and spatial transcriptomics research (http://research.libd.org/spatialLIBD).

Genetic and environmental regulation of caudate nucleus transcriptome: insight into schizophrenia risk and the dopamine system
Genetic and environmental regulation of caudate nucleus transcriptome: insight into schizophrenia risk and the dopamine system

Increased dopamine (DA) signaling in the striatum has been a cornerstone hypothesis about psychosis for over 50 years. Increased dopamine release results in psychotic symptoms, while D2 dopamine receptor (DRD2) antagonists are antipsychotic. Recent schizophrenia GWAS identified risk-associated common variants near the DRD2 gene, but the risk mechanism has been unclear. To gain novel insight into risk mechanisms underlying schizophrenia, we performed a comprehensive analysis of the genetic and transcriptional landscape of schizophrenia in postmortem caudate nucleus from a cohort of 444 individuals. Integrating expression quantitative trait loci (eQTL) analysis, transcriptome wide association study (TWAS), and differential expression analysis, we found many new genes associated with schizophrenia through genetic modulation of gene expression. Using a new approach based on deep neural networks, we construct caudate nucleus gene expression networks that highlight interactions involving schizophrenia risk. Interestingly, we found that genetic risk for schizophrenia is associated with decreased expression of the short isoform of DRD2, which encodes the presynaptic autoreceptor, and not with the long isoform, which encodes the postsynaptic receptor. This association suggests that decreased control of presynaptic DA release is a potential genetic mechanism of schizophrenia risk. Altogether, these analyses provide a new resource for the study of schizophrenia that can bring insight into risk mechanisms and potential novel therapeutic targets.

Characterizing the dynamic and functional DNA methylation landscape in the developing human cortex
Characterizing the dynamic and functional DNA methylation landscape in the developing human cortex

DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. We, therefore, assessed genomic methylation at over 39 million sites in the prenatal cortex using whole-genome bisulfite sequencing and found loci and regions in which methylation levels are dynamic across development. We saw that DNAm at these loci was associated with nearby gene expression and enriched for enhancer chromatin states in prenatal brain tissue. Additionally, these loci were enriched for genes associated with neuropsychiatric disorders and genes involved with neurogenesis. We also found autosomal differences in DNAm between the sexes during prenatal development, though these have less clear functional consequences. We lastly confirmed that the dynamic methylation at this critical period is specifically CpG methylation, with generally low levels of CpH methylation. Our findings provide detailed insight into prenatal brain development as well as clues to the pathogenesis of psychiatric traits seen later in life.

Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders
Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders

Growing evidence suggests that human gene annotation remains incomplete; however, it is unclear how this affects different tissues and our understanding of different disorders. Here, we detect previously unannotated transcription from Genotype-Tissue Expression RNA sequencing data across 41 human tissues. We connect this unannotated transcription to known genes, confirming that human gene annotation remains incomplete, even among well-studied genes including 63% of the Online Mendelian Inheritance in Man–morbid catalog and 317 neurodegeneration-associated genes. We find the greatest abundance of unannotated transcription in brain and genes highly expressed in brain are more likely to be reannotated. We explore examples of reannotated disease genes, such as SNCA, for which we experimentally validate a previously unidentified, brain-specific, potentially protein-coding exon. We release all tissue-specific transcriptomes through vizER: http://rytenlab.com/browser/app/vizER. We anticipate that this resource will facilitate more accurate genetic analysis, with the greatest impact on our understanding of Mendelian and complex neurogenetic disorders.

Regulatory sites for splicing in human basal ganglia are enriched for disease-relevant information
Regulatory sites for splicing in human basal ganglia are enriched for disease-relevant information

Genome-wide association studies have generated an increasing number of common genetic variants associated with neurological and psychiatric disease risk. An improved understanding of the genetic control of gene expression in human brain is vital considering this is the likely modus operandum for many causal variants. However, human brain sampling complexities limit the explanatory power of brain-related expression quantitative trait loci (eQTL) and allele-specific expression (ASE) signals. We address this, using paired genomic and transcriptomic data from putamen and substantia nigra from 117 human brains, interrogating regulation at different RNA processing stages and uncovering novel transcripts. We identify disease-relevant regulatory loci, find that splicing eQTLs are enriched for regulatory information of neuron-specific genes, that ASEs provide cell-specific regulatory information with evidence for cellular specificity, and that incomplete annotation of the brain transcriptome limits interpretation of risk loci for neuropsychiatric disease. This resource of regulatory data is accessible through our web server, http://braineacv2.inf.um.es/.

Recounting the FANTOM CAGE–Associated Transcriptome
Recounting the FANTOM CAGE–Associated Transcriptome

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancers lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.

Dissecting transcriptomic signatures of neuronal differentiation and maturation using iPSCs
Dissecting transcriptomic signatures of neuronal differentiation and maturation using iPSCs

Human induced pluripotent stem cells (hiPSCs) are a powerful model of neural differentiation and maturation. We present a hiPSC transcriptomics resource on corticogenesis from 5 iPSC donor and 13 subclonal lines across 9 time points over 5 broad conditions: self-renewal, early neuronal differentiation, neural precursor cells (NPCs), assembled rosettes, and differentiated neuronal cells. We identify widespread changes in the expression of both individual features and global patterns of transcription. We next demonstrate that co-culturing human NPCs with rodent astrocytes results in mutually synergistic maturation, and that cell type-specific expression data can be extracted using only sequencing read alignments without cell sorting. We lastly adapt a previously generated RNA deconvolution approach to single-cell expression data to estimate the relative neuronal maturity of iPSC-derived neuronal cultures and human brain tissue. Using many public datasets, we demonstrate neuronal cultures are maturationally heterogeneous but contain subsets of neurons more mature than previously observed.

Divergent neuronal DNA methylation patterns across human cortical development reveal critical periods and a unique role of CpH methylation
Divergent neuronal DNA methylation patterns across human cortical development reveal critical periods and a unique role of CpH methylation

Background: DNA methylation (DNAm) is a critical regulator of both development and cellular identity and shows unique patterns in neurons. To better characterize maturational changes in DNAm patterns in these cells, we profile the DNAm landscape at single-base resolution across the first two decades of human neocortical development in NeuN+ neurons using whole-genome bisulfite sequencing and compare them to non-neurons (primarily glia) and prenatal homogenate cortex. Results: We show that DNAm changes more dramatically during the first 5 years of postnatal life than during the entire remaining period. We further refine global patterns of increasingly divergent neuronal CpG and CpH methylation (mCpG and mCpH) into six developmental trajectories and find that in contrast to genome-wide patterns, neighboring mCpG and mCpH levels within these regions are highly correlated. We integrate paired RNAseq data and identify putative regulation of hundreds of transcripts and their splicing events exclusively by mCpH levels, independently from mCpG levels, across this period. We finally explore the relationship between DNAm patterns and development of brain-related phenotypes and find enriched heritability for many phenotypes within identified DNAm features. Conclusions: By profiling DNAm changes in NeuN-sorted neurons over the span of human cortical development, we identify novel, dynamic regions of DNAm that would be masked in homogenate DNAm data; expand on the relationship between CpG methylation, CpH methylation, and gene expression; and find enrichment particularly for neuropsychiatric diseases in genomic regions with cell type-specific, developmentally dynamic DNAm patterns. Keywords: DNA methylation, Neurodevelopment, Gene expression, Non-CpG methylation.

Comprehensive assessment of multiple biases in small RNA sequencing reveals significant differences in the performance of widely used methods
Comprehensive assessment of multiple biases in small RNA sequencing reveals significant differences in the performance of widely used methods

Background: RNA sequencing offers advantages over other quantification methods for microRNA (miRNA), yet numerous biases make reliable quantification challenging. Previous evaluations of these biases have focused on adapter ligation bias with limited evaluation of reverse transcription bias or amplification bias. Furthermore, evaluations of the quantification of isomiRs (miRNA isoforms) or the influence of starting amount on performance have been very limited. No study had yet evaluated the quantification of isomiRs of altered length or compared the consistency of results derived from multiple moderate starting inputs. We therefore evaluated quantifications of miRNA and isomiRs using four library preparation kits, with various starting amounts, as well as quantifications following removal of duplicate reads using unique molecular identifiers (UMIs) to mitigate reverse transcription and amplification biases. Results: All methods resulted in false isomiR detection; however, the adapter-free method tested was especially prone to false isomiR detection. We demonstrate that using UMIs improves accuracy and we provide a guide for input amounts to improve consistency. Conclusions: Our data show differences and limitations of current methods, thus raising concerns about the validity of quantification of miRNA and isomiRs across studies. We advocate for the use of UMIs to improve accuracy and reliability of miRNA quantifications.

Regional heterogeneity in gene expression, regulation, and coherence in the frontal cortex and hippocampus across development and schizophrenia
Regional heterogeneity in gene expression, regulation, and coherence in the frontal cortex and hippocampus across development and schizophrenia

The hippocampus formation, although prominently implicated in schizophrenia pathogenesis, has been overlooked in large-scale genomics efforts in the schizophrenic brain. We performed RNA-seq in hippocampi and dorsolateral prefrontal cortices (DLPFCs) from 551 individuals (286 with schizophrenia). We identified substantial regional differences in gene expression and found widespread developmental differences that were independent of cellular composition. We identified 48 and 245 differentially expressed genes (DEGs) associated with schizophrenia within the hippocampus and DLPFC, with little overlap between the brain regions. 124 of 163 (76.6%) of schizophrenia GWAS risk loci contained eQTLs in any region. Transcriptome-wide association studies in each region identified many novel schizophrenia risk features that were brain region-specific. Last, we identified potential molecular correlates of in vivo evidence of altered prefrontal-hippocampal functional coherence in schizophrenia. These results underscore the complexity and regional heterogeneity of the transcriptional correlates of schizophrenia and offer new insights into potentially causative biology.

recount-brain: a curated repository of human brain RNA-seq datasets metadata
recount-brain: a curated repository of human brain RNA-seq datasets metadata

The usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data. To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.

Integrated Transcriptomic and Proteomic Analysis of Primary Human Umbilical Vein Endothelial Cells
Integrated Transcriptomic and Proteomic Analysis of Primary Human Umbilical Vein Endothelial Cells

Understanding the molecular profile of every human cell type is essential for understanding its role in normal physi-ology and disease. Technological advancements in DNA sequencing, mass spectrometry, and computational methodsallow us to carry out multiomics analyses although such approaches are not routine yet. Human umbilical vein endothe-lial cells (HUVECs) are a widely used model system to study pathological and physiological processes associated withthe cardiovascular system. In this study, next-generation sequencing and high-resolution mass spectrometry to profilethe transcriptome and proteome of primary HUVECs is employed. Analysis of 145 million paired-end reads from next-generation sequencing confirmed expression of 12 186 protein-coding genes (FPKM >= 0.1), 439 novel long non-codingRNAs, and revealed 6089 novel isoforms that were not annotated in GENCODE. Proteomics analysis identifies 6477proteins including confirmation ofN-termini for 1091 proteins, isoforms for 149 proteins, and 1034 phosphosites. Adatabase search to specifically identify other post-translational modifications provide evidence for a number of modifi-cation sites on 117 proteins which include ubiquitylation, lysine acetylation, and mono-, di- and tri-methylation events.Evidence for 11 ‘missing proteins’, which are proteins for which there was insufficient or no protein level evidence, isprovided. Peptides supporting missing protein and novel events are validated by comparison of MS/MS fragmentationpatterns with synthetic peptides. Finally, 245 variant peptides derived from 207 expressed proteins in addition to alternatetranslational start sites for seven proteins and evidence for novel proteoforms for five proteins resulting from alternativesplicing are identified. Overall, it is believed that the integrated approach employed in this study is widely applicable tostudy any primary cell type for deeper molecular characterization.

Integrated DNA methylation and gene expression profiling across multiple brain regions implicate novel genes in Alzheimer's disease
Integrated DNA methylation and gene expression profiling across multiple brain regions implicate novel genes in Alzheimer's disease

Late-onset Alzheimer’s disease (AD) is a complex age-related neurodegenerative disorder that likely involves epigenetic factors. To better understand the epigenetic state associated with AD, we surveyed 420,852 DNA methylation (DNAm) sites from neurotypical controls (N=49) and late-onset AD patients (N=24) across four brain regions (hippocampus, entorhinal cortex, dorsolateral prefrontal cortex and cerebellum). We identified 858 sites with robust differential methylation collectively annotated to 772 possible genes (FDR<5%, within 10 kb). These sites were overrepresented in AD genetic risk loci (p=0.00655) and were enriched for changes during normal aging (p<2.2×10−16), and nearby genes were enriched for processes related to cell-adhesion, immunity, and calcium homeostasis (FDR<5%). To functionally validate these associations, we generated and analyzed corresponding transcriptome data to prioritize 130 genes within 10 kb of the differentially methylated sites. These 130 genes were differentially expressed between AD cases and controls and their expression was associated with nearby DNAm (p<0.05). This integrated analysis implicates novel genes in Alzheimer’s disease, such as ANKRD30B. These results highlight DNAm differences in Alzheimer’s disease that have gene expression correlates, further implicating DNAm as an epigenetic mechanism underlying pathological molecular changes associated with AD. Furthermore, our framework illustrates the value of integrating epigenetic and transcriptomic data for understanding complex disease.

Non-coding Class Switch Recombination-Related Transcription in Human Normal and Pathological Immune Responses
Non-coding Class Switch Recombination-Related Transcription in Human Normal and Pathological Immune Responses

Background: Antibody class switch recombination (CSR) to IgG, IgA or IgE is a hallmark of adaptive immunity, allowing antibody function diversification beyond IgM. CSR involves a deletion of the IgM/IgD constant region genes placing a new acceptor Constant (CH) gene, downstream of the VDJH exon. CSR depends on non-coding (CSRnc) transcription of donor Iμ and acceptor IH exons, located 5′ upstream of each CH coding gene. Although our knowledge of the role of CSRnc transcription has advanced greatly, its extension and importance in healthy and diseased humans is scarce. Methods: We analyzed CSRnc transcription in 70,603 publicly available RNA-seq samples, including GTEx, TCGA and the Sequence Read Archive (SRA) using recount2, an online resource consisting of normalized RNA-seq gene and exon counts, as well as coverage BigWig files that can be programmatically accessed through R. CSRnc transcription was validated with a qRT-PCR assay for Iμ, Iγ1 and Iγ3 in humans in response to vaccination. Results: We mapped IH transcription for the human IgH locus, including the less understood IGHD gene. CSRnc transcription was restricted to B cells and is widely distributed in normal adult tissues, but predominant in blood, spleen, MALT-containing tissues, visceral adipose tissue and some so-called ‘immune privileged’ tissues. However, significant Iγ4 expression was found even in non-lymphoid fetal tissues. CSRnc expression in cancer tissues mimicked the expression of their normal counterparts, with notable pattern changes in some common cancer subsets. CSRnc transcription in tumors appears to result from tumor infiltration by B cells, since CSRnc transcription was not detected in corresponding tumor-derived immortal cell lines. Additionally, significantly increased Iδ transcription in ileal mucosa in Crohn’s disease with ulceration was found. Conclusions: CSRnc transcription occurs in multiple anatomical locations beyond classical secondary lymphoid organs, representing a potentially useful marker of effector B cell responses in normal and pathological immune responses. The pattern of IH exon expression may reveal clues of the local immune response (i.e. cytokine milieu) in health and disease. This is a great example of how the public recount2 data can be used to further our understanding of transcription, including regions outside the known transcriptome.

Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis
Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis

Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated. We leveraged this general eQTL database to show that 48.1% of risk variants for schizophrenia associate with nearby expression. We lastly found 237 genes significantly differentially expressed between patients and controls, which replicated in an independent dataset, implicated synaptic processes, and were strongly regulated in early development. These findings together offer genetics- and diagnosis-related targets for better modeling of schizophrenia risk. This resource is publicly available at http://eqtl.brainseq.org/phase1.

Improving the value of public RNA-seq expression data by phenotype prediction
Improving the value of public RNA-seq expression data by phenotype prediction

Background: Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. Results: We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using, well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70,000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project (https://jhubiostatistics.shinyapps.io/recount/). We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package (https://github.com/leekgroup/phenopredict) and the predictions for recount2 are available from the recount R package (https://bioconductor.org/packages/release/bioc/html/recount.html). Conclusion: Having leveraging massive public data sets to generate a well-phenotyped set of expression data for more than 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

Fast Annotation-Agnostic Differential Expression Analysis
Fast Annotation-Agnostic Differential Expression Analysis
Interspecies interactions that result in Bacillus subtilis forming biofilms are mediated mainly by members of its own genus
Interspecies interactions that result in Bacillus subtilis forming biofilms are mediated mainly by members of its own genus

Many different systems of bacterial interactions have been described. However, relatively few studies have explored how interactions between different microorganisms might influence bacterial development. To explore such interspecies interactions, we focused on Bacillus subtilis, which characteristically develops into matrix-producing cannibals before entering sporulation. We investigated whether organisms from the natural environment of B. subtilis—the soil—were able to alter the development of B.subtilis. To test this possibility, we developed a coculture microcolony screen in which we used fluorescent reporters to identify soil bacteria able to induce matrix production in B. subtilis. Most of the bacteria that influence matrix production in B. subtilis are members of the genus Bacillus, suggesting that such interactions may be predominantly with close relatives. The interactions we observed were mediated via two different mechanisms. One resulted in increased expression of matrix genes via the activation of a sensor histidine kinase, KinD. The second was kinase independent and conceivably functions by altering the relative subpopulations of B. subtilis cell types by preferentially killing noncannibals. These two mechanisms were grouped according to the inducing strain’s relatedness to B. subtilis. Our results suggest that bacteria preferentially alter their development in response to secreted molecules from closely related bacteria and do so using mechanisms that depend on the phylogenetic relatedness of the interacting bacteria.

RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units)
RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units)

RegulonDB (http://regulondb.ccg.unam.mx/) is the primary reference database of the best-known regulatory network of any free-living organism, that of Escherichia coli K-12. The major conceptual change since 3 years ago is an expanded biological context so that transcriptional regulation is now part of a unit that initiates with the signal and continues with the signal transduction to the core of regulation, modifying expression of the affected target genes responsible for the response. We call these genetic sensory response units, or Gensor Units. We have initiated their high-level curation, with graphic maps and superreactions with links to other databases. Additional connectivity uses expandable submaps. RegulonDB has summaries for every transcription factor (TF) and TF-binding sites with internal symmetry. Several DNA-binding motifs and their sizes have been redefined and relocated. In addition to data from the literature, we have incorporated our own information on transcription start sites (TSSs) and transcriptional units (TUs), obtained by using high-throughput whole-genome sequencing technologies. A new portable drawing tool for genomic features is also now available, as well as new ways to download the data, including web services, files for several relational database manager systems and text files including BioPAX format.

Global Analysis of Transcription Start Sites and Transcription Units in Bacterial Genomes
Global Analysis of Transcription Start Sites and Transcription Units in Bacterial Genomes