The usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data. To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at

Genome-wide association studies have generated an increasing number of common genetic variants that affect neurological and psychiatric disease risk. Given that many causal variants are likely to operate by regulating gene expression, an improved understanding of the genetic control of gene expression in human brain is vital. However, the difficulties of sampling human brain, and its complexity, has meant that brain-related expression quantitative trait loci (eQTL) and allele specific expression (ASE) signals have been more limited in their explanatory power than might otherwise be expected. To address this, we use paired genomic and transcriptomic data from putamen and substantia nigra dissected from 117 brains, combined with a comprehensive set of analyses, to interrogate regulation at different stages of RNA processing and uncover novel transcripts. We identify disease-relevant regulatory loci and reveal the types of analyses and regulatory positions yielding the most disease-specific information. We find that splicing eQTLs are enriched for neuronspecific regulatory information; that ASE analyses provide highly cell-specific regulatory information; and that incomplete annotation of the brain transcriptome limits the interpretation of risk loci for neuropsychiatric disease. We release this rich resource of regulatory data through a searchable webserver,

Late-onset Alzheimer’s disease (AD) is a complex age-related neurodegenerative disorder that likely involves epigenetic factors. To better understand the epigenetic state associated with AD, we surveyed 420,852 DNA methylation (DNAm) sites from neurotypical controls (N=49) and late-onset AD patients (N=24) across four brain regions (hippocampus, entorhinal cortex, dorsolateral prefrontal cortex and cerebellum). We identified 858 sites with robust differential methylation collectively annotated to 772 possible genes (FDR<5%, within 10 kb). These sites were overrepresented in AD genetic risk loci (p=0.00655) and were enriched for changes during normal aging (p<2.2×10−16), and nearby genes were enriched for processes related to cell-adhesion, immunity, and calcium homeostasis (FDR<5%). To functionally validate these associations, we generated and analyzed corresponding transcriptome data to prioritize 130 genes within 10 kb of the differentially methylated sites. These 130 genes were differentially expressed between AD cases and controls and their expression was associated with nearby DNAm (p<0.05). This integrated analysis implicates novel genes in Alzheimer’s disease, such as ANKRD30B. These results highlight DNAm differences in Alzheimer’s disease that have gene expression correlates, further implicating DNAm as an epigenetic mechanism underlying pathological molecular changes associated with AD. Furthermore, our framework illustrates the value of integrating epigenetic and transcriptomic data for understanding complex disease.

Although the increasing use of whole-exome and whole-genome sequencing have improved the yield of genetic testing for Mendelian disorders, an estimated 50% of patients still leave the clinic without a genetic diagnosis. This can be attributed in part to our lack of ability to accurately interpret the genetic variation detected through next-generation sequencing. Variant interpretation is fundamentally reliant on accurate and complete gene annotation, however numerous reports and discrepancies between gene annotation databases reveals that the knowledge of gene annotation remains far from comprehensive. Here, we detect and validate transcription in an annotation-agnostic manner across all 41 different GTEx tissues, then connect novel transcription to known genes, ultimately improving the annotation of 63% of the known OMIM-morbid genes. We find the majority of novel transcription to be tissue-specific in origin, with brain tissues being most susceptible to misannotation. Furthermore, we find that novel transcribed regions tend to be poorly conserved, but are significantly depleted for genetic variation within humans, suggesting they are functionally significant and potentially have human-specific functions. We present our findings through an online platform vizER, which enables individual genes to be visualised and queried for evidence of misannotation. We also release all tissue-specific transcriptomes in a BED format for ease of integration with whole-genome sequencing data. We anticipate that these resources will improve the diagnostic yield for a wide range of Mendelian disorders.

Background: Antibody class switch recombination (CSR) to IgG, IgA or IgE is a hallmark of adaptive immunity, allowing antibody function diversification beyond IgM. CSR involves a deletion of the IgM/IgD constant region genes placing a new acceptor Constant (CH) gene, downstream of the VDJH exon. CSR depends on non-coding (CSRnc) transcription of donor Iμ and acceptor IH exons, located 5′ upstream of each CH coding gene. Although our knowledge of the role of CSRnc transcription has advanced greatly, its extension and importance in healthy and diseased humans is scarce. Methods: We analyzed CSRnc transcription in 70,603 publicly available RNA-seq samples, including GTEx, TCGA and the Sequence Read Archive (SRA) using recount2, an online resource consisting of normalized RNA-seq gene and exon counts, as well as coverage BigWig files that can be programmatically accessed through R. CSRnc transcription was validated with a qRT-PCR assay for Iμ, Iγ1 and Iγ3 in humans in response to vaccination. Results: We mapped IH transcription for the human IgH locus, including the less understood IGHD gene. CSRnc transcription was restricted to B cells and is widely distributed in normal adult tissues, but predominant in blood, spleen, MALT-containing tissues, visceral adipose tissue and some so-called ‘immune privileged’ tissues. However, significant Iγ4 expression was found even in non-lymphoid fetal tissues. CSRnc expression in cancer tissues mimicked the expression of their normal counterparts, with notable pattern changes in some common cancer subsets. CSRnc transcription in tumors appears to result from tumor infiltration by B cells, since CSRnc transcription was not detected in corresponding tumor-derived immortal cell lines. Additionally, significantly increased Iδ transcription in ileal mucosa in Crohn’s disease with ulceration was found. Conclusions: CSRnc transcription occurs in multiple anatomical locations beyond classical secondary lymphoid organs, representing a potentially useful marker of effector B cell responses in normal and pathological immune responses. The pattern of IH exon expression may reveal clues of the local immune response (i.e. cytokine milieu) in health and disease. This is a great example of how the public recount2 data can be used to further our understanding of transcription, including regions outside the known transcriptome.

High-throughput sequencing offers advantages over other quantification methods for microRNA (miRNA), yet numerous biases make reliable quantification challenging. Previous evaluations of reverse transcription or amplification bias in small RNA sequencing has been limited. Furthermore, little work has evaluated isomiR (miRNA isoforms) quantifications or the influence of starting amount on performance. We therefore evaluated quantifications of canonical miRNA and isomiRs using four library preparation kits, with various starting amounts (100ng to 2000ng), as well as quantifications following the removal of duplicate reads using unique molecular identifiers (UMIs) to mitigate reverse transcription and amplification biases. Randomized adapter and adapter-free methods mitigated bias; however, the adapter-free method was especially prone to false isomiR detection. We demonstrate that using UMIs improves accuracy and we provide a guide for input amounts to improve consistency. Our data show differences and limitations of current methods, thus raising concerns about the validity of quantification of miRNA and isomiRs.

We characterized the landscape of DNA methylation (DNAm) across the first two decades of human neocortical development in neurons and glia using whole-genome bisulfite sequencing. We show that the rate of DNAm changes more dramatically during the first five years of postnatal life than during the entire remaining period. We further refined global patterns of increasingly divergent neuronal CpG and CpH methylation (mCpG and mCpH) into six unique developmental trajectories, within which neighboring mC levels - independent of sequence context - were highly correlated, unlike across the genome, where mCpG levels were correlated but mCpH levels were not. We then integrated paired RNA-seq data and identified direct regulation of hundreds of transcripts and their splicing events exclusively by mCpH, independent of mCpG levels, across this period of human cortical development. In addition to expanding the relationship between mCpH and gene expression, these splicing-associated cytosines and developmentally dynamic DNAm regions were associated with neuropsychiatric disease risk sequence, providing insight into the cell type and timing of dynamic epigenomic remodeling in known disease risk genes and loci.

Recent large-scale genomics efforts have better characterized the molecular correlates of schizophrenia in postmortem human neocortex, but not hippocampus which is a brain region prominently implicated in its pathogenesis. Here in the second phase of the BrainSeq Consortium (Phase II), we have generated RiboZero RNA-seq data for 900 samples across both the dorsolateral prefrontal cortex (DLPFC) and the hippocampus (HIPPO) for 551 individuals (286 affected by schizophrenia disorder: SCZD). We identify substantial regional differences in gene expression, in both pre- and post-natal life, and find widespread differences in how genes are regulated across development. By extending quality surrogate variable analysis (qSVA) to multiple brain regions, we identified 48 and 245 differentially expressed genes (DEG) by SCZD diagnosis (FDR<5%) in HIPPO and DLPFC, respectively, with surprisingly minimal overlap in DEG between the two brain regions. While there were widespread eQTLs in both brain regions, we identified 205,618 brain region-dependent eQTLs (FDR<1%). We further found that 124163 (76.6%) schizophrenia GWAS risk loci contained eQTLs in at least one of the regions. We lastly identified potential molecular correlates of in vivo evidence of altered prefrontal-hippocampal functional coherence in schizophrenia. These results underscore the complexity and regional heterogeneity of the transcriptional correlates of schizophrenia, and suggest future schizophrenia therapeutics may need to target molecular pathologies localized to specific brain regions.

This paper is on smoking and its relation to gene expression at different life stages. These genes are also related to autism spectrum disorder.

Human induced pluripotent stem cells (hiPSCs) are a powerful model of neural differentiation and maturation. We present a hiPSC transcriptomics resource on corticogenesis from 5 iPSC donor and 13 subclonal lines across nine time points over 5 broad conditions: self-renewal, early neuronal differentiation, neural precursor cells (NPCs), assembled rosettes, and differentiated neuronal cells that were validated using electrophysiology. We identified widespread changes in the expression of individual transcript features and their splice variants, gene networks, and global patterns of transcription. We next demonstrated that co-culturing human NPCs with rodent astrocytes resulted in mutually synergistic maturation, and that cell type-specific expression data can be extracted using only sequencing read alignments without potentially disruptive cell sorting. We lastly developed and validated a computational tool to estimate the relative neuronal maturity of iPSC-derived neuronal cultures and human brain tissue, which were maturationally heterogeneous but contained subsets of cells most akin to adult human neurons.

Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated. We leveraged this general eQTL database to show that 48.1% of risk variants for schizophrenia associate with nearby expression. We lastly found 237 genes significantly differentially expressed between patients and controls, which replicated in an independent dataset, implicated synaptic processes, and were strongly regulated in early development. These findings together offer genetics- and diagnosis-related targets for better modeling of schizophrenia risk. This resource is publicly available at

Background: Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. Results: We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using, well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70,000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project ( We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package ( and the predictions for recount2 are available from the recount R package ( Conclusion: Having leveraging massive public data sets to generate a well-phenotyped set of expression data for more than 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

More than 70,000 short-read RNA-sequencing samples are publicly available through the recount2 project, a curated database of summary coverage data. However, no current methods can estimate transcript-level abundances using the reduced-representation information stored in this database. Here we present a linear model utilizing coverage of junctions and subdivided exons to generate transcript abundance estimates of comparable accuracy to those obtained from methods requiring read-level data. Our approach flexibly models bias, produces standard errors, and is easy to refresh given updated annotation. We illustrate our method on simulated and real data and release transcript abundance estimates for the samples in recount2.

The recount2 resource is composed of over 70,000 uniformly processed human RNA-seq samples spanning TCGA and SRA, including GTEx. The processed data can be accessed via the recount2 website and the recount Bioconductor package. This workflow explains in detail how to use the recount package and how to integrate it with other Bioconductor packages for several analyses that can be carried out with the recount2 resource. In particular, we describe how the coverage count matrices were computed in recount2 as well as different ways of obtaining public metadata, which can facilitate downstream analyses. Step-by-step directions show how to do a gene-level differential expression analysis, visualize base-level genome coverage data, and perform an analyses at multiple feature levels. This workflow thus provides further information to understand the data in recount2 and a compendium of R code to use the data.

Expression of the gene set of HNMT, HRH1, HRH2 and HRH3 was significantly altered between ASD and matched controls, and this finding was replicated with an independent data set.

recount2 is a resource of processed and summarized expression data spanning over 70,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bioconductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at

We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated out of 21,504 samples, and their expression associated with tissue type. We compiled junction data into a resource called intropolis available at

derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor.

We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners.

regionReport is an R package for generating detailed interactive reports from region-level genomic analyses as well as feature-level RNA-seq. The report includes quality-control checks, an overview of the results, an interactive table of the genomic regions or features of interest and reproducibility information. regionReport provides specialised reports for exploring DESeq2, edgeR, or derfinder differential expression analyses results. regionReport is also flexible and can easily be expanded with report templates for other analysis pipelines.

Transcriptome analysis of human brain provides fundamental insight into development and disease, but it largely relies on existing annotation. We sequenced transcriptomes of 72 prefrontal cortex samples across six life stages and identified 50,650 differentially expression regions (DERs) associated with developmental and aging, agnostic of annotation. While many DERs annotated to non-exonic sequence (41.1%), most were similarly regulated in cytosolic mRNA extracted from independent samples. The DERs were developmentally conserved across 16 brain regions and in the developing mouse cortex, and were expressed in diverse cell and tissue types. The DERs were further enriched for active chromatin marks and clinical risk for neurodevelopmental disorders such as schizophrenia. Lastly, we demonstrate quantitatively that these DERs associate with a changing neuronal phenotype related to differentiation and maturation. These data show conserved molecular signatures of transcriptional dynamics across brain development, have potential clinical relevance and highlight the incomplete annotation of the human brain transcriptome.

There has been a major shift from microarrays to RNA-sequencing (RNA-seq) for measuring gene expression as the price per measurement between these technologies has become comparable. The advantages of RNA-seq are increased measurement flexibility to detect alternative transcription, allele specific transcription, or transcription outside of known coding regions. The price of this increased flexibility is: (a) an increase in raw data size and (b) more decisions that must be made by the data analyst. Here we provide a selective review and extension of our previous work in attempting to measure variability in results due to different choices about how to summarize and analyze RNA-sequencing data. We discuss a standard model for gene expression measurements that breaks variability down into variation due to technology, biology, and measurement error. Finally, wee show the importance of gene model selection, normalization, and choice for statistical model on the ultimate results of an RNA-sequencing experiment.

Many different systems of bacterial interactions have been described. However, relatively few studies have explored how interactions between different microorganisms might influence bacterial development. To explore such interspecies interactions, we focused on Bacillus subtilis, which characteristically develops into matrix-producing cannibals before entering sporulation. We investigated whether organisms from the natural environment of B. subtilis—the soil—were able to alter the development of B.subtilis. To test this possibility, we developed a coculture microcolony screen in which we used fluorescent reporters to identify soil bacteria able to induce matrix production in B. subtilis. Most of the bacteria that influence matrix production in B. subtilis are members of the genus Bacillus, suggesting that such interactions may be predominantly with close relatives. The interactions we observed were mediated via two different mechanisms. One resulted in increased expression of matrix genes via the activation of a sensor histidine kinase, KinD. The second was kinase independent and conceivably functions by altering the relative subpopulations of B. subtilis cell types by preferentially killing noncannibals. These two mechanisms were grouped according to the inducing strain’s relatedness to B. subtilis. Our results suggest that bacteria preferentially alter their development in response to secreted molecules from closely related bacteria and do so using mechanisms that depend on the phylogenetic relatedness of the interacting bacteria.

RegulonDB ( is the primary reference database of the best-known regulatory network of any free-living organism, that of Escherichia coli K-12. The major conceptual change since 3 years ago is an expanded biological context so that transcriptional regulation is now part of a unit that initiates with the signal and continues with the signal transduction to the core of regulation, modifying expression of the affected target genes responsible for the response. We call these genetic sensory response units, or Gensor Units. We have initiated their high-level curation, with graphic maps and superreactions with links to other databases. Additional connectivity uses expandable submaps. RegulonDB has summaries for every transcription factor (TF) and TF-binding sites with internal symmetry. Several DNA-binding motifs and their sizes have been redefined and relocated. In addition to data from the literature, we have incorporated our own information on transcription start sites (TSSs) and transcriptional units (TUs), obtained by using high-throughput whole-genome sequencing technologies. A new portable drawing tool for genomic features is also now available, as well as new ways to download the data, including web services, files for several relational database manager systems and text files including BioPAX format.