Enhancers are important non-coding elements, but they have traditionally been hard to characterize experimentally. The development of massively parallel assays allows the characterization of large numbers of enhancers for the first time. Here, we developed a framework using Drosophila STARR-seq to create shape-matching filters based on meta-profiles of epigenetic features. We integrated these features with supervised machine-learning algorithms to predict enhancers. We further demonstrated that our model could be transferred to predict enhancers in mammals. We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mice and transduction-based reporter assays in human cell lines (153 enhancers in total). The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription factor binding patterns at predicted enhancers versus promoters. We demonstrated that these patterns enable the construction of a secondary model that effectively distinguishes enhancers and promoters.
ENCODE comprises thousands of functional genomics datasets, and the encyclopedia covers hundreds of cell types, providing a universal annotation for genome interpretation. However, for particular applications, it may be advantageous to use a customized annotation. Here, we develop such a custom annotation by leveraging advanced assays, such as eCLIP, Hi-C, and whole-genome STARR-seq on a number of data-rich ENCODE cell types. A key aspect of this annotation is comprehensive and experimentally derived networks of both transcription factors and RNA-binding proteins (TFs and RBPs). Cancer, a disease of system-wide dysregulation, is an ideal application for such a network-based annotation. Specifically, for cancer-associated cell types, we put regulators into hierarchies and measure their network change (rewiring) during oncogenesis. We also extensively survey TF-RBP crosstalk, highlighting how SUB1, a previously uncharacterized RBP, drives aberrant tumor expression and amplifies the effect of MYC, a well-known oncogenic TF. Furthermore, we show how our annotation allows us to place oncogenic transformations in the context of a broad cell space; here, many normal-to-tumor transitions move towards a stem-like state, while oncogene knockdowns show an opposing trend. Finally, we organize the resource into a coherent workflow to prioritize key elements and variants, in addition to regulators. We showcase the application of this prioritization to somatic burdening, cancer differential expression and GWAS. Targeted validations of the prioritized regulators, elements and variants using siRNA knockdowns, CRISPR-based editing, and luciferase assays demonstrate the value of the ENCODE resource.
Since the 1st discovery of transcriptional enhancers in 1981, their textbook definition has remained largely unchanged in the past 37 years. With the emergence of high-throughput assays and genome editing, which are switching the paradigm from bottom-up discovery and testing of individual enhancers to top-down profiling of enhancer activities genome-wide, it has become increasingly evidenced that this classical definition has left substantial gray areas in different aspects. Here we survey a representative set of recent research articles and report the definitions of enhancers they have adopted. The results reveal that a wide spectrum of definitions is used usually without the definition stated explicitly, which could lead to difficulties in data interpretation and downstream analyses. Based on these findings, we discuss the practical implications and suggestions for future studies.
Data science allows the extraction of practical insights from large-scale data. Here, we contextualize it as an umbrella term, encompassing several disparate subdomains. We focus on how genomics fits as a specific application subdomain, in terms of well-known 3 V data and 4 M process frameworks (volume-velocity-variety and measurement-mining-modeling-manipulation, respectively). We further analyze the technical and cultural “exports” and “imports” between genomics and other data-science subdomains (e.g., astronomy). Finally, we discuss how data value, privacy, and ownership are pressing issues for data science applications, in general, and are especially relevant to genomics, due to the persistent nature of DNA.
Vertebrate tissues exhibit mechanical homeostasis, showing stable stiffness and tension over time and recovery after changes in mechanical stress. However, the regulatory pathways that mediate these effects are unknown. A comprehensive identification of Argonaute 2-associated microRNAs and mRNAs in endothelial cells identified a network of 122 microRNA families that target 73 mRNAs encoding cytoskeletal, contractile, adhesive and extracellular matrix (CAM) proteins. The level of these microRNAs increased in cells plated on stiff versus soft substrates, consistent with homeostasis, and suppressed targets via microRNA recognition elements within the 3′ untranslated regions of CAM mRNAs. Inhibition of DROSHA or Argonaute 2, or disruption of microRNA recognition elements within individual target mRNAs, such as connective tissue growth factor, induced hyper-adhesive, hyper-contractile phenotypes in endothelial and fibroblast cells in vitro, and increased tissue stiffness, contractility and extracellular matrix deposition in the zebrafish fin fold in vivo. Thus, a network of microRNAs buffers CAM expression to mediate tissue mechanical homeostasis.
Despite progress in defining genetic risk for psychiatric disorders, their molecular mechanisms remain elusive. Addressing this, the PsychENCODE Consortium has generated a comprehensive online resource for the adult brain across 1866 individuals. The PsychENCODE resource contains ~79,000 brain-active enhancers, sets of Hi-C linkages, and topologically associating domains; single-cell expression profiles for many cell types; expression quantitative-trait loci (QTLs); and further QTLs associated with chromatin, splicing, and cell-type proportions. Integration shows that varying cell-type proportions largely account for the cross-population variation in expression (with >88% reconstruction accuracy). It also allows building of a gene regulatory network, linking genome-wide association study variants to genes (e.g., 321 for schizophrenia). We embed this network into an interpretable deep-learning model, which improves disease prediction by ~6-fold versus polygenic risk scores and identifies key genes and pathways in psychiatric disorders.
Biomedical data scientists study many types of networks, ranging from those formed by neurons to those created by molecular interactions. People often criticize these networks as uninterpretable diagrams termed hairballs; however, here we show that molecular biological networks can be interpreted in several straightforward ways. First, we can break down a network into smaller components, focusing on individual pathways and modules. Second, we can compute global statistics describing the network as a whole. Third, we can compare networks. These comparisons can be within the same context (e.g., between two gene regulatory networks) or cross-disciplinary (e.g., between regulatory networks and governmental hierarchies). The latter comparisons can transfer a formalism, such as that for Markov chains, from one context to another or relate our intuitions in a familiar setting (e.g., social networks) to the relatively unfamiliar molecular context. Finally, key aspects of molecular networks are dynamics and evolution, i.e., how they evolve over time and how genetic variants affect them. By studying the relationships between variants in networks, we can begin to interpret many common diseases, such as cancer and heart disease.
During the maternal-to-zygotic transition (MZT), transcriptionally silent embryos rely on post-transcriptional regulation of maternal mRNAs until zygotic genome activation (ZGA). RNA-binding proteins (RBPs) are important regulators of post-transcriptional RNA processing events, yet their identities and functions during developmental transitions in vertebrates remain largely unexplored. Using mRNA interactome capture, we identified 227 RBPs in zebrafish embryos before and during ZGA, hereby named the zebrafish MZT mRNA-bound proteome. This protein constellation consists of many conserved RBPs, some of which are potential stage-specific mRNA interactors that likely reflect the dynamics of RNA–protein interactions during MZT. The enrichment of numerous splicing factors like hnRNP proteins before ZGA was surprising, because maternal mRNAs were found to be fully spliced. To address potentially unique roles of these RBPs in embryogenesis, we focused on Hnrnpa1. iCLIP and subsequent mRNA reporter assays revealed a function for Hnrnpa1 in the regulation of poly(A) tail length and translation of maternal mRNAs through sequence-specific association with 3′ UTRs before ZGA. Comparison of iCLIP data from two developmental stages revealed that Hnrnpa1 dissociates from maternal mRNAs at ZGA and instead regulates the nuclear processing of pri-mir-430 transcripts, which we validated experimentally. The shift from cytoplasmic to nuclear RNA targets was accompanied by a dramatic translocation of Hnrnpa1 and other pre-mRNA splicing factors to the nucleus in a transcription-dependent manner. Thus, our study identifies global changes in RNA–protein interactions during vertebrate MZT and shows that Hnrnpa1 RNA-binding activities are spatially and temporally coordinated to regulate RNA metabolism during early development.
Proper functioning of an organism requires cells and tissues to behave in uniform, well-organized ways. How this optimum of phenotypes is achieved during the development of vertebrates is unclear. Here, we carried out a multi-faceted and single-cell resolution screen of zebrafish embryonic blood vessels upon mutagenesis of single and multi-gene microRNA (miRNA) families. We found that embryos lacking particular miRNA-dependent signaling pathways develop a vascular trait similar to wild-type, but with a profound increase in phenotypic heterogeneity. Aberrant trait variance in miRNA mutant embryos uniquely sensitizes their vascular system to environmental perturbations. We discovered a previously unrecognized role for specific vertebrate miRNAs to protect tissue development against phenotypic variability. This discovery marks an important advance in our comprehension of how miRNAs function in the development of higher organisms.
Alternative splicing is a ubiquitous mechanism of post-transcriptional regulation of gene expression and produces multiple isoforms from the same genes. Expression quantitative trait loci (eQTL) has been a major method for finding associations between gene expression and genomic variations. Differences in alternative splicing isoforms are resulted from differences in the expression of exons. We propose to use exon expression QTL (eeQTL) to study the genomic variations that are associated with splicing regulation. A stringent criterion was adopted to study gene-level eQTLs and exon-level eeQTLs for both cis- and trans- factors. From experiments on an RNA-sequencing (RNA-Seq) data set of HapMap samples, we observed that compared with eQTLs, more eeQTL trans-factors can be found than cis-factors, and many of the eeQTLs cannot be found at the gene level. This work highlights that the regulation of exons adds another layer of regulation on gene expression, and that eeQTL analysis is a new approach for investigating genome-wide genomic variations that are involved in the regulation of alternative splicing.