We are searching data for your request:
Upon completion, a link will appear to access the found materials.
Can you have exons and introns of the same gene separated by hundred of kb in a genome? If so, how is the full mRNA assembled in such distance?
I'm working on plant mitochondria and I've seen an annoted gene that is cut between very distinct part of the genome.
Yes you can have exons and introns of the same gene separated by hundred (even thousands!) of kilobases.
Here is an example for the human genome: "On average, there are 8.8 exons and 7.8 introns per gene. About 80% of the exons on each chromosome are < 200 bp in length. < 0.01% of the introns are < 20 bp in length and < 10% of introns are more than 11,000 bp in length."
For the splicing process, it is not really a problem for the distance. Let me reframe the questions by challenging the assumption you are making: long distances in a sequence correspond to long distance in space.
Actually, DNA/RNA molecules are can have a lot of tertiary structure for example, G quartets for DNA, hairpin structure for RNA.
Source: https://upload.wikimedia.org/wikipedia/commons/5/5d/ATPC_secondary_structure.jpg">ShareImprove this answeredited Mar 22 '20 at 10:50answered Mar 22 '20 at 9:48Dr. H. LecterDr. H. Lecter6102 silver badges14 bronze badges
Adeno-associated viruses (AAV) are small viruses that infect humans and some other primate species. They belong to the genus Dependoparvovirus, which in turn belongs to the family Parvoviridae. They are small (20 nm) replication-defective, nonenveloped viruses and have linear single-stranded DNA (ssDNA) genome of approximately 4.8 kilobases (kb).  
AAV are not currently known to cause disease. The viruses cause a very mild immune response. Several additional features make AAV an attractive candidate for creating viral vectors for gene therapy, and for the creation of isogenic human disease models.  Gene therapy vectors using AAV can infect both dividing and quiescent cells and persist in an extrachromosomal state without integrating into the genome of the host cell, although in the native virus integration of virally carried genes into the host genome does occur.  Integration can be important for certain applications, but can also have unwanted consequences. Recent human clinical trials using AAV for gene therapy in the retina have shown promise. 
Get full journal access for 1 year
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Analysis of simulated chromosomes
Topology weighting provides an informative summary of the genealogical data and highlights differences between the simulated scenarios (Figure 2). As described above, there are three possible unrooted topologies for the four taxa. In the Neutral scenario, the most prevalent topology, <[(A,B),C],D>, which reflects the population split times, has an average weighting of 71% across the chromosome. The other two topologies are both fairly rare, but <[(B,C),A],D>is more common on average (17%) than <[(A,C),B],D>(12%). This is because the former can result from both gene flow and incomplete lineage sorting (ILS), whereas the latter can only result from ILS, as there was no simulated migration between A and C or between B and D. In the Adaptive Introgression scenario, the weightings are very similar to the Neutral scenario on average, but in the center of the chromosome there is a strong excess of the topology <[(B,C),A],D>, created by the spread of a beneficial allele from population C into B. Finally, in the Barrier Locus scenario, high migration from C to B causes a swamping by the topology <[(B,C),A],D>, which has an average weighting of 65%. However, there is a broad peak at the center of the chromosome where the population branching topology <[(A,B),C],D>had not been eroded, due to selection limiting introgression.
In the corresponding simulations with five taxa, there are 15 possible taxon topologies (Figure S7 in File S2). There is greater topological variation overall, as there are more ways that incomplete sorting can occur. Nonetheless, topology weights clearly detect the differences among the scenarios, highlighting the most abundant topologies as well as the location of the selected locus (Figure S7 in File S2).
Inferring weightings from simulated sequence data
Above, we computed the weightings directly from the simulated genealogies, but we are also able to show that topology weightings can be reliably estimated when the genealogies are inferred from simulated sequence data (Figure 2D and Figure S7D in File S2). Because neither the genealogies nor the recombination breakpoints at which genealogies switch are known, we tested several approaches for inferring genealogies for narrow intervals across the chromosome. First, we performed extensive power analyses, covering a range of demographic scenarios and sampling designs, to explore the relationship between the number of SNPs used for tree inference and the accuracy of topology weighting. Across the range of scenarios investigated, we find a consistent lower bound of 50 SNPs to achieve >90% accuracy (Figure S4, Figure S5, and Figure S6 in File S2). Focusing specifically on TISs (see above) makes no discernible difference, probably because most SNPs in our simulations are taxon informative. These tests also indicate that neighbor-joining trees provide more accurate weightings than maximum likelihood trees, in addition to much faster computation (Figure S4, Figure S5, and Figure S6 in File S2).
We then analyzed trees inferred for nonoverlapping windows across our simulated recombining chromosomes. A fixed window size of 50 SNPs gives results that most closely approximate the true weightings (Figure 2D and Figure S7D in File S2). In agreement with our power analyses, with <50 SNPs the estimates are less accurate and tend to underestimate the weighting of the most prevalent topology (Figure S8 and Figure S9 in File S2). Weightings tending toward intermediate values are expected as the underlying trees become less well resolved. Interestingly, windows of ≥100 SNPs also result in reduced accuracy, but with a tendency to overestimate support for the most prevalent topology and underestimate support for others (Figure S8 and Figure S9 in File S2). This can be explained by the fact that large windows are forced to average over regions of distinct ancestry, therefore favoring the most widespread signal. To confirm this hypothesis, we repeated our neutral simulation using a 10-fold lower population recombination rate. In this new dataset, 100 SNP windows give the most accurate weightings, and even 200 SNP windows have high accuracy, while 50 SNP windows perform only marginally less well (Figure S10 and Figure S11 in File S2).
We tested whether bootstrapping over the SNPs in each window can be used to validate the accuracy of the observed weightings. Bootstrap weights tend to be similar but marginally more conservative, underestimating the weight of the most prevalent topology (Figure 2D). This is because the bootstrap trees tend to be slightly less well resolved, leading to more intermediate weightings. Bootstrapping is therefore a useful means to test the strength of support for an observed peak in the weighting of a particular topology. However, being inherently conservative, bootstrapping would not be able to determine whether an observed intermediate weighting was accurate or simply the result of a poorly resolved tree.
Because real recombination breakpoints are not evenly spaced, we also tested two approaches in which the window boundaries are inferred from the data itself. In our first approach, we used the R package GenWin (Beissinger et al. 2015) to fit a smooth spline to the weightings from 10 SNP windows and identify likely window boundaries as inflection points, and then inferred trees for the newly defined window regions. The resulting topology weightings match the true weightings fairly well, but not as well as for the fixed 50 SNP windows (Figure S12 and Figure S13 in File S2). As above, this appears to be due to poor tree inference in the smallest windows. The second approach used the method Saguaro (Zamani et al. 2013), which combines a hidden Markov model and a self-organizing map to infer both the trees and window boundaries. This approach poorly recapitulates the true weightings, greatly overestimating support for the most prevalent topology (Figure S12 and Figure S13 in File S2). We therefore used fixed windows of 50 SNPs for all further analyses.
Branch lengths differ among topology types
Topology weighting is primarily a descriptive method, but the weightings do carry information that can aid inferences about population history. The simulated Barrier Locus scenario (Figure 2) provides an interesting test case. Due to the overwhelming signal of introgression, it would be difficult to know which topology corresponds to the true population branching order (i.e., the species tree) if this was not known. The topology <[(B,C),A],D>is prevalent across much of the chromosome, but <[(A,B),C],D>is prevalent around the chromosome center. It has been proposed that the original population branching order can be identified by considering branch lengths (Fontaine et al. 2015 Gante et al. 2016). Taxa that cluster together due to recent introgression tend to be separated by short branches, whereas those that cluster according to the population branching order should have deeper splits. Indeed, in trees inferred from 50 SNP windows, pairwise branch distances between the taxa suggest that subtrees matching <[(B,C),A],D>tend to result from recent introgression between B and C (Figure S14 in File S2), thus implying that <[(A,B),C],D>is the more likely population branching order.
Analysis of real genomic data
The Neurospora dataset consists of four taxa (three possible topologies) and is the simpler of the two real datasets analyzed (Figure 3, A and B). It was selected to test how well Twisst is able to detect the signal of a previously described adaptive introgression event from N. hispaniola into N. tetrasperma individuals of the A mating type (Corcoran et al. 2016). This introgression covers the entire (∼7 Mb) nonrecombining region of linkage group I (LGI). Indeed, we find a dramatic shift in the pattern of topology weightings in the central part of LGI (Figure 3C). The species-tree topology (topo1), which groups the two N. tetrasperma mating types as closest relatives, is prevalent across most of the genome but has very little weighting in the central part of LGI. Instead, it is replaced by topo3, which groups mating type A individuals of N. tetrasperma with N. hispaniola. Elsewhere, topo3 has limited weighting, nearly identical to that of topo2, and consistent with a low level of ILS throughout the genome. However, a region of linkage group IV also shows a weak shift in support toward topo3, potentially reflecting a separate introgression signal involving a small number of sequences.
Neurospora analysis. (A) The putative species tree. Note that mating type a and A individuals of N. tetrasperma are shown as separate branches, while in reality, apart from the nonrecombining region of LGI, these samples represent a single recombining population. The putative introgression from N. hispaniola into N. tetrasperma mat A individuals (Corcoran et al. 2016) is indicated by an arrow. (B) The three possible taxon topologies for these four taxa. (C) Topology weightings for 50 SNP windows plotted across all seven linkage groups, with loess smoothing (span = 500 kb). The top and bottom plots show the same data, plotted as stacked or as separate lines, respectively.
The Heliconius dataset represents a more complex, five-taxon test case. The five taxa include an outgroup and two pairs of sympatric, nonsister taxa, between which gene flow is known to occur (Figure 4A). Of the 15 possible topologies (Figure 4B), the two most common across these chromosomes are topo3 and topo6. topo3 is consistent with the accepted species branching order, in which the allopatric H. c. chioneus and H. t. thelxinoe are sister taxa whereas topo6 groups populations by geography, consistent with interspecific gene flow in both Panama and Peru. The former is by far the most prevalent throughout the Z chromosome (Figure 4C). By contrast, the species topology has variable weighting across chromosome 18, and is outweighed in places by topologies consistent with gene flow (topo4, topo5, topo6, topo11, and topo14). In particular, there is a strong peak in the region of optix for topo11, which groups the taxa by wing pattern, and is consistent with the previously described adaptive introgression of the red-band allele between H. m. amaryllis and H. t. thelxinoe in Peru (Pardo-Diaz et al. 2012 The Heliconius Genome Consortium 2012). Zooming in on this peak shows a clear block of ∼150 kb over which the introgression topology is weighted highly (Figure S15 in File S2). This block includes the regulatory region downstream of optix that is known to controls wing pattern variation in these species (Baxter et al. 2010 Wallbank et al. 2016). Another four topologies that partially match the species branching order (topo1, topo2, topo10, and topo15) have moderate weightings throughout, whereas topologies consistent with neither the species tree nor gene flow (topo7, topo8, topo9, topo12, and topo13) have low weightings, especially across the Z chromosome, implying less ILS than on chromosome 18.
Heliconius analysis. (A) The putative species tree. Shaded arrows indicate ongoing gene flow between sympatric, nonsister taxa in Panama and Peru, respectively (Martin et al. 2013). The solid red arrow indicates the putative adaptive introgression of the red wing-patterning allele near the gene optix (Pardo-Diaz et al. 2012 The Heliconius Genome Consortium 2012). (B) The 15 possible taxon topologies for these five taxa. (C) Topology weightings for 50 SNP windows plotted across chromosomes 18 and 21 (Z), with loess smoothing (span = 500 kb). The top and bottom plots show the same data, plotted as stacked or as separate lines, respectively. The location of optix on chromosome 18 is indicated by a dashed vertical line.
Expression and mutational signatures
Global variations in mutational patterns can be quantified using mutational signatures, which tag mutational processes specific to their tissue-of-origin and environmental exposures 19 . However, the extraction of mutational signatures is an intrinsically statistical process that requires a posteriori functional annotation. We performed a pan-cancer association analysis between genome-wide mutational signatures and gene expression levels to decipher the molecular processes that accompany the presence of mutational signatures.
We considered 28 mutational signatures derived using non-negative matrix factorization of context-specific mutation frequencies 9 . We tested for association between signature prevalence in donors and total gene expression, accounting for total mutational burden, cancer type, and other technical and biological confounders. This identified 1,176 genes associated with at least one signature (FDR ≤ 10%) (Extended Data Fig. 10, Supplementary Table 19).
We considered 18 signatures with 20 or more associated genes for further annotation (Extended Data Fig. 11) and assessed enrichment using GO categories 20 and Reactome pathways 21 . We found that 11 signatures were enriched for at least one category (FDR ≤ 10%) (Supplementary Table 19), revealing associations consistent with known and unknown aetiologies (Fig. 1d). For example, signature 38, which is correlated with the canonical UV signature 7 (r 2 = 0.375, P = 5 × 10 −40 ) (Extended Data Fig. 11c), was linked to melanin processes (Fig. 1d). The synthesis of melanin causes oxidative stress to melanocytes 22 , and we found signature 38 associated with the oxidative-stress-promoting gene TYR 23 (P = 1.0 × 10 −4 ). A hallmark of signature 38 genes are C>A mutations, a typical product of reactive oxygen species 24 . This suggests that signature 38 may capture DNA damage that is indirectly caused by UV-induced oxidative damage after direct sun exposure 25 , with TYR as a possible mediator of the effect.
The complete sequence of the HIV-1 genome, extracted from infectious virions, has been solved to single-nucleotide resolution.  The HIV genome encodes a small number of viral proteins, invariably establishing cooperative associations among HIV proteins and between HIV and host proteins, to invade host cells and hijack their internal machineries.  HIV is different in structure from other retroviruses. The HIV virion is
100 nm in diameter. Its innermost region consists of a cone-shaped core that includes two copies of the (positive sense) ssRNA genome, the enzymes reverse transcriptase, integrase and protease, some minor proteins, and the major core protein.  The genome of human immunodeficiency virus (HIV) encodes 8 viral proteins playing essential roles during the HIV life cycle. 
HIV-1 is composed of two copies of noncovalently linked, unspliced, positive-sense single-stranded RNA enclosed by a conical capsid composed of the viral protein p24, typical of lentiviruses.   The two copies of RNA strands are vital in contributing to HIV-1 recombination, which occurs during reverse transcription of viral replication. The containment of two copies of single-stranded RNA within a virion but the production of only a single DNA provirus is called pseudodiploidy.  The RNA component is 9749 nucleotides long   and bears a 5’ cap (Gppp), a 3’ poly(A) tail, and many open reading frames (ORFs).  Viral structural proteins are encoded by long ORFs, whereas smaller ORFs encode regulators of the viral life cycle: attachment, membrane fusion, replication, and assembly. 
The single-strand RNA is tightly bound to p7 nucleocapsid proteins, late assembly protein p6, and enzymes essential to the development of the virion, such as reverse transcriptase and integrase. Lysine tRNA is the primer of the magnesium-dependent reverse transcriptase.  The nucleocapsid associates with the genomic RNA (one molecule per hexamer) and protects the RNA from digestion by nucleases. Also enclosed within the virion particle are Vif, Vpr, Nef, and viral protease. [ citation needed ] The envelope of the virion is formed by a plasma membrane of host cell origin, which is supported by a matrix composed of the viral p17 protein, ensuring the integrity of the virion particle. At the surface of the virion can be found a limited number of the envelope glycoprotein (Env) of HIV, a trimer formed by heterodimers of gp120 and gp41. Env is responsible for binding to its primary host receptor, CD4, and its co-receptor (mainly CCR5 or CXCR4), leading to viral entry into its target cell. 
As the only proteins on the surface of the virus, the envelope glycoproteins (gp120 and gp41) are the major targets for HIV vaccine efforts.  Over half of the mass of the trimeric envelope spike is N-linked glycans. The density is high as the glycans shield underlying viral protein from neutralisation by antibodies. This is one of the most densely glycosylated molecules known and the density is sufficiently high to prevent the normal maturation process of glycans during biogenesis in the endoplasmic and Golgi apparatus.   The majority of the glycans are therefore stalled as immature 'high-mannose' glycans not normally present on secreted or cell surface human glycoproteins.  The unusual processing and high density means that almost all broadly neutralising antibodies that have so far been identified (from a subset of patients that have been infected for many months to years) bind to or, are adapted to cope with, these envelope glycans. 
The molecular structure of the viral spike has now been determined by X-ray crystallography  and cryo-electron microscopy.  These advances in structural biology were made possible due to the development of stable recombinant forms of the viral spike by the introduction of an intersubunit disulphide bond and an isoleucine to proline mutation in gp41.  The so-called SOSIP trimers not only reproduce the antigenic properties of the native viral spike but also display the same degree of immature glycans as presented on the native virus.  Recombinant trimeric viral spikes are promising vaccine candidates as they display less non-neutralising epitopes than recombinant monomeric gp120 which act to suppress the immune response to target epitopes. 
HIV has several major genes coding for structural proteins that are found in all retroviruses as well as several nonstructural ("accessory") genes unique to HIV.  The HIV genome contains nine genes that encode fifteen viral proteins.  These are synthesized as polyproteins which produce proteins for virion interior, called Gag, group specific antigen the viral enzymes (Pol, polymerase) or the glycoproteins of the virion env (envelope).  In addition to these, HIV encodes for proteins which have certain regulatory and auxiliary functions as well.  HIV-1 has two important regulatory elements: Tat and Rev and few important accessory proteins such as Nef, Vpr, Vif and Vpu which are not essential for replication in certain tissues.  The gag gene provides the basic physical infrastructure of the virus, and pol provides the basic mechanism by which retroviruses reproduce, while the others help HIV to enter the host cell and enhance its reproduction. Though they may be altered by mutation, all of these genes except tev exist in all known variants of HIV see Genetic variability of HIV. [ citation needed ]
HIV employs a sophisticated system of differential RNA splicing to obtain nine different gene products from a less than 10kb genome.  HIV has a 9.2kb unspliced genomic transcript which encodes for gag and pol precursors a singly spliced, 4.5 kb encoding for env, Vif, Vpr and Vpu and a multiply spliced, 2 kb mRNA encoding for Tat, Rev and Nef. 
|Class||Gene name||Primary protein products||Processed protein products|
|Viral structural proteins||gag||Gag polyprotein||MA, CA, SP1, NC, SP2, P6|
|pol||Pol polyprotein||RT, RNase H, IN, PR|
|Essential regulatory elements||tat||Tat|
|Accessory regulatory proteins||nef||Nef|
Viral structural proteins Edit
- gag (group-specific antigen) codes for the precursor gag polyprotein which is processed by viral protease during maturation to MA (matrix protein, p17) CA (capsid protein, p24) SP1 (spacer peptide 1, p2) NC (nucleocapsid protein, p7) SP2 (spacer peptide 2, p1) and P6 protein. 
- pol codes for viral enzymes reverse transcriptase (RT) and RNase H, integrase (IN), and HIV protease (PR).  HIV protease is required to cleave the precursor Gag polyprotein to produce structural proteins, RT is required to transcribe DNA from RNA template, and IN is necessary to integrate the double-stranded viral DNA into the host genome. 
- env (for "envelope") codes for gp160, which is cleaved by a host protease, furin, within the endoplasmic reticulum of the host cell. The post-translational processing produces a surface glycoprotein, gp120 or SU, which attaches to the CD4 receptors present on lymphocytes, and gp41 or TM, which embeds in the viral envelope to enable the virus to attach to and fuse with target cells. 
Essential regulatory elements Edit
- tat (HIV trans-activator) plays an important role in regulating the reverse transcription of viral genome RNA, ensuring efficient synthesis of viral mRNAs and regulating the release of virions from infected cells.  Tat is expressed as 72-amino acid one-exon Tat as well as the 86–101-amino-acid two-exon Tat, and plays an important role early in HIV infection. Tat (14–15 kDa) binds to the bulged genomic RNA stem-loop secondary structure near the 5' LTR region forming the trans-activation response element (TAR). 
- rev (regulator of expression of virion proteins): The Rev protein binds to the viral genome via an arginine-rich RNA-binding motif that also acts as a NLS (nuclear localization signals), required for the transport of Rev to the nucleus from cytosol during viral replication.  Rev recognizes a complex stem-loop structure of the mRNA env located in the intron separating coding exon of Tat and Rev, known as the HIV Rev response element (RRE).  Rev is important for the synthesis of major viral proteins and is hence essential for viral replication. 
Accessory regulatory proteins Edit
- vpr (lentivirus protein R): Vpr is a virion-associated, nucleocytoplasmic shuttling regulatory protein.  It is believed to play an important role in replication of the virus, specifically, nuclear import of the preintegration complex. Vpr also appears to cause its host cells to arrest their cell cycle in the G2 phase. This arrest activates the host DNA repair machinery which may enable integration of the viral DNA. HIV-2 and SIV encode an additional Vpr related protein called Vpx which functions in association with Vpr. 
- vif - Vif is a highly conserved, 23 kDa phosphoprotein important for the infectivity of HIV-1 virions depending on the cell type.  HIV-1 has been found to require Vif to synthesize infectious viruses in lymphocytes, macrophages, and certain human cell lines. It does not appear to require Vif for the same process in HeLa cells or COS cells, among others. 
- nef- Nef, negative factor, is a N-terminal myristoylated membrane-associated phosphoprotein. It is involved in multiple functions during the replication cycle of the virus. It is believed to play an important role in cell apoptosis and increase virus infectivity. 
- vpu (Virus protein U) - Vpu is specific to HIV-1. It is a class I oligomeric integral membrane phosphoprotein with numerous biological functions. Vpu is involved in CD4 degradation involving the ubiquitin proteasome pathway as well as in the successful release of virions from infected cells. 
- tev: This gene is only present in a few HIV-1 isolates. It is a fusion of parts of the tat, env, and rev genes, and codes for a protein with some of the properties of tat, but little or none of the properties of rev. 
Several conserved secondary structure elements have been identified within the HIV RNA genome. The 5'UTR structure consists of series of stem-loop structures connected by small linkers.  These stem-loops (5' to 3') include the trans-activation region (TAR) element, the 5' polyadenylation signal [poly(A)], the PBS, the DIS, the major SD and the ψ hairpin structure located within the 5' end of the genome and the HIV Rev response element (RRE) within the env gene.    Another RNA structure that has been identified is gag stem loop 3 (GSL3), thought to be involved in viral packaging.   RNA secondary structures have been proposed to affect the HIV life cycle by altering the function of HIV protease and reverse transcriptase, although not all elements identified have been assigned a function. [ citation needed ]
An RNA secondary structure determined by SHAPE analysis has shown to contain three stem loops and is located between the HIV protease and reverse transcriptase genes. This cis regulatory RNA has been shown to be conserved throughout the HIV family and is thought to influence the viral life cycle. 
The third variable loop or V3 loop is a part or region of the Human Immunodeficiency Virus. The V3 loop of the viron's envelope glycoprotein, gp120, allows it to infect human immune cells by binding to a cytokine receptor on the target human immune cell, such as a CCR5 cell or CXCR4 cell, depending on the strain of HIV.  The envelope glycoprotein (Env)gp 120/41 is essential for HIV-1 entry into cells. Env serves as a molecular target of a medicine treating individuals with HIV-1 infection, and a source of immunogen to develop AIDS vaccine. However, the structure of the functional Env trimer has remained elusive. 
All DNA-templated processes that occur in eukaryotic cells do so in the context of chromatin. Chromatin is composed of an array of nucleosomes consisting of 147 base pairs of double-stranded DNA wrapped around an octamer of histone proteins (Kornberg and Lorch 1999). Chromatin is highly regulated to facilitate proper function of DNA-templated processes at the levels of individual nucleosomes, DNA accessibility, and higher-order structures—all of which are regulated by chromatin-interacting factors. These chromatin-interacting factors are directed to regions of the genome as both a cause and consequence of local chromatin architecture, creating discrete patterns of factor localization. What emerges is a complex system of reciprocity in which chromatin regulatory factors affect nucleosome architecture, which in turn affects the binding of new regulatory factors. With the dynamic interplay between these processes, diverse methods are necessary to examine nucleosome architecture and regulatory factor binding.
Regulatory elements within a cell are primarily found at open or accessible regions of the genome. Identifying cell-specific regulatory elements is therefore primarily accomplished through accessibility assays. Detecting open chromatin can also identify binding sites for chromatin-interacting proteins. In this review, we will first discuss techniques in the field of chromatin biology for examining chromatin accessibility—including digestion with DNase I and deep sequencing (DNase-seq) (Crawford et al. 2006a, b Sabo et al. 2006 Song and Crawford 2010), formaldehyde-assisted isolation of regulatory elements (FAIRE-seq) (Giresi et al. 2007 Simon et al. 2012), micrococcal nuclease (MNase) digestion followed by deep sequencing (MNase-seq (Cui and Zhao 2012a Henikoff et al. 2011 Mieczkowski et al. 2016 Ramani et al. 2019), and an assay for transposase accessibility (ATAC-seq (Buenrostro et al. 2013, 2015 Chen et al. 2016 Corces et al. 2017) Fig. 1). These techniques provide important context for gene regulation, especially with respect to nucleosome occupancy and positioning.
Methods for mapping genome accessibility. A DNase-seq identifies open regions of chromatin. DNase-seq relies upon preferential digestion of regions of chromatin that are unprotected by bound proteins, leaving behind accessible regions that are known as DNase I hypersensitive sites (DHSs). B FAIRE-seq is dependent on crosslinking of chromatin-interacting proteins to DNA using formaldehyde. Chromatin is then sheared, and regions that are unbound by proteins (e.g., histones) remain in the aqueous layer of a phenol-chloroform extraction, while crosslinked DNA remains in the organic layer. C MNase-seq profiles nucleosome occupancy and positioning. After formaldehyde crosslinking, added MNase digests DNA that is unprotected by bound proteins, allowing one to infer increased accessibility by decreased presence in sequencing library. D. ATAC-seq relies on the hyperactive Tn5 transposase to insert sequencing adapters at accessible regions of the genome. Following transposition, genomic DNA can be isolated and amplified by PCR, then subjected to deep sequencing. Figure created with Biorender.com
Importantly, the genomic location of factors or histone proteins cannot be predicted in cell types by DNA sequence or accessibility alone. Individual protein profiling technologies are therefore used to identify the cell-specific characteristics of functional binding. We will discuss techniques for determining factor binding to and localization on chromatin, including chromatin immunoprecipitation (ChIP) (Albert et al. 2007 Furey 2012 Gilmour and Lis 1984 Gilmour et al. 1991 O’Neill 2003 Solomon and Varshavsky 1985), DNA adenine methyltransferase identification (DamID (Greil et al. 2006 van Steensel and Henikoff 2000), and chromatin immunocleavage-derived techniques (ChIC/CUT&RUN (Schmid et al. 2004 Skene and Henikoff 2017) Fig. 2).
Methods for profiling protein localization on chromatin. A DamID exploits the E. coli DNA adenine methyltransferase (Dam) by fusing it to a factor of interest and transfecting that plasmid into a cell. This construct methylates adenines located near factor binding sites. Genomic DNA can then be isolated and digested with DpnI, which specifically cleaves at the sequence G m ATC. A portion of the digested DNA is then digested with DpnII, which cleaves unmethylated GATC to identify potential methylated sites out of Dam’s range. Side-by-side libraries are built and subjected to deep sequencing. B ChIP-seq is an antibody-based technology that begins with crosslinking of factors to DNA, followed by chromatin shearing and antibody pulldowns for the factor of interest on either magnetic or agarose beads. Crosslinks are then reversed, and DNA is isolated for deep sequencing. C CUT&RUN makes use of a recombinant Protein A-MNase (pA-MNase) fusion construct to bind to a primary antibody recognizing the factor of interest and specifically cleave DNA at factor binding sites, thereby creating small fragments that can be isolated from nuclei and used as a template for library construction and deep sequencing. CUT&RUN offers near-base pair resolution and can be carried out under native (i.e., non-crosslinking) conditions due to its high sequencing signal-to-noise ratio. Figure created with Biorender.com
Together, the chromatin profiling technologies that assess either accessibility or localization have been refined with increasing precision to improve target signal over background and to reduce necessary cell input in recent years, often reaching their peak with the development of single-cell adaptations of the techniques. Here, we review the technology development, methods, advantages and disadvantages, and optimization for low cell applications.
Section 1: Methods in examining DNA accessibility and chromatin state
Eukaryotic DNA is compacted into the nucleus through interactions between DNA and histone proteins to form chromatin (Lammerding 2011). Generally, the basic repeating unit of chromatin, the nucleosome, poses a significant obstacle to DNA-templated processes, as factors are unable to occupy regions on DNA that are occluded by histone proteins (Beato and Eisfeld 1997 Felsenfeld 1992 Wallrath et al. 1994). Regions of open chromatin, however, are accessible to DNA-binding proteins and are often found at regulatory regions of the genome (Song and Crawford 2010 Thurman et al. 2012). Identifying regions of the genome that are accessible to non-histone proteins therefore provides important information for putative genomic regulatory regions, such as enhancers, promoters, and insulators as well as describing the nucleosome structure of known regulatory regions of the genome (Thurman et al. 2012).
Genomic methods used to examine chromatin accessibility have traditionally been based on preferential enzymatic digestion or modification of accessible DNA to DNA that is protected by bound histone proteins or transcription factors (Fig. 1). Many genomic accessibility techniques (e.g., DNase-seq and MNase-seq) have evolved from long-used nuclease footprinting experiments (Cappabianca et al. 1999 Dingwall et al. 1981 Galas and Schmitz 1978), taking advantage of next-generation sequencing developments to assess genome-wide nucleosome architecture rather than locus-specific footprinting (Crawford et al. 2006b Schones et al. 2008). The techniques that have emerged are numerous, powerful, and capable of providing high-resolution data describing chromatin accessibility. For a general bioinformatic pipeline of how to asses these datasets, see Fig. 3. Though many of the enzymes used to profile accessibility bear slight biases, the portraits of genome architecture that emerge are generally consistent when compared with each other.
A general bioinformatic pipeline for analyzing genome-wide accessibility or profiling datasets. Although analyses vary depending on the technique used so as to minimize biases, we have presented a general pipeline for analyzing NGS-generated datasets. Following relevant quality control information (Andrews 2010), all sequencing experiments involve mapping to the genome of interest, generating files containing the sequence, alignment information, and quality information, known as .sam files (or, when compressed, .bam files Langmead et al. 2009 Langmead and Saltzburg 2012 Li and Durbin 2009). These aligned files are filtered and used in downstream analyses for studying nucleosome and factor occupancy and positioning, size classes are created to divide inaccessible regions by the factors blocking their availability (Li, Handsaker et al. 2009 Schep et al. 2015). From the size-divided accessibility .bam files and the quality-filtered localization .bam files, peaks can be called above local background scoring and/or compared with an input file (Heinz et al. 2010 Meers, Tenenbaum, and Henikoff, 2019 Zhang et al. 2008). From factor peaks, motifs can be called to determine which factors most likely bind these locations. Genomic data are typically viewed in the form of either heatmaps or metaplots (Heinz et al. 2010 Ramírez et al. 2016). Figure created with Biorender.com
DNase-seq is a method used to examine chromatin accessibility with the non-specific DNA endonuclease DNase I, which preferentially degrades DNA unprotected by bound proteins (e.g., histone proteins Fig. 1A). Prior to DNase-seq, DNase I had been used for footprinting, in which a gel would be run after DNase treatment both in the presence and absence of the protein of interest blank regions on the gel would be inferred to be protected and/or inaccessible regions, whereas more nucleosome-depleted—or accessible—regions would be marked by greater cleavage site presence on a gel (Cappabianca et al. 1999 Dingwall et al. 1981 Galas and Schmitz 1978). Francis Collins’ group first applied DNase I footprinting genome-wide in 2006, using microarray chips (DNase-chip) and massively parallel Sanger sequencing (Crawford et al. 2006a, b Sabo et al. 2006). In 2008, Gregory Crawford’s group further developed this technology through combination with next-generation sequencing (Boyle et al. 2008) to greater success than the previous DNase-chip and DNase-seq experiments due to the increased resolution and quality offered over microarray technology. DNase-seq is applicable to all eukaryotic chromatin, including that of the common lab systems of plants, yeast, nematodes, flies, and mammalian cells.
DNase-seq is performed by isolating nuclei from cells, subjecting nuclei to general DNA digestion by DNase I, degrading RNA and proteins using RNases and Proteinase K, respectively, purifying the DNA using a phenol-chloroform extraction and ethanol precipitation, and gel-extracting fragments of sizes corresponding to the desired class of factors (typically 50–100 bp for transcription factors and 130–160 bp for nucleosomes (He et al. 2014). Purified and size-selected DNA is then used as a template for library construction. Those regions least frequently identified in sequencing of DNase-seq libraries have been most frequently degraded by DNase I and are inferred to be most accessible.
There is an intrinsic bias for DNase I to degrade DNA differently based on sequence, and this effect has been suggested to be related to the width of the minor groove (Lazarovici et al. 2013). This limitation must be considered when preparing a DNase-seq experiment (He et al. 2014). For factors that are difficult to profile by DNase-seq, a recent modification has incorporated the use of 0.1% formaldehyde crosslinking to assist in identification, termed XL-DNase-seq (Oh et al. 2019). Another DNase-seq modification, single-cell DNase-seq (scDNase-seq) has applied DNase-seq to individual cells and low-input primary tissue samples (Jin et al. 2015). While similar to traditional DNase-seq, scDNase-seq has been further optimized, applying the following alterations: inclusion of bacterial carrier DNA, lack of nuclear isolation, optimized DNase I digestion, lack of agarose gel separation, and altered PCR conditions. These optimizations are designed to minimize sample loss and facilitate amplification of small DNA fragments (Cooper et al. 2017).
DNase-seq has been highly influential in identifying putative regulatory regions of the genome. Regions that seldom appear in DNase-seq libraries, known as DNase I hypersensitive sites (DHSs), are often used as a proxy for active regulatory regions, such as enhancers and promoters. Attempts to identify these DHSs have resulted in highly influential papers covering almost all known cis-regulatory regions, including over 2.9 million DHSs (Thurman et al. 2012) and over 45 million transcription factor occupancy events (Neph et al. 2012). Additionally, DNase-seq has become a valuable tool for investigating epigenetic tissue– and cell type–specific differences, largely through the efforts of the ENCODE project and the Roadmap Epigenomic Consortium (Consortium 2012 Maurano et al. 2015 Roadmap Epigenomics et al. 2015).
As an alternative to DNase-seq to identify accessible regions throughout the genome, formaldehyde-assisted isolation of regulatory elements (FAIRE) was developed in 2007. Rather than digesting unprotected DNA, FAIRE relies on crosslinking of histones to DNA, while unbound DNA is inferred to be accessible (Fig. 1B). FAIRE was first developed for use with DNA microarrays (Giresi et al. 2007) but was soon combined with next-generation sequencing technologies (Gaulton et al. 2010). Similar to DNase-seq, FAIRE-seq can be used to examine regulatory regions (including TSSs, promoters, and enhancers), also referred to as DHSs. FAIRE-seq has been validated in plant, yeast, nematode, fly, mouse, and human cells.
A typical FAIRE-seq experiment involves formaldehyde crosslinking, with the most abundant crosslinking targets being histone proteins (Rodríguez-Gil et al. 2018 Simon et al. 2012). Crosslinked chromatin is then sheared by sonication to approximately 200–300 bp in size and DNA isolated via a phenol-chloroform extraction, wherein the highly crosslinked DNA remains in the organic phase and the non-crosslinked DNA is pulled to the aqueous phase. Non-crosslinked DNA from the aqueous phase can then be amplified and sequenced. Reads enriched in the sequencing pool tend to have lower nucleosome and factor binding and are therefore inferred to come from accessible regions.
A key disadvantage of FAIRE-seq experiments is that, while informative for histone-based chromatin architecture, regulatory regions that are bound by transcription factors or actively transcribed are also able to crosslink. The technique therefore relies on the presence of a mixed population for accurate accessibility profiling and is consequently lower resolution than the other techniques described in this review. As a result, fewer research groups have employed this technology however, FAIRE-seq has been used to identify regulatory regions driving tumor development (Davie et al. 2015), to differentiate between ground-state and primed-pluripotent cells (Murtha et al. 2015), and, similarly, to the ENCODE and Roadmap Epigenomic Consortium’s DNase-seq efforts, to globally map accessible regulatory regions of chromatin (Bianco et al. 2015).
MNase-seq is a method to assay nucleosome positioning and occupancy throughout the genome (Fig. 1C). Micrococcal nuclease (MNase) is an enzyme isolated from Staphylococcus aureus that displays both endo- and exonuclease activity to digest free DNA (Axel 1975 Dingwall et al. 1981). Similar to DNase I, MNase was used in DNA footprinting experiments to examine DNA accessibility before the invention of next-generation sequencing technologies (Cappabianca et al. 1999 Dingwall et al. 1981). MNase tiling arrays (MNase-chip) were used by Ollie Rando, Corey Nislow, and Frank Pugh’s groups, among others, to identify nucleosome positioning at high resolution before the advent of deep sequencing (Lee et al. 2007 Mavrich et al. 2008 Yuan et al. 2005). As with other techniques, MNase profiling was soon paired with next-generation sequencing technologies (Schones et al. 2008). MNase-seq has been used to map nucleosome architecture throughout eukaryotes from plants to yeast to humans.
An MNase-seq experiment begins with an in vivo formaldehyde crosslinking step that is designed to capture the interaction between proteins and DNA. This crosslinking allows bound proteins to shield their associated DNA from digestion by MNase. Following crosslinking, cells are lysed and digested with MNase, which is specifically activated by addition of Ca 2+ to the lysis buffer. This digestion is halted by chelating the reaction, at which point the samples are RNase treated, crosslinks are reversed, and proteins are digested away from the chromatin. DNA is then isolated via a phenol-chloroform extraction and examined on an agarose gel to ensure proper digestion of the DNA without degradation. As the most abundant DNA-contacting proteins are histones, this gel will typically display periodic laddering every 147 base pairs, representing mono-, di-, and trinucleosomes, and so on.
Traditional MNase-seq protocols advise excision of the mono-nucleosome band to enrich for these protected DNA fragments (Cui and Zhao 2012b Rando 2010 Zhang and Pugh 2011) however, it is also possible to perform deep sequencing on the entirety of a MNase-digested sample (Henikoff et al. 2011). Fragments remaining after MNase cleavage were protected from digestion and are therefore inferred to have been protein-bound. Sequencing DNA protected by all crosslinked proteins can provide additional footprinting corresponding to both small proteins (< 80 bp shielded from digestion, e.g., transcription factors) as well as the traditional nucleosome arrays (Hainer and Fazzio 2015 Henikoff et al. 2011).
Importantly, MNase displays different digestion kinetics based on the amount of enzyme used to digest a population of cells (Mieczkowski et al. 2016) in addition, in the case of some genomic loci (such as fragile nucleosomes), high and low digestion profiles can provide drastically different information (Chereji et al. 2017 Mieczkowski et al. 2016 Weiner et al. 2010). It is therefore crucial to perform MNase-seq experiments on a uniform population with no-MNase, low-MNase, and high-MNase replicates. While MNase-seq has traditionally been limited by cellular input available, single-cell MNase-seq has recently been published (Lai et al. 2018).
MNase has a well-documented preference for cleavage of AT-rich naked DNA (Chung et al. 2010) however, this sequence preference is minute compared with preference due to chromatin accessibility (Allan et al. 2012). Nonetheless, techniques are available that can minimize bias due to MNase preference. Jay Shendure’s lab has published an alternative, single-stranded library building protocol for MNase-seq, known as MNase-SSP that displays low sequence bias and enriches for shorter fragments than traditional MNase-seq, making for robust profiling of transcription factors (Ramani et al. 2019). In addition, a few closely related alternatives have been developed that utilize chemical cleavage of DNA, rather than enzymatic digestion. MPE-seq, developed by Bing Ren’s group, uses methidiumpropyl-EDTA-Fe(II) (MPE) to preferentially cleave linker DNA between histones (Ishii et al. 2015). Steve Henikoff’s group has also developed a chemical DNA cleavage technique, using a mutation in H4 (S47C) to create a site-specific nuclease by phenanthroline-mediated chelation of copper, which locally cleaves DNA at the dyad axis in the presence of peroxide (Chereji et al. 2018).
MNase-seq has been used to profile nucleosome occupancy and positioning changes at regulatory regions as a result of cellular differentiation, highlighting key changes in embryonic stem cell enhancers (West et al. 2014). Furthermore, MNase-seq can even be used to profile paused Pol II positioning, a trend that has been confirmed by parallel Pol II ChIP-seq (Teves and Henikoff 2011). Interestingly, MNase-seq profiling can be used to reliably predict 3D genome interactions and higher-order chromatin structures (Schwartz et al. 2019 Zhang et al. 2017). Because of its ability to capture transitory interactions via crosslinking, MNase-seq is one of the most versatile chromatin accessibility profiling techniques.
The assay for transposase accessibility and deep sequencing (ATAC-seq) is an additional technology to assess accessible chromatin. ATAC-seq involves the use of a hyperactive Tn5 transposase to insert sequencing adapters into open regions of chromatin to then sequence those regions through next generation sequencing (Buenrostro et al. 2013) Fig. 1D). Unlike other accessibility-profiling techniques, ATAC-seq was only recently developed (Buenrostro et al. 2013), though it has been adapted for use at a single locus (ATAC-qPCR (Yost et al. 2018). Although ATAC-seq is a relatively new technique, the enzyme used, Tn5 transposase, was one of the first transposases identified, and has been used for in vitro transposition experiments for over 20 years (Goryshin and Reznikoff 1998 Naumann and Reznikoff 2002 Reznikoff 2003 Reznikoff 2008). Tn5 operates by a DNA-mediated “cut-and-paste” mechanism, wherein the transposase excises a segment of DNA, binds to a target DNA site, induces a double-strand break, and inserts the transposon into the new locus (Ivics et al. 2009). In ATAC-seq, Tn5 is loaded with a transposon designed to add sequencing adapters at the insertion point, forming a functional transposome. ATAC-seq has been used to map open chromatin in yeast, plants, nematodes, flies, mammals, and even frozen tissues (Corces et al. 2017).
ATAC-seq is performed in two to three basic steps consisting of cellular lysis and DNA transposition steps and DNA extraction and amplification (Buenrostro et al. 2013). Various ATAC-seq protocols have been developed including the original ATAC-seq (Buenrostro et al. 2013), FAST-ATAC-seq, which was designed for blood cells (Corces et al. 2016), and Omni-ATAC-seq (Corces et al. 2017), largely differing in the detergents used in cellular lysis. Because ATAC-seq relies on insertion to accessible DNA, rather than digestion of protected DNA, the technique is prone to sequencing contamination by mitochondrial DNA. Because of this prevalence, methods have been developed to reduce mitochondrial reads in ATAC-seq (Corces et al. 2017 Montefiori et al. 2017 Rickner et al. 2019).
ATAC-seq has successfully been used to assess chromatin accessibility in single cells (Buenrostro et al. 2015 Mulqueen et al. 2019) and from frozen tissue (Corces et al. 2017), and therefore the technique is be a valuable tool for confronting core genomic issues of cell heterogeneity and low sample availability. Indeed, Jay Shendure’s group has published 85 different chromatin accessibility patterns (largely cell type-specific) based on single-cell indexed ATAC-seq in various mouse tissues (Cusanovich et al. 2018). In addition, Howard Chang’s and William Greenleaf’s groups have published accessibility studies in a litany of primary human cancers using ATAC-seq (Corces et al. 2018). ATAC has further been paired with visualization and flow cytometry (ATAC-see) to allow direct imaging, quantitation, and cell sorting as results of genome accessibility (Chen et al. 2016).
Techniques used to measure chromatin accessibility rely on two basic principles: first, that proteins can shield DNA from digestion and second, that histone proteins are the most prominent proteins interacting with DNA. DNase-seq, MNase-seq, and ATAC-seq fundamentally rely on the first principle, while FAIRE-seq and MNase-seq rely more on the second principle however, both principles are important to the discrete patterns of accessibility uncovered by each technique. The aforementioned techniques provide distinct—yet consistent—snapshots of nucleosome positioning and chromatin accessibility, and each technique has particular advantages and disadvantages (Table 1). These technologies have illuminated and verified the accessible state of the genome by orthogonal approaches and led to identification of approximately 3 million putative regulatory regions of the human genome (Thurman et al. 2012).
In parallel to mapping generally accessible regions of the genome, investigating the factors that interact with chromatin and regulate these accessible regions through factor-specific protein localization profiling is equally important to understanding the basic principles of genome architecture.
Section 2: Methods in protein localization profiling on chromatin
Depending on their specific roles within the nucleus, chromatin-interacting proteins display characteristic patterns of genomic localization. By identifying the genomic regions at which proteins are found, it is possible to identify functional roles, motifs important for binding, and regulatory networks of DNA-templated processes in vivo. Like methods of measuring DNA accessibility, there are numerous approaches to identifying genomic binding sites of chromatin-interacting proteins that have gained popularity in recent years (Fig. 2), each of which has advantages and disadvantages (Table 1). Broadly, profiling methods must balance resolution of binding site identification with sample necessary to perform the experiment. Some methods, like ChIP-exo (Rhee and Pugh 2012), prioritize base-pair resolution, at the expense of increased necessary sample input others, like DamID (van Steensel and Henikoff 2000), provide robust interaction data without the input limitations of higher-resolution techniques. More recently, techniques derived from the chromatin immunocleavage (ChIC) method (Schmid et al. 2004) have emerged and are capable of providing high-resolution identification of binding sites with even ultra-low input samples. For a general bioinformatic pipeline on how to identify these genomic binding sites, see Fig. 3.
The most commonly used technique to assess the localization of chromatin-binding proteins, chromatin immunoprecipitation (ChIP) (Fig. 2A), was developed for use at a single locus using radioactive DNA labeling by Gilmour and Lis (1984) and formaldehyde crosslinking and gel-based imaging by Solomon and Varshavsky (1985). This technique had been in use for many years before being adapted for deep sequencing after library construction to examine genomic identification of a chromatin-interacting protein’s binding site (Albert et al. 2007). Based on the initial radiolabeling experiments, ChIP-chip, a technique in which ChIP DNA is hybridized to DNA microarrays against various genomic loci, was developed in 2000 as the first broad genomic application of ChIP (Ren et al. 2000). ChIP was combined with quantitative PCR (ChIP-qPCR) as a way to examine protein occupancy at multiple locations in a quantitative manner that was more targeted than ChIP-chip, but less restrictive than single-locus radiolabeled ChIP (Irvine et al. 2002). ChIP-seq robustly profiles protein-DNA interactions throughout eukaryotic species.
A ChIP experiment typically begins with a formaldehyde incubation designed to crosslink the lysines of interacting proteins with local DNA. Cells are then lysed to release crosslinked chromatin and subjected to unbiased sonication to shear the chromatin into short segments (typically between 100 and 400 base pairs). The sheared chromatin is then incubated with an antibody targeting the protein of interest followed by addition of a secondary IgG recognizing antibody that is typically coupled to sepharose or magnetic beads. Upon recognition of the epitope, the interacting region of DNA is pulled down with the protein to which it is crosslinked, thereby specifically isolating regions of DNA at which the protein crosslinks (and to which the protein is necessarily in close proximity—approximately 2 Å (Perez-Romero and Imperiale 2007). Crosslinks are then reversed, protein is digested, and the DNA is isolated to be used as a template for locus-specific qPCR or to be run on a gel.
ChIP-seq has been combined with various techniques to provide heightened resolution, including lambda exonuclease digestion (ChIP-exo and ChIP-nexus (He et al. 2015 Rhee and Pugh 2012), UV-crosslinking (UV-ChIP (Gilmour et al. 1991), and MNase digestion (Native ChIP (O’Neill 2003). ChIP-exo and ChIP-nexus are two techniques that utilize nuclease digestion to improve ChIP-seq resolution to a near-base-pair level. ChIP-exo uses lambda exonuclease to digest unbound dsDNA 5′-3′ until reaching a protein-DNA crosslink through which the nuclease cannot proceed (Rhee and Pugh 2012). Similar to ChIP-exo, ChIP-nexus relies on digestion of crosslinked DNA using lambda exonuclease however, ChIP-nexus also incorporates a modified library build protocol and a barcode-based monitor of overamplification (He et al. 2015). In addition, ChIP-nexus requires only one 3′ sequencing adaptor, reducing input requirements relative to traditional ChIP-seq (He et al. 2015). UV-ChIP utilizes UV light as a zero-length in vivo crosslinking agent that tests direct protein interaction however, UV crosslinking provides low yields, making it unsuitable for low-input samples or infrequent interactions (Toth and Biggin 2000). Native ChIP uses MNase digestion as a gentler alternative to sonication that allows for identification of protein binding on non-crosslinked chromatin, and at substantially higher resolution than traditional ChIP-seq because it is no longer limited by sonication efficiency (O’Neill 2003).
The most pressing limitation to ChIP-seq experimentation is input to produce a high signal-to-noise ratio, ChIP-seq typically requires millions of input cells, particularly to examine transcription factor binding. As histones are far more abundant than other DNA-binding proteins, optimizing ChIP-seq technologies for low input has been far more fruitful using histones than factors. For traditional, crosslinking-based ChIP-seq techniques, μChIP-seq has been sufficient to profile histone modifications in 400 cells (Dahl et al. 2016), although ChIP has been paired with microfluidics technology (Cao et al. 2015 Rotem et al. 2015) to reduce necessary input to 100 cells for profiling histone modifications. Native ChIP-seq techniques have been more successful in reducing cellular input due to gentler chromatin shearing. In 2006, Carrier ChIP was successfully used to profile histone modifications in 50 cells (albeit with millions of “carrier” cells to reduce sample loss (O’Neill et al. 2006), while more recent attempts have reduced cellular input for histone modification profiling to 500 cells (MINT-ChIP and ULI-NChIP) and 200 cells (STAR-ChIP (Liu et al. 2016 van Galen et al. 2016 Zhang et al. 2016). While transcription factors’ abundance and transitory binding make them harder to profile in low-input samples, two ChIP-based techniques have been successfully lowered cell input: ChIPmentation and Carrier-assisted ChIP-seq. The first, ChIPmentation, was developed by Christoph Bock’s group and utilizes Tn5 transposase to ligate sequencing adapters directly onto chromatin on beads (Schmidl et al. 2015) ChIPmentation was used to profile transcription factors in 100,000 cells. In addition, Jason Carroll’s group has used carrier-assisted ChIP-seq to profile transcription factor localization in as few as 10,000 cells (Zwart et al. 2013).
As one of the first and most prominent genomic techniques, ChIP and its derivatives have been extraordinarily impactful in understanding regulation of chromatin interactions and transcription. To date, the term “chromatin immunoprecipitation” has almost 23,000 PubMed hits and over 9000 publicly available datasets in the ENCODE database, with far more stored in the NCBI Sequence Read Archive (Consortium 2012). Although ChIP-seq remains the gold standard of factor localization profiling, other techniques have been developed over the past 30 years to examine factor localization through different approaches.
DamID presents a non-ChIP alternative to locating proteins on chromatin (Fig. 2B) (van Steensel and Henikoff 2000). DamID makes use of a recombinant protein (Escherichia coli DNA Adenine Methyltransferase or Dam) fused to the chromatin-interacting protein of interest to identify genomic regions at which the protein interacts. Dam methylates adenine within the sequence GATC (Barras and Marinus 1989 Boivin and Dura 1998 Wines et al. 1996). As adenine methylation does not occur in most eukaryotes, DamID provides a native and specific readout for factor localization (Barras and Marinus 1989). Dam methylation can spread up to 5 kb from the protein-binding site (van Steensel and Henikoff 2000), highlighting the tradeoff between resolution and specificity balanced in DamID experiments. Additionally, more accessible regions of the genome are more likely to be methylated by Dam (Greil et al. 2006), a variable that is controlled for by profiling with transfection of unfused Dam. Although DamID was pioneered with Southern blotting and quantitative PCR (qPCR) as methylation quantitation, they have since been supplanted by next-generation sequencing technologies (Aughey et al. 2019 Greil et al. 2006). DamID is most commonly applied in Drosophila cells but has been used in yeast, C. elegans, Arabidopsis, mice, and human cells, illustrating a more versatile range of profiling.
A typical DamID experiment involves construction of a plasmid with Dam fused to the N- or C-terminus of the protein of interest. The plasmid is then transfected into the cells to be examined, as are a control plasmid containing Dam alone and an empty vector. Genomic DNA is then isolated from the transfected cells and digested with the DpnI restriction enzyme. As DpnI exclusively and specifically digests G m ATC, fragments generated from this digestion are inferred to have been in close proximity to the chromatin-interacting protein of interest. Adapters are ligated to the DpnI-digested fragments, and the DNA is then treated with DpnII, a restriction enzyme that cleaves only unmethylated GATC, to doubly select for G m ATC in the genome. DNA libraries are then amplified and can be submitted for deep sequencing.
DamID has not reached the same popularity as ChIP-seq but presents some notable strengths. First, DamID is not dependent on antibodies to profile factor binding, a significant advantage for profiling understudied proteins. Additionally, DamID was the first method by which one could confirm ChIP data by an alternate approach. DamID is, however, disadvantaged by the fact that the profiled protein is not endogenous to the host cells. The binding sites of a Dam fusion construct will often be comparable with an endogenous protein, but likely not identical due to the presence of the Dam construct itself as well as its plasmid-based expression. Additionally, DamID requires a genetically tractable system that can be transfected with the Dam fusion plasmid. Furthermore, DamID is limited by its low resolution because Dam can methylate residues up to 5 kb from the fusion protein’s binding site, and extensive false positives can be found (van Steensel and Henikoff 2000). Because of this range of methylation, DamID is unlikely to reach the resolution offered by ChIP-based techniques DamID is not, however, constrained by the same input limitations, and has been used to profile transcription factor binding from 1000 ES cells (Tosti et al. 2018) and even single cells (Lai et al. 2019). Although ChIP-seq (and more recently, CUT&RUN) has largely superseded DamID for factor localization, DamID is becoming more popular in studying broader chromatin features for instance, Chromatin Accessibility Targeted DamID (CATaDA) has been developed to assess open chromatin (Aughey et al. 2018). CATaDa utilizes an untethered Dam protein to methylate regions of open chromatin, leaving nucleosome-bound DNA unmethylated (Aughey et al. 2018). Split DamID has also been used to profile co-occupancy of two proteins at genomic loci, acting in a similar manner to a yeast two-hybrid screen (Hass et al. 2015), and a catalytically inactive DpnI-GFP fusion construct has been used to examine Dam-driven GATC methylation in real-time using microscopy (Kind et al. 2015).
Cleavage under targets and release using nuclease (CUT&RUN) was developed by Skene and Henikoff in 2017 as a genome-wide modification of Ulrich Laemmli’s group’s 2004 ChIC technique, in which a recombinant Protein A fused to micrococcal nuclease (pA-MNase) can be combined with a primary antibody to specifically target MNase and cleave DNA surrounding sites where the protein of interest binds (Fig. 2C (Schmid et al. 2004). Similar techniques include chromatin endogenous cleavage (ChEC (Schmid et al. 2004), in which involves a C-terminal fusion of MNase to a protein of interest and ChEC-seq, a genome-wide pairing of ChEC and next-generation sequencing (Zentner et al. 2015). While ChEC has been successfully applied to assess the localization of multiple proteins (Baptista et al. 2017 Grunberg et al. 2016 Grunberg and Zentner 2017 Warfield et al. 2017 Zentner et al. 2015), the technique is limited by a need to specifically tag the protein of interest. CUT&RUN, on the other hand, utilizes a recombinant pA-MNase protein to recognize any primary antibody with compatible IgG backbones. Although CUT&RUN is a recently developed technique, it has been used to profile protein-DNA interactions in Arabidopsis, yeast, flies, mice, and human cells, demonstrating a versatile range of application.
A CUT&RUN experiment involves either a nuclear isolation with a hypotonic buffer to lyse the cells (Hainer and Fazzio 2019 Skene and Henikoff 2017) or cell permeabilization with digitonin (Skene et al. 2018) and lectin-coated concanavalin A magnetic beads to isolate the nuclei. Subsequent steps are carried out in the bead-bound nuclei until the protected DNA fragments are released prior to library preparation. Primary antibody targeting the protein of interest is added and allowed to freely diffuse into the nuclei, followed by addition of recombinant pA-MNase, which recognizes the IgG backbone of the primary antibody and is therefore specifically directed to the protein of interest’s binding sites on chromatin. The MNase is then activated by addition of Ca 2+ and digested in an ice-water bath (for sub-optimal MNase digestion kinetics) to cleave DNA and release the protein-bound fragments into the supernatant. Released fragments are then RNase treated, digested with Proteinase K, purified, and used as input for library construction. CUT&RUN experiments are performed in tandem with a replicate in which the primary antibody is either left out of the sample or replaced with an IgG control, measuring background cutting by the free pA-MNase construct and correcting for an inherent bias towards more accessible regions of the genome. In addition, heterologous DNA can be spiked-in to the reaction upon chelating the MNase digestion (Skene and Henikoff 2017) or contaminating E. coli DNA from the pA-MNase purification can be used as a spike in (Meers et al. 2019). CUT&RUN provides a high signal-to-noise ratio, with the reduced background allowing thorough sequencing with approximately 10 million reads, whereas a ChIP-seq experiment requires 20–40 million reads to accurately assess protein binding.
CUT&RUN has proven to be adaptable to numerous alterations to suit experimental contexts, most of which have been developed by Steve Henikoff’s group. One such adaptation is robotic automation of the protocol for high-throughput profiling (AutoCUT&RUN (Janssens et al. 2018). In addition, Henikoff’s group has published CUT&RUN.Salt, a method that allows chromatin fractionation based on solubility and is especially useful for profiling centromeric or otherwise insoluble chromatin under typical conditions (Thakur and Henikoff 2018). To improve efficiency of pA-MNase-antibody binding, Henikoff’s group has engineered a recombinant Protein A-Protein G-MNase fusion construct that allows for profiling of non-rabbit antibodies without a secondary antibody step (Meers et al. 2019). Finally, CUT&RUN has been combined with traditional ChIP (CUT&RUN.ChIP) that allows one to ChIP for protein complexes present within released CUT&RUN fragments (Brahma and Henikoff 2019). The general CUT&RUN technique therefore appears flexible to profile protein localization for a variety of experimental designs and desired outcomes.
In 2019, the first single-cell genome-wide profiling of chromatin-bound proteins using CUT&RUN was published to examine pluripotency factors in murine embryonic stem cells (Hainer et al. 2019). In addition to profiling in single cells, factor binding was profiled in individual early blastocysts (consisting of between 30-50 cells each), an application not previously possible using ChIP-based techniques. More recently, Cleavage Under Targets and Tagmentation, or CUT&Tag, was developed as a modification on CUT&RUN that uses a recombinant Protein A-Tn5 transposase fusion instead of a recombinant pA-MNase fusion protein (Kaya-Okur et al. 2019). CUT&Tag has been used to profile histone modifications in single cells, although it has not yet been used to profile transcription factor binding in single cells (Kaya-Okur et al. 2019). In addition to CUT&Tag, a similar single-cell modification of ChIC, scChIC-seq, which involves tethering of MNase to a specific antibody and cleavage of target sites using the antibody to direct the MNase, then selectively amplifying cleaved fragments by PCR was developed (Ku et al. 2019). Between CUT&RUN, uliCUT&RUN, CUT&Tag, ChEC-seq, and ChIC-seq, ChIC- and ChEC-derived techniques appear poised to facilitate the next era of chromatin-interacting factor profiling.
As genomic technique refinement has allowed researchers to identify factor binding sites on chromatin and DNA accessibility with high resolution, the limitations of standard techniques have become more and more apparent. Because of differences due to cellular heterogeneity, inconsistent enzyme digestion kinetics, and untargeted sample isolation, recent advances in genomic techniques have focused on reducing necessary sample input and background signal. These technical improvements have made it possible to examine genome architecture and factor-binding profiles in individual cells, low-input samples like patient biopsies, and subsets of heterogeneous cellular populations. What has emerged from genomic studies of accessibility and factor binding is a complex picture of DNA templated activities regulated by chromatin architecture.
Profiling of genome accessibility and factor binding has set the stage for identification of genomic regulatory mechanisms however, these techniques are merely a start towards understanding the gene regulation on a mechanistic level. These data must be integrated to understand how transcriptional and cellular networks function cooperatively and antagonistically to shape the functional genome. Additionally, comparisons between cell types will be important to provide insight into the ways in which a common suite of factors drive cell type-specific functions.
We thank M. Garlovsky, S. Martin, C. Cooney, C. Roux, J. Larson, and J. Mallet for critical feedback and for discussion. K. Lohse, M. de la Cámara, J. Cerca, M. A. Chase, C. Baskett, A. M. Westram, and N. H. Barton gave feedback on a draft of the manuscript. O. Seehausen, two anonymous reviewers, and the AE (Michael Kopp) provided comments that greatly improved the manuscript. V. Holzmann made many corrections to the proofs. G. Bisschop and K. Lohse kindly contributed the simulations and analyses presented in Box 3. We would also like to extend our thanks to everyone who took part in the speciation survey, which received ethical approval through the University of Sheffield Ethics Review Procedure (Application 029768). We are especially grateful to R. K. Butlin for stimulating discussion throughout the writing of the manuscript and for feedback on an earlier draft.
Health Conditions Related to Genetic Changes
Nearly 400 mutations in the HBB gene have been found to cause beta thalassemia. Most of the mutations involve a change in a single DNA building block (nucleotide) within or near the HBB gene. Other mutations insert or delete a small number of nucleotides in the HBB gene.
HBB gene mutations that decrease beta-globin production result in a condition called beta-plus (β + ) thalassemia. Mutations that prevent cells from producing any beta-globin result in beta-zero (β 0 ) thalassemia.
Problems with the subunits that make up hemoglobin, including low levels of beta-globin, reduce or eliminate the production of this molecule. A lack of hemoglobin disrupts the normal development of red blood cells. A shortage of mature red blood cells can reduce the amount of oxygen that is delivered to tissues to below what is needed to satisfy the body's energy needs. A lack of oxygen in the body's tissues can lead to poor growth, organ damage, and other health problems associated with beta thalassemia.
Methemoglobinemia, beta-globin type
More than 10 mutations in the HBB gene have been found to cause methemoglobinemia, beta-globin type, which is a condition that alters the hemoglobin within red blood cells. These mutations often affect the region of the protein that binds to heme. For hemoglobin to bind to oxygen, the iron within the heme molecule needs to be in a form called ferrous iron (Fe 2+ ). The iron within the heme can change to another form of iron called ferric iron (Fe 3+ ), which cannot bind to oxygen. Hemoglobin that contains ferric iron is known as methemoglobin and is unable to efficiently deliver oxygen to the body's tissues.
In methemoglobinemia, beta-globin type, mutations in the HBB gene alter the beta-globin protein and promote the heme iron to change from ferrous to ferric. This altered hemoglobin gives the blood a brown color and causes a bluish appearance of the skin, lips, and nails (cyanosis). The signs and symptoms of methemoglobinemia, beta-globin type are generally limited to cyanosis, which does not cause any health problems. However, in rare cases, severe methemoglobinemia, beta-globin type can cause headaches, weakness, and fatigue.
Sickle cell disease
Sickle cell anemia (also called homozygous sickle cell disease or HbSS disease) is the most common form of sickle cell disease. This form is caused by a particular mutation in the HBB gene that results in the production of an abnormal version of beta-globin called hemoglobin S or HbS. In this condition, hemoglobin S replaces both beta-globin subunits in hemoglobin. The mutation that causes hemoglobin S changes a single protein building block (amino acid) in beta-globin. Specifically, the amino acid glutamic acid is replaced with the amino acid valine at position 6 in beta-globin, written as Glu6Val or E6V. Replacing glutamic acid with valine causes the abnormal hemoglobin S subunits to stick together and form long, rigid molecules that bend red blood cells into a sickle (crescent) shape. The sickle-shaped cells die prematurely, which can lead to a shortage of red blood cells (anemia). The sickle-shaped cells are rigid and can block small blood vessels, causing severe pain and organ damage.
Mutations in the HBB gene can also cause other abnormalities in beta-globin, leading to other types of sickle cell disease. These abnormal forms of beta-globin are often designated by letters of the alphabet or sometimes by a name. In these other types of sickle cell disease, just one beta-globin subunit is replaced with hemoglobin S. The other beta-globin subunit is replaced with a different abnormal variant, such as hemoglobin C or hemoglobin E.
In hemoglobin SC (HbSC) disease, the beta-globin subunits are replaced by hemoglobin S and hemoglobin C. Hemoglobin C results when the amino acid lysine replaces the amino acid glutamic acid at position 6 in beta-globin (written Glu6Lys or E6K). The severity of hemoglobin SC disease is variable, but it can be as severe as sickle cell anemia. Hemoglobin E (HbE) is caused when the amino acid glutamic acid is replaced with the amino acid lysine at position 26 in beta-globin (written Glu26Lys or E26K). In some cases, the hemoglobin E mutation is present with hemoglobin S. In these cases, a person may have more severe signs and symptoms associated with sickle cell anemia, such as episodes of pain, anemia, and abnormal spleen function.
Other conditions, known as hemoglobin sickle-beta thalassemias (HbSBetaThal), are caused when mutations that produce hemoglobin S and beta thalassemia occur together. Mutations that combine sickle cell disease with beta-zero (β 0 ) thalassemia lead to severe disease, while sickle cell disease combined with beta-plus (β + ) thalassemia is generally milder.
Hundreds of variations have been identified in the HBB gene. These changes result in the production of different versions of beta-globin. Some of these variations cause no noticeable signs or symptoms and are found when blood work is done for other reasons, while other HBB gene variations may affect a person's health. Two of the most common variants are hemoglobin C and hemoglobin E.
Hemoglobin C (HbC), caused by the Glu6Lys mutation in beta-globin, is more common in people of West African descent than in other populations. People who have two hemoglobin C subunits in their hemoglobin, instead of normal beta-globin, have a mild condition called hemoglobin C disease. This condition often causes chronic anemia, in which the red blood cells are broken down prematurely.
Hemoglobin E (HbE), caused by the Glu26Lys mutation in beta-globin, is a variant of hemoglobin most commonly found in the Southeast Asian population. When a person has two hemoglobin E subunits in their hemoglobin in place of beta-globin, a mild anemia called hemoglobin E disease can occur. In some cases, the mutations that produce hemoglobin E and beta thalassemia are found together. People with this hemoglobin combination can have signs and symptoms ranging from mild anemia to severe thalassemia major.
Health Conditions Related to Chromosomal Changes
The following chromosomal conditions are associated with changes in the structure or number of copies of chromosome 9.
9q22.3 microdeletion is a chromosomal change in which a small piece of the long (q) arm of chromosome 9 is deleted in each cell. Affected individuals are missing at least 352,000 base pairs, also written as 352 kilobases (kb), in the q22.3 region of chromosome 9. This 352-kb segment is known as the minimum critical region because it is the smallest deletion that has been found to cause the signs and symptoms related to 9q22.3 microdeletions. These signs and symptoms include delayed development, intellectual disability, certain physical abnormalities, and the characteristic features of a genetic condition called Gorlin syndrome (also known as nevoid basal cell carcinoma syndrome). 9q22.3 microdeletions can also be much larger the largest reported deletion included 20.5 million base pairs (20.5 Mb).
People with a 9q22.3 microdeletion are missing two to more than 270 genes on chromosome 9. All known 9q22.3 microdeletions include the PTCH1 gene. Researchers believe that many of the features associated with 9q22.3 microdeletions, particularly the signs and symptoms of Gorlin syndrome, result from a loss of the PTCH1 gene. Other signs and symptoms related to 9q22.3 microdeletions probably result from the loss of additional genes in the q22.3 region. Researchers are working to determine which missing genes contribute to the other features associated with the deletion.
Deletions of part or all of chromosome 9 are commonly found in bladder cancer. Bladder cancer is a disease in which certain cells in the bladder become abnormal and multiply uncontrollably to form a tumor. Bladder cancer may cause blood in the urine, pain during urination, frequent urination, the feeling of needing to urinate without being able to, or lower back pain.
Bladder cancer is generally divided into two types, non-muscle invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC), based on where in the bladder the tumor is located. Many cases of NMIBC tumors have a chromosome 9 deletion, which typically occurs early in tumor formation. These chromosomal changes are seen only in cancer cells. Research shows that several genes that control cell growth and division are located on chromosome 9. Many of these genes are tumor suppressors, which means they normally help prevent cells from growing and dividing in an uncontrolled way. It is likely that a loss of one or more of these genes plays a role in the early development and progression of bladder cancer.
Chronic myeloid leukemia
A rearrangement (translocation) of genetic material between chromosomes 9 and 22 causes a type of cancer of blood-forming cells called chronic myeloid leukemia. This slow-growing cancer leads to an overproduction of abnormal white blood cells. Common features of the condition include excessive tiredness (fatigue), fever, weight loss, and an enlarged spleen.
The translocation involved in this condition, written as t(922), fuses part of the ABL1 gene from chromosome 9 with part of the BCR gene from chromosome 22, creating an abnormal fusion gene called BCR-ABL1. The abnormal chromosome 22, containing a piece of chromosome 9 and the fusion gene, is commonly called the Philadelphia chromosome. The translocation is acquired during a person's lifetime and is present only in the abnormal blood cells. This type of genetic change, called a somatic mutation, is not inherited.
The protein produced from the BCR-ABL1 gene signals cells to continue dividing abnormally and prevents them from self-destructing, which leads to overproduction of the abnormal cells.
The Philadelphia chromosome also has been found in some cases of rapidly progressing blood cancers known as acute leukemias. It is likely that the form of blood cancer that develops is influenced by the type of blood cell that acquires the mutation and other genetic changes that occur. The presence of the Philadelphia chromosome provides a target for molecular therapies.
Most people with Kleefstra syndrome, a disorder with signs and symptoms involving many parts of the body, are missing a sequence of about 1 million DNA building blocks (base pairs) on one copy of chromosome 9 in each cell. The deletion occurs near the end of the long (q) arm of the chromosome at a location designated q34.3, a region containing a gene called EHMT1. Some affected individuals have shorter or longer deletions in the same region.
The loss of the EHMT1 gene from one copy of chromosome 9 in each cell is believed to be responsible for the characteristic features of Kleefstra syndrome in people with the 9q34.3 deletion. However, the loss of other genes in the same region may lead to additional health problems in some affected individuals.
The EHMT1 gene provides instructions for making an enzyme called euchromatic histone methyltransferase 1. Histone methyltransferases are enzymes that modify proteins called histones. Histones are structural proteins that attach (bind) to DNA and give chromosomes their shape. By adding a molecule called a methyl group to histones, histone methyltransferases can turn off (suppress) the activity of certain genes, which is essential for normal development and function. A lack of euchromatic histone methyltransferase 1 enzyme impairs proper control of the activity of certain genes in many of the body's organs and tissues, resulting in the abnormalities of development and function characteristic of Kleefstra syndrome.
Other chromosomal conditions
Other changes in the structure or number of copies of chromosome 9 can have a variety of effects. Intellectual disability, delayed development, distinctive facial features, and an unusual head shape are common features. Changes to chromosome 9 include an extra piece of the chromosome in each cell (partial trisomy), a missing segment of the chromosome in each cell (partial monosomy), and a circular structure called a ring chromosome 9. A ring chromosome occurs when both ends of a broken chromosome are reunited. Rearrangements (translocations) of genetic material between chromosome 9 and other chromosomes can also lead to extra or missing chromosome segments.
Changes in the structure of chromosome 9 have been found in many types of cancer. These changes, which occur only in cells that give rise to cancer, usually involve a loss of part of the chromosome or a rearrangement of chromosomal material. For example, a loss of part of the long (q) arm of chromosome 9 has been identified in some types of brain tumor. In addition, chromosomal rearrangements that fuse the ABL1 gene with genes other than BCR have been found in a small number of acute leukemias. The exact mechanisms by which these genetic changes lead to cancer are not completely understood, although it is likely that the proteins produced from them promote uncontrolled growth of cells.