Skip to main content

From haplotype to phenotype: a systems integration of allelic variation, chromatin state and 3D genome data

Periodic Reporting for period 4 - HAP-PHEN (From haplotype to phenotype: a systems integration of allelic variation, chromatin state and 3D genome data)

Reporting period: 2020-03-01 to 2020-08-31

In the last decade tremendous progress has been made in the field of high-throughput sequencing, leading to a rapid decline in sequencing costs. It is now possible to sequence the genomes of hundreds of thousands of individuals to create a deep catalog of the bulk of human genetic variation. However, sequence information alone is not enough, we need to understand what the function of the variation in genetic sequences between individuals is. Assigning function to genetic variation is known as functional annotation. For genetic variation that occurs in protein coding sequences it is relatively straightforward to predict the effect of a genetic variant, however the vast majority of genetic variants (~97%) lies outside of protein coding regions (i.e. in non-coding regions). For non-coding variants the power of predicting the effect of a genetic variant drops effectively to zero. However, genome wide association studies have already shown that the majority of all loci significantly associated with human traits and diseases are found in these non-coding, supposedly regulatory regions.
One of the challenges with human genome sequencing is that basically generates a list of genetic variants that are unlinked. We all inherit for every chromosome one copy from our father and one from our mother. Genetic variants can lie on the paternal or maternal copy of a chromosome. Functional genetic variants will also affect expression on that same chromosome, also known as allele. When we can link non-coding genetic variants to genetic variants that are expressed we can determine more directly the effect on gene expression. Regions where genetic variants can be link to the same parental chromosome are called haplotypes. The overarching aim of this project was to develop novel technologies to resolve haplotypes to identify genetic variants that affect gene expression.

Understanding the effect of non-coding genetic variants in gene regulation is particularly important in complex human genetics, which studies traits that are influenced by multiple genetic loci. Improving our understanding of complex genetic traits will enable better prediction of disease risk. Genetic risk assessment is complicated by the fact that every individual harbors millions of genetic variants, of which only a subset affects phenotypic traits (e.g. height, blood pressure or cardiovascular disease). Precisely, because the vast majority of non-coding genetic variants is not functional, assigning function to genetic variants is far from trivial. We have used a combination of multiple genomics methods to assign function to non-coding genetic variants.

A better understanding of human genetics, for both coding and non-coding sequences can lead to improvements in genetic risk profiles that can be used to encourage people to make lifestyle choices that improve healthy living and aging by preventing the onset of disease.
In the project we generate chromosome-wide haplotypes for a set of lymphoblastoid cell lines by combining 10X Genomics linked read sequencing with chromosome conformation capture data (Hi-C). After generating these data we developed a computational analysis pipeline that enabled the haplotyping of these cell lines. We were able to haplotype >99% of the genetic variants in the genome. To measure transcription rates in these cells we generated RNAseq data. In addition, we identified regulatory regions at high resolution in these cells. We can use the haplotype information to identify genes that expressed specifically from one of the parental chromosome. By intersecting these data with regulatory regions that are also found specifically on that chromosome we can prioritize functional non-coding variants. These results will be disseminated through scientific publication, in addition all our genomics data will be made available the research community through databases such as NCBI GEO and the European Nucleotide Archive.

In addition to our human genetics analyses, our expertise on the analysis of the 3D genome enabled us to study cohesin biology in more detail. This has resulted in a number of high-profile papers that study the role of cohesin and cohesin interacting proteins on the organization of the genome. For instance, we showed that loss of the cohesin regulator WAPL results in longer loops (Haarhuis et al, 2017, Cell). Furthermore, we showed that mutations in the architectural protein CTCF resulted in a loss of all CTCF-anchored chromatin loops (Li at al. 2020, Nature). Finally, we have shown that dynamic cohesin is crucial to the regulation of cell-type specific genes (Liu et al. 2020, bioRxiv).

We are now planning to further understand how genetic variants located at a distance from the promoter of genes contribute to the regulation of these genes. Our work has shown that cohesin and the 3D genome plays a crucial role in this regulation. We aim to combine these two disciplines to better understand gene regulation in general and distal gene regulation in particular.
Although haplotype information would be the preferred type of genomic data for human genomes (or any genome for that matter), technical limitations have hampered the implementation of haplotype sequencing in clinical diagnostics. We have shown a relatively straightforward way of generating haplotypes using bulk cell samples. We combined 10X Genomics long range DNA sequencing with Hi-C data and generated whole chromosome de novo haplotypes for five different primary lymphoblastoid cell lines. As far as we are aware this is the first example of high-resolution, de novo (i.e. not requiring trio, parent or population information) whole chromosome haplotypes using short read sequencing. By phasing >99% of all single nucleotide variants and insertions/deletions (indels) we have reached a level of complexity hitherto unforeseen. The relative ease of implementation and low cost should enable the uptake of this method of haplotype resolution, particularly when sequence cost go down further, which is expected in the future. This will also make haplotype information available in the future.

We are expanding our data analysis pipeline to include a statistical framework to identify functional non-coding variants. Our method is the first method to identify functional non-coding variants in single individuals at high throughput. At the moment it is still necessary to analyze large cohorts of individuals to identify putative functional genetic variants. However, when we can identify functional non-coding genetic variants in individuals this should open up possibilities to identify (non-coding) driver mutations also for rare genetic diseases caused by non-coding mutations, which by definition cannot be studied in large cohorts.
Hi-C data for a human chromosome arm