Deciphering de novo gene birth in populations

Información del proyecto

NovoGenePop

Identificador del acuerdo de subvención: 101052538

DOI

10.3030/101052538

Fecha de la firma de la CE 24 Mayo 2022

Fecha de inicio 1 Junio 2022

Fecha de finalización 31 Mayo 2027

Financiado con arreglo a

European Research Council (ERC)

Coste total

€ 2 453 751,00

Aportación de la UE

€ 2 453 751,00

2 453 751,00

Coordinado por

FUNDACIO INSTITUT HOSPITAL DEL MAR D INVESTIGACIONS MEDIQUES
Spain

Periodic Reporting for period 1 - NovoGenePop (Deciphering de novo gene birth in populations)

Período documentado: 2022-06-01 hasta 2024-11-30

Genes are fundamental units of life and their origin has fascinated researchers since the beginning of the molecular era. Many of the studies on the formation of new genes in genomes have focused on gene duplication and subsequent divergence of the two gene copies. But, in recent years, we have learnt that genes can also arise de novo from previously non-genic sequences. The discovery of de novo genes has become possible by the sequencing of complete genomes and the comparison of gene sets between closely related species. Here we wish to test a novel hypothesis, we propose that de novo gene formation dynamics in populations results in substantial differences in gene content between individuals. If they exist, these differences would be not be visible by the current methods to study gene variation, which are based on the comparison of the sequences of each individual to a common set of reference genes. To test our hypothesis, we will need to develop novel computational approaches to first obtain an accurate representation of all transcripts and translated open reading frames in each individual, and then integrate the information at the population level. We propose to apply these methods to two very distinct biological systems, a large collection of Saccharomyces cerevisiae world isolates and a human lymphoblastoid cell line (LCL) panel. For this, we will collect and generate RNA (RNA-Seq) and ribosome profiling (Ribo-Seq) sequencing data. In order to identify de novo originated events occurred within populations, as opposed to phylogenetically conserved genes that have been lost in some individuals, we will also generate similar data from a set of closely related species in each of the two systems. Combined with genomics data, we will identify the spectrum of mutations associated with de novo gene birth with an unprecedented level of detail and uncover footprints of adaptation linked to the birth of new genes.

We have developed a computational pipeline for transcript clustering that uses Illumina RNA-Seq data. The computational pipeline is reference-free, which allows the identification of transcripts not annotated in the reference gene annotation.

We have developed a computational pipeline to identify translated open reading frames that uses Illumina Ribo-Seq data. The pipeline includes several steps to map the Ribo-Seq reads to the genome, define the P-site (central codon position) for reads of different length, and identify ORFs which show significant translation signatures.

We have used Nanopore RNA-Seq to validate the transcripts reconstructed with our pipeline in different Saccharomyces species. In Nanopore RNA-Seq data each transcript corresponds to a sequencing read; this provides accurate information of the extension of a given transcript.

We have developed a methodology to identify orthologous transcripts across yeast species and strains from Saccharomyces cerevisiae, which is based on conserved blocks of genomic synteny. Using this pipeline we have been able to cluster all the transcripts into gene families.

We have identified putative de novo genes in seven S. cerevisiae strains using information from the gene families. This shows that de novo gene birth events can be detected at short evolutionary time scales in the yeast population.

We have utilized Ribo-Seq data from 72 Yoruba-derived lymphoblastoid cell lines (LCLs) to evaluate the level of polymorphism of translated non-canonical ORFs in this human population. We have found that a collection of previously defined de novo ORFs tend to be more polymorphic in the population than canonical protein-coding sequences.

We have investigated measured the strength of purifying selection in yeast putative de novo genes using single nucleotide polymorphism data. We have shown that recently emerged de novo genes are under weak purifying selection when compared to older genes.

We have applied the computational pipeline for transcript clustering that uses Illumina RNA-Seq data to study tumor transcript diversity in an hepatocellular carcinoma cohort. This has allowed us to identify transcripts that are tumor-specific yet shared by different individuals.

We have published three research articles related to the project (see Publications section).

We have developed a pipeline to cluster transcripts into families that takes into account genomic synteny information and which can be used to identify genes born de novo within the population. We have shown that, with this tool, it is possible to identify genes that are specific of yeast strains from a given geographical location, and which are missing from strains at other geographical locations. This supports our hypothesis that de novo genes can be detected at the short evolutionary time scales separating different S. cerevisiae strains.

We have shown that the translation of de novo ORFs in human transcripts, which are human or primate-specific, is more polymorphic across individuals than the translation of annotated coding sequences, which tend to be more conserved across species. This provides the first evidence that recently originated proteins are likely to be lost more frequently in the population that old proteins.

We have shown that recently born proteins tend to be positively charged, but subsequently they accumulate changes that promote gain of acidic amino acids, and their charge becomes neutral. These changes are mostly due to mutation biases in the genome, although we also see an effect of natural selection in promoting them.

We have successfully applied the pipeline developed for transcript clustering to tumor RNA-Seq data from different patients. This has allowed us to identify tumor-specific transcripts that are shared by a subset of the patients. We have shown that thirteen of these transcripts are likely to translate small proteins that generate tumor-specific antigens, and which could be used to develop anti-cancer vaccines.

Periodic Reporting for period 1 - NovoGenePop (Deciphering de novo gene birth in populations)

Descargar Descargar el contenido de la página