Periodic Reporting for period 1 - NovoGenePop (Deciphering de novo gene birth in populations)
Période du rapport: 2022-06-01 au 2024-11-30
We have developed a computational pipeline to identify translated open reading frames that uses Illumina Ribo-Seq data. The pipeline includes several steps to map the Ribo-Seq reads to the genome, define the P-site (central codon position) for reads of different length, and identify ORFs which show significant translation signatures.
We have used Nanopore RNA-Seq to validate the transcripts reconstructed with our pipeline in different Saccharomyces species. In Nanopore RNA-Seq data each transcript corresponds to a sequencing read; this provides accurate information of the extension of a given transcript.
We have developed a methodology to identify orthologous transcripts across yeast species and strains from Saccharomyces cerevisiae, which is based on conserved blocks of genomic synteny. Using this pipeline we have been able to cluster all the transcripts into gene families.
We have identified putative de novo genes in seven S. cerevisiae strains using information from the gene families. This shows that de novo gene birth events can be detected at short evolutionary time scales in the yeast population.
We have utilized Ribo-Seq data from 72 Yoruba-derived lymphoblastoid cell lines (LCLs) to evaluate the level of polymorphism of translated non-canonical ORFs in this human population. We have found that a collection of previously defined de novo ORFs tend to be more polymorphic in the population than canonical protein-coding sequences.
We have investigated measured the strength of purifying selection in yeast putative de novo genes using single nucleotide polymorphism data. We have shown that recently emerged de novo genes are under weak purifying selection when compared to older genes.
We have applied the computational pipeline for transcript clustering that uses Illumina RNA-Seq data to study tumor transcript diversity in an hepatocellular carcinoma cohort. This has allowed us to identify transcripts that are tumor-specific yet shared by different individuals.
We have published three research articles related to the project (see Publications section).
We have shown that the translation of de novo ORFs in human transcripts, which are human or primate-specific, is more polymorphic across individuals than the translation of annotated coding sequences, which tend to be more conserved across species. This provides the first evidence that recently originated proteins are likely to be lost more frequently in the population that old proteins.
We have shown that recently born proteins tend to be positively charged, but subsequently they accumulate changes that promote gain of acidic amino acids, and their charge becomes neutral. These changes are mostly due to mutation biases in the genome, although we also see an effect of natural selection in promoting them.
We have successfully applied the pipeline developed for transcript clustering to tumor RNA-Seq data from different patients. This has allowed us to identify tumor-specific transcripts that are shared by a subset of the patients. We have shown that thirteen of these transcripts are likely to translate small proteins that generate tumor-specific antigens, and which could be used to develop anti-cancer vaccines.