CORDIS - EU research results

cOmpRession of Genomic dAta to facilitate precision MedIcine

Periodic Reporting for period 1 - ORIGAMI (cOmpRession of Genomic dAta to facilitate precision MedIcine)

Reporting period: 2019-08-01 to 2020-01-31

In Europe, chronical diseases are responsible for a loss of 115 billion EUR per year. In order to reduce this loss, countries are undergoing a revolution in the way these diseases are treated through personalized medicine. Based on DNA, high throughput genomic sequencing is performed for a more accurate diagnosis and the use of targeted therapy. Genome sequencing had shown its value in breast cancer treatment for instance while sequencing a panel of selected genes. More recently, hospitals are moving towards whole exome sequencing (20,000 genes), and whole genome sequencing for an even better precision in diagnosis.

This shift in the size of the genome sequenced is responsible for production of astronomical amount of numerical genomic data. With these volumes, transferring, storing, archiving and analyzing the data becomes an issue. To counteract these issues ENANCIO has developed LENA, an algorithm to compress genomic data and improve the entire genomic data workflow. ENANCIO aims to provide a solution to improve the speed of transfer and decrease the storage cost of those data, but also to improve speed and precision of data analysis.

Illumina, the leader sequencing instrument provider worldwide had launched new instruments to follow the whole exome, whole genome sequencing trend. The objective of this project was for ENANCIO to evaluate the performance of its compression technology on newest Illumina platforms, to study the market responsiveness of this algorithm, to establish new specifications that need to be implemented to facilitate its adoption and to refine its business plan.

The technical study has shown excellent performance results of LENA over data generated by latest sequencing instrument. Savings over storage can go as high as 75% and voluminous file transfers can be 5 times faster. Market survey also revealed performance were in line with customers expectations and highlighted further points that needed to be addressed depending on storage strategy.
The work performed at the beginning of the project consisted in collecting relevant dataset generated from latest Illumina’s platforms to perform an evaluation of the compression algorithm. Tested metrics were: compression ratio, compression/decompression time, compute resources, MD5 checksum. The output of the compression algorithm was also tested to make sure it was directly usable with most commonly used bioinformatics tools. At the same time a market survey has been performed evaluating the current use of sequencing instruments per segments, the added-value for compression solution, the expectations for a good adoption. Based on collected information, more testings on other sequencing instruments were performed, and additional functionalities to the compression algorithm were tested as prototypes.
The ORIGAMI projects allowed for a better understanding of main barriers for adoption, a better understanding of customer’s expectations according to which segment they belong to, a better positioning of LENA and led to a roadmap for future developments, and a commercial strategy to put in place.
ENANCIO showed that LENA is the only one compression solution on the market able to combine three main characteristics: a high compression ratio, an ultra-fast compression/decompression time and low compute resources to run. This combination enhances the objective of a genomic data compression algorithm to decrease cost as it does not need costly and high CPU’s servers to run. Moreover with high compression ratio performance, LENA confirms its positive impact on the environment with the possibility to reduce the amount of storing server by a factor 5.
Compression ratio performance of LENA vs. Gzip files on several Illumina's platforms