Periodic Reporting for period 3 - OLIGOARCHIVE (Oligoarchive - Intelligent DNA Storage for Archival)
Période du rapport: 2021-10-01 au 2023-03-31
Synthetic DNA is one such storage media that has recently received attention due to its high density and durability. DNA possesses four key properties that make it relevant for archival storage. First, it is an extremely dense three-dimensional storage medium which can store 455 Exabytes of data in 1 gram; in contrast, a 3.5” hard disk can store 10 Terabytes and weighs 600 grams today. Second, DNA can last several centuries even in harsh storage environments; traditional storage technology (e.g. magnetic hard disk and tape) have lifetimes of five and thirty years. Third, it is very easy, quick, and cheap to perform in-vitro replication of DNA; for tape and magnetic hard disk drives it takes hours or days for copying large Exabyte-sized archives due to their limited bandwidth. Fourth, given the demand for digital data storage, we will soon run out of silicon, while DNA is abundantly available to cover our storage needs.
The vision of this project is to develop an end-to-end prototype for DNA storage which enables storing and analysing arbitrary information in DNA. We will carry out the research to develop the fundamental building blocks needed in a series of scientific breakthroughs:
- Near-molecule data analysis: to develop approaches to analyse data stored in DNA using biomolecular techniques directly in storage. The approaches will be faster and more energy efficient compared to traditional computers thanks to the unprecedented parallelism.
- Accelerated sequencing: speeding up reading data from DNA storage by developing novel sequencing techniques for DNA storage.
- Optimal encoding for different types of data: novel, tuneable error correction for different types of data (e.g. imaging data can tolerate some errors whereas text cannot).
- Synthesis: novel synthesis methods to cheaply write data to DNA for storage (which can tolerate small imprecisions as opposed to DNA used for biological purposes).
- End-to-end automation: automatic translation from binary data to DNA, synthesis, data analysis, selective retrieval and sequencing to read back to binary information, all based on robotic equipment.
These building blocks enable us to build the first efficient end-to-end prototype for DNA storage comparable to today’s archival storage in terms of speed but more cost-effective.
However, the first year of the project also face difficulties: as soon as the project started, infrastructure being set up and hirings on the way, the COVID-19 pandemic kicked off. Everybody had to work remote, hiring and procurement was interrupted, and wet labs closed. This also affected some lines of work more than others: computational work proceeded with little delays whereas wet lab experiments could not be carried out and are thus postponed.
Still, despite a difficult first year, considerable progress has been made. We have carried out a lot of the groundwork (e.g. encoding survey, thermodynamic models, computational modelling of experiments, design of encodings and many others) and have also accomplished impressive results (direct oligonucleotide sequencing to accelerate sequencing, enzymatic synthesis, image data encoding and many more). Some of the results have already been published (or in the process of being patented) while at the same time we are filling the publication pipeline. This past year has put us on a very strong footing to produce the results outlined in the DoA.
- Near-molecule data analysis: to develop approaches to analyse data stored in DNA using biomolecular techniques directly in storage. The approaches will be faster and more energy efficient compared to traditional computers thanks to the unprecedented parallelism.
- Accelerated sequencing: speeding up reading data from DNA storage by developing novel sequencing techniques for DNA storage.
- Optimal encoding for different types of data: novel, tuneable error correction for different types of data (e.g. imaging data can tolerate some errors whereas text cannot).
- Synthesis: novel synthesis methods to cheaply write data to DNA for storage (which can tolerate small imprecisions as opposed to DNA used for biological purposes).
- End-to-end automation: automatic translation from binary data to DNA, synthesis, data analysis, selective retrieval and sequencing to read back to binary information, all based on robotic equipment.
All of these objectives move us well beyond the state of the art. In addition, whilst working towards these objectives, we have and also will develop many related technologies, such as accelerated alignment (needed for speeding up sequencing), enzymatic DNA synthesis, study of error characteristics of DNA storage, efficient encoding approaches for image information in DNA and many more.