Skip to main content

Oligoarchive - Intelligent DNA Storage for Archival

Periodic Reporting for period 2 - OLIGOARCHIVE (Oligoarchive - Intelligent DNA Storage for Archival)

Reporting period: 2020-10-01 to 2021-09-30

The demand for analysing archival data coupled with the need to retain data for regulatory compliance has resulted in a rapid increase in the amount of archival data stored by industry and academia. As data generation far outpaces the rate of improvement in traditional storage technology, finding a new storage media that can store at very low cost such archival data is pivotal.
Synthetic DNA is one such storage media that has recently received attention due to its high density and durability. DNA possesses four key properties that make it relevant for archival storage. First, it is an extremely dense three-dimensional storage medium which can store 455 Exabytes of data in 1 gram; in contrast, a 3.5” hard disk can store 10 Terabytes and weighs 600 grams today. Second, DNA can last several centuries even in harsh storage environments; traditional storage technology (e.g. magnetic hard disk and tape) have lifetimes of five and thirty years. Third, it is very easy, quick, and cheap to perform in-vitro replication of DNA; for tape and magnetic hard disk drives it takes hours or days for copying large Exabyte-sized archives due to their limited bandwidth. Fourth, given the demand for digital data storage, we will soon run out of silicon, while DNA is abundantly available to cover our storage needs.

The vision of this project is to develop an end-to-end prototype for DNA storage which enables storing and analysing arbitrary information in DNA. We will carry out the research to develop the fundamental building blocks needed in a series of scientific breakthroughs:
- Near-molecule data analysis: to develop approaches to analyse data stored in DNA using biomolecular techniques directly in storage. The approaches will be faster and more energy efficient compared to traditional computers thanks to the unprecedented parallelism.
- Accelerated sequencing: speeding up reading data from DNA storage by developing novel sequencing techniques for DNA storage.
- Optimal encoding for different types of data: novel, tuneable error correction for different types of data (e.g. imaging data can tolerate some errors whereas text cannot).
- Synthesis: novel synthesis methods to cheaply write data to DNA for storage (which can tolerate small imprecisions as opposed to DNA used for biological purposes).
- End-to-end automation: automatic translation from binary data to DNA, synthesis, data analysis, selective retrieval and sequencing to read back to binary information, all based on robotic equipment.

These building blocks enable us to build the first efficient end-to-end prototype for DNA storage comparable to today’s archival storage in terms of speed but more cost-effective.
The consortium has started off well into the first year of the project with the kick-off meeting in November 2019. The collaboration among members has been very good with a lot of scheduled as well as impromptu meetings and generally excellent communication. All partners have also started very well in working towards the goals set out.

However, the first year of the project also face difficulties: as soon as the project started, infrastructure being set up and hirings on the way, the COVID-19 pandemic kicked off. Everybody had to work remote, hiring and procurement was interrupted, and wet labs closed. This also affected some lines of work more than others: computational work proceeded with little delays whereas wet lab experiments could not be carried out and are thus postponed.

Still, despite a difficult first year, considerable progress has been made. We have carried out a lot of the groundwork (e.g. encoding survey, thermodynamic models, computational modelling of experiments, design of encodings and many others) and have also accomplished impressive results (direct oligonucleotide sequencing to accelerate sequencing, enzymatic synthesis, image data encoding and many more). Some of the results have already been published (or in the process of being patented) while at the same time we are filling the publication pipeline. This past year has put us on a very strong footing to produce the results outlined in the DoA.
The project will deliver on its objectives as set out in the description of actions:

- Near-molecule data analysis: to develop approaches to analyse data stored in DNA using biomolecular techniques directly in storage. The approaches will be faster and more energy efficient compared to traditional computers thanks to the unprecedented parallelism.
- Accelerated sequencing: speeding up reading data from DNA storage by developing novel sequencing techniques for DNA storage.
- Optimal encoding for different types of data: novel, tuneable error correction for different types of data (e.g. imaging data can tolerate some errors whereas text cannot).
- Synthesis: novel synthesis methods to cheaply write data to DNA for storage (which can tolerate small imprecisions as opposed to DNA used for biological purposes).
- End-to-end automation: automatic translation from binary data to DNA, synthesis, data analysis, selective retrieval and sequencing to read back to binary information, all based on robotic equipment.

All of these objectives move us well beyond the state of the art. In addition, whilst working towards these objectives, we have and also will develop many related technologies, such as accelerated alignment (needed for speeding up sequencing), enzymatic DNA synthesis, study of error characteristics of DNA storage, efficient encoding approaches for image information in DNA and many more.
Project logo