Coding for DNA Storage

Projektinformationen

DNAStorage

ID Finanzhilfevereinbarung: 101045114

DOI

10.3030/101045114

EK-Unterschriftsdatum 10 Mai 2022

Startdatum 1 Juni 2022

Enddatum 31 Mai 2027

Finanziert unter

European Research Council (ERC)

Gesamtkosten

€ 1 999 096,00

EU-Beitrag

€ 1 999 096,00

1 999 096,00

Koordiniert durch

TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY
Israel

Periodic Reporting for period 1 - DNAStorage (Coding for DNA Storage)

Berichtszeitraum: 2022-06-01 bis 2024-11-30

DNA-based storage has attracted significant attention due to recent demonstrations of the viability of storing information in macromolecules. Unlike classical optical and magnetic storage technologies, DNA-based storage does not require electrical supply to maintain data integrity, and given the trends in cost decreases of DNA synthesis and sequencing, it is estimated that within the next decade DNA storage may become a highly competitive archiving technology. The goal of this research is to develop coding methods and techniques by designing novel and advanced solutions that are specifically targeted for the unique structure and error behavior of DNA-based storage systems. The proposed analytical framework will allow to address coding-theoretic challenges arising in the context of synthesis, storage, and sequencing of DNA strands. To achieve these goals, we design codes for clustering, trace-reconstruction techniques, error-correction codes, and constrained codes. These codes are applicable for long-term storage and recovery of data recorded in DNA, while overcoming the unique challenges associated with the DNA storage channel. We expect that knowledge, techniques, and qualitative insights gained in our investigation will advance DNA storage technologies capable of accommodating the massive amounts of data. Lastly, the accompanying experimental testing will allow for practical as assessments of system performance and cost.

Clustering algorithms: In this part, we review different methods for evaluating clustering algorithms and introduce a novel clustering algorithm for DNA storage systems, named Gradual Hash-based clustering (GradHC). The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, including varying strand lengths, cluster sizes (including extremely small clusters), and different error ranges. Benchmark analysis demonstrates that GradHC is significantly more stable and robust than other clustering algorithms previously proposed for DNA storage, while also producing highly reliable clustering result.

Reconstruction algorithms: In this work, we present several new algorithms for the DNA reconstruction problem. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, to decode the original string. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data, on data from previous DNA storage experiments, and on a new synthesized dataset, and are shown to outperform previous algorithms in reconstruction accuracy.

Error-correcting codes: In the first work, we study error-correcting codes for a channel that mimics sequencing by nanopore. In the second work, we study codes that correct errors and are used to do clustering and reconstruction together.

Constrained codes: In our first work, we propose a universal framework for tackling any parametric constraint problem with far fewer requirements, through a simple iterative algorithm. We demonstrate how to apply this algorithm to the run-length-limited, minimal Hamming weight, local almost balanced Hamming weight constraints, as well as repeat-free and secondary-structure constraints. In the second paper we show how to apply this technique, using several more tools, on almost-balanced codes. In the third and fourth papers we study how constrained codes are used for DNA labeling, which is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k + 1)-ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed to gain labeling capacity of 2. Lastly in the last paper we study the capacity of the weighted read channel which mimics how sequencing is performed using nanopore sequencing.

Codes for sequencing: We aim to reduce not only the cost but also the latency of DNA storage by studying the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. We show how to optimally pair an error correcting code with a given retrieval algorithm to minimize the sequencing coverage depth, while guaranteeing retrieval of the information with high probability. Additionally, we study the DNA coverage depth problem under the random-access setup for either a single strand or a single file.

Codes for synthesis: In conventional DNA synthesis machines many strands are usually synthesized in parallel by iterating through a supersequence s and adding in each cycle a single nucleotide to a subset of the strands. Then, the length of s determines the number of the cycles, hence the time and the cost of the synthesis process too. Recently, to optimize the synthesis process, researchers have suggested to append in each cycle a shortmer instead of a single nucleotide. This work studies this optimization from a theoretical point of view. It discusses which shortmers are the best to use, and how to calculate the number of cycles required to synthesize in parallel a set of strands using a set of shormers. Lastly, and following a previously described connection between the DNA synthesis problem and costly constrained graphs, this work investigates calculating the capacities of such non-deterministic graphs.

In many of our works, we show how to take the practical model of sequencing, synthesis or other steps and build a theoretical model that describes it properly.
This enables us to study this important practical problems using information theoretical tools and derived the fundamental results and bounds in each case.
There is more work to be done mostly in the error-correcting part. This is crucial in order to construct practical end-to-end solutions for DNA-based storage. It is required to construct codes correcting probabilistic edit errors, both in a single strand and for multiple strands. Furthermore, we are still lacking efficient reconstruction solution and algorithms which aim to work on the output signal of the nanopore. We plan to address these needs in the rest of the project.

A DNA-based Storage System

Periodic Reporting for period 1 - DNAStorage (Coding for DNA Storage)

Diese Seite teilen Diese Seite in sozialen Netzwerken teilen

Herunterladen Den Inhalt der Seite herunterladen