CORDIS - Forschungsergebnisse der EU
CORDIS

Interactive web-based tool for the design of multi-RNA binding protein binding site cassettes

Periodic Reporting for period 1 - CARBP (Interactive web-based tool for the design of multi-RNA binding protein binding site cassettes)

Berichtszeitraum: 2019-07-01 bis 2020-06-30

Problem:

For the past two decades, synthetic biologists have built a portfolio of increasingly sophisticated biological circuits that are able to perform logical functions inside living cells1–4. Such circuits are made from “biological parts” which are biochemical analogs of electronic components that are routinely used for the design of electrical circuits. Unfortunately, unlike their electronic counterparts, connecting biological parts to form circuits often fails. This is mostly due to the fact that many parts are short sequences of DNA or RNA, and connecting them introduces unpredictable and undesirable sequence effects5. As a result, many iterations of trial and error are often needed before a successful design is achieved. This is termed the design, build, test (DBT) cycle in synthetic biology and is considered to be a major bottleneck for progress in the field. Specifically, the field is lacking computational methods that allow users to reliably design their system of choice without going through multiple time-consuming DBT cycles. The challenge of formulating such algorithms is rooted in the large space of biomolecules that make-up the biological parts, and the variety of interactions that are possible between them. This translates to a plethora of molecular mechanisms, each governed by differing kinetics, thermodynamic parameters, and free-energy considerations. Consequently, modelling these systems necessitates case-specific kinetic and/or thermodynamic modelling approaches to devise a reliable design algorithm. Reliable algorithms are especially needed for the design of RNA-centric functional modules for various applications. In a recent study, we demonstrated model-based functional design of non-repetitive sgRNA cassettes for targeting multiple metabolic genes in bacteria11. Another RNA-based system where a reliable design algorithm can help bring about the full potential of the technology is the encoding of multiple repeats of phage coat protein (CP) binding elements on an RNA molecule of choice. Such cassettes have been utilized in many studies for a variety of applications including gene editing and RNA-tracking12–17. However, a limited understanding of CP-binding in vivo has forced cassette designs into incorporating repeated hairpin-like sequence elements, making them cumbersome to synthesize using current oligo-based technology. Subsequent steps, including cloning and genome maintenance, are also badly affected by the repeat nature of the cassette. Finally, repeat sequence elements are notoriously unstable18, thus damaging protein binding to the cassette and causing occupancy-related experimental noise. Consequently, these limitations hinder the utility of these cassettes for robust quantitative measurements19 as well as expansion to more complex multi-genic applications. Previous findings have determined that specificity in phage CP binding to RNA is determined by the structural elements formed by specific sequence motifs20–26. This implies that for a given phage CP, many different sequences may become potential binding sites by folding into a common functional structure. The DBT problem for phage CP-binding cassette design can thus be solved by generating a database of functional binding sites that are divergent from a sequence perspective, and then utilizing different sequences with the same functional structure in place of multiple repeats of the same wild type (WT) sequence. The emergence in recent years of high-throughput oligo library (OL) based-experiments5,8,27–29 provides a platform for testing hundreds of thousands of potential binding-site variants. While extremely useful for identifying functional variants, the OL scale is much smaller than the available sequence space for ~20nt-long binding sites, and thus many functional variants are not sampled. Recently-developed machine-learning (ML) algorithms30–32 provide the necessary tool for computationally expanding the variant database to millions of potentially functional sequences, using the OL as an empirical training dataset. The result is an ML algorithm which can computationally score any sequence for the desired functionality.


Importance to society:
The development of reliable design software for genetically encoded synthetic biomolecules is of outmost importance to society as such algorithms are desperately needed in order to accelerate the development of drugs and other nucleic-acid based therapeutics. Such new generation of products can provide for many hundreds of thousands of jobs in start-up companies and in a next-generation pharmaceutical sector. Any algorithm similar to CARBP will add to a design body of knowledge that is continuously increasing, thus facilitating the new Synthetic Biology industry.

Objectives:

The aim of CARBP was to develop a reliable design algorithm for synthetic long non-coding RNA (slncRNA) molecules which encode multiple not-repeating binding sites for the phage coat proteins of MS2, PP7, and Qβ. The final result of this work is CARBP - an online software for the design of DNA CAssettes incorporating sites for RNA-binding proteins (RBPs).
"There were 4 main tasks - all were completed successfully:

T1 Experimental generation of CARBP training set.
- We used an oligo-library and high-throughput SORT-seq experimentation to Identify experimentally several thousand functional binding sites that were subsequently used to train a machine-learning algorith, which together form the back-bone of the CARBP algorithm.
T2 Development of a Machine Learning (ML) model for CARBP.
- We used the OL data to train a Convolutional Neural Network. The CNN model, in turn, vastly expanded the known or ""predicted"" binding spaces for the three CPs.This in turn, allowed us to develop a design algorithm for CP binding sites with a nearly unlimited amount of function sites which can be picked from the various CP binding spaces..
T3 Validation of ML model.
- We validated the reliability of the CARBP algorithm using three cassettes based on ""characterized"" OL binding sites, and an additional three cassettes which encoded multiple ""predicted"" or ""unseen"" binding sites. All 6 cassettes functioned as designed, thus validating the CARBP algorithm.
T4 Description of CARBP GUI lnterface
- Finally, we set-up a user-friendly GUI interface so any user within the Synthetic Biology, Biology, or Biotech communities would be able to design any cassette of choice."
The expect impact of this project is a public-domain web-site which would allow anyone to design CP-binding cassettes for any application. We also expect start-up companies to use this technology more frequently now that it is accessible.
CARBP software logo