Game-changing advances in protein design and engineering

Implementation of statistical models and machine learning algorithms can assist the design and engineering of novel proteins with improved functionality.

Digital Economy

Health

The design of novel proteins with desired functions is complex but has enormous repercussions in the pharmaceutical, biomedical and industrial sectors. Although medical applications constitute the richest market nowadays of engineered protein products, synthetic enzymes are also used in the food industry for food processing. Moreover, artificial enzymes also find environmental applications in the detoxification of pollutants or in the design of modified microorganisms to eliminate environmental pollutants such as plastics.

Simplifying the design of new proteins

The design of new proteins with improved target functionality is a difficult task due to the large sequence space and the many structural constraints that must be satisfied. For example, a small protein of 100 amino acids has about 10^130 possible variants, more than the atoms in the universe, but the overwhelming majority are non functional. It is becoming increasingly clear that to find the best sequence variant for a given purpose, it is necessary to employ sophisticated experimental solutions combined with advanced computational approaches. For this purpose, the INFERNET project developed effective tools for inference and optimisation of large-scale data. The research was undertaken with the support of the Marie Skłodowska-Curie Actions MSCA programme. “To draw conclusions or make predictions based on observed patterns and trends, we built statistical models and machine learning algorithms that helped us analyse the data and identify relationships and correlations between variables,” explains MSCA research fellow Andrea Pagnani.

Modelling genotype-phenotype relationships

The development of accurate high-throughput biochemical assays with sequencing techniques has established large-scale genetic screening as a fundamental tool to study the relationship between evolution, fitness, and other biological concepts behind experimental research. This enables the investigation of the relationship between genotype and phenotype under controlled selective pressure from external factors. Such methods are routinely used to select molecules with specific properties. INFERNET developed a data-driven probabilistic approach for modelling the genotype phenotype association derived from experiments. This method can be used as a generative model to find new genetic variations with high fitness, and it can be incorporated into a machine learning-based process of directed evolution.

Predicting mutations during evolution

A key feature related to predicting the distribution and frequency of genetic mutations is the ability to efficiently generate artificial sequences with a given target specificity. Different computational strategies and specific modelling approaches have been devised for this aim. “Generating artificial sequences, from our standpoint, means being able to efficiently generate a set of sequences with indistinguishable statistical characterisations from the training set,” outlines Pagnani. INFERNET proposed a new computational strategy to generate sequences that are very different from the natural ones. This computational pipeline needs to be followed by experimental validation of the biological activity of the selected set of artificial sequences.

INFERNET methodology to improve protein functionality

A key validation of the INFERNET methodology was the design of artificial chorismate mutase, a fundamental enzyme in the biosynthesis of aromatic amino acids. Researchers were able to design new natural like variants of conserved or improved functionality. The INFERNET sequence-based statistical models were sufficient to specify proteins and provide access to an enormous space of functional sequences. This result provided a foundation for a general process for the evolution-based design of artificial proteins. “Such evolution-based statistical approaches may provide an informed guide for the quest for functional proteins with an improved target functionality,” concludes Pagnani.

Keywords

INFERNET, proteins, evolution, statistical models, machine learning algorithms, engineering, protein design, genetic mutations, inference, chorismate mutase

New algorithms for inference and optimization from large-scale biological data