BIO knowLEDGe Extractor and Modeller for Protein Production

Final Report Summary - BIOLEDGE (BIO knowLEDGe Extractor and Modeller for Protein Production)

Executive Summary:
Biotechnology is recognized by the EU as a Key Enabling Technology for the Europe. By the year 2025, an increasing number of chemicals and materials will be produced using biotechnology in one or more of the processing steps. Biotechnological processes will be used to produce chemicals and materials which are hard or impossible to produce conventionally, or to make existing products in a more efficient way. The development and use of Industrial Biotechnology is essential for the future competitiveness of European industry and provides a sound technological base for the sustainable society of the future.
To identify novel enzymes and micro-organisms which will provide tomorrow’s new products and improved processes is one of the main research areas. European companies already produce around 70% of the world’s industrial enzymes, and have a well-established research base. However, there is an increasing demand for new enzymes as novel biotechnological processes are developed and refined.
Rapid progress has been made in the techniques and equipment for DNA sequencing, enabling relatively fast mapping of microbial genomes. The vast amount of information generated in this way has to be stored, organised, indexed, and analysed. The BIOLEDGE platform combines the bioinformatics platform development with the construction of genome-wide as well as dynamic models relevant for the protein production process.
The BIOLEDGE project has had a clear impact on the competitiveness of the participating entities. Significant advances were made in developing mathematical modeling tools for:

• Cultivation and genome wide data integration and interpretation
• Scientific publication text mining to support biotechnology research
• Cultivation condition optimization
• Intracellular interaction network prediction
• High performance computing platform for biological modelling

The models and tools developed in the BIOLEDGE all aim towards guiding research through a more targeted trail by spanning the possible alternatives and options in the solution space first in silico, before conducting cost- and labour-intensive work. Such predictive models, pipelines, and design tools reduce the workforce and financial requirements for achieving set research goals as well as suggesting more rapid routes to overcome the accompanying challenges. The achievements in reducing process times or production costs are expected to have a direct impact on the quality of the service that the society receives. The tools developed in the BIOLEDGE project establish a substantial and robust foundation on which industry can build on in order to optimise strain and process design as well as the execution and control of the manufacturing process itself.

Project Context and Objectives:
The collaborative project BIOLEDGE specifically addressed the call FP7 Cooperation Work Programme: KBBE.2011.3.6-01: Increasing the accessibility, usability and predictive capacities of bioinformatics tools for biotechnology applications. Better bioinformatics tools are needed to capitalize on vast amounts of data becoming accessible in biotechnology applications. The theme of this collaborative project was protein production for biotechnology applications.

Proteins are needed in the chemical industry in specific conversion reactions to create specialty chemicals (e.g. oxidases, transferases), in pharmaceutical applications in treatment of diseases (e.g. insulin, therapeutic antibodies), in diagnostics (receptors, antibodies) and as industrial enzymes in (for instance) food, feed and biorefinery applications (e.g. amylases, lipases, cellulases). Various uses of proteins are appearing and envisaged also in nanotechnology applications, and these may differ from the current protein products (e.g. self-assembling and structural proteins). The diversity of protein products will further multiply in the future with engineered variants and multifunctional fusion proteins. With increasing interest in replacing chemical synthesis reactions with biotechnology and oil-based production processes with biorefineries, the need for efficient production of high quality proteins will increase significantly.

The best industrial strains of the filamentous fungi Trichoderma reesei and Aspergillus niger produce significant amounts of industrial enzymes. However, it is not understood why these fungi, and particularly the mutant strains, are such efficient producers of native secreted proteins. In addition, success in production of heterologous proteins is still very much a matter of trial and error in any organism. We do not understand why certain proteins are produced better in a particular host while others are not, and why some proteins fail to be efficiently produced at all. Nevertheless, there exist several examples on rational engineering of production hosts, such as the significant improvements of full-length antibody production (to grams per liter) in protease-deficient filamentous fungal strains and engineering of the yeast Pichia (Komagataella) pastoris for production of proteins with human-type glycosylation. Human insulin production in the yeast Saccharomyces cerevisiae is amongst the best established examples of commercial heterologous protein production. The significantly increased protein production capacity of mutant strains and the successful examples with heterologous proteins demonstrate the potential that exists for microorganisms in general, and especially eukaryotic microbes, to be engineered for efficient and versatile protein production hosts.

Until recently the biological complexity of cellular mechanisms related to secreted protein production and the lack of genome data from important classes of organisms have hampered generation of understanding on protein production. Currently however, genome sequencing has become routine and a great number genome sequences have become available and annotated, including those of fungi (yeasts and filamentous fungi). This enables comparison of gene content, variation of their sequence and the reactions involved in protein production.

Due to the increasing demand for proteins in various applications and the possibilities provided by systems biology tools, it is now urgent and timely to develop new efficient computing tools for data mining, analysis and mathematical modelling that enable faster and more rational strain improvement for protein production. The BIOLEDGE project is addressing this demand.

The current status is that

1. protein production is an important topic in biotechnology, with application ranging from production of therapeutic proteins for the pharmaceutical industry to uses in industrial biotechnology and in biorefineries;

2. vast amounts of genomic and post-genomic data from industrial microorganisms and mutants are being generated using new technologies such as next-generation sequencing;
3. tools and computational platforms which could capitalize on such new data in the context of protein production for biotechnology applications are lacking;

4. advanced systems biology modelling tools exist or are being developed which aim to reconstruct biological networks from the genomic and post-genomic data, opening new opportunities in biotechnology applications.

In order to address the currently unmet needs in bioinformatics applications for biotechnology, with specific focus on protein production, new bioinformatics, high-performance computing, and modelling platforms will be developed in the BIOLEDGE project. As part of these efforts, new methods will be developed for reconstruction of biological networks from genome-wide data, for data and text mining, and for modelling of protein production. These developments will be supported by wet-lab experiments to generate data needed for platform and tool development and testing, as well as for the model validation.

BIOLEDGE objectives

The main objective of BIOLEDGE was:

The overall objective is to develop bioinformatics and related modelling and computing platforms to support biotechnology application in the domain of protein production in an industrial setting. Better bioinformatics tools are needed to capitalize on vast amounts of data becoming accessible in biotechnology applications. The main objective of BIOLEDGE was divided into six specific objectives:

• O1. To investigate in detail physiological and genome-based data on protein production in industrially relevant microorganisms in order to provide the data and biological insight needed to construct and validate the novel bioinformatics and modelling tools.
• O2. To develop and implement methods for reconstruction of biological networks from the genomic and post-genomic data obtained from industrial microorganisms in the context of protein production.
• O3. To develop and implement knowledge extraction methods from biological databases and literature in the context of industrial biotechnology applications.
• O4. To implement a high-performance computing platform to support bioinformatics mining and modelling for biotechnology applications.
• O5. To develop and implement a bioinformatics platform for knowledge extraction and modelling in the context of industrial biotechnology applications involving protein production.
• O6. To develop, implement and validate predictive models for studies of protein production in industrial microorganisms.

Project Results:
The BIOLEDGE project fulfilled its objectives almost fully as was described in the original work plan. All the deliverables were realised and milestones reached as promised. Throughout the project, collaboration within the consortium was very good, ‘high spirited’, and fluent and four project meetings, three training courses and an IPR related workshop were organized. In addition, the consortium had several work planning meetings where, depending on the topic at hand, all or some of the partners were represented.

The BIOLEDGE project has had a clear impact on the competitiveness of the participating entities. Significant advances were made in developing mathematical modeling tools for:

• Cultivation and genome wide data integration and interpretation (BIOLEDGE Platform hosted by NorayBio, UCAM CLUSTERnGO)
• Scientific publication text mining to support biotechnology research (UMA and integrated into BIOLEDGE Platform)
• Cultivation condition optimization (CAM Optimus (UCAM and Aalto))
• Intracellular network prediction (Aalto & VTT: CoReCo pipeline & Protein-Protein Interaction prediction (integrated into BIOLEDGE Platform))
• High performance computing platform for biological modelling (Techila)

In addition, for the first time a logical model of protein production, a metabolic model of Trichoderma reesei and a consensus metabolic model of Komagataella pastoris were developed. Altogether, these enable faster and better strain design and cultivation condition development of industrially relevant microbes. The mathematical tools developed are directly applicable even for previously poorly characterized organisms. This is likely to be a great asset in the future when previously underutilized microbes are taken for industrial use. Importantly, the BIOLEDGE project resulted in a novel tool (CAM Optimus (UCAM and Aalto)) and cultivation condition development for a participating SME c-LEcta and in technology development opportunities and exploitation possibilities for SMEs NorayBio and Techila. Thus the integration of the SMEs into the project’s R&D and especially into the technology transfer and demonstration activities resulted in fluent knowledge and innovation transfer from R&D to industrial applications providing the SMEs competitive advantages.

The BIOLEDGE bioinformatics platform supports integration and mining of heterogeneous data related to industrial biotechnology. The BIOLEDGE Web Platform hosted by NorayBio is an innovative tool to enable knowledge extraction from biological repositories, with specific focus on protein production. This platform came to reduce the gap between data generation and its analysis and successful exploitation, attending to the real need for new methods to analyse this data much faster.
The platform will be a new valuable tool for any kind of study that involves data such as transcriptomics, proteomics and metabolomics data, and also with metadata like phenotype or almost any other kind of variables.

The tools generated in the BIOLEDGE project establish a substantial and robust foundation on which industry can build in order to optimise strain and process design as well as the execution and control of the manufacturing process itself.

1.3.2 Progress in the individual work packages

WP1 Protein production case studies

Objectives:

O1.1 To provide physiological parameters and insight for modelling (e.g. carbon, nutrient and oxygen requirements, yield and productivity figures, energy and redox status, flux analyses)
O1.2 To provide genome-based (-omics) data to aid bioinformatics and platform construction and modelling (e.g. for metabolic and secretion pathway construction, building up regulatory interactions, module interactions)
O1.3 To provide experimental validation to software and models constructed in BIOLEDGE

Progress:

WP1: Protein production case studies

Komagataella (Pichia) pastoris strains expressing mutant versions (T70N and I56T) of Human lysozyme (HuLy) with different degrees of misfolding and reduced secretion ability were grown in chemostat cultures. A strain expressing non-mutated Human Lysozyme (HuLy) and a strain harbouring only the empty vector and thus not expressing any Human lysozyme were used as reference. Continuous fermentations were carried out using these four strains of K. pastoris. The cultures were grown first on sorbitol as the principal carbon source and sorbitol steady-state samples were collected before the expression of HuLy was induced by introducing methanol into the culture medium. Samples were collected 1 hour and 3 hours post induction as well as when the cultures reached a second steady state growing on both methanol and sorbitol.

The physiological parameters obtained from these fermentations were evaluated in conjunction with previous studies that were conducted to evaluate the use of a fed-batch methanol feeding strategy for recombinant protein production by K. pastoris in the presence of sorbitol as a co-substrate (Celik et al., 2009). It was determined that adding sorbitol batch-wise to the medium in addition to methanol eliminated the long lag phase for the cells and enabled attainment of high-cell-density production at an early stage of the process. This achieved higher recombinant protein titres and reduced specific protease production. It also eliminated the build-up of lactic acid, lowered the oxygen uptake rate, and achieved higher overall yield coefficients without affecting the maximum specific oxidase activity. Samples were also collected for transcriptome analysis via transcriptome (RNAseq), metabolome, and proteome analyses. The maximal induction of the HuLy transgene was determined to vary between 3- and 16-fold in different strains, while KAR2 induction was much less. Splicing of HAC1 was found to be constitutive in these strains, highlighting the usefulness of RNAseq analysis.

In order to gain information on the expression of different types of heterologous proteins, in addition to the Human lysozyme, expression cultivations in K. (Pichia) pastoris, Lipase B from Candida Antarctica (CalB) and Hydroxynitrile Lyase from Mannihot esculenta (HNL) were expressed in Saccharomyces cerevisiae and the strains grown in multiple fermentation cultivations under production conditions to provide physiological parameters for modelling.
In parallel with the work producing biological cultivation data in Saccharomyces cerevisiae, new genetic tools were established for the K. (Pichia) pastoris protein production host at c-LEcta for more efficient utilization of this organism in their business. For this, novel integration plasmids and a set of new promoters (12) were created and tested and two high expression level promoters identified and taken in use at c-LEcta.

The filamentous fungi Trichoderma reesei strains producing heterologous proteins or an altered composition of endogenous proteins were cultivated in bioreactors using defined minimal medium in order to get information on the effects caused by different protein loads to the secretory system. The cultivation set-up allowed comparisons to study changes caused by production of the endogenous cellulases, or a heterologous protein (a cutinase or a lipase) either in the absence of the major endogenous cellulase CBHI or in the presence of full pattern of endogenous cellulases. From 18 bioreactor cultivations a broad range of extracellular metabolites was measured from 108 time points, free intracellular amino acids from 54 time points and RNAseq data generated from 72 sampling points for the analysis and development of the models in WP2 and WP6. Furthermore, in order to validate the metabolic model functionality the T. reesei biomass was measured and transformed into stoichiometric coefficients for modelling i.e. flux balance analysis. Comparisons to literature were carried out to confirm and interpret this data and then used to confirm the functionality of the genome wide metabolic model generated in WP2. In particular, the WP2 developed T. reesei metabolic model was found to be able to predict protein production rate in the above experiments.

An integrative analysis of the first sets of experimental data that was generated and the existing models of yeasts as well as the models that were developed within the scope of BIOLEDGE showed that the models indicated need for improvement of their predictive capabilities. Specifically, the prediction of how physiological characteristics such as growth respond to variations in nutrient availability and how the predictions are affected by the identification of the correct localization of enzymes was focused on.

To target the first issue, K. pastoris cells expressing recombinant HuLy and Fab3H6 fragment were grown in different medium compositions, which were determined using environmental optimization tool based on a genetic algorithm implemented by Aalto. The growth performance and the recombinant protein expression/secretion capability of the strains under different environmental conditions were measured with the aim of identifying the major contributors to the performance of the cell. In order to address the second issue, a genome-wide protein localization study was carried out in both K. pastoris and Saccharomyces cerevisiae.

For this purpose, the first aim was to optimize the available yeast protocols for efficient subcellular fractionation. Once the optimized subcellular fractionation protocols were ready, the major membrane-bound compartments of yeast could be resolved and the genome-scale study of the subcellular compartments of yeast could thus be facilitated. Subcellular fractionation, LC2-MS2, and multivariate statistics were applied to initiate a global and simultaneous high-throughput analysis of protein localisation in S. cerevisiae in order to define the proteomes of the organelles and membrane systems. 1606 proteins, with false-positive error frequency lower than 0.01 covering more than a quarter of the whole yeast proteome, were classified into eight distinct subcellular compartments: plasma membrane, cytoplasm, ER, mitochondria, Golgi apparatus, ribosome and proteasome, vacuole, and nucleus, defining unique locations for ca. 1,000 proteins. Numerous proteins (170) with unknown or putative functions have been assigned to defined subcellular locations, thereby either confirming the putative functions assigned by sequence analysis or providing strong suggestions as to the functional domains in which proteins of so-far unknown function operate.

The environmental optimization tool was validated and developed further as collaboration between UCAM and Aalto. The outcome of the tool was improved through fine-tuning via experimental analysis. For this purpose, the AOX inducible K. pastoris strain expressing and secreting HuLy protein was used as the test strain and medium parameters including pH, glycerol, methanol, and sorbitol concentrations as well as the concentrations of the main ammonium, phosphate, sulphate, iron and calcium sources of the K. pastoris cells were optimized using the optimisation tool based on genetic algorithm principles. A substantial improvement was observed in total recombinant protein activity and productivity in the culture supernatant after 3 generations of optimisation, at which time the various parameters started to display convergence. Then experiments were conducted to confirm the optimized conditions by cultivating the clones expressing HuLy under the AOX promoter under commonly employed conditions and defined medium as well as under the conditions optimized by the tool. It was observed that precipitation problems troubled the optimized conditions. Ultimately, through population profiling and resulting optimisation of new medium compositions, a media was found that produced a 75 % increase in activity and 124 % increase in productivity. The optimised set of conditions was used in scale-up cultivations in order to determine if the culture characteristics could be maintained. The results of this analysis indicated that a repeatable 15-fold increase in scale did not change either the activity of the HuLy protein produced or the productivity of the culture, firmly validating the usefulness of the medium optimisation tool CAM Optimus.

WP2 Network reconstruction

Objectives:

O2.1 To improve the predictive accuracy of biological network reconstruction tools from heterogeneous biological data via advanced machine learning technology with emphasis on networks and pathways relevant to protein production.
O2.2 To build a high-performance network reconstruction tool with capacity to process a large number of industrially useful microorganisms via parallel distributed algorithms.
O2.3 To build network reconstructions for protein production case studies to facilitate the analysis, interpretation and building predictive tools for protein production.

Progress: Advanced machine learning methods for predicting protein interactions and function.

A major focus was given in the project to developing new machine learning methods for protein-protein interaction (PPI) prediction. We developed new methods for integrating several heterogeneous data sources for proteins using Multiple Kernel Learning (MKL) and methods for predicting the interaction strengths in a PPI network using tree-based and regularized regression based multi-output methods. In particular we focussed on a cross-species setup where the PPI network for a new species or strain needs to be reconstructed using the known interaction data from relative species. The methods have been incorporated in the BIOLEDGE bioinformatics platform developed in the project.
To facilitate functional annotation of transport proteins, we developed machine learning methods to automatically classify proteins against the Transport Classification Database (TCDB) hierarchical classification system. To this end we developed novel structured output prediction methods able to explicitly consider the hierarchical relationships between TCDB categories. The proposed classifier exploits state-of-the-art Multiple Kernel Learning (MKL) strategies to integrate a very large set of features extracted from up-to-date databases and it is conceived to be applied virtually to any organism for the TCDB-wide and proteome-wide prediction of the categories of transporters. The overall classification accuracy of the model of roughly 85% in F1 score, is remarkably good considering the high number of functional classes (over 3400) in the TCDB.

Reconstruction and comparative analysis of biological networks for protein production

Previously a published prototype of a metabolic model reconstruction pipeline CoReCo (Comparative ReConstruction) has been developed. In this project the prototype was developed into an industry level tool by developing a reaction database and improving the algorithm, usability and parallelization of the pipeline. The CoReCo framework builds metabolic networks in parallel for a set of related species. By using the evolutionary relationships of the species accurate enzyme predictions are achieved. CoReCo builds gaples metabolic network, meaning that the resulting models are executable by network and flux analysis software. In addition, the models are described at the atom level, making them usable in conjunction of isotope tracing experiments. In the project, metabolic reconstructions for 57 fungal species have been automatically generated using the CoReCo software, and subsequently curated using expert knowledge.

Furthermore, a special emphasis was put to metabolic reconstruction of two industrially important production hosts, namely Komagataella (Pichia) pastoris and Trichoderma reesei. The reconstructed models were rigorously curated for removal of gaps and other inconsistencies. Thus, unlike many competing models, these are fully executable by metabolic network and flux analysis software, and are thus valuable tools for strain design and improvement.
T. reesei models were produced with CoReCo, evaluated, corrected and improvements fed back to the CoReCo – reaction bag. These involved mainly correction of bounds i.e. directions of reactions. CoReCo – reaction bag includes bounds retrieved from metabolic models that were used to build the reaction bag. Typically KEGG and Metacyc derived bounds needed correction based on thermodynamic calculations or literature. When a functional model capable of simulating T. reesei protein production in Bioledge cultivations was achieved, metabolic models for other fungal species was reconstructed with CoReCo. Their quality was assessed by simulating growth towards a simple yeast biomass.

For Komagataella (Pichia) pastoris, a consensus network was constructed from previously published models using supporting information from KEGG, MetaCyC, YMDB and, the literature. This consensus network is composed of 979 reactions and 1163 metabolites in 8 different compartments. This network was further curated to improve the fatty acid and iron metabolism. The model makes use of information of differences of biomass composition of K. pastoris under different aeration conditions in context-dependent manner, to arrive at more accurate predictions.

WP3 Model annotation and data mining

Objectives:

O3.1 To develop and implement text mining techniques to annotate the genome-wide models from industrial organisms.
O3.2 To bind text and data mining techniques to the domain of protein production and to implement knowledge extraction methods for the text and data mining techniques.
O3.3 To improve the quality of text and data mining techniques using domain specific knowledge.

Progress:

Task 3.1 Text Mining and model annotation

The goal for Task 3.1 was to develop algorithms for text mining to extract information from scientific texts. These algorithms had to be extended to deal with the use of the semantic model developed in WP5.
The text mining module should be maintained as modular as possible, therefore a substantial effort was devoted to maintaining its standalone character. Text mining module communicates with other parts of the Bioledge system through an API which was designed during the first 18 months of the project. The architecture’s overview is presented in the Figure on the right side.
These are both software and data components used in this task:

• GATE (Generalized Architecture for Text Engineering). The platform provides basic natural language processing (NLP) components as well as a framework to sequentially connect those components into a text mining pipeline.
• Apache Lucene. Within Bioledge it serves mainly as a text search component that is used in order to retrieve the most relevant documents from a large document collection.
• During the development phase of the project a collection of 800K articles from PMC OA subset has been used as a text mining corpus. The corpus can be easily extended with new resources by updating the Lucene index.
• Several publically accessible biological databases are used in the process of integrating a large Knowledge Base (KB) for the text mining component. These include: Kegg, Uniprot, Gene Ontology. The KB is publicly available using SPARQL at http://150.214.214.5/virtuoso-bioledge/sparql.

The methods used within the text mining system are: Basic NLP tasks (splitting text into tokens and sentences, annotating tokens with appropriate Part-of-Speech Tags), Gazetteer lookup (a dictionary-style lookup is performed on a text in order to annotate names of things that appear in the text), Co-reference resolution (spotting different expressions that refer to the same entity, for example, if we identify an expression ‘Pichia Pastoris’ as a reference to Pichia Pastoris, we should also be able to identify ‘P. Pastoris’ as such a reference), TGSP (an algorithm for collecting frequent terms from texts), Rule-based strategy (for 'special-case' names), Semantic disambiguation rules, Annotation of numerical values, Relationship extraction (rule based) and Fact extraction (the annotation of organisms has been performed on the entire corpus to enable organism based filtering for the widest possible array of search scenarios).

The TEXT MINING tool has been refined during the project both in terms of its quality and time-efficiency. A version of the TEXT MINING architecture has also been developed on a Hadoop framework, so that the document analysis can be performed in parallel. Research has done in possible applications of machine learning to the problems of disambiguation, relationship extraction and results filtering. The nature of the problems at hand seems to imply a non-supervised or semi-supervised approach that could benefit from the data stored in the Knowledge Base.

The development done includes the creation of a maintenance and deployment protocol for the Text Mining system. Automated procedures for KB creation, article repository update and indexing and metadata harvesting have been created.

Task 3.2 Data mining over RDF Data

We have implemented a group of pre-processing and analysis algorithms with the aim to 1) prepare the data and 2) to extract hidden information of interest for the research. First, we have implemented a group of statistics tests to pre-filter the data, removing those variables with a lower influence. Some of those tests are: the Kruskal-Wallis algorithm an extension of the Mann-Whitney U for more than two groups (it is the non-parametric equivalent of ANOVA) and Correlation-filter selects a number of variables with the highest degree of correlation with a numerical reference variable. A set of data mining tools has been studied and implemented:

• Variables Selection tool based on Relieff algorithm.
• Association rules. This tool, based on Apriori algorithm, finds rules among the variables of the data set ordered by two parameters called confidence and support.
• Decision trees. Generates tree-shaped predictive structures and obtains the best groups of variables to classify the instances.
• Clustering algorithms: K-Means and K-Prototypes, an extension of K-Means that allows the usage of categorical variables
• Algorithms for dimensionality reduction like NIPALS-PCA and Landmark Isomap have been also implemented.
• Logo algorithm based tool for variable selection.
• Partitional Hierarchical clustering. Generates a tree-like hierarchical structure between the instances of the dataset.
• Biclustering algorithms: Find biclusters, submatrices in the dataset matrix that shows unique patterns. 4 methods have been implemented: Qubic, BiMax, BCCA and ISA.
• DSM-SOM, a clustering algorithm based on a SOM network. This clustering method, which is faster than other agglomerative algorithms, is not restricted to convex data and, unlike other two level clustering algorithms such as SOM+K-Means, detects automatically the number of clusters.

A new tool that provides a combined score of variable importance as a result of applying sequentially data and text mining tools has been developed. It works as follows: 1) Takes the output from a feature selection tool (the list of selected variables); 2) Looks for each selected variable plus either the class variable name, or class variable elements. 3) It provides: the data mining score for each variable, the text mining score for each pair of terms {variable, class variable} or {variable, class variable group} and a combination of those scores (averaged value).
The Combined TM-DM scoring tool must be preceded by a filtering tool, combined in a workflow (e.g. [SYS] [FILTER+DM+TM] Logo + Combined TM-DM scoring). The tool can be used to analyze together variables (previously screened by data mining) and terms that define groups of the target variable.

WP 4 High performance computing platform

Objectives:

O4.1 Study on the requirements set for High-Performance Computing in bioinformatics. Focus on modelling and data mining applications within the scope of BIOLEDGE project.
O4.2 Evaluation of technical feasibility and prioritization of the requirements identified as a part of work related to O2.1. Roadmapping requirements on BIOLEDGE schedule, according to this work and resources available in the WP.
O4.3 Design, implementation and testing of a High-Performance Computing platform to provide integration for BIOLEDGE applications on Techila technology base. Efforts aligned with the work in O4.2

Progress:

The work in WP4 was started by identifying bottlenecks of high-performance computing in biotechnology. The interviews were performed using direct and community interview methodologies, After this, the results were compared to the results of a research which was conducted by Techila Technologies and CERN in 2009. The results and requirements identified during this work were presented in “Research report - Challenges of High-Performance Computing in Biotechnology” (2013).

A leading requirement which Techila Technologies noticed in the research was that users want to put their energy on Biotechnology and not on becoming professional Computer Scientists. The research showed that ease of deployment, ease of use and avoidance of vendor lock-in were high on the community’s agenda. Cloud-based processing and data were seen as interesting topics and possible enablers of higher quality research and more efficient international collaboration activities.

The three dominant application areas in the discussions about biotechnology related computation pain-points were experimentations, text mining and machine learning techniques, and algorithmic data analysis. A high-performance computing platform that is easy to use can benefit the development of biotechnological algorithms and validation of models in these areas. Simple usability will enable putting undistracted effort on biotechnology, which can accelerate the time-to-market of biotechnological innovations

In order to enable efficient processing of the requirements Techila Technologies changed the strategy used in the set-up of BIOLEDGE high-performance computing platform from Waterfall to Agile methods. This enabled the required loop-back from project partners and the extended community, and a faster set-up workflow. The set-up was executed using a bottom-up build approach and proof-of-concept work in co-operation with Aalto, VTT and UMA.

The first milestone consisted of implementation of deployment tools, which can support automated deployment of the high-performance computing platform in the cloud or on-premises using in-house servers. Techila’s ability to remove the vendor lock-in and enable integration of computing capacity from various services was acknowledged by media as a unique solution and as a possible enabler of future competitive advantages.

The second milestone consisted of development and documentation of integration interfaces (APIs) to support the required languages and environments. Interfaces to support packaging of the platform to 3rd party applications (ISV applications) were also developed. Packaging the high-performance computing features in applications can support further and wider exploitation of the benefits of the high-performance computing platform among users who want to put their energy on Biotechnological innovation, instead of training themselves to become professional Computer Scientists.
After the setup of the platform Techila, Aalto and VTT organized as a part of WP8 a training course on high performance computing, advanced modelling and systems biology for biotechnology. (2014) This training included industry application demos which used the BIOLEDGE high-performance computing platform for Metabolic modelling tasks and Machine Learning algorithms. (WP2, WP6).

WP5. Integrative bioinformatics platform

Objectives:

O5.1 To produce a Bioinformatics Platform (called BIOLEDGE) that enables knowledge extraction from biological repositories focused on the development of new biotechnical products, which will be available through the internet to the scientific community.
O5.2 To develop and implement the semantic model to represent the scientific information managed in the project.
O5.3 To develop and implement information extraction methods from biological databases.
O5.4 To enable protein production related tools taking advantage of the platform’s analysis engine.
O5.5 To develop a new environment that allows the integration of new analysis tools produced beyond the context of this project together with existing biological databases.

Progress:

The main objective of WP5 is the development of a Web Platform to enable knowledge extraction from biological repositories, with specific focus on protein production. This platform, which has to integrate the algorithms and tools developed in other WPs, should be able to analyse the data produced in the consortium, but also be flexible enough to accept other kinds of data.

The design and development of BIOLEDGE platform

Based on a preliminary detailed definition of the requirement analysis we designed the architecture of the platform: the web server host the web and the relational database, while the processes launched by the users are controlled by a local application installed in the processing server. The output from the local application is interpreted by the web platform and reported to the user in an intuitive manner.
We implemented a relational database using SQL 2008 technology.
We created a general web interface and then we implemented modules and functionalities for authentication and authorization, user management, data extraction and management, datasets creation and management, workflows definition, analysis and results visualization.
We defined a dynamic variable manager to allow the users to define, manage and use their own variables. This functionality makes the data input more flexible, easier, and more secure.

Development of the semantic Model

The semantic model to be used in the different components of the project architecture was developed. The model consists of a core Knowledge Base (KB) later extended with auxiliary models that provide specialised structures for document metadata and text annotations.

Querying services

A public RDF database endpoint has been set up at http://150.214.214.5/virtuoso-bioledge/sparql. It can be used to directly access both the KB and the annotations produced for the WP3 through SPARQL queries.
Additionally, a tool for semantic relatedness enhanced fact extraction was developed and tested. The tool is focused on extracting sentences (or sentence sequences) related to the information needs specified by the users. The users can specify both rigid and soft restrictions on their query, as well as filter the results with respect to the TM annotations produced for WP3.

Integration of data and text mining tools

A total of 16 data mining tools produced in the WP3 were integrated: Relieff, Logo variable selection, PCA using NIPALS, Landmark ISOMAP, Kruskal-wallis filter, Correlation filter, Hierarchical clustering, DSM-SOM, Qubic biclustering, BiMax biclustering, BCCA biclustering, ISA biclustering, K-prototypes, Partitional Hierarchical clustering, Decision Trees and DM Apriori. To that end, the system was flexible enough to allow a fast integration of new tools without extra programming effort. We defined a document in XML format that defines the tools available, and provides the web interface with all the information needed for their configuration and running.

Besides the manual creation of custom workflows made of filtering, data and text mining tools, a list of preconfigured WFs was added in order to make easier launching new analysis. In total, 41 System Workflows were published.
In order to enable an easy interpretation of the results, we develop a visualization framework which included XY plots, heatmaps, trees, interaction networks and dendrograms among other kinds of graphs.

Text Mining tools were integrated in the web platform in two different ways:
• A module was produced to allow an administrator to define Text Mining Queries. The user then configures and launches the query and access to the results.
• Analysis tools based on the Text Mining functionalities were developed: 1) a search through key variables and 2) a search of related variables found through the algorithm Apriori. The system allows running them individually against datasets or including them in a workflow.

Both, TM Queries and TM Analysis Tools produce a list of results that include a RDF model. The platform transforms the RDF into XGMML and visualises it through Cytoscape libraries.
Three kinds of functionalities to exploit the resulting models were developed: 1) a data viewer, to show information of the nodes selected, 2) three statistical functions 3) and a tool that allows running SPARQL queries against the model. Also, graphical functionalities to manipulate the graph layout were developed.

In parallel, Protein-Protein Interaction tools produced in WP2 were also integrated in the web platform. The three machine learning algorithms for PPI prediction were put available and a graphical output was developed to make easy the interpretation of the results.
The access to TM results is provided through a public API, which provides an application-oriented view of the RDF data. The API and its documentation can be accessed at http://150.214.214.3/KhaosAPI.

An example application for the exploration of the TM results that is based on the API can be accessed at http://150.214.214.3/bioledge.

Combined TM-DM scoring

We also developed a tool that provides a combined score of variable importance as a result of applying sequentially data and text mining tools for feature selection. For the best variables according to their capacity of classifying samples in the groups defined by the user, a graphical output shows the scores obtained by the data mining tool used, the text mining one and the combination of those scores.
Validation, testing and installation

Datasets produced in WP1 were used for the validation of the individual tools. Moreover, results produced by the platform were compared with those got by other means and interpreted in collaboration with experts. An installation procedure was set. After that, the platform was installed in those partner’s servers on demand. The software was also installed in a public server.

WP6 Predictive models for protein production

Objectives:

O6.1 To produce an integrated logical model of the pathways of protein synthesis, the unfolded protein response (UPR), and protein secretion.
O6.2 To develop an effective algorithm for the design of proteins with improved stability.
O6.3 To construct an integrative model of flux from metabolites to recombinant protein that will permit the rational design of customised growth media for the microbial production of specific recombinant proteins.

Progress:

Task 6.1 Construction of an integrated model of protein synthesis, secretion, and the UPR
Task 6.2 Transformation of the integrated model into a multi-valued logical format
Task 6.3 Improvement of the prediction and design of the stability of protein products
Task 6.4 Development of computational tools for the design of growth media optimised for the production of specific proteins by different industrial microbes.

One of the tasks in WP6 was the construction of an integrated model of protein synthesis, secretion, and the UPR as well as its transformation into a multi-valued logical format. The first step in this process was to identify the members of the protein machinery based on the Gene Ontology Structure. The gene pool for the model was formed using Gene Ontology terms of genes. In total functionalities of 1932 genes in S. cerevisiae, combined into protein complexes whenever available, formed 1292 proteins or protein complex entities as a starting point of the model.

A structural model was constructed to represent the natural flow of material building blocks through the events of protein biosynthesis, modification, translocation, secretion or degradation along with biomass formation. The model traces the route from the uptake of amino acids into the system towards recombinant and native protein production and/or secretion and degradation routes. PetriNet formalism was used to analyse this network model. The directed bipartite graph representation of the system allowed the nodes to be characterised by transitions (events) and places (i.e. conditions). The pre-places are connected to post-places through arcs. The first step in the process was establishing the PetriNet structure of protein machinery for recombinant protein synthesis. For this purpose, the arc weights and the directionality of the flow were determined to and from every transition to and from the specific places that transition is connected. Human Lysozyme was selected as the recombinant protein for the proof-of-principle study. These data allowed the logical form of the model to be constructed.

The next step was the determination of the initial markings of the places in order to be able to carry out a structural analysis using this constructed net. The input of amino acids was constrained by the amount of amino acids, which would be externally provided into the system and the initial markings within the boundaries of the net were determined using the data that are available in the literature. The constructed model was then used to evaluate the competing interests between biomass formation and recombinant protein production as well as to identify the bottlenecks in the recombinant protein production sub-network by keeping track of the native protein content of the system as a measure of the protein constituent of the biomass. For achieving this goal, the GLPK (GNU Linear Programming Kit) package was used under Python (running in the Unix environment) to handle the optimization tasks. The first test case investigated the minimum number of amino acids required for the sum of the markings in the functional active protein places to double. This analysis highlighted the relatively low requirement for tryptophan, methionine, histidine and cysteine. Methionine and histidine were highlighted among the amino acids, which show substantial variability in biomass composition in earlier analyses, highlighting their importance in recombinant protein production as well as in biomass production.

A further analysis of the structure was carried out using the same objective function in order to investigate, which proteins were affected the most during this increase. Several proteins were identified to be affected more extensively, including Sis1p (YNL007c), Avt2p (YEL064c), and Shp1p (YBL058W). The second test case investigated how the network would be affected by the maximisation of the overall number of proteins in the net with a given amount of amino acids. In this analysis, Rer2p (YBR002c), Sly1p (YDR189w), Gin4p (YDR507c), Fhn1p (YGR131w), Scs7p (YMR272c) and Lsp1p (YPL004c) were highlighted as being produced more with a fixed upper limit on the number of proteins provided into the protein machinery. Following these analyses, the network structure was extended by integration of the regulatory information into the model.

Part of the work completed as a partial requirement of WP6 involved the development of a protocol for the design of protein products with reduced aggregation propensity. The route adopted for achieving this goal was to improve the stability of the recombinant protein product. More of the protein product can be secreted through improved stability reducing the burden on the host cell by relieving the stress caused by a highly active unfolded protein response pathway. Furthermore, secreted protein product with improved stability could also be used as a proxy for the more of the active protein product to be available for commercial use. In order to demonstrate the feasibility of the protocol, we employed Human Lysozyme (HuLy) protein for conducting the analyses.

The first step in developing the strategy was to configure the basic roadmap: The primary structure of the protein was mutated in silico by replacing every individual amino acid in the peptide sequence to the other remaining 19 amino acids in an SNP-analysis like fashion. The change in the Gibbs free energy of the new re-configuration was calculated and compared to that of the original configuration. Any reduction suggested by the mutations was identified as a candidate with improved stability. The preliminary step in this analysis was to repair the structure file and to further lower the change in the Gibbs free energy of the original configuration prior to any modifications. The SNP analysis was initially carried out in a high-throughput manner, using FoldX, in order to scan the possible solution space of changes in Gibbs free energy in response to mutational variations. Any mutation leading to improvements (i.e. lower values) in the change in Gibbs free energy under these two different sets of conditions were selected as candidates for the next step in the SNP analysis. The candidate mutations were further evaluated using two additional tools CUPSAT and Rosetta Design. The final list of candidates comprised of specific mutations of those residues, which were suggested by all of the three algorithms that were employed. This list comprised of 7 residues; Y45F, R50G, H78N, N88D, A90K, V99I and A111M from 2470 different possible candidates. Although it is not completely exhaustive, the protocol provides a rational and structured in silico approach to carry out an otherwise labour-intensive and costly endeavour of improving the stability of a protein structure.

The final task WP6 addressed was to develop computational tools that will enable industry to optimally design different environmental parameter configurations and media feeding regimes to specifically optimise metabolic flux to their desired protein products. We employed Komagataella (Pichia) pastoris expression systems as a host for recombinant protein production to develop and test the efficacy of the tool. The recombinant protein was selected as the Human Lysozyme expressed under the strong methanol-inducible alcohol oxidase (AOX) promoter in this proof-of-principle study owing to its desirable properties as a model protein. The first phase in the development of the environmental optimisation tool was the identification of the biological objectives to be achieved. For out test case, we employed a multi-parametric objective with inter-dependent parameters. The next stage in the process was the selection of the parameters that were considered to contribute substantially to these objectives. We selected pH and 8 medium constituents to be important to achieve the objectives set in the precious phase. A feasible range for each parameter was determined. Genetic algorithm, which was based on the evolutionary ideas of natural selection and genetics, was employed as the adaptive heuristic search algorithm to conduct the optimization study. The algorithm used a population of possible solutions to a problem to evaluate a possible solution space and over generations, subsequent populations would be fitter and therefore more adapted to their environment as dictated by their objective function.

The initial step of the algorithm was concerned with the generation of an initial population of solutions, which provided the randomly generated environmental parameters from a provided range of values. The experiments were conducted for each of these populations in triplicates and the fitness of each population as defined by the objective function, were determined. The fitness values were evaluated and the best performing individuals were selected. The population values for the best performing individuals were mated, random mutations were introduced and a new generation of populations were created. The fitness was evaluated and the procedure was repeated until a satisfactory convergence was observed in the objectives, which were represented by the convergence of the productivity and the protein activity values. A convergence in the performance metrics were observed after conducting three generations of experiments for the test case employed. The optimised medium composition was fine-tuned to eliminate the problems regarding precipitation via population profiling.

The new recipe was shown to perform better to boost productivity and recombinantly expressed protein activity than any other medium reported for K. pastoris. The optimized conditions were also demonstrated to yield reproducible results in scale-up studies. The tool was developed using a commercially available language, MATLAB. The tool was also made available in a format, which could readily be used by industry, without any preliminary requirements on programming languages or any commercial obligations, through a graphical user interface, which is available under free-licensing.

WP7 Awareness and dissemination

Objectives:

O7.1 To attain a high level of public awareness of BIOLEDGE activities and discoveries and of the relevance to protein production research and industry.
O7.2 To protect BIOLEDGE intellectual property.
O7.3 To maximize exploitation of BIOLEDGE discoveries in the industry.

Progress:

The main dissemination activities which have been carried out during the project include the maintenance of the project website, the publications of scientific articles, and oral and poster presentations at international meetings and congresses. Dissemination has been carried out according to the dissemination plan. The dissemination activities under WP7 are discussed in detail in Section 1.4.2.

WP8 Training

Objectives:

Exploiting the conceptual advances and technological developments created by BIOLEDGE, the objective of the training is to prepare the young European researchers in the early phase of their career to become the interdisciplinary researchers of tomorrow. The research that BIOLEDGE proposes combines different fields, and is highly challenging and multifaceted. It is thus essential that its researchers have a grasp of a wide range of theoretical and technical expertise from different disciplines. In BIOLEDGE, these include bioinformatics, computational biology, high-performance computing, metabolic engineering and protein production. Our objective is to train the young researchers in as many of these disciplines as possible through the network.

Progress:

As scheduled in the work plan for the BIOLEDGE project, three expert training courses have been carried out:

1. Applied bioinformatics in biotechnology industry, 7-8th November, 2012 Bilbao Spain.
2. Modelling for industrial biotechnology, 10th December, 2013 Cambridge, UK.
3. Metabolic engineering and synthetic biology in industrial biotechnology, September 4th, 2014, Espoo, Finland

In addition, an Innovation related training workshop “Innovation and IPR in Industrial Biotechnology’ was held in two parts. The first part was held in September 2014 in Espoo Finland and the second part was held in February 2015 in Malaga, Spain.

WP9 Management activities

Objectives:

O9.1 Provide overall scientific management and coordination of the work programme, including the timely implementation of the work plan, networking and achievement of scientific goals to ensure the overall smooth operation of the project.
O9.2 Provide the overall coordination of all financial, legal, administrative and contractual requirements within the contract, including audit certificates and the maintenance of the consortium agreement.
O9.3 Oversee information and knowledge management, including dissemination, Intellectual Property Rights and exploitation.
O9.4 Oversee implementation and execution of training activities.
O9.5 Oversee risk assessment and contingency plans.
O9.6 Oversee gender, ethical and wider societal issues.
O9.7 Provide communication with the European Commission

Progress:

VTT’s management team for the BIOLEDGE project composed of the project coordinator, a project manager and a financial officer has throughout the duration of the project held monthly meetings to discuss and deal with project finances and actions required. At the beginning of the project the consortium agreement preparation was coordinated by the project coordinator.
The progress of the project according to the work plan has been followed closely and in relation to this four consortium meetings have been organised and in association of these meetings both BIOLEDGE Steering group and General assembly meeting have been held. In addition, three training courses have been organized as planned in the work description for the BIOLEDGE project. The coordinator has also overseen the activities of the Foreground evaluation committee (Chaired by NorayBio).
The management team has generated five six-month progress reports, three periodic reports and a final report. In addition, the coordinator has submitted all the planned Deliverables (44) to the EU-project portal.

Potential Impact:
Impact

The BIOLEDGE project has resulted in diverse impacts on the competitiveness of the participating entities. Significant advances have been made in developing mathematical modeling tools for:

• Cultivation and genome wide data integration and interpretation (BIOLEDGE Platform hosted by NorayBio, UCAM CLUSTERnGO)
• Scientific publication text mining to support biotechnology research (UMA and integrated into BIOLEDGE Platform)
• Cultivation condition optimization (CAM Optimus (UCAM and Aalto))
• Intracellular network prediction (Aalto & VTT: CoReCo pipeline & Protein-Protein Interaction prediction (integrated into BIOLEDGE Platform))
• High performance computing platform for biological modelling (Techila)

In addition, for the first time a logical model of protein production, a metabolic model of Trichoderma reesei and a consensus metabolic model of Komagataella pastoris were developed. Altogether, these enable faster and better strain design and cultivation condition development of industrially relevant microbes. The mathematical tools developed are directly applicable even for previously poorly characterized organisms. This is likely to be a great asset in the future when previously underutilized microbes are taken for industrial use. Importantly, the BIOLEDGE project resulted in a novel tool (CAM Optimus (UCAM and Aalto)) and cultivation condition development for a participating SME c-LEcta and in technology development opportunities and exploitation possibilities for SMEs NorayBio and Techila. Thus the integration of the SMEs into the project’s R&D and especially into the technology transfer and demonstration activities resulted in fluent knowledge and innovation transfer from R&D to industrial applications providing the SMEs competitive advantages.

The development and use of Industrial Biotechnology is essential to the future competitiveness of European industry and provides a sound technological base for the sustainable society of the future. By the year 2025, an increasing number of chemicals and materials will be produced using biotechnology in one or more of the processing steps. Biotechnological processes will be used to produce chemicals and materials which are hard or impossible to produce conventionally, or to make existing products in a more efficient way.

To identify novel enzymes and micro-organisms which will provide tomorrow’s new products and improved processes is one of the main research areas. European companies already produce around 70% of the world’s industrial enzymes, and have a well-established research base. However, there is an increasing demand for new enzymes as novel biotechnological processes are developed and refined.

Rapid progress has been made in the techniques and equipment for DNA sequencing, enabling relatively fast mapping of microbial genomes. The vast amount of information generated in this way has to be stored, organised, indexed, and analysed. But, the data on candidate genes is of no use until it has been successfully mined and translated into actual knowledge. Biological data across the diverse sources should not be viewed statically. Data mining methods may detect patterns in data and retrieve relevant information, but they may not predict the outcomes of biological processes, e.g. protein production in a specific industrial microorganism. For that, models are needed that rely on measured (e.g. genome-wide) data, and in addition accurately estimate the dependencies or relationship between different factors, such as specific proteins or metabolites, involved in the protein production process. The information across the existing diverse data sources can thus be put to use only if the validated models exists to address specific biological questions. For this reason, BIOLEDGE combines the bioinformatics platform development with the construction of genome-wide as well as dynamic models relevant to the protein production process.

In this context, BIOLEDGE bioinformatics platform supports integration and mining of heterogeneous data related to industrial biotechnology. The BIOLEDGE Web Platform hosted by NorayBio is an innovative tool to enable knowledge extraction from biological repositories, with specific focus on protein production. This platform came to reduce the gap between data generation and its analysis and successful exploitation, attending to the real need for new methods to analyse this data much faster.

The platform will be a new valuable tool for any kind of study that involves data such as transcriptomics, proteomics and metabolomics data, and also with metadata like phenotype or almost any other kind of variables. In that sense, the number of research groups that could benefit from the platform is huge. Thus, the platform developed could be of high interest of the scientific community, aiming to become a reference in the study of biotechnological processes.
Text mining results on the BIOLEDGE project can have an impact in the daily biotechnological research, as these techniques can filter large amounts of texts and extract relevant information from them without human intervention. This will reduce the time needed to discover novel studies in a given area of interest. These results will ultimately help in the protein production industry by reducing the initial analysis for a given species.

In parallel, the developed CoReCo framework for metabolic model reconstruction is expected to speed up optimization of production strains towards desired products. This way the tools lower the hurdle for biotechnology industry to make use genome-scale data in the R&D operations. Furthermore, the new machine learning tools developed for protein-protein interaction and transport function prediction improve the accuracy of network reconstructions and thus can provide further gains in model development by lessening the amount of manual curation needed.

As a result of the BIOLEDGE project, the participating SME Techila was able to develop in their high performance computing platform features which address the pain-points and respond to the requirements reported by the users in biotechnology. Ease of deployment, ease of use, and no hardware or cloud vendor lock-in enable “HPC To Every Desk” scenarios. By enabling cloud-based processing, high-performance computing is also accessible by organizations who have not been able to afford traditional server-based investments. The results make high-performance computing viable also to those users who need faster computing in their biotechnological innovation processes but do not want to spend time on learning Computer Sciences.

The models and tools developed in the BIOLEDGE all aim towards guiding research through a more targeted trail by spanning the possible alternatives and options in the solution space first in silico, before conducting cost- and labour-intensive work. Such predictive models, pipelines, and design tools reduce the workforce and financial requirements for achieving set research goals as well as suggesting more rapid routes to overcome the accompanying challenges. The field of recombinant protein production concerns the manufacture of goods with immediate use as therapeutics, bioinsecticides, bioremediators, diagnostic kits, or other useful pharmaceutical chemicals in health, agriculture and environmental sciences. Therefore, the achievements in reduction in process times or production costs, which could potentially be delivered by the employment of in silico tools, would have direct impact on the quality of the service that the society receives. We believe these tools to establish a substantial and robust foundation on which industry can build in order to optimise strain and process design as well as the execution and control of the manufacturing process itself. In fact, in addition to the molecular biological tools that were developed at the participating SME c-LEcta, also the modeling tools developed within the BIOLEDGE project found use in the design of their cultivation regimes.

The BIOLEDGE project has contributed to the implementation of the following European policies and strategies:

1. EU strategy for Key Enabling Technologies
• The BIOLEDGE has developed enabling methods to improve biotechnological production of proteins.

Main dissemination activities

The main dissemination activities which have been carried out during the BIOLEDGE project include the maintenance of the project website, publications of scientific articles and presentations at international meetings and congresses. At this stage, nine peer-reviewed scientific articles have been published. 11 additional manuscripts have been submitted for publication or are currently in preparation. In addition, ten poster presentation and 18 oral presentations have been carried out in scientific meetings.

For the BIOLEDGE platform a public Resource Description Framework (RDF) database endpoint has been set up at http://150.214.214.5/virtuoso-bioledge/sparql. It can be used to directly access both the Knowledge Base and the annotations produced for the WP3 through SPARQL queries.

The access to Text mining results carried out is provided through a public Application Programming Interface (API), which provides an application-oriented view of the Resource Description Framework (RDF) data. The API and its documentation can be accessed at http://150.214.214.3/KhaosAPI.

An example application for the exploration of the Text mining results that is based on the API can be accessed at http://150.214.214.3/bioledge.
The SME partner Techila has produced 2 videos and 5 technical flyers about their high performance computing tools and their use.

Within the BIOLEDGE project two Workshops (Modelling for Industrial Biotechnology and Metabolic engineering and synthetic biology in industrial biotechnology) were arranged.
The following open access websites for project generated tools have been established: CANTO Online Annotation Tool - Komagataella pastoris community curation tool ( http://curation.pombase.org/ ), CLUSTERnGO: a user-defined modelling platform for two-stage clustering of time-series data (http://www.cmpe.boun.edu.tr/content/cng) and esyN: Network Building, Sharing and Publishing (http://www.esyn.org/).

Two PhD and one MSci Thesis have been successfully submitted by students working within the frame of this project and one further MSci Thesis will be submitted in December 2015.

Exploitation of results

The BIOLEDGE project has developed processes and technologies according to the plan and to large extent, the developed technologies were demonstrated. No patent applications were filed during the duration of the BIOLEDGE project. However, it is highly possible that a few applications around technologies established in this project will be filed during the next few years.

The metabolic reconstructions prepared in the project can be used in microbial bioproduction processes for which a detailed manual reconstruction is not available. CoReCo created models can be used to improve yield and purity of the product by means of metabolic simulations. They can also be used to select production hosts for novel bioproduction processes. The CoReCo software will be used extensively at VTT in future projects, and it is available for other partners in the BIOLEDGE consortium. After the planned publication of the CoReCo associated reaction database, the software is expected to find broad use also in the design of microbial bioprocesses within industry. VTT is expecting to build new contract research collaborations around CoReCo and the reaction database. The development of the protein-protein interaction prediction tools has provided novel candidate protein secretion factor genes which will be experimentally pursued in future projects.

The BIOLEDGE bioinformatics platform supports integration and mining of heterogeneous data related to industrial biotechnology. This platform combines the analysis of the valuable data in unstructured textual form with data mining techniques. Data mining methods may detect patterns in data and retrieve relevant information, but they may not predict the outcomes of biological processes, e.g. protein production in a specific industrial microorganism. This information is complemented from existing studies through the analysis of scientific texts with the developed text mining algorithms. The BIOLEDGE platform and its component tools form new products for NorayBio that will be marketed for industry clients.

Furthermore, the BIOLEDGE project has produced several additional tools that are expected to find extensive use with academia and industry. These include: a stand-alone tool for the identification of patterns in time-series data and to assign functional associations for clusters identified by these patterns (CLUSTERnGO), an online tool for detailed curation of published molecular data configured for the access of Komagataella (Pichia) pastoris community (CANTO K. pastoris), a stand-alone tool for the optimisation of environmental parameters in microbial or mammalian cultivation (CAM Optimus), the first community model of the metabolic network of K. pastoris and its recombinant protein production routes (Kp1.0) a formal executable model of the protein biosynthetic machinery of yeast that describes the structural hierarchy of molecular organization, a protein re-design pipeline to improve the stability of recombinant protein products during both their manufacture, and use and a free and open-source tool to facilitate the exchange of biological network models between researchers (esyN).

In addition, within the BIOLEDGE project a yeast proteome-compartmentation study was carried out to complement our efforts to improve metabolic model predictions as well as a time-series RNAseq analysis of inducible K. pastoris variants expressing Human Lysozyme with various folding abilities in continuous cultures. The largest ever cultivation and genome wide data set of protein production cultivation of Trichoderma reesei was produced. Along with T: reesei metabolic model built with CoReCo this data set provides the bases of improving protein production in filamentous fungi by modelling. These data sets are or will soon be publicly available and are expected to find extensive use with academia and industry.

List of Websites:
The URL of the project website is http://www.bioledge.eu/

Dr Jussi Jäntti
Leader of the Synthetic biology research team
P.O. Box 1000
FI 02044 VTT
Finland
Tel: +358 50 5227846

Final Report Summary - BIOLEDGE (BIO knowLEDGe Extractor and Modeller for Protein Production)

Descargar Descargar el contenido de la página