European Commission logo
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

A genomics toolbox to enhance business for SMEs in the market of starter cultures and probiotics

Final Report Summary - GENOBOX (A genomics toolbox to enhance business for SMEs in the market of starter cultures and probiotics)

Executive Summary:
Fermented foods, such as cheese, yoghurt, bread and wine, constitute a large part of the human diet. These fermented foods are produced from substrates such as milk, fruits, and cereals by the action of micro-organisms: yeasts or lactic acid bacteria. Microorganisms from food (or other sources) that might exert a health benefit (e.g. prevention of diarrhea), are known as probiotic strains. The functional potential of a microorganism is determined to a large extend by its genomic sequence that encodes enzymes and proteins. It is therefore important to determine the genomic sequences of important industrial microorganisms and to develop tools for interpretation of these genomic sequences, because it can provide rapid insight into the potential functionalities of these organisms.
In order to address the need to efficiently analyze genomic sequences from larger collections of strains, the GENOBOX consortium was founded. This consortium is a consortium of SMEs from the dairy and probiotic industry (LB Bulgaricum, Sacco, Winclove and Bioprox) and two RTDs (NIZO Food research and Radboud University Nijmegen) specialized in microbiology, screening and bioinformatics of lactic acid bacteria. The goal of the consortium is to construct a toolbox in which genomic analysis can be used for prediction of functional properties of microorganisms, based on the genomics sequences of these organisms and associated phenotypic data and to produce experimental data to validate and refine these predictive models.
The work in the consortium that has been carried out for the last two years has been sponsored by the European Commission within the Framework 7 program under the financial scheme ‘Research for the Benefit of SMEs’. In the past two years, a large number of strains from the consortium have been sequenced, including species from the Lactobacillus, Lactococcus, Streptococcus and Bifidobacterium genera. The sequence data were supplemented by sequences from publicly available reference genomes, providing an extensive starting point for predictive modelling and the construction of metabolic models. Next to the collection of genomic data, phenotypic arrays and transcriptomics experiments were performed to identify potential molecular markers that could be predictive for certain phenotypic characteristics.
The genomic and experimental data were combined in a software platform (also called GENOBOX as an abbreviation for GENOmic toolBOX) that allows for predictive analysis of new genomic sequences.
Using this platform, predictions were made for new routes to higher yields in fermentations and for genetic markers that can be used for strain identification and tracking of strains in complex mixtures. The predictions were validated by the SMEs in their laboratories.

Project Context and Objectives:
Introduction
Throughout the European Union (EU), companies supply starters and probiotic strains to other business (B2B) end-users (B2C) for application in a food product or supplements. Besides a limited number of large multinational companies, producers of starters and (novel) probiotics are mainly SMEs. The competition in the field of starters and probiotics is driven by the ability to provide (i) well characterized and traceable strains belonging to species that have been granted QPS (Qualified Presumption of Safety) status by EFSA (European Food Safety Authority), (ii) cost-efficient production processes, and (iii) a wide port-folio of products with specific in-product functionalities or proven health or product benefits.

Yield improvement
An important economic target in the market of starters and probiotics is increasing production efficiency, by either improving the biomass yield per unit medium or by replacing ingredients in production media with more efficient and affordable alternatives. The final yield is also affected by the survival rate of the strains after they have been dried (for example by freeze-drying), packaged and transported to the end-users. Increasing this robustness of strains can further increase revenues of companies. For example, increasing the robustness of strains to the freeze-drying process and stability during storage afterwards can increase the revenues of a strain up to 10-fold. Upon re-hydration of the strain, the activity and survival, or overall robustness of the culture can be negatively affected which in turn can affect the production turnover of cheese or yoghurt or survival of a probiotic.

Strain safety
The safety of a strain is determined by multiple factors. Screening assays for determining a strain’s safety include the resistances to antibiotics, the production of biogenic amines, the presence of toxin or virulence factors genes as well as more complicated assays such liver bacterial translocation in mice, unwanted reductase activity and thrombin induction. Knowledge in an early stage on the potential risks with respect to the safety profile of a strain is important and could aid in selection the best strain from a set of potential new strains. Since a lot of the characteristics that determine the safety are encoded in genes, analysis of the genome already gives a good indication of strain safety. Eventually studies in humans may be needed to prove the safety of strain before it can be produced for the market.

Product benefits
Micro-organisms may have beneficial action directly by interacting with the human host or by displacing pathogenic organisms in the gut or in other relevant tissues. If they exert their influence in this way they are called probiotics.
An alternative way in which microorganisms exert their influence is by modifying a fermented product by production of additional metabolites that influence the flavor and taste of product (e.g. the production of diacetyl, acetaldehyde or acetion) or the texture of a product (e.g. by the production of EPS). The production of these metabolites is regulated by well-known biological pathways and can be readily predicted from the genome sequence. The exact mechanisms by which micro-organisms exert their probiotic effect is less well-known and prediction of these properties from the genome sequence alone may be considered more challenging.

State-of-the-art in R&D
Chr. Hansen and Danisco are the world leaders in both the market of starters and probiotics, claiming about 80% of total revenue. The annual turnover of these companies increases each year and a significant percentage of their turnover is spent on R&D (6% for Chr. Hansen and 7% for Danisco) which demonstrates the importance of R&D activities for these companies.
Within these R&D activities both at multinational companies and at smaller SMEs, genome sequencing and annotation of microbial organisms are starting to play an ever more important role. Genome sequences are used to predict beneficial functionalities in strains such as improved production of flavor molecules or vitamins, or increased adhesion and survival in the gastro-intestinal tract. The efficient use of these genome sequences enables faster strain and product development and improvement as outlined above. Moreover genome sequences are used to predict whether organisms are potentially toxic, pathogenic or important for spoilage.
The current bottle neck for SMEs to routinely apply genomics (in terms of costs, time and expertise) lies in the management and analysis of these complex data because the analyses require having access to, and understanding of, large public databases and bioinformatics tools, as well as the know-how to use these tools. Also, licenses for software packages are costly for SMEs to purchase.
The objective of this project is to provide the SMEs with sufficient genomics and bioinformatics tools to enhance their competitiveness through more efficient production of their strains, guaranteed safety of their products and potential for development of new products. The use of genomics and bioinformatics tools will allow SMEs to obtain fast results while reducing the number of expensive, time consuming high throughput screening projects for the discovery of new functionalities and the subsequent large scale validation experiments.

Project Results:
Webframework for genomic analysis

One of the most important deliverables from this consortium is the GENOBOX server. The backbone of the GENOBOX webserver is written in the Python programming language using the Django web framework. Python and Django were chosen because they enable the building of high-performing and elegant web applications quickly and they are available as open-source software. A postgreSQL database implementing a CHADO schema was created to store all biological information. These tools are also open source and the CHADO scheme is widely used in the academic environment to store genomic and experimental data on model-organisms. The web application was developed taking into account readability and maintainability of the source code with the use of object-oriented programming were possible. High importance was given to security and privacy of the GENOBOX users. Moreover, we made GENOBOX available as a virtual appliance to maximize and facilitate its portability to the SMEs.
This design principle mean that any updates to the server, either in the context of the consortium or done by the SMEs themselves can be performed easily. Moreover the exclusive use of open-source software means that it is not necessary for the SMEs to buy expensive third party licenses when they want to use this tool in-house.
The functions implemented in GENOBOX are the following:
• Upload and refinement of strain sequence and metadata.
• Upload and refinement of strain phenotypic data.
• Database browsing.
• Strain comparisons based on phenotypic and genomic data.
• Sequence alignment between strains and gene containers.
• Orthology calculations.
• Predictive modelling based on the genome sequence.
• Various export functions to text and Excel files for use in other bioinformatic tools

The most important feature of the GENOBOX server is to do predictive modelling based on the genome sequence that is uploaded for an organism. These predictive models are based on sets of genes gene containers that are predictive for a given functional trait. The construction of these gene containers is described in more detail below under the heading “Construction of phenotypic rules based on gene containers”. The prediction of strain characteristics is done via a so-called traffic light system where red-yellow-green color coding is used to indicate the absence, potential presence and presence respectively of a functionality. The traffic light has a drill-down function that allows for detailed inspection of the underlying data of the predictive model. This traffic light system was chosen at the specific request of the SMEs to also allow non-expert user interpret the results of the predictive models. The GENOBOX webserver will be hosted by NIZO for another 3 years after the ending of the EU financing period.

Annotated sequences
Within the project a large number of strains have been sequenced. All these sequences are available through the GENBOX webserver. Genome sequences are valuable in the sense that they don’t change significantly, they are a relatively immutable property of the organism. This means that whenever new annotation methods, algorithms or predictive models become available that are based on other genomic sequences, the genomic sequences generated within GENOBOX can be used as input to predict functionality of these genomic sequences. For the sequenced organisms, specific attention was given to genes in the respective genomes with functionality of particular interest to the SMEs (those in the gene containers). The function of these genes was manually curated by comparison of the automatically annotated genes in the SME’s genomes to the curated annotations in reference genomes. This was done by determining for each species separately orthologous genes (genes with common ancestry) between the SME’s genomes and the public genomes. To determine whether start codons were correctly assigned, and no genes were missed in the SME’s genomes, contigs with the genes of interest were aligned to reference genomes. Furthermore, the latter approach allowed determining pseudogenes, genes that for instance were truncated (and likely not functional) in the SME’s strains.

Metabolic models
Based on the genome sequence, metabolic models were constructed. These metabolic models can be considered as a map that indicates the flow of molecules through the organisms. To this end, all enzymes that are known to perform metabolic reactions were identified in the genomic sequences. Enzymes were then linked together in metabolic pathways on basis of their reaction equations, i.e. on the substrates that they use and the metabolic products that they produce. By connecting reactions and pathways in this way, substrate conversion into biomass and CO2 can be modelled and the fate of individual molecules can be predicted. The refined metabolic models were delivered to the SMEs in the form of static metabolic maps that allow manual inspection of the metabolic network of the strain of interest. In addition, SBML-based files of the genome scale models were generated that allow for using the genome scale models in open source modelling software (e.g. PySCes or VANTED) by the SMEs. Additionally Excel files were made available in which the metabolic reactions were listed.

Barcoding of strains
Based on the genome sequence of the individual organisms, sets of sequences can be determined. These sets of sequences are unique in the sense that they are able to discriminate the strain of interest in a mixture of other strains from the same species. These barcodes were derived by comparing the sequence of the strain with all other sequences from related organisms that are available in GENOBOX database. The fact that we collected sequences from multiple SME strains and reference strains in the database allowed for much more powerful prediction of unique sequences then when a smaller set of e.g. only publicly available genomic sequences would have been used.
Since the barcode prediction was based on sequences available in the GENOBOX server, the real test for specificity would come from using the barcodes in an experimental setting with new strains, that were not in the GENOBOX. This validation experiment was carried out at two SMEs and showed that indeed the barcodes were highly specific and able to detect the strain of interest within a set of strains from the same species.

Construction of phenotypic rules based on gene containers
The most important step in developing the GENOBOX predictive models is the development of the so-called gene containers. Gene containers are sets of genes that are believed to be responsible for establishing a certain phenotypic trait. These genes are often cooperating in one or more biological pathways that lead to the production of a metabolite, the conversion of a given substrate or the synthesis of certain cellular components (e.g. the cell wall). The gene containers were assembled by using expert knowledge, literature findings and experimental data and consist of genes from multiple species and strains from a given family. In total ~50 gene containers were developed predicting phenotypic traits ranging from GI tract stability to production of flavours and EPS. The gene containers are available in the GENOBOX webserver for the SMEs and are not available in the public version of the webserver. The gene based markers that were derived from the literature can be augmented with the markers derived in WP4 and WP5. This list of makers is too long to show here, but for each SME a set of ~100 - 200 markers for phenotypes of interest was generated based on the results of the microarray experiments. Whether the markers will be incorporated in the final predictive model depends on the outcome of multiple validation experiments that are ongoing and on the decision by the users on how specific the predictive models should be and how many false positive predictions are allowed.

Yield increase
One of the most important economic drivers in fermentation is the biomass yield that can be obtained per unit medium. The metabolic models that were constructed as described above were used to predict the yield of the organisms, one organism per SME. For all four organisms a yield increase upon addition of an additional compound as substrate or co-factor could be predicted. Validation of these predictions on a lab scale showed that 3 out of four predictions could be confirmed. Further validation at the site of the SMs showed that in a number of cases also a yield improvement was observed in an industrial setting.
The metabolic models that were derived still contain large gaps, in the sense that only up to about ~30-40% of the genes of the genome can be assigned a metabolic function and thus has a place in the metabolic map. However, more genome sequences will become available in the public domain and as a consequence annotation algorithms will continue to be improved. The metabolic models can thus be extended with new information as more genes of the genomes can be annotated, thereby refining the accuracy of the predictions that can be made with this model.

Phenotypic data
An important result is the large amount of data that was produced by doing the high-throughput screenings for the organisms of the SMEs. All data were transferred to SMEs in the form of Excel files and also as graphical representations of the data. These data were also uploaded into the GENOBOX web portal. While these data were of immediate use in the current project for selection of interesting fermentation conditions in WP4 and WP5 and read outs for transfer to large scale fermentations, these data are also valuable for future research if the SMEs decide to focus on additional phenotypic read outs that have already been measured in this project but have not yet been followed up in the validation experiments. In addition the protocols that were developed in the course of this screening program were shared with the SMEs enabling them to repeat the experiments in their own R&D facilities and to also use these protocols on new strains that were not part of the collection that was analyzed in the current project.

Potential Impact:
Impact for the SMEs
Probiotics and starter cultures are high value ingredients for the development of functional food. At the moment, there are very clear trends visible that the large food-producing industries are moving their efforts more and more toward added value products directed at specific target groups (people with obesity, cardiovascular problems, diabetes, high blood pressure, etc.) or even individuals. As a consequence, the demand for new and improved probiotic strains and starter cultures is expected to grow significantly. SMEs in this area can profit from these trends provided that they can deliver products with known genomic sequences and proven safety, stability, and functionalities. The economic impact of the results from this project can be categorized as follows :

Yield increase : This directly saves money because more product can be obtained with less starting material. It is hard to calculate at this stage what the exact net benefit is because the yield increase is only one factor in the entire production chain and the quality and yield of the final product can only be determined if the whole manufacturing process has been performed including harvesting, downstream processing, storage, packaging and transport. There has not been enough time in the project to perform the entire production chain.

Sales increase : The fact that an SME is able to demonstrate a lot of background knowledge on the characteristics of a strain strengthens the value proposition for their customers. This is expected to lead to an increase in market share. The exact size of increase in market share is hard to measure at this stage in the project since the final improved products still need to be produced on a large scale.

IP protection : The unique barcodes that were derived for strains from the SMEs may serve two possible goals. On the one hand they may be used to track the presence of the strains in complicated mixtures, thereby giving handles for optimization of complex fermentation processes or assist in trouble shooting for process that do not deliver the required quality. The second option is to use these barcodes to monitor products of competitors for the occurrence of strains owned by the SMEs. Again, also here the economic benefit can clearly be imagined but is hard to express in actual amounts of money at this stage.

Shared ownership of GENOBOX : A large part of the GENOBOX server has been developed in this consortium and is property of the consortium. As such this ownership can generate income by means of licensing of the software. A number of companies have already expressed interest in the GENOBOX framework (see below). In deliverable 6.4 we have laid out the direction of thinking on how we should proceed in commericalisation of this server. Hereby we would strongly stress that the submitted document is still a first version of the final document concerning dissemination of Genobox and a proposal for a division of what will be Genobox Basic and what Genobox Extended. We intend to cast the final plan in the form of an agreement that will be signed by the legal representatives of the consortium partners.


Impact for other SMEs in related fields
During dissemination activities (publications and presentations on congresses) several SMEs and large companies from the starter & probiotic business as well as software development companies have expressed interest in the GENOBOX system. This shows that there is a large interest in these system and that the GENOBOX provides a solution to a technology gap in a wider community.
GENOBOX will not be made available to large companies with the starter & probiotic business as this would undermine the competitive position of the SMEs of the GENOBOX consortium. However for SMEs in the areas of public health & safety, sustainability and medicines (see below) the GENOBOX technology can be provided in collaborations or under licensing agreements.

Impact for society at large
An ever increasing amount of data is being generated in all disciplines of science and industrial R&D. This is fueled on the one hand by the development of sophisticated experimental technologies for high throughput data generation, such as next generation sequencing and a variety of other ~omics techniques and on the other hand by the decreasing costs of data storage and network based data transfer.
Efficient use of the data that are being generated, i.e. the efficient aggregation of heterogeneous data and subsequent filtering to identify important patterns, will lead to faster and more informed decision making in multiple disciplines. Algorithm development and the availability of computing power is in practice not a limiting factor anymore in big data analytics. Rather, the communication of the results in a comprehensible manner to stakeholders that are no experts in computer science remains a bottle-neck. We believe that the methodology developed within this consortium has efficiently addressed this bottleneck in the current project, and that the principles can be applied in a wider scope in the society at large.
In the GENOBOX consortium the focus has been on the analysis of genome sequences and gene expression data from micro-organisms. However, he techniques for the development of the webframework and the predictive models based on gene containers can easily be transferred and applied to other organisms such as viruses, yeast and more complicated organisms such as mice and human.
For example, in pharmaceutical drug development a major problem is the fact that people react differently to the action of drugs dependent on their genetic make-up. Although genome sequences for individuals become more and more available, it is still difficult for scientists and physicians to harness these data given the large amount of data per individual. It is very well feasible that a ‘gene-container’ with disease related genes coupled to a traffic light system would enable researchers to quickly drill down into the relevant areas of an individual’s genome and assess whether a particular drug could work or would fail. Given the fact the such disease-containers are already available in a number of specific databases, the step towards such an integrated system based on GENOBOX technology would be relatively small.

List of Websites:
www.genobox.eu