European Commission logo
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS
Contenu archivé le 2024-06-18

Comparative Genomics and Next Generation Sequencing

Final Report Summary - COGANGS (Comparative Genomics and Next Generation Sequencing)

Executive Summary:
Gene regulation is central in governing how our body translates our genomic blueprints into biologically active pathways, cells and organs. The analysis of gene regulation is therefore of highest value for understanding our biological makeup, and relevant for nearly every biological process by helping to unravel which factors influence gene regulation, how much impact they have on gene regulation, how they can be identified in the genome of interest, how different factors influence each other, and how they work in combination. Gene regulation on the DNA is effected by genomic sites that bind regulatory proteins, and thereby govern which genes are expressed.
The COGANGS project has been focused on developing a software suite, the COGANGS engine, to improve significantly on the ability to predict such functional regulatory genomic sites. It does so by combining prior knowledge on the sequences of known functional sites and evolutionary conservation. For this type of analysis, a large number of genomes from different species can be used as knowledge input. This kind of software is able to provide completely new knowledge, and will thus have tremendous value to life science researchers globally, including pharmaceutical companies, biotech companies, agricultural companies, biofuel companies, and research hospitals, as well as universities and governmental research organizations.
The software solution developed provides novel tools for performing detailed gene regulation analysis using a comprehensive collection of known DNA sequence motifs. The software is built on probabilistic algorithms that provide a sound theoretical foundation for a scientific tool to unravel gene regulatory mechanisms. An important feature of the COGANGS engine is to enable identification of functionally relevant variations in regulatory regions. To identify such functional regulatory SNPs (Single Nucleotide Polymorphisms) the researcher provides one or more sequences that harbor interesting variants. The COGANGS engine will then analyze the sequences integrating knowledge on known binding motifs and comparative information about homologous regions in other species. From the analysis results the researcher may be able to deduce how variants within the sequences influence gene expression and to assess hypotheses in a probabilistic way. The software is implemented to be fully flexible in order to satisfy the desired needs of the end-users: the regulatory regions can be predicted from a single sequence or from multiple sequences and any prior knowledge can be incorporated into the analysis, if available.
The COGANGS engine carries out a Bayesian analysis of genomic sequences that combines comparative and knowledge-based methods in a statistical alignment framework. Using a Markov chain-Monte Carlo approach, the COGANGS engine considers all possible multiple alignments, phylogenies, and arrangements of conserved elements. The statistical analysis also includes an optimal phylogenetic segmentation of large sequence sets that alleviates the scientist from the need to manually select a set of sequences best suited for comparative study.
The COGANGS engine is developed in order to allow implementation either as a command line tool or via a graphical user interface using CLC bio’s Genomic Workbench. It makes use of experimentally-proven binding sites and the library of DNA sequence motifs from BIOBASE's TRANSFAC® database and it applies this comprehensive collection of functional elements to characterize the gene regulatory potential of genomic sequences.

Project Context and Objectives:
The activity of genes is absolutely essential for all life - from viruses and bacteria to crops and human beings. Despite the many technological breakthroughs within life science research during the last 20-30 years, we are still far away from fully understanding how genes are regulated. Understanding gene regulation is one of the most important prerequisites for the development of drugs which can act with high precision on a given target gene in a given person, to combat a disease without the usual side effects. This is one of the major goals in the development towards a truly personalized medicine. Every step forward in understanding gene regulation has tremendous value for life science research, drug development, and thus ultimately for society.

In the last few years, new revolutionary technologies have been developed for DNA sequencing. These methods of high-throughput sequencing, commonly called Next-Generation-Sequencing (NGS) are significantly faster and much cheaper than traditional sequencing technology. The speed of sequencing, and thus the volume of data being generated are tens of thousands times faster and larger, respectively, than just a few years ago. The price per sequenced DNA base has been reduced accordingly. These new technologies will in the near future create a vast amount of sequence data that can be exploited, among other things, to discover genetic causes of diseases and traits. The "1,000 Genomes" project was started with the aim of sequencing more than 1,000 human beings, and recently, a "10,000 genome" project started (Genome 10K), with the aim of sequencing the full genomes of 10,000 different species.

The objective of the COGANGS project was to develop a software suite, the COGANGS engine, where up to a thousand genomes, e.g. from the 10,000 genome project, can be used as knowledge input in gene regulation analysis - analysis of which factors influence gene regulation, how much impact they have on gene regulation, how they can be identified in the genome of interest, how different gene regulation factors influence each other, and how they work in combination. Such software would be able to provide completely new knowledge, and would thus have tremendous value to life science researchers globally, including pharmaceutical companies, biotech companies, agricultural companies, biofuel companies, and research hospitals, as well as universities and governmental research organizations.


The COGANGS engine provides an integrated analysis of gene regulation of transcription. Gene regulation of transcription is effected by genomic sites that bind regulatory proteins, so-called transcription factors, and thereby govern which genes are expressed. The COGANGS engine provides tools to improve significantly on the ability to predict such transcription factor binding sites (TFBSs).

To make it accessible, the COGANGS engine is built by integrating a stand-alone command-line tool called TransFoot, that can be accessed programmatically and implements the algorithms, with the CLC Genomics Workbench, an environments that provides intuitive graphical user interfaces, and a web-based interface called the Match Portal, which extends BIOBASE’s ExPlain system.

There are two main approaches to predict TFBSs. The first approach is based on knowledge, where already known patterns are searched for in promoter regions.

The second approach is an ab initio search, based on comparative bioinformatics considerations: the parts of the promoter regions that are conserved under selection in evolution are assumed to be functional regions, and thus potential TFBSs. By comparing homologous sequences, it is possible to predict which positions are more conserved than their surrounding regions and thus predicted as TFBSs.

The COGANGS engine combines these two approaches, and is able to predict already known TFBSs as well as predict potential, so-far unknown sites. The software is implemented to be flexible to the need of potential users: the TFBSs can be predicted from a single sequence or from multiple sequences. Any prior knowledge – whether from evolution or known binding sites – can be incorporated into the analysis, and the user can choose between fast, but less accurate and slower but more accurate approaches.

Furthermore, in case of multiple orthologous sequences, the user can “segment” the phylogeny, that is group similar sequences, and give predictions on each segment. Knowledge from segment to segment might be propagated. Phylogenetic segmentation provides a sensible way to decompose the input rooted evolutionary tree into a number of smaller overlapping components (segments) that can be analysed separately using the above-described methods. The TFBS prediction results provided by TransFoot for each component can be transferred to further components to be analysed as prior information about the sequences in which the consecutive components overlap. As a result, the prediction of TFBSs obtained for a component depends both on the Transfoot analysis of the component itself and also the prior information transferred from previous components. The components to be analysed are ordered in a way that the “sequence of interest” for which the user would like to obtain TFBS annotation predictions belongs to the last component analysed. Therefore, the predicted annotations of the sequence of interest depend on information gathered from the entire phylogenetic tree (i.e. from evolutionarily more distant sequences than those belonging to the same component). Since the analysis of a series of small tree components is far less time-consuming than the analysis of the original input tree as a whole, phylogenetic segmentation extends TransFoot’s capability by allowing the analysis of a larger number of sequences in realistic time. For the simplest case with no evolutionary information available, the “Match Portal” component of the COGANGS engine provides less sophisticated but fast prediction algorithms.


Project Results:
Since this description contains formatting that would be changed extensively upon pasting into this section and additionally contains multiple figures and illustrations, the description of the main S & T results/foregrounds is attached as a pdf file named "Description of the main S and T results foregrounds"
Potential Impact:
All project results have been achieved with some deviations from the original requirement specifications for the COGANGS Engine, and this software solution is expected to have a societal and political impact in the near future as follows:

Public health
Personalised medicine is a major focus area for bioinformatics, where improving the safety and efficacy of drugs through identifying variations in genes will take us beyond the usual approach to evidence-based medicine rooted in epidemiologic studies of large cohorts. Such an approach is not beneficial to many common diseases like diabetes, Alzheimer’s or asthma, where treatment is very much customized to each patient. Improving the safety of such drugs directed to the individual will have a major effect. A study in Sweden from September 2007 ( Incidence of fatal adverse drug reactions: a population based study) found that fatal adverse drug reactions account for around 3% of all death in their general population. Thereby, fatal adverse drug reactions are ranked 7th place as the most common cause of death in Sweden.

In the long-term, the COGANGS software will help move EU into the next phase of personalized medicine, where the major aim is to use genetic information to aid diagnosis and treatment selection. Thus for each patient, knowledge about the interaction between specific drugs and genetic variation will help the selection of the most appropriate treatment option and minimizes the risk that drugs are administered to individuals, who are genetically predisposed to serious side effects.

European Research Strategy
In the recent report (2009 November, (http://ec.europa.eu/research/industrial_technologies/pdf/nmp-expert-advisory-group-report_en.pdf) from the NMP Expert Advisory Group the essential to do bioinformatics research in strategic research areas like understanding and adapting microbial metabolic pathways for better manufacturing processes is highlighted. Such adaptation needs the change of regulation of expression of genes participating in metabolic pathways, and for this, we have to have better understanding how these genes are regulated. The COGANGS software will help in this work.

European competitiveness:
European companies operating in the market of bioinformatics are under pressure from larger US competitors and a rapid increasing Chinese investment in development of combinatorial data mining technologies. Consequently, investment in European bioinformatics industry will ensure that Europe’s competitive position remains strong, or that it outperforms that of USA and China.

Another indication of the problem for the European bioinformatic society is the brain-drain of leading European scientist, where all SME partners and also the RTD partners have experienced some of their top scientists that have moved to the US for more prestigious work. E.g. Bjarne Knudsen, Chief Scientific Officer of CLC bio expatriated to USA in 2001 for doing bioinformatics research to the benefit of USA and the University of Florida, only to return to EU/Denmark when good funding opportunities arose, and CLC bio was started – ensuring what is now the World’s leading company within NGS analysis software solutions. Many other examples of brain drain away from EU in the bioinformatics area are seen every month.

Energy savings:
With higher speed for analysing gene regulation the need for power consumption will be significantly less, probably up to a factor of 10-20 times less (i.e. 1/10 to 1/20 of the power consumption) when analyzing large dataset.

Dissemination of project results:
External dissemination activities can include:
• Websites, including the COGANGS project-specific site
• Local, regional and national press, television and radio outlets
• Publishing and interviews in trade journals
• Scientific publications through e-journals as well as traditional paper journals – e.g. IEEE/ACM transactions on Computational Biology and Bioinformatics, BMC Bioinformatics, Bioinformatics, Virology, Mol Biol Evol, Comput Biol Chem, Nucleic Acids Res, and Genetics
• The EC’s own press – e.g. CORDIS focus, euroabstracts, RTD info, etc.
• Exhibitions, trade shows, conferences and seminars – e.g. RECOMB, WABI, ECCB, American Chemical Society National Meeting & Exposition, Biotech Forum Scanlab, American Society of Human Genetics, Joint Meeting of the Signal Transduction Society, and MEDICA.
• Scientific talks at scientific conferences
• Workshops – for partners and interest groups in the pharmaceutical industries, as well as scientific community

Further to the media related dissemination the commercial dissemination, which is the most important part, will take the form of sales and marketing activities. These will draw upon existing expertise for the existing products from CLC bio and BIOBASE and include the following sales activities:
• Dissemination through CLC bio's and BIOBASE's salespersons, presenting and selling the solution alongside with (and integrated in) all other software solutions from these two companies
o In pre-sales presentations
o At face-to-face customer meetings
o At conferences
• Similar dissemination activities can be carried out by CLC bio's more than 20 resellers and sales partners globally, as well as by BIOBASES resellers and sales partners.

As the prototype still has to be developed into a marketable product, these dissemination activities will only start after the end of the project, when such a product is advanced enough to be marketed. These sales activities will be complemented by marketing activities from CLC bio and BIOBASE. Marketing activities can be split into development of marketing material to be used in sales, marketing and dissemination and the execution of marketing initiatives. Categories of marketing material to be developed or supplemented for existing product materials for disseminating the project results, from which the most effective will be selected for implementation will be:
• Integration of product descriptions into the websites of the SMEs
• Development of products sheets, fliers, leaflets and brochures on the separate products as well as the integrated solutions
• Development of PowerPoint presentations for download or for use in online Seminars
• Development of case studies, white papers, method reviews and tutorials
• Shooting of videos – both tutorials and customer testimonials
• Development of trial access products – prospective customers may download a time-limited demo version of the various products on the websites of CLC bio and BIOBASE
• Development of messaging and formats for newsletters, and campaign letters to be used (both for electronic and paper formats)
• Development of online banner and print advertisements
• Development and refinement of key words and online search engine text advertisements
• Integration of information into Press Kits on the companies and their offerings

The marketing material will be used in the dissemination and sales activities listed above, as well as in the following activities specific for marketing:
• Newsletters mailings about the company products and the project (CLC bio’s monthly newsletter has more than 45.000 readers, BIOBASE has more than 60.000 registered users in its contact database)
• On-line presentations (e.g. WebEx, GoTo meeting) of the integrated products, also to be presented jointly by both companies
• Targeted e-mail campaigns, postcard or other paper mailings
• Advertising through banner ads, search engine ads, print ads and free information material

Each company will be free to implement such dissemination steps as it fits into their existing marketing plan and schedule. All substantial activities such as presentations at scientific conferences planned during the project have to be reported in the Dissemination Roster (see Dissemination Process, above). Individual discussions with potential users do not need to be reported.
For an overview of the project dissemination activities, see table A2 “list of dissemination activities” in this report.


Exploitation of project results
Ownership of project results by result lists the planned exploitation for revenue generation from the results for each SME partner (for details, see table B2, “overview table with exploitable foreground” in this report).

Market strategy and product
The product resulting from this project is the COGANGS engine, which provides high quality algorithms for analyzing gene regulation. It is defined with an open API for easy integration into CLC bio’s and BIOBASE’s product line.
The COGANGS engine will be brought to market in several ways, leveraging the flexibility on the product offerings already being present in CLC bio’s and BIOBASE’s market strategy.
The main solution will be software where the COGANGS engine is embedded as a plug-in, with integrated TRANSFAC derived prior knowledge, in a combined platform consisting of CLC Genomics Server and CLC Genomics Workbench with links to BIOBASE’s TRANSFAC database. These three products are already brought to market as a shared solution, and integrating the COGANGS engine is thus technically easy.
The above solutions from CLC bio and BIOBASE are already in the market with great commercial success, and the COGANGS engine will accelerate and leverage the sales of these solutions through penetration to new segments of the market.

The COGANGS Engine will be sold in three different ways:
• Concept #1: Sale of the full solution consisting of all the above components
• Concept #2: Sale of additional modules driven by the value of the COGANGS engine in combination with other products. Example: Sale of the COGANGS engine and the TRANSFAC database to an existing CLC bio customer, using CLC bio Genomics Workbench and CLC bio Genomics Server.
• Concept #3: Sale of the TransFoot software on a stand-alone basis.

By far the greatest revenue source is expected to be realized through concept #1 – the COGANGS engine as a part of a large, combined BIOBASE/CLC bio system. The products and solutions will be sold through CLC bio’s and BIOBASE’s existing sales network of internal sales people and resellers, located in more than 20 countries globally. The bioinformatics market is moving towards consolidation and integration of many analyses in meta-solutions, because of the need for ease of use solutions and solutions that reduce manual work and increase automation of workflows. This solution is valuable to the market because it will be the premier solution for analysing gene regulation – e.g. for diagnostics purposes. As the system, in the present version, cannot be applied to whole genomes due to performance limitations, we expect that the addressable market for the present version of the software is smaller than the market for the originally envisioned solution. The SMEs will investigate if this performance bottleneck is inherent in the scientific approach, of if it can be circumvented by appropriate optimization efforts. The COGANGS engine will provide better results than other analysis solutions, for small, focused areas of investigation, which is also what life science researchers are focusing on at present. To apply the solution to genome scale would require additional need for hardware investments and for power consumption, which may deter adoption in its current form. If these issues can be overcome, the market potential is huge, at least 100 million USD sales per year, due to the fact that gene regulation is a key component of a vast amount of research areas today – and in molecular diagnostics in the future. Due to the extreme impact that NGS has on the market – i.e. the many new types of research, the new volumes of data to base research on, and the many new applications for research and research-results - this market is expected to grow at a very high rate (probably more than 30% per year) for at least the next 10 years.

Routes to market
At present CLC bio has 35 salespeople in Denmark, Germany, France, UK, Sweden, USA, Brazil, India, Taiwan and Japan. BIOBASE has salespeople in Germany, India, USA and Japan. CLC bio has presently resellers/distributors in a range of countries, including Canada, Mexico, Brazil, Poland, France, Germany (BIOBASE), Turkey, Israel, Russia, South Africa, Malaysia, Singapore, Taiwan, South Korea, Japan, China, Australia, New Zealand. BIOBASE has resellers in several Asian countries, including Korea, China, and Japan. Both CLC bio and BIOBASE thus have existing, very well performing global sales organizations. Both companies also have very well performing marketing departments. It is these sales and marketing organizations that will be carrying out the sales and marketing work (the dissemination/exploitation – see Dissemination plan). The products developed in the COGANGS project will be sold as fully integrated modules for CLC bio’s existing products and BIOBASE’s existing products – i.e. products that the existing sales and marketing organizations knows already. It is the strategy for both CLC bio and BIOBASE to exploit the abovementioned advantages in the way that the COGANGS-products will be branded and marketed along with the existing, successfully marketed products, and sold along with and integrated with the existing, successfully sold products All of the above activities will be carried out by both CLC bio, BIOBASE, and the many resellers/distributors of the two companies. To conclude, the COGANGS-products will be sold by both the internal salespeople and the external resellers/distributors of CLC bio and BIOBASE. The above sales and marketing concepts are the same as have been successfully implemented on CLC bio’s and BIOBASE’s other products, and it is therefore expected to have a significant market impact.

Pricing of the COGANGS engine
The price for a full solution plug-in (Concept #1 above) will be 5,000 EUR, and up to 20% of the purchase price for each year of maintenance, upgrades, and support following the first. These prices are lower than originally projected, as the solutions are more limited in scope and throughput. The price for TransFoot on a stand-alone basis (Concept #3 above) would be around 3,000 EUR for a stand-alone license (a license for one computer), including maintenance, upgrade, and support for 12 months. Additional years of maintenance, upgrade and support will be 2,000 EUR per year. It is expected that the software use and thus the sales pattern will follow other bioinformatics solutions sold by CLC bio and BIOBASE, and large customers will thus buy multiple licenses. As the performance of the software increases in future versions, the price level will be raised.

Time-to-market
Before the consortium can enter the market our product idea must be validated. The consortium has a strong history in market driven innovation, and the validation will thus be starting already in WP6, “Real life testing and feedback”. After finalization of the COGANGS engine, more field tests will be carried out, and appropriate improvements and adjustments will be made. The market entry will be late 2013 or early 2014 with a public beta version. As with all other CLC bio and BIOBASE products, customers will be able to try out the software on a trial basis – typically 1 month for small and medium sized solutions and 2-4 months for larger solutions. Traditionally, this trial period either ends up with a purchase and/or structured feedback on what could be improved in order to increase the value of the solution for the customer. At deCODE, variants indicated by the COGANGS machine to be better candidates for risk assessment than previously used variants will be validated in large case control sample sets from external collaborators. If a higher predictive value for the new variant is confirmed, it can be immediately incorporated into the respective test. For the refinement of previous risk locations, it can be expected that the testing and validation will take 3-6 months from the time that the first prototype product is ready. For each new disease locus, the machine will be applied to the region as soon as it is implicated in disease. deCODE offers a variety of data analyses for the users of the company’s genomic services, including analysis of sequence data. The COGANGS solution will be among the software applications offered for the customers as soon as the product enters the market in 2013.

Market penetration rate
It is the experience of CLC bio and BIOBASE that the sales cycle is long (up to 24 months for large solutions), and that larger solutions are often purchased in steps, gradually increasing the size of the solution. The time from the purchase of the initial test setup to the purchase of a large site license can be several years. It is expected that this sales pattern will be the same for solutions including the COGANGS engine. This slow rate of market penetration is accelerated by bundling the COGANGS engine with existing solutions, and by up-selling it to existing customers. In addition, the project partners will continuously work on improving and expanding the software.

Profitability
Based on the estimated market potential, time to market and the price model above, we have in Table 7 (Total profit for the COGANGS engine + associated software sales leveraged through the COGANGS engine sales) on p. 19 in deliverable 2.5 "Final Exploitation Plan", calculated the economic benefits for the 2 SME’s selling the COGANGS engine. This is to verify that each of the partners will regain their investment within a reasonable time.

CLC bio and BIOBASE
CLC bio customer portfolio has increased drastically in recent years, resulting in a growth in sales of more than 100% per year in 2008 as well as in 2009. 16 out of 20 Big Pharma and hundreds of other companies, hospitals, and research organizations are customers of CLC bio. CLC bio thus has a perfect platform for up-selling COGANGS engines and related software to existing customers. CLC bio’s global sales force and more than 20 resellers, globally will be key in these sales and in the sale of COGANGS solutions to new customers. BIOBASE sells its products and services mainly via direct sales activities through its regional branch offices in Europe, USA, India and Japan as well as via a network of distributors in Korea, Taiwan and Japan. As of May 2009, BIOBASE employed 15 employees for sales and business development, thereof 3 professionals in Germany, 5 in the USA, 5 professionals in India and 2 in Japan. With these distribution channels in place, BIOBASE is well prepared to market and sell the COGANGS engine and the products based upon and enhanced by the COGANGS engine. The sales will be a worldwide up-sell to existing customers, and new sales to customers. CLC bio’s profit is estimated at an average of 25% of the sales revenue, costs being sales commission and salaries for customer support, salespeople, and for further product development.
Sales of the COGANGS engine are projected with a sales increase from 20 units in 2014 to 300 in 2018 with an average sales price of 5,000 EUR per single user license, resulting in fees for maintenance, upgrades, and support of 1,000 EUR.
The above are meant to be conservative sales numbers as they do not include the increased sales that will be realized if/when the performance of the software is improved. The remaining 70% of the sales are other solutions, such as the CLC Genomics Workbench or TRANSFAC, whose sale is driven completely by the fact that the COGANGS engine is an integrated part of that combined solution.

deCODE
deCODE’s exploitation of the methods developed in COGANGS will first and foremost be indirect, i.e. the methods will be used to help find genetic variants that cause increased risk of disease. These variants can subsequently be incorporated into diagnostic tests for genetic risk of common diseases. Furthermore, deCODE will be able to offer these methods for analysis of genomic data to the customers of deCODE’s genetic services. For deCODE, as an SME end-user, its profit from participating in the project will ultimately come from the marketing of tests to assess genetic risk for common diseases. deCODE has already put tests on the market that assess the genetic risk of 6 common diseases; however, for all these diseases, additional variants remain to be found. In order to find such variants and to improve the specificity and sensitivity of the genetic tests, deCODE is actively searching for additional variants that affect the risk of these and other diseases. Furthermore, whole-genome genotyping identifies variants that “tag” a statistical association to the disease but are unlikely to be the actual causal variants themselves, which is not optimal for inclusion in a genetic test. If application of the methods developed in the project on NGS data proves to be useful for annotating functional parts of the genome, this will guide deCODE’s efforts in finding the candidate causative variants which, in turn, will improve the specificity of the test. The actual financial benefit or “profitability” of this use of the end product is difficult to estimate at this point, but the COGANGS project will clearly provide us with a competitive edge, leading to a reasonable return of invest for the project costs. With the projected sales envisaged above, the participating SMEs will each obtain additional turnover, profit and employee growth as estimated in Table 8 (Estimated growth for participating SMEs for the two software selling SMEs) on p. 20 of deliverable 2.5 "Final Exploitation Plan".
We have not estimated a turnover and profit for deCODE, but during 2006, 2007 and 2008 deCODE’s revenue was $40.5M $40.4M and $58.1M respectively, of which genomic services contributed about $2.5M $5.8M and $23.6M during the same years. This growth in the genomic services was sustained by the research and development programs in various fields of genomic analysis and continued growth is expected for as long as deCODE maintains its position at the cutting edge. The COGANGS project is one of the activities that is considered as major contributor for deCODE’s continued success. Even if the competitive edge provided by the success of the COGANGS end product contributes only a fraction of the future growth, the return of investment in the project would nonetheless be quite high.


List of Websites:
http://cogangs.com/

Contact person: Liselotte Kahns, Scientific Project Manager at CLC bio; lkahns@clcbio.com
final1-description-of-the-main-s-and-t-results-foregrounds.pdf