Multi-layer network modules to identify markers for personalized medication in complex diseases

Final Report Summary - MULTIMOD (Multi-layer network modules to identify markers for personalized medication in complex diseases)

Executive summary:

It has been estimated that most drugs currently on the market work in less than half of patients. Thus, lack of treatment response causes both increased suffering and costs. Ideally, physicians should be able to personalise treatment based on diagnostic markers that can be routinely measured in the clinic. The identification of such markers is complicated by the involvement of thousands of disease genes, in combinations that may vary between patients that do or do not respond to treatment. In the MULTIMOD project, we developed a systems medical strategy to identify markers for personalised medication, based on studies of a model disease, namely seasonal allergic rhinitis (SAR). We propose that this strategy is generally applicable to complex diseases. Briefly, we identified disease-associated genes by high-throughput analyses of allergen-challenged T lymphocytes from patients with SAR and healthy controls. Those genes were mapped on a network map of human protein interactions. We found that the most relevant genes tended to cluster in the same area of the map. The genes in those clusters were highly interconnected and functionally related, so called modules. This entailed extensive statistical and bioinformatics research and development. In order to understand the functions of the modules, we used bioinformatics tools to identify pathways and performed extensive experimental studies of individual genes. We also studied why those modules differed between patients, by searching for genetic variants as well as epigenetic causes. This entailed analysing original and public genome-wide association data from hundreds of thousands of patients. We developed several different methods to identify modules. One of those methods was shown to be generally applicable to complex diseases. This method is now available on the MULTIMOD website, as a standardised tool for researchers to identify modules. We propose that our strategy is generally applicable to find markers for personalised medicine based on functional understanding of disease mechanisms and how they vary between patients. It is likely that this, or similar strategies, will contribute to the implementation of systems medicine and personalised medication. Other important implications are that improved understanding of why some patients do not respond to treatment, will help drug development, both because non responders can be excluded from therapeutic trials, and instead be targeted for the development of novel drugs. It is also possible that the same methods can be applied to predict disease, and thereby to preventative medicine.

Project Context and Objectives:

Most common diseases, including allergy, cancer and diabetes, are complex. Instead of single genes or gene products, thousands of environmental, genetic and epigenetic factors factors may be involved. High-throughput analyses have implicated thousands of genes. Moreover, there is considerable individual variability. A clinical consequence of all this is variable response to medication, which increases both suffering and cost. Most drugs are only effective in less than half of patients (Editorial. Nature Biotechnol 2012). Physicians should ideally be able to personalize medication routinely, based on diagnostic markers that can be routinely be measures. The identification of such markers is thus an important goal, but also a formidable challenge, and one that requires understanding complex pathogenic mechanisms and how they vary across populations. Because complex diseases depend on altered interactions between multiple genes, it is difficult to obtain a functional understanding based on detailed studies of individual genes. Systems medicine is an emerging discipline that addresses this challenge by combining high-throughput genomics, computer science, bioinformatics and systems biology as outlined below (Henney A. Nature 2008, Auffray et al. Genome Med 2009).

The MULTIMOD project is based on the hypothesis that markers for individualized medication can be identified by network-based analysis of high-throughput data. Our approach is outlined as follows:

1. Disease-associated genes are identified and organized into putative interaction networks.
2. Networks are dissected to find modules of genes with distinct biological functions.
3. Modules are further decomposed to elucidate putative pathways and individual genes with key probable regulatory functions.
4. The transcriptomal modules are expanded to include other layers ranging from DNA to protein. This is done by adding data from complementary high-throughput experiments. The ultimate aim is to obtain multi-layer modules (MLM) that include information about all layers and regulatory elements.
5. Protein markers for individual variations are extracted from those modules.
6. Gene expression modules and/or protein markers are tested to stratify patients, aiming for personalized medication.

The project was based on high-throughput methods to analyse SNPs, proteins and different regulatory elements such as microRNAs and DNA methylation. Moreover, because the layers and elements are interdependent, an analysis of dependencies can be used for step-wise cross-validation (for example, altered mRNA expression due to microRNAs).

The project was faced with several noteworthy challenges:
a) the heterogeneity of complex diseases, and in many cases limited knowledge about causal mechanisms;
b) the difficulties in finding representative study models;
c) methodological problems involved in the development of computational and bioinformatics methods to build modules;
d) experimental validation of disease mechanisms that may involve great numbers of genes, many of which have unknown or poorly defined functions;
e) because the external causes are often unknown it is difficult to model the disease process experimentally.

MULTIMOD was based on ongoing multi-disciplinary collaborations between clinically active researchers and leading experts in genomics, systems medicine, computer science, bioinformatics and statistics. In summary the aims were:
- to develop and apply methods to form multi-layer modules in a complex disease,
- to analyse the modules to understand disease mechanisms and individual variations,
- to find gene expression or protein markers of those variations,
- to apply the markers diagnostically in an effort to predict treatment response, and
- to make the resultant bioinformatics methods widely available in a standardized form (e.g. as web-based tools) in order to facilitate other studies of complex diseases.

The development and application of methods were based on network medicine (Barabasi et al. Science 1999, New England J Med 2007), as well as the applicants' experiences of genome-wide association studies (Sladek et al. Nature 2007), network models of gene expression data (Jenssen et al. Nat Genet 2001, Voy et al. PLoS Comput Biol 2006) and linking such models to genetic variations (Chesler et al. Nat Genet 2005). In MULTIMOD those methods were applied and further developed on allergen-challenged lymphocytes from patients with seasonal allergic rhinitis (SAR). This is an optimal disease model because it is common, well-defined, has a known external cause (pollen) and can be analyzed in both experimental and clinical studies. In the experimental studies, peripheral blood mononuclear cells were challenged with allergen in vitro. This mimics the in vivo situation, in which antigen presenting cells process the allergen, and present it in peptide forms to T lymphocytes. In allergic patients, those T lymphocytes differentiate into Th2 cells, which activate different effector cells. In clinical studies, signs and symptoms are readily assessed using physical examination and symptom scores. Moreover, we and others have found that gene expression patterns in allergen-challenged T cells are partially mirrored by mRNA and protein changes in the nose (for example, Benson et al. J Allergy Clin Immunol 2006). It is thus possible to link diagnostic markers to symptoms and signs, and thereby treatment response.

Some reservations should be mentioned: The Th1/Th2 cytokine concept is simplified. Other T cell subsets have important regulatory roles. T regulatory (Treg) cells produce the anti-inflammatory cytokines transforming growth factor and IL-10 (Umetsu 1997, Ling 2004). Increased activity of Treg is seen in healthy subjects compared to allergic patients (Akdis 2004) and may explain the beneficial effects of immunotherapy (Akdis 2007). Th17- and NKT cells are other examples of important T cell subsets (Akbari 2003, Kaiko 2008). Moreover, B cells, eosinophils, mast cells and even epithelial cells are capable of producing Th1 or Th2 like cytokine profiles (Holgate 2007, Schleimer 2007).

Despite these reservations, the involvement of multiple cells and genes in SAR resemble other diseases. The analytical methods and principles of this project may therefore be applicable to other complex diseases. In order to facilitate such applications and to disseminate the methods, the analytical tools have been made available on the Internet in a standardized format for studies of other complex diseases.

An important objective of MULTIMOD has been that the identification of diagnostic markers and gene expression patterns should be based on functional understanding of disease mechanisms. This has been sought on two levels; 1) by siRNA screens of candidate genes. These screens have been performed by our SME partner, Cenix, using Th2 cytokines as read-outs. The knockdowns have been repeated for genes that affected Th2 cytokines and their functions studied by mRNA microarray and pathway analyses ; 2) by detailed experimental studies, for example using a mouse model of allergy, and comparing with a knock-out mouse.

Project Results:

The main Science and Technology (S&T) results/foregrounds for WP1

Summary
The MULTIMOD project revolves around a central concept, i.e. that disease-associated modules can be found in gene expression networks. These transcriptional modules are used as templates on which other layers and regulatory elements are superimposed. The resulting MLM are dissected to find individual variations and protein markers for those variations. The clinical relevance of these markers are tested by examining if they can be used to stratify patients for personalized medication.

Objectives

- to develop and apply methods to form MLM in a complex disease,
- to analyse the MLM to understand disease mechanisms and individual variations,
- to find protein markers of those variations,
- to apply the markers diagnostically in an effort to predict treatment response
- to make the resultant bioinformatics methods widely available in a standardized form (e.g. as web-based tools) in order to facilitate other studies of complex diseases

The MULTIMOD WPs are highly integrated. Clinical work is being performed in WP1, computational analyses in WP2, statistical in WP3, bioinformatics in WP4 and experimental work in WP1, WP3 and WP5. The deliverables are also highly interconnected and have been addressed with complementary methods. Therefore the same deliverable may be referred to in more than one place in the description below:

In silico construction and validation of a module responsible for Th1/Th2 differentiation (D1:1)

Since Th1/Th2 differentiation plays a key role in allergy and other inflammatory diseases, WP1 and WP3 have constructed and validated a Th1/Th2 differentiation module in silico. This was done by automated extraction of genes from 20 million abstracts in Medline. Simulation studies of activation of the module indicated genes previously thought to antagonize each other in fact tended to synergize. This was supported by analyses of mRNA expression data from 11 inflammatory diseases that showed that those genes were positively correlated (Pedicini M et al, PLoS Comput Biol. 2010). We could validate this is clinical and experimental studies of allergen-challenged T cells from allergic patients (Wang H et al, J Allergy Clin Immunol. 2009). We propose that our strategy for in silico studies may be generally applicable to complex disease and of value to mine the vast amounts of data available in the public domain.

A module of genes relevant for Th1 Th2 cell differentiation. A module which was identified by automated mining of 20 million abstracts in MedLine. Black edges depict positive regulation; red edges negative regulations (Pedicini et al. Plos Comp Biol 2010)

Identification of protein markers based on transcriptomal modules in monozygotic twins (D1:1,2,4)

With the help of the Swedish and Italian twin registries we identified monozygotic twins that were discordant for SAR. We performed mRNA, exon, methylation and microRNA microarray analyses of allergen/diluent-challenged T cells from these subjects. We identified differentially expressed genes and showed that the corresponding proteins were also differentially expressed.

However, similar to a study of monozygotic twins discordant for multiple sclerosis (Baranzini et al. Nature 2010), microarray studies showed no differences in methylation or microRNAs (Sjogren et al. Allergy 2012 Add this ref). This led us to develop novel methods to define modules, first by focusing on genes of known relevance for allergy and secondly by using network-principles to prioritize disease-associated genes.

Development and validation of a network-based method to identify modules in complex diseases (D1:1), and if they were enriched for disease-associated SNPs (D5). This method was also used to define module subtypes (D4) and for the standardised analytical program (D7)

Summary:

In a published study we used network principles to show that mRNA modules were generally applicable to prioritize disease genes in complex diseases (D1:1) The network principle is that genes whose protein products interact tend to be co-expressed. This implies that genes that are differentially expressed in patients with the same disease will co-localize in the PPI network and form disease-related modules. Another important principle is that the most interconnected genes tend to be most important for the disease. In this study, we referred to such modules as susceptibility modules or SuMs. We analyzed original gene-expression and genome-wide association study (GWAS) data from almost 5000 subjects, as well as public data representing hundreds of thousands of patients (). We found that an mRNA module of highly interconnected disease-genes from SAR patients was highly enriched for disease-associated SNPs defined by the GWAS of 5000 patients. siRNA mediated knockdown of two novel candidate genes in the module resulted in significant changes on known disease genes. Similar observations were made for modules defined by studies of other diseases, showing the general applicability of the strategy. The article is highlighted as highly accessed (see http://genomebiology.com/2012/13/6/R46 online). Since we found the method generally applicable to complex diseases, we also performed a study that only focused on T cell associated diseases, and if there were systems levels principles to define modules and module subtypes (D4). The methods from this work have been made publically available on the Internet to identify modules (D7).

This project is described in more detail below:

The SuM can be isolated from the rest of the PPI network and analyzed using methods from the field of network theory. The aim of this study was to use SuMs to identify the most important disease genes. We used genes harbouring disease-associated genetic polymorphisms to measure our success in this effort.

A common measure of a genes importance in a network is to measure its centrality in the network. In other words, if a gene is closely connected to all parts of the network, this implies that it is of importance. We hypothesized that the most interconnected genes in SuMs (referred to as Core SuM) would be enriched with disease-associated polymorphisms.

An important limitation in any study that is based on PPIs, is that such resources contain inherent biases. For example, well-known disease genes (like cancer genes BRCA1 and TP53) and their interactions are extensively studied, compared to newly discovered genes without disease associations. To overcome this limitation, we construct SuMs from gene expression microarray data which measures the expression of all genes without knowledge bias. Furthermore, we test the hypothesis using disease-associated polymorphisms identified by genome-wide association studies, which is also free from knowledge bias.

We developed a novel method to identify SuMs in the human PPI network, using gene expression microarray data from allergen-challenged CD4+ cells. We then tested the hypothesis using GWAS data from the North Finland Birth Cohort. Genetic polymorphisms in the most interconnected SuM genes were 3.4 times more likely to be disease-associated than random genes. Many of these genetic polymorphisms resided within the gene FGF2, which has not previously been implicated in allergic disease. We therefore performed RNAi knockdown of FGF2 in Th2-polarized T cells, which served as a model of allergen-challenged CD4+ cells. We found that knockdown of FGF2 affected two genes in the core SuM, namely NFKB1 and TLR2. NFKB1 is a well-known transcription factor which takes part in the T cell receptor signalling pathway, among others. TLR2 is involved in the recognition of bacterial antigens and has been shown to activate NFKB1.

To examine if the findings in the study of SAR were applicable to other complex diseases, we proceeded to define SuMs and core SuMs in complex diseases other than SAR. We examined if those SuMs were enriched for genes with disease-associated SNPs identified by GWAS. These analyses were performed in diseases for which gene expression microarray data from relevant cells or tissues were available in the public domain, and where GWAS genes had been described. Thirteen oncological, auto-immune or psychiatric diseases fulfilled these criteria. We found that compared to the whole PPI network both the SuM and core SuM were significantly enriched for GWAS genes associated with their respective diseases. The enrichment was stronger in the core SuMs than in the SuMs (4.71-fold and 2.22-fold respectively). By contrast, using only differentially expressed genes we found a mere 1.15-fold enrichment of GWAS genes, which was not statistically significant.

During these analyses, core SuMs were defined as the 10% of the genes in each SuM with the lowest ASPL. We reproduced this test with different cut-offs, varying from 100% to 1% of the SuM. We found that a more stringent cut-off resulted in a stronger enrichment of GWAS genes in the resulting core SuMs.

We also observed an overlap between the core SuMs, which prompted us to ask if the core SuMs were generally associated with increased susceptibility for complex diseases. To test this, we examined the union of all the genes in the 13 core SuMs for enrichment of all reported GWAS genes from complex diseases. This test comprised 694 GWAS genes associated with 114 complex diseases. We found a 3.9-fold enrichment, compared to the whole PPI network. The enrichment was stronger (7.7-fold) when only considering GWAS genes associated with more than one disease.

Define modules and module subtypes in allergy and other T cell associated diseases (D4)

In a meta-analysis of GWAS genes we found that Th cell differentiation was the most enriched pathway. This led us to focus only on T cell associated diseases. Using the method described above we defined modules in 8 T cell associated diseases, based on analysis of public mRNA microarray data. Similar to above, we found a shared susceptibility module (SuM). This sum was enriched for key pathways, like T cell differentiation but also proliferative and metabolic pathways. A likely interpretation is that these pathways are interdependent and highly interconnected. A negative result of this is that a malfunction in one pathway may spill over to other pathways, and cause not only one but more than one diseases. This led us to examine if the shared SuM was diagnostically and therapeutically relevant. Indeed, the SuM was highly enriched for known diagnostic markers and drug targets. Next, we focused on original data from two T cell associated diseases, thought to represent opposing ends of T cell differentiation, namely multiple sclerosis (MS) and SAR. Analysis of original GWAS data from 30,000 patients and healthy controls showed that genes harboring disease-associated SNPs in both diseases, were enriched in the shared SuM. We tested if the SuM would have module subtypes in the two diseases. The MS patients were followed during one year of treatment with thysabri, a specific antibody against an adhesion molecule that is important for T cell passage through the blood brain barrier, but that also may affect T cell differentiation. The SAR patients were followed during two weeks of treatment with local glucocorticoids (GCs). In both diseases, high and low responders were defined, HR and LR, respectively. In MS we found no SuM subtypes between HR and LR. Instead MS-specific genes differed between HR and LR. By contrast, in SAR SuM subtypes were found, and correctly classified HR and LR with an 85% accuracy. The importance of this study lies not only in the definition of module subtypes, but also in the identification of systems-level principles to define module subtypes.

Define transcriptomal disease module by finding transcription factors (TFs) that can regulate key proteins in allergy, and annotation of genes with poorly defined functions (D1:1, D1:6, D3)

We hypothesised that a transcriptomal disease module could be defined by finding TFs that regulate two key proteins in allergy, namely IL-5 and IL-13. The underlying hypotheses were that those TFs would also regulate other key genes in allergy and that those genes would form a module. We searched for known and novel TFs regulating IL-5 and IL-13 using either the literature or sequence-based predictions. WP5 (Cenix) knocked down thirty TFs using siRNA. We found that knockdown of 7 of the 30 TFs had an effect on IL-5 and IL-13. We repeated the knockdowns for those 7 TFs and performed mRNA microarrays before and after knockdown. This showed that about 100 genes were co-regulated by the 7 TFs. Those genes were highly interconnected more than 0.0001) and formed a module. In this module, we found 15 genes whose functions were novel in the context of allergy or poorly annotated. We examined if they affected the release of IL-5 and IL-13. We found that the knockdowns were effective for eleven of the genes, namely S100A4, PVRL2, TGFBI, MX1, ASCL2, IRF7, IFI35, PYCARD, DHX58 and SYTL2. All successfully down-regulated genes resulted in a strong change in IL-5 and IL-13 expression.

We proceeded to validate three of the novel genes namely S100A4, TGFBI and IL1A. First, we analysed the proteins in nasal fluids and found increased levels during the pollen season, compared to before the season, which decreased following treatment with glucocorticoids. We also analysed allergen- and diluent challenged skin from allergic patients and found increased expression after allergen-challenge. We also performed functional studies of S100A4 in a murine model of allergy, using S100A4 knock-out mice.

Mice were immunized with OVA in Alum, a well characterized Th2-polarizing sensitization and then challenged in the ear with OVA, which resulted in a typical allergic inflammation. S100A4 deficient mice showed a significantly suppressed response to challenge with OVA. Ear swelling was reduced by more than 70% in S100A4-/- mice compared with wild-type controls. The reduced ear swelling was associated with decreased Th2 cell activity, antibody levels and effector cells in the ear, draining lymph nodes or serum. Specifically, significantly reduced neutrophil and dendritic cell infiltration were observed in the provoked ears of the S100A4-/- mice. Although not significant, a trend of reduced eosinophil infiltration was also observed in S100A4-/- mice. The recruitment of CD8+ T cells, which contributes to tissue damage, was also compromised in S100A4-/- mice. No substantial CD4+ T cell infiltration was observed after provocation in the ears of either S100A4+/+ or S100A4-/- mice. The severity of the dermatological inflammatory reaction can also be reflected in the recruitment of inflammatory cells to the cervical lymph nodes that drain the area of provocation. The recruitment of CD4+ and CD8+ T cells, neutrophils and dendritic cells in the cervical lymph nodes 24 hours after challenge was lower in S100A4 deficient mice as compared to wild-type mice. Taken together, these data suggest that leukocyte recruitment and migration both at the effector site, i.e. ear, and the regulatory site, i.e. draining lymph nodes, were suppressed in S100A4-/- mice following intradermal allergen provocation. Impaired recruitment of leukocytes to sites of inflammation in S100A4-/- mice has been reported previously.

Define disease modules were co-regulated by the same microRNAs (D1:2)

The above results led us to examine if microRNAs could also be used to find co-regulated disease modules. WP3 (RR-HF) analysed allergen/diluent challenged CD4 + T cells with 80 mRNA and 80 microRNAs. We found two distinct modules, which contained both known and novel disease genes. We also identified three candiated microRNAs, miR139, miR647 and miR139p, which were validated with QPCR.

Over- and down-expression of individual or pairs of microRNAs in Th2 polarised cells showed their importance for the release of IL-5 and IL-13.

DNA methylation, but not gene expression, splice variants or CNVs, stratifies patients with SAR (D1:3-6)

Altered DNA methylation patterns in CD4+ T-cells indicate the importance of epigenetic mechanisms in inflammatory diseases. However, the identification of these alterations is complicated by the heterogeneity of most inflammatory diseases. SAR is an optimal disease model for the study of DNA methylation because of its well-defined phenotype and etiology. We generated genome-wide DNA methylation profiles of CD4+ T-cells from SAR patients and healthy controls, during and outside the pollen season. DNA methylation profiles clearly and robustly distinguished SAR patients from controls, during and outside the pollen season. Moreover, we found that this methylation signature correlated with symptom severity. In agreement with previously published studies, gene expression profiles of the same samples failed to separate patients and controls. Separation by methylation, but not by gene expression was also observed in an in vitro model system in which purified PBMCs from patients and healthy controls were challenged with allergen. We associated these differences with changes of central memory T-cell populations between patients and controls, and to targeting of a transcription factor of known relevance to allergy in CD4+ T-cells. Our study highlights the potential of genome-wide epigenetic technologies in the stratification of immune diseases.

Identification of protein markers based on transcriptomal modules and module subtypes (D1:6, D4, D6).

We defined differentially expressed genes in a study of allergen/diluent-challenged CD4 + T cells in monozygous twins that were discordant for allergic rhinitis. We showed that the corresponding proteins were also differentially expressed. However, microarray studies showed no differences in methylation or microRNAs (Sjogren et al. Allergy 2012).

In order to find diagnostic protein markers for treatment with glucocorticoids in nasal fluids we analyzed mRNA microarray data from nasal fluids, nasal fluid cells and nasal mucosa in combination with proteomic data from nasal fluids. We identified several disease-associated pathways. Candidate diagnostic proteins were identified from those pathways and validated with ELISA (Wang et al. Allergy 2011):

In the next study we combined proteomic, pathway and multivariate analyses to find potential diagnostic markers. We hypothesised that proteomic analysis of nasal fluids patients that showed a high response (HR) to treatment with glucocorticoids would help to identify such markers. We started by comparing HR to low responders (LR):

We proceeded to examine the protein that had the highest scores using ELISA. We also examined if cells and cellular subsets from patients with SAR would respond to treatment with glucocorticoids. In the patients, GCs led to a significant decrease of VAS, from 8.22 ± 0.33 to 3.58 ± 0.70 (P
more than 0.0001). ENO was significantly higher in patients than in controls, 66.50 ± 15.72 compared to 22.44 ± 3.17 and this value decreased to 48.75 ± 5.31 after treatment with GCs (both P more than 0.01). We analyzed the expression of all known major immune subsets in blood from patients after GC treatment. Among the major immune subsets, we found a significant change of NK cells between patients before and after GC treatment (P more than 0.05). However, no other major immune subsets changed between patients before and after GC treatment. We then compared the expression of CD4+ T subsets in patients with SAR before and after GC treatment and found that CD4+ TCM cells were reversed by GC treatment (P more than 0.05). However, Th1 cells were found to be decreased in patients after GC treatment ( P more than 0.05). In the analysis of B cells and corresponding subsets, we found decreased memory B cells and increased naive B cells in patients with SAR after GC treatment (P more than 0.05). In addition, CD8+ T effect memory cells (TEM) cells increased in patients with SAR after GC treatment (P more than 0.05).

Analysis of gender differences (D2)

Gene expression microarray analysis was performed with freshly isolated CD4+ T cells from 48 patients with SAR during the season, of which 26 were female. Multivariate analysis was used to identify gender differences in patients with SAR, namely principal component analysis (PCA) and orthogonal partial least squares-discriminant analysis (OPLS-DA). OPLS-DA predictive loadings plot with significant confidence intervals and the S plot were used for the extraction of genes correlated to the discrimination. The Ingenuity Pathways Analysis (IPA) software was used to map genes onto known canonical pathways. A Fisher's exact test was used to calculate a P value determining the probability that the association between the proteins in the dataset and the canonical pathway is explained by chance alone. Pathways with a P value less than 0.05 were considered to be statistically significant.

Results

PCA with the GEM data from 26 female and 22 male patients with SAR showed that the two groups could be partially discriminated in the 6th component. (Next, OPLS-DA was performed to identify genes that correlated to the discrimination between the two groups. The predictive variation between the GEM data and the discriminator variable used 3% of the GEM data (according to R2X) with an explained discrimination value of 98% (according to R2Y). The total predicted variation is 50.4%. The model was indicated to be significant according to CV-ANOVA (P = 0.006). The loading plot with confidence intervals and the S plot from OPLS-DA model were used for the extraction of genes that correlated to the discrimination, namely genes with a p(corr)[1] = -0.3 or = 0.3 and a confidence interval that does not cross the 0 axis in the loading plot were selected for the pathway analysis.

Development of a standardized analytical program to define multi-layer modules that will be made available on the Internet in order to facilitate other studies of complex diseases (D7)

We developed a standardized analytical program that identifies markers that separate any groups of subjects. The program first identifies markers using a graph-based method (Barrenas et al. Genome Biology 2012) and plots a histogram showing how well the groups are separated by these markers. To facilitate the prioritization and categorization of these markers, the program also maps the markers on a network provided by the user. It then identifies a sub-network where the identified markers are enriched and outputs it as a two column from-two-matrix where every row represents an interaction between two markers.

The main Science and Technology (S&T) results/foregrounds for WP2

Summary of Project Objectives

- Develop novel graph algorithms for the analysis of high-throughput biological data.
- Synthesize scalable algorithmic implementations on cutting-edge computing platforms.
- Use these implementations to extract modules (distilled gene sets) suggestive of co-regulation.
- Perform genomic data mining in order to highlight the most promising modules for detailed scrutiny.
- Integrate knowledge of these modules with other biological data types in order to obtain a more comprehensive understanding of allergic disease.
-

Summary of Project Results

- We have designed and synthesized innovative graph theoretical tools for biological network analysis.
- We have used our experience in fixed-parameter tractability to make our methods efficient, parallel, scalable and highly efficient.
- We have implemented our techniques on a variety of high performance computational platforms.
- We have used our tools to analyze transcriptomic, methylation and other data as provided to us by Professor Benson and his colleagues.
- We have constructed algorithmic toolchains such to provide open access methods to the community at large.
- We have expanded the notion of differential expression to differential correlation and differential topology.
- We have investigated new forms of differential analysis that utilize entropy and coefficient of variation as supplements to differential expression.
- We have performed comprehensive clustering comparisons on well annotated data and across a host of popular methods.
- We have devised new algorithms for related problems such as out of core computations and maximum clique enumeration.

The main Science and Technology (S&T) results/foregrounds for WP3

Project context and objectives

The advent of high-throughput technologies within genomics is generating huge amounts of data that could provide new insights with respect to the genetic basis of complex disease phenotypes. A common output in many of these analyses is a list of genes that are associated with a given phenotype. The latter may be coined a disease-associated module. Using RNA expression profiling as an example, a module will represent a set of genes that are highly differentially expressed between cells from healthy and diseased human tissue. A major challenge in this respect is to identify and understand, using current knowledge and other available data, how the gene module may be mechanistically related to the disease state.

The primary objective of WP3:
Functional annotation of modules is to develop and invent new tools and databases that may increase our ability to interpret the functional role of modules. A central part of our work builds upon the customization of a powerful tool for visualization and analyzing biological relationships that are apparent within the biomedical literature. Specifically, we take advantage of previously developed text mining techniques to extract a network of biomedical knowledge, known as PubGene. At the core of the PubGene tool lie a database of significant literature associations between biomedical entities, where the latter encompass a range of different types, such as diseases, genes, drugs, biological processes and
symptoms (D12).

The following lists the objectives for both a) inventions in understanding gene modules/networks and regulatory mechanisms of gene expression, and b) an improved and customized version of PubGene that seeks to address the main challenges for functional annotation of MULTIMOD gene modules:
- For genes with no associated literature associations, develop method that utilizes sequence homology to genes with literature associations to assess their putative function. (D13.1)
- A customized interface to the literature association database that targets the field of interest (i.e. immunology, but method should also be applicable to other domains, (D13.2))
- Integrate information about common coding and regulatory DNA variation in human genes from publicly available databases. This will enable rapid identification of possible variants that could modulate gene expression (D13.3)
- Assemble thesauri of tissue-specific cell lines and index them within the biomedical literature so that gene modules may be interpreted in the context of relevant cell lines and tissues (D13.4)
- A quantitative method to score genes of unknown function in a context-sensitive way, i.e. to what extent is gene A related to immunology? The method will perform a customized text mining scheme to index genes with a list of expert keywords and is applicable also to other domains (D14)
- Invent method that utilizes the topological properties of protein interaction networks to predict higher-order protein complex interactions and functional communities in the cell (D15)
- Combine recent data on DNA footprints in accessible chromatin and DNA variation data from population-scale sequencing to assess the selection pressure on transcription factor binding sites (D16)

Summary of main results and achievements

Standard text-mining methodologies have been applied to capture the current knowledge of relationships between biomedical entities (D12). The various classes of entities represent multiple dimensions in the global set of cellular interactions, and will thus be explored to identify the potentially multiple layers of modules associated with the complex disease that is studied.

Our system consists of two primary parts:
- A database of interactions between biomedical entities, as extracted from the scientific literature (i.e. MEDLINE).

This database has more specifically been created based on:
- Custom-made software for extraction of biomedical terms in text
- A set of publicly available thesauri on biomedical concepts, subject to minor degrees of manual curation
- Measures of connectivity between biomedical entities (e.g. as a means to identify interactions), primarily the frequency of co-occurrence in scientific abstracts
- A web-based tool for browsing the full set of interactions in the database and exploring gene modules of interest in the context of current biomedical knowledge

The dictionaries that have been used include official gene and protein names from NCBI, gene Ontology (molecular function, biological process, cellular component) terms, chemical terms (PubChem), diseases (Medical Subject Headings (MeSH)), anatomy terms (MeSH), and drugs (DrugBank). A web interface to the tool and database (PubGene or now known as Coremine Medical Browser) is available at http://www.coremine.com/ and customized version In order to increase the scope of functional annotation of gene modules, RR-HF have invented a method that incorporates sequence homology data as an additional information layer on top of the basic literature network (D13.1). The idea is that a gene with a poorly defined function can be subject to a 'knowledge transfer' from genes that are highly similar in sequence and thus also presumably in function. The first part of the method involves the actual computation of homology relationships. For this we performed an all-to-all sequence comparison of human proteins and created a database of the results that is easily accessible from the CoreMine Medical Browser (In the right panel of search results for a given gene, click on Show more information greater than Sequence homology). Whereas most protein sequence similarity searches use the common tool BLAST, we have used a new algorithm known as SWIPE that provides better accuracy with only minor sacrifices in terms of speed [1]. In the next part, we identified approximately 2,000 human protein-coding genes that do not carry any associations within the literature, i.e. no significant co-occurrences were found between any of these genes with other concepts during indexing of MEDLINE.

In order to assign putative relationships to other concepts for these genes, we used the following scheme:
1) Assume gene X has no literature associations, but several significant (p more than 0.05) homologous human proteins (e.g. H1, H2 and H3)
2) The homology relationship between gene X and gene H1 is usually given in the form of an E-value, but this can easily be converted to a p-value P [2]
3) If gene H1 is significantly associated to another concept A (e.g. a GO process) with a p value of Q1 (more than 0.05) in the literature, assign a p-value of W = ( 1 P )( 1 - Q1 ) to the relationship between concept A and gene X 4) If more than one homolog of X is significantly associated with concept A, (e.g. homolog H2) use the minimum value of W that can be obtained (i.e. best-hit) An implementation of the method on human genes of unknown function can be viewed online (see http://folk.uio.no/sigven/multimod/homology_knowledge_transfer/ online), where we only show the 'inferred' associations for concepts within the gene ontology biological process category.

An important intention of our knowledge discovery browser is to provide the users with a focused version of interactions that appears relevant to the complex disease in question (D13.2). RR-HF has chosen to focus on the immunology field, since this field is tightly linked to the pathogenic mechanisms of seasonal allergic rhinitis (SAR). Our strategy for customization of our database towards immunology were performed in three steps:
1. Retrieve a list of highly relevant journals within immunology (using ISI Web of knowledge impact factor)
2. Map the unique identifier of each of these journals (ISSN) to entries (e.g. publications) in the MEDLINE database
3. Extract biomedical terms from the identified abstracts and create the association database
The current 'immunology filter' has provided an immunology-focused version of the PubGene system (see http://www.coremine.com/prototype_immunology online). A key number of the underlying database is the approximately 4,000 human genes that are deemed relevant to immunology. We believe this method is applicable to any other domain with a known set of domain-relevant journals.

An ultimate goal during interpretation of the disease-associated modules is to find the genetic variants that are responsible for the phenotypic variation. In order to obtain potentially phenotypically relevant SNPs in immune-related genes, we have queried the most comprehensive database of single nucleotide polymorphisms (dbSNP, NCBI, USA) at the coordinates of roughly 4,000 genes that were deemed relevant to immunology.

We have further filtered the set of retrieved SNPs according to those that satisfy either of two different criteria:
1) SNPs changing the amino acid sequence of genes (i.e. missense SNPs), and
2) those SNPs that have been shown to be associated with complex disease phenotypes in published genome-wide association studies (GWAS).

With regards to the missense SNPs, we also provide predictions of their phenotypic effect on the encoded protein, by means of known computational algorithms (e.g. SNPeffect (see http://snpeffect.vib.be online)) Complex-disease associated SNPs were collected from SNPedia (see http://www.snpedia.com online), an online resource with respect to SNP-trait associations. In the general Coremine Medical Browser, we have recently also incorporated high-confidence variants coming from the 1000 Genomes project and annotated these with known genomic locations where transcription factor binding sites cluster and chromatin accessibility is high (specifically the two UCSC genomic tracks wgEncodeRegDnaseClustered and wgEncodeRegTfbsClustered developed by the ENCODE project). The latter is done to pinpoint potentially important variants in regulatory regions of human genes (D13.3).

Functional annotation of gene modules should incorporate information about tissue and cell specificity (D13.4). In order to address this need, we have added two features to the knowledge discovery browser. First, we take advantage of the fact that a significant number of MEDLINE abstracts have been manually annotated with tissue terms from the Medical Subject Heading (MeSH). This information is used during recording of concept co-occurrences and thus represents a complementary literature approach for linking tissues with genes, diseases, drugs etc. Second, using several publicly available cell line providers and databases, we have manually compiled a collection of more than 1,600 commonly used cell lines in biomedical research, and indexed these in the biomedical literature. For the latter results, we have created a novel web database (see http://dev.pubgene.com/cellmine online) that can be queried with cell line names and retrieve significant concept associations in the biomedical literature (Nakken et al., accepted, Bioinformation, 2012)

To further develop methods to annotate genes of unknown functions, we have implemented a general algorithm for the assignment of context scores to a given biomedical entity (D14).

Although many genes carry annotations that link them to certain diseases, molecular processes etc., the large body of scientific literature provides a novel means of automatically assigning a context-score to any given gene. Based on a recent published approach [3], we take a controlled collection of terms relevant for a given context (in our context 1921 immunological terms, manually created by domain experts), and use these terms and their relationships to gene citations in MEDLINE to quantify the encoded 'immune messages' by human genes. By this, it is implied that immune relevant genes have a level of immune information content quantified using this combined set of immune terms in MEDLINE, which is greater than that of genes that play a lesser role in the immune system. Information theory calculations were used to measure the size of the immunological message stored for each human gene with respect to these terms. The probabilities in the information theory calculations are defined through the frequency by which a given gene is cited with a given immune term relative to the number of times the immune term is cited in MEDLINE among all human genes with that term. This measure of immune information content for a gene may be biased by the higher frequency of certain genes being associated overall with the sources of the immune terms, i.e. the popularity of a gene among all terms in the biomedical vocabularies. This bias was corrected for using a method in information theory known as the Kullback-Leibler (KL) divergence.

Protein interactions constitute a key element to our understanding of disease-associated gene modules. Although the human protein-protein interaction network per se is of importance as a layer of annotation for understanding complex disease modules, an additional layer that has not yet been fully explored appears at the level of protein complexes. In order to understand the higher-order topological structure of protein interactions in the cell, we have thus developed a method that predicts interactions between protein complexes (D15). We combined data on manually curated binary protein interactions from the literature (iRefIndex) and a collection of manually curated protein complexes (CORUM) to develop a quantitative score that assess likelihood of complex interaction (known as complex-complex degree). Using this method on a known set of complex interactions in yeast, we are able to predict nearly 50% of interactions between these molecular machines of the cell. We also demonstrate that the higher-order structure of complex interactions form specific functional communities, revealing a sensible biological network structure of the cellular proteome. Importantly, our method can be applied on empirical protein-protein interaction data to extract new regulatory relationships in complex disease.

There is currently a large gap in our ability to interpret and assess the pathogenicity of noncoding disease-associated variants. A first step in this process is to have a map of several, potentially important non-coding sites in the genome, such as transcription factor binding sites, microRNA target sites, transcriptional enhancers, and other functional non-coding motifs.

Understanding the relative importance of these classes of regulatory sites and how they could affect expression variation is now an area of much research. We have used population-scale sequence variation data from the 1000 Genomes project in combination with position weight matrices (PWMs) of transcription factors to assess the selective pressure on sites of transcription factor binding (D16). Importantly, we have judged the importance of chromatin accessibility as a guide for functional relevance for regulatory factors. A large set of accessible sites across different cell types (DNase1 footprints) were recently discovered and made available through the ENCODE project [4]. Using our current view of transcriptionally important regions, defined as 5kb upstream and 5'-UTR regions of protein-coding genes, we found in the European population a slight excess of rare derived alleles (DAF more than 0.05 Fisher exact test) in predicted binding sites within footprints compared to those predicted outside. This result was not replicated across other populations. In addition to our analysis of the derived allele frequency distribution, we developed a binding site 'perturbation' score that assessed the relative impact of the observed variant allele on motif binding (through PWM score) by comparison with all other possible mutations that could have occurred. Interestingly, we find that variants associated with disadvantegous perturbation scores are enriched within the regulatory regions of genes that are both highly differentially expressed and linked to diseases of unknown molecular origin [5] (p more than 2.2e-16 Fisher exact test, Nakken et al., manuscript in preparation)

The main Science and Technology (S&T) results/foregrounds for WP4

Allergy is a common complex disease, for which prevalence is steadily increasing. The molecular basis of the allergic response is still poorly understood, however there is evidence that the mechanism involves dis-regulation of CD4+ T-cells. Biological regulation of the allergic response is likely to involve multiple levels of biological information, including genetic and epigenetic information; mRNA and microRNA expression level; and levels of protein expression, each of which interact with each other in biological networks, involving multiple cells and tissue types. The goal of the MULTIMOD project was to apply a multi-level approach to unravelling the molecular modules which are involved in the allergic response, with a focus on understanding the CD4+ T cell response.

There has been a substantial amount of research into network and pathway based approaches for modelling a single level of biological information. For example, network-based approaches have been most widely applied to the analysis of gene-expression datasets, as very high-dimensional datasets can be acquired cheaply using gene expression arrays. Gene expression arrays allow profiling of the entire transcriptome under different conditions, across multiple samples. There has been a substantial body of work published on methodology for identifying transcripts which are differentially expressed between conditions; and for identification of networks of genes which have correlated patterns of expression. Pathway based approaches have also been widely applied to the analysis of genome-wide association studies, in order to identify pathways which are enriched for association with disease outcome.

Summary of main results:

The first task we addressed in this work-package was the question of what sample size would be required to identify diseases associated modules in from multi-layer data. We developed approaches for estimating the sample size required to detect edges in correlation networks from gene expression data, and found that a sample size as small as 18 is sufficient to detect and edge with probability of 0.9 at a type I error of 0.05 provided the correlation between nodes high (greater than0.7) but this sample size requirement becomes much higher (approximately 100) when the correlation is 0.3. This approach was extended to identify sample size requirements to detect differential edges i.e. edges present under one condition, but not another. In this case, a sample size of 18 was sufficient to detect a differential edge difference with 90% power and a type I error of 0.1 provided the case correlation was greater than 0.7 and the control correlation was zero, but this sample size requirement increased to approximately 130 when the case correlation is weaker (0.3) as might be expected between different biological layers.

The main Science and Technology(S&T) results/foregrounds for WP5

Work Package objectives:

In this work package Cenix aimed to functionally validate sets of genes suggested by the other partners (UGOT) to have a role in allergic rhinitis through siRNA mediated knockdown and functional analysis in primary T cells.

Cenix goal was to establish a High throughput RNAi gene silencing protocol for primary T cells. After challenge with allergen, the functional response was to be assessed by a cytokine secretion read out. Optimal assay conditions were to be established by use of positive control genes, confirming the RNAi mediated silencing by qRT-PCR. This screening paradigm was then to be applied in an iterative fashion on gene sets nominated by UGOT.

Summary description of work

Initially, extensive optimization to establish a HT-RNAi and T-cell response protocol was carried out in two systems:
- Freshly isolated naïve T-Cells from buffy coats containing blood samples from non-allergic donors, using the appropriate cell isolation kits recommended by the Benson lab at UGOT.
- Frozen primary T-Cells from several donors, commercially available from Lonza.

In both cell systems we were able to yield good RNAi mediated silencing of control genes using electroporation (also called nucleofection) in a high throughput format (96well). Also, both systems responded well with the expected change in cytokine production, detected by qRT-PCR and ELISA.

Although the T-Cells provided by LONZA were isolated via a slightly different isolation protocol compared to the preferred method used by the Benson lab at UGOT, LONZA cells were chosen to be used in the screening process for the following reasons:

- Enough cells were available from a single donor to be used for the whole RNAi screening procedure of up to 100 genes, whereas with freshly isolated T-Cells from buffy coats, multiple donors would have to be used. By choosing the LONZA cells, donor to donor variability was excluded. By testing several LONZA donors, it was assured that a representative donor was chosen for screening.
- Since the Lonza cells were delivered frozen aliquots, experimental handling and throughput was easily adjustable. Whereas with freshly isolated cells yields varied and the experimental throughput had to be adjusted accordingly, using the frozen cells allowed for testing of all genes in the first RNAi screening pass in two experimental rounds, thus minimizing the influence of experimental variability and improving comparability of the results.

Potential Impact:

The clinical relevance of systems medicine is increasingly recognized. However, bridging the gap between systems biological, experimental studies of model organisms and clinical research is a formidable challenge. Furthermore, there have been relatively few research groups that have focused on applying systems biological principles to research involving patients and clinicians. MULTIMOD is one of the first of an increasing number of EU projects that have clear clinical objectives. The reasons for the shortage of clinically oriented systems medical projects include the greater complexity of human disease but also because clinical research has traditionally been performed based on detailed studies of individual genes or proteins. Most clinical researchers do not have the training or experience to perform high-throughput studies. On the other hand, it is clear that 'reductionist' approaches are unlikely to be solve problems like finding markers for personalized medication. This has been recognized as a key medical and economic problem: Some 90% of all drugs are only effective for about 40% of patients. In the US this corresponds to 350 billion dollars/year for ineffective medications (Editorial. Nat Biotechnol January 2012). To our knowledge there are no similar figures available for Europe. However, personalized medication would significantly improve health care and also save the cost of pre-scribing sub-optimal treatment. In the case of allergic disease, which affects some 30-40% of the population in the EU, this could contribute both to health and lowering of pharmaceutical costs. The MULTIMOD project may contribute to bridging the gap between systems biological genomic studies on model organisms and clinical systems medical research because we have chosen an optimal model of complex diseases and because a multi-disciplinary team of leading experts have been gathered to solve the analytical challenges. We propose that the results of our project will contribute to developing and applying methods to find markers for personalized medication in allergic disease.

Expected impacts listed in the work programme

This project directly meets the description of HEALTH-2007-2.1.2-5. It is a multidisciplinary project based on genomics and systems biology. It addresses basic biological processes at all levels of systems complexity and may contribute to new concepts, such as using modules rather than individual genes as functional units. It has been aimed at the solution of a concrete clinical problem, namely to personalize medication. Essentially, this required translating principles and methods developed from systems biological genomic studies of model organisms to systems medicine. This involved solving a number of theoretical, computational, bioinformatic, statistical and genomic challenges that may have significant impact on clinical research as well as the treatment of complex diseases. This could, in turn, be of commercial and educational importance as detailed below:

Research impacts doors opened for the participants and other researchers

The project may provide a bridge between systems biological genomic studies of model organisms and clinical research. The concept of network-based analysis is well established in model organisms but has only recently been applied to high-throughput studies of complex diseases by us and others. However, to our knowledge this project is among the first to use high-throughput technologies to construct MLM, and use those modules to find diagnostic markers in clinical studies. The project has lead to publications in widely read journals and one of the key articles is rated as highly accessed. This may have significant impact on the clinical application of systems medial research. The participants have and will actively promote this not only by publications, but also via presentations at conferences, educational efforts and collaborations with pharmaceutical companies. Another important way to disseminate the analytical methods is to make them freely available on the Internet so that they can be applied to other complex diseases. Taken together, these efforts may have significant impact on clinical research.

Impacts of new therapies

The primary aim of this project has been to develop methods to find markers for personalized medication, which in itself is a great challenge. However, we also identified a potential therapeutic target. Since we showed that the methods work not only in SAR but also in multiple other diseases, our methods are likely to be used in research of other diseases. Thus, the main therapeutic impact is personalized medication. This may decrease both suffering and costs. However, the same methods may be also useful to predict adverse affects, which could also decrease suffering. In addition, if it were possible to predict adverse effects, drugs that have been removed from the market because of such effects might be relaunched. Since the time and cost of developing new drugs are enormous this could have significant commercial impact. Moreover, detailed analysis of how disease mechanisms vary between individuals could lead to development of new drugs for those who do not respond to current medications.

Impacts on patients

The project may contribute to decreased suffering by disseminating methods for the identification of markers for personalized medication. As stated above the methods could also decrease suffering by development of new drugs. The elucidation of gender differences in diagnostic markers is an important part of this.

Commercial impacts

By showing that medication can be potentially personalized based on measuring gene expression patterns or a limited number of diagnostic protein markers there would be commercial interest in developing methods for doing this routinely. We speculate that this will lead to combined diagnostic and therapeutic products that include methods to measure the proteins, perhaps nanochips and software to analyse the results. This could be developed in collaborative efforts between biotechnological and pharmaceutical industries. Another important commercial aspect is the possible renewed use of drugs that have been withdrawn because of adverse effects and the development of new drugs for patients who do not respond to existing medications.

Use of new results in training/education of young researchers

The project has been a unique multi-disciplinary collaboration based on employing post doctoral researchers in clinical sciences, genomics, computer science, bioinformatics and statistics. These researchers have analyse the same materials from different perspectives. This has resulted not only in synergies but also a multi-disciplinary training which is likely to be very helpful in their careers. Post docs from the MULTIMOD projects have already contributed to the dissemination of the methods and principles in various locations in the world. For example, one post doc got an EMBO fellowship and is now working at an MRC laboratory in Cambridge. He is currently interviewed for a leadership position in his native India. Other recently graduated PhD students have gone on to the industry or post doc positions in the US.

Reasons why the project requires a European rather than a national or local approach.

The partners have been selected because they have unique expertise in their fields, which is required for addressing the challenges in the project. Even if less qualified experts would have been selected it would have been difficult to find them in one centre.

List of Websites:

http://www.multimod-project.eu/

Final Report Summary - MULTIMOD (Multi-layer network modules to identify markers for personalized medication in complex diseases)

Télécharger Télécharger le contenu de la page