Skip to main content

Molecular signatures: a systems biology tool to understand how leaf development is constrained by drought

Final Report Summary - MOLSIG (Molecular signatures: a systems biology tool to understand how leaf development is constrained by drought)


In virtually all agricultural regions, crop yields are periodically reduced by drought. To produce more drought-resistant crops, it is vital to have a comprehensive 'systems biology' understanding of:

(i) how drought affects plant growth and development;
(ii) what physiological mechanisms the plant employs to tolerate drought;
(iii) how and what genes regulate these mechanisms.

Furthermore, identification and characterisation of genes governing plant development under drought will in turn facilitate identification of candidate genes for improving crop drought tolerance. Plant responses to drought are complex and involve the activation of a variety of mechanisms to enable the plant to survive. High-throughput 'omics' technologies have begun to catalogue transcripts, proteins and metabolites whose abundance changes in response to stresses such as drought. However, these molecular profiling techniques are relatively expensive and an alternative option would be to select a limited suite of biomolecules (transcripts, metabolites etc.) which represent molecular signatures reflecting biological processes affected under non-stress and stress conditions. Using targeted sets of 'signature' biomolecules would allow the expression of a relatively low number of process-representative genes to be measured by QPCR. Identification of such signature biomolecules could be achieved by algorithms such as K-means that cluster together groups of genes with similar expression patterns. This assumes that genes participating in a particular process will exhibit similar expression patterns over a large number of experiments or time points in a given experiment (the guilt-by-association principle). A gene showing the average expression in a cluster could then be assigned as a representative of the processes represented by genes in that cluster. However, K-means yields different clusters upon each run even with identical input data. In addition, such algorithms can only work on single graphs. They are unable to combine multiple graphs and thus important sources of information are lost that could be critical to selecting process-representatives genes. To address this problem, Dr Alberto Paccanaro's lab developed a novel algorithm designated the MOLSIG algorithm. It is based upon the computation of the maximum eigenvalue of a matrix related to the graph Laplacian and yielded excellent results for determining process-representative genes based on toy and artificial data. MOLSIG is not a cluster algorithm. Rather, it is able to directly pick genes which are representative of common gene expression patterns. A set of genes whose expression is highly correlated with the representative gene can then be identified. The results obtained are stable over different runs and uniquely, this algorithm is able to combine multigraphs containing identical nodes but with different types of connections in each network.


Our task was to further develop this method by:

(i) examining how to use MOLSIG on real drought gene expression data;
(ii) determining whether MOLSIG could identify processes related to plant drought responses;
(iii) testing whether using multigraphs gave better results than a single graph;
(iv) assessing whether MOLSIG performed better than commonly used clustering algorithms.


(i) Investigating the parameters to use for the MOLSIG algorithm

The data used for all experiments consisted of a microarray dataset comprising a timecourse (0, 15 min, 30 min 1 hr, 3 hr, 6 hr, 12 hr and 24 hr) of the Arabidopsis response to drought stress. The dataset was obtained from the AtGenExpress public microarray database of Arabidopsis microarray experiments and was subjected to, and passed, various quality controls (M/A, RNA degradation, NUSE plots etc.). The data were re-normalised and used to calculate fold-change of expression of each gene in response to drought compared to the control. In order to determine the optimal parameters for using MOLSIG, the algorithm was employed on the Arabidopsis drought dataset to identify process representative genes. Gene groups were formed initially by identifying a fixed number of the most highly correlated genes with respect to the representative gene. Gene groups exhibiting clear stress-mediated changes in gene expression were subjected to functional annotation over-representation analysis (ORA) using GO-term_Biological process FAT, GO-term_Molecular function_FAT, SP_PIR_Keywords and Interpro, and using the DAVID functional annotation bioinformatics webtool to carry out the analysis. The following parameters were found to be optimal for MOLSIG analysis of the drought dataset:

(i) Genes whose expression show at least a two-fold change in expression were used as input data.
(ii) The optimal number of gene groups generated was k = 15.
(iii) Gene groups were generated with genes showing a Pearson correlation coefficient = 0.85 with respect to the process representative gene picked by the algorithm rather than using a fixed number of highly correlated genes.
(iv) To prevent genes appearing in more than one gene group, genes were forced into the groups with which they were most highly correlated.

(ii) Could MOLSIG identify process related to plant drought responses?

(iii) Does MOLSIG perform better using multigraphs?

We tested the unique ability of MOLSIG to combine two or more graphs. The two graphs used were the gene expression co-correlation (GE) data and the GO term_biological semantic similarity (SS) graph. We surmised that MOLSIG would pick better process representative genes using the combined graphs because these genes would not only be the center of a group of highly correlated genes but would also be significantly different from other gene centers in terms of biological function. MOLSIG was run with a weighting of 0.9 for the GE graph and 0.1 for the SS graph. However, many more functional terms (105) were also identified specifically by using GE and SS than when using GE alone (32).

Many of the functional terms identified specifically by using GE and SS are known to be important processes responding to abiotic and biotic stress in plants such as 'response to salicylic acid stimulus', 'phytoalexin metabolic process', 'reproductive structure development', 'root system development', 'response to nutrient levels' and 'response to cadmium ion'. Thus, the unique multigraph feature of MOLSIG showed that addition of extra information by using both gene expression co-correlation and GO term_biological semantic similarity graphs, significantly improved the ability of MOLSIG to pinpoint process representative genes. Consequently, many functional annotation terms were identified that would have been missed using gene expression data alone. This finding has clear implications for hypothesis generation and gene discovery in addition to picking genes that could be used in real-time PCR and would represent multiple processes taking place in response to drought.

(iv) Does MOLSIG perform better than commonly used clustering algorithms?

We also tested whether MOLSIG (using GE and SS) can identify more functional annotation terms than one commonly used clustering algorithm (K-means) and one relatively new clustering algorithm (affinity propagation). Of those terms specifically identified by MOLSIG, many were known to be important in response to abiotic and biotic stresses. Indeed, in contrast to MOLSIG, the affinity propagation algorithm was unable to identify 'abscisic acid signaling pathway'. Abscisic acid is a plant hormone that is essential for activating many of the plant's stress defense mechanisms. It should also be noted again that unlike MOLSIG, different runs of K-means generate different gene clusters. In conclusion, the MOLSIG algorithm is able to combine two or more graphs, can outperform algorithms using clustering methods and is able to choose genes that represent multiple biological processes. Thus, using MOLSIG has clear implications for hypothesis generation and gene discovery in addition to picking genes that could be used in real-time PCR and would represent multiple processes taking place in response to drought or for the global response of any other biological phenomenon being investigated. Hence, MOLSIG could be used both in academia and in the biotechnology industry. Furthermore, the MOLSIG algorithm has wider applications than for biomolecule expression data alone. For instance, MOLSIG's ability to combine graphs could have uses in product marketing by combining social networks with geographical location, thereby providing marketing targets that are representative people who are the most connected to other people in the same geographical area.

Related documents