User-driven Development of Statistical Methods for Experimental Planning, Data Gathering, and Integrative Analysis of Next Generation Sequencing, Proteomics and Metabolomics data

Final Report Summary - STATEGRA (User-driven Development of Statistical Methods for Experimental Planning, Data Gathering, and Integrative Analysis of Next Generation Sequencing, Proteomics and Metabolomics data)

Executive Summary:
The main goal of the STATegra project is to develop a new generation of bioinformatics resources for the integrative analysis of multiple types of omics data. These resources include both novel statistical methodologies as well as user-friendly software implementations.

STATegra methods address many aspects of the omics data integration problem: design of multiomics experiments, integrative transcriptional and regulatory networks, integrative variable selection, data fusion, integration of public domain data, and integrative pathway analysis. To support method development STATegra uses a model biological system, namely the differentiation process of mouse pre-B-cells. On this system the consortium has created a high-quality data collection consisting of a replicated time course using seven different omics platforms: RNA-seq, miRNA-seq, ChIP-seq, DNase-seq, RRBS-seq, Proteomics and Metabolomics, which is used to assess and to validate STATegra methods and is available from public repositories. Novel integration methodologies follow a double implementation track: as free R package or web-based tools, and as commercial user-friendly software.

Methods developed by the STATegra project include Multi-omics clustering, Multi-omics component analysis, Multi-omics regulatory gene networks, Transcriptional Networks, Feature Selection methods based on multi-omics evidence, Pathway Network Analysis, Causal Meta-analysis and Multi-omics Pathway Analysis & visualization. We have also developed tools for optimal multi-omics experimental design, method validation and annotation of multi-omics experiments. The project has also initiated the development of semantic structures to communicate results of the integrative statistical analysis in a structured way.

Commercially the project resulted in a collection of plugins for the CLCbio GxWB version 8.5 that constitute the Qiagen Systems Biology Platform. STATegra plugins include an interface to the Biomax Knowledge Base, an R executor for STATegra R (and third-party) packages, Transfac and IPA interfaces, a novel multi-purpose Peak Caller and several tools for Genome browsing and genomics track manipulations.

Publication-wise, the project has motivated around 50 manuscripts, half of them already in publication, and has lead three special issues at Open Access journals. Moreover, we organized several workshops and courses, and launched the Statistical Methods for Omics Data Integration and Analysis Conference, celebrated in 2014 and 2015 which aims to continue in the future as dedicated meeting for Multi-omics data analysis research.

Project Context and Objectives:
Recent developments in the omics field have resulted in the availability of a wide array of high throughput technologies that allow the study of cell biology at different levels of molecular organization. Specially, the explosion in the last years of next generation sequencing (NGS) applications and their continuing drop in price makes genome-wide, system-oriented approaches in biomedical research increasingly affordable for many molecular biology labs. At the same time, these technological developments have fuelled statistical research on how to extract the best signal-to-noise ratio for a given data-type. There are now varyingly mature statistical pipelines available for the analysis of different types of omics: transcriptomics, proteomics, metabolomics, and the novel –“seq” approaches: RNA-seq, ChIP-seq (*Johnson 2007, Barski 2007), Methyl-seq (Brunner 2009), etc. However, there is still a gap between the available tools for statistical analysis of a single data-type versus the requirements of biomedical scientists who address their studies through multiple omics approaches and are faced with the challenge of understanding the combined results in an integrative fashion. Furthermore, researchers need guidance on how to design such integrative analyses more efficiently both at planning of the experiments and at the data collection. These analytical challenges still represent today critical points for the successful translation of the omics experimental investments into significant knowledge advances for the biomedicine.

The STATegra project aims to fill this scientific and technological gap in current genomics research. The goals of this project are a primary R&D target of leading European bioinformatics companies who understand that genomics research is increasingly moving towards data-intensive experimental designs and that efficient solutions for data processing and sharing need to be provided to preserve competitiveness in the global bioinformatics software market. Our vision is that the development of an appropriate and accurate analysis framework for -omics will permit a more efficient use of the data and a better understanding of the results, and that this can only be achieved through an intimate collaboration between statistical experts, biomedical researchers, data producers and software developers. Throughout these interconnections we can make sure that we stay close to the needs of the experimentalists, understand the nature of the data, create sound analytical solutions and make them available. Moreover, by developing our analysis framework within such a multidisciplinary environment we can evaluate statistically, experimentally and operationally the usefulness of our novel methodologies in planning experiments, collecting data and integrating different types of measurements.

Specifically, the objectives of the STATegra project are:

Objective 1: Development of methods for data gathering and visualization. STATegra will develop methods to tackle the challenges of data retrieval and visualization. These developments include algorithms for data gathering that are specific for the System under Study (SuS), as well as procedures for semantic mapping of the experimental data with prior knowledge on specific biomedical domains. STATegra will create novel visualization strategies that facilitate the interpretation of information-rich data models and structures.

Objective 2: Development of statistical methods for integrative analysis of omics data measured on the same samples. The STATegra project aims to develop sound and comprehensive statistical solutions for the integrative analysis of different types of omics data in order to fully leverage the information content and the discovery potential of experiments where different omics levels are performed on the same set of samples. Our goal is integrative analysis beyond mere data integration, i.e. we strive to develop statistical methodologies that will incorporate different types of omics data into one analysis with models of the SuS that integrate different molecular layers. We will analyze the uncertainty (noisy data and missing values) in different data types to evaluate how effective the integrative approach is in handling these characteristics of the data and how robust integration methods are to different levels of noisy that might appear in the experiment.

Objective 3: Development of statistical methods for integrative analysis of omics data obtained in different sets of samples. We will employ causal modelling methods to model interventions and to generalize meta-analysis methods that enable using datasets obtained under different experimental conditions. In addition, we will develop analysis methods that can co-analyze analyse several datasets measuring different, but overlapping feature sets (value missing-by-design). The above methods should be able to improve their inductions by intelligently incorporating prior causal knowledge.

Objective 4: Development of statistical framework for multilayer integration and interpretation of omics data. We aim to provide an example of a data analysis scenario where different omics data sources are integrated in a global system biology model. This example will collect statistical innovations obtained from the project and will use the STATegra mouse B-cell differentiation system. We intend that this example, rather than a fixed integration methodology, could serve as a reference on how data modalities and analysis algorithms can be put together to gain biological insights that would be heard to obtain otherwise.

Objective 5: Development of algorithms for experimental design. We will develop computer-assisted methodologies for estimating optimal experimental designs in multivariate and multiplatform genomics studies. Our objective is to develop methods (possibly by Monte Carlo simulations of heteroscedastic measurements) to infer which experimental setting combined with specific measurement platforms and set of parameters lead to the largest gain in information within the available sampling resources.

Objective 6: Development of methodologies to feed back statistical results into current knowledge. STATegra aims at the generation of a formal representation of knowledge that can form the basis to feed back causal discoveries. We will generate methods to compare knowledge models described in different formal languages and subsequently generate semantic mappings for data derived from/fed back into these sources. Moreover, we will propose community aligned formats to extend current formal representations of knowledge to ensure that results from causal discovery can also be represented. Another important objective is to enable intuitive interpretation of the implications of the causal discovery for the understanding of the SuS by powerful visualization engines.

Objective 7: Validate statistical methods and create procedures for validation of statistical results. We aim develop analytical procedures for the identification of optimal validation experiments, perform the suggested validation tests and contrast the results with the newly knowledge generated for the SuS. The goals of these validation experiments are two: on one hand, to actually check predictions derived from the integration models generated through objectives 2 and 3, and second, to verify the adequateness of the validation process to suggests suitable follow up experiments.

Objective 8: Implement developed algorithms into user-friendly packages. The STATegra project aims to eliminate the obstacles for translating cutting edge omics data and statistical modelling research into real use by the wider biomedical research community. Our objective is to build a new, user-friendly and integrated bioinformatics platform for statistical analysis- and visualization of selected omics data in a systems biology context. The software will enable the user to combine reference omics data with their own experimental data so that multi omics analysis can be performed in their specific research context, addressing their specific needs and including their specific biological knowledge. The software will be modular and will provide straightforward interfaces for inclusion of other software to enable the inclusion of cutting edge statistical methods and tools in the platform.

Objective 9: Ensure dissemination of these methodologies among the genomics community. One of the primary goals of the STATegra project is to establish an intensive set of dissemination actions that will ensure an efficient translation of the methodological developments of the project into the current practice of genomic analysis. Our dissemination goals will target three aspects of translational bioinformatics. First, we aim to establish strategic liaisons with other research consortia involved in large genomic projects for the critical and end-user compliant development of our statistical tools. Second we have a very specific goal in the creation of user-friendly software solutions that can serve the wider genomics community. Finally, we intend to organize training activities for our developed methods and tools to maximize the actual uptake of our newly developed statistical method by the biomedical research community.

Project Results:
The STATegra project has successfully met the objectives of creating data, methods and software for the integrative analysis of multi-omics experiments. As planned, we have targeted very different aspects of the processing of this type of data, namely data gathering, experimental design, statistical algorithms and visualization. We have also been successful in creating and comprehensive multi-omics dataset and in disseminating and exploiting the results of our efforts. In the following, we summarize the project achievements at all these fronts.

1 The STATegra multiomics-data collection
We have created a unique data collection that has the potential of becoming a gold standard for data integration studies. In contrast to other large scale data repositories such as ENCODE, BluePrint, or TCGA , which collects a limited number of data types over a large diversity of samples frequently with low replication, the STATegra collection focus on a specific biological system: the differentiation of the mouse pre-B - like B3 cell line under the induction of the transcription factor Ikaros. The STATegra data collection has a defined experimental design: six time-points were sampled in triplicate for both Ikaros and Control series; and covers a diverse set of omics data types: Expression-seq (RNA-seq, microRNA-seq and single-cell RNAseq), Genome-seq (ChIP-seq, RRBS-seq and DNase-seq) and non-nucleic acid omics such as metabolomics and proteomics. In total we have collected 560 datasets (Figure 1, See document final report attached).

This unique collection necessitated that experimental design and platform-compatible normalization issues be thoroughly addressed. By carefully designing sample distribution at library construction and sequencing, we were able to estimate and eliminate batch effects that inevitably appear when large experiments are run in several rounds. The utilization of internal standards allowed us to co-normalize data in omics technologies with substantially different sample preparation protocol. This allowed us for example to correct for differentiation associated cell shrinkage effects during our time course throughout multiple omics methods. Finally, we have used information from some datasets to make data processing decisions in other omics types, which facilitated analysis and reduced sample preparation costs. For example, gene expression values have been used to support pre-processing of proteomics data or integration of ChIP-seq and DNase-seq data has been used to define signal confidence intervals. We concluded that experimental protocols are important information to keep in mind while designing normalization and data integration approaches that involve heterogeneous omics data types. This information is frequently neglected by data analysis. The STATegra project, by integrating experimentalists, bioinformaticians and biostatisticians wants to stress the necessity of a good communication flow from data generation to analysis for the proper processing of complex data structures.

The STATegra data is now available through three different interfaces:
1) STATegraEMS (Deliverable 3.1). This is a software development of the project. STATegraEMS is a management system for the annotation of multi-omics experiments. The STATegraEMS was developed to host the STATegra data collection but was designed as a general tool that could be used in any genome research laboratory. The STATegraEMS differs from other information system in being experiment rather sample-centric. While most NGS LIMS are conceived to help tracking sample processing at sequencing facilities, the STATegraEMS allows the storage of multiple types of omics experiments, separates sample related from sequencing related metadata and include fields to record and query experimental factors. The STATegraEMS has been published and is freely available to the scientific community.
2) The STATegra Knowledge Base (KB). The STATegra KB is a fully structured database that gathers multiple information sources on the STATegra experimental system. STATegra KB semantically integrates prior knowledge from public resources on B-cell differentiation together with public and project specific experimental data. This integration and semantic mapping of heterogeneous prior knowledge and information creates extremely information rich structures with thousands of objects (genes, proteins, compounds, etc.) from different organisms, millions of connections (e.g. protein-protein interactions (PPI), transcription factor-target relations) and several levels of modularity (e.g. signal pathways or sub-cellular localization). In addition this network involves context specific information (e.g. transcription factor binding true under a specific condition), quality information (e.g. evidences of functional relations) and is overlaid with large amounts of multi-dimensional data (e.g. large scale gene expression, NGS results). STATegra KB is available at: https://ssl.biomax.de/stategra/cgi/login_bioxm_portal.cgi?BIOXM_WEB=BioXM
Public repositories (Deliverable 3.2). All raw and processed data is available from different public sources: NGS data are available at Gene Expression Omnibus (GEO):

- mRNA-seq data at GSE75417 Private, waiting for publication
- miRNA-seq data at GSE75394 Private, waiting for publication
- Methyl-seq data at GSE75393 Private, waiting for publication
- DNase-seq data at GSE75390 Private, waiting for publication

While Metabolomics and Proteomics datasets are located at dedicated databases hosted by EMBL-EBI:
- Metabolomics at MTBLS283:
Private, waiting for review
- Proteomics at PXD003263. Private, waiting for publication

2 Development of methods for data gathering and visualization
The primary aim of integrating datasets and prior knowledge in a specific knowledge base (KB) is to ensure that a comprehensive collection of large-scale datasets mapped to a network of prior knowledge is available for data mining, statistical modeling, causal discovery and experimental design. To meet this need we created a data model that semantically links existing knowledge with experimental data. We used the STATegra System under Study (SuS), the mouse B-cell differentiation system, as use case for this data modeling effort. Basically, three levels of information were considered:
a) Quantitative experimental data (from STATegra omics and public domain experiments)
b) Experimentally derived information (such as protein-protein interactions)
c) General causal information (e.g. as hematopoietic differentiation processes)
Moreover, we distinguish the experimental data being related to the same biological system as the STATegra SuS (S1), related experiments (S2) and new designs regarding protocols and technologies with a potential to study the SuS. Integrated data types will be classified according to this (S1-3).
The proposed data model (Figure 2, See document final report attached).; Milestones 11 and 12) integrates meta-data from experimental design (sample metadata), experimental technologies (specific metadata for each of the omics methods), source of public domain knowledge (public repositories) and relationships between data elements (evidences, ontologies etc.). This model was implemented in the STATegra KB and has been used to link project data to public domain information. This is also the underlying model that supports the visualization engines available at the KB.

STATegra has also resulted in several developments that meet the difficult challenge of visualizing multi-omics data to facilitate the interpretation of information-rich data models and structures. We have focused on visualizing pathway and molecular network graphs, on enabling the integration and visualization of several kinds of heterogeneous datasets and also on enabling the visualization of changes over time. Our methods enable interactive network construction and navigation at different scales (cell, molecular and single entity level) and allow breakdown (pathway editing) or pre-selection of a defined subset of nodes for visual representation to avoid overcrowded networks.
Visualization resources have been implemented at two fronts:
I) Within the Knowledge Base (Figure 3, See document final report attached). Three different novel implementations are available in the STATegra KB that target three levels of visual representation (Deliverable D2.1):
a) Visualization of Expert Knowledge at the cellular process level (Figure 3a). This visualization represents the consolidated previous knowledge existing in the consortium.
b) Visualization of integrative local molecular networks (Figure 3b). This visualization aims to be a flexible framework to display different types of data around the regulation of a particular factor.
c) Visualization of multiple sources of data at the gene level (gene cards-like) (Figure 3c), i.e. gene-centric charts that collect all public and consortium data available for the gene.

II) The Paintomics tool (http://bioinfo.cipf.es/paintomics/ Figure 4 and Figure 5, See document final report attached). Paintomics is a web tool for the integrative analysis and visualization of multi-omics data on the template of KEGG pathways that can be used for virtually any organism. The tool provides 3 visualization levels:
II.a) Pathway Network is a global visualization of (coordinated) molecular changes at the global systems level.
II.b) Pathway level. Different omics signal at represented for each node of the KEGG pathway
III.c) Gene or node level. Multi-omics profiles for all genes in a pathway node.
In this way Paintomics supports navigation at different information levels. Zooming in and out across these levels permits interactive understanding of global and local features of the omics experiment.
Paintomics accepts virtually any type of omics data that can be mapped to genes or metabolites. In the case of genomics coordinates data (such as ChIP-seq, Methyl-seq or DNase-seq), Paintomics uses the RGmatch algorithm to link these to genes and provides a aggregated measurement for each gene that is then displayed at the pathway node together with other omics values. Similarly, when microRNA-mRNA mappings are available, Paintomics can show microRNA regulation at each pathway node.

3. Development of statistical methods for integrative analysis of omics data obtained in different sets of samples
We have completed the development of statistical methods for the integrative analysis of multi-omics experiments. The rationale of these methods is that a data structures are supported by a common experimental design, which is used in the integrative process. These methods address the data integration problem from different perspectives using dimension reduction techniques, reverse engineering, linear models etc. We next describe briefly these methodologies:
a. Omics Component Analysis. UvA and CIPF have investigated the use of different data fusion approaches based on Component Analysis for the joined multivariate analysis of different datasets. These methods separate data variability between shared and distinct components and provide a global overview of the relationship between omics modalities.
b. OmicsClustering. This is a clustering method based on the combined and weighted distances between genes calculated on the basis of several omics measurements. The algorithm requires a mapping strategy to assign non-gene features (such as ChIP-seq peaks) to genes. The approach is interesting to see gene associations due to different regulatory characteristics of genes.
c. Reverse engineering methods. KI developed a novel pipeline to reconstruct gene regulatory networks as a function of time, by integrating different sources of data and using a systematic approach to uncover the network structure in a step-by-step procedure. The principle underlying this approach is the assumption that network topology is in a quasi-steady state, and we seek to identify the time-dependent edges. The tools used are mutual information-based pairwise associations, normalized differential expression across time, modified regression with L1 penalty for predicting node values. We use RNA-seq expression data and DNAse footprint binding information for our reconstruction.
d. Pathway linear models. We have extended CIPF’s previous Pathway Network Analysis methods to process multiple omics data types. PANA creates a network of pathways based on correlation patterns among their components that now can include metabolomics and proteomics data. This approach helps to investigate functional connections and underlying regulators between pathways changing their activity during a given biological process. Paintomics maps mutli-omics data to KEGG pathways and displays graphically the value that each gene (and metabolite) takes at each omics type over the topology of the pathway. The tool can also perform pathway enrichment analysis for each omics separately and jointly for all features.
e. Pairwise linear models. This approach was developed by CIPF during the first reporting period to study gene-wise time-related covariation patterns between a pair of omics data types. The method returns groups of genes with similar patterns between the 2 omics. In this second reporting period we have added gene set enrichment analysis to this methodology.
f. Time-resolved Clustering. In the previous reporting period KI described a novel methodology to quantitatively relate gene clusters of different sizes. The methodology has been extended to incorporate multi-omics data-sets. A new cluster-based network has been inferred to represent gene expression changes of the System under Study based on modules (clusters) and which are then related based on the temporal profiles (Figure 6, See project final report document attached).
g. IntegRa. This is a machine learning approach developed by CIPF that seeks to identify the set of NGS-quantified features that potentially regulate the expression of one gene or group of genes. Potential regulators of gene expression used were TFs (measured by RNA-seq and ChiP-seq), microRNAs (measured by microRNA-seq), DNA methylation events (measured by Methyl-seq), and chromatin accessible regions (measured by DNase-seq). The method first uses decision trees as variable selection strategy and then generalized linear regression and structural equations to find a regulatory program for each gene.
h. Other developments: RGmatch (CIPF) is a highly configurable python package for assigning genomic regions defined by chromatin related omics (such as ChIP-seq, Methyl-seq or DNase-seq) to annotated genes. NextmaSigPro is an extension of CIPF’s software maSigPro for the analysis of time course NGS data. NextmaSigPro has been used for the statistical analysis of mRNA-seq, miRNA-seq and DNase-seq of the STATegra data collection.
Figure 7, See document final report attached, summarizes the statistical methods developed in STATegra, indicating the application domain and type of output provided. Several of these methods are now available at the STATegRa R package from Bioconductor (https://www.bioconductor.org/packages/3.3/bioc/html/STATegRa.html , Deliverable 4.2)

4. Development of statistical methods for integrative analysis of omics data obtained in different sets of samples
We addressed the tasks of developing of a set of theories, methods and algorithms for integratively co-analyzing diverse omics data in the context of the available prior knowledge and with the goal of discovering causal relationships.
Methods developed towards this objective include:
a) COmbINE. This algorithm infers causal structures from the integrative analysis of collections of datasets that measure overlapping sets of variables under different experimental conditions. The method assumes that there are common underlying causal relationships in the data measured in different experiments and can infer relationships between variables even if they have not been measured together
b) The Causal Network Meta-Analysis (CNMA) integrates datasets that share experimental conditions and variables with the aim of creating a larger dataset that can be efficiently used in causal network inferential analysis. The goal in this case is to account for possible batch effects and removed them prior to causal network analysis.
c) Methods for integrating prior knowledge and data that learn causal networks in the context of prior causal knowledge, e.g. that X causally affects, directly or indirectly, Y. These algorithms can be used in combination with the previous ones, particularly with the Causal Network Meta-Analysis algorithms.
d) Pairing methods. We have developed a novel methodology to pair datasets from two omics technologies that have not been created on the same sets of samples. In particular have applied this to the integration of transcriptomics (T) and methylation (M) data from public repositories. This method relies on the assumption that paired T-M datasets should be those that have significant negative correlations. Significance is assessed by comparing against true paired samples and using permutation to obtain a null distribution.
e) HolistOmics. HolistOmics is a novel application of the Non Parametric Combination (NPC) methodoly, specifically tailored for the idiosyncrasies of omics data. First, each datatype is analyzed independently using the appropriate method. Currently, holistOmics analyses static (one time point) RNAseq data, using voom+limma method, and static microarray data, using limma. In the future holistOmics will be extended to analyze more types of data as well as data with measurements over several time points. The resulting p-values are combined employing Fisher, Liptak and Tippett combining functions. Tippett function returns findings, which are supported by at least one omics modality. Liptak function returns findings, which are supported by most modalities. Fisher function has an intermediate behavior between those of Tippett and Liptak.
Holistomics is also part of the STATegRa R package from Bioconductor (https://www.bioconductor.org/packages/3.3/bioc/html/STATegRa.html Deliverable 5.2)

5. Development of statistical framework for multilayer integration and interpretation of omics data
We applied the different methods described in two previous objective to the STATegra data collection to a) probe the potential of different integration technology to provide complementary analysis results and b) Obtain a global model for B-cell differentiation that encompass the dynamics of epigenetic, transcriptome, metabolome and proteome changes. We have shown that through this meta-integrative analysis approach we have been able to:
1. Provide a comparative analysis of the information content in at each omics layer.
2. Highlight the most important epigenetic changes
3. Infer a transcriptional network through the combination of DNase-seq and RNA-seq data.
4. Profile the dynamics of transcriptomics changes during the differentiation process, including the putative role of microRNA regulation
5. Reveal the metabolic switch that follows cell-cycle arrest and onset of differentiation
6. Verify some of the transcriptomics changes at the protein level.

We next report the most important conclusions of this integrative effort and indicate which STATegra integrative approaches have helped in the analysis leading to each conclusion (Deliverables D6.1 and D6.2).
1. Different omics modalities have different signal-to-noise ratios. We found proteomics data especially hard to integrate due to high variability and abundant missing values. Also metabolomics was noisy but informative and demanded new validation measurements. Analysis of noise levels within the framework of methods for experimental design provided the data for this conclusion.
2. Cell differentiation changes were primarily driven by transcriptional regulation. Hardly any changes can be observed in this short time-frame at the DNA methylation level. Chromatin accessibility changes moderately and TF regulation occurs on existing open structures due to changes in expression level. Data Fusion and Omics Component Analysis and Pair-wise lineal models support this conclusion.
3. Cell cycle arrest and differentiation result in a global reduction of gene and protein expression, concomitant with a global reduction in cell-size and chromatin accessibility. An important number of changes occur as early as 2 hours but the majority of the regulation is observed around 12 hours. Fewer changes occurred between 18 and 24 hours. Time-resolved omics clustering and OmicsClustering support this conclusion.
4. Gene expression changes with four major regulatory patterns, namely monotonic induction and repression and quadratic (up-down, down-up) progression. microRNA expression changes, however, are concentrated in the second part of the time course and point to role in fine-tuning the transcriptional regulation resulting in further down-regulation of protein levels (Figure 6, See document final report attached). Time-resolved omics clustering supports this conclusion.
5. Ikaros binds to a large number of differentially expressed genes, pointing to a global role of this TF as transcriptional repressor. Generic and specific TFs are part of the transcriptional network. Myc and Foxo1 (both with increased expression) are relevant players. We have created a gene-level regulatory network that includes TFs and microRNA regulations for each gene, and also if any chromatin modification (accessibility or methylation) occurs. ChIP-seq analysis and TF-regulatory network analysis support this conclusion.
6. Metabolism undergoes a massive and synchronized down regulation. This is mirrored by genetic information processing pathways (RNA and DNA processes). On the contrary most signaling pathways (Wnt, NFkB, TGb, Foxo, Ras, Rap1, Notch) are induced, especially after 12 hours (Figure 4, See document final report attached). Paintomics and IntegRa analysis support this conclusion.
7. Metabolic reprogramming results in a downregulation of glycolysis. Also the glucose transporter in down-regulated. On the contrary, autophagy pathway is up-regulated. This points to a role of autophagy in providing energy in a glycolysis-arrested environment and fits the cell-size reduction at 24 hours (Figure 8, See document final report attached). Integrative pathway analysis supports this conclusion.

8. Genes differentially expressed in the in vitro B3-cell differentiation largely overlap genes regulated in in vivo preB-cells and those found dysregulated in leukemia patients, with suggest that the STATegra system and models can be a useful tool in the study of the biology of this disease. Integrative meta-analysis supports this conclusion.

6. Development of algorithms for experimental design
We have addressed the problem of experimental design and validation in multiomics projects through four different developments.
1. We concluded that prior to any integrated multi-omics design being proposed, a common understanding of the Figure of Merit (FoM, or performance metric) across technology should be obtained. Hence, our approach to provide methods for multi-omics experimental design has focused in the first place in the definition of these FoMs. We have identified 7 FoMs that need attention in omics technologies: Sensitivity, Reproducibility, Selectivity, Detection Limit, Dynamic Range, Coverage and Identification. Deliverable D7.1 report describes these terms and compares their meaning and significance at each type of omics data used in STATegra, namely metabolomics, proteomics, gene expression (RNA-seq), methods based on the identification of genomic regions (ChIP-seq and DNase-seq) and methods based on the detection of nucleotide variants (RRBS-seq). Figure 10 a comparative summary of relevant aspects within each of these FoM across all these omics methods.

2. Comparative analysis of noise and missing values in multiomics data.
We investigated different metrics to evaluate noise levels and compared among omics data types (Deliverable D4.1). Important aspects to take into account for a useful comparative analysis were the dimensionality of the data type and the possible occurrence of outliers. Our final choice was to use the standard error among replicates and the within-condition signal change as two measures of the variability versus effect-size in each omics and compare them graphically (Figure 9A, See document final report attached).

From this figure we can conclude that RNA-seq very nicely combines a low standard error with a large change in signal, both useful properties to identify significant differences between experimental conditions. scRNA-seq has lower standard error due to the high number of replicates, but the lower sequencing depth results in narrower effect sizes. On the other hand, proteomics, ChIP-seq and RRBS-seq were among the omics technologies with the highest measurement variability, whereas metabolomics, DNase-seq and miRNA-seq displayed medium variability.
An important aspect to consider in missing values is whether they are random or not, and whether they show association. Association can be sample-wise (a sample has many missing feature values) or feature-wise (a feature is frequently missing across samples). Randomly missing values are easier to impute than systematic missing values. Other considerations include the relationship between missing value and limit of detection. When missing values are those that are arise at the platform limits of detection, these can impute as zeros, but not otherwise.
The comparative analysis of omics dataset revealed that: i) Missing values in NGS-based technologies are frequently related to the limit of detection and therefore can be addressed by deeper sequencing. ii) Missing values in proteomics were mostly sample-wise, indicating that some samples “failed” to be correctly measured. Also in proteomics, there was not clear association between missed values and limit of detection. iii) We found no missing values in metabolomics, but the technology could not quantify many metabolites that were actually present in the cell extract by default. Here missing values associates to coverage.

3. We have developed a new algorithm called MultiPower to perform power analysis in the context of multi-omics experiments. The goal of the algorithm is to determine the sample size per omics, i.e. the number of biological replicates per experimental group, to minimize the total cost of the experiment subjected to the restriction that a minimum statistical power for each omics must be achieved. We have formulated this in terms of an optimization problem that can be solved with an integer linear programming algorithm. Optimal sample size estimation is done based on: i) the size of the effect to be detected, ii) the variability range of the data (that can be obtained from related public domain data), iii) the choice of significance level, iv) the individual cost to produce each biological replicate, and v) the targeted minimum statistical power. With this information, the MultiPower method generates the objective function, incorporates the restrictions of the optimization problem and returns the optimal sample size for each omics (Figure 9B, See document final report attached). This approach can be also used to validate existing experimental designs if data variability is computed from actual experiment samples.

4. Multi-omics dataset simulation algorithm. This pioneering tool is the first to simulate several interconnected omics data types. The algorithm generates read counts from RNA-seq, miRNA-seq, ChIP-seq, DNase-seq and RRB-seq, taking into account dynamic ranges and error models of each omics type. It allows for flexible experimental designs with different experimental conditions, time points and number of replicates. Most importantly, the tool defines regulatory mechanisms between genes and the rest of omics features simulating that they originate from the same biological system. The simulation algorithm is part of the STATegRa R package.

5. Methods for Optimal Experimental Design when prior information is available. A new method for optimal design based on a metabolic flux model of the biological system is still under development. This method combines a dynamic FBA approach with a novel node activity metric developed in the MeTRA algorithm. This metric combines the expression values of each gene and its regulators to provide an “activity” score for the gene in the metabolic network. The experimental design optimization strategy relies in the analysis of flux variability when fixing all fluxes but one to infer which parts of the network have more uncertainty and require additional data. Similarly, by analyzing predicted metabolic changes with upcoming metabolomics data we can also identify parts of the network where more information is needed.

7. Development of methodologies to feed back statistical results into current knowledge
We first investigated the recovery of experimental design information from published literature using a text-mining approach. For this we compiled a dictionary of experimental design/results terms and phrases that contained a total of 86748 items describing perturbation types, perturbation results, expression-related methods, statistical techniques, investigation techniques, experimental design, NGS-methods and organismal anatomy (Deliverable D2.3 and D2.4). We applied this dictionary to mine the literature of experiments related to the STATegra System under Study (SuS). Overall the information on biological systems and measurement methods were extractable with reasonable recall and precision while the perturbation type and result were, on average, not extracted with reasonable precision, although the individual term matches were of good to high quality. In addition, the statements about result quality were often hidden in figures or tables, which present additional challenges for literature mining. When combining the results from the four stages of our analysis, the number of true, meaningful statements was extremely low.
Based on these results, we proposed to develop rule-based or machine-learning-based systems that focused on methods and results sections and learned to distinguish cause (perturbation type) and effect (perturbation result). Moreover, to extract quality of results like significance we propose the integration of dedicated parsers for figures and tables.
Our proposed method has a main goal to facilitate the feedback of experimental results into the public domain. Our strategy has been to define structures to capture information on Study Design, Data Attributes, Biological Samples, Experimental Data Production, Data Processing and Analysis and use existing ontologies to support annotations. For example, the Study Design aspect has a file Objective than can accept terms from several ontologies including FMA, MeSH and OBI. The knowledge being represented and fed back into the public domain consists of logic statements, rules, and weighted object associations that can be represented in graph networks with boolean or differential equation type interaction dependency definitions. The representation is mapped to the corresponding data analyses (0-n) from which it results.
We have implemented a generic data model (Figure 11, See document final report attached) that connects statements about biological mechanisms such as “human miRNA hsa-mir-20a regulates the mRNA of HIF1A in B-cell differentiation during stage F’C” with the corresponding evidence including the full evidence production process. Based on the generic knowledge representation model we propose a simple communication standard format aligned with current trends in life-science information communication, the ISA-TAB-Knowledge format. This format is based on the popular three-tier, tab-delimited ISA-TAB format promoted for data exchange by the MIBBI biological standards community and now accepted by multiple public data repositories for deposits. Its three tables allow describing the context of data production, the samples used therein and the actual experiments. In our proposal an addition table will contain the derived knowledge in terms of associations between two biological concepts such as a gene and a disease or a miRNA and its target mRNA. We have extended the format by two additional tables to enable the full reference between knowledge and evidence. The Analysis table is similar in structure to the Assay table but describes the data analysis process. The Data table provides meta-information about the parameters used in the knowledge representation. Further details on the ISA-TAB tables used in the feedback form proposal can be found at the D2.3_D2.4 documents.

8. Validate statistical methods and create procedures for validation of statistical results
One of the most interesting aspects of the STATegra project is the possibility of experimental validation of projects results. As STATegra aims to develop multi-omics methods at the levels of experimental design, statistical analysis and visualization, we have performed validation of all these different aspects. Validation has consisted in i) the identification of validation experiments, ii) performing theses assays, and iii) analysis/interpretation of the results in the context of the existing models. These validation experiments are described in detailed in Deliverables D81. D8.2 and D8.3.
1. For validation of the STATegra experimental design we applied our MultiPower method to our own data. This analysis indicated that gene expression experiments were correctly dimensioned but other omics were underpowered, the one furthest away from recommended sample number was metabolomics, which finally was selected for additional experiments. Leiden University has developed analytical metabolomics methods that extend to central and carbon metabolism to carry out the validation experiments for the project. Moreover, both internal and external metabolites have been obtained.
2. Validation integration model: prediction of the molecular elements of the system.
We aimed to validate specific molecular components that would be pointed as relevant in the B-cell model. We anticipated different type of validation experiments, (knock-out experiments, verification of microRNAs, specific metabolic interventions). These and additional experiments have been carried out:
a) Knockdown or over-expression of Ikaros co-factors or Ikaros target genes. Both the gene expression and the transcriptional network analysis revealed the importance of Myc in our Sus. To test the model that Ikaros-mediated silencing of Myc expression is an important step in the sequence of Ikaros-imposed reconfiguration of gene expression networks, ICL has provided RNA-seq data derived from B3 cells transduced with Ikaros or Ikaros plus Myc. KI is exploring the impact of Ikaros-resistant Myc expression.
b) Transduction or inhibition of validated miRNAs and genes regulated through methylation changes. IDIBELL has performed two sets of validations experiments within WP8: first, they tested by quantitative RT-PCR and LNA probes for miRNAs and standard primers for their target genes, the profile of their levels in a time course manner to validate the data obtained in WP3 and resulting from integrative analysis in WP6. The selection of miRNAs and corresponding targets has been done in collaboration with CIPF and KI. Specifically, they tested 10 upregulated miRNAs and 10 down regulated miRNAs and two targets for each of them. The second validation experiment involved bisulphite sequencing of a selection of genes displaying significant methylation changes.
c) Specific metabolic interventions such as 13C tracer experiments to zoom into the dynamics of metabolic pathways. ICL has conducted sample preparation for metabolomics, full time course with PCR validation of Ikaros target genes prepared in triplicate, sampling intracellular and extracellular metabolites, as indicated previously. The samples have been sent to Leiden for processing. ICL has also prepared samples for metabolic tracer analysis of glucose, glutamate and Acetyl-CoA (ICL).
d) Validation of phosphorylation activities. To test predictions of omics models derived from the integrative analysis of STATegra data, ICL has carried out Ikaros induction and control time course experiments with PCR validation of Ikaros target genes for the analysis of protein expression and phosphorylation. The samples were used for western blotting for Foxo1, Foxo3a, Akt, Akt phosphorylation, S6 ribosomal protein, S6 ribosomal protein phosphorylation. The prediction of reduced mTOR activity in B3 cells in response to Ikaros induction was validated by the finding of a time-dependent reduction in the phosphorylation S6 ribosomal protein following Ikaros induction.
e) Validation of open chromatin changes by single-cell ATAC-seq. The changes in open chromatin as detected by DNase-seq during the time course could be happening in all cells or only a subset of cells. In particular, the question of whether a small open chromatin signal means that this regions is open in all cells at a low level or fully open but in a few cells would affect both the interpretation of their significance of small changes as well as the regions to focus on in the integrative analysis. Therefore RUC carried out single-cell ATAC-seq at the beginning and end time-points of the Ikaros induction in B3 cells to identify the regions that change robustly in a large fraction of cells. The data is being analyzed jointly with the single-cell RNA-seq data to pinpoint those regulatory elements that change in conjunction with changes in expression.

3. Validation integration model: predictions on functional components of the system.
The integrative analysis of the STATegra data indicated several metabolic and signally pathways with strong and consistent regulation patterns across the differentiation process. The validation of these functional elements of the system has been achieved by leveraging the results of the metagenomics analyses performed in WP8, which in turn were also motivated by the ongoing findings of the statistical analyses. We obtained two validations results:
a) Validation of B3 cell metabolic switches in primary mouse B cell progenitors. We concluded that a number of Ikaros-regulated pathways are reprogrammed not only in the B3 cell line model but also in primary mouse B cell progenitors. These include glycolysis, cell cycle, autophagy and others. In addition, the time-resolved nature of STATegra data has allowed us to assemble the temporal order in which these pathways are reprogrammed with respect to each other, as well as the sequence of gene expression changes within individual pathways.
b) Validation of STATegra data in human leukemia cells. In WP5, FORTH developed a meta-data integration analysis strategy to identify genes and pathways regulated in leukemia. We leveraged these results to validate the STATegra B3 cell system by investigating leukemia related processes. Interestingly, an important number of gene frequently disregulated in leukemia patients change their expression during the STATegra B3 cell differentiation course. Moreover, significant signaling and cancer pathways differentially regulated in differentiating B3 cells were also enriched among leukemia genes, and the expression pattern of genes in theses pathways is highly similar (see for example, Jak-Stat signally pathway in Figure 12,See document final report attached). Because STATegra's integrative omics approach provides a deep understanding not only of individual Ikaros target genes but also of the pathways they belong to and how these pathways are interconnected, it is possible that this approach will identify potential therapeutic targets in IKZF1-mutated B-ALL.

Potential Impact:
1. Impact
Based on available market reports, it is currently difficult to separate the segment for analysis of multi-omics (or X-omics) data from the major NGS application areas. However, we expect that in the coming 12 months 40-60% of the users will require support for RNA-Seq analysis and 5-15% will require support for epigenomics data analysis. Epigenomics data analysis is mostly driven by an increasing uptake of bisulfite-sequencing, and to a lesser extent, ChIP-seq data as well. Although the fraction of users that will incorporate the aforementioned and other omics-datatypes into their multi-omics analysis is difficult to estimate, it is clear that the STATegra project has anticipated a market that is now beginning to emerge and will grow more in future. Biomarker Discovery within pharmaceutical research and to some extent in basic research facilities is the main driver for the demand in multi-omics data analysis. The global biomarkers market was estimated to $8.09 billion in 2014 and is expected to reach $18.30 billion in 2020, growing at a robust CAGR (Cumulative Annual Growth Rate) of 14.6% (2014–2020)1. It is estimated that 43.1% of this market is driven by genomics for $3.49 billion in 2014 and a 17% CAGR2. Key opinion leaders and researchers believe that ‘cross-omics’ approaches will be the key to effectively understand and manage complex diseases in the future.
STATegra identified a number of scientific, societal and economics impacts that were expected to be met through the activities of the project. Here we summarize these expected impacts and how they have been realized.
(a) “use high-throughput technologies to generate data for elucidating the function of genes and gene products in biological processes”: STATegra has develop statistical methodologies enabling causal characterization of biological processes, that shed light on the functionality of genes and proteins identification of genes. Therefore STATegra will impact the efficient use of high-throughput technologies to this end.
(b) “New and improved statistical tools allowing better use, analysis and interpretation of large scale, multivariate and/or small-sample -omics data and better experimental design”. STATegra has delivered new statistical methodologies that (i) will integrate different (multivariate) -omic data types, (ii) will be scalable for large samples, and (iii) will allow optimal experimental design by the integration of prior information and publically available data.
(c) “The new methods should meet the scientific needs and have the potential for rapid uptake in practice”. STATegra was designed to target the real scientific need of statistical methods facilitating a statistical basis for a number of challenges inherent in real scientific questions. By incorporating leading bioinformatics SMEs such as QiagenAarhus, we ensure the development of user-friendly implementations. Moreover, by capitalizing on the current and large user basis C QiagenAarhus is poised to deliver a rapid uptake of the statistical methodologies produced by STATegra.
(d) STATegra has identified and automatated those processes that act as time-bottlenecks in the post-processed multi-level data-analysis, (2) design of methodologies that allow non-bioinformaticians to analyze (self-produced) omics-data and (3) design of guidelines and methodologies for proper multilayer experimental designs. STATegra is well-positioned to provide solutions in these three aspects, therefore having an impact in data analysis cost reduction.
(e) “clinical use of -omics approaches and the analysis of their outcomes”. STATegra has disseminated the user-friendly version of the statistical methodologies through the Qiagen Aahrus Systems Biology Platform. Since the biomedical and clinical researcher is a core user-profile for Qiagen Aahrus, STATegra favourably impacts and facilitates the clinical use of –omic approaches and their outcomes.
(f) “In the post-genome era the -omics technologies (genomics, proteomics, structural biology, epigenomics, interactomics, metabolomics, pharmacogenomics, etc.) enable new innovative approaches in diagnosis, drug development, and individualised therapy”. Individualized therapy or drug development can, in our view, only be achieved by efficient design of statistical methodologies which enable integration of several different data-types since as a rule a single data-type, such as a genetic variant, is as rule insufficient for understanding a biological process for drug development or providing personalized therapy. STATegra shows how to use different omics layers to generate comprehesive models of molecular systems. We expect that this innovation will impact the utilization of the multi-omics approach as effective strategy for the further development of the precision medicine.
(g) “supporting more topics aimed at generating knowledge to deliver new and more innovative products, processes and services”. STATegra delivers innovative outcomes aligned with the present needs of the research community (tools for “omic data integration and experimental design”). The outcome has three innovative aspects. First a commercial software tool that is oriented to biomedical researcher with no coding-expertise; the impact of such software development will affect numerous different research profiles, as it will provide user-friendly tools for biologist, clinicians and others. Furthermore a final product will be delivered after completing a pilot, and using it for demonstration and validation. Second, STATegra innovates by developing new model-oriented analytical methodologies. This innovative aspect impacts on the method-development bioinformatics community by providing novel integrative methodologies embedded within open-source (R) codes that have been made publicly available for the community to test and improve them. Third, an innovative aspect of the project is that it has considered both a commercial version (user-friendly version designed for biologist and clinicians) and an open-source tool (for instance as R-packages oriented for method-developers). By doing so we are able to obtain the best of both communities: (1) the user-oriented efficiency of commercial tools, and (2) the enhanced support, development and accountability of open-source solutions.

2. Dissemination Activities
STATegra has maintained an important number of dissemination activities consisting in workshops, courses, publications, participation in conferences, internet communication and publication of data and graphical material. In the next we summarize this intensive dissemination effort.
a) Workshops. We organized 3 workshops and one summer school:
Place Event Month Date
1 Barcelona High-Throughput Omics and Data-Integration Workshop 5 13-15/ Feb/2013
2 Heraklion Data Integration Workshop SMODIA2014 26 12-14/ November/2014
3 Valencia Data Integration Workshop SMODIA2015 36 14-16/ September/2015
4 Amsterdam Workshop in Experimental pipelines and post-analysis in NGS and omics data 18 24-25/ March/2015
5 Benicassim Summer School 36 7-11/ September/2015

b) Publications. The project has produced so far 13 scientific publications in high impact biomedical and biostatistical journals, and 4 book chapters. 4 more papers are under review and 19 additional manuscripts are under drafting.
c) Special issues. We have edited one special issue in the journal BMC Systems Biology (January 2014), in connection with the Barcelona Workshop “High-throughput Omics and Data-Integration Workshop” Additionally we are editing two additional special issues of BMV Bioinformatics in connection with SMODIA14 and SMODIA15.
d) Courses. STATegra has participated in the following courses:

Activity Organizer Partner Type Date
Enzymes and Multienzyme Complexes acting on Nucleic Acids Giessen University LMU Workshop 17th-20th of September 2013, Germany
Case Studies of Causal Discovery with Model Search Carnegie Mellon University FORTH Workshop 25-27 October, 2013, US
International Course on Massive Data Analysis: Transcriptomics CIPF CIPF Course 10-14th March, 2014, Spain
Bioinformatics and Oncology
Thessaloniki, Greece FORTH Conference 10 April, 2014
EMBO | FEBS Lecture Course on Nuclear Proteomics. EMBO, FEBS LMU EMBO Course 17-22nd May, 2014
International Course on Massive Data Analysis: Transcriptomics CIPF CIPF Course 9-13th March, 2015, Spain

e) Conferences. We have presented STATegra results at 6 international conferences. In 3 cases the STATegra coordinator presented the project in a keynote lecture.
f) Data publication. STATegra data collection, both with raw and processed data, is available from the GEO and xxx data repositories.
g) Internet. We have maintained an active web site (http://www.stategra.eu/) and twitter (https://twitter.com/STATegra) and Facebook accounts (STATegra account).
h) We have started to publish video lectures in the STATegra youtube channel. Some examples are:
https://www.dropbox.com/s/40sube38u87kwxh/CIPF_Visualization%20Rafa.mp4?dl=0
https://www.dropbox.com/s/wuzoe9j76r8wa7w/NetworkReverse_Venky.mp4?dl=0 Additionally we have maintained scientific contact with several EU projects such as SeqAhead (Cost Action), Epiconcept (Cost Action), Mimomics (FP7), Radiant (FP7), ALLbio (FP7), MeDALL (FP7), FANFOM6, ENCODE, DEANN Marie Curie, Frailomic (FP7), CombiMS (FP7), Casym (FP7), REACTION (FP7), AirPROM (FP7), METSY (FP7), CANCERMOTISYS (FP7), CONGANS (FP7), EpiGeneSys (FP7), Profolic (FP7).

3. Exploitation of results
According with the philosophy of the STATegra followed a double implementation/exploitation track:
ACADEMIC TRACK.
Statistical methods developed in the project have been implemented as freely accessible software packages made available to the scientific community through the Bioconductor project or other public software repositories. This academic track includes the following products, extensively discussed in the S & T section of this Final Report:
STATegRa R package, integrates several of the data integration tools developed in STATegRa such as OmicsPCA, OmicsClustering, Holistomics. Additional methods are under implementation such as MultiPower and Cross-querying system.
NextmaSigPro, R package for time series analysis of count-based data.
RGMatch. Python package for costumizable mapping of genome coordinates to genes
Mens x Macchina R package for causal discovery analysis
Paintomics. Web tool for integrative multi-omics visualization and analysis over KEGG pathways.
Other STATegra methods will follow a similar academic dissemination track as algorithms get published.
COMMERCIAL TRACK.
The commercial exploitation has consisted in the development of the Qiagen Aahrus Systems Biology platform that integrates several of the STATegra methods and additional features for a comprehensive x-omics analysis commercial product.
The Systems-Biology platform builds on the CLC Genomics Workbench (GxWB) framework with its flexible plugin-Architecture. All extensions were initially implemented as plugins. Some, (e.g. the ChIP-seq analysis tools) have already moved into the standard distribution and have become an integral part of the GxWB where they are accessible for download via the plugin manager. From a user-perspective, the plugins work similar to tools that are part of the standard distribution. Figure 1 gives a schematic overview of System Biology platform architecture, distinguishing between extension plugins and interface plugins. Extension plugins run within the platform itself and extend secondary NGS data handling, analysis and visualization capabilities (e.g. ChIP-seq, Track Visuals), while interface plugins provide access to externally hosted services (IPA, Biobase, R/Bioconductor, Biomax). Figure 2 shows plug-ins are available for the latest GxWB version 8.5 via the plugin-store, and hence currently under commercial exploitation.

Figure 1. System Biology platform architecture - Schematic overview. The GxWB interfaces to IPA, BioBase, TransFac, R/bioconductor, and the Biomax KnowledgeBase are illustrated above. Plugins that add functionality within the platform are listed on the lower left. The screenshot illustrates the track visualisation and calculus for visual analytics of heterogeneous omics-datasets.

Figure 2. Screenshot of the GxWB plugin manager showing some of the STATegra related plugins installed. The Toolbox on the lower left shows the Epigenomics Tools for TransFac, ChIP-Seq (Histone, Transcription Factors, Advanced Tools) and Annotate with Nearby Gene Information.
The STATegra related plug-ins that have been implemented and are currently under exploitation include:

Biomax KB - bioXM interface. Biomax develops and hosts the semantic KnowledgeBase (KB) for the STATegra project, containing integrative data and analysis results in combination with existing knowledge and annotations. The interface between the GxWB and the KB makes it easy to query the KB with an intuitive interface based on genomic regions and lists of genes in the GxWB. For example, after selecting a region of interest, the known genes in that particular region are retrieved and displayed in a table. Next, finding known interaction partners from this list of genes, their upstream regulators or downstream targets is supported.
R-executer. The R plugin provides a mechanism to execute R/bioconductoR packages from within the GUI of the GxWB environment and hence incorporates the statistical packages developed by STATegra into the systems biology platform. In order to connect arbitrary R scripts with the parameterization by the GUI-wizards and their input/output with the corresponding data structures, the R-code has to be “wrapped” and using with a few special lines of code that enables the seamless communication between the GxWB and the R-script. The final version of the R-executer includes the option to directly import example data for testing and several code examples for R-script parameterization and error handling. Simple test and demonstration scripts are included. R-packages from each of the statistical partners are available, i.e. maSigPro (CIPF) and tools from the STATegRa package (lead by partner KI). Also a popular Bioconductor package (edgeR) is readily available to be executed via the GxWB. Finally, the request to generate graphical reports as .pdf-files is also supported.
Biobase TRANSFAC. With the TRANSFAC extension users can search DNA sequences for putative transcription factor binding sites, which is an important downstream analysis step after calling peaks from ChIP-seq data or finding open chromatin regions from DNAse data. It relies on the unique TRANSFAC knowledge-base containing published data on eukaryotic transcription factors and miRNAs, their experimentally-proven binding sites, and regulated genes. The extensive compilation of binding sites provides the most comprehensive data set of transcription factor – gene interactions available. The same data also forms the basis of derived positional weight matrices, which are used with the included Match™ tool. Data and tools are seamlessly integrated and produce annotation tracks for further analysis and visualization. For the final release, the very latest database Version (BIOBASE 2015.2) was incorporated into the plugin.

Ingenuity Pathway Analysis (IPA). The interface allows direct upload of RNA-seq/gene-expression data from the GxWB into IPA for downstream biological interpretation. By linking the data with high-quality information regarding molecular interactions, cellular phenotypes, and disease processes a solution is provided to aid researchers in the understanding of complex biological systems. Besides links to the biological processes, pathways and the associated literature it also provides the identification of upstream regulators in regulatory networks (Figure 3).

Figure 3. An example screenshot of an analysis using IPA
Peak caller. To allow the integration with chromatin-region based omics, Qiagen has developed a proprietary peak caller algorithm. The Qiagen shape-based peak caller is probably unique in its combination of ease-of-use, flexibility and accuracy through the ability of building optimized filters for different analysis tasks and datasets. While the general peak-shape recognition algorithm is still available, the functionality has been cast into two specialized tools for the most popular use-case scenarios, namely the analysis of narrow peaks from transcription factor data, and broad peaks resulting from Histone ChIP-seq.
Track Visualisation. When dealing with NGS data, a reference genome provides a central coordinate system for heterogeneous data integration in the form of “genomic tracks”. The simultaneous display of many tracks creates “Vertical-Browser-Bloat“- describes the problem of many tracks quickly exhausting the capacity of both the available screen-estate and the cognitive resources of the users. We followed the concept of integrated visual analytics, aiming at a tight coupling of visualization and analytical tools in an interactive and intuitive. The new track visualisation is a hierarchically structured interactive visualization framework capable of visually overlaying several track (Figure 4).
TrackCalculus. The TrackCalculus handles the general computational aspects of the analysis and serves as a template for rapid prototyping and development of integrated analysis and visualization pipelines. For interactive analysis of the data, the TrackCalculus offers a minimal yet sufficient set of operations for data-transformation (such as addition, ratio, log, smoothing with variable window size and thresholding). These operations can be freely combined into more complex mathematical expressions. This way linear pipelines of data transformation and analysis on tracks can be quickly established without the need for coding. As an application example, a simple peak-caller is built from mapping, normalizing and smoothing the original data, finally thresholding the log-odds ratios between the two tracks representing two time-points of STATegra DNAse-seq data. The TrackCalculus has proven to be a powerful tool for Bioinformaticians carrying out explorative research on heterogeneous X-omics data in the form of genomic track
Map to Proximal Genes. Inspired by a the RGmatch python script developed by partner CIPF we created a plugin called "Map to Proximal genes" as an update to the existing “annotate with nearby genes” tool available in the GxWB. While featuring a more fine-grained biologically inspired classification, the goal of the GxWB implementation was to arrive at a simple-to-use tool to associate regions that result i.e. from peak-finding with a single gene for downstream analysis. It was tested by partner CIPF with positive feedback on the functionality and suggestions regarding improvements to the manual and tutorials.

Figure 4. Track Visualisation

List of Websites:
http://stategra.eu

Documents connexes

final1-project-final-report.pdf

Final Report Summary - STATEGRA (User-driven Development of Statistical Methods for Experimental Planning, Data Gathering, and Integrative Analysis of Next Generation Sequencing, Proteomics and Metabolomics data)

Documents connexes

Télécharger Télécharger le contenu de la page