Skip to main content
European Commission logo print header

Cloud Based Software Solution for Next Generation Diagnostics in Infectious Diseases

Final Report Summary - CLOUDX-I (Cloud Based Software Solution for Next Generation Diagnostics in Infectious Diseases)

The ClouDx-i project is focussed on developing new computational techniques to aid diagnosis and prognosis for neonatal infection. Traditional wet-lab culture techniques often do not recognise particular bacterial strains, nor do they incorporate the resulting host response. Therefore, the principal scientific and technological objective of this project was to develop cloud computing techniques that can support rapid molecular diagnosis of infection and to embed these techniques in an efficient, usable, auditable and secure end to end diagnostic process.

During the project all objectives were achieved, including collection and DNA extraction of 12 clinical isolates, which were sequencing using next generation sequencing techniques. A comprehensive bioinformatics pipeline was constructed to assemble, annotate and perform comparative genomics analysis of the samples. Whole genome sequencing of these isolates and comparison to other sequences in open databases enable detailed characterisation of genome structure, virulence factors and SNP mapping. This allowed us to design hybridisation probes capable of detecting and identifying blood sepsis causing pathogens from whole blood samples. The SNP maps furthermore allowed new studies in this project such as a comparison of changes on the genetic level between fresh isolates and pathogens cultured for several generations. This approach allowed us to assess the validity of using cultured pathogens to characterise virulence and antibiotic resistance of these bacteria.

Following the DNA extraction, sequencing, assembly and annotation of 12 samples from microbes isolated from neonates at the Royal Infirmary Edinburgh, six papers were published in the Genome Announcements (genomeA) journal at http://genomea.asm.org/. The genome announcements are listed with a number of other academic papers that were published related to this work package. These publications provide important meta-data related to genomic sequences and clinical context of the microbial pathogen causing sepsis and which is essential for future reproducibility and downstream analysis.

In addition we also designed and published a classification procedure for use with microarray transcription data in a population of hospitalized neonates with or without bacterial infections. For this, we specified algorithm design parameters for feature selection, classifier training, classifier testing and independent classifier validation. In our Nature Communications paper we established a 52-gene RNA-based classifier for bacterial sepsis in human neonates. This involved training a ROC-based classifier on 62 patients whole blood (RNA extracted) samples, 35 of which were healthy controls and 27 of which had a bacterial culture confirmed infection. This study was used to train and internally test this classifier. Still as part of that study, the classifier was tested on new samples from the same population of Edinburgh Royal Infirmary neonates (10 control samples, 16 bacterial infection samples, 3 viral infection samples). The classifier was also validated on a subset of the original samples (18 bacterial infections, 24 controls) together with a set of 30 new ‘suspected’ bacterial infection cases, but on a different microarray platform.

Software pipelines, which were devised by the project team, to scan and compare the genomes of major pathogens for unique molecular identifiers, have been distributed across a master slave cluster topology to allow large scale “big data” processing. The team experimented with various strategies of building bioinformatics workflows, load balancing, and evaluated several parallel processing software packages. This resulted in a novel parallel alignment approach – a new way and new software to run BLAST algorithms.
A reference private cloud implementation was also constructed. Several popular virtualization software solutions from major vendors have been deployed and evaluated and the most flexible and cost effective has been chosen. The developed software can be deployed within relevant intranet systems to enhance security and data protection best practice. Moreover, since it was done in a virtualised environment, allowing full advantage of cloud computing to be leveraged by the project as the complete software installation can be readily deployed to both public and private cloud infrastructure. The solution architecture does not depend on underlying hardware and is portable to clouds of any size and vendor. The solution is scalable, efficient, provides for fast deployment and easy management. Following on from the pipeline implementation, in particular the development and analysis of the parallel alignment, we published a paper entitled “HBLAST: Parallelised Sequence Similarity - A Hadoop MapReducable Basic Local Alignment Search Tool”.

Another finding of the project was the disparity between academic based bioinformatics software development and commercial software development. This led to analysis and publication of a journal paper “Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software”. A survey of over one hundred professionals in the bioinformatics community was conducted and a significant gap was found between how software engineers and life scientists develop cloud based molecular diagnostic systems. The paper also goes on the recommend best practices for implementing software in this domain.

Another significant finding was the lack of reproducibility in infection diagnostics, which lead to a presentation and publication of a paper on “Enhancing Reproducibility in Bioinformatics for Microbiology” at the prestigious SMBE conference in Vienna 2015. We addressed the lack of reproducibility in infection diagnostics by implementing fully reproducible pipelines in ClouDx-i software outputs. Moreover, we demonstrate this by fully sequencing and analysing the genome of bacterial pathogens implicated in clinical cases of neonatal sepsis. We demonstrate how all bioinformatics analysis related to this clinical study is fully reproducible through the use of a novel cloud based bioinformatics framework.
Transfer of knowledge events were held, as well as dissemination activities to the wider research community and knowledge economy in the form of conferences, trade shows, journal publications and educational outreach events. It is noteworthy that research outputs from ClouDx-i have resulted in a number of high impact publications, one of which is a Nature Communications paper, with another Nature paper currently in submission. Research output also includes gene transcription microarray data for host-side neonatal responses and sequenced pathogen DNA from neonate isolates which have been sequenced with deep coverage and annotated during the project.

Six specialist ‘transfer of knowledge’ workshops have been delivered to research fellows in both molecular diagnostics and cloud computing. Two international research conference has also been organised around ClouDx-i where all partners participated. Overall 50 internationally peer reviewed publications have been produced, along with press articles, 6 international conferences and trade shows and 3 educational outreach events.

Contact Point: Paul.Walsh@nsilico.com
Website: www.cloudxi.eu