Skip to main content

COuNt data TimE SerieS Analysis: significance tests and sequencing data application

Periodic Reporting for period 1 - CONTESSA (COuNt data TimE SerieS Analysis: significance tests and sequencing data application)

Reporting period: 2015-08-01 to 2017-07-31

The aim of this project is to develop methods for analysis of time-series based on count data. The target of my project broadens to general analysis of count time-series data such as clustering, classification, perturbations inference and machine learning over sequential count data. The project focus on count data sets from ribonucleic acid sequencing (RNA-seq) time course experiments. My project potentially has promising applications in biology, recent examples include high- throughput sequencing, such as RNA-seq and chromatin immunoprecipitation sequencing (ChIP-seq) analyses and more recently Single Cell sequencing.
The work carried through the fellowship was scheduled as a learning period and a very productive research development period. The project lead me to develop six research lines and a dissemination and communication project that are detailed in next section. One work is under review, some of the works are ready for submission and some are still ongoing. To the purpose of further developing and publish the mentioned research results, I was recently awarded a visiting academic position at the department of computer science.
As intermediate results dissemination I took part to many research meetings and workshops. I also took part to summer schools and organized myself scientific events and had weekly meetings with my supervisors Neil Lawrence and Eleni Vasilaki.
As soon as I started my fellowship I realised that the new technology of single Cells RNA-sequencing was emerging for the study of gene expression. Given the advent of this new technology, I rather focused on this kind of sequencing data and all the problems connected to this. I broadened my project to general analysis of RNA-seq count data from single cells, as clustering, classification and perturbations inference. In the beginning of my fellowship I joined the group of professor Neil Lawrence and learned about Gaussian Processes (GP), attending the GP Summer schools in both September 2015 and 2016, helping with the organisation. I learned how to use and share on GitHub, the basics of Python and Jupyter Notebooks.

In Sheffield I collaborated with Marta Milo and Guillaume Hautbergue, working on Amyotrophic Lateral Sclerosis (ALS) data. I joined a very important project where a new experimental technology has been tested. I could handle new RNAseq data and develop a custom pipeline from the alignment phase to the differential expression and protein protein interaction network estimation. In particular the new proposed technology is called GRASPS (Genome-wide RNA Analysis of Stalled Protein Synthesis): A novel translatome technology to identify functional consequences of widespread RNA dysregulation in neurodegeneration. This work is in collaboration and has been presented at the Sheffield neuroscience conference and is currently under revision for journal submission.

In the meantime I started my secondment at the University of Manchester where I collaborated within the group of Professor Magnus Rattray. There I could interact with a computational biology team and start a project about finding co-oscillating genes in a given set of RNA-seq data (bulk or Single Cell). This study lead us to develop a method and a software called PyScope: Detecting oscillatory gene networks. This is in collaboration and has been presented at the data science 2017 meeting in Manchester and at the ISMB 2017 meeting in Praga. We propose a full analysis pipeline on the resulting graph to identify communities of signicantly co-oscillating genes.

I also focused on network community extraction methods and their validation. This is extremely important when dealing with real Biological or Social networks. Indeed a way of summarising networks is via the main representative groups of nodes (elements) that are strongly connected, hence via communities. It follows that it is crucial to be able to rely on robust community extraction methods. This led me to develop, in collaboration with Annamaria Carissimo and Italia Defeis, a method for validating community robustness in networks. We show the results obtained with the proposed technique on simulated and real datasets. This work is currently under second round of revision in a top statistical journal an was presented at the Machine Learning conference NIPS 2016, Barcellona.

Discussing ideas with the ML group in Sheffield, I was introduced to the team of professor Ernst Wit, leading a COST action on Networks called COSTNET (COST Action CA15109). I took part to the first Meeting of COSTNET in 2016. There I exchanged some ideas on Network validation with Mirko Signorelli and this lead us to a fruitful collaboration on Networks validation techniques. We developed an inferential procedure for community structure validation in networks. This work is currently under revision for submission to a statistical journal.

I had the opportunity to take part to the launch of the single cell facility at BMS where I was invited for a talk. Within the facility I started a project about how to address Fluidigm C1 doublets problem and the detection of a single cell developmental stage before the sequencing. This work is in collaboration with Max Zwießele, Paul J Gokhale, Marcelo Rivolta and Marta Milo. The Fluidigm C1 is a single-cell analysis system uses a simplified single-cell isolation and cell processing based on Integrated Fluidic Circuits (IFCs). Our approach gives the great advantage of characterising cells before the RNA-seq assay and therefore gives great interpretation power to the following RNA-seq data analysis. Future improvements of this approach are based on optimising prior selection and data features extracted from the IFCs images. This work was presented at the meeting ISMB 2017 in Prague.

Together with professor Neil Lawrence I developed a method to estimate the graph both between the cells and the genes involved in the same dataset. We rely on a previous work by Lawrence and Kalaitzis where a Bigraphical Lasso approach was implemented. When dealing with single cell data we are simultaneously interested in estimating cells interrelations and genes interrelations.The bigraphical lasso is a model for matrix-variate data that preserves their column/row structure and simultaneously learns two graphs, one over rows and one over columns of the matrix samples. This model has the time complexity of 2 GLasso problems of O(n + p), preserves the matrix structure by using a Kronecker sum (KS) structure for the precision matrix and also enhances sparsity of the graph.

Among the main research lines all related to my project I also started a dissemination and communication project with the Research Software Engineer Team in Sheffield, lead by Micheal Croucher. This project was to make the Bioinformatics Awareness Days accessible to a worldwide public. The Bioinformatics Awareness Days are days devoted to Bioinformatics. Together with Tania Allard and Mike Croucher at the DCS, we decided to publicly divulgate this material to all the interested scientific community. The sessions are self contained and a full run should last at most 2 hours. All of the sessions material is now also contained in a website based around the Jupyter notebooks This also involved the use of the new Microsoft Azure Notebooks.