Accurate reconstruction of microbial genomes from the environment

Project Information

Metagenome binning

Grant agreement ID: 101111457

DOI

10.3030/101111457

Project closed

EC signature date 19 June 2023

Start date 1 August 2023

End date 31 July 2025

Funded under

Marie Skłodowska-Curie Actions (MSCA)

Total cost

No data

EU contribution

€ 189 687,36

Coordinated by

MAX-PLANCK-GESELLSCHAFT ZUR FORDERUNG DER WISSENSCHAFTEN EV
Germany

Periodic Reporting for period 1 - Metagenome binning (Accurate reconstruction of microbial genomes from the environment)

Reporting period: 2023-08-01 to 2025-07-31

The primary goal of the proposed project is to improve metagenome binning. It is a computational process in which metagenomic contigs are grouped together based on their presumed genomic origin. State-of-the-art tools perform binning in two stages: (i) computing the distance or similarities in the abundance and k-mer frequency profiles of contigs, and (ii) clustering contigs based on these similarities. Binning tools differ in their similarity measures and clustering algorithms. The current challenges in metagenome binning are i) incorrect binning of conserved regions due to cross mapping of reads to conserved regions, ii) poor recovery of low abundance species and iii) SCMGs-based assessment of intermediate bins which results in too optimistic measures of bin quality.

We proposed a new binning algorithm designed to address these issues in three key improvements. First, a linear mixture model is applied to account for cross-mapping of reads. Second, Poisson statistics is applied to effectively process low read counts. Third, the refinement process during clustering using analyses on read counts and k-mer frequencies without applying SCMGs. The algorithm uses a Bayesian theory to derive novel distance measures to identify contigs belonging to the same genome and probabilistic assignment of contigs to genomic bins to improve clustering accuracy and completeness.

Overall, the algorithm has several important advantages over the existing methods: i) it promises to be more accurate in binning conserved regions due to the mixture modeling that solves the problem of cross-mappability of reads, ii) it models the read and k-mer count distributions with the appropriate (Poisson) statistics to effectively recover genomes in low abundance and iii) it does not have to rely on single-copy marker genes, permitting an unbiased quality assessment.

The project was carried out in four work packages. Work package 1 involved preparing input data for metagenome binning. I have created synthetic metagenomic datasets using CAMISIM tool. Using these datasets, I implemented scripts required to generate kmer frequency matrix from contig sequences and read coverage data from read-contig alignment files (.sam format). I next generated input kmer and abundance matrices for three simulated metagenomics datasets available from CAMI2 study for further analysis.

Work package 2 involved a series of steps to deconvolute the abundance matrix into MAGs. I have developed a probabilistic distance measure using Bayesian statistics to calculate pairwise distance between contigs and tested the measure using millions of contigs from CAMI2 simulated metagenomic samples. Using the pairwise probabilistic distance measures, I have implemented an agglomerative clustering algorithm to cluster similar contigs and estimate the number of genomes that are likely to be present in the samples. These initial clusters of contigs were used to initialise W and Z matrices and optimised them through NMF. Poisson statistics and non-negativity constraints were applied during optimization. I have successfully found the optimal solution for the metagenomic datasets. MAGs were reconstructed from the input abundance matrix. All these steps were pipelined into an open access bioinformatics software under the GNU General Public License (https://github.com/yazhinia/McDevol).

In work package 3, I have developed an improved version of our binning tool using a deep learning approach and Leiden community detection algorithm (https://github.com/soedinglab/McDevol). It generates genomic bins independent of SCMGs. I have improved computational efficiency of our tool using runtime and memory profilers. I have carried out an extensive benchmarking study on CAMI2 datasets to compare binning performance of deep learning binners (https://github.com/soedinglab/binning_benchmarking). I have also developed automated Snakemake workflows to reproduce this benchmarking study (https://github.com/soedinglab/binning_benchmarking/tree/main/workflow).

Work package 4 focused on improving the yield of MAGs from single and multi-sample binning. For this, I developed a new dereplication tool that outperformed the widely used dereplication tool (dRep) in recovering high-quality MAGs. The source code is available publicly under the GNU General Public License on GitHub (https://github.com/soedinglab/MAGmax.git) as a bioconda package (https://anaconda.org/bioconda/magmax) and as a docker image (https://github.com/soedinglab/MAGmax/pkgs/container/magmax).

Our new data augmentation strategy will facilitate the design of fast binning tools based on contrastive learning models. The MAGmax developed in this project will improve the recovery of high-quality microbial genomes from large-scale metagenome studies.

Periodic Reporting for period 1 - Metagenome binning (Accurate reconstruction of microbial genomes from the environment)

Share this page Share this page on social networks

Download Download the content of the page