Periodic Reporting for period 1 - Metagenome binning (Accurate reconstruction of microbial genomes from the environment)
Reporting period: 2023-08-01 to 2025-07-31
We proposed a new binning algorithm designed to address these issues in three key improvements. First, a linear mixture model is applied to account for cross-mapping of reads. Second, Poisson statistics is applied to effectively process low read counts. Third, the refinement process during clustering using analyses on read counts and k-mer frequencies without applying SCMGs. The algorithm uses a Bayesian theory to derive novel distance measures to identify contigs belonging to the same genome and probabilistic assignment of contigs to genomic bins to improve clustering accuracy and completeness.
Overall, the algorithm has several important advantages over the existing methods: i) it promises to be more accurate in binning conserved regions due to the mixture modeling that solves the problem of cross-mappability of reads, ii) it models the read and k-mer count distributions with the appropriate (Poisson) statistics to effectively recover genomes in low abundance and iii) it does not have to rely on single-copy marker genes, permitting an unbiased quality assessment.
Work package 2 involved a series of steps to deconvolute the abundance matrix into MAGs. I have developed a probabilistic distance measure using Bayesian statistics to calculate pairwise distance between contigs and tested the measure using millions of contigs from CAMI2 simulated metagenomic samples. Using the pairwise probabilistic distance measures, I have implemented an agglomerative clustering algorithm to cluster similar contigs and estimate the number of genomes that are likely to be present in the samples. These initial clusters of contigs were used to initialise W and Z matrices and optimised them through NMF. Poisson statistics and non-negativity constraints were applied during optimization. I have successfully found the optimal solution for the metagenomic datasets. MAGs were reconstructed from the input abundance matrix. All these steps were pipelined into an open access bioinformatics software under the GNU General Public License (https://github.com/yazhinia/McDevol(opens in new window)).
In work package 3, I have developed an improved version of our binning tool using a deep learning approach and Leiden community detection algorithm (https://github.com/soedinglab/McDevol(opens in new window)). It generates genomic bins independent of SCMGs. I have improved computational efficiency of our tool using runtime and memory profilers. I have carried out an extensive benchmarking study on CAMI2 datasets to compare binning performance of deep learning binners (https://github.com/soedinglab/binning_benchmarking(opens in new window)). I have also developed automated Snakemake workflows to reproduce this benchmarking study (https://github.com/soedinglab/binning_benchmarking/tree/main/workflow(opens in new window)).
Work package 4 focused on improving the yield of MAGs from single and multi-sample binning. For this, I developed a new dereplication tool that outperformed the widely used dereplication tool (dRep) in recovering high-quality MAGs. The source code is available publicly under the GNU General Public License on GitHub (https://github.com/soedinglab/MAGmax.git(opens in new window)) as a bioconda package (https://anaconda.org/bioconda/magmax(opens in new window)) and as a docker image (https://github.com/soedinglab/MAGmax/pkgs/container/magmax(opens in new window)).