The primary goal of the proposed project is to improve metagenome binning. It is a computational process in which metagenomic contigs are grouped together based on their presumed genomic origin. State-of-the-art tools perform binning in two stages: (i) computing the distance or similarities in the abundance and k-mer frequency profiles of contigs, and (ii) clustering contigs based on these similarities. Binning tools differ in their similarity measures and clustering algorithms. The current challenges in metagenome binning are i) incorrect binning of conserved regions due to cross mapping of reads to conserved regions, ii) poor recovery of low abundance species and iii) SCMGs-based assessment of intermediate bins which results in too optimistic measures of bin quality.
We proposed a new binning algorithm designed to address these issues in three key improvements. First, a linear mixture model is applied to account for cross-mapping of reads. Second, Poisson statistics is applied to effectively process low read counts. Third, the refinement process during clustering using analyses on read counts and k-mer frequencies without applying SCMGs. The algorithm uses a Bayesian theory to derive novel distance measures to identify contigs belonging to the same genome and probabilistic assignment of contigs to genomic bins to improve clustering accuracy and completeness.
Overall, the algorithm has several important advantages over the existing methods: i) it promises to be more accurate in binning conserved regions due to the mixture modeling that solves the problem of cross-mappability of reads, ii) it models the read and k-mer count distributions with the appropriate (Poisson) statistics to effectively recover genomes in low abundance and iii) it does not have to rely on single-copy marker genes, permitting an unbiased quality assessment.