"Global protein similarity networks to identify LGT were constructed for a set of genomes including pathogenic kinetoplastid (e.g. Trypanosoma and Leishmania) protozoa, representatives of the 5 main super groups of eukaryotes, the major groups of Archaea and a representative sample of Bacteria. To benchmark the sensitivity of our approach for detecting LGT, we also included eukaryotic genomes where systematic screens for LGT using tree-based methods have already been published. All of the data were downloaded from public databases. To assemble a representative sample of bacterial gene diversity we used the Eggnog v 4 data-base of orthologous groups and linear programming (www.gnu.org/software/glpk/) to maximize coverage and minimize redundancy of the bacterial protein sequence diversity in our sample. The ""Evolutionary gene and genome network"" (EGN) software was then used to make a sequence similarity network for all of the proteins in our data set. To remove weaker edges based upon short proteins or alignment segments we used a quality criterion based upon Homology-derived Secondary Structure of Proteins (HSSP) scores. The HSSP distance is a measure of sequence similarity that considers both pairwise sequence identity and alignment length - higher HSSP values are thus required to infer homology between short proteins or alignments. The use of HSSP scores greater than 5 above the HSSP threshold curve, has previously been shown to be improve homology predictions in tree-based analyses for detecting LGT. The sequence similarity networks obtained using EGN represent sequences as nodes and all pairwise sequence relationships as edges. The networks were screened for potential LGTs using the R/Igraph package v. 1.0.1 to identify eukaryotic nodes showing linkages with taxonomic distributions that were not consistent with simple vertical inheritance. The simplest examples of LGT comprised examples where eukaryotic nodes were only connected to prokaryotic nodes in the network, but we also identified eukaryotic nodes that showed significantly fewer connections to other eukaryotes (assessed using randomization), and significant similarity to prokaryotes. In order to further test if these outliers were candidate LGTs, we measured the geodesic distances: the number of edges in the shortest path between eukaryotic nodes, normalized by the diameter of the network, as a measure of the proximity of eukaryotic nodes to each other, and identified eukaryotic nodes that had significantly larger geodesic distance compared to other eukaryotes. We also used the Jaccard similarity index (JI) to quantify node interaction profile similarity to identify eukaryotic nodes that shared interaction profiles that were more similar to adjacent prokaryotic nodes than to adjacent eukaryotic nodes. Analyses that combine different complementary approaches to infer LGT often improve the quality of predictions, so we subjected the LGTs identified by these network-based methods to further analysis by tree-based and non-tree based methods, to identify an “consensus set” of LGTs. One of the most interesting findings of this work was that LGTs affect all eukaryote genomes, suggesting that prokaryote-to-eukaryote LGT is a pervasive force shaping the genomes of microbial eukaryotes. The genes being transferred affect metabolic pathways like glycolysis and amino acid metabolism but also include genes of potential adaptive significance for the pathogenic lifestyle. The identity of prokaryotic donor lineages shows a strong correlation with shared habitat. At present the results of these analyses are being prepared for publication. The dataset of LGTs will also be deposited in public databases providing a resource of potential targets for therapeutic intervention against pathogenic protozoa.
"