Final Report Summary - HUMVAR (Human variation: causes, patterns and consequences) Introduction. In this project we aimed to analyze mutations. Indeed, mutations are the source of genetic variation in natural populations, they provide material for molecular evolution and, importantly, cause human genetic diseases. Therefore, studies of mutations are extremely important. Hundreds of resequenced human genomes (including cancerous samples) and hundreds of completely sequenced genomes of other species are accumulating at a growing pace. We took advantage of the availability of these data, and conducted bioinformatic analyses of mutations that accumulated in completely sequenced genomes over varying evolutionary time. This allowed us to uncover the intricacies of mutagenesis processes. Compared with wet-lab experiments, computational analyses enabled us to study mutations in their native genomic environment and on a whole-genome scale. It is known that mutation rates fluctuate greatly from locus to locus in mammalian genomes, and that the rates of some mutation types co-vary regionally. However, this phenomenon has been neither characterized simultaneously for multiple mutation types nor anchored to specific regions of the genome. A complete understanding of its mechanistic basis has also been lacking.Results and ConclusionsIn this project, we utilized human resequencing data in conjunction with completely sequenced mammalian genomes in order to study regional (primarily intrachromosomal) variation and co-variation in rates of different mutation types. Using newly developed bioinformatic tools (see below), we segmented the genome into contiguous segments, each characterized by a specific mutation profile, and associated these states with 35 genomic landscape features, parsing the contributions of different biochemical processes to mutagenesis. This characterization was similar between mutations that accumulated in neutrally evolving regions of the human genome over millions of years (e.g. in a human-orangutan comparison) and recent mutations - the ones still segregating in human populations (from the 1,000 Genomes Project). Deviations from these general trends highlight the evolutionary history of primate chromosomes. We further demonstrated that genes and non-coding functional marks localize in the genome according to the underlying mutation rates. Socio-economical impact. These segmentations allow screening personal genome variants, including those associated with cancer and other diseases, and provide a framework for accurate computational predictions of non-coding functional elements. Therefore, our results provide a powerful resource for biomedical data analysis. This research advances our understanding of mutagenesis, because we were able to decipher mutagenesis mechanisms by examining the associations between regions of elevated rates of particular mutations and enrichment of particular genomic features. Additionally, it provides information vital for improving models of the evolutionary process, alignment algorithms, and algorithms for the prediction of functional elements.Several statistical multivariate and multi-scale techniques were developed and implemented in GALAXY, an open-source software suite for genomics research. The tools developed here bridge interdisciplinary differences in concepts and data between biology and statistics, as well as between bioinformatics and experimental biochemistry. They can be used by other researchers in the field of genomics and in other scientific disciplines. The tools we developed fall into several categories. First, these are general statistical tools useful for any scientific discipline. These are multiple regression tools, including 'Perform Linear Regression', 'Perform Best-subsets Regression', and 'Compute RCVE', a segmentation tool, 'Fit Hidden Markov models', and a data graphing tool, 'Heatmap'. We are confident that many researchers will find these tools useful and will be able to utilize them in their own research projects. Second, we have developed other tools - the ones that are more specialized but nevertheless will be extremely useful for genomicists. These include alignment pre-processing tools, 'Make genomic windows', 'Feature coverage', 'Filter nucleotides', and 'Mask CpG/non-CpG sites'; and tools for identifying mutations and computing their rates, including 'Fetch indels', 'Estimate indel rates', 'Fetch substitutions', 'Estimate substitution rates', 'Extract orthologous microsatellites', 'Estimate microsatellite mutability'.