Final Report Summary - GENEREGULATION (Deciphering the code of gene regulation using massively parallel assays of designed sequence libraries)
Genetic variation in non-coding regulatory regions accounts for a significant fraction of changes in gene expression among individuals from the same species. However, without a ‘regulatory code’ that informs us how DNA sequences determine expression levels, we cannot predict which sequence changes will affect expression, by how much, and by what mechanism. To address this challenge, we developed a high-throughput method for constructing libraries of thousands of fully designed regulatory sequences and measuring their expression levels in parallel, within a single experiment, and with an accuracy similar to that obtained when each sequence is constructed and measured individually. Using this ~1000-fold increase in the scale with which we can study the effect of sequence on expression, we designed and measured the expression of libraries in which the location, number, affinity and organization of different types of regulatory elements has been systematically perturbed across different types of regulatory regions including 5' and 3' UTRs as well as promoters. We also went beyond outcomes of gene expression and measured global cellular fitness and cellular growth in cells in response to tens of thousands of different designed genetic variants. We developed algorithms that use these insights to predict the expression of unseen regulatory regions as well as expression of genes from human individuals given only their genotype in the local genomic vicinity of the predicted gene. Our results provide several new insights into principles of gene regulation, bringing us closer towards a mechanistic and quantitative understanding of which how expression levels are encoded in DNA sequence.