Final Report Summary - HEROGEN (The Molecular Genetics of Heroin Dependence)
Dependence on illicit drugs such as heroin can have devastating impact on the lives of the affected individuals, their families and the wider society. Heroin use is on the increase in Europe; the risk of death in drug users is 20 to 30 times greater than in non-drug users, mostly from overdose or acquired infections. Although governments of various countries have been putting financial resources into understanding the cultural, socio-political aspects of heroin addiction, there is far less effort invested in understanding why some individuals become addicted and why they find so hard to give up. Previous evidence from animal studies as well as twin and family studies in humans suggest that there is a biological reason for addiction and that genes may predispose some individuals to become dependent on the drug once they have tried it. Thus, the aim of this project was to identify risk genes using an ethnically homogeneous sample of addicts and control subjects selected from the same ethnic region. This European Commission funded study is the first to test a sample of Han Chinese individuals for association with heroin dependence. The choice of ethnicity of the sample was largely due to the fact that this sample was available at the host institute. This sample is larger than most previously tested samples of heroin addicts and controls, which adds considerable value to the study. In addition, the host institute’s facilities include cutting edge technologies and analytical expertise; this enabled me to deviate slightly from the proposed project to ensure that the results of this EC funded project remained up-to-date and scientifically valid.
Methods
The study was performed with the Illumina HumanCoreExome-12v1_A beadchip which has been developed to capture extensive genomic variation including rare variants and indels. Stringent quality control measures were implemented at every stage. Prior to hybridization to the chip, DNA quantity and quality was assessed using the most accurate methods, to avoid spurious results. Automated procedures were used to hybridise 400 cases and 170 controls to the beadchip which was scanned on the Illumina HiScan platform. This particular chip has 547644 markers in total, including 264909 common variants and over 240,000 exonic and rare variants. Initial QC of plate hybridization and scanning results showed >99% accuracy in most cases; individuals with <98% accuracy were excluded which left 512 samples for the statistical analysis. Marker accuracy was stringently evaluated in the BeadStudio/GenomeStudio suite of programs; all markers failing to meet accuracy threshold levels of 98%, or were discordant for gender and with heterozygosity levels of <1% were removed.
Data Analysis
Several quality control measures were implemented prior to conducting the case-control association analysis using PLINK (Purcell et al, 2007). Although the common variants on the beadchip were tagging SNPs, there is always a possibility that there is some underlying linkage, leading to correlated structure. This is particularly relevant when the associated SNPs are located within genes, resulting in a non-independent signal. To check for correlated structure, the clumping procedure was applied in PLINK, using the default setting of r2=0.5. The procedure takes all the SNPs with p≤0.001 termed index SNPs and creates clumps of SNPs that are in LD with the index SNPs at the threshold of r2=0.5 within a defined distance; the default distance is 250kb. For this analysis, the default settings were used. A test for any residual population stratification was also implanted using the MDS (multidimensional scaling plots) function in PLINK. Following these QC procedures, the dataset was analyzed for case-control association, using PLINK. The analysis yielded 534333 SNPs, however a number of them did not yield any results. Following the case-control analysis, the dataset was cleaned further prior to imputation; the cleaning was conducted using set procedures in PLINK, thus SNPs with minor allele frequencies of less than 0.01 and genotype frequencies of 0.01 were removed. The remaining markers, totalling 263084 were imputed with the Minimac procedure and IMPUTE2 algorithm. This procedure can be run quickly and easily on the University of Michigan’s server (http://imputationserver.sph.umich.edu/) after the data has been separated into chromosomes; the program produces a zipped file which can be converted into PLINK format. The imputed analysis produced 30M SNPs, which were analyzed for association.
Results
The case-control association analysis did not produce any SNP which meets the genome wide significance level of 10-8. The top hits were two SNPs, both located on chromosome 17, with p-values of 10-7 and 9 SNPs with p-values of 10-6. The most significant result is with an exonic marker located in the gene CCDC42 (coiled coil domain 42) on chromosome 17, yielding a p-value of 1.19x10-7 and an odds ratio (OR) of 0.4. Interestingly, there is a cluster of SNPs on chromosome 17p13.1; previously a linkage study in Han Chinese had highlighted a region on chromosome 17 but that is different (17q11.2) from the one observed here. Among the top 100 associated SNPs are multiple signals in the gene BRSK2 (BR serine/threonine kinase 2). Other genes identified in the top hits include PPP1R12B (protein phosphatise 1, regulatory subunit 12B) and ATP9B (ATPase, class II, type 9B). More detailed analysis is in progress.
Discussion
Results testing association between cases and controls from this limited number of individuals has revealed three top hits in the same chromosomal region, suggesting that these signals are likely to be non-spurious. Interestingly, the top three associated SNPs are located on chromosome 17 which has been previously linked to heroin dependence. Of the 3 markers, two markers (exm1292458, rs9894347) are located within the gene, CCDC42, and one (rs2101939) is located 12kb from the 3’ end of CCDC42. There is long range linkage disequilibrium in the region, suggesting that the results may not be chance. The gene is conserved in higher primates and other mammals and has been identified as a site of ubiquitylation (Danielsen et al, 2011), suggesting that the gene might be involved in epigenetic regulation in response to heroin consumption. BRSK2 has multiple SNPs in the top results but considering its role in the endoplasmic reticulum protein degradation, it is hard to see how it could affect heroin addiction. Other genes in the top 100 hits include genes, such as PPP1R12B and PRKG1 (protein kinase, cGMP-dependent, type 1) play a functional role in signal transduction and hence may be related to heroin dependence. However, it should be noted that although some of the associated genes may be functionally relevant to heroin addiction, the results would need to be replicated in independent, bigger samples and in well designed functional studies. Replication is even more relevant as the sample size is quite small for a complex disorder such as heroin addiction. The power analysis in the original proposal was based on a sample size of about 700-800 cases. In the end, the stringent quality control measures implemented in the study meant there is reliable data on about 400 individuals; this has obviously reduced statistical power. The results quoted here do not include the results of the imputation and pathway analysis, which are ongoing.
Conclusion and Socio-economic impact
The value of this study lies in its novelty as the first genome wide study in a Chinese sample nonetheless the results would be of value to the wider scientific community as there are limited studies testing heroin dependence. In addition, the data obtained in this EC funded study will be used in a mega-analysis by the Psychiatric Genetics Consortium for addiction; this is a global effort to perform analysis of the various datasets around the world. This EC-funded dataset is the only one of its kind in Europe and therefore represents an important contribution to the wider analysis by the PGC-addiction consortium. In addition, due to the extensive quality control and data cleaning efforts, my dataset represents a robust and reliable one.