Skip to main content

Semantic mining of phenotype associations from the biomedical literature

Final Report Summary - PHENOMINER (Semantic mining of phenotype associations from the biomedical literature)

Summary of Research

Phenotypic descriptions such as “increased intraocular pressure” describe abnormalities in bodily structures, physiological processes or behaviours. These observations form the basis by which clinicians determine the existence and treatment of a disease. In recent years great efforts have been made to generate standardised phenotypic vocabularies (“ontologies”) for humans and a variety of experimental organisms – specifically those with a genetic disposition for disease such as Alzheimer’s, multiple sclerosis as well as many rare disorders. Given further progress in developing vocabulary standards we hope to see the semantic alignment of clinical and biomedical data resources through the phenotypic descriptions so that (a) clinicians can have greater access to findings from molecular biology for the evaluation of individual dispositions and (b) researchers can have access to harmonized data in patient records to help discover new cures and treatments. However progress in constructing standard phenotype vocabularies has been slow due to the quantity of primary literature which needs to be assessed – usually by expert human curation efforts – and the high level of semantic complexity in phenotype names.
The PHENOMINER project’s major objective – carried out over a 2 year period - has been to exploit state-of-the-art solutions in human language technologies together with existing ontological resources to discover human phenotypic descriptions in the open access scientific literature and to bring these into a novel database for the interpretation of human diseases. The phenotypic descriptions have been automatically encoded in a machine understandable representation, making them semantically interoperable with existing coding standards and immediately available to the bioinformatics community for integration with ongoing laboratory work.

Figure 1: Overview of PhenoMiner project showing key stages and resources

Main Results

The system developed in the fellowship was the first to bring together a hybrid approach to phenotype capture in the literature: supervised machine learning from annotated texts (“corpora”), existing ontologies, syntactic parsing and a semantic/syntactic rule based approach that encodes expert intuitions about how phenotype descriptions are structured. As shown in Figure 1 these techniques have been used to mine phenotype descriptions in the free-text literature stored in Europe PubMedCentral (EuropePMC) and to discover statistically plausible associations with Mendelian diseases through data-mining technology. Importantly the resulting data set of 4,898 phenotypes and 28,155 phenotype-disorder associations has been assessed for quality by experts against existing gold standards such as the Online Mendelian Inheritance of Man (OMIM) database and the Human Phenotype Ontology (HPO). A semantic database of automatically mined phenotypes and phenotype-disorder associations was made available in two releases through public open access repositories (GitHub: and Zenodo: as well as in a demonstration portal (available from
The 13 publications (2 currently under submission) which resulted from the fellowship include journal articles in PloS One (IF:3.5) Genome Biology (IF:10.5) Database (IF:4.4) and Biomedical Semantics (IF:2.1) as well as conference/workshop papers at the International Conference on Computational Linguistics (COLING), Force 2015 and LOUHI at the European Association for Computational Linguistics (EACL) annual conference.
The fellowship brought together and has benefited from extensive discussions and collaborations with experts in phenotype knowledge representation (Dr. Anika Oellrich, Wellcome Trust Sanger Institute and Dr. Tudor Groza, University of Queensland), computational biology (Dr. Peter Robinson, Charité Berlin) comparative genomics (Dr. Damian Smedley, Wellcome Trust Sanger Institute), chemical informatics (Dr. John Overington, EMBL-EBI), text mining (Dr. Dietrich Rebholz-Schuhmann, EMBL-EBI and University of Zurich), literature services (Dr. Johanna McEntyre, EMBL-EBI), computational linguistics (Dr. Anna Korhonen, University of Cambridge and Dr. Yusuke Miyao, National Institute of Informatics in Tokyo) and machine learning (Dr. Quang-Thuy Ha, Vietnam National University).
In particular several user-driven applications for the PHENOMINER data were explored: detection of novel phenotype-OMIM disorder associations (collaborator: Dr. Peter Robinson, Dr. Dietrich Rebholz-Schuhmann), automated assignment of OMIM terms to EuropePMC literature (collaborator: Dr. Jo McEntyre) and verification of cross-species human-mouse gene disease associations (collaborator: Dr Damian Smedley) resulting in valuable evidence about the quality of the mined phenotypes from a variety of user perspectives.
In terms of knowledge transfer, the fellowship has delivered the beginnings of a community of practice around phenotype knowledge representation, acquisition, application and interoperability through the Phenotype Day workshop ( at the Intelligent Systems for Molecular Biology (ISMB) conference in Boston in 2014. This was an exciting initiative Dr. Collier was able to initiate with collaborating colleagues, attended by over 50 researchers and supported by publication in a special edition of the Journal of Biomedical Semantics. A second Phenotype Day workshop has been accepted for ISMB 2015. Separate to this Dr. Collier has participated in the organisation of the Biomedical Linked Annotation Hackathon ( to take place in February in Tokyo which will promote re-use of literature annotations including PHENOMINER phenotypes.
In addition to this Dr. Collier has given seminars about PHENOMINER at key groups involved in cross-disciplinary research involving human language technologies and bioinformatics (Dr. Rinaldi – Univeristy of Zurich, Dr. Nenadic – University of Manchester, Dr. Nedellec – INRA, Prof. Moulton – University of East Anglia, Dr. Korhonen – University of Cambridge, Dr Mark Stevenson) as well as a number of seminars on text mining at EMBL-EBI. Dr. Collier has also contributed research findings to students in the MPhil lecture series on Bioinformatics at the University of Cambridge.

Relevant Target Groups

The research reported here is highly relevant to the ongoing work of several groups: these include (1) Life scientists and clinicians involved in translational studies who will benefit from having a novel database of evidence about phenotype associations in human diseases that links the existing scientific literature to coding standards. We already have preliminary evidence of the utility of this data in a cross-species study for predicting disease genes conducted with a group at the Wellcome Trust Sanger Institute; (2) Bioinformaticians and database curators involved in knowledge discovery and data integration will benefit from data that they can incorporate into their own workflows. Again early fruits of this are already apparent when we look at some of the novel phenotype mentions we were able to discovery and compare them to the Human Phenotype Ontology; (3) Researchers and engineers in human language technologies, e-Science and information retrieval have a clearer framework and a better understanding of the methods necessary to encode the important conceptual class of Phenotypes.


The highly complex nature of phenotype descriptions makes capturing them a major challenge. Text/data mining has proven the essential link between free-text literature and expert conceptual encodings. Taken together the simulations Dr. Collier has done indicate that hybrid methods can successfully identify a range of human phenotypes in situ in the scientific literature and that these phenotype candidates are relevant for a wide range of human heritable diseases. The PHENOMINER approach provides a novel set of techniques and gold-standard benchmark data for flexibly capturing diverse phenotypes and in the future provides a pathway to answering new research questions such as which phenotype forms are actually used by authors and how frequently. The project has made good progress towards taking phenotype candidates and using them to discover potentially novel associations with OMIM disorders. In future work the investigation should be extended to enable analysis of typed relationships that are reported between a phenotype and a disorder as well as to include relationships to genes and mutations. Dr. Collier is in discussions with collaborators to provide an expert curation study of those phenotypes not matched to existing database terms in order to speed up the expansion of gold-standard manually curated data.