Collaborative Annotation of a Large Biomedical Corpus

Description du projet

Intelligent Content and Semantics
CALBC transform a large set of documents into a corpus with rich semantic links to biomedical data resources

The biomedical scientific literature is the key resource for the exchange of scientific facts: researchers write publications for their peer group to propose novel theories and report groundbreaking innovative findings. The new open access policies of the publishers have removed the barriers that hindered integration of the literature content into the infrastructure of fact databases. This change led into the standardization process where scientific publications are seamlessly connected to the scientific databases.
The CALBC support action will engage the community of biomedical text mining researchers into a challenge that will lead to the exchange of a large set of annotated scientific documents. This community research effort will give answers to a very difficult question: “If we take all semantic resources, for example terminologies, that are available and use them to annotate a large set of documents, how will the documents finally look like under the best conditions possible”. The solutions to this problem will deliver biomedical literature in a standardized way and will enable sophisticated retrieval methods for the literature, i.e. with better semantic support. In addition, automatic interlinking of the documents with the biomedical fact databases will be possible.
This project addresses the difficult problem of annotating an unrestricted number of text documents with a large set of semantic types from the biomedical domain. We propose a collaborative approach to this annotation task in the form of an open challenge to the biomedical text mining community. The task is the annotation of named entities in a large biomedical corpus, for a variety of semantic categories. The project delivers as outcome a large, collaboratively annotated corpus, marked with the mentions of biomedical entities. The annotated corpus becomes a resource for the community, to be used as a reference for improving text-mining applications.
The biomedical text mining research community has a long tradition of organizing such challenges, as a way of evaluating techniques, sharing technical knowledge, and helping to improve the results from text mining programs. However, such challenges have typically addressed relatively small corpora in a narrow sub-domain, in part because the evaluation of the results is extremely long and costly. As a result, the generated annotated corpora are too small and are only narrowly annotated to be useful in a variety of text mining applications.
In contrast, we propose to create a broadly scoped and large annotated corpus (at least 100,000 Medline abstracts annotated with 5-10 semantic types) by integrating the annotations from different named entity recognition systems. Metadata will also be added to the corpus. The participating systems have different application scopes and annotation strategies, and therefore complement each other. Therefore, the annotated corpus reflects these different scopes and strategies. A secondary goal of this project is to define a standardized format for representing the annotations contributed by the participants and comparing them effectively. Currently the lack of such a format hinders progress in the evaluation of named entity recognition systems. The final corpus will also be made available formatted in RDF for exploitation in Semantic Web applications.
The corpus will be used to organize challenges where participants can download the corpus, can annotate it with their own text mining solutions, submit the corpus to a central server and receive an assessment of their results through a fully automated analysis. Over a half-year period, submissions and assessments at any time can be contributed. At the end of that period all submissions of annotated corpora will be used to generate the next fully annotated corpus, which then will be used for the next round of the challenge.

This proposal defines a support action project that brings together the researchers from international biomedical text-mining groups to address the difficult issue of annotating large text corpora with a large set of semantic types. We propose a collaborative approach to this annotation task in the form of an open challenge to the biomedical text-mining community. The task is the annotation of named entities in a large biomedical corpus, for a variety of semantic categories. The project delivers as outcome a large, collaboratively annotated corpus, marked with the mentions of biomedical entities. The annotated corpus becomes a resource for the community, to be used as a reference for improving text-mining applications. The biomedical text-mining research community has a long tradition of organizing such challenges, as a way of evaluating techniques, sharing technical knowledge, and helping to improve the results from text-mining programs. However, such challenges have typically addressed relatively small corpora in a narrow sub-domain, in part because the evaluation of the results is extremely long and costly. As a result, the generated annotated corpora are too small and are only narrowly annotated to be useful in a variety of text-mining applications. In contrast, we propose to create a broadly-scoped and large annotated corpus by integrating the annotations from different named entity recognition systems. Metadata will also be added to the corpus. The participating systems have different application scopes and annotation strategies, and therefore complement each other. As a consequence, the annotated corpus reflects these different scopes and strategies. A secondary goal of this project is to define a standardized format for representing the annotations contributed by the participants and comparing them effectively. Currently the lack of such a format hinders progress in the evaluation of named entity recognition systems.

Champ scientifique (EuroSciVoc)

CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Ce projet n'a pas encore été classé par EuroSciVoc.
Proposez les domaines scientifiques qui vous semblent les plus pertinents et aidez-nous à améliorer notre service de classification.

Programme(s)

Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

FP7-ICT - Specific Programme "Cooperation": Information and communication technologies

Thème(s)

Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

ICT-2007.4.4 - Intelligent content and semantics (ICT-2007.4.4)

Appel à propositions

Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.

FP7-ICT-2007-3
Voir d’autres projets de cet appel

Régime de financement

Régime de financement (ou «type d’action») à l’intérieur d’un programme présentant des caractéristiques communes. Le régime de financement précise le champ d’application de ce qui est financé, le taux de remboursement, les critères d’évaluation spécifiques pour bénéficier du financement et les formes simplifiées de couverture des coûts, telles que les montants forfaitaires.

CSA - Coordination and support action

Coordinateur

EUROPEAN MOLECULAR BIOLOGY LABORATORY

Contribution de l’UE

€ 685 697,00

Adresse

Meyerhofstrasse 1
69117 Heidelberg
Allemagne

Région

Baden-Württemberg Karlsruhe Heidelberg, Stadtkreis

Type d’activité

Research Organisations

Liens

Contacter l’organisation Site web

Participation aux programmes de R&I de l'UE

Réseau de collaboration HORIZON

Coût total

Aucune donnée

Participants (3)

FRIEDRICH-SCHILLER-UNIVERSITÄT JENA

Allemagne

Contribution de l’UE

€ 262 364,00

ERASMUS UNIVERSITAIR MEDISCH CENTRUM ROTTERDAM

Pays-Bas

Contribution de l’UE

€ 442 700,00

LINGUAMATICS LIMITED

Royaume-Uni

Contribution de l’UE

€ 108 926,00

Description du projet

Champ scientifique (EuroSciVoc) CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Programme(s) Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

Thème(s) Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

Appel à propositions Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.

Coordinateur

Participants (3)

Télécharger Télécharger le contenu de la page

Champ scientifique (EuroSciVoc)

CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Programme(s)

Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

Thème(s)

Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

Appel à propositions

Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.