European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS

Common Language Resources and Technology Infrastructure

Final Report Summary - CLARIN (Common language resources and technology infrastructure)

The easiest way to describe CLARIN in non-technical terms is to say that its goals were threefold. First of all, it aimed at uniting existing digital archives in Europe that contained language based material into a federation that would allow the social sciences and humanities research communities unified access to the content, irrespective of the location of researcher or data. Secondly, it wanted to make the wealth of language and speech processing tools that were developed over the recent years available to interested researchers with a view of increasing their productivity and opening up new research avenues. Thirdly, it wanted to provide web based services that would allow non-expert users to perform complex tasks on the materials contained in the archives, such as 'Summarize all news about environmental issues in Le Monde of 17 March 2008, in Polish'. To achieve this various technical challenges had to be addressed. The CLARIN preparatory phase project had as its main goal to lay the foundations for the construction of this infrastructure.

CLARIN was shaped as a distributed data infrastructure with centres in all participating countries and a strong governing and coordination body to ensure interoperability between centres. Countries would fund their own technical operations and the population of the infrastructure with data and tools and would jointly fund the governance and coordination body. The final result of the project consisted of:

1. a full technical specification of the future infrastructure, including a first set of recommended standards
2. an experimental prototype embodying it, which served both as a proof of concept and as a starting point for the further development and expansion
3. an initial set of language resources and services that could already be accessed and used
4. an active community representing over 200 centres in 33 countries
5. active working relationships with sister projects from ESFRI and other European Community programmes
6. intensive contacts with user communities in linguistics and initial contacts with other user communities
7. a specification of the governance and coordination structure and a financial plan
8. firm financial commitments from five to eight countries and expressions of an intention to join or to collaborate from 10 other countries or institutions
9. a 'ready to submit' application for the creation of a legal entity, shaped as a consortium of countries (ERIC), as the governing and coordinating body of the future CLARIN infrastructure, which would be hosted by the Netherlands and was expected to start its operations on 1 January 2012.

The CLARIN mission was to create an infrastructure that would make language resources and technology available and readily usable to scholars of all disciplines, in particular the humanities and social sciences (HSS). In our age we are presented with many challenges as we deal with language in electronic formats, in spoken, written and multimodal forms. The sheer size of this material made the use of computer aided methods indispensable for many scholars in the humanities and those in related fields who were concerned with linguistic material.

The CLARIN infrastructure was based on the firm belief that the days of pencil and paper research were numbered, even in the humanities. Computer aided language processing was already used by a wide variety of disciplines in the humanities and social sciences, addressing one or more of the multiple roles language plays, as carrier of cultural content and knowledge, instrument of communication, component of identity and object of study. However, achieving the advanced analysis of linguistic material with current resources requires an effort that no single humanities and social sciences scholar should be expected to make.

CLARIN proposed that any single user would have access to guidance and advice through distributed knowledge centres and, via a single sign-on, the user would have access to repositories of data with standardised descriptions and processing tools ready to operate on standardised data. The nature of the project was therefore primarily to turn existing, fragmented technology and resources into accessible and stable services that any user could share or adapt and repurpose. CLARIN could build upon a rich history of national and European initiatives in this domain and would ensure that Europe maintained the leading position in humanities and social science research in the current highly competitive era.

The CLARIN preparatory phase paved the way for the implementation of the infrastructure along four dimensions, namely funding and governance, technical specifications and validation, adequacy of the approach for all participating languages and suitability of the proposal for the end users' needs. The main objective was not to generate new foreground knowledge but rather to lay the foundations for the future CLARIN infrastructure. All research activities served to support the design and the specifications to be fed into the construction phase. Some of them were technical in nature, while others had more to do with the functioning of the future infrastructure as a means to support humanities and social sciences research in a broader sense. In terms of technical aspects, CLARIN achieved progress in:

1. component metadata formats, through linking resources to a component based description
2. linking abstracts and persistent identifiers, called handles, to address digital obsolesces and enhancing the sustainability and transparency of electronic humanities' research
3. providing a federated identity setting up cross connect repositories and cross linking federations from the European member states allowing users to access data repositories that were located in other countries
4. defining a standardised way of performing searches so as to effectively send the 'search' command to several data centres and bundle the returned results in a user friendly an unified way.

Moreover, regarding language resources and services, an overview on the current situation of language resources, tools and standards was acquired according to the project's preparation phase. Metadata on all available language resources and tools from CLARIN partners were collected and made available by the virtual language observatory (VLO, http://www.clarin.eu/vlo/).

As a counterpart of the VLO, for several social sciences and humanities disciplines, user needs were explored within a basic language resources and tools kit (BLARK). For major and minor European languages, these needs were mapped on the freely available language resources and tools and gaps in the infrastructure were identified.

Moreover, the project facilitated the usage and the interoperability of language resources and tools, as well as for different kinds of annotations. By using those standards as pivot formats, language tools could be used with a bigger range of resources and other tools.

In terms of reaching the users, CLARIN was first conceived to alleviate fragmentation of resources, tools and projects in HSS. As well as addressing the fragmentation of the natural language processing domain from which our actors, tools and resources originate, CLARIN played a key role in wider developments across the HSS and beyond. Primarily, enrichment of textual data with linguistic information yielded major improvement over manual text analysis and synthesis, since it allowed for obtaining more precise hits in content analysis and was a sound means to replicating experimental output. Furthermore, as digital resources are continually updated, users would be able to access the latest versions of tools and datasets, which would enhance HSS workflows considerably. Building bridges to HSS communities that were so far only marginally or not at all familiar with linguistic processing tools and methods was a core accomplishment of CLARIN. The project initiated a step change in the mentality of both the language technology and humanist communities.

We also conducted fieldwork by collaborating in humanities research projects to demonstrate the use of research infrastructure in scientific investigation. We monitored and advised how researchers managed and enriched their own digital research data. Users' scholarship advanced considerable and, in turn, CLARIN gained knowledge about the research interests, methods and conceptual frameworks which are the basis for HSS scholars' research. These would be used in tailoring the services and infrastructure so that users from the heterogeneous fields could utilise language technology and resources to their full potential.

Furthermore, specific attention was placed on the need to support future users of the proposal. CLARIN made available a help-desk for experts and expertise within CLARIN was considered absolutely necessary. Users could easily submit questions using a very simple interface. We also looked for solutions to increase the efficiency of the help desk, its average answering speed, the ability to detect frequently asked questions etc.

An additional objective was to investigate the intellectual property rights and data privacy Issues concerning linguistic tools and resources in order to provide a framework for licensing and authorisation between CLARIN and external providers to enable incorporating new and existing resources and technology into CLARIN. As the end result, a network of agreements and licenses was drafted in order to achieve and maintain sufficient levels of trust conforming to the law. In addition, the legal basis for the technical infrastructure was provided through drafting a set of formal agreements in conjunction with some software technology specifications to authenticate and identify the researchers in a secure way when they worked on distributed language resources and applications.

The project also addressed issues of governance of the future infrastructure. The main activity under this heading was the preparation of an agreement between the funding agencies in the participating countries about the construction and exploitation phase of the produced infrastructure. This included the investigation of possible legal, financial and organisational models. The key deliverable was the CLARIN construction and exploitation agreement that should form the basis for the joint construction and exploitation of the infrastructure by the participating countries. ERIC was the adopted form, thus the main results were therefore the documents required for the submission of a request to establish an ERIC. The relevant preparatory work was conducted in very close collaboration with the ministry of research of the host country, the Netherlands. This activity has generated a significant body of knowledge concerning the creation of ERICs, both in terms of the various options to be considered and the process to get there. The group that undertook this activity was operating in close collaboration with the European Community and this activity was continued beyond the project completion on a more structural basis under the auspices of the COPORI project.

While the problem of fragmentation of isolated silos of digital activity remained, important European initiatives resolved to work together on advocacy for improved infrastructure and on aligning their infrastructure initiatives to allow the maximum interoperability of services, collaboration and reuse of resources. In this context CLARIN played a key role. A newly created project was dedicated to widening participation in the CLARIN cross search demonstrator to other centres.

CLARIN was a successful project in the sense that it managed to mobilise a large community of researchers and research institutions interested in using digital language data and tools to support and conduct electronic humanities' research. At the same time it brought together a number of countries that was willing to take joint responsibility for the operation of the project as a European research infrastructure. As the project target was the HSS community the social impact will be rather indirect than direct. The use of CLARIN would give researchers better access to existing material from all over Europe. This will allow them to ask old questions to more data, to ask new questions to old and new data and to increase their research productivity. We expected that the effects would be seen in a number of areas and that research supported by the CLARIN infrastructure could contribute to the study and understanding of social climate change along historical, linguistic, geographical and social dimensions.