Skip to main content

Article Category

News

Article available in the folowing languages:

CorpEus, a "web as corpus" tool for Basque, is on-line

The Elhuyar Fundazioa R+D Team, in collaboration with the IXA Team at the Faculty of Informatics of the University of the Basque Country (UPV/EHU), has posted an on-line web service known as CorpEus. The CorpEus service enables consultations to be made on Internet as if it were an enormous corpus of Euskera, the Basque language.

The Elhuyar Fundazioa R+D Team, in collaboration with the IXA Team at the Faculty of Informatics of the University of the Basque Country (UPV/EHU), has posted an on-line web service known as CorpEus. The CorpEus service enables consultations to be made on Internet as if it were an enormous corpus of Euskera, the Basque language. It can carry out a word (or phrase) search of all web pages on the Internet in Basque and show all incidences and in their context, together with graphs as a function of various parameters. CorpEus was presented at the international WAC3 (Web as Corpus) congress at Louvain-la Neuve (Belgium) on 15 and 16 September, where it was received with great interest by participants, given that the features of the new tool used a methodology that could be useful for other languages. Today all languages need corpora. They are a highly important resource for developing linguistic technologies, for drawing up dictionaries and for normalising the language itself for use in translation. In a nutshell, corpora provide information about the real usage of words: if one word is used more than another, how it is declined or written, its collocations, and so on. But, drawing up a corpus is an arduous and costly task and it it is not easy to have it updated in an ongoing way. As a result, corpora of Euskara are few and small, at least in comparison with other tongues. We have, nevertheless, the Internet: an enormous collection of texts, available to all, with many more texts than any other corpus in Basque and, moreover, which is being constantly updated. For this is, when all is said and done, a corpus, although without advertising itself as such in so many words. It would be a fine thing to be able to consult and exploit it as a corpus. This is, precisely, what CorpEus does. CorpEus uses the APIs of Internet search engines in order to know in which webpages the words consulted appear. But, unlike other Internet search engines and tools, it carries out the search solving two problems with Euskera: it searches in function of the lemma, and pages only in Basque. This is achieved by means of morphological creation and filter words, employing various tools from the IXA Team at the Informatics Faculty at the University of the Basque Country (UPV/EHU). Once the search has been undertaken, CorpEus shows all the words found – and with their context, together with the number of appearances and graphics, as a function of a number of factors such as form, category and the lemma of the previous word. It can also order the words according to various parameters and present a linguistic analysis of the results. It functions with various types of documents (HTML, XML, RSS, RDF, TXT, DBF, DOC, RTF, PDF, PPT, PPS, XLS). Moreover, it detects if the word consulted has variants and, apart from carrying out the search, it tells the user of these variants or, in the event of the word consulted being itself a variant, the standard form. In the event of the word not being recognised, CorpEus checks if a standard word can be arrived at using phonological rules, and if this proves viable, this new Internet tool for Basque will also suggest this to the user. When the user keys in an unknown or ambiguous term, he or she can choose from the returned analyses. The user can also carry out searches by terminological lemmas or whole syntagyms, by keying in the words between inverted commas. CorpEus is programmed to use the APIs of the principal search engines (Google, Google AJAS, Yahoo!, Windows Live Search), but the public service will, for the present, be provided through Windows Live, as it is the API that provides the best conditions (more than 25,000 usages a day, compared to 1,000 for Google and 10,000 for Yahoo!). CorpEus is on-line at . More information on its presentation and help pages.

Keywords

linguistics

Countries

Spain