OCR for Basque language

The ELEKA company has developed a project which adds Basque spelling together with morphological information on Basque words for the OCR software most commonly used today, Omnipage. This development has been supported by the Basque Government, and it will be soon available.

OCR (Optical Character Recognition) is the computer-recognition of printed or written characters. This means that when we scan a book, each character is interpreted as an image. Subsequently, this scanned image is analysed and the character converted into an ordinary code, such as, for example, into ASCII code. Most OCRs used today use a Spanish dictionary when wishing to analyse a text in Basque. Despite this, it has to be said that it is worse to use a dictionary from another language as more errors will arise in the text. For example, if one uses an English dictionary, almost every time the Basque word sei (six) appears, it is substituted by the word set. On the other hand, if we use a Spanish dictionary, whenever the word energia appears, it is replaced by energía with the Spanish accent; words do not bear accents in Basque. The ELEKA linguistic engineering(opens in new window) company has developed a project which adds Basque spelling together with morphological information on Basque words for the OCR software most commonly used today, Omnipage. This OCR can process text in 114 different languages, amongst them Basque, and then convert the scanned images into characters. This means that it can recognise the Basque alphabet. However, to date, it has not been able to offer the necessary checking and correcting of texts (such as has been operated for other languages such as German and English). Now, the OCR can carry out the pertinent corrections in Basque and will make suggestions for those words it does not recognise. ELEKAs next step will be to develop a OCR corrector for Microsoft Word text processors or for Open Office. This is because users not using Omnipage, scan a text and normally create a document in Microsoft Word or similar. In order to correct such documents a new Xuxen-type proof-reader has to be created, taking into account the mistakes generated by the OCR. Thus, those users not using Omnipage will have an OCR system for Basque. Providing linguistic tools in Basque, they have developed one which digitalises texts in Basque in a much-improved manner. That is to say that ELEKA has devised a tool capable of automatically understanding and correcting digitalised texts in Euskara. This project development by ELEKA has been supported by the Basque Governments sub-department for Linguistic Policy, which will be responsible for the distribution of the new application.

Keywords

Linguistic engineering

Countries

Spain

OCR for Basque language

Keywords

Countries

Share this page Share this page on social networks

Download Download the content of the page