Cultural Heritage Language Technologies

CHLT has provided a Lemmatisation Module for Early Modern Latin, called CHLT-LEMLAT that is Web based and will work with the texts in the CHLT Digital Library Collections, as well as 'free text' in Latin that is pulled off the Internet. CHLT-LEMLAT allows greater integration between Cultural Heritage resources of Latin texts and ICT tools and applications which will allow a range of users (students, teachers, scholars, life-long learner's and the general public) ways of understanding and interpreting these rare materials without the benefit of a classical education. Users are now able to take advantage of complicated automatic morphological tools (parsers, morpho-syntactic dismabiguators, wordform queries) to develop their understanding of Latin, and therefore are aided in lowering the barriers to comprehending the meaning of words and texts. This tool will be particularly helpful to those who are teaching elementary and intermediate Latin in secondary and higher education, yet it is also essential to the advanced scholar who is uncertain about irregular word forms. Although CHLT has allowed the first stages of the development of such a tool, there is still a considerable way to go to add all of the wordforms in Latin and the implementation of new algorithms for the management of non-segmented wordforms, which will take at least three more years to bring to market. We are contacting publishers and education experts in the EC, as well as academic institutions, involved in the teaching of Latin, to further develop this tool for wide distribution as a Web-Based Tool.

The creation of Old Norse Texts and Digitised Images with Morphological Analyser in an Integrated Reading Environment is a landmark achievement and the culmination of 35 months work by CHLT partners working together to produce a fully integrated system. CHLT has made available on the web (i) images of rare and fragile texts alongside (ii) TEI tagged texts (diplomatic and normalised) in XML with (iii) morphological analysis tools within (iv) a digital library environment (cf. www.CHLT.org and www.perseus.tufts.edu). The morphological analyser is linked to diplomatic and normalised editions of the manuscripts (as well as the Standard Edition texts of the Fornaldar Sagas) which in turn are linked to images of the manuscript pages, and integrated into the CHLT-Perseus Digital Library System, where we have incorporated the morphological analyser and look-up tool; we have also integrated the CHLT Visualisation and Clustering Tool with our Old Norse Texts and Tools. The results of our work breakdown into five sections: (i) Underlying Code - morpho-syntactic parser is 'object oriented' so that the rules set for Old Norse take the form of 'modules' (not hard coded); (ii) Rules - precise for word classes; (iii) Phenomena - strategies for dealing with phenomena unique to Old Norse that increase parser accuracy; (iv) Transcription Guidelines, and (v) Future Work - that builds on CHLT results in National Science Foundation Project.

CHLT has developed a collaborative infrastructure for a distributed cultural heritage library that allows metadata sharing between two digital collections at different CHLT sites, and has extended this model to a fully functional general metadata-sharing model for external partners. CHLT has taken up the challenge of integrating and implementing FRBR standards to all of the CHLT and Perseus catalogue material so that we can now share metadata with existing digital collections around the world, including the Library of Congress and OCLC, and are therefore not limited to sharing only with CHLT partners (as was originally envisioned at the start of the project). Adoption of FRBR has allowed us not only to link to international collections but also to get involved with the harmonisation efforts between the CIDOC CRM ISO 21137 Special Interest Group and the IFLA FRBR Special Interest Group. Classical texts are available in many forms: editions, translations, and manuscripts. As a result, collections of cultural heritage works are a good test-bed for implementation of IFLA FRBR. After updating the CHLT catalogue information in FRBR for core collections, we were able to harvest metadata from the Library of Congress and OCLC's WorldCat and make use of the LC's search/retrieval web service (SRW) gateway using Z39.50; we then organised documents in our collection into FRBR categories and created a new CHLT Catalogue Database. The challenge of visualisation, display, and sharing the catalogue was addressed by introducing three interface modes: a web-based catalogue system; a Classical Text server protocol; and a web-standards approach using OAI and the emerging SRU/W protocol (these systems allowed for the distribution of catalogue records in a Z39.50-type interface). As these standards become more widely implemented we expect that CHLT material will be ever more available for metadata sharing on an international scale.

CHLT has provided an Early Modern Latin Corpus with over 300MB of Early Modern Renaissance texts and 60MB of Isaac Newton's papers as a test-bed for the CHLT Digital Library System, which also incorporates newly digitised works in the History of Science from the Linda Hall Library in Kansas City Missouri (over 200MB of text that was not anticipated at the beginning of the project). Texts from the Stoa Project (Univ of Kentucy, Lexington) and The Newton Project (ICSTM) were tagged by hand with XML-TEI; considerable editorial work was undertaken so that the materials would be electronically archived (web-archived) using standard protocols -both for use within the CHLT Digital Library System (i.e. metadata harvesting and information retrieval) and for purposes of preservation (i.e correct spelling, variant readings, et al). For instance, because the Erasmus Colloquia and the text of the Encomium morias came from various sources, it was important to have the encoded material reflect a single edition (1867 ed Desiderr Erasmi Roterodami colloquia familiaria et encomium morias). With this goal in mind we went through 63 colloquia and the Encomium morias adjusting any orthographical and/or editorial discrepancies. Faithfulness to the text has been an important guiding principle in transcription, and the Newton Project papers offer diplomatic and normalised versions of each text with (when possible) facing page images of the manuscripts. Dissemination of these results has given access to rare and fragile source materials, and when coupled with morphological tools have lowered the barriers to understanding these important texts for those without the benefit of a classical education.

CHLT has developed a text documentation, visualisation and clustering tool called VISHNU, which integrates within a single programme: (i) full text document indexing, (ii) collection building, (iii) keyword extraction, (iv) document search and retrieval, (v) keyword-based document clustering, and (vi) a suite of interactive interfaces that provide the end-user with a variety of different visualisations for search results. VISHNU aims to discover structure within the set of documents retrieved for a query and to expose this structure to the user to facilitate document search; it then interacts with the user for resource discovery and discrete searching across corpora. The software is especially suited to non-specialists using specialist domains (i.e. those who have only partial knowledge of the original language). Vishnu has been developed for use with CHLT corpora in Ancient Greek, Latin and Old Norse. These texts are typically studied by non-specialist users who nevertheless have a desire to understand the texts. Vishnu allows users to search corpora in innovative ways and bridges the gap between passive awareness and active understanding. For this reason it is suitable for development in the education sector at secondary and higher education levels, although it has a much wider application potential in multimedia information retrieval and human-computer interaction research. Please contact the Head of MMIS Group, Stefan Rueger (s.rueger@ic.ac.uk) for further information.

CHLT results in Ancient Language Technologies in Greek and Latin can be divided into three groups: (i) multi-lingual retrieval facilities for digital library systems, (ii) vocabulary word profile tools for texts and corpora, and (iii) syntactic parsing tools for Greek texts. (Demo's of results can be viewed at http://www.chlt.org or http://www.perseus.tufts.edu). The workflow followed a certain order: the initial development of vocabulary profile tools and the integration of user-feedback; then came problems of document architecture and the establishment of unique identifiers for documents in the digital library system, as well as multi-lingual information retrieval issues and multi-lingual thesauri automatically extracted from multi-lingual lexica; next came search tools that used the thesauri to generate Greek and Latin search queries from queries that were originally entered in English; and finally the development of a Greek syntactic parsing tool that is designed to answer questions about the distribution of grammatical features within texts, common subjects of verbs, and the identification of modifiers in Ancient Greek and Latin. Three CHLT Prototypes (Multilingual Retrieval tool, Vocabulary Word Profile Tool and Syntactic Parser) are groundbreaking achievements and are currently running on two websites; feedback from their use will inform future development of these morphological tools.

Deliverables

Share this page

Download