SCARRIE will develop a high-quality proof-reading tool for the Scandinavian publishing industry. The system is designed to meet the needs of users from Danish, Norwegian and Swedish newspapers and publishing houses. It will handle spelling and grammatical errors and also check consistency with house standards or styles. SCARRIE will address special issues for the languages concerned: recognising novel compound words, hyphenation and correcting fixed multi-word expressions. SCARRIE will be interfaced to word processing systems used at the validation sites. These interfaces will be simple, giving users the opportunity to evaluate the system and to feed their experience back into the development cycle. Provision will be made for the systematic development of advanced interfaces to widely used word processing environments on successful completion of the project.
Document creation and authoring are expensive and time consuming processes, resulting in newspapers and publishing houses being, until recently, among the most labour intensive of businesses. This has changed due to rapid technical development combined with new distribution channels which is forcing investment in new media technology and the drive towards reducing costs.
Scandinavia is one of the most active areas in the world as regards publishing. Norway has a higher net circulation of newspapers than any other country, with Sweden number three after Japan. Competition to sustain viable circulation levels is fierce, with the situation in Sweden being further complicated by a combination of tax regulations and state subsidies.
In response to these pressures, technical production has undergone extensive automation during the past fifteen years. In the 1980's computerised writing tools were introduced by printing houses which could afford English language based software with limited features and functions for local languages. Tools were based on mini-computer systems which have been largely replaced in the 1990's by networked desktop computers and imported software which are still far too limited to cope with current requirements. As an example, a system for Swedish spell checking has no compound analysis, no mechanism for handling errors in punctuation or grammar, and the coverage of its word lists is inadequate. Market size is also a contributory factor in the absence of suitably customised systems for the Danish, Norwegian and Swedish languages.
These languages share features that are also inadequately handled in existing products. This includes compounding, which requires intelligent language dependent analysis, and correction of words in fixed multi-word expressions as well as in their syntactic context. Norway has the additional problem of two official language standards (Bokmål and Nynorsk).
The lack of suitable systems and the pressure to lower costs have prompted many newspapers to abandon manual proof reading methods prior to the future introduction of suitable high quality software. Quality degradation has ensued which could endanger circulation figures and so diminish income.
These factors were the driving force behind the formation of the SCARRIE consortium. The project will address these issues by developing a high quality proof reading tool for the Scandinavian publishing industry as a whole. Similarities in the languages involved make it more economic to approach the tasks at a multi-national than at a national level. SCARRIE will support human proof readers in the formal aspects of language, handling errors in the expression of words and syntactic structure. In so doing, it takes up the challenge of closing the gap between the Scandinavian languages (Danish, Norwegian and Swedish) and the more widely spoken major European languages with respect to the availability of language tools and resources.
SCARRIE will be based on the CORRie pilot, developed in the Esprit programme, as regards word level, with the grammar being based on a finite state machine. The word level section can thus be regarded as a further development of initial EU funded research.
The CORRie model contains these modules which will be used in this project:
- Dictionaries with grammatical word category information.
- Pre-processing and post-processing to handle mark-up codes and identify words to be proof read.
- Type/frequency list for all words in the text.
- Sound and spelling correction (a refined version of the Triphone Analysis). It combines statistical and linguistic approaches.
- Compound analysis, with a resolution mechanism for multiple variations in one text (e.g. Dutch - investeringselectie and investeringsselectie).
The project co-ordinator is a Swedish company which has sold 30,000 copies of dictionary software for different languages, as well as reference works and document authoring systems since 1991. Two national research centres from Norway and Denmark will contribute their expertise in computational linguistics, especially in the areas of compound analysis and grammar. University departments from each of the Scandinavian countries are represented, bringing their experience in computational lexicography and morphology, phonetics, and language resource construction.
User presence is made up of three newspapers, two nationals and one local, together with two publishing houses, who will act as test beds for the project results. These users will be complemented by a large user reference group which will act as a discussion forum and contribute to the further specification of user requirements and evaluation of proposed solutions.
Major user concerns are the need for reliable and automatic tools to minimise expensive manual proof reading. These are shared by both the publishing and printing service industries, and the newspaper industry. They also each have their particular requirements. Newspapers have their own house style and readership to consider, and printers must tune their product to different language variants and terminology. Users share the following core system requirements:
- Compatibility with as many word processors and operating systems as possible.
- The ability to run on a standard PC or Macintosh, occupy limited storage, work fast, and be user friendly.
- Adequate lexical coverage based on contemporary newspaper texts and books.
- Punctuation and spell checking.
- Addition of words and phrases, recognition of new compounds, and syntactic analysis to help identify errors such as with homophones.
Word and grammatical error detection and correction are the basic features of the system. Word error detection will use word form dictionaries in combination with special mechanisms for handling compounds, numerical expressions, and proper nouns (outside the dictionaries). The word form dictionaries comprise two kinds of entries: correctly spelled words or approved words, and incorrectly spelled words or misspellings. The approved words are tagged with information about category, lemma (basic form), standard, style, morpho-syntactic features, and optionally morpheme boundaries for syllabification. Misspellings are fed back with recommendations for corrections.
Word form dictionaries will be derived from a basis of corpora collected from specific areas of newspapers and publishing houses and existing base form dictionaries. Part of the linguistic information needed will be assigned manually with the rest being done automatically using appropriate tools. An existing lexical database will be used as a frame of reference to make sure that highly frequent word forms are included.
Three grammar checkers currently offered on the Swedish market will be analysed to find which types of constructions they can handle. Together with error and test corpora the analysis will serve as a baseline for the grammar checker.
Consistency with regard to standard or style will also be checked. A common standard for each language regarding punctuation, use of capital letters and hyphens, numerical expressions and foreign names will also be defined.
Funding SchemeCSC - Cost-sharing contracts
751 20 Uppsala