Project ID: IST-2000-29452
Financé au titre de: FP5-IST
Pays: United Kingdom

Tools for corpus annotation and building dialogue act classification models

This item consists of two tools:
- A tool Web-based editor of dialogue corpora to be marked up in an Annotation-Graph notation.

- An editor to produce classification models for dialogue corpora, marked up in the Annotation Graph notation.

The first item allows the corpus annotator to select attribute values for the turns in a transcribed corpus of spoken language data. (The corpus is assumed to have been previously transcribed with the Annotation Graph toolkit, and aligned with the recorded speech using the PRAAT tool.) The annotator can select one or more dialogue acts to apply to a turn, and can ascribe additional attribute-values to the turn. It is implemented as a Java Servlet, and uses a relational database (currently Access).

The second item facilitates the construction of a classification model for use with the WEKA machine learning toolkit from a dialogue corpus annotated in the Annotation Graph notation. A typical classification model consists of a 'window' of n turns before the one to be classified. The editor allows the user to select n and to specify the attributes to be included in the model. Since the dialogue turns in the DUMAS English WOZ had been parsed using the FDG parser, a facility was added to specify regular expressions that would extract syntactic attribute values from the parser output.

William BLACK, (Senior Lecturer)
Tél.: +44-1612-003096
Fax: +44-1612-003324