Trevor Johnston, Macquarie University
Onno Crasborn, Radboud University

The Use of Annotation Software in the Creation of Signed Language Corpora

Two digital signed language corpora are currently being created-one of Australian Sign Language ( "Auslan") and one of Nederlandse Gebarentaal ("NGT"). We begin this presentation by identifying the major reasons for these two projects: (1) the issue of signed language endangerment in deaf communities; (2) the relatively scarce language documentation found in most signed language using communities; (3) the general problems in linguistic analysis and corpus building associated with unwritten languages; and (4) the particular problems presented by the absence of a commonly accepted written representation of signed languages (i.e., no 'IPA'). These last two issues mean that signed language archives-insofar as they exist at all-are of limited use to signed language linguists. Indeed, progress in signed language research is being seriously hampered by the absence of representative collections of naturalistic data for signed languages in any form, and especially in a form that may be subject to computer-based enquiry, allowing for complex searches.

After discussing the technological and linguistic reasons why signed language corpora were impossible to create until relatively recently, we examine one specific tool, namely the ELAN annotation software developed at the Max Planck Institute for Psycholinguistics (Nijmegen), and discuss how it is being applied in the creation of corpora of Auslan and NGT. In particular, we deal with the issues of transcription, glossing and annotation, and the way in which the ELAN software can compensate for, if not circumvent, the serious issue of lacking an accepted transcription system for signed languages, namely by allowing for direct access to the video sources in inspecting search results. The problem is not only that there is no agreement on which transcription system to use, but more fundamentally, that the sign language signal is highly complex and multidimensional, and that we have a restricted linguistic understanding of many aspects of both manual and nonmanual properties of sign language production. A direct link between some form of annotation and the source signal may break the circular problem that linguists face in developing transcription systems.

We will then describe how the corpus materials are being collected, the criteria used for selecting the types of participants, and the elicitation and recording techniques. We conclude by describing the structure of the two corpora, and the kinds of annotations and tags being added to the data in the first instance (examples of annotated signed language texts will be shown). The aim of both projects is to create a relatively large and richly annotated digital corpus of each language over a relatively short period of time (several years), adopting common XML standards for storing data. Each documentation project will use a 'cumulative' annotation protocol whereby later annotations of texts can be added to existing annotated texts. Full-resolution PAL video files will be included in addition to lower-resolution versions for day-to-day use. Such a project would have been technically impossible only a few years ago. Close collaboration with the ELAN developers ensures that software functionality specific for working with sign language data is implemented during the course of the documentation projects.