Automatic speech endpoint detection and labelling technology

It is usual for speech recordings to contain the occasional short period of silence/background noise or non-speech sound. If such recordings are to be used for training automatic speech recognition (ASR) systems such as OPTACIA phonetic maps, it is necessary to exclude these unwanted non-speech acoustic artefacts by identifying precisely where they occur within the recording. The technique for locating the start and end-points of speech and/or non-speech sounds is known as endpointing or segmenting.

It is also important to associate the target speech sounds with items in the ASR application’s vocabulary. This process, known as labelling, is often combined with endpointing. Typically the speech sound’s start and end points are specified (in some unit of time) along with the identifying symbol, e.g. 0.20 0.50 a (indicating that the speech sound -- the vowel a -- started 0.2 seconds into the signal and ended 0.3 seconds later, i.e. at 0.5 seconds).

Manual endpointing of speech data can be a tedious process. Assuming that speech sounds will contain more energy than background noise if the speaker is near to the microphone (e.g. if the microphone is head-mounted), the endpointing algorithm computes the average background noise energy level at the start of the recording before any speech is encountered. This energy threshold value is then used to distinguish between speech and non-speech. The application also supports segment separation and separate labelling of single consonant vowel (CV) clusters.

This technology, which has been tested extensively and found to be robust under normal conditions, is currently implemented as a part of the STAPTK software but the source code can be easily modified to make it a stand-alone application which would be highly useful for endpointing and labelling of speech signals featuring isolated words and sub-word units (phones).

