Danish Stød and Automatic Speech Recognition

OPEN ARCHIVE

Union Jack
Dannebrog

Danish Stød and Automatic Speech Recognition

Vis flere oplysninger

Titel: Danish Stød and Automatic Speech Recognition
Forfatter: Kirkedal, Andreas Søeborg
Resume: Stød is a prosodic feature in Danish spoken language that is able to distinguish lexemes. This distinction can also identify word class and has the potential to improve the performance of automatic speech recognisers for Danish spoken language. Stød manifestation exhibits a large amount of variability and may be perceptual in nature, because stød in some cases can be audibly perceived yet not be visible in a spectrogram. The variability is the primary reason there is currently no agreed upon acoustic or phonetic definition of stød. The working definition of stød is “. . . a kind of creaky voice” (Grønnum, 2005) and “stød is not just creak” (Hansen, 2015). In the present work, we investigate whether stød can be exploited in automatic speech recognition. To exploit stød without an acoustic or phonetic definition, we need to use a (almost) zero-knowledge datadriven approach which is based on a number of assumptions that we investigate prior to conducting ASR experimentation. We assume that stød can be detected in audio input, using acoustic features. To detect stød, we need to identify features that signal stød, which requires annotated data. To select the right features, the stød annotation must be reliable and accurate. We therefore conduct a reliability study of stød annotation with inter-annotator agreement measures, rank acoustic features for stød detection according to feature importance using a forest of randomised decision trees and experiment with stød detection as a binary and multi-class classification task. The experiments identify a set of features important or stød detection and confirms that we can detect stød in audio. Lastly, we model stød in automatic speech recognition and show that significant improvements in word error rate can be gained simply by annotating stød in the phonetic dictionary at the expense of decoding speed. Extending the acoustic feature vectors with pitch-related features and other features of voice quality also give significant performance improvement on both read-aloud speech and spontaneous speech. Decoding speed increases when we extend the acoustic feature vectors and actually improve decoding speed over the baseline where stød is not modelled.
URI: http://hdl.handle.net/10398/9336
Dato: 2016-08-16

Creative Commons License This work is licensed under a Creative Commons License.

Filer Størrelse Format Vis
Andreas_Søeborg_Kirkedal.pdf 5.647Mb PDF Vis/Åbn Phd-afhandling

Dette dokument findes i følgende samling(er)

Vis flere oplysninger