|
Abstract:
|
Today's synthetic voices are largely based on diphone synthesis (DiSyn) and unit selection synthesis (UnitSyn). In most
DiSyn systems, prosodic envelopes are generated with formal models while UnitSyn systems refer to extensive, highly
indexed sound databases. Each approach has its drawbacks; such as low naturalness (DiSyn) and dependence on huge
amounts of background data (UnitSyn). We present a hybrid model based on high-level speech data. As preliminary
tests show, prosodic models combining DiSyn style at the phone level with UnitSyn style at the supra-segmental levels
may approach UnitSyn quality on a DiSyn footprint. Our test data are Danish, but our algorithm is language neutral. |