Text-to-Speech Synthesis in the Era of Mobile Devices

David Malah
Elron-Elbit Professor of Electrical Engineering
Head, Signal and Image Processing Lab (SIPL)
Department of Electrical Engineering
Technion – Israel Institute of Technology
Haifa, Israel

Abstract – Text-to-Speech (TTS) systems produce synthetic speech from text. Such systems are used in
a variety of applications, such as assisting the handicapped - like reading to the blind, telecommunication,
entertainment, and other human-machine interactions. Important considerations in developing such
systems are achieving natural quality and high intelligibility.  With the advent of computer technology, in
memory capacity and computation speed, these goals became achievable by applying concatenative
synthesis, using very large stored databases of speech basic units. The stored units are used to produce
desired utterances by elaborate concatenation, requiring also extensive CPU power. With the emergence
of mobile devices and the need for hands-free communication, along with classical applications of TTS
synthesis, with these devices, the challenge became the reduction in storage size (“system footprint”) and
CPU power that fit such embedded systems, without compromising the quality of synthesized speech.
Following an overview on speech synthesis and TTS systems, I’ll describe in this talk our recent work, done
in collaboration with the speech processing group at IBM Research Labs in Haifa. Specifically, I’ll address a
new approach for improving the speech quality of reduced footprint TTS systems based on a hybrid
technique that combines concatenative and statistical TTS approaches. I’ll also present an efficient
compression method that is applied to entire acoustic leaves, each containing a set of speech units, from
which the TTS database is comprised.

Bio - David Malah received his PhD degree in 1971 from the University of Minnesota, Minneapolis, MN, in
Electrical Engineering. He obtained the BSc and MSc degrees in EE from the Technion, Haifa, Israel, in
1964 and 1967, respectively. Since 1972 he is at the Technion, where he is an Elron-Elbit Professor of
Electrical Engineering. In 1975 he co-founded the Signal and Image Processing Laboratory (SIPL) at the
Technion, and serves as its head since. During 1979 to 2001 he spent 6 year, cumulatively, of leaves at
AT\&T Bell Laboratories, and AT\&T Labs, NJ. From 2006 to 2010 he served as the director of the Center
for Communication and Information Technologies – CCIT, Technion. He was elected Fellow of the IEEE in
1987 and is a Life Fellow as of 2009. He is a recipient of the 2007 IBM Faculty Award and the 2011
Outstanding Achievement Award from his Alma Mater – U of M. Since 1999, he is on the Editorial Board
of the Journal of Visual Communication and Image Representation, and as of 2010 on the Senior Editorial
Board of the IEEE Journal of Selected Topics in Signal Processing. His main research interests are in
Image, Video, Speech and Audio Coding; Speech and Image Enhancement; Text to Speech Synthesis;
Hyperspectral Image Analysis, and in Digital Signal Processing Techniques.