Technotes: Speech Synthesis
This is the first in a series of ITD articles on the basics of adaptive technologies. In simple, non-technical language, the author provides a brief introduction to synthetic speech technology which is currently used by individuals with speech and/or visual impairments.
Language is probably one of the most important features which distinguishes humans from other animals and speech is the most important medium of language. So people have attempted over the centuries to build machines which imitate the sounds of speech. It is in recent years, with the advent of digital electronic technology, that this goal has come closest to being achieved. It has to be said that as yet no one has really succeeded in synthesizing a voice which is indistinguishable from a human voice - but I would predict that this is not far away. In the meantime, speech synthesizers are now available which produce speech of a quality adequate for many applications. This article presents a brief introduction to the current technology. Anyone interested in further details might wish to consult my book (Edwards, 1991).
The first characteristic of synthetic speech which a listener notices is its quality, that is to say how closely the voice resembles a human one. It has taken so long to develop adequate synthetic speech because speech is very complex and there are a number of factors which affect its quality. We need to explore some of these factors before we can look at the technology further.
It is generally assumed that the basic building blocks of speech are _phonemes_. The phoneme is the smallest segment of sound such that if one phoneme in a word is substituted with another, the meaning may be changed. For example, substituting the first phoneme in "coffee" could change the word to "toffee."
Because the definition is somewhat subjective, it is not possible to say precisely how many phonemes there are in the English language, but it is generally agreed that there are around forty. One approach to dealing with the problem of differing pronunciations of phonemes is to recognize that variations on the basic phonemes exist which are known as "allophones." So, for example the phoneme "r" may be seen to have two allophones. In a word such as "red" it is voiced, but in "try" it is unvoiced. These are _not_ separate phonemes, because if we substitute one for the other, the word would still be recognizable - it just might sound a little odd. As with phonemes, linguists disagree as to how many distinct allophones there are in the English language, but high-quality speech synthesizers have been developed on the basis of around _sixty_ of them.
In addition to how the individual phonemes or "segments" of speech are generated, speech also includes features which span segments, i.e., "suprasegmental" features. Suprasegmental features impart more information than is contained in the words alone. For instance differences in prosody (the 'tune' of the utterance) can signal the difference between a statement ("It's raining!") and a question ("It's raining?"). Timing is important too. The difference between, "The last time we met Alistair was horrible." and "The last time we met, Alistair was horrible." is signalled by a pause. (Notice how in written language we attempt to communicate some of the suprasegmental aspects of speech through punctuation - in this case a comma). The quality of a voice is judged on both its segmental fidelity (how authentic the phonemes sound) and its suprasegmental features.
The designer of a speech synthesizer has to consider a number of competing requirements; quality is just one of them. In some applications quality is paramount, and in particular this is true when the synthetic voice is being used as a replacement for the natural voice of someone who cannot speak. However, it should be borne in mind that quality is not always so important. There are even situations when a degraded quality is desirable, such as when you want the listener to be aware that they are being spoken to by a machine and not an intelligent person. Also quality may be sacrificed for other factors, notably monetary cost.
Now that we have some idea of what quality is in this context we can look at the sorts of technologies currently employed to try to achieve the goal. Klatt (1987) provides a very comprehensive review of text-to-speech technology for anyone interested in the details.
If a limited vocabulary is required, then the technique of _copy synthesis_ can be appropriate. This consists essentially of recording a real human voice and storing it in a digital form (much in the way that music is stored on CDs). These stored utterances can then be retrieved and strung together to make meaningful messages. An important choice is how big the stored utterances should be. If you store individual words, then it is possible to create completely new sentences. However they may sound a bit odd because the prosody will be artificial. In particular, the way we say a word depends on its position in the sentence. For example the pitch of words (in British and American English, at least) tends to fall throughout a sentence. Thus, a word at the end of a sentence will be spoken at a lower pitch than if it came near the beginning of the sentence. Copy synthesis is useful in particular applications. For instance, some telephone enquiry systems use it. A telephone number can be pronounced by a program which simply strings together recordings of its digits. Careful listeners may notice that in practice different recordings of the same digit are sometimes used, giving a more appropriate pitch pattern. There are also augmentative communication devices which use copy synthesis. General, useful utterances are stored in the device and can be spoken in response to some form of selection action.
Copy synthesis implies a limited vocabulary - limited to those particular words or phrases which have been stored. Most applications require a much broader, essentially unlimited vocabulary. In general the requirement is to turn text (in a computer-readable form) into speech, usually known as "text-to-speech synthesis." Most such synthesizers are based on translation from text into streams of phonemes (or allophones), which are then made audible by sound-production hardware.
The quality of the pronunciation depends on how good the rules for translation from text to phonemes are. At the same time, pronunciation is often irregular and not expressible in general rules. English is probably one of the worst languages in this respect. For example, what rules could capture the differences between the pronunciation of "bough," "cough," and "though?" To cope with these irregularities, synthesizers generally have "exception dictionaries" which map such words directly into the appropriate phonemes. When a word is passed to the synthesizer, it is looked up in the dictionary. Only if it is not present are the rules for generation of the phoneme strings applied.
Most synthesizers have a male voice. This is mainly because it is more difficult to synthesize a realistic female one; it is not simply a matter of playing the male voice at a higher pitch. Furthermore, most synthesizers speak English - with an American accent. This is not so much a technicality but just reflects their country of origin. There are British English synthesizers and an increasing number of other languages, though the quality of the speech in other languages does not seem yet to match that of the English ones. A few synthesizers are "multi-lingual," the switch from one language to another involving the selection of a different ROM chip.
Most synthesizers are constructed from hardware components. However, with increases in the power of processors and computers having built-in sound generators it is becoming more common to find synthesizers which are implemented in software. A hardware synthesizer may be an external device, in which case it will be attached to one of the computers input/output ports. Alternatively it may be on a "card" which is fitted internally into the machine.
The text fed to the synthesizer is usually in the form of ASCII standard computer code, including punctuation. The connections are usually standard too (such as RS232 serial connection). This level of standardization means that there is some scope for substituting different synthesizers; if you attach any synthesizer to speech-based software you will probably get some kind of output. However, synthesizers vary a great deal in the facilities they have and the way the software controls those facilities. For instance, for one synthesizer it may be possible to embed commands to alter the way the speech will be pronounced (inserting a pause or altering the pitch, for instance). That same command passed to another synthesizer may have no effect - or worse, it may have a completely different effect (inserting a pause _instead of_ altering the pitch, perhaps).
There are many uses to which synthetic speech can be applied, but there are two particular prosthetic applications for people with disabilities: augmentative communication and computer access for blind people.
"Augmentative communication" refers to the use of technology (usually computer-based) to facilitate personal communication. In other words someone who cannot speak for one reason or another may use a device through which he or she can specify utterances to be communicated to another person. The obvious medium of communication for such a device is (synthetic) speech, since it is the voice which is being replaced (though there are similar devices which rely on text - displayed on a screen or a piece of paper). Professor Stephen Hawking, author of _A Brief History of Time_, is perhaps the most celebrated user of such technology.
A person's voice is a very personal attribute; it is part of their self image. Witness people's reaction to hearing a tape recording of themselves. Often their voice does not sound to them as they would like to think they sound to other people. So, for someone who might use a synthetic voice to replace natural speech, its quality is most important. Indeed there are many people who might use this technology but choose not to do so because of the quality.
The other major application relies on the fact that text on a computer screen can be converted into speech. This is carried out by a "screen reader," which is a piece of (usually) software which runs alongside other programs, capturing whatever they display on the screen. The great advantage of a screen reader is that it will work with standard application software, so it is not necessary to develop (say) a talking word processor and a talking spreadsheet. With one screen reader adaptation two standard such packages can be made accessible. The task of a screen reader is quite complex because it is not simply a case of dumping the entire contents of the screen into speech; that would overwhelm the user. Instead the user requires some degree of control. The requirements the user of a screen reader has of the speech synthesizer can be quite different. As a regular user, the person will quickly learn to understand the speech, so that its quality may be less important. Other factors may be brought into account, such as cost and the speed of the speech. Speech (natural or synthetic) is slow - certainly compared to the speed of silent reading. Hence, many blind speech users prefer to hear the speech at an increased speed. In other words, they prefer to put up with a further degraded quality for the sake of efficiency. This is a factor which ought to be borne in mind by more synthesizer manufacturers. (See Blenkhorn, 1994, for a further discussion). Also, cost can be a major consideration and at least one blind user of synthetic speech has suggested that he would rather halve the price of his synthesizer than double the quality.
An important development has been the marrying of speech synthesis with optical character recognition, whereby printed texts can be read aloud by a machine to people who cannot read them for themselves (principally blind people, but also those with other "print handicaps" such as dyslexia). A number of such machines exist and their price has dropped dramatically in recent years. Here again quality becomes important. It is very hard to listen to long passages in a poor quality voice. Given current speech quality, reading machines do not offer a real substitute for a human reader (be they live or recorded) particularly when reading for entertainment.
Like all computer technology, speech synthesizers have greatly reduced in price over recent years. As always, the main aim of research and development is to improve quality, and progress is being made. Already synthesizers exist, based on new technology, which produce voices which would be hard to distinguish from a recording of human speech (certainly over a telephone line). What they lack currently is the facility to convert from text to speech in real time. It is only a matter of time before the techniques and technology are developed to the point that this becomes feasible.
This article has briefly outlined the way in which current synthetic speech technology attempts to achieve the goal of producing human-sounding speech from a machine. Enormous progress has been made in recent years such that synthetic speech of reasonable quality is available at affordable prices. All the signs are that this development will continue, particularly to the benefit of both those people who _need_ to use it because other forms of communication are not available to them.
ReferencesBlenkhorn, P. (1994) "Producing a text-to-speech synthesizer for
use by blind people" in Edwards, A. D. N. (ed.) _Extra-Ordinary
Human-Computer Interaction_. New York: Cambridge University
Edwards A. D. N. (1991) _Speech Synthesis: Technology for
Disabled People_. London: Paul Chapman Ltd. (distributed in the
USA by Paul Brookes).
Klatt, D. H. (1987) "Review of text-to-speech conversion for
English," _Journal of the Acoustic Society of America_, Vol. 82,
no. 3, pp. 737-793.
Alistair D. N. Edwards (firstname.lastname@example.org)
Human-Computer Interaction Research Group
Department of Computer Science
University of York
York England YO1 5DD