Volume XIV Number 1, March 2014

Scratch That: 20 Years of Change with Speech Recognition

Darren Gabbert
University of Missouri

Talk is cheap, even in terms of speech recognition technology, but it hasn’t always been that way. In 1993, I was quite fortunate to be using one of the first commercially available speech recognition systems known as DragonDictate (not to be confused with Dragon Dictate—with a space between words—which was released for Mac in 2010). Due to the progression of my neurological disease, I was in need of a hands-free typing accommodation to keep me competitive in my IT career. DragonDictate fit the bill. The speech system, which included a beefed up machine (with a whole megabyte of RAM!), cost my department over $10,000. No, it wasn’t cheap. But for the next 20 years, speech recognition would increase my productivity faster than my disease could diminish it.

From the beginning, speech recognition systems fell into one of two basic categories: speaker-independent or speaker-dependent. Speaker-independent systems require no training and are designed to recognize different voices and dialects; while speaker-dependent systems require user training, and recognition is based on voice samples in memory. Such systems were further divided between continuous speech, which is much like natural language, and discrete speech, which requires the user to pause between each word or short phrase.

DragonDictate was a speaker-dependent, discrete speech system that ran in DOS and Microsoft Windows. User training was a 45- to 60-minute process that was not for the faint of heart. For best results the user was expected to repeat over 200 words and commands three times each. DragonDictate could have been more accurately categorized as speaker-adaptive because the user’s speech was not required to be in memory prior to dictating. The system had a 30,000-word base vocabulary that adapted to the user with successive use.

There are both advantages and disadvantages to using discrete speech. A clear disadvantage is the unnatural pausing required between each word. This, combined with the training requirements, deterred most keyboardists from seeing DragonDictate as a viable alternative. The advantages are found for those forced to find a keyboard alternative, like myself. One committed to productivity via DragonDictate had to accept one simple axiom: man trains machine, and machine trains man. After two weeks of daily dictation, discrete speech became unnaturally natural. And I don’t recall ever showing up... for… a... meeting... talking... like... this. DragonDictate also had extensive macro capabilities that could be leveraged to boost productivity. Command driven text macros could insert boilerplate phrases and even paragraphs into documents on the fly. And mouse movement commands that included left click, right click, and drag made DragonDictate the only truly hands-free game in town.

Did I mention that I was actually a Mac guy? In the 1990s, that was not a good thing for someone needing a hands-free solution. With the help of Articulate Systems’ Voice Navigator, I clung to my Mac II like a limpet sticking to a rock. Voice Navigator was a speaker-dependent, discrete speech system with a 1,000-word capacity. This offered substantial support to my diminishing keyboarding abilities, but it was by no means a hands-free solution. I soon transitioned to a 2-workstation setup, using the Windows platform with DragonDictate for composing text and the Mac platform with Voice Navigator for web browsing and graphic design. My use of DragonDictate rapidly grew.

Dragon’s 1997 release of their first continuous speech software, NaturallySpeaking, was a game changer that made every keyboardist take a second look. This began to open up a huge market for legal and medical transcription, a market that would soon have product versions tailored to each of their specific needs. For some, speech recognition had finally arrived and they eagerly embraced NaturallySpeaking. But some that were quick to embrace it found frustration as they sought to increase their productivity. This was primarily due to unguided user expectations. When the average person thinks of speech recognition, they often think of someone leaning back in their chair and dictating hands-free. While NaturallySpeaking was and is capable of such dictation, there is a learning curve of commands and techniques that the average person was unwilling to climb. In addition to increasing recognition accuracy, subsequent editions of NaturallySpeaking would focus on reducing training time, simplifying command sets, and automating format options to reduce this learning curve. Those finding the most immediate success used NaturallySpeaking for composing and the keyboard for recognition correction and text formatting. This approach gave the user immediate success and allowed for gradual learning of command sets specific to their needs.

Persons with mobility limitations, as well as learning disabilities, benefited greatly from NaturallySpeaking’s continuous speech. But, segments of this population struggled with voice recognition accuracy. Anyone with unclear speech and/or limited breathing capacity often found NaturallySpeaking’s continuous speech a formidable challenge. One of the fundamentals of continuous speech is that it relies heavily on analyzing words in context. Thus, best recognition results are more likely to be realized by dictating utterances of sentence length. My own 1- to 2-word-per-utterance limitation was less than effective with early versions of NaturallySpeaking. In fact, it wasn’t until version 10 came out in 2008 that I found NaturallySpeaking’s recognition so improved that I let DragonDictate fall into antiquity. And thanks to an advanced option that adjusts the maximum pause between spoken words, I was able to dictate entire sentences as a single utterance which gave me a significant productivity boost.

But what about the Mac guy within that never dies? He was awakened somewhat in the 90s by Articulate Systems’ Power Secretary, but DragonDictate for Windows was still the best hands-free solution. Likewise, MacSpeech’s releases of iListen and MacSpeech Dictate lacked the functionality to which I had become accustomed. Features such as keyboard corrections, multiple recognition modes, user-friendly macro tools, and varied training options continue to be lacking on the Mac platform despite Nuance’s acquisition and relabeling to Dragon Dictate for Mac.

Today, speech recognition capabilities come bundled on both iOS and Windows systems. While not as feature rich as Dragon NaturallySpeaking, Windows Speech Recognition offers excellent accuracy for some users and a slick Show Numbers utility for selecting icons, buttons, and menu items via voice. But whether you are looking for a hands-free computing accommodation or simply wanting to integrate speech recognition into various aspects of your workflow, Dragon NaturallySpeaking is still the industry leader with the most flexibility to adapt to every user.

Much has changed... since... my... introduction... to... DragonDictate. Today, for under $150, someone can buy the Premium version of NaturallySpeaking and begin dictating with amazing accuracy within 10 minutes of installing the software. And for under $600, NaturallySpeaking’s Professional version provides advanced custom command capabilities, as well as enterprise administration of user speech profiles. And today speech recognition is not limited to desktop computers. Nuance didn’t waste any time taking its cloud-based interactive voice recognition technology to the mobile market. Unlike NaturallySpeaking’s local processing, Nuance powered Apple’s Siri and Samsung’s Android-based S Voice by analyzing voice spectrograms sent to their own servers. Google followed suit, and actually raised the bar, with their own neural network-like Voice Search. Besides native speech applications there are numerous third-party speech apps for Windows, Android, Apple, and Blackberry based devices that offer hands-free access to features and text transcription. I realized just how far speech recognition had come when I overheard my mother-in-law (who... isn’t young) dictate a text message to someone. I listened for a correction to be made. None was forthcoming. I listened for an expression of amazement. None was forthcoming. Interestingly enough, the “Lucky Few” generation finds mobile speech recognition to be a matter of fact. Much has changed, indeed.  

Just about the only thing that hasn’t changed in the past 20 years is that charming speech recognition command that everyone knows and loves, “SCRATCH THAT.” Granted, more recent versions of NaturallySpeaking can boast the ability to say “Scratch That 6 Times” to avoid repeating oneself. But who would want to use it? There is just some kind of refreshing satisfaction to saying something stupid and with those two magic words, make it vanish! If only all of life was that way...

Honey, have you gained a few pounds over the Holidays?

SCRATCH THAT!

Gabbert, D. (2014). Scratch That: 20 Years of Change with Speech Recognition. Information Technology and Disabilities E-Journal, 14(1).