Talking Books: Toward A Digital Model
We propose to develop performance criteria for next-generation digital talking books (DTB) by using the National Information Standards Organization (NISO) standardization process. This process entails soliciting advice from all interested parties including users, consumer organizations, and manufacturers, then seeking consensus on the characteristics of the contemplated product. NISO is accredited by the American National Standards Institute (ANSI) to develop and maintain technical standards for information services, libraries, publishers, and others involved in the business of creation, storage, preservation, sharing, accession, and dissemination of data.
At present, library access for blind and physically handicapped persons is served by analog cassette tape. This technology has enjoyed the acceptance and economy found in the consumer entertainment market for over twenty-five years. As digital technology gains in market share, the analog cassette is likely to become less attractive from both a cost and consumer - preference standpoint. The two forces, cost and preference, will ultimately converge to motivate a move to digital methods. In anticipation of this, a digital talking book (DTB) standard is needed to define requirements, examine the feasibility of proposed features, and most importantly, explore user- control preferences.
Developing a standard necessarily begins far in advance of system-wide hardware implementation so that when the forces for change become compelling, a practical system, with acceptable controls, will be fully defined and tested. Developing the standard under the auspices of the National Information Standards Organization (NISO) is appropriate because NISO is the only organization accredited by the American National Standards Institute (ANSI) to develop and maintain technical standards for information services, libraries, publishers, and others involved in the business of creation, storage, preservation, sharing, accession, and dissemination of data.
WHY A STANDARD?
A standard is required to define minimum performance and optional features for next-generation library-access equipment used by blind and physically handicapped individuals. It will address problems of control, audio quality, media compatibility, copy- right protection, ease of international interlibrary loan, and affordability. Interested parties include patrons, particularly as represented by advocacy organizations, media producers (volun- teer and commercial), rights owners, equipment producers, librarians, and international borrowers and lenders.
SCOPE AND APPLICATION
The proposed standard is intended to define minimum performance requirements for next-generation patron-access equipment. It will also describe optional features. While the standard will be written in a digital context, it will not define the software or hardware internals of a particular implementation. Emphasis will be on performance characteristics and control. Potential implementers would include manufacturers of digital and analog hardware, developers of multimedia authoring and presentation software, and media producers. Available resources, individuals, and organizations with expertise in the subject matter include various advocacy and service organizations such as the American Council of the Blind, Association for Education and Rehabilitation of the Blind and Visually Impaired, American Foundation for the Blind, American Printing House for the Blind, Blinded Veterans Association, National Federation of the Blind, Recording for the Blind and Dyslexic, and National Library Service for the Blind and Physically Handicapped (NLS/BPH). Engineering expertise is available at NLS/BPH as are funds.
Developing the proposal into an American National Standard will require an engineering talent, financial support for developing software that tests control concepts, and the funding of program administration including communications and travel.
The effect on users of moving from existing technology to the proposed new standard will likely range from virtually transpar- ent to profound. "Virtually transparent" means that, although tactile interfacing will not necessarily be the same as with a cassette player, it could be functionally identical. Similarly, sound quality could be identical to today's performance but would very likely be improved through the use of digital methods. "Profound" means that readers would have access to multiple levels of optional complexity and product-specific presentation-software features. This dual approach satisfies the demands of patrons who prefer simplicity while accommodating those who prefer more elaborate features.
Because the technology, and associated costs, found in the consumer audio market are changing, limiting the time for development of the proposed standard is important. The existing system is based on analog cassette tape, while the standard will define a system that will be digitally based but not restricted to any particular distribution medium or implementation. Projecting at least ten more years of acceptable and economical use for cassette tape suggests that the standard should be finished within five years to allow for a gradual transition period. A gradual transition will control technical risk, enhance user acceptance, and encourage financial support.
When recommending a new standard, NISO requires the proposer to estimate what portion of the user community will be motivated to adopt the standard. For talking books, the entire community will be so motivated by expanded functionality and lower cost. The "entire community" includes blind and physically handicapped patrons as well as the infrastructure of people who support and implement the library system. This community includes librarians; producers of talking books and magazines, both commercial and volunteer; equipment manufacturers; and software developers. The new standard will, again, range from transparent to profound in impact on this community. For example, audio studios may continue to narrate into conventional analog equipment while their product would become usable only by processing through digital encoding software that is not found in today's production stream.
Related standards include the various levels of MPEG (ISO/IEC 11172, 13818 and 14196), MPEG-7 (under development), HyTime (the Information technology - Hypermedia/Time-based Structuring Language ISO/IEC 10744), the Standard Generalized Markup Language (SGML, ISO-8879), and MHEG, which is under development by an ISO/IEC JTC1/SC29 working group.
Work in Progress
DTB development has been undertaken by the Swedish Library of Talking Books and Braille and by Recording for the Blind and Dyslexic (RFB&D). The Swedish have produced a "Digital Audio-based Information System" (DAISY) that partitions digitized spoken audio into phrases that can be randomly accessed from an audio version of the book's table of contents. RFB&D has experimented with a design based on PWWebspeak playing audio files.
PWWebspeak is a text-based web browser that communicates with the user via a speech synthesizer. We intend to build on the experience and expertise acquired in these and other projects by inviting their developers to participate in the standardization process.
Work to be Undertaken
In support of the standardization process, we intend to begin simulating DTBs in software using common off-the-shelf software systems, typically called multimedia authoring/presentation (A/P) software. The objective of this experimentation is to assess the technical and cost implications of implementing various DTB features and performance characteristics identified in the standardization process. A central concern with each feature is the structure and control preferred by users. Thus an essential requirement of simulation will be real-time user control. This will allow us to explore, on a feature-by-feature basis, the range of control options preferred by users. Examples of A/P systems that run on a personal computer are Macromedia's Director and Asymetrix's Toolbook.
Using authoring/presentation software, we will test PC-based presentation of recorded digital audio and will explore use of the corresponding text for indexing. These features are supported well enough by A/P packages to demonstrate the general concepts by developing short DTB excerpts with in-house expertise. Going beyond excerpts, complete books average twelve hours of spoken audio and require about 3,000 megabytes of storage (3Gb). Interactively presenting a data set this large with a PC presents difficulties that can be eased by eliminating inaudible components through application of a "smart" data-reduction algorithm. One such encoding algorithm that enjoys wide popularity is known as MPEG. Using this encoder, it should be possible to reduce the data set, without perceptible distortion, to about 300 megabytes. At present, coding of this type is not directly supported by A/P software. Moreover, linking coded audio closely with the corresponding text is not supported either. Implementation will require considerable programming expertise that we will obtain from external sources. Control of the product, however, will be alterable so that the simulation will be an efficient vehicle for exploring user-control preferences.
The complexity of a PC keyboard makes it unsuitable for studying user-control preferences. We are investigating the use of hand-held remote controls for this purpose. A typical programmable device is marketed as a multimedia-presentation controller or remote mouse. By itself, however, such a remote is not useful without spoken audio feedback and prompting. Thus, the simulator system must support voice synthesis. Again, A/P software does not support this capability, so integration of an external program is necessary.
Bookmarks that are retained from session to session are also important for a basic DTB simulation. Making them user-specific so that multiple users are supported may be desirable. The bookmarking concept, along with most other DTB features, would probably be implemented by making use of the scripting languages built into the authoring/presentation software. These languages are interpretive, like BASIC, and allow control structures similar to those of the C programming language. For simple functions, they are manageable by casual users, but for complex functions, such as multiple bookmarks for multiple users, they require a seasoned programmer. The end user is not expected to know or care how the scripting language works.
Beyond basic simulation, there are many DTB features that could be explored. We will strive to implement those identified in the standards development process that can be evaluated in the PC environment. At the outset, we will consider time-scale modification (TSM) and multiple-book support. As in the case of basic simulation, our objective is to assess cost and technical feasibility while studying user-control preferences. In the long term, simulation of voice control might be explored.
TSM is variable-rate playback of spoken audio without pitch distortion. In TSM, pitch and presentation rate are controlled by adding and deleting speech segments. Intelligibility and listener comfort depend on which segments are repeated or deleted and the temporal length of the segments. We will experiment with methods that distinguish between vowels and consonants. Changes to consonants are avoided while repetitive intervals (pitch periods) found within vowel utterances or silences may be deleted or repeated.
We will strive to allow placement of bookmarks wherever index points are supported. For basic DTB simulation, this will be at the chapter and page level. For augmented simulation with finer indexing resolution, corresponding bookmarks will be allowed. How many bookmarks, for how many books, how they are identified, how they are accessed, and related questions are the kind of user-control considerations that the simulator will help us to investigate.
We have undertaken standardization of the next-generation digital talking book through the NISO standardization process. During the process, we explore technical feasibility, cost and user-control preferences of various DTB features, and performance characteristics through PC-based simulation. We solicit the comments and participation of all interested parties.
This paper, which was delivered at the California State University Northridge Conference "Where Assistive Technology Meets the Information Age" (March 18 - 22, 1997) is reprinted, with permission of CSUN's Founder and Director, Harry J. Murphy, Ed.D.