AsTeR: Audio System for Technical Readings

T. V. Raman

Digital Equipment Corporation

ABSTRACT

The advent of electronic documents makes information available in more than its visual form; electronic information can now be display-independent. In this article, the author describes a computing system, AsTeR, that audio formats electronic documents to produce audio documents. AsTeR can speak both literary texts and highly technical documents (presently in La)TeX) that contain complex mathematics. Visual communication is characterized by the eye's ability to actively access parts of a two-dimensional display. The reader is active, while the display is passive. This active-passive role is reversed by the temporal nature of oral communication: information flows actively past a passive listener. This prohibits multiple views - it is impossible to first obtain a high-level view and then "look" at details. These shortcomings become severe when presenting complex mathematics orally.

Audio formatting, which renders information structure in a manner attuned to an auditory display, overcomes these problems. AsTeR is interactive, and the ability to browse information structure and obtain multiple views enables active listening.

This article describes a system for producing audio renderings. Print is not the ideal medium for describing such renderings, (and ASCII is an even poorer one!). RFB members can acquire an audio formatted version of the author's thesis, (this article is a slightly edited version of the first chapter) rendered by AsTeR, from Recording for the Blind (RFB order number FB190). Non-RFB customers may request a two track (standard commercial format) tape of AsTeR examples. Requests should be addressed to info@RFB.org; ask for Raman's Math Examples Tape.

Finally, readers with access to the WWW can experience an interactive demo of AsTeR at

http://www.cs.cornell.edu/Info/People/raman/aster/aster-toplevel.html

http://www.research.digital.com/CRL/personal/raman/aster/aster-toplevel.html

1. MOTIVATION

Documents encapsulate structured information. Visual formatting renders this structure on a two-dimensional display (paper or a video screen) using accepted conventions. The visual layout helps the reader recreate, internalize and browse the underlying structure. The ability to selectively access portions of the display, combined with the layout, enables multiple views. For example, a reader can first skim a document to obtain a high-level view and then read portions of it in detail.

The rendering is attuned to the visual mode of communication, which is characterized by the spatial nature of the display and the eye's ability to actively access parts of this display. The reader is active, while the rendering itself is passive.

This active-passive role is reversed in oral communication: information flows actively past a passive listener. This is particularly evident in traditional forms of reproducing audio, e.g., cassette tapes. Here, a listener can only browse the audio with respect to the underlying time-line -- by rewinding or forwarding the tape. The passive nature of listening prohibits multiple views -- it is impossible to first obtain a high-level view and then "look" at portions of the information in detail.

Traditionally, documents have been made available in audio by trained readers speaking the contents onto a cassette tape to produce "talking books." Being non-interactive, these do not permit browsing. They do have the advantage that the reader can interpret the information and convey a particular view of the structure to the listener. However, the listener is restricted to the single view present on the tape. In the early 1980's, text-to-speech technology was combined with OCR (Optical Character Recognition) to produce "reading machines." In addition to being non-interactive, renderings produced from scanning visually formatted text convey very little structure. Thus, the true audio document was non-existent when we started our work.

We overcome these problems of oral communication by developing the notion of audio formatting-and a computing system that implements it. Audio formatting renders information structure orally, using speech augmented by non-speech sound cues. The renderings produced by this process are attuned to an auditory display audio layout present in the output conveys information structure. Multiple audio views are enabled by making the renderings interactive. A listener can change how specific information structures are rendered and browse them selectively. Thus, the listener becomes an active participant in oral communication.

In the past, information was available only in a visual form, and it required a human to recreate its inherent structure. Electronic information has opened a new world: information can now be captured in a display-independent manner -- using, e.g., tools like SGML and LaTeX (1). Though the principal mode of display is still visual, we can now produce alternative renderings, such as oral and tactile displays. We take advantage of this to audio-format information structure present in LaTeX documents. The resulting audio documents achieve effective oral communication of structured information from a wide range of sources, including literary texts and highly technical documents containing complex mathematics.

The results of this thesis are equally applicable to producing audio renderings of structured information from such diverse sources as information databases and electronic libraries. Audio formatting clients can be developed to allow seamless access to a variety of electronic information, available on both local and remote servers. Thus, the server provides the information, and various clients, such as visual or audio formatters, provide appropriate views of the information. Our work is therefore significant in the area of developing adaptive computer technologies.

Today's computer interfaces are like the silent movies of the past! As speech becomes a more integral part of human-computer interaction, our work will become more relevant in the general area of user-interface design, by adding audio as a new dimension to computer interfaces.

2. WHAT IS AsTeR?

AsTeR (2) is a computing system for producing audio renderings of electronic documents. The present implementation works with documents written in the TeX family of markup (3) languages, i.e., TeX, LaTeX and AMSTeX. But the design of AsTeR is not restricted to any single markup language. Though motivated by the need to render technical documents, our system works equally well on structured documents from the non-technical subjects.

AsTeR is founded on the belief that all information is display-independent. Information has structure, and this structure is rendered on paper or on a visual display, but the information itself is not restricted to these output modes. Thus, AsTeR renders this same information in audio. AsTeR recognizes the logical structure of a document as embodied in the markup source and represents this structure internally. The internal representation is then rendered in audio by applying a collection of rendering rules written in AFL, a language for audio formatting. Think of AFL as a high-level audio analogue to a visual rendering language like Postscript. Rendering an internalized high-level representation enables AsTeR to produce different audio views of the information. A user can either listen to entire documents, or browse the internal structure and selectively read portions of a document. The rendering and browsing components of AsTeR can work equally well with high-level representations we may get from sources such as OCR-based document recognition.

This article gives a high-level view of how the various components of AsTeR are used. AsTeR is implemented in CLOS (4) with an Emacs front-end. The recommended way of using the system is to run Lisp as a subprocess of Emacs. Throughout this chapter, we will assume familiarity with basic Emacs concepts. Section 3 introduces the system by showing how simple documents can be read and browsed. Section 4 explains how AsTeR can be extended to read newly defined document structures in La)TeX (5). Section 5 gives some examples of changing between different ways of rendering the same information. Section 6 presents some advanced techniques that can be used to advantage when reading complex documents such as text books. AsTeR can render information produced by various sources. We give an example of this by demonstrating how AsTeR can be used to interact with the Emacs calculator, a full-fledged symbolic algebra system.

3. READING DOCUMENTS

This section assumes that AsTeR has been installed and initialized. At this point, text within any file being visited in Emacs (in general, text in any Emacs buffer), can be rendered in audio. To listen to a piece of text, mark it using standard Emacs commands and invoke read-aloud-region (6). This results in the marked text being audio formatted using a standard rendering style. The text can constitute an entire document or book; it could also be a short paragraph or a single equation from a document. AsTeR renders both partial and complete documents.

This is the simplest and also the most common type of interaction with AsTeR. All markup commands appearing in the text are recognized to produce audio renderings that reflect the structure represented by the markup. The input may be plain ASCII text; in this case, AsTeR will still recognize the minimal document structure present, i.e., paragraph breaks, quoted text etc. La)TeX markup helps the system recognize more of the document logical structure, and as a consequence produce more sophisticated renderings.

3.1 BROWSING THE DOCUMENT

Next to getting the system to speak, the most important thing is to get it to stop speaking. Once an audio rendering has been launched, rendering can be interrupted at any time by executing reader-quit-reading (7) The listener can then traverse the internal structure by moving the current selection, which represents the current position in the document, by executing any of the browser commands reader-move-previous, reader-move-next, reader-move-up or reader-move-down.

To orient the user within the document structure, the current selection is summarized by verbalizing a short message of the form " is ", e.g., moving down one level from the top of the equation

ABC = 0
(1)

produces the message "left hand side is a product ". The user has the option of either listening to just the current selection, or reading the rest of the document. In the interest of brevity, we will not give all of the browser key-bindings.

3.2 EXAMPLES OF USE

AsTeR can be used:

- To read technical articles and books: The files for such documents may be available on the local system or on the global Internet (8). Resources retrieved over the network can be audio formatted by AsTeR since they are just text in Emacs buffers. Currently, the system audio formats 10 text books available to the author on his local system. In addition, AsTeR also renders a wide collection of technical documents available on the Internet including technical reports and AMS bulletins.

- For entertainment: At present about 200 electronic texts are available on the Internet, in addition to the complete works of Shakespeare. The majority of these documents are in plain ASCII, but the quality of audio renderings produced by AsTeR based on the minimal document structure that can be recognized still surpasses conventional reading machines. Increased availability of electronic texts marked up in La)TeX, SGML and HTML will enable better recognition of document structure, and as a consequence, better audio renderings.

- In proof-reading: This feature is especially useful when typesetting complex mathematical formulae. AsTeR can render both partial and complete documents. Thus, although designed as a system for reading documents, the flexible design, combined with the power afforded by the Emacs editor, turns AsTeR into a very useful document preparation aid.

4. EXTENDING ASTER

As explained in the previous section, the quality of audio renderings produced by AsTeR is dependent on how much of the document logical structure is recognized. Authors of La)TeX documents often use their own macros (9) to encapsulate specific structures. AsTeR of course does not know of these extensions to start with. Occurrences of user-defined La)TeX macros are initially rendered in a canonical way; typically, the user-defined macros are read aloud as they appear in the running text.

Thus, given a document containing

$A \kronecker B$

AsTeR would produce

cap a kronecker cap b

In this case, this canonical rendering is quite acceptable. In general, how AsTeR renders such user-defined structures is fully customizable. The first step is to extend the recognizer to handle the new construct, in this case \kronecker. Here, we give the reader a brief example of how this mechanism is used in practice.

The recognizer is extended by calling Lisp macro define-text-object. In the case of the \kronecker macro, this call takes the form:

(define-text-object :macro-name "kronecker" :number- args 0 :processing-function kronecker-expand :object- name kronecker :supers (binary-operator) :precedence multiplication)

This extends the recognizer to represent instances of macro "kronecker" as instances of object kronecker-product. The user can now define any number of ways in which an instance of object kronecker-product should be rendered.

AFL, our language for audio formatting, is used to define rendering rules. Here, we give a rendering rule for object kronecker-product.

(def-reading-rule (kronecker-product simple)
"Simple rendering rule for object kronecker-product."
(read-aloud (first (children kronecker-product)))
(read-aloud "kronecker product") (read-aloud (second (children kronecker-product))))

which produces

cap a kronecker product cap b

for the input text shown earlier.

Notice, however, that the rendering rule is free to render the use of the kronecker product in more complex ways; in particular, the order in which the expression is spoken can be completely independent of how it appears on paper. Thus, it is straightforward to write a rendering rule that produces

"The kronecker product of A and B "

AsTeR derives its power from representing document content internally as objects and by allowing several user-defined rendering rules for individual object types. Such rendering rules can cause any number of audio events, ranging from speaking a simple phrase to playing a digitized sound, when an instance of a particular object type is rendered. The mechanism for extending the recognizer affords this same power when rendering user- defined constructs. Once the recognizer has been extended by an appropriate call to define-text-object, such constructs can be handled just as well as any standard La)TeX construct.

5. PRODUCING DIFFERENT RENDERINGS OF THE SAME OBJECT

AsTeR can produce more than one kind of rendering for a given object. When perusing printed information, a reader has the luxury of viewing a complex piece of mathematics from different perspectives, and AsTeR provides this same functionality. The listener can switch between any of several pre-defined renderings for a given object, or add to these by defining new rendering rules. Switching between different rendering rules produces different audio views of a given object.

Activating a rendering rule is the simplest way of changing how a given object is rendered. Statement

(activate-rule )

activates rule for object . Thus, executing (activate-rule 'paragraph 'summarize) results in paragraphs being summarized.

Suppose we wish to skip all instances of verbatim text in a LaTeX document. We could define the following quiet rendering rule:

(def-reading-rule (verbatim quiet) nil)

and activate it by executing

(activate-rule 'verbatim 'quiet)

To later hear the verbatim text in a document, rule quiet is deactivated by executing

(deactivate-rule 'verbatim)

Notice that at any given time, only one rendering rule is active for any object. Hence, we only need specify the object when deactivating a rendering rule. AsTeR provides an Emacs interface to activating and deactivating rendering rules.

Activating a single rendering rule is a convenient way of changing how a specific object is rendered. Rendering styles allow making more global changes to the renderings. Activating style style-1 by executing

(activate-style 'style-1)

makes the rendering rule named style-1 active for all objects for which this rendering rule is defined. All other objects continue to be rendered as before. This is also true when a sequence of rendering styles is successively activated.

Thus, activating rendering styles is a convenient way of progressively customizing the rendering of a complex document. The effect of activating a style can be undone at any time by executing

(deactivate-style )

AsTeR provides the following rendering styles:

- Variable-substitution: Use variable substitution when rendering complex mathematical expressions.

- Use-special-pattern: Recognize special patterns in mathematical expressions to produce context-specific renderings.

- Descriptive: Produce descriptive, context-specific renderings for mathematical expressions.

- Simple: Produce a base-level audio notation for mathematical expressions.

- Default: Produce default renderings.

- Summarize: Provide a short summary.

- Quiet: Skip objects.

When AsTeR is initialized, the following styles are active:

(use-special-pattern descriptive simple default)

with the leftmost style the most recently activated style. Defining a new rendering style amounts to defining a collection of rendering rules having the same name. Note that a rendering style need not provide rendering rules for all objects in the document logical structure. As explained earlier, activating a rendering style only affects the renderings of those objects for which the style provides a rule.

6. USING THE FULL POWER OF AsTeR

This section demonstrates some advanced features of AsTeR that are useful when rendering complex documents. AsTeR recognizes cross-references and allows the listener to traverse these as hypertext links. Cross-referenceable objects can be labelled interactively and these labels used when referring to such objects within renderings. The ability to switch between rendering rules allows the listener to quickly locate portions of interest in a document. By activating rendering rules, all instances of a particular object can be floated to the end of the containing hierarchical unit, or entirely skipped. This is convenient when getting a quick overview of a document. AsTeR also provides a simple bookmark facility for marking positions of interest to be returned to later. Finally, AsTeR can be interfaced with sources of structured information other than electronic documents. We demonstrate this by interfacing AsTeR to the Emacs calculator.

6.1 Cross-References

Cross-reference tags occurring in the body of a document are represented internally as instances of object cross-reference and contain a link to the object being referenced. How such cross- reference tags are rendered of course depends on the currently active rule for object cross-reference . The default rendering rule for cross-references presents the user with a summary of the object being cross-referenced, e.g., the number and title of a sectional unit. This is followed by a non-speech audio prompt. Pressing a key at this prompt results in the entire cross-referenced object being rendered at this point. Reading continues if no key is pressed within a certain time interval. In addition, the listener can interrupt the rendering and move through the cross-reference tags. This is useful in cases where many such tags occur within the same sentence.

6.2 Labelling a cross-referenceable object

Consider a proof that reads:

By theorem 2.1 and lemma 3.5 we get equation 8 and hence the result.

If the above looks abstruse in print, it sounds meaningless in audio. This is in fact a serious drawback when listening to mathematical books on cassette where it is practically impossible to locate the cross-reference. AsTeR is more effective since these cross-reference links can be traversed; but traversing each link while listening to the above proof can be distracting. Typically, we only glance back at the cross-references to get sufficient information about what theorem 2.1 is about. AsTeR provides a convenient mechanism for building in such information into the renderings. When a cross-referenceable object such as an equation is rendered, the system verbalizes an automatically generated label, i.e., the equation number, and then generates an audible prompt. If the user presses a key at this prompt, he can specify a more meaningful label which will be used in preference to the system-generated label when rendering cross-reference tags.

To continue the current example, when listening to theorem 2.1, the user could have specified the label "Fermat's theorem". Then the proof shown earlier would be read as:

By Fermat's theorem and lemma3 .5 we get equation 8 and hence the result.

Of course, the user could have specified labels for the other cross-referenced objects as well, in which case the rendering produced almost obviates the need to look back at the cross- referenced objects.

6.3 Locating portions of interest

Printed books allow the reader to skim through the text and quickly locate portions of interest. Experienced readers use several different techniques to achieve this. One of these is to locate an equation or table of interest, and then read the text surrounding this object. AsTeR provides this functionality to some extent.

We explained in Section 4 that different rules can be activated to change the type of renderings produced. Using this mechanism, we can activate a rendering rule that only reads the equations occurring in a document. Once an equation of interest is located, rendering can be interrupted and the rendering rule changed. Using the browser, the listener can now move the current selection to the enclosing hierarchical unit and then read the surrounding text.

6.4 Getting an overview of a document

Rendering rules can be activated to obtain different views of a document. For instance, activating rendering rule quiet for object paragraph provides a thumb-nail view of a document. Activating rendering rule quiet is a convenient way of temporarily skipping over all occurrences of a specific object. We often do this when perusing printed documents; we skip over complex material at the first reading and return to these later. We may skip instances of some objects entirely e.g., source code; in other cases we may merely defer the reading. This notion of delaying the reading of an object is aptly captured by the concept of floating an object to the end of the enclosing unit. Typesetting systems like La)TeX permit the author to float all figures and tables to the end of the containing section or chapter. However, only specific objects can be floated, and this is exclusively under the control of the author, not the consumer of the document.

AsTeR provides a much more general framework for floating objects. Any object can be floated to the end of any enclosing hierarchical unit, e.g., instances of object footnote can be floated to the end of the containing paragraph. The ability to float objects is very useful when producing audio renderings. This is because audio takes time, and it is advantageous to delay the rendering of some objects when obtaining an overview. Printed documents use footnotes and floating figures for precisely this reason. The interactive nature of AsTeR allows us to extend this functionality.

6.5 Bookmarks

The browser provides a simple bookmark facility for marking positions of interest to be returned to later. Browser command mark-read-pointer bound to C-b m prompts for a bookmark name and marks the current selection. The listener can later read the object at this marked position, or move the current selection to the marked position by executing browser command follow-bookmark and specifying the appropriate bookmark name.

6.6 Reading using variable substitution

When reading complex mathematics in print, we often get a high- level view of an equation first, and read the leaves of an expression once we have understood the top-level structure. Thus, when presented with a complex equation, an experienced reader of mathematics might view it as an equation with a double summation on the left-hand-side and a double integral on the right-hand- side, and only then attempt to read the equation in full detail. In an audio rendering that simply produces a linear rendering, the temporal nature of audio prevents a listener from getting such high-level views. We compensate by providing a variable substitution rendering style. When active, this results in AsTeR replacing sub-expressions in complex mathematics with meaningful phrases. Having thus provided a top-level view, AsTeR then reads the sub-expressions that were substituted for earlier upon request.

6.7 Interfacing AsTeR with other information sources

AsTeR has been presented as a system for reading documents. More generally, AsTeR is a system for presenting structured information in audio. This fact is amply demonstrated by the following example where we interface AsTeR to the Emacs calculator, a full-fledged symbolic algebra system.

The Emacs calculator is a public domain symbolic algebra system available under the terms of the GNU license. It provided an excellent source of examples for trying out the variable substitution rendering style for mathematical expressions. Providing an audio interface to a symbolic algebra system is challenging since the expressions produced are quite complex. The flexible design of AsTeR and the power of Emacs makes this interface easy. AsTeR can render any information present in an Emacs buffer. The output of the Emacs calculator satisfies this requirement. A collection of Emacs Lisp functions arranges for the output from the calculator to be sent to AsTeR.

A user of the Emacs calculator can now perform a computation and execute command read-previous-calc-answer to have the output rendered by AsTeR. The expression can be browsed, summarized, transformed by applying variable substitution, and the rendering manipulated in any of the ways described so far in the context of documents.

NOTES

(1) Standard Generalized Markup Language (SGML) captures information in a layout independent form; LaTeX, designed by Leslie Lamport, is a document preparation system based on the TeX typesetting system developed by Donald Knuth.

(2) In real life, AsTeR is the name of the author's guide-dog, a big friendly black Labrador.

(3) To most people, "markup" means an increase in the price of an article. Here, "markup" is a term from the publishing and printing business, where it means the instructions for the typesetter, written on a typescript or manuscript copy by an editor. Typesetting systems like LaTeX have these commands embedded in the electronic source. A markup language is a set of means (constructs) to express how text (i.e., that which is not markup) should be processed, or handled in other ways.

(4) clos (Common Lisp Object System) is an object oriented extension of Common Lisp.

(5) In this article, the notation La)TeX represents the entire "family" of markup languages including TeX, LaTeX, and AMSTex.

(6) This is an Emacs Lisp command, and in the author's setup, it is bound to C-z d.

(7) reader-quit-reading Bound to C-b q.

(8) ANGE-FTP, an Emacs utility written by Andy Norman, allows seamless access to such files. In addition, Emacs clients are available for networked information retrieval systems like GOPHER, WWW and WAIS.

(9) Macros permit an author to define new language constructs in TeX and specify how these constructs should be rendered on paper.

Raman, T. V. (1994). AsTeR: Audio system for technical readings. Information Technology and Disabilities E-Journal, 1(4).