W3C

Voice Extensible Markup Language (VoiceXML) Version 2.0

W3C Recommendation 16 March 2004

This Version:
http://www.w3.org/TR/2004/REC-voicexml20-20040316/
Latest Version:
http://www.w3.org/TR/voicexml20/
Previous Version:
http://www.w3.org/TR/2004/PR-voicexml20-20040203/
Editors:
Scott McGlashan, Hewlett-Packard (Editor-in-Chief)
Daniel C. Burnett, Nuance Communications
Jerry Carter, Invited Expert
Peter Danielsen, Lucent (until October 2002)
Jim Ferrans, Motorola
Andrew Hunt, ScanSoft
Bruce Lucas, IBM
Brad Porter, Tellme Networks
Ken Rehor, Vocalocity
Steph Tryphonas, Tellme Networks

Please refer to the errata for this document, which may include some normative corrections.

See also translations.


Abstract

This document specifies VoiceXML, the Voice Extensible Markup Language. VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document has been reviewed by W3C Members and other interested parties, and it has been endorsed by the Director as a W3C Recommendation. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionaility and interoperability of the Web.

This specification is part of the W3C Speech Interface Framework and has been developed within the W3C Voice Browser Activity by participants in the Voice Browser Working Group (W3C Members only).

The design of VoiceXML 2.0 has been widely reviewed (see the disposition of comments) and satisfies the Working Group's technical requirements. A list of implementations is included in the VoiceXML 2.0 implementation report, along with the associated test suite.

Comments are welcome on [email protected] (archive). See W3C mailing list and archive usage guidelines.

The W3C maintains a list of any patent disclosures related to this work.

Conventions of this Document

In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate requirement levels for compliant VoiceXML implementations.

Table of Contents

Abbreviated Contents

Full Contents


1. Overview

This document defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented in Section 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of the Form Interpretation Algorithm and timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy.

The origins of VoiceXML began in 1995 as an XML-based dialog design language intended to simplify the speech recognition application development process within an AT&T project called Phone Markup Language (PML). As AT&T reorganized, teams at AT&T, Lucent and Motorola continued working on their own PML-like languages.

In 1998, W3C hosted a conference on voice browsers. By this time, AT&T and Lucent had different variants of their original PML, while Motorola had developed VoxML, and IBM was developing its own SpeechML. Many other attendees at the conference were also developing similar languages for dialog design; for example, such as HP's TalkML and PipeBeach's VoiceHTML.

The VoiceXML Forum was then formed by AT&T, IBM, Lucent, and Motorola to pool their efforts. The mission of the VoiceXML Forum was to define a standard dialog design language that developers could use to build conversational applications. They chose XML as the basis for this effort because it was clear to them that this was the direction technology was going.

In 2000, the VoiceXML Forum released VoiceXML 1.0 to the public. Shortly thereafter, VoiceXML 1.0 was submitted to the W3C as the basis for the creation of a new international standard. VoiceXML 2.0 is the result of this work based on input from W3C Member companies, other W3C Working Groups, and the public.

Developers familiar with VoiceXML 1.0 are particularly directed to Changes from Previous Public Version which summarizes how VoiceXML 2.0 differs from VoiceXML 1.0.

1.1 Introduction

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Here are two short examples of VoiceXML. The first is the venerable "Hello World":

 

  
Hello World!

The top-level element is , which is mainly a container for dialogs. There are two types of dialogs: forms and menus. Forms present information and gather input; menus offer choices of what to do next. This example has a single form, which contains a block that synthesizes and presents "Hello World!" to the user. Since the form does not specify a successor dialog, the conversation ends.

Our second example asks the user for a choice of drink and then submits it to a server script:



  
Would you like coffee, tea, milk, or nothing?

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:

C (computer): Would you like coffee, tea, milk, or nothing?

H (human): Orange juice.

C: I did not understand what you said. (a platform-specific default message.)

C: Would you like coffee, tea, milk, or nothing?

H: Tea

C: (continues in document drink2.asp)

1.2 Background

This section contains a high-level architectural model, whose terminology is then used to describe the goals of VoiceXML, its scope, its design principles, and the requirements it places on the systems that support it.

1.2.1 Architectural Model

The architectural model assumed by this document has the following components:

VoiceXML interpreter fits between document server and implementation platform
Figure 1: Architectural Model

A document server (e.g. a Web server) processes requests from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.

1.2.2 Goals of VoiceXML

VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs.

VoiceXML is a markup language that:

While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.

1.2.3 Scope of VoiceXML

The language describes the human-machine interaction provided by voice response systems, which includes:

The language provides means for collecting character and/or spoken input, assigning the input results to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

1.2.4 Principles of Design

VoiceXML is an XML application [XML].

  1. The language promotes portability of services through abstraction of platform resources.

  2. The language accommodates platform diversity in supported audio file formats, speech grammar formats, and URI schemes. While producers of platforms may support various grammar formats the language requires a common grammar format, namely the XML Form of the W3C Speech Recognition Grammar Specification [SRGS], to facilitate interoperability. Similarly, while various audio formats for playback and recording may be supported, the audio formats described in Appendix E must be supported

  3. The language supports ease of authoring for common types of interactions.

  4. The language has well-defined semantics that preserves the author's intent regarding the behavior of interactions with the user. Client heuristics are not required to determine document element interpretation.

  5. The language recognizes semantic interpretations from grammars and makes this information available to the application.

  6. The language has a control flow mechanism.

  7. The language enables a separation of service logic from interaction behavior.

  8. It is not intended for intensive computation, database operations, or legacy system operations. These are assumed to be handled by resources outside the document interpreter, e.g. a document server.

  9. General service logic, state management, dialog generation, and dialog sequencing are assumed to reside outside the document interpreter.

  10. The language provides ways to link documents using URIs, and also to submit data to server scripts using URIs.

  11. VoiceXML provides ways to identify exactly which data to submit to the server, and which HTTP method (GET or POST) to use in the submittal.

  12. The language does not require document authors to explicitly allocate and deallocate dialog resources, or deal with concurrency. Resource allocation and concurrent threads of control are to be handled by the implementation platform.

1.2.5 Implementation Platform Requirements

This section outlines the requirements on the hardware/software platforms that will support a VoiceXML interpreter.

Document acquisition. The interpreter context is expected to acquire documents for the VoiceXML interpreter to act on. The "http" URI scheme must be supported. In some cases, the document request is generated by the interpretation of a VoiceXML document, while other requests are generated by the interpreter context in response to events outside the scope of the language, for example an incoming phone call. When issuing document requests via http, the interpreter context identifies itself using the "User-Agent" header variable with the value "/", for example, "acme-browser/1.2"

Audio output. An implementation platform must support audio output using audio files and text-to-speech (TTS). The platform must be able to freely sequence TTS and audio output. If an audio output resource is not available, an error.noresource event must be thrown. Audio files are referred to by a URI. The language specifies a required set of audio file formats which must be supported (see Appendix E); additional audio file formats may also be supported.

Audio input. An implementation platform is required to detect and report character and/or spoken input simultaneously and to control input detection interval duration with a timer whose length is specified by a VoiceXML document. If an audio input resource is not available, an error.noresource event must be thrown.

Transfer The platform should be able to support making a third party connection through a communications network, such as the telephone.

1.3 Concepts

A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

1.3.1 Dialogs and Subdialogs

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.

1.3.2 Sessions

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.

1.3.3 Applications

An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root document's variables are available to the other documents as application variables, and its grammars remain active for the duration of the application, subject to the grammar activation rules discussed in Section 3.1.4.

Figure 2 shows the transition of documents (D) in an application that share a common application root document (root).

root over sequence of 3 documents
Figure 2: Transitioning between documents in an application.

1.3.4 Grammars

Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

1.3.5 Events

VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.

Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Furthermore, catch elements are also inherited from enclosing elements "as if by copy". In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.

1.3.6 Links

A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.

1.4 VoiceXML Elements

Table 1: VoiceXML Elements
Element Purpose Section
Assign a variable a value 5.3.2
Play an audio clip within a prompt 4.1.3
A container of (non-interactive) executable code 2.3.2
Catch an event 5.2.2
Define a menu item 2.2.2
Clear one or more form item variables 5.3.3
Disconnect a session 5.3.11
Used in elements 5.3.4
Used in elements 5.3.4
Shorthand for enumerating the choices in a menu 2.2.4
Catch an error event 5.2.3
Exit a session 5.3.9
Declares an input field in a form 2.3.1
An action executed when fields are filled 2.4
A dialog for presenting information and collecting data 2.1
Go to another dialog in the same or different document 5.3.7
Specify a speech recognition or DTMF grammar 3.1
Catch a help event 5.2.3
Simple conditional logic 5.3.4
Declares initial logic upon entry into a (mixed initiative) form 2.3.3
Specify a transition common to all dialogs in the link's scope 2.5
Generate a debug message 5.3.13
A dialog for choosing amongst alternative destinations 2.2.1
Define a metadata item as a name/value pair 6.2.1
Define metadata information using a metadata schema 6.2.2
Catch a noinput event 5.2.3
Catch a nomatch event 5.2.3
Interact with a custom extension 2.3.5
Specify an option in a 2.3.1.3
Parameter in or 6.4
Queue speech synthesis and audio output to the user 4.1
Control implementation platform settings. 6.3
Record an audio sample 2.3.6
Play a field prompt when a field is re-visited after an event 5.3.6
Return from a subdialog. 5.3.10

Attributes of

Table 36:
src The URI of the audio prompt. See Appendix E for required audio file formats; additional formats may be used if supported by the platform.

Attributes of

Table 37:
fetchtimeout See Section 6.1. This defaults to the fetchtimeout property.
fetchhint See Section 6.1. This defaults to the audiofetchhint property.
maxage See Section 6.1. This defaults to the audiomaxage property.
maxstale See Section 6.1. This defaults to the audiomaxstale property.
expr An ECMAScript expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded with the item or evaluate to the URI of an audio resource to fetch.

Exactly one of "src" or "expr" must be specified; otherwise, an error.badfetch event is thrown.

Note that it is a platform optimization to stream audio: i.e. the platform may begin processing audio content as it arrives and not to wait for full retrieval. The "prefetch" fetchhint can be used to request full audio retrieval prior to playback.

4.1.4 Element

The element is used to insert the value of an expression into a prompt. It has one attribute:

Table 38: Attributes
expr The expression to render.

For example if n is 12, the prompt


   is the square of .

will result in the text string "144 is the square of 12" being passed to the speech synthesis engine.

The manner in which the value attribute is played is controlled by the surrounding speech synthesis markup. For instance, a value can be played as a date in the following example:



    
        
    

The text inserted by the element is not subject to any special interpretation; in particular, it is not parsed as an [SSML] document or document fragment. XML special characters (&, >, and <) are not treated specially and do not need to be escaped. The equivalent effect may be obtained by literally inserting the text computed by the element in a CDATA section. For example, when the following variable assignment:

 

is referenced in a prompt element as

   The price of  is $1. 

the following output is produced.

 The price of AT&T is $1.

4.1.5 Bargein

If an implementation platform supports bargein, the application author can specify whether a user can interrupt, or "bargein" on, a prompt using speech or DTMF input. This speeds up conversations, but is not always desired. If the application author requires that the user must hear all of a warning, legal notice, or advertisement, bargein should be disabled. This is done with the bargein attribute:


Users can interrupt a prompt whose bargein attribute is true, but must wait for completion of a prompt whose bargein attribute is false. In the case where several prompts are queued, the bargein attribute of each prompt is honored during the period of time in which that prompt is playing. If bargein occurs during any prompt in a sequence, all subsequent prompts are not played (even those whose bargein attribute is set to false). If the bargein attribute is not specified, then the value of the bargein property is used if set.

When the bargein attribute is false, input is not buffered while the prompt is playing, and any DTMF input buffered in a transition state is deleted from the buffer (Section 4.1.8 describes input collection during transition states).

Note that not all speech recognition engines or implementation platforms support bargein. For a platform to support bargein, it must support at least one of the bargein types described in Section 4.1.5.1.

4.1.5.1 Bargein type

When bargein is enabled, the bargeintype attribute can be used to suggest the type of bargein the platform will perform in response to voice or DTMF input. Possible values for this attribute are:

Table 39: bargeintype Values
speech The prompt will be stopped as soon as speech or DTMF input is detected. The prompt is stopped irrespective of whether or not the input matches a grammar and irrespective of which grammars are active.
hotword The prompt will not be stopped until a complete match of an active grammar is detected. Input that does not match a grammar is ignored (note that this even applies during the timeout period); as a consequence, a nomatch event will never be generated in the case of hotword bargein.

If the bargeintype attribute is not specified, then the value of the bargeintype property is used. Implementations that claim to support bargein are required to support at least one of these two types. Mixing these types within a single queue of prompts can result in unpredictable behavior and is discouraged.

In the case of "speech" bargeintype, the exact meaning of "speech input" is necessarily implementation-dependent, due to the complexity of speech recognition technology. It is expected that the prompt will be stopped as soon as the platform is able to reliably determine that the input is speech. Stopping the prompt as early as possible is desireable because it avoids the "stutter" effect in which a user stops in mid-utterance and re-starts if he does not believe that the system has heard him.

4.1.6 Prompt Selection

Tapered prompts are those that may change with each attempt. Information-requesting prompts may become more terse under the assumption that the user is becoming more familiar with the task. Help messages become more detailed perhaps, under the assumption that the user needs more help. Or, prompts can change just to make the interaction more interesting.

Each input item, , and menu has an internal prompt counter that is reset to one each time the form or menu is entered. Whenever the system selects a given input item in the select phase of FIA and FIA does perform normal selection and queuing of prompts (i.e., as described in Section 5.3.6, the previous iteration of FIA did not end with a catch handler that had no reprompt), the input item's associated prompt counter is incremented. This is the mechanism supporting tapered prompts.

For instance, here is a form with a form level prompt and field level prompts:

 

Welcome to the ice cream survey. vanilla chocolate strawberry What is your favorite flavor? Say chocolate, vanilla, or strawberry. Sorry, no help is available.

A conversation using this form follows:

C: Welcome to the ice cream survey.

C: What is your favorite flavor? (the "flavor" field's prompt counter is 1)

H: Pecan praline.

C: I do not understand.

C: What is your favorite flavor? (the prompt counter is now 2)

H: Pecan praline.

C: I do not understand.

C: Say chocolate, vanilla, or strawberry. (prompt counter is 3)

H: What if I hate those?

C: I do not understand.

C: Say chocolate, vanilla, or strawberry. (prompt counter is 4)

H: ...

This is just an example to illustrate the use of prompt counters. A polished form would need to offer a more extensive range of choices and to deal with out of range values in more flexible way.

When it is time to select a prompt, the prompt counter is examined. The child prompt with the highest count attribute less than or equal to the prompt counter is used. If a prompt has no count attribute, a count of "1" is assumed.

A conditional prompt is one that is spoken only if its condition is satisfied. In this example, a prompt is varied on each visit to the enclosing form.

 

Would you like to hear another elephant joke? For another joke say yes. To exit say no.

When a prompt must be chosen, a set of prompts to be queued is chosen according to the following algorithm:

  1. Form an ordered list of prompts consisting of all prompts in the enclosing element in document order.
  2. Remove from this list all prompts whose cond evaluates to false after conversion to boolean.
  3. Find the "correct count": the highest count among the prompt elements still on the list less than or equal to the current count value.
  4. Remove from the list all the elements that don't have the "correct count".

All elements that remain on the list will be queued for play.

4.1.7 Timeout

The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value.

The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next.

The prompt timeout attribute determines the noinput timeout for the following input:


  Pick a color for your new Model T.



  Please choose color of your new nineteen twenty four
  Ford Model T. Possible colors are black, black, or
  black.  Please take your time.

If several prompts are queued before a field input, the timeout of the last prompt is used.

4.1.8 Prompt Queueing and Input Collection

A VoiceXML interpreter is at all times in one of two states:

  • waiting for input in an input item (such as , , or ), or
  • transitioning between input items in response to an input (including spoken utterances, dtmf key presses, and input-related events such as a noinput or nomatch event) received while in the waiting state. While in the transitioning state no speech input is collected, accepted or interpreted. Consequently root and document level speech grammars (such as defined in s) may not be active at all times. However, DTMF input (including timing information) should be collected and buffered in the transition state. Similarly, asynchronously generated events not related directly to execution of the transition should also be buffered until the waiting state (e.g. connection.disconnect.hangup).

The waiting and transitioning states are related to the phases of the Form Interpretation Algorithm as follows:

  • the waiting state is eventually entered in the collect phase of an input item (at the point at which the interpreter waits for input), and
  • the transitioning state encompasses the process and select phases, the collect phase for control items (such as s), and the collect phase for input items up until the point at which the interpreter waits for input.

This distinction of states is made in order to greatly simplify the programming model. In particular, an important consequence of this model is that the VoiceXML application designer can rely on all executable content (such as the content of and elements) being run to completion, because it is executed while in the transitioning state, which may not be interrupted by input.

While in the transitioning state various prompts are queued, either by the element in executable content or by the element in form items. In addition, audio may be queued by the fetchaudio attribute. The queued prompts and audio are played either

  • when the interpreter reaches the waiting state, at which point the prompts are played and the interpreter listens for input that matches one of the active grammars, or
  • when the interpreter begins fetching a resource (such as a document) for which fetchaudio was specified. In this case the prompts queued before the fetchaudio are played to completion, and then, if the resource actually needs to be fetched (i.e. it is not unexpired in the cache), the fetchaudio is played until the fetch completes. The interpreter remains in the transitioning state and no input is accepted during the fetch.

Note that when a prompt's bargein attribute is false, input is not collected and DTMF input buffered in a transition state is deleted (see Section 4.1.5).

When an ASR grammar is matched, if DTMF input was consumed by a simultaneously active DTMF grammar (but did not result in a complete match of the DTMF grammar), the DTMF input may, at processor discretion, be discarded.

Before the interpreter exits all queued prompts are played to completion. The interpreter remains in the transitioning state and no input is accepted while the interpreter is exiting.

It is a permissible optimization to begin playing prompts queued during the transitioning state before reaching the waiting state, provided that correct semantics are maintained regarding processing of the input audio received while the prompts are playing, for example with respect to bargein and grammar processing.

The following examples illustrate the operation of these rules in some common cases.

Case 1

Typical non-fetching case: field, followed by executable content (such as and ), followed by another field.

 in document d0

    

    
        executable content e1
        queues prompts {p1}
    

    
        queues prompts {p2}
        enables grammars {g2}
    

As a result of input received while waiting in field f0 the following actions take place:

  • in transitioning state
    • execute e1 (without goto)
    • queue prompts {p1}
    • queue prompts {p2}
  • in waiting state, simultaneously
    • play prompts {p1,p2}
    • enable grammars {g2} and wait for input

Case 2

Typical fetching case: field, followed by executable content (such as and ) ending with a that specifies fetchaudio, ending up in a field in a different document that is fetched from a server.

 in document d0

    

    
        executable content e1
        queues prompts {p1}
        ends with goto f2 in d1 with fetchaudio fa
    

in document d1

    
        queues prompts {p2}
        enables grammars {g2}
    

As a result of input received while waiting in field f0 the following actions take place:

  • in transitioning state
    • execute e1
    • queue prompts {p1}
    • simultaneously
      • fetch d1
      • play prompts {p1} to completion and then play fa until fetch completes
    • queue prompts {p2}
  • in waiting state, simultaneously
    • play prompts {p2}
    • enable grammars {g2} and wait for input

Case 3

As in Case 2, but no fetchaudio is specified.

 in document d0

    

    
        executable content e1
        queues prompts {p1}
        ends with goto f2 in d1 (no fetchaudio specified)
    

in document d1

    
        queues prompts {p2}
        enables grammars {g2}
    

As a result of input received while waiting in field f0 the following actions take place:

  • in transitioning state
    • execute e1
    • queue prompts {p1}
    • fetch d1
    • queue prompts {p2}
  • in waiting state, simultaneously
    • play prompts {p1, p2}
    • enable grammars {g2} and wait for input

5. Control flow and scripting

5.1 Variables and Expressions

VoiceXML variables are in all respects equivalent to ECMAScript variables: they are part of the same variable space. VoiceXML variables can be used in a

5.2 Event Handling

The platform throws events when the user does not respond, doesn't respond in a way that the application understands, requests help, etc. The interpreter throws events if it finds a semantic error in a VoiceXML document, or when it encounters a element. Events are identified by character strings.

Each element in which an event can occur has a set of catch elements, which include:

An element inherits the catch elements ("as if by copy") from each of its ancestor elements, as needed. If a field, for example, does not contain a catch element for nomatch, but its form does, the form's nomatch catch element is used. In this way, common event handling behavior can be specified at any level, and it applies to all descendents.

The "as if by copy" semantics for inheriting catch elements implies that when a catch element is executed, variables are resolved and thrown events are handled relative to the scope where the original event originated, not relative to the scope that contains the catch element. For example, consider a catch element that is defined at document scope handling an event that originated in a within the document. In such a catch element variable references are resolved relative to the 's scope, and if an event is thrown by the catch element it is handled relative to the . Similarly, relative URI references in a catch element are resolved against the active document and not relative to the document in which they were declared. Finally, properties are resolved relative to the element where the event originated. For example, a prompt element defined as part of a document level catch would use the innermost property value of the active form item to resolve its timeout attribute if no value is explicitly specified.

5.2.1 throw element

The element throws an event. These can be the pre-defined ones:

 

or application-defined events:


Attributes of are:

Table 41: Attributes
event The event being thrown.
eventexpr An ECMAScript expression evaluating to the name of the event being thrown.
message A message string providing additional context about the event being thrown. For the pre-defined events thrown by the platform, the value of the message is platform-dependent.
The message is available as the value of a variable within the scope of the catch element, see below.
messageexpr An ECMAScript expression evaluating to the message string.

Exactly one of "event" or "eventexpr" must be specified; otherwise, an error.badfetch event is thrown. Exactly one of "message" or "messageexpr" may be specified; otherwise, an error.badfetch event is thrown.

Unless explicited stated otherwise, VoiceXML does not specify when events are thrown.

5.2.2 catch element

The catch element associates a catch with a document, dialog, or form item (except for blocks). It contains executable content.


What is your username What is the code word? rutabaga It is the name of an obscure vegetable. Security violation!

The catch element's anonymous variable scope includes the special variable _event which contains the name of the event that was thrown. For example, the following catch element can handle two types of events:


  
    
    
  

The _event variable is inspected to select the audio to play based on the event that was thrown. The foo.wav file will be played for event.foo events. The bar.wav file will be played for event.bar events. The remainder of the catch element contains executable content that is common to the handling of both event types.

The catch element's anonymous variable scope also includes the special variable _message which contains the value of the message string from the corresponding element, or a platform-dependent value for the pre-defined events raised by the platform. If the thrown event does not specify a message, the value of _message is ECMAScript undefined.

If a element contains a element with the same event, then there may be an infinite loop:

 
    

A platform could detect this situation and throw a semantic error instead.

Attributes of are:

Table 42: Attributes
event The event or events to catch. A space-separated list of events may be specified, indicating that this element catches all the events named in the list. In such a case a separate event counter (see "count" attribute) is maintained for each event. If the attribute is unspecified, all events are to be caught.
count The occurrence of the event (default is 1). The count allows you to handle different occurrences of the same event differently.

Each

, , and form item maintains a counter for each event that occurs while it is being visited. Item-level event counters are used for events thrown while visiting individual form items and while executing elements contained within those items. Form-level and menu-level counters are used for events thrown during dialog initialization and while executing form-level elements.

Form-level and menu-level event counters are reset each time the

or is re-entered. Form-level and menu-level event counters are not reset by the element.

Item-level event counters are reset each time the containing the item is re-entered. Item-level event counters are also reset when the item is reset with the element. An item's event counters are not reset when the item is re-entered without leaving the .

Counters are incremented against the full event name and every prefix matching event name; for example, occurrence of the event "event.foo.1" increments the counters for "event.foo.1" plus "event.foo" and "event".

cond An expression which must evaluate to true after conversion to boolean in order for the event to be caught. Defaults to true.

5.2.3 Shorthand Notation

The , , , and elements are shorthands for very common types of elements.

The element is short for and catches all events of type error:


  An error has occurred -- please call again later.
  

The element is an abbreviation for :

No help is available.

The element abbreviates :

I didn't hear anything, please try again.

And the element is short for :

I heard something, but it wasn't a known city.

These elements take the attributes:

Table 43: Shorthand Catch Attributes
count The event count (as in ).
cond An optional condition to test to see if the event is caught by this element (as in described in Section 5.2.2). Defaults to true.

5.2.4 catch element selection

An element inherits the catch elements ("as if by copy") from each of its ancestor elements, as needed. For example, if a element inherits a element from the document



    


    
        Please say a primary color
    red | yellow | blue
        
            
        
    


then the element is implicitly copied into as if defined below:


Please say a primary color red | yellow | blue

When an event is thrown, the scope in which the event is handled and its enclosing scopes are examined to find the best qualified catch element, according to the following algorithm:

  1. Form an ordered list of catches consisting of all catches in the current scope and all enclosing scopes (form item, form, document, application root document, interpreter context), ordered first by scope (starting with the current scope), and then within each scope by document order.
  2. Remove from this list all catches whose event name does not match the event being thrown or whose cond evaluates to false after conversion to boolean.
  3. Find the "correct count": the highest count among the catch elements still on the list less than or equal to the current count value.
  4. Select the first element in the list with the "correct count".

The name of a thrown event matches the catch element event name if it is an exact match, a prefix match or if the catch event attribute is not specified (note that the event attribute cannot be specified as an empty string - event="" is syntactically invalid). A prefix match occurs when the catch element event attribute is a token prefix of the name of the event being thrown, where the dot is the token separator, all trailing dots are removed, and a remaining empty string matches everything. For example,


   Caught a connection dot disconnect event

will prefix match the event connection.disconnect.transfer.


   Caught a com dot example dot my event

prefix matches com.example.myevent.event1., com.example.myevent. and com.example.myevent..event1 but not com.example.myevents.event1. Finally,


   Caught an event

prefix matches all events (as does without an event attribute).

Note that the catch element selection algorithm gives priority to catch elements that occur earlier in a document over those that occur later, but does not give priority to catch elements that are more specific over those that are less specific. Therefore is generally advisable to specify catch elements in order from more specific to less specific. For example, it would be advisable to specify catch elements for "error.foo" and "error" in that order, as follows:

 
  Caught an error dot foo event
 
 
  Caught an error event 

If the catch elements were specified in the opposite order, the catch element for "error.foo" would never be executed.

5.2.5 Default catch elements

The interpreter is expected to provide implicit default catch handlers for the noinput, help, nomatch, cancel, exit, and error events if the author did not specify them.

The system default behavior of catch handlers for various events and errors is summarized by the definitions below that specify (1) whether any audio response is to be provided, and (2) how execution is affected. Note: where an audio response is provided, the actual content is platform dependent.

Table 44: Default Catch Handlers
Event Type Audio Provided Action
cancel no don't reprompt
error yes exit interpreter
exit no exit interpreter
help yes reprompt
noinput no reprompt
nomatch yes reprompt
maxspeechtimeout yes reprompt
connection.disconnect no exit interpreter
all others yes exit interpreter

Specific platforms will differ in the default prompts presented.

5.2.6 Event Types

There are pre-defined events, and application and platform-specific events. Events are also subdivided into plain events (things that happen normally), and error events (abnormal occurrences). The error naming convention allows for multiple levels of granularity.

A conforming browser may throw an event that extends a pre-defined event string so long as the event contains the specified pre-defined event string as a dot-separated exact initial substring of its event name. Applications that write catch handlers for the pre-defined events will be interoperable. Applications that write catch handlers for extended event names are not guaranteed interoperability. For example, if in loading a grammar file a syntax error is detected the platform must throw "error.badfetch". Throwing "error.badfetch.grammar.syntax" is an acceptable implementation.

Components of event names in italics are to be substituted with the relevant information; for example, in error.unsupported.element, element is substituted with the name of VoiceXML element which is not supported such as error.unsupported.transfer. All other event name components are fixed.

Further information about an event may be specified in the "_message" variable (see Section 5.2.2).

The pre-defined events are:

cancel
The user has requested to cancel playing of the current prompt.
connection.disconnect.hangup
The user has hung up.
connection.disconnect.transfer
The user has been transferred unconditionally to another line and will not return.
exit
The user has asked to exit.
help
The user has asked for help.
noinput
The user has not responded within the timeout interval.
nomatch
The user input something, but it was not recognized.
maxspeechtimeout
The user input was too long exceeding the 'maxspeechtimeout' property.

In addition to transfer errors (Section 2.3.7.3), the pre-defined errors are:

error.badfetch
The interpreter context throws this event when a fetch of a document has failed and the interpreter context has reached a place in the document interpretation where the fetch result is required. Fetch failures result from unsupported scheme references, malformed URIs, client aborts, communication errors, timeouts, security violations, unsupported resource types, resource type mismatches, document parse errors, and a variety of errors represented by scheme-specific error codes.
If the interpreter context has speculatively prefetched a document and that document turns out not to be needed, error.badfetch is not thrown.  Likewise if the fetch of an
When an interpreter context is transitioning to a new document, the interpreter context throws error.badfetch on an error until the interpreter is capable of executing the new document, but again only at the point in time where the new document is actually needed, not before. Whether or not variable initialization is considered part of executing the new document is platform-dependent.
error.badfetch.http.response_code
error.badfetch.protocol.response_code
In the case of a fetch failure, the interpreter context must use a detailed event type telling which specific HTTP or other protocol-specific response code was encountered. The value of the response code for HTTP is defined in [RFC2616]. This allows applications to differentially treat a missing document from a prohibited document, for instance. The value of the response code for other protocols (such as HTTPS, RTSP, and so on) is dependent upon the protocol.
error.semantic
A run-time error was found in the VoiceXML document, e.g. substring bounds error, or an undefined variable was referenced.
error.noauthorization
Thrown when the application tries to perform an operation that is not authorized by the platform. Examples would include dialing an invalid telephone number or one which the user is not allowed to call, attempting to access a protected database via a platform-specific , inappropriate access to builtin grammars, etc.
error.noresource
A run-time error occurred because a requested platform resource was not available during execution.
error.unsupported.builtin
The platform does not support a requested builtin type/grammar.
error.unsupported.format
The requested resource has a format that is not supported by the platform, e.g. an unsupported grammar format, or media type.
error.unsupported.language
The platform does not support the language for either speech synthesis or speech recognition.
error.unsupported.objectname
The platform does not support a particular platform-specific object. Note that 'objectname' is a fixed string and is not substituted with the name of the unsupported object.
error.unsupported.element
The platform does not support the given element, where element is a VoiceXML element defined in this specification. For instance, if a platform does not implement , it must throw error.unsupported.transfer. This allows an author to use event handling to adapt to different platform capabilities.

Errors encountered during document loading, including transport errors (no document found, HTTP status code 404, and so on) and syntactic errors (no element, etc) result in a badfetch error event raised in the calling document. Errors that occur after loading and before entering the initialization phase of the Form Interpretation Algorithm are handled in a platform-specific manner. Errors that occur after entering the FIA initialization phase, such as semantic errors, are raised in the new document. The handling of errors encountered during the loading of the first document in a session is platform-specific.

Application-specific and platform-specific event types should use the reversed Internet domain name convention to avoid naming conflicts. For example:

error.com.example.voiceplatform.noauth
The user is not authorized to dial out on this platform.
org.example.voice.someapplication.toomanynoinputs
The user is far too quiet.

Catches can catch specific events (cancel) or all those sharing a prefix (error.unsupported).

5.3 Executable Content

Executable content refers to a block of procedural logic. Such logic appears in:

  • The form item.

  • The actions in forms and input items.

  • Event handlers (, , et cetera).

Executable elements are executed in document order in their block of procedural logic. If an executable element generates an error, that error is thrown immediately. Subsequent executable elements in that block of procedural logic are not executed.

This section covers the elements that can occur in executable content.

5.3.1 var element

This element declares a variable. It can occur in executable content or as a child of

or . Examples:

 

If it occurs in executable content, it declares a variable in the anonymous scope associated with the enclosing , , or catch element. This declaration is made only when the element is executed. If the variable is already declared in this scope, subsequent declarations act as assignments, as in ECMAScript.

If a is a child of a element, it declares a variable in the dialog scope of the . This declaration is made during the form's initialization phase as described in Section 2.1.6.1. The element is not a form item, and so is not visited by the Form Interpretation Algorithm's main loop.

If a is a child of a element, it declares a variable in the document scope; and if it is the child of a element in a root document then it also declares the variable in the application scope. This declaration is made when the document is initialized; initializations happen in document order.

Attributes of include:

Table 45: Attributes
name The name of the variable that will hold the result. Unlike the name attribute of element (Section 5.3.2), this attribute must not specify a variable with a scope prefix (if a variable is specified with a scope prefix, then an error.semantic event is thrown). The scope in which the variable is defined is determined from the position in the document at which the element is declared.
expr The initial value of the variable (optional). If there is no expr attribute, the variable retains its current value, if any. Variables start out with the ECMAScript value undefined if they are not given initial values.

5.3.2 assign element

The element assigns a value to a variable:

 

It is illegal to make an assignment to a variable that has not been explicitly declared using a element or a var statement within a Tell me a number and I'll tell you its factorial. factorial is

A The time is hours, minutes, and seconds. Do you want to hear another time?

The content of a

All variables must be declared before being referenced by ECMAScript scripts, or by VoiceXML elements as described in Section 5.1.1.

5.3.13 log element

The element allows an application to generate a logging or debug message which a developer can use to help in application development or post-execution analysis of application performance.

The element may contain any combination of text (CDATA) and elements. The generated message consists of the concatenation of the text and the string form of the value of the "expr" attribute of the elements.

The manner in which the message is displayed or logged is platform-dependent. The usage of label is platform-dependent. Platforms are not required to preserve white space.

ECMAScript expressions in must be evaluated in document order. The use of the element should have no other side-effects on interpretation.

The card number was 

The element has the following attributes:

Table 53: Attributes
label An optional string which may be used, for example, to indicate the purpose of the log.
expr An optional ECMAScript expression evaluating to a string.

6. Environment and Resources

6.1 Resource Fetching

6.1.1 Fetching

A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as audio files, grammars, scripts, and objects. Each fetch of the content associated with a URI is governed by the following attributes:

Table 54: Fetch Attributes
fetchtimeout The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation (see Section 6.5). If not specified, a value derived from the innermost fetchtimeout property is used.
fetchhint Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used.
maxage Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used.
maxstale Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used.

When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below.

The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource.  Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint.

When transitioning from one dialog to another, through either a , , , , or element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins.

Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch.

Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see Section 1.5.2 and [RFC2396]). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched.

Elements that fetch VoiceXML documents also support the following additional attribute:

Table 55: Additional Fetch Attribute
fetchaudio The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch.

6.1.2 Caching

The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes.

The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:

  • If the resource is not present in the cache, fetch it from the server using get.
  • If the resource is in the cache,
    • If a maxage value is provided,
      • If age of the cached resource <= maxage,
        • If the resource has expired,
          • Perform maxstale check.
        • Otherwise, use the cached copy.
      • Otherwise, fetch it from the server using get.
    • Otherwise,
      • If the resource has expired,
        • Perform maxstale check.
      • Otherwise, use the cached copy.

The "maxstale check" is:

  • If maxstale is provided,
    • If cached copy has exceeded its expiration time by no more than maxstale seconds, then use the cached copy.
    • Otherwise, fetch it from the server using get.
  • Otherwise, fetch it from the server using get.

Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server.

The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform).

While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt.

6.1.2.1 Controlling the Caching Policy

VoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the element's application attribute: there is no markup mechanism to control the caching policy for an application root document).

Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero.

Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files.

6.1.3 Prefetching

Prefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a must not be moved prior to any assignments of the namelist variables.

The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2.

6.1.4 Protocols

The "http" URI scheme must be supported by VoiceXML platforms, the "https" protocol should be supported and other URI protocols may be supported.

6.2 Metadata Information

Metadata information is information about the document rather than the document's content. VoiceXML 2.0 provides two elements in which metadata information can be expressed: and . The element provides more general and powerful treatment of metadata information than .

VoiceXML does not specify required metadata information. However, it does recommend that metadata is expressed using the element with information in Resource Description Framework (RDF) [RDF-SYNTAX] using the Dublin Core version 1.0 RDF schema [DC] (see Section 6.2.2).

6.2.1 meta element

The element specifies meta information as in [HTML]. There are two types of .

The first type specifies a metadata property of the document as a whole and is expressed by the pair of attributes, name and content. For example to specify the maintainer of a VoiceXML document:

 
 
  
  
Hello

The second type of specifies HTTP response headers and is expressed by the pair of attributes http-equiv and content. In the following example, the first element sets an expiration date that prevents caching of the document; the second element sets the Date header.

 
 
   
   
  
Hello

Attributes of are:

Table 56: Attributes
name The name of the metadata property.
content The value of the metadata property.
http-equiv The name of an HTTP response header.

Exactly one of "name" or "http-equiv" must be specified; otherwise, an error.badfetch event is thrown.

6.2.2 metadata element

The element is container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with , it is recommended that the RDF schema is used in conjunction with metadata properties defined in the Dublin Core Metadata Initiative.

RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] as well as the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Copyrights, etc.).

The following Dublin Core metadata properties are recommended in :

Table 57: Recommended Dublin Core Metadata Properties
Creator An entity primarily responsible for making the content of the resource.
Rights Information about rights held in and over the resource.
Subject The topic of the content of the resource. Typically, a subject will be expressed as keywords, key phrases or classification codes. Recommended best practice is to select values from a controlled vocabulary or formal classification scheme.

Here is an example of how can be included in a VoiceXML document using the Dublin Core version 1.0 RDF schema [DC]:

 
 

   


                   
       
          
             Jackie Crystal
             William Lee
          
       
   
  
 
 
Hello

6.3 property element

The element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.

Properties may be defined for the whole application, for the whole document at the level, for a particular dialog at the

or level, or for a particular form item. Properties apply to their parent element and all the descendants of the parent. A property at a lower level overrides a property at a higher level. When different values for a property are specified at the same level, the last one in document order applies. Properties specified in the application root document provide default values for properties in every document in the application; properties specified in an individual document override property values specified in the application root document.

If a platform detects that the value of a property is invalid, then it should throw an error.semantic.

In some cases, elements specify default values for element attributes, such as timeout or bargein. For example, to turn off bargein by default for all the prompts in a particular form:

 
 
   
   
    
      This introductory prompt cannot be barged into.
     
    
      And neither can this prompt.
     
    
      But this one can be barged into.
     
   
  
    
      Please say yes or no.
    
    


The element has the following attributes:

Table 58: Attributes
name The name of the property.
value The value of the property.

6.3.1 Platform-Specific Properties

An interpreter context is free to provide platform-specific properties. For example, to set the "multiplication factor" for this platform in the scope of this document:

 

Welcome

By definition, platform-specific properties introduce incompatibilities which reduce application portability. To minimize them, the following interpreter context guidelines are strongly recommended:

  • Platform-specific properties should use reverse domain names to eliminate potential collisions as in: com.example.foo, which is clearly different from net.example.foo

  • An interpreter context must not throw an error.unsupported.property event when encountering a property it cannot process; rather the interpreter context must just ignore that property.

6.3.2 Generic Speech Recognizer Properties

The generic speech recognizer properties mostly are taken from the Java Speech API [JSAPI]:

Table 59: Generic Speech Recognizer Properties
confidencelevel The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see Section 6.5). The default value is 0.5.
sensitivity Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see Section 6.5). The default value is 0.5.
speedvsaccuracy A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see Section 6.5). The default is value 0.5.
completetimeout

The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar.

A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see Section 6.5). The default is platform-dependent. See Appendix D.

Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below.

incompletetimeout

The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars.  In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event).

The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken.

A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately.

The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See Appendix D.

Platforms choosing not to support the completetimeout property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout.

The value is a Time Designation (see Section 6.5).

maxspeechtimeout

The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see Section 6.5). The default duration is platform-dependent.

6.3.3 Generic DTMF Recognizer Properties

Several generic properties pertain to DTMF grammar recognition:

Table 60: Generic DTMF Recognizer Properties
interdigittimeout The inter-digit timeout value to use when recognizing DTMF input. The value is a Time Designation (see Section 6.5). The default is platform-dependent. See Appendix D.
termtimeout The terminating timeout to use when recognizing DTMF input. The value is a Time Designation (see Section 6.5). The default value is "0s". Appendix D.
termchar The terminating DTMF character for DTMF input recognition. The default value is "#". See Appendix D.

6.3.4 Prompt and Collect Properties

These properties apply to the fundamental platform prompt and collect cycle:

Table 61: Prompt and Collect Properties
bargein The bargein attribute to use for prompts. Setting this to true allows bargein by default. Setting it to false disallows bargein. The default value is "true".
bargeintype Sets the type of bargein to be speech or hotword. Default is platform-specific. See Section 4.1.5.1.
timeout The time after which a noinput event is thrown by the platform. The value is a Time Designation (see Section 6.5). The default value is platform-dependent. See Appendix D.

6.3.5 Fetching Properties

These properties pertain to the fetching of new documents and resources (note that maxage and maxstale properties may have no default value - see Section 6.1.2):

Table 62: Fetching Properties
audiofetchhint This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio. The default value is prefetch.
audiomaxage Tells the platform the maximum acceptable age, in seconds, of cached audio resources. The default is platform-specific.
audiomaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources. The default is platform-specific.
documentfetchhint Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch.
documentmaxage Tells the platform the maximum acceptable age, in seconds, of cached documents. The default is platform-specific.
documentmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents. The default is platform-specific.
grammarfetchhint Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe.
grammarmaxage Tells the platform the maximum acceptable age, in seconds, of cached grammars. The default is platform-specific.
grammarmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars. The default is platform-specific.
objectfetchhint Tells the platform whether the URI contents for may be pre-fetched or not. The values are prefetch (the default), or safe.
objectmaxage Tells the platform the maximum acceptable age, in seconds, of cached objects. The default is platform-specific.
objectmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects. The default is platform-specific.
scriptfetchhint Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe.
scriptmaxage Tells the platform the maximum acceptable age, in seconds, of cached scripts. The default is platform-specific.
scriptmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts. The default is platform-specific.
fetchaudio The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

fetchaudiodelay

The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see Section 6.5). The default interval is platform-dependent, e.g. "2s".  The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off.

fetchaudiominimum

The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see Section 6.5). The default is platform-dependent, e.g., "5s".  The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly.

fetchtimeout The timeout for fetches. The value is a Time Designation (see Section 6.5). The default value is platform-dependent.

6.3.6 Miscellaneous Properties

Table 63: Miscellaneous Properties
inputmodes This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active.

universals

Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see Section 3.1.4) and which generate specific events.

Production-grade applications often need to define their own universal command grammars, e.g., to increase application portability or to provide a distinctive interface. They specify new universal command grammars with elements. They turn off the default grammars with this property. Default catch handlers are not affected by this property.

The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help".

maxnbest

This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. The default value is 1.

Our last example shows several of these properties used at multiple levels.

 

   
   
  

  
Welcome to the Voice Address Book Who would you like to call? Say the name of the person you would like to call. Say the location of the person you would like to call. You said to call at . Is this correct?

6.4 param element

The element is used to specify values that are passed to subdialogs or objects. It is modeled on the [HTML] element. Its attributes are:

Table 64: Attributes
name The name to be associated with this parameter when the object or subdialog is invoked.
expr An expression that computes the value associated with name.
value Associates a literal string value with name.
valuetype One of data or ref, by default data; used to indicate to an object if the value associated with name is data or a URI (ref). This is not used for since values are always data.
type The media type of the result provided by a URI if the valuetype is ref; only relevant for uses of in .

Exactly one of "expr" or "value" must be specified; otherwise, an error.badfetch event is thrown.

The use of valuetype and type is optional in general, although they may be required by specific objects. When is contained in a element, the values specified by it are used to initialize dialog elements in the subdialog that is invoked. See Section 2.3.4 for details regarding initialization of variables in subdialogs using . When is contained in an , the use of the parameter data is specific to the object that is being invoked, and is outside the scope of the VoiceXML specification.

Below is an example of used as part of an . In this case, the first two elements have expressions (implicitly of valuetype="data"), the third has an explicit value, and the fourth is a URI that returns a media type of text/plain. The meaning of this data is specific to the object.

 
   
   
   
   

The next example illustrates used with . In this case, two expressions are used to initialize variables in the scope of the subdialog form.

Form with calling dialog

Subdialog in http://another.example.com
 

Please say Social Securityy number.

Using in a is a convenient way of passing data to a subdialog without requiring the use of server side scripting.

6.5 Value Designations

Several VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2].

Real numbers and integers are specified in decimal notation only. An integer consists of one or more digits "0" to "9". A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign.

Time designations consist of a non-negative real number followed by a time unit identifier. The time unit identifiers are:

  • ms: milliseconds

  • s: seconds

Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s".

Appendices

Appendix A — Glossary of Terms


active grammar
A speech or DTMF grammar that is currently active. This is based on the currently executing element, and the scope elements of the currently defined grammars.

application
A collection of VoiceXML documents that are tagged with the same application name attribute.

ASR
Automatic speech recognition.

author
The creator of a VoiceXML document.

catch element
A block or one of its abbreviated forms. Certain default catch elements are defined by the VoiceXML interpreter.

control item
A form item whose purpose is either to contain a block of procedural logics () or to allow initial prompts for a mixed initiative dialog ().

CSS W3C Cascading Style Sheet specification.
See [CSS2]

dialog
An interaction with the user specified in a VoiceXML document. Types of dialogs include forms and menus.

DTMF (Dual Tone Multi-Frequency)
Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.

ECMAScript
A standard version of JavaScript backed by the European Computer Manufacturer's Association. See [ECMASCRIPT]

event
A notification "thrown" by the implementation platform, VoiceXML interpreter context, VoiceXML interpreter, or VoiceXML code. Events include exceptional conditions (semantic errors), normal errors (user did not say something recognizable), normal events (user wants to exit), and user defined events.

executable content
Procedural logic that occurs in , , and event handlers.

form
A dialog that interacts with the user in a highly flexible fashion with the computer and the user sharing the initiative.

FIA (Form Interpretation Algorithm)
An algorithm implemented in a VoiceXML interpreter which drives the interaction between the user and a VoiceXML form or menu. See Section 2.1.6 and Appendix C.

form item
An element of
that can be visited during form execution: , , , , , , and .

form item variable
A variable, either implicitly or explicitly defined, associated with each form item in a form. If the form item variable is undefined, the form interpretation algorithm will visit the form item and use it to interact with the user.

implementation platform
A computer with the requisite software and/or hardware to support the types of interaction defined by VoiceXML.

input item
A form item whose purpose is to input a input item variable. Input items include , , , , and .

language identifier
A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML], a legal language identifier is identified by an RFC 3066 [RFC3066] code. A language code is required by RFC 3066. A country code or other subtag identifier is optional by RFC 3066.

link
A set of grammars that when matched by something the user says or keys in, either transitions to a new dialog or document or throws an event in the current form item.

menu
A dialog presenting the user with a set of choices and takes action on the selected one.

mixed initiative
A computer-human interaction in which either the computer or the human can take initiative and decide what to do next.

JSGF
Java API Speech Grammar Format. A proposed standard for representing speech grammars. See [JSGF]

object
A platform-specific capability with an interface available via VoiceXML.

request
A collection of data including: a URI specifying a document server for the data, a set of name-value pairs of data to be processed (optional), and a method of submission for processing (optional).

script
A fragment of logic written in a client-side scripting language, especially ECMAScript, which is a scripting language that must be supported by any VoiceXML interpreter.

session
A connection between a user and an implementation platform, e.g. a telephone call to a voice response system. One session may involve the interpretation of more than one VoiceXML document.

SRGS (Speech Recognition Grammar Specification)
A standard format for context-free speech recognition grammars being developed by the W3C Voice Browser group. Both ABNF and XML formats are defined [SRGS].

SSML (Speech Synthesis Markup Language)
A standard format for speech synthesis being developed by the W3C Voice Browser group [SSML].

subdialog
A VoiceXML dialog (or document) invoked from the current dialog in a manner analogous to function calls.

tapered prompts
A set of prompts used to vary a message given to the human. Prompts may be tapered to be more terse with use (field prompting), or more explicit (help prompts).

throw
An element that fires an event.

TTS
text-to-speech; speech synthesis.

user
A person whose interaction with an implementation platform is controlled by a VoiceXML interpreter.

URI
Uniform Resource Indicator.

URL
Uniform Resource Locator.

VoiceXML document
An XML document conforming to the VoiceXML specification.

VoiceXML interpreter
A computer program that interprets a VoiceXML document to control an implementation platform for the purpose of conducting an interaction with a user.

VoiceXML interpreter context
A computer program that uses a VoiceXML interpreter to interpret a VoiceXML Document and that may also interact with the implementation platform independently of the VoiceXML interpreter.

W3C
World Wide Web Consortium http://www.w3.org/

Appendix B — VoiceXML Document Type Definition

The VoiceXML DTD is located at http://www.w3.org/TR/voicexml20/vxml.dtd.

Due to DTD limitations, the VoiceXML DTD does not correctly express that the element can contain elements from other XML namespaces.

Note: the VoiceXML DTD includes modified elements from the DTDs of the Speech Recognition Grammar Specification 1.0 [SRGS] and the Speech Synthesis Markup Language 1.0 [SSML].

Appendix C — Form Interpretation Algorithm

The form interpretation algorithm (FIA) drives the interaction between the user and a VoiceXML form or menu. A menu can be viewed as a form containing a single field whose grammar and whose action are constructed from the elements.

The FIA must handle:

  • Form initialization.

  • Prompting, including the management of the prompt counters needed for prompt tapering.

  • Grammar activation and deactivation at the form and form item levels.

  • Entering the form with an utterance that matched one of the form's document-scoped grammars while the user was visiting a different form or menu.

  • Leaving the form because the user matched another form, menu, or link's document-scoped grammar.

  • Processing multiple field fills from one utterance, including the execution of the relevant actions.

  • Selecting the next form item to visit, and then processing that form item.

  • Choosing the correct catch element to handle any events thrown while processing a form item.

First we define some terms and data structures used in the form interpretation algorithm:


active grammar set
The set of grammars active during a VoiceXML interpreter context's input collection operation.

utterance
A summary of what the user said or keyed in, including the specific grammar matched, and a semantic result consisting of an interpretation structure or, where there is no semantic interpretation, the raw text of the input (see Section 3.1.6). An example utterance might be: "grammar 123 was matched, and the semantic interpretation is {drink: "coke" pizza: {number: "3" size: "large"}}".

execute
To execute executable content – either a block, a filled action, or a set of filled actions. If an event is thrown during execution, the execution of the executable content is aborted. The appropriate event handler is then executed, and this may cause control to resume in a form item, in the next iteration of the form's main loop, or outside of the form. If a is executed, the transfer takes place immediately, and the remaining executable content is not executed.

Here is the conceptual form interpretation algorithm. The FIA can start with no initial utterance, or with an initial utterance passed in from another dialog:

//
// Initialization Phase
//

foreach ( ,