This document specifies VoiceXML, the Voice Extensible Markup Language. VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document has been reviewed by W3C Members and other interested parties, and it has been endorsed by the Director as a W3C Recommendation. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionaility and interoperability of the Web.

Conventions of this Document

In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate requirement levels for compliant VoiceXML implementations.

Abbreviated Contents

Full Contents

1. Overview

This document defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented in Section 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of the Form Interpretation Algorithm and timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy.

The origins of VoiceXML began in 1995 as an XML-based dialog design language intended to simplify the speech recognition application development process within an AT&T project called Phone Markup Language (PML). As AT&T reorganized, teams at AT&T, Lucent and Motorola continued working on their own PML-like languages.

In 1998, W3C hosted a conference on voice browsers. By this time, AT&T and Lucent had different variants of their original PML, while Motorola had developed VoxML, and IBM was developing its own SpeechML. Many other attendees at the conference were also developing similar languages for dialog design; for example, such as HP's TalkML and PipeBeach's VoiceHTML.

The VoiceXML Forum was then formed by AT&T, IBM, Lucent, and Motorola to pool their efforts. The mission of the VoiceXML Forum was to define a standard dialog design language that developers could use to build conversational applications. They chose XML as the basis for this effort because it was clear to them that this was the direction technology was going.

In 2000, the VoiceXML Forum released VoiceXML 1.0 to the public. Shortly thereafter, VoiceXML 1.0 was submitted to the W3C as the basis for the creation of a new international standard. VoiceXML 2.0 is the result of this work based on input from W3C Member companies, other W3C Working Groups, and the public.

Developers familiar with VoiceXML 1.0 are particularly directed to Changes from Previous Public Version which summarizes how VoiceXML 2.0 differs from VoiceXML 1.0.

1.1 Introduction

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Here are two short examples of VoiceXML. The first is the venerable "Hello World":

The top-level element is , which is mainly a container for dialogs. There are two types of dialogs: forms and menus. Forms present information and gather input; menus offer choices of what to do next. This example has a single form, which contains a block that synthesizes and presents "Hello World!" to the user. Since the form does not specify a successor dialog, the conversation ends.

Our second example asks the user for a choice of drink and then submits it to a server script:

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:

1.2 Background

This section contains a high-level architectural model, whose terminology is then used to describe the goals of VoiceXML, its scope, its design principles, and the requirements it places on the systems that support it.

1.2.1 Architectural Model

A document server (e.g. a Web server) processes requests from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.

1.2.2 Goals of VoiceXML

VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs.

While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.

1.2.3 Scope of VoiceXML

The language describes the human-machine interaction provided by voice response systems, which includes:

The language provides means for collecting character and/or spoken input, assigning the input results to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

1.2.4 Principles of Design

1.2.5 Implementation Platform Requirements

This section outlines the requirements on the hardware/software platforms that will support a VoiceXML interpreter.

Document acquisition. The interpreter context is expected to acquire documents for the VoiceXML interpreter to act on. The "http" URI scheme must be supported. In some cases, the document request is generated by the interpretation of a VoiceXML document, while other requests are generated by the interpreter context in response to events outside the scope of the language, for example an incoming phone call. When issuing document requests via http, the interpreter context identifies itself using the "User-Agent" header variable with the value "/", for example, "acme-browser/1.2"

Audio output. An implementation platform must support audio output using audio files and text-to-speech (TTS). The platform must be able to freely sequence TTS and audio output. If an audio output resource is not available, an error.noresource event must be thrown. Audio files are referred to by a URI. The language specifies a required set of audio file formats which must be supported (see Appendix E); additional audio file formats may also be supported.

Audio input. An implementation platform is required to detect and report character and/or spoken input simultaneously and to control input detection interval duration with a timer whose length is specified by a VoiceXML document. If an audio input resource is not available, an error.noresource event must be thrown.

Transfer The platform should be able to support making a third party connection through a communications network, such as the telephone.

1.3 Concepts

A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

1.3.1 Dialogs and Subdialogs

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.

1.3.2 Sessions

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.

1.3.3 Applications

An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root document's variables are available to the other documents as application variables, and its grammars remain active for the duration of the application, subject to the grammar activation rules discussed in Section 3.1.4.

Figure 2 shows the transition of documents (D) in an application that share a common application root document (root).

1.3.4 Grammars

Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

1.3.5 Events

VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.

Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Furthermore, catch elements are also inherited from enclosing elements "as if by copy". In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.

1.3.6 Links

A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.

1.4 VoiceXML Elements

Table 1: VoiceXML Elements

Element

Purpose

Section

Assign a variable a value

5.3.2

Play an audio clip within a prompt

4.1.3

A container of (non-interactive) executable code

2.3.2

Catch an event

5.2.2

Define a menu item

2.2.2

Clear one or more form item variables

Disconnect a session

Used in elements

Used in elements

Shorthand for enumerating the choices in a menu

2.2.4

Catch an error event

5.2.3

Exit a session

5.3.9

Declares an input field in a form

2.3.1

An action executed when fields are filled

2.4

A dialog for presenting information and collecting data

2.1

Go to another dialog in the same or different document

5.3.7

Specify a speech recognition or DTMF grammar

3.1

Catch a help event

5.2.3

Simple conditional logic

5.3.4

Declares initial logic upon entry into a (mixed initiative) form

2.3.3

Specify a transition common to all dialogs in the link's scope

2.5

Generate a debug message

5.3.13

A dialog for choosing amongst alternative destinations

2.2.1

Define a metadata item as a name/value pair

6.2.1

Define metadata information using a metadata schema

6.2.2

Catch a noinput event

5.2.3

Catch a nomatch event

5.2.3

Interact with a custom extension

2.3.5

Specify an option in a

2.3.1.3

Parameter in

6.4

Queue speech synthesis and audio output to the user

4.1

Control implementation platform settings.

6.3

Record an audio sample

2.3.6

Play a field prompt when a field is re-visited after an event

5.3.6

Return from a subdialog.

5.3.10

Attributes of

Table 36:
src	The URI of the audio prompt. See Appendix E for required audio file formats; additional formats may be used if supported by the platform.

Attributes of

Table 37:
fetchtimeout	See Section 6.1. This defaults to the fetchtimeout property.
fetchhint	See Section 6.1. This defaults to the audiofetchhint property.
maxage	See Section 6.1. This defaults to the audiomaxage property.
maxstale	See Section 6.1. This defaults to the audiomaxstale property.
expr	An ECMAScript expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded with the item or evaluate to the URI of an audio resource to fetch.

Exactly one of "src" or "expr" must be specified; otherwise, an error.badfetch event is thrown.

Note that it is a platform optimization to stream audio: i.e. the platform may begin processing audio content as it arrives and not to wait for full retrieval. The "prefetch" fetchhint can be used to request full audio retrieval prior to playback.

4.1.4 Element

The element is used to insert the value of an expression into a prompt. It has one attribute:

Table 38: Attributes
expr	The expression to render.

For example if n is 12, the prompt


   is the square of .

will result in the text string "144 is the square of 12" being passed to the speech synthesis engine.

The manner in which the value attribute is played is controlled by the surrounding speech synthesis markup. For instance, a value can be played as a date in the following example:

The text inserted by the element is not subject to any special interpretation; in particular, it is not parsed as an [SSML] document or document fragment. XML special characters (&, >, and <) are not treated specially and do not need to be escaped. The equivalent effect may be obtained by literally inserting the text computed by the element in a CDATA section. For example, when the following variable assignment:

is referenced in a prompt element as

   The price of  is $1.

the following output is produced.

 The price of AT&T is $1.

4.1.5 Bargein

If an implementation platform supports bargein, the application author can specify whether a user can interrupt, or "bargein" on, a prompt using speech or DTMF input. This speeds up conversations, but is not always desired. If the application author requires that the user must hear all of a warning, legal notice, or advertisement, bargein should be disabled. This is done with the bargein attribute:

Users can interrupt a prompt whose bargein attribute is true, but must wait for completion of a prompt whose bargein attribute is false. In the case where several prompts are queued, the bargein attribute of each prompt is honored during the period of time in which that prompt is playing. If bargein occurs during any prompt in a sequence, all subsequent prompts are not played (even those whose bargein attribute is set to false). If the bargein attribute is not specified, then the value of the bargein property is used if set.

When the bargein attribute is false, input is not buffered while the prompt is playing, and any DTMF input buffered in a transition state is deleted from the buffer (Section 4.1.8 describes input collection during transition states).

Note that not all speech recognition engines or implementation platforms support bargein. For a platform to support bargein, it must support at least one of the bargein types described in Section 4.1.5.1.

4.1.5.1 Bargein type

When bargein is enabled, the bargeintype attribute can be used to suggest the type of bargein the platform will perform in response to voice or DTMF input. Possible values for this attribute are:

Table 39: bargeintype Values
speech	The prompt will be stopped as soon as speech or DTMF input is detected. The prompt is stopped irrespective of whether or not the input matches a grammar and irrespective of which grammars are active.
hotword	The prompt will not be stopped until a complete match of an active grammar is detected. Input that does not match a grammar is ignored (note that this even applies during the timeout period); as a consequence, a nomatch event will never be generated in the case of hotword bargein.

If the bargeintype attribute is not specified, then the value of the bargeintype property is used. Implementations that claim to support bargein are required to support at least one of these two types. Mixing these types within a single queue of prompts can result in unpredictable behavior and is discouraged.

In the case of "speech" bargeintype, the exact meaning of "speech input" is necessarily implementation-dependent, due to the complexity of speech recognition technology. It is expected that the prompt will be stopped as soon as the platform is able to reliably determine that the input is speech. Stopping the prompt as early as possible is desireable because it avoids the "stutter" effect in which a user stops in mid-utterance and re-starts if he does not believe that the system has heard him.

4.1.6 Prompt Selection

Tapered prompts are those that may change with each attempt. Information-requesting prompts may become more terse under the assumption that the user is becoming more familiar with the task. Help messages become more detailed perhaps, under the assumption that the user needs more help. Or, prompts can change just to make the interaction more interesting.

Each input item, , and menu has an internal prompt counter that is reset to one each time the form or menu is entered. Whenever the system selects a given input item in the select phase of FIA and FIA does perform normal selection and queuing of prompts (i.e., as described in Section 5.3.6, the previous iteration of FIA did not end with a catch handler that had no reprompt), the input item's associated prompt counter is incremented. This is the mechanism supporting tapered prompts.

For instance, here is a form with a form level prompt and field level prompts:

 


  
    
      Welcome to the ice cream survey.
    
  
  
  
    
      
        vanilla 
        chocolate
        strawberry
     
    
   
   What is your favorite flavor?
   Say chocolate, vanilla, or strawberry.
   Sorry, no help is available.

A conversation using this form follows:

C: Welcome to the ice cream survey.

C: What is your favorite flavor? (the "flavor" field's prompt counter is 1)

H: Pecan praline.

C: I do not understand.

C: What is your favorite flavor? (the prompt counter is now 2)

H: Pecan praline.

C: I do not understand.

C: Say chocolate, vanilla, or strawberry. (prompt counter is 3)

H: What if I hate those?

C: I do not understand.

C: Say chocolate, vanilla, or strawberry. (prompt counter is 4)

H: ...

This is just an example to illustrate the use of prompt counters. A polished form would need to offer a more extensive range of choices and to deal with out of range values in more flexible way.

When it is time to select a prompt, the prompt counter is examined. The child prompt with the highest count attribute less than or equal to the prompt counter is used. If a prompt has no count attribute, a count of "1" is assumed.

A conditional prompt is one that is spoken only if its condition is satisfied. In this example, a prompt is varied on each visit to the enclosing form.

 


   
   
       
       
          Would you like to hear another elephant joke?
       
       
         For another joke say yes.  To exit say no.

When a prompt must be chosen, a set of prompts to be queued is chosen according to the following algorithm:

Form an ordered list of prompts consisting of all prompts in the enclosing element in document order.
Remove from this list all prompts whose cond evaluates to false after conversion to boolean.
Find the "correct count": the highest count among the prompt elements still on the list less than or equal to the current count value.
Remove from the list all the elements that don't have the "correct count".

All elements that remain on the list will be queued for play.

4.1.7 Timeout

The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value.

The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next.

The prompt timeout attribute determines the noinput timeout for the following input:


  Pick a color for your new Model T.



  Please choose color of your new nineteen twenty four
  Ford Model T. Possible colors are black, black, or
  black.  Please take your time.

If several prompts are queued before a field input, the timeout of the last prompt is used.

4.1.8 Prompt Queueing and Input Collection

A VoiceXML interpreter is at all times in one of two states:

waiting for input in an input item (such as , , or ), or
transitioning between input items in response to an input (including spoken utterances, dtmf key presses, and input-related events such as a noinput or nomatch event) received while in the waiting state. While in the transitioning state no speech input is collected, accepted or interpreted. Consequently root and document level speech grammars (such as defined in s) may not be active at all times. However, DTMF input (including timing information) should be collected and buffered in the transition state. Similarly, asynchronously generated events not related directly to execution of the transition should also be buffered until the waiting state (e.g. connection.disconnect.hangup).

The waiting and transitioning states are related to the phases of the Form Interpretation Algorithm as follows:

the waiting state is eventually entered in the collect phase of an input item (at the point at which the interpreter waits for input), and
the transitioning state encompasses the process and select phases, the collect phase for control items (such as s), and the collect phase for input items up until the point at which the interpreter waits for input.

This distinction of states is made in order to greatly simplify the programming model. In particular, an important consequence of this model is that the VoiceXML application designer can rely on all executable content (such as the content of and elements) being run to completion, because it is executed while in the transitioning state, which may not be interrupted by input.

While in the transitioning state various prompts are queued, either by the element in executable content or by the element in form items. In addition, audio may be queued by the fetchaudio attribute. The queued prompts and audio are played either

when the interpreter reaches the waiting state, at which point the prompts are played and the interpreter listens for input that matches one of the active grammars, or
when the interpreter begins fetching a resource (such as a document) for which fetchaudio was specified. In this case the prompts queued before the fetchaudio are played to completion, and then, if the resource actually needs to be fetched (i.e. it is not unexpired in the cache), the fetchaudio is played until the fetch completes. The interpreter remains in the transitioning state and no input is accepted during the fetch.

Note that when a prompt's bargein attribute is false, input is not collected and DTMF input buffered in a transition state is deleted (see Section 4.1.5).

When an ASR grammar is matched, if DTMF input was consumed by a simultaneously active DTMF grammar (but did not result in a complete match of the DTMF grammar), the DTMF input may, at processor discretion, be discarded.

Before the interpreter exits all queued prompts are played to completion. The interpreter remains in the transitioning state and no input is accepted while the interpreter is exiting.

It is a permissible optimization to begin playing prompts queued during the transitioning state before reaching the waiting state, provided that correct semantics are maintained regarding processing of the input audio received while the prompts are playing, for example with respect to bargein and grammar processing.

The following examples illustrate the operation of these rules in some common cases.

Case 1

Typical non-fetching case: field, followed by executable content (such as and ), followed by another field.

 in document d0

    

    
        executable content e1
        queues prompts {p1}
    

    
        queues prompts {p2}
        enables grammars {g2}

As a result of input received while waiting in field f0 the following actions take place:

in transitioning state
- execute e1 (without goto)
- queue prompts {p1}
- queue prompts {p2}
in waiting state, simultaneously
- play prompts {p1,p2}
- enable grammars {g2} and wait for input

Case 2

Typical fetching case: field, followed by executable content (such as and ) ending with a that specifies fetchaudio, ending up in a field in a different document that is fetched from a server.

 in document d0

    

    
        executable content e1
        queues prompts {p1}
        ends with goto f2 in d1 with fetchaudio fa
    

in document d1

    
        queues prompts {p2}
        enables grammars {g2}

As a result of input received while waiting in field f0 the following actions take place:

in transitioning state
- execute e1
- queue prompts {p1}
- simultaneously
  - fetch d1
  - play prompts {p1} to completion and then play fa until fetch completes
- queue prompts {p2}
in waiting state, simultaneously
- play prompts {p2}
- enable grammars {g2} and wait for input

Case 3

As in Case 2, but no fetchaudio is specified.

 in document d0

    

    
        executable content e1
        queues prompts {p1}
        ends with goto f2 in d1 (no fetchaudio specified)
    

in document d1

    
        queues prompts {p2}
        enables grammars {g2}

As a result of input received while waiting in field f0 the following actions take place:

in transitioning state
- execute e1
- queue prompts {p1}
- fetch d1
- queue prompts {p2}
in waiting state, simultaneously
- play prompts {p1, p2}
- enable grammars {g2} and wait for input

5. Control flow and scripting

5.1 Variables and Expressions

VoiceXML variables are in all respects equivalent to ECMAScript variables: they are part of the same variable space. VoiceXML variables can be used in a

5.2 Event Handling

The platform throws events when the user does not respond, doesn't respond in a way that the application understands, requests help, etc. The interpreter throws events if it finds a semantic error in a VoiceXML document, or when it encounters a element. Events are identified by character strings.

Each element in which an event can occur has a set of catch elements, which include:

An element inherits the catch elements ("as if by copy") from each of its ancestor elements, as needed. If a field, for example, does not contain a catch element for nomatch, but its form does, the form's nomatch catch element is used. In this way, common event handling behavior can be specified at any level, and it applies to all descendents.

The "as if by copy" semantics for inheriting catch elements implies that when a catch element is executed, variables are resolved and thrown events are handled relative to the scope where the original event originated, not relative to the scope that contains the catch element. For example, consider a catch element that is defined at document scope handling an event that originated in a within the document. In such a catch element variable references are resolved relative to the 's scope, and if an event is thrown by the catch element it is handled relative to the . Similarly, relative URI references in a catch element are resolved against the active document and not relative to the document in which they were declared. Finally, properties are resolved relative to the element where the event originated. For example, a prompt element defined as part of a document level catch would use the innermost property value of the active form item to resolve its timeout attribute if no value is explicitly specified.

5.2.1 throw element

The element throws an event. These can be the pre-defined ones:

or application-defined events:

Attributes of are:

Table 41: Attributes
event	The event being thrown.
eventexpr	An ECMAScript expression evaluating to the name of the event being thrown.
message	A message string providing additional context about the event being thrown. For the pre-defined events thrown by the platform, the value of the message is platform-dependent. The message is available as the value of a variable within the scope of the catch element, see below.
messageexpr	An ECMAScript expression evaluating to the message string.

Exactly one of "event" or "eventexpr" must be specified; otherwise, an error.badfetch event is thrown. Exactly one of "message" or "messageexpr" may be specified; otherwise, an error.badfetch event is thrown.

Unless explicited stated otherwise, VoiceXML does not specify when events are thrown.

5.2.2 catch element

The catch element associates a catch with a document, dialog, or form item (except for blocks). It contains executable content.



   
    What is your username 
  
   
    What is the code word? 
    
     rutabaga
    
   It is the name of an obscure vegetable. 
    
     Security violation!

The catch element's anonymous variable scope includes the special variable _event which contains the name of the event that was thrown. For example, the following catch element can handle two types of events:

The _event variable is inspected to select the audio to play based on the event that was thrown. The foo.wav file will be played for event.foo events. The bar.wav file will be played for event.bar events. The remainder of the catch element contains executable content that is common to the handling of both event types.

The catch element's anonymous variable scope also includes the special variable _message which contains the value of the message string from the corresponding element, or a platform-dependent value for the pre-defined events raised by the platform. If the thrown event does not specify a message, the value of _message is ECMAScript undefined.

If a element contains a element with the same event, then there may be an infinite loop:

A platform could detect this situation and throw a semantic error instead.

Attributes of are:

Table 42: Attributes
event	The event or events to catch. A space-separated list of events may be specified, indicating that this element catches all the events named in the list. In such a case a separate event counter (see "count" attribute) is maintained for each event. If the attribute is unspecified, all events are to be caught.
count	The occurrence of the event (default is 1). The count allows you to handle different occurrences of the same event differently. Each , , and form item maintains a counter for each event that occurs while it is being visited. Item-level event counters are used for events thrown while visiting individual form items and while executing elements contained within those items. Form-level and menu-level counters are used for events thrown during dialog initialization and while executing form-level elements. Form-level and menu-level event counters are reset each time the or is re-entered. Form-level and menu-level event counters are not reset by the element. Item-level event counters are reset each time the containing the item is re-entered. Item-level event counters are also reset when the item is reset with the element. An item's event counters are not reset when the item is re-entered without leaving the . Counters are incremented against the full event name and every prefix matching event name; for example, occurrence of the event "event.foo.1" increments the counters for "event.foo.1" plus "event.foo" and "event".
cond	An expression which must evaluate to true after conversion to boolean in order for the event to be caught. Defaults to true.

5.2.3 Shorthand Notation

The , , , and elements are shorthands for very common types of elements.

The element is short for and catches all events of type error:


  An error has occurred -- please call again later.

The element is an abbreviation for :

No help is available.

The element abbreviates :

I didn't hear anything, please try again.

And the element is short for :

I heard something, but it wasn't a known city.

These elements take the attributes:

Table 43: Shorthand Catch Attributes
count	The event count (as in ).
cond	An optional condition to test to see if the event is caught by this element (as in described in Section 5.2.2). Defaults to true.

5.2.4 catch element selection

An element inherits the catch elements ("as if by copy") from each of its ancestor elements, as needed. For example, if a element inherits a element from the document



    


    
        Please say a primary color
    red | yellow | blue

then the element is implicitly copied into as if defined below:



    
        Please say a primary color
    red | yellow | blue

When an event is thrown, the scope in which the event is handled and its enclosing scopes are examined to find the best qualified catch element, according to the following algorithm:

Form an ordered list of catches consisting of all catches in the current scope and all enclosing scopes (form item, form, document, application root document, interpreter context), ordered first by scope (starting with the current scope), and then within each scope by document order.
Remove from this list all catches whose event name does not match the event being thrown or whose cond evaluates to false after conversion to boolean.
Find the "correct count": the highest count among the catch elements still on the list less than or equal to the current count value.
Select the first element in the list with the "correct count".

The name of a thrown event matches the catch element event name if it is an exact match, a prefix match or if the catch event attribute is not specified (note that the event attribute cannot be specified as an empty string - event="" is syntactically invalid). A prefix match occurs when the catch element event attribute is a token prefix of the name of the event being thrown, where the dot is the token separator, all trailing dots are removed, and a remaining empty string matches everything. For example,


   Caught a connection dot disconnect event

will prefix match the event connection.disconnect.transfer.


   Caught a com dot example dot my event

prefix matches com.example.myevent.event1., com.example.myevent. and com.example.myevent..event1 but not com.example.myevents.event1. Finally,


   Caught an event

prefix matches all events (as does without an event attribute).

Note that the catch element selection algorithm gives priority to catch elements that occur earlier in a document over those that occur later, but does not give priority to catch elements that are more specific over those that are less specific. Therefore is generally advisable to specify catch elements in order from more specific to less specific. For example, it would be advisable to specify catch elements for "error.foo" and "error" in that order, as follows:

 
  Caught an error dot foo event
 
 
  Caught an error event

If the catch elements were specified in the opposite order, the catch element for "error.foo" would never be executed.

5.2.5 Default catch elements

The interpreter is expected to provide implicit default catch handlers for the noinput, help, nomatch, cancel, exit, and error events if the author did not specify them.

The system default behavior of catch handlers for various events and errors is summarized by the definitions below that specify (1) whether any audio response is to be provided, and (2) how execution is affected. Note: where an audio response is provided, the actual content is platform dependent.

Table 44: Default Catch Handlers
Event Type	Audio Provided	Action
cancel	no	don't reprompt
error	yes	exit interpreter
exit	no	exit interpreter
help	yes	reprompt
noinput	no	reprompt
nomatch	yes	reprompt
maxspeechtimeout	yes	reprompt
connection.disconnect	no	exit interpreter
all others	yes	exit interpreter

Specific platforms will differ in the default prompts presented.

5.2.6 Event Types

There are pre-defined events, and application and platform-specific events. Events are also subdivided into plain events (things that happen normally), and error events (abnormal occurrences). The error naming convention allows for multiple levels of granularity.

A conforming browser may throw an event that extends a pre-defined event string so long as the event contains the specified pre-defined event string as a dot-separated exact initial substring of its event name. Applications that write catch handlers for the pre-defined events will be interoperable. Applications that write catch handlers for extended event names are not guaranteed interoperability. For example, if in loading a grammar file a syntax error is detected the platform must throw "error.badfetch". Throwing "error.badfetch.grammar.syntax" is an acceptable implementation.

Components of event names in italics are to be substituted with the relevant information; for example, in error.unsupported.element, element is substituted with the name of VoiceXML element which is not supported such as error.unsupported.transfer. All other event name components are fixed.

Further information about an event may be specified in the "_message" variable (see Section 5.2.2).

The pre-defined events are:

cancel: The user has requested to cancel playing of the current prompt.
connection.disconnect.hangup: The user has hung up.
connection.disconnect.transfer: The user has been transferred unconditionally to another line and will not return.
exit: The user has asked to exit.
help: The user has asked for help.
noinput: The user has not responded within the timeout interval.
nomatch: The user input something, but it was not recognized.
maxspeechtimeout: The user input was too long exceeding the 'maxspeechtimeout' property.

In addition to transfer errors (Section 2.3.7.3), the pre-defined errors are:

error.badfetch

The interpreter context throws this event when a fetch of a document has failed and the interpreter context has reached a place in the document interpretation where the fetch result is required. Fetch failures result from unsupported scheme references, malformed URIs, client aborts, communication errors, timeouts, security violations, unsupported resource types, resource type mismatches, document parse errors, and a variety of errors represented by scheme-specific error codes.

If the interpreter context has speculatively prefetched a document and that document turns out not to be needed, error.badfetch is not thrown. Likewise if the fetch of an

When an interpreter context is transitioning to a new document, the interpreter context throws error.badfetch on an error until the interpreter is capable of executing the new document, but again only at the point in time where the new document is actually needed, not before. Whether or not variable initialization is considered part of executing the new document is platform-dependent.

error.badfetch.http.response_code
error.badfetch.protocol.response_code

In the case of a fetch failure, the interpreter context must use a detailed event type telling which specific HTTP or other protocol-specific response code was encountered. The value of the response code for HTTP is defined in [RFC2616]. This allows applications to differentially treat a missing document from a prohibited document, for instance. The value of the response code for other protocols (such as HTTPS, RTSP, and so on) is dependent upon the protocol.

error.semantic

A run-time error was found in the VoiceXML document, e.g. substring bounds error, or an undefined variable was referenced.

error.noauthorization

Thrown when the application tries to perform an operation that is not authorized by the platform. Examples would include dialing an invalid telephone number or one which the user is not allowed to call, attempting to access a protected database via a platform-specific

Voice Extensible Markup Language (VoiceXML) Version 2.0

W3C Recommendation 16 March 2004

Status of this Document

Conventions of this Document

Abbreviated Contents

Full Contents

4.1.5.1 Bargein type

Case 1

Case 2

Case 3

Form with calling dialog

Subdialog in http://another.example.com