Independent Study – VoiceXML
By Chetan
Sharma
webchetan@hotmail.com
Under Supervision of Professor C. Tappert, Pace University
VoiceXML is a Web-based markup language for
representing human-computer dialogs, just like HTML. While HTML assumes a
graphical web browser, with display, keyboard, and mouse, VoiceXML is assumes a
voice browser with audio output (computer-synthesized and/or recorded),
and audio input (voice and/or keypad tones). VoiceXML leverages the
Internet for voice application development and delivery, greatly simplifying
these difficult tasks and creating new opportunities.
VoiceXML 1.0 is also a specification of the VoiceXML Forum, an industry consortium of over 300 companies. The Forum is active in the conformance testing, education, and marketing of VoiceXML, and has given control over further language development to the World Wide Web Consortium (W3C). Because it is a specification, applications that work on one conformant VoiceXML platform will work on others as well.
VoiceXML is a programming language
for describing call flows for interactive voice applications. The VoiceXML
language provides a clean and simple means for:
VoiceXML
documents can perform programming functions such as arithmetic and text
manipulation. This allows a document to check the validity of the user's input.
Also, a user's session need not be a simple sequence that runs the same way
every time. The document may include "if-then-else" decision making
(branching) and other complex structures. Writing powerful documents is easier when
you use Nuance Speech Objects. These are pieces of software that are
pre-written, tested, and packaged in a form that is easy for a VoiceXML
document to use. Speech Objects conduct dialogs for common functions such as
accepting credit card numbers, times and dates, and dollar amounts.
The grammar in VXML is the most
important aspect for developing a VXML application. Motorola, IBM, TellMe
Studio, BeVocal are the few known firms which are
working on VXML application development. The standard version of VXML is yet to
develop, since each and every company.
VoiceXML is a derivative of the Extensible Markup Language
(XML). XML is the standard format for defining structured documents and data on
the Web. XML enables programmers to define an arbitrary vocabulary, formally
known as a schema, using a standard, well-defined, easily parsed syntax. One
XML schema might describe customer information, another might describe a
mathematical equation, and yet another might describe a recipe for chocolate
chip cookies. The
initial project resulted from collaboration between IBM, Motorola, Lucent and
AT&T. The current list of members covers a broad spectrum of the computer
industry. We will explore how VoiceXML goes beyond the graphical user
interfaces of HTML and provides a framework for the most natural form of
communication: ‘spoken language.
The world of VoiceXML is
changing weekly. One of the first companies to offer a system for
experimentation was IBM, through their alphaWorks
program. They integrated an early version of VoiceXML with their ViaVoice speech technology. Most of the software can be
downloaded free from www.alphaworks.ibm.com/tech/voicexml.
The system supports Microsoft Windows and desktop recognition. Several
companies have already deployed VoiceXML systems for their internal development
work, but do not generally make their platforms available to developers. Nuance
and SpeechWorks, both providers of telephone-based
recognizers, have VoiceXML initiatives underway with extensive developer
programs. Nuance has announced a dial-up phone system for testing scripts. It
was scheduled for release in late August as of this writing and may be used
free for 60 days.
Tellme.com appears to be the furthest along
with their Tellme Labs developer program. They offer
a free dial-up number and a range of tools for developing and testing VoiceXML
scripts. They even include a window that may be used while browsing their
developer site to write and modify VoiceXML scripts that can then be
immediately tested from their free dial-up number.
A system can also be built
by assembling the necessary components outlined in Figure 1. This approach
requires a much more extensive effort and understanding in putting the parts
together. Telephony cards, DSPs and various servers
must all be made to work together. The most widely used recognizers from
AT&T, IBM, L&H/Dragon, Nuance and SpeechWorks
were not initially designed to work with VoiceXML. Several of the dynamic
aspects of VoiceXML make it more difficult to simply match the respective
speech APIs to the VoiceXML requirements. The VoiceXML committee is still
working on a standard capable of supporting the more complex grammar that each
of the recognizers referenced above are built on. This is resulting in several
different VoiceXML platform-dependent solutions that are likely to change over
time.
A key value
of VoiceXML, much like HTML, is the simplicity. It isn't a full-fledged
programming language, and to a large extent it would be expected that, much
like HTML, a new set of professionals would specialize in VoiceXML. Writing
good VoiceXML dialogs requires a sense of what makes human dialog work. Some of
the more proficient writers of VoiceXML dialogs that I've worked with have come
from a background in linguistics, audio engineering or the broadcast industry.
While XML can be a little alien for someone who has never programmed, the
commands are easily learned. The art of good dialog design is an open
challenge, a call to those with an understanding of verbal discourse.
Components of a VoiceXML
System
Any web site can be a
VoiceXML content server. No special hardware or software is necessary. Servers
respond to requests by generating either canned or dynamically generated
VoiceXML scripts, which are passed by HTTP back to the gateway. VoiceXML
scripts look very much like HTML documents.
For example a
<PROMPT> tag indicates that the gateway system should play back a piece
of recorded audio to the customer. A <FIELD> tag is used to indicate an
input field. The presence of the <FIELD> tag is a cue to the speech
recognition engine to listen for user input and interpret it according to a
grammar specified in the script.
Like conventional web
pages, VoiceXML scripts may have embedded server-side or client (gateway-side)
script. A specialized tag called <OBJECT> allows the incorporation of
platform-specific functionality. Many VoiceXML scripts will probably contain a
combination of "pure" VoiceXML and pre-written modular components
written in Java or ActiveX.
Interpretation of the
script and the interaction with the user is controlled by the VoiceXML gateway.
Gateways are special collections of hardware and software which form the core
of VoiceXML technology. Essentially they provide the presentation services
component of VoiceXML, analogous to the web browser in conventional HTTP
service.
Goals
of VoiceXML
VoiceXML’s main goal is to bring the full power of
web development and content delivery to voice response applications, and to
free the authors of such applications from low-level programming and resource
management. It enables integration of voice services with data services using
the familiar client-server paradigm. A voice service is viewed as a sequence of
interaction dialogs between a user and an implementation platform. Document
servers provide the dialogs, which may be external to the implementation
platform. Document servers maintain overall service logic, perform database and
legacy system operations, and produce dialogs.
A VoiceXML document specifies each interaction dialog to
be conducted by a VoiceXML interpreter. User input affects dialog
interpretation and is collected into requests submitted to a document server.
The document server may reply with another VoiceXML document to continue the
user’s session with other dialogs.
Advantages
of VoiceXML
Ø Ø Minimizes client/server interactions by specifying multiple interactions per document.
Ø Ø Shields application authors from low-level, and platform-specific details
Ø Ø Separates user interaction code (in VoiceXML) from service logic (CGI scripts)
Ø Ø Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers
Ø Ø Safely handles shared network-based applications. No arbitrary computations are allowed, and platform resources are protected
Ø Ø Is easy to use for simple interactions, and yet extensible for complex ones
The language describes the human-machine interaction provided by voice response systems, which include the ability to:
Ø Ø Synthesized speech output (text-to-speech).
Ø Ø Output of audio files
Ø Ø Recognition of spoken input
Ø Ø Recognition of DTMF input
Ø Ø Recording of spoken input
Ø Ø Telephony features such as call transfer and disconnect
The language provides means for collecting character and/or spoken input,
assigning the input to document-defined request variables, and making decisions
that affect the interpretation of documents written in the language. A document
may be linked to other documents through Universal Resource Identifiers (URI’s). When a link is followed, request variables and
their values, if present, are submitted to the link's URI.
Commuter: Needs
to know the next time a Metro-North train departs from a station on the
Dialog: (Note: Voice Server – VS)
1: VS: Welcome to the Metro-North
2: VS: This service will help you determine departure
and travel time
3: VS: from your point of departure.
4: VS: Lets start,
5: VS: Where are you departing from?
COMMUTER: <speaks the train station name> $origin
6: VS: What is your destination?
COMMUTER: <speaks the train station name> $dest
7: VS: How long will it take you to arrive at $origin?
COMMUTER: <speaks time in minutes>
8: VS:
1. Determines when the commuter will arrive at the start station plus a 5min grace period $departTime+$gracePeriod
2. Determines the next train to arrive at the start station at this time. $arrive
3. Then calculates the total travel time on the train to get to the destination station. $totalTime.
4. The next train will depart from $start at $arrive with a $gracePeriod. The total travel time will be $totalTime.
9: VS: Would you like to submit another train
itinerary?
COMMUTER: <speaks YES or NO>
10: VS:
1. Determines YES or NO
a. If YES then go to step 5
b. If NO then "Thank you for using the Metro-North Harlem line VOICE train scheduler.