I haven’t really considered VoiceXML in a good long time. I remember the early excitement (pre- dot com bust) when a few start-up companies were helping to bring computer telephony into the web age— Voxeo, General Magic (extinct), VoiceGenie (now part of Genesys), et al. Like millions (OK, maybe thousands) of others, I built a simple “hello world” app within a demo environment hosted by one of the aforementioned firms. It was heady stuff, especially for those who remembered what CT programming was like. Time passed, the enterprise PBX vendors introduced VoiceXML development environments (which, by the way, weren’t that easy to master) into their voicemail servers, and what was new became accepted.
I recently heard about Twilio through my membership in New York Tech Meetup. This SF-based company has a simple XML spawn called TwiML that has made the voice programming process approachable for the masses while advancing the core power of VoiceXML’s speech handling to take on basic call control.
In other words, you can accomplish useful software—conferencing, click-to-talk, SMS—that can be embedded in web apps with minimal effort and no hardware using Twilio’s hosted service.
Twilio’s focus is on simplicity, like another well known social media application that happens to have a similar sounding name. While you won’t have the capabilities (nor the complexities) involved with VXML, you’ll be able to get started quickly.
And no, you won’t have access to speech recognition grammars. It’s just plain phone pad (aka DTMF) processing.
Twilio provides a few basic verbs, which include say (a text-to-speech function), gather (collect DTMF), record and play (for voice mail), SMS, and dial. To get a sense of how to put these together to create a voice script, I’ve excerpted below part of a demo application to implement an auto attendant:
header("content-type: text/xml");
echo "\n";
switch($destination) {
case 'hours': ?>
Initech is open Monday through Friday, 9am to 5pm
Saturday, 10am to 3pm and closed on Sundays
case 'attendant': ?>
Please wait while we connect you
212-555-1234
?>
As with any server side scripting, you can marry the TwiML with your favorite interpreter (PHP, Python, etc.) This example is straightforward, though there are some subtleties in the call control that TwiML supports. The dial verb opens a separate leg to the called party and then bridges the two sides. The TwiML server then remains around to supervise the call progress in case the called side is busy. You provide a URL (as shown) to take action on the call completion status, which is passed as an HTML parameter.
For some perspective, this level of call control wasn’t possible with VoiceXML 2.0 (not sure about the newer 3.0), and required yet another XML variant called CCXML to properly pull off. It becomes messy very quickly, so kudos to Twilio for making this easy.
In fact, using Twilio’s demo environment I was able to quickly modify the attendant to add a “find me” function by taking advantage of dial’s ability to call multiple numbers simultaneously.
Another big difference between the VoiceXML applications in the early part of the millenium and now is the development of web services and adoption of RESTful interfaces. It has become completely practical to embed telephony call control into Web desktop and mobile applications.
That is a big deal. With the TwiML API’s you can easily initiate a click-to-call application from a web page and then hand off the call to a TwiML script.
Yes, I’m excited about Twilio. But….here’s my big quibble: with the rise of smartphones and highly capable mobile video chips, basic interactive voice response (IVR) systems ain’t enough.
What’s needed is the adoption of newer interactive voice and video response (IVVR) technologies, preferably based on session initiation protocol (SIP). (I know, lots of acronyms in this post.)
IVVR would allow VoiceXML scripts to push web pages or videos in response to a DTMF or speech command. In other words, a spoken request for travel directions would result in a Goggle map popping on a mobile phone’s display.
It would be quite nice if Twilio and/or others could support this ability in their next generation of VoiceXML software. Just a thought.