About a year ago, VoiceXML pioneer Voxeo started a cloud-based unified communications service called Tropo. It’s a tempting free development environment in which you craft unified communications apps in your favorite web programming language without having to wade too deeply into VoXML tags and voice grammars. The words “free”, “development environment”, and “VoiceXML” struck the right note with me.
My last telephony project involved a straightforward SMS headline alert service built from Twilio’s REST APIs. While I appreciate the simplicity of Twilio’s approach for getting started quickly—and in fact, I’m now addicted to my cell phone news messages—it was less than straightforward to evolve this software into an interactive environment.
With this latest influx of voice-in-the-cloud companies, simple and the iconic easy-to-use are relative terms—they all support rapid development.
What I had in mind was a text-to-speech news reader that would let me scroll through headlines and listen to story contents when something caught my ear. With Twilio, I would have had to create a separate TwiML file (Twilio’s own VoiceXML with training wheels) to collect speech commands and activate the appropriate scripts.
That’s not a lot of extra work, but Tropo let’s me program directly in PHP, and if necessary embed VoXML tag sauce into my text-to-speech stream.
Before I present my mini-project, let’s step back to gain needed perspective on the meaning of it all. Most of what I’m now able to do with web-service APIs, web scripting, and RSS feeds was just not possible in the early days (pre-dot com bust) of hosted VoiceXML. Voice in the cloud is a marriage of hosted VoiceXML and web data technologies, creating practical and useful— and I know these words are overused, but it’s true—unified communications.
With an afternoon to kill while I was waiting for a presentation I had recorded to be transcribed, I reworked my original PHP script that pulled headlines and stories from my Yahoo Pipes feed.
Plugging in Tropo’s PHP wrapper APIs for text-to-speech (say) and voice keyword detection (ask), I constructed my voice “home page”:
$recents = $RSS_Content;
$arObject = new Navarticles($recents);
if ( count($recents)==0 ) {
$initialPrompt="You have no new headlines to review. Try back later.";
hangup();
return;
}
else
$initialPrompt ='
You have '.count($recents).'headlines to review. To listen to messages say next or
prev or repeat.To hear the story, say story. Your first headline is: ';
say(" Hi. Welcome to News Attendant.",
array("voice"=> "kate"));
$items= array();
$initialPrompt= $initialPrompt.$arObject->headline().' ';
do{
$event = ask($initialPrompt,
array (
"choices" =>"next, prev, repeat, story, done" ,
"timeout" => 8,
"silenceTimeout" => 2,
"repeat" => 3,
"voice" => "kate",
"onChoice" =>
create_function(
'$event',
'navigate($event->value,$arObject)'
),
"onBadChoice" =>
create_function(
'$event',
'say("I\'m sorry,I didn\'t understand what you said.");'
),
"onTimeout" =>
create_function(
'$event','timeout($event,$arObject)')
)
);
$initialPrompt = ' Your headline is '.$arObject->headline(). ';
}while (true);
The service was now able to present itself in the voice of Kate—my imaginary hyper-efficient English office manager—who cheerfully announces the number of headlines to review and instructions for scrolling through titles and listening to a particular story. I could now issue commands, “next”, “prev”, “repeat”, or “story”, which on detection would trigger my back-end navigation functions.
I was helped in my debugging with Tropo’s real-time logging screen, which captures call handling events and reports (bless them) PHP parsing errors.
After examining the flow of SIP invites in the log, I am also incredibly thankful that Tropo lets me remain blissfully unaware of the underlying telephony.
However, you do need to address a few speech rendering issues if you want to achieve the right level comprehensibility. Just as web text has to be marked up to render correctly on your browser, you’ll want to tag your spoken text as well.
Small illuminating example: “12” should be spoken as twelve and not one-two. Fortunately, Tropo lets you add VoXML tags such as say-as to handle that particular problem.
Other glitches that can arise often center around pronunciation. I wanted “prev” to be spoken more like “preeev”. VoXML handles this by letting you specify core pronunciation units known as “phonemes”, which lets me fine tune vowel and consonant sounds.
For a business-oriented app, you’ll need to pay much more attention to the text stream than I did. In fact, you’ll be forced to build a layer of software to parse the text, removing extraneous punctuation, and adding VoXML tags where needed.
But for my purposes— keeping informed while using a cell in hands-free mode while driving—this solution works just fine.
Related articles by Zemanta
- Voxeo Announces Tropo: The Open Source Cloud Telephony Service
- A Little More Fun With Twilio (technoverseblog.com)
- Simplicity and Power (tropo.com)
I’m glad you’re liking Tropo. One of the things you mention here is the way Tropo offers an easy way to do the simple case (just say() something and we speak it) but if you want to do things like have precise control over the pronunciation, you can access that as well. But it goes deeper than that, even.
Those SIP messages you saw flying back and forth in the debugger and all those Java voice API calls you saw firing are accessible to your app. If you WANTED to care about the underlying telephony, you could.
But like you, we’re glad you don’t have to. Telephony sucks. We don’t want you to have to think about it.
Also, have you tried sending an SMS to your app? Your existing number can handle SMS. Or hooking up an IM name or Twitter handle to it? The same code you list above would also respond on all those channels. No extra work required on your part. You can certainly adjust the prompts and UI to match the medium if you’d like, but you don’t have to.
We did delve a little into VoiceXML— adding a phoneme tag to get pronunciation just right. And are very appreciative that Tropo lets us embed tags into “say” as well as “ask”, instead of having to create separate VoiceXML files. The option to build grammars and voice forms is all there in Tropo, it you have the expertise.
Tropo is a true environment for rapidly building unified communications applications.
Pingback: Tropo Developers Doing Great Write-ups! « The Tropo Blog