Cloud Telephony: High Availability with sipXecs

My world was shaken a little when Amazon’s Elastic Computing Cloud or EC2 collapsed two weeks ago, temporarily closing the doors on such sites as Quora, Reddit, FourSquare, and  others.

The trigger appears to have been a mysterious network event that occurred at Amazon’s “USA-EAST-1” availability zone, leading to delays in Amazon’s EBS and eventually bringing the show to a stop.

If you are not familiar with AWS—oh sorry, Amazon Web Services—and its terminology then most of the accounts in the news may have left you more, not less, anxious about the state of cloud computing.

Because I recently completed a DIY project (see reference below) in which I tested a very intriguing open-source SIP comm server called sipXecs (pronounced sipX, the ecs is silent) in Amazon’s EC2, my free-floating cloud concerns now settled on cloud telephony.

To paraphrase some of the overheated headlines I’ve read: Has Cloud Telephony Lost its Innocence?

So I decided it was the right time to chat with Mike Picher, author of Building Enterprise Ready Telephony Systems with sipXecs, and also the guy who gave me a tip on configuring proper DNS for my previously mentioned handiwork.

pbx4 and pbx4 are failovers with a higher priority number. pbx1,pbx2, and pbx3 are primary server given weights of 50, 25, and 25, which DNS uses to determine load sharing percentages.

The real story about Amazon’s cloud crisis is that Internet technology is still well suited to handle server and networking failures, whether those racks are in the cloud or in telecom room down the hall. And thanks to the foresight of the founding fathers, failure recovery is, has been, and will always be, an inherent part of the grand design.

However, there are old truths that must be obeyed: as in, if you don’t add backups of data and computing in geographically distributed locations, or as Amazon calls them, “availability zones”, then bad things will happen.

Picher, who is Director of Technical Services at eZuce, pointed out that DNS has built-in server recovery mechanisms that are incredibly easy to exploit. It’s a matter of tweaking a SRV record—the same configuration text that allowed me to assign a port number to a subdomain in my SIP experiment.

In designing a recoverable sipXecs system, you’ll of course need to have multiple servers—they’re your failover resources. But you also must be sure to specify different priorities—it’s an attribute in the SRV records.

Servers assigned a higher priority level, which in DNS works out to a lower numeric value, will always receive connections ahead of any lower priority server.

How does this work in practice? If there is a failure in the part of the cloud where those higher-priority servers reside, then end-point SIP devices requests will be automatically routed to the lower-priority  back-up servers.

Failover accomplished!

When servers with a higher-priority level become available, new SIP connections will then automatically “fail back” to them—i.e., they receive the resolved DNS address of the new live servers.

Picher reminded me that there’s nothing new about any of this: the same idea is used in making mail servers (and, really, any other kind of server) highly available in the Internet.

SipXecs, though, has kept true to this point-to-point philosophy, allowing it to pull off high availability in voice communications as well.

In their approach, the SIP proxy servers step out of the call signaling path after setup, so that there’s not a problem if sipXecs fails during a call: it’s still just two endpoints exchanging signaling and media.

That’s not necessarily true in other SIP implementations.

Picher: “Once the call is setup, its device to device, so that  server can be rebooted, it can be blown off the face of the earth, and that call will keep going… It’s not Asterisk-type solution. In Asterisk, it wants to sit in the middle of the conversation so it can listen for touchtones.”

Amazon dashboard captured start of the outage on April 21.

Picher, of course, was referring to a SIP concept of B2BUA or back-to-back user agent, which allows a proxy server to relay signaling between endpoint and potentially act on them.

It’s not the approach of sipXecs, and in staying with an Internet model of smart endpoints, it can rely on DNS to do the failover work.

Does DNS handle all the failover issues for sipXecs?  For media services—voice mail, auto attendant, conferencing, paging  etc.—the answer is currently no.

And what about Amazon-like public cloud solutions as a home for telephony?

Picher thinks it makes great sense for small companies, and  that if large companies use Amazon’s ECS, it will more likely be as a backup to their own internal systems or to temporarily increase capacity to meet seasonal demand. Bigger players generally have tighter requirements for control and security, which  may not necessarily be met in cloud environments.

In the recent past, Fortune 1000 and other large companies have dealt with their Web and data needs by building their own intra-nets and buying racks and lots of cabling, essentially “creating a private cloud.” These same need for IT control will likely drive voice communications into the private cloud as well.

My conclusion from the short time I spend chatting with Mike was that the cloud, whether public or private, is not a place of lost innocence. Companies that place their computing resources in Amazon or other cloud services but not providing appropriate and smart recovery mechanisms are taking well understood risks.Don’t blame the Internet.

And as for remaining recovery holes not addressed by DNS in sipXecs? Picher said upcoming releases of scheduled over the the two quarters will address both media services and also on premises call control, which will help achieve even higher availability.

Thanks Mike!

Enhanced by Zemanta