In our interview this month, Cisco Systems’ Cullen Jennings offers this call to arms for SIP (Session Initiation Protocol): “The vendors need to get on with implementing the standards that are made, and the standards guys need to hurry up and finish their standards.” And he would know. Jennings has spent his career both helping define IP telephony standards and developing products based on them. As a Distinguished Engineer in Cisco’s Voice Technology Group, Jennings’s current work focuses on VoIP, conferencing, security, and firewall and NAT traversal. His primary responsibility is setting the direction of the technology that will make up the next generation of Cisco’s voice products, especially those concerned with conferencing, presence, and rich media systems.
Jennings is also actively involved with the IETF (Internet Engineering Task Force), where he serves as realtime applications area director. In this role he leads the IETF’s activities involving voice, video, and instant messaging. Jennings also makes key contributions to all of the SIP security work at IETF. He was the original designer of SIP’s certificate management system and most recently was responsible for the SIP Identity RFC.
Joining Jennings in this month’s discussion is Doug Wadkins, chief technology officer for Edgewater Networks, a provider of IP voice and video solutions.
Wadkins and Jennings worked together at Vovida
Networks, which developed open source call-control software for IP networks. Wadkins has spent more than 20 years evaluating technology for investment and acquisitions, and leading technology teams. Prior to joining Edgewater, Wadkins worked at Cisco Systems, where he held a number of positions focusing on technology evaluation and integration strategy for corporate business development and engineering and marketing roles in the voice technology group.
DOUG WADKINS About 10 years ago people started using the Internet to communicate almost like ham radio enthusiasts would use that medium, and they were using H.323 as a protocol. How is SIP different from that, and how did it evolve into what it has become?
CULLEN JENNINGS That’s a complicated question. For a long time there have been two other major signaling protocols for setting up voice: MGCP (Media Gateway Control Protocol) and H.323. So let me talk about how those compare with SIP.
MGCP was a very master-slave device control-type protocol. The idea was that you would have a call agent that had all the intelligence to do the call control, and it very closely controlled the device. The device was very basic and sent out simple things like key-press information, and it received commands, such as start streaming this audio to your speaker from this RTP (Realtime Transport Protocol) stream. It allowed people to make simple devices.
But it made for a fairly complex, intelligent thing in the middle that controlled it all. SIP was really a direction to push the intelligence toward the edge of the network, out toward the actual phones or softphones or whatever device you were using. SIP was also very much focused on multimedia, instant messaging, voice, and video. We speculate that maybe someday there will be a new medium called Smellovision and the question is: Will all of your SIP stuff continue to work with that without actually upgrading the middle of the network? With SIP, the idea was to push the features and complexity out to the edge and define the features in such a way that different phones that were made at different times and supported different features could all interoperate.
H.323 was also an architecture where a lot of the computation was pushed toward the edge. Yes, there were gatekeepers that helped route things, but a fair amount of it did happen at the edge. The problem with H.323—and it was really a large part of what drove the move to SIP—is that each time people wanted to add a new feature, they had to do a complete new standard that defined this new feature, and then every device in the network had to be upgraded to support that before anyone could really start using it.
The idea with SIP was that instead of defining features, it defined primitives that devices could support. For example, you might have a phone that supported a couple of primitives, and my phone might have some advanced features that took advantage of those primitives as building blocks so that it could interoperate its complex features with your device. With SIP, my phone’s features could interoperate with your phone, even though those features were never imagined when your phone was designed.
This made for a complicated design, but it also made for a system that allowed us to add new features without upgrading all of the devices in the network, which was a very difficult problem that really slowed down feature deployment. That, combined with adding features into SIP that H.323 never had—such as presence information and subscriptions and notifications about various things that were going on—provided a lot of flexibility beyond what some of the traditional protocols had. It’s not that it would have been impossible to do some of these things in other protocols; it was just that it would have been very difficult, and SIP had this design focused around features right from the beginning. That’s the essence of how SIP compares with H.323 and MGCP.
There have been many proprietary protocols along the way, and SIP compares at various levels with those. Today, however, the bulk of VoIP signaling development is all done around SIP. It’s pretty much the predominant signaling protocol at this point.
DW It’s clear that TDM (time-division multiplexing) systems are moving toward IP-based systems. In the past TDM PBXs had a proprietary signaling protocol that went to the handset itself, and many IP PBXs today have followed that same pattern. Is SIP changing that, or will SIP simply be the way to interconnect those PBXs over the wide area network?
CJ To a certain degree the most important point is that SIP is able to interoperate these functions and services between all of these PBXs, no matter what the PBXs do themselves. Many of the customers who buy PBXs today, however, do not want to have a lock-in type of environment. They don’t want their phones to cost outrageous amounts of money compared with what non-PBX phones cost. They are asking the vendors to build standards-compliant protocols between the phones and the PBXs.
You’ll see that most of the major vendors have SIP running between their phones and the PBXs. This provides the opportunity for third-party phones to use a standardized protocol to interoperate with the PBX, which has been very important to many PBX customers. I think you’ll see that the bulk of PBX manufacturers are actually moving toward using SIP between the phone and the PBX, and using SIP-compliant phones.
DW What about open source projects? Has Asterisk or any other open source project had any real impact on SIP in the enterprise?
CJ I’m sure if you ask lots of different people this question, you’ll get lots of different answers. Certainly, Asterisk is the most popular open source PBX right now. The Vovida PBX system, which you and I both worked on a long time ago, and sipX, which was the Pingtel-based system, are popular as well.
All of those provide various levels of SIP features. Asterisk is the most popular of all of those today.
Asterisk is built around a traditional design where there are voice channels, and the Asterisk system works on a model of connecting these channels together. The architecture is very much about building a smaller phone system. It’s not about a complex SIP-type presence or instant messaging or unified communications system. You hear a lot about unified communications today. Asterisk is not about that.
It’s about building something that can do voice calls, connect a bunch of phones, do some IVR (interactive voice response), a little conferencing, recording, and some voicemail. The most exciting part about Asterisk to me in enterprises is that this is something that an IT guy can download, install on a PC, start playing with, and maybe build some new little feature that integrates an existing voice system with the overall business process. Today we don’t find it weird at all that the IT department might make sure that when some machine goes down, somebody gets an e-mail. With Asterisk, the IT people can write some scripts so that somebody gets a phone call or maybe a couple people get a conference call, and those are connected with various business process services.
That ability to take functions that have traditionally been done in call centers and suddenly bring them out so that end users can start changing the phone system and interconnecting phone systems with the rest of the systems they build—that’s pretty powerful. I think that that’s one of the most interesting characteristics of Asterisk in particular and these open source projects in general.
DW Asterisk doesn’t actually use SIP for its interconnection protocol. It has its own proprietary version, the IAX (Asterisk Interexchange). How is that different from SIP, and why was that used instead of SIP?
CJ Well, I spent a fair amount of time with Mark Spencer, who is the lead designer and inventor behind the IAX protocol. Asterisk has the ability to translate to lots of different protocols, including SIP, but its key protocol is IAX. One of its major differences from every other VoIP protocol right now is what would be called in traditional PSTN (public switched telephone network) terms, in-band versus out-of-band signaling.
What happens with SIP, H.323, and MGCP is that the signaling goes over one path—and it might go through several different computers that help route it and provide features to it—between two phones. Then the RTP—the actual audio data that’s being transported between the two phones—goes directly between the two phones. It doesn’t follow the same path.
IAX actually tunnels both the media and the signaling over the same channel, which makes NAT traversal easier because the phone always initiates the signaling connection, and it can just go through this same tunnel that is already set up.
The downside is that the media has to go through all of the processing elements, which puts a lot of load on them. It also means that the media has to take a longer path and go through more devices, which slows it down and adds to the latency. Both the scalability and voice-quality problems that occur by not sending the media directly are the primary reasons why none of the other existing systems went with that approach.
Asterisk works best in a phone system in a small office where you’ve got a phone on a LAN directly connected to Asterisk. It’s easy to run all of it on some small processor on a small system, and it works fine in that environment, but not in wider-scale environments. That’s another major difference from SIP.
DW I’d like to discuss some of the better-known VoIP services, such as Skype, Google Talk, and the voice services that Yahoo provides. Skype is the best-known worldwide as it has the largest user base. Skype does not use SIP, so how relevant is SIP for these types of service, and why were some of those technical choices made? In other words, why didn’t Skype use SIP?
CJ Skype is the premier service right now, not only because of its large user base, but also because of its excellent marketing and its basic usability. Its voice quality is good. It works through lots of NAT-type situations. You download it, it’s easy to use, and it provides a very good user experience.
In general, when new technologies are coming into a marketplace and replacing old ones, initially people will develop technologies that work only with themselves. That’s because they can control both ends of the connection at this point. They can upgrade both ends of the connection, and it’s much easier to get it all to work. Examples of this are the original e-mail systems, which didn’t send e-mail between each other. You couldn’t use any e-mail you wanted. You had a CompuServe client and you could send e-mail to other people on CompuServe. That was a very closed, proprietary system that was easier to get deployed initially.
Then, as more people started using those systems and the technology advanced, people wanted to start connecting them together, and eventually wanted to be able to choose their own clients and be able to connect them up to the system in a very interoperable way. So, traditionally with new technologies, you have seen fairly proprietary solutions deploy first and then migrate toward much more standardized solutions. I believe we’re already seeing that with Skype to a certain degree.
The people at Skype looked at what was out there and started with a protocol that they could easily get to work. Now, one of the very good insights that I believe Skype has shown the market is using P2P-type technology to move things out toward the edge and as much as possible reduce what they had to do on centralized servers. This reduced Skype’s support and operational costs, as well as bandwidth, data structures, infrastructure, and the amount of equipment needed to run its service. It still does use centralized services for certain things, such as controlling authorizations, making names, and security.
Initially, Skype users could call only other Skype clients. Then they moved to wanting to be able to connect to the PSTN. How Skype actually does this is by moving all its calls to SIP. That’s the protocol it uses to connect to all of the gateways that it uses in various countries to terminate to the PSTN and do Skypein and Skypeout. It also uses SIP for connecting to other systems.
When Skype started, however, SIP didn’t have everything it needed, so the company added some proprietary elements. For example, it wanted to have the option of pushing advertising or certain messages down to Skype clients. Controlling the protocol was a good way to do that initially. Over time, I think we will see Skype migrate to more advanced forms of standards-based protocols where it can take advantage of advances in SIP and be able to use SIP that way.
Moving toward using SIP allows Skype to connect more devices to its network and have more users. With Skype, there is very much a network-type effect: the more people who have it, the more valuable it is to be on it. Skype is trying to get very big very quick, and it has been amazingly successful.
DW What do you see as the biggest technical hurdles to get over for SIP to be more widely adopted? You mentioned the NAT issue, which has obviously been a problem for several years, but I think we’re finally getting to the point where the ICE (Interactive Connectivity Establishment) and other NAT-related work is defined well enough that people can implement solutions to the problem. There are always concerns about security, some justified and some not. What do you think the biggest problems are that still need to be solved, or are maybe partially solved, and which ones are solved but not implemented?
CJ The most important thing that’s blocking widespread deployment is the ongoing standards work at the IETF around NAT traversal. (For more information on the NAT traversal challenge, see Robert Sparks’s article on page 22 in this issue). That’s where the technical work has been widely done, but the IETF hasn’t managed to finish up all the details on it and get it published as an RFC.
In the security area there are some solutions that have been published as RFCs but they haven’t been fully implemented yet, so let me talk about security at a couple different levels. First, let’s consider authorization: knowing that the person making the call is really allowed to be making the call. That’s mostly done with digest-based authorization. It’s much like an http digest, where you don’t send your password in the clear over the network. If I am trying to make a phone call, my phone sends a request to the server, the server challenges the request, and the phone proves to the server that it knows my password. This allows the phone to authenticate to the server. This is widely implemented, and it works quite well. This really addressed the major concern in lots of voice systems, which is toll fraud.
The next level of security has to do with encrypting the information so people don’t see whom you’re calling. This is handled by TLS (Transport Layer Security), which is the name for the latest version of SSL. It’s the same stuff we use to encrypt our http connections, and it’s almost identical to how it’s used in SIP. TLS can encrypt and protect the integrity of all your signaling and provide authorization of the servers.
Now, this has been defined in SIP for a long time. It’s implemented in a fairly significant portion of products, but certainly not all. And even in lots of places where it is implemented, it’s not deployed because people have just not found practical attacks that they were worried about whose prevention required using it. But I think you’ll see TLS being adopted more.
Something that has been defined for a little while and implemented in a few products, but not many, is SRTP (secure RTP), which is a way of encrypting the actual media stream—the audio and the video. You’ll see more of this coming on. People are quite interested in having encrypted voice.
Something that exists in almost no products today, and that has just recently become RFC 4474, is identity solutions. It’s fine to have an encrypted call, but it’s not particularly useful unless you know whom it’s an encrypted call to. This identity work is about providing a strong cryptographically provable statement about the caller ID.
In the PSTN today, caller ID is highly spoofable. I don’t know if people know this, but there are 800 numbers that let you enter whatever phone number you want your call to look like it’s coming from. It doesn’t even have to be a valid phone number. You can make calls that the caller ID shows up as coming from 911. So, caller ID is very unsecure on the PSTN.
One of the reasons I think this is really important has to do with what in my mind is the biggest unsolved security problem in SIP: spit (spam over Internet telephony) and spim (spam over IM). This cryptographically authenticated caller ID is a good building block for being able to build whitelists, blacklists, and reputation services, which can help you start to deal with spit in telephone systems.
That’s my rundown of the key work needed from the IETF. Now the vendors need to get on with implementing the standards that are made, and the standards guys need to hurry up and finish their standards.
DW Last but not least, I want to ask you about comments that I’ve heard from different people saying SIP is the worst protocol the IETF has ever developed. I think those comments were probably a result of the large number of RFCs and drafts around SIP. People look at that and say, “Well, there’s so much change, there’s so much churn here that there must be something inherently wrong with it to begin with.”
CJ SIP is the best of things and the worst of things. People look at all of these drafts and think that they have to implement all of them to build a useful service. That’s very much an incorrect view. They also look at it and say, “Oh, well, it must be changing a lot.” But that’s also not quite right.
If you take a SIP phone from 2000 and connect it to any of today’s SIP systems, it still works fine, and all of the features that worked on it then still interoperate quite well with the latest and greatest features. For example, the phone from 2000 wouldn’t support some of the abilities that you would need for my phone to be able to watch your phone and figure out when you hung it up so I could call you back. But it would still support all the things that it did back then, such as basic phone calls, caller ID, and transfer.
No one is arguing that the S in SIP stands for simple. It is a large and complicated system. But any advanced phone system that deals with such a large number of features and the amount of interoperability and the ability to add new features by upgrading only one end is going to have a similar level of complexity.
Many people have built very operational SIP systems that are fully compliant with the standards, have a specific set of features, and are actually quite small and lean. We’ve seen very small, simple SIP implementations for just voice phones or IM or presence clients that don’t do a lot of other things.
It has just been complicated for people to get their heads around all of these different standards and what they mean—and which ones they need to pay attention to and which ones they don’t. Recently the IETF has been working on a roadmap to understanding all of this. Called The SIP Hitchhiker’s Guide, written by Jonathan Rosenberg of Cisco Systems, it helps you understand how all these pieces fit together and which ones you need to pay attention to and which ones you don’t need for your service. I think that guide will help clarify some of these issues for people who don’t live and breathe SIP as part of their day.
DW Let’s hope this discussion can do its part in that as well.
Originally published in Queue vol. 5, no. 2—
see this item in the ACM Digital Library