Instant Instant Messaging: Just Add Web Sockets

Chat rooms and live peer-to-peer chat on the web are high on the list of stunning rich application features that can still drop jaws. Facebook recently launched an integrated web chat implementation to much fanfare. Their impressive Erlang and C++ chat infrastructure showcases real-time web techniques and live, interactive interface elements. However, following Facebook's lead is actually not that hard if you use the right approach.

The two main difficulties faced by Facebook, Google, or any other web giant that wishes to deploy a chat application are scaling out an inconceivably large messaging back end and sending real-time messages to the browser. The first problem is one that many of us would like to have, but few actually encounter. The second problem is getting much, much easier to solve. Web chat is about to go mainstream, thanks in large part to a new HTML 5 feature called Web Sockets.

Chat Is Full-Duplex; the Web Is Not
The challenges in bringing real-time messaging (chat, for example) to web applications stem from the fundamental design of the web. It is, after all, a system designed for navigating hypertext documents. The request-response model is perfect for document retrieval, but it's less perfect for an application platform. Some applications are exceptionally bad fits.

Chat requires full-duplex communication. That is, data must be able to flow bidirectionally. This is a clear necessity for chat, as messages may originate on either the server or the client and must be transmitted without delay. After all, instant messaging is of little use if it isn't instantaneous. HTTP explicitly eliminates the ability to send information over the network in either direction at will at the same time.

Since nearly anything is possible, some tenacious developers have shoehorned full-duplex communication into existing web browsers. Comet long-polling, iframe streaming, and other techniques have all been applied in pursuit of this goal. Each technique has weaknesses, however, and even Comet techniques that allow data to stream down from the server cannot avoid initiating a new request for each upstream message. While a new request might not seem like much, each HTTP action causes several hundred bytes of header data to be generated, transmitted, and parsed. Even when bandwidth and computing resources are cheap, the additional latency incurred by a full round trip hurts real-time interactivity.

Using today's browser technology for both upstream and downstream, there is significant overhead to real-time communication. Despite this, some have been willing to stomach the complexity and inefficiency to run applications like chat on the web without succumbing to the temptation of proprietary plugins.

HTML 5 Web Sockets
The HTML 5 standard specifies new APIs for storage, drawing, drag-and-drop, and other areas that have made web programming painful. Browsers have already begun incorporating parts of HTML 5 (canvas, for example) even though the specification is far from complete. The HTML 5 Communication section includes two additional connectivity features: Server-Sent Events, a standardization of HTTP push, and Web Sockets, a cross-domain safe, full-duplex connection. Server-Sent Events will make real-time updates and notifications easy, and Web Sockets provide the functionality necessary to build chat for the web without the previously required hackery.

Unlike traditional AJAX, in which each XMLHttpRequest consists of a round trip which sends and then receives data from a remote server, a Web Socket sends and receives asynchronously on a single connection. This allows WebSockets to

The WebSocket API is straightforward, as the interface definition in the current HTML 5 specification shows:

[Constructor(in DOMString url)]|
interface WebSocket {
readonly attribute DOMString URL;
// ready state
const unsigned short CONNECTING = 0;
const unsigned short OPEN = 1;
const unsigned short CLOSED = 2;
readonly attribute long readyState;
// networking
attribute EventListener onopen;
attribute EventListener onmessage;
attribute EventListener onclosed;
void postMessage(in DOMString data);
void disconnect();
};

The only twist is that the constructor also initiates the outbound connection. Due to the single-threaded nature of JavaScript, event handlers can be safely attached to the newly constructed object. For instance, the following code adds a handler for the "open" event that will never be called prematurely, even though the process of opening the socket appears to have already begun. In another language, similar code might create a race condition, but in JavaScript, it's perfectly safe.

var mySocket = new WebSocket("ws://example.com/server");
mySocket.addEventListener("open", openHandler);

While at first glance this usage may be somewhat confusing, this choice is not wholly without benefit. Because the connection opens when the WebSocket object is created, Web Sockets cannot be reused nor can they listen for incoming connections. This removes some of the ambiguity found in other socket APIs.

Web Socket objects dispatch events when the connection state changes ("open" and "closed") and upon receiving a new frame ("message"). Like XMLHttpRequest, Web Sockets have the readyState property, which can have the following values:

The WebSocket interface, including events, callbacks, readyState, and postMessage, is consistent with existing browser APIs. In that respect, it's like an updated XMLHttpRequest with real-time capabilities.

Cross-Domain Security
Security is a major concern with browser technology. After all, cross site attacks have accounted for some truly dangerous exploits. The Web Socket protocol contains security mechanisms meant to stave off such attacks.

If low-level sockets were exposed directly to JavaScript, unsuspecting web visitors could be made to participate in sophisticated distributed attacks. Not the least of these potential exploits would connect browsers directly to unsuspecting mail servers to send spam. In order to avoid the sort of doomsday scenario that would "break the Internet," any sort of socket in the browser must impose additional security restrictions. Flash and Silverlight require side-participation from a security policy server before allowing socket connections. This is an unfortunate compromise, but it allows direct connectivity at the TCP level. Instead of adopting this approach, Web Sockets use a single connection with an opening handshake.

The Web Socket handshake is a strict initial exchange between the browser and server. The handshake identifies the protocol, destination host name, and origin. This allows services that are not expecting connections from the web to reject attacks. Likewise, services intended for use from a particular set of origin domains can enforce strict security policies, including same-origin.

[Client Sends]
GET /services/chat HTTP/1.1
Upgrade: WebSocket
Connection: Upgrade
Host: chat.example.com:81
Origin: http://www.example.com:80

[Server Responds]
HTTP/1.1 101 Web Socket Protocol Handshake
Upgrade: WebSocket
Connection: Upgrade
websocket-origin: http://www.example.com:80
websocket-location: ws://chat.example.com:81/services/chat

The Web Socket handshake resembles HTTP, but it is explicitly not HTTP. This duality ensures compatibility with proxies and other intermediaries while allowing the specification to define behavior that falls outside of standard HTTP.

Web Sockets cannot connect directly to the same servers that TCP sockets can. In order for a Web Socket-capable browser to communicate with a server, either the server must be updated to accept Web Socket connections or a bridge must adapt the protocol. Kaazing's Web Socket server, Kaazing Enterprise Gateway, provides that bridging functionality by brokering TCP socket connections to web browsers. Kaazing Enterprise Gateway allows browsers and network servers to communicate efficiently, enforces access control, and includes client libraries for use with chat and other application protocols.

Framed Messages
The post-handshake portion of the Web Socket protocol consists of variable-length, framed messages. Users of native sockets know that the option that creates a TCP socket is SOCK_STREAM. Web Sockets, although designed to solve a similar problem as TCP sockets, are not streaming.

The Web Socket protocol sits on top of TCP and consists of framed UTF-8 strings. There are provisions in the HTML 5 specification to eventually support binary frames. On the wire, a typical WebSocket appears as a byte of all zeros (0x00), a string, and a byte of all ones (0xFF). The bytes 0xFF and 0x00 never appear in UTF-8 strings and act as frame delineators. This guarantees that each message event contains the text of a complete message. This may seem like the bare minimum. After all, XMLHttpRequest returns text responses. When using TCP directly, however, the only atomic units are bytes. The responsibility for interpreting higher-level constructs (including strings) lies on the shoulders of the developer. Since Web Socket frames contain complete UTF-8 strings, it eliminates the need to buffer and parse streams of bytes in simple text-oriented protocols.

The downside is that stream-oriented protocols that do not require framing are less efficient when implemented on top of framed messages. The not-insignificant upside is that it becomes trivial to send and receive complete strings out of the box. Sending strings is a very common case, and the practice of sending JSON or XML can continue easily over Web Sockets. Even with enhanced connectivity capabilities, text encodings are likely to dominate web programming for some time.

Architecture: Simplify, Simplify
One of the obstacles for real-time messaging on the web, and therefore also for chat, are web servers and frameworks. Most implementations of chat over HTTP involve running a native chat client on the server-side and bridging or translating the semantics of chat into a different format for consumption in JavaScript. This bridging approach is cumbersome, inefficient, and now unnecessary. In real-time applications, web servers get in the way of simple client/server architectures. Access to a bidirectional communication API from JavaScript promotes applications with end-to-end participation. Now that web clients have emerged as legitimate platforms for rich applications, it makes sense to move some logic away from the middle tier.

There are scalability benefits from a socket-based architecture, as well. In addition to the obvious vertical scalability boost from shedding the overhead of HTTP, sockets put the responsibility for scaling out in the appropriate place. An application that connects to a cluster of chat servers scales just as well as the chat servers themselves. The fact that the connections originated from web browsers does not impose additional scalability limitations.

With a full-duplex connection, building a chat client for the web can be as straightforward as building chat for the desktop. On the desktop, the client would open a connection to the destination server and communicate using a standard chat protocol. On the web, we would like to do the same. The simple, elegant client/server approach suits web browsers with Web Sockets extremely well.

Futuristic Compatibility Layer
In older browsers, some of the benefits of Web Sockets can be attained through emulation. The Kaazing Enterprise Gateway enables full-duplex communication using several strategies. In the worst-case (the fallback mode) HTTP-based Comet techniques ensure that browsers can communicate bidirectionally with the Kaazing Enterprise Gateway. Even then, the Kaazing client libraries expose the standard Web Socket API, allowing browsers as old as Internet Explorer 5.5 to run applications using Web Sockets.

Whether using emulation or the native Web Socket protocol, applications written against the HTML 5 Web Socket interface can connect through the gateway to TCP servers for real-time messaging, mail, and chat.

Figure 1: Connecting to Google Talk through the Kaazing Enterprise Gateway

XMPP, the Extensible Messaging and Presence Protocol, is a popular protocol used by numerous chat servers and instant messaging networks. The Kaazing XMPP client library uses the Web Socket API in conjunction with the Kaazing Enterprise Gateway to connect to chat servers. Because of the gateway's ability to connect browsers to TCP servers, the chat servers can be unmodified daemons serving XMPP/TCP. This puts chat clients running in the browser on equal footing with desktop chat clients. Clients using an identical protocol on and off the web can connect to the same servers and message freely among themselves.

Online communities have provided mostly static forms of communication such as forums and message boards. This forced users to go out-of-band and use desktop instant messaging clients to carry out live conversations. Users have naturally discovered that IM can be an excellent complement to less immediate communication media. By incorporating web chat into their applications, Facebook and others have provided a more complete user experience inside the browser. Thus far, adding chat to web applications has been difficult.

Web Sockets provide the bidirectional networking chat needs. The concise, point-to-point socket API promotes simple and efficient end-to-end architectures. End-to-end architectures in turn improve scalability and interoperability with desktop chat clients. Best of all, this browser feature can be effectively emulated today on browsers that are nearly a decade old. In all, Web Sockets promise to simplify one of the greatest challenges involved in bringing real-time interactive applications such as chat to the web.

© 2008 SYS-CON Media