XTech 2007 paper cauldron
This is work-in-progress. The first part is the actual paper being written, and is due on April 5th. The second part is the cauldron of short notes and almost-articles that I've written over time and from which I'm cherry-picking ideas as I go. XTech page about the presentation is here.
You're very welcome to add comments! The page is editable, just login first (if you don't want to create an account, login with username "guest" and password "anon").
Paper
Abstract
Use of the web by people largely consists of 1) interaction between person and application and 2) interaction between person and person. Paradoxically, interaction with applications is direct, real-time, rich, and evolves quickly, while interaction with people is indirect, deferred, poor, and evolves slowly.
We take a few scenarios from everyday online life which stem from this contrast, trace them back to the constraining way in which the network handles person to person communication, and finally show a different approach, where direct, rich, real-time person to person interaction happens in a context that also encourages fast evolution, and is based on a mix of proven, standard technologies.
Introduction
Online interaction among people is full of paradoxes---things that should be trivial are close to impossible. But as we go through our online everyday lives, we experience these paradoxes from the inside and do not perceive anything odd around us.
Take for example questions, which at first may seem too absurd to deserve a serious reply:
- What if, for each new web application being written, users had to upgrade their browsers?
- What if interaction with web applications were limited to a command line?
- What if talking through an instant messenger meant that an intermediate application stored one's messages for several minutes, and the recipient had to poll the application repeatedly to get the message?
- What if changing the subject of conversation required subscribing to another such application?
Now move them just a few pixel across the screen, say from the browser to the instant messenger or vice-versa, and suddenly we appear to deem them reasonable. In fact:
- Adding a whiteboard, a card game, a shared editing area or any other extended form of interaction to an instant messenger is expected to happen through modification of the client, followed by massive upgrades campaigns across large userbases.
- Exchanging short text messages in linear fashion at a text prompt is the major (often, the only) component of the experience of instant messaging.
- Chunks of communication sent to web applications but intended for people (messages, links, pictures) are stored on servers and not presented to the recipient until requested.
- Being connected with a friend through a photo sharing web site does not also mean being able to e.g. share links with her. Both parties need to subscribe to another web application.
Where does this strange state of affairs stem from? Is it inherent in the structure of today's online world, or can it be escaped?
One cause: too much intelligence
Back in 1997, David Isenberg contrasted the Intelligent Network with the Stupid Network.
The Intelligent Network is the telephony network. It's "intelligent" because infrastructure makes all sorts of assumptions about information travelling between endpoints.
AT&T True Voice was a valiant attempt to improve circuit switched voice quality as much as possible in the context of current network architecture. (...) But as we set out to implement this conceptually simple improvement, we kept running into the problem that there were too many places in the network that had built in "intelligent" assumptions about the voice signal - echo cancellers, conference bridges, voice messaging systems, etc. - and too many devices that depended on these acoustic assumptions for their correct operation - modems, fax machines, and a surprising number of strange devices with proprietary analog protocols.
On the other hand, a Stupid Network...
...would let you stuff bits in one end and get them out the other without getting tangled up in cobwebs of legacy assumptions. Want a different voice quality? With a Stupid Network, you'd get a different program, install it in your intelligent end user device and run it.
Clearly, "stupid" is good. Isenberg poses that it describes the Internet is a kind of Stupid Network:
Now, suppose Internet Telephony gets as good as telephone company telephony (see below), and some enterprising independent programmer wants to make a product that solves the problem of being on hold. They would simply write an end-user application and sell it from their web site. If it works, and people like it, they will sell lots of it. If not, they might try again. But they don't have to go through any long, bureaucratic economic justification, business planning, and technical development processes - they just do it.
(...)
The Internet breaks the telephone company model by passing control to the end user. It does this by taking the underlying network details out of the picture.
(...)
The network provider becomes virtually irrelevant - the user controls the relevant capabilities.
However, does the example above stand the trial of a real life scenario?
...and some enterprising independent programmer wants to make a product that solves the problem of NOT BEING ABLE TO INSERT IMAGES ON THE FOOBAR WEB FORUM. They would simply write an end-user application and sell it from their web site. If it works, and people like it, they will sell lots of it. If not, they might try again. But they don't have to go through any long, bureaucratic economic justification, business planning, and technical development processes - they just do it.
Something's not quite right with this picture.
It is true (and fortunate) that physical devices at the lower level of the Internet have become "stupid".
But behind apparently confined annoyances---such as exchanging bookmarks with someone on Del.icio.us but having to join Flickr to also exchange photos with the same someone---it is not too hard to spot higher-order networks emerging above the physical one, that connect people rather than machines, are made of "virtual devices", and are just as assuming about the information they carry as the telephony network.
Intelligent endpoints are rare. Dumb layers (such as the web pages of a photo sharing service), whose purpose is only to connect to intelligent midpoints (the server-side applications actually storing and organizing the photos), are widespread.
The network is getting intelligent again and, if we are to listen to Isenberg's conclusions, this isn't good news.
Another cause: flow (or lack thereof)
Real-time communication (i.e. without delay-inducing intermediaries) was the first in human history to develop, and is being the last on the web to land.
The difference between real-time and deferred communication may at first appear as the quantitative difference between completing a task in one minute and completing it in one hour.
As anyone who has been repeatedly interrupted knows, there is only so much fragmentation a task can stand before a qualitative shift kick in. [1] Or put another way, a conversation by mail is not just slower than a live one, it's an entirely different kind of conversation.
AJAX and broadband have increased the pace of "conversation" between users and applications, enabling qualitative shifts in user experience. Sadly, no matching shift in person to person interaction has taken place: systems which already supported a high pace (instant messaging) only support basic forms of interaction and evolve too slowly; attempts at increasing pace on the web (HTTP poll, Comet), where interaction can be richer, remain server-centric and of intelligent-network nature.
One entire class of experience is missing from the web. Rich applications that have so far been confined to the desktop using ad-hoc communication protocols have yet to make their way to the web. It is in the realms of this open web where they will truly flourish.
The alternate route
The rest of this paper proposes a design and implementation to support a different scenario, one where:
- users are more directly connected; if John and I are connected because we exchange bookmarks, we are also able to exchange photos, chat about yesterday's match, or play a card game without joining additional ad-hoc networks;
- real-time communication between person and person is as rich as communication between user and applications: chess pieces can be moved around, shapes can be drawn, documents can be edited, slides can be presented, maps can be navigated together, ...
Surprisingly, ingredients for this scenario are readily available, standards-compliant, popular, and made to be extended. They just need to be mixed together.
Scenario
Mary starts the web browser. Usually, the endpoint of interactions carried through the browser is web servers and web applications. Mary's browser, though, has been augmented with support for real-time communication. People can be interaction endpoints as well.
People are shown in the familiar "contact list" form:
(TODO: Screenshot)
Clicking on a nickname in the list triggers a basic form of interaction (the traditional "chat"):
(TODO: Screenshot)
Mary and Sam begin to chat. At one point, Sam wants to refer to a picture. He tries to do so through minute verbal description:
(TODO: Screenshot)
Mary soon gets tired of building pictures in her mind, and points Sam to a web page where they can interact in a visual way. Both load the page:
(TODO: Screenshot)
Mary and Sam select tell browsers to make their communication channel also available to the two instances of the web page:
(TODO: Screenshot)
From now on, actions in the web page on Mary's side are propagated to the the page on Sam's side, and vice-versa. Mary draws a line, a line appears on Sam's screen. Sam draws a circle, a circle appears on Sam's screen:
(TODO: Screenshot)
Since Mary and Sam are not interacting with a page but are interacting with each other on a page, we call this an application space rather than application.
Discussion
Some things are remarkable from the point of view of Mary and Sam:
- they did not have to download and install new software in order to access an additional form of interaction, they just loaded a web page;
- they did not have to subscribe to a service in order to recreate their social link in the context of a different form of interaction; the application was already usable through their already existing link;
- actions on one side are reproduced immediately on the other side.
From the developer's point of view, this scenario also means that:
- their deployment and upgrade procedure consists in uploading a web page, instead of packaging a client and pushing it to thousands of machines;
- they can rely on a standards-based, familiar, cross-platform toolset, namely XML-based technologies (XHTML, SVG, XUL) and JavaScript?.
- they do not need to deploy complex server software in order to hack real-time communication on top of HTTP; XMPP, a technology designed precisely with purpose, is used instead.
Other scenarios
More scenarios are possible. A couple are briefly outlined below:
- Enriching already existing web applications with ad-hoc synchronization code.
(TODO: mapshare screenshot)
MapShare? detects a Google map in the current page and lets the user and his/her contact pan and zoom on it as if it were a single map shared between them.
(TODO S5Share screenshot)
S5Share and Laser let two or more users attend a synchronized S5 slideshow and share mouse cursors to point at things on the screen.
- Enriching already existing applications with general-purpose synchronization code.
(TODO RichDraw? screenshot)
XML Sync Islands is a library that allows two or more users to keep a DOM subtree synchronized. Figure shows how this was used to turn the single-user RichDraw? SVG application into a shared SVG whiteboard.
How it works
It begins by making real-time, end-to-end communication a first-class citizen in the browser, like HTTP.
For this implementation, the technology used for the browser is Mozilla (specifically, Firefox and Flock). Mozilla covers all major platforms, is open source, has a lively community, and can be extended.
For the communication side, the technology used is the IETF standard eXtensible Messaging and Presence Protocol (XMPP) [2]. It can be thought of bi-directional exchanges of XML packets among client entities (human users, automated agents) routed by "stupid" servers---in the positive sense of the "Stupid Network", i.e. just delivering the packets without making assumptions about what they mean.
When connected users wish to interact in a richer form than usual, they load a web page providing that interaction form. The page is built with ordinary technologies (XHTML, SVG, JavaScript?, XUL). They then tell the browser that their existing communication channel should be made available to the web page.
The browser monitors a pre-defined area of the web page (a DOM element with id="xmpp-outgoing"). When the web page writes something in "xmpp-outgoing", the browser picks it up and sends it over XMPP to the other side. The browser on the other side receives the data and writes it into another pre-defined area of the web page (a DOM element with id="xmpp-incoming"); the web page on the other side in turn reacts on data written in "xmpp-incoming", usually by synchronizing some state.
(Diagram)
Ordinarily, a message exchange through XMPP looks like the following (from Mary's point of view):
OUT: <message to="sam@server.org/SamePlace">
<body>
Hello, Sam!
</body>
</message>
IN: <message from="sam@server.org/SamePlace" to="mary@server.org/SamePlace">
<body>
Hello, Mary! I'm about to draw a circle, watch out...
</body>
</message>
OUT: <message to="mary@server.org/SamePlace">
<body>
Ok...
</body>
</message>
When the channel is made available to a web page, it looks like this:
IN: <message from="sam@server.org/SamePlace" to="mary@server.org/SamePlace">
<body>
Here it comes
</body>
</message>
IN: <message from="sam@server.org/SamePlace" to="mary@server.org/SamePlace">
<circle xmlns="http://apps.sameplace.cc/whiteboard">
<center x="4" y="3"/>
<radius length="5"/>
</circle>
</message>
The web page at the other end receives all messages inside the "xmpp-incoming" element. It ignores most of them, but it notices that the last one carries a child element with namespace it recognizes, interprets it, and draws the line.
Conclusions and future work
Distinguishing features of the web augmented with real-time communication (RTC) capabilities are summarized below and contrasted with current approaches.
| Traditional web | RTC-enabled web | |
| Interaction | server-centric | user-centric |
| What is shared | content, results | live experience |
| Communication pace | slow | fast |
| Traditional messaging | RTC-enabled web | |
| Richness of communication | basic (short text messages) | high (anything that can be put on a web page) |
| Evolution rate | slow (fat clients, complex deployments and update cycles | rapid (deploying to all parties consists of just uploading web pages) |
Both the XMPP layer and the user interface that gives access to shared web pages are currently implemented as extensions for the Firefox and Flock browsers and available from the project's web site [4] under open source licenses.
Further work will consist of experimenting with more interaction domains (games, productivity, media), both directly and by easing adoption from third parties through libraries and documentation, and seeing what limits they hit in the current implementation and what possibilities they uncover in the underlying philosophy.
References
[1] A. J. Dix Pace and interaction. Proceedings of HCI'92: People and Computers VII, Eds. A. Monk, D. Diaper and M. Harrison. Cambridge University Press. 193-207. http://www.comp.lancs.ac.uk/computing/users/dixa/papers/pace/ [2] P. St. Andre Streaming XML with Jabber/XMPP Internet Computing, IEEE, vol. 9, n. 5, Sept.-Oct. 2005, pp. 82-89. [3] David Isenberg Rise of the Stupid Network Computer Telephony, August 1997, pg 16-26. http://www.hyperorg.com/misc/stupidnet.html [4] xmpp4moz and SamePlace Project Site http://dev.hyperstruct.net/xmpp4moz
Paper ends here. :-)
Cauldron
What is the Web, anyway?
Technology tells us that it's the HTTP protocol, hyperlinks that let us browse HTML pages, audio and video we download.
Daily use tells us something different, something for which the "http://" or "<a href='...'" criteria don't hold.
What is it? Sending a photo to a person five thousands kilometers away from you. Writing a piece to be read by people five years into the future away from you. Asking a question on a subject you never guessed anybody could care about. Finding out that your geek friend from high school now runs a carpentry.
In short, the Web isn't technology. It's perception. Experience.
Build a bridge from a web page to content served by some arcane "foobar://" protocol; make that bridge seamless so that the only difference people see is the "foobar://" thing in the address bar... it would still be perceived as the Web--and because of this it would be Web.
We are at an impasse. People care about experience and perception; technologists often pavlovianly answer HTTP to any question web. And the HTTP assumption is not free from perception and experience implications--quite the opposite.
HTTP, "Hyper-Text Transfer Protocol" is focused on transferring texts (or rather "documents", given the large quantity of visual/aural information exchanged today), which are contents, results, or "states". HTTP mostly carries state snapshots.
XMPP, on the other hand, mostly carries events, or state changes. By transferring state changes, it is easier to reproduce a state mutation (a live experience) on the other side of a communication.
Axes along which the "traditional Web" vs. the "XMPP-enabled Web" distinction can be described:
| Traditional Web | XMPP-enabled Web | |
| Interaction | server-centric | user-centric |
| Focus | content, results | live experience |
| Communication pace | slow | fast |
And vs. traditional messaging:
| Traditional messaging | XMPP-enabled Web | |
| Richness of communication | basic (short text messages) | high (anything that can be put on a web page) |
| Evolution rate | slow (fat clients, complex deployments and update cycles | rapid (deploying to all parties is just uploading web pages) |
On communication pace
Increasing communication pace from e.g. one exchange every five days to one exchange every five minutes only makes it into a more frequent communication. Accelerating it from e.g. one exchange every five minutes to one every five second however makes it into a different kind of communication, i.e. there is a threshold after which it undergoes a qualitative transformation.
Immediate, direct feedback is among the fundamental factors that allows dramatic shifts, such as the state of flow [Csikszentmihalyi 90], where one is...
"completely involved in an activity for its own sake. The ego falls away. Time flies. Every action, movement, and thought follows inevitably from the previous one, like playing jazz. Your whole being is involved, and you're using your skills to the utmost."
Even without such desirable but borderline situations, it is an intuition from everyday's life that there are kinds of communication which only make sense when the rate of events falls within a certain frequency (it would be impossible to understand a movie by watching one frame per minute; it also isn't very easy to play ping-pong and take pauses when the ball is in mid-air).
One of the distinguishing marks of modern web applications is increased responsiveness in the user interface (by means of partial updates rather than full page reloads), which allows faster, tighter interaction pace. This translates not only to users being able to do the same work in less time, but also to being able to do better work.
On the other hand, human-human interaction on the web almost completely consists of long feedback cycles. Communicating through a web forum is very much like sending and receiving a letter rather than talking, only the the receiver might read them it five minutes later rather than five days later. There is a class of experiences that the current way of "doing web" is stopping from happening. Paradoxically, it is also the same class of experience that are most ingrained in our unconscious (we were talking to each other long before we were sending letters to each other) and which would feel most "natural".
Questions
There are weird questions that interestingly seem to become reasonable when applied to a world whose limitations we have come to accept as natural.
What if, for each new web application that is written and you want to use, you had to upgrade your web client (the browser)?
That's what happens in the world of IM. For each new functionality that could be added, you would have to upgrade your IM client.
Whan if you could interact with web applications only via a command line?
In IM, most of the time you interact with people only via a command line.
What if, when you talked to your friends via IM, every message had to be stored on an intermediate application for e.g. five minutes, and your friends had to continually ask the application for those messages?
On the web, you send communication chunks to web applications. Other people only see them when they reload pages.
Web as communication medium or as a big panel with buttons?
It started as the former, but even ten years ago it was clear that the it was "evolving from an information repository into a distributed interface to a global networked computational engine" (Dix 98) and led to wondering "[w]hat are we interacting with - is it information, is it computer systems?"
When used to convey functionality, the web is a communication endpoint. Communication is not between a person and another person through the web, it's between a person and the web.
Things come full circle, though, and it is clearer and clearer that one of the task we want this "global networked computational engine" to perform is improving communication among us. The success of "social software" speaks clearly. One of the simplest (both technically and conceptually) and most successful form of the web as button panel, the "weblog" or "blog", goes back to being a communication medium.
Currently, though, there are few places where the web is both a medium and conveys functionality...
The network is getting intelligent again (and that isn't good news)
Isenberg wrote about Intelligent Network vs Stupid Network. According to him, the Intelligent Network tries to do a lots of things with and to the data, while the Stupid Network just delivers the bits. So, he writes, the telephony network is an "intelligent" network. He talks about when he was at AT&T, on a project named "True Voice", trying to improve the quality of voice in phone calls, and he kept bumping against assumptions built in the network itself about what the data should be like. A stupid network, instead, wouldn't have dictated how the data should be like: it would have just delivered it and let intelligent end-points handle it.
There was too much "intelligence" intertwined with the basic transport.
The True Voice experience led me to see the advantages of a network - a Stupid Network - that would let you stuff bits in one end and get them out the other without getting tangled up in cobwebs of legacy assumptions."
Isenberg poses that, differently than telephone networks, the Internet is a "stupid" network: it just delivers the bits and lets end-points handle them. And that is true to a certain extent: there are no assumptions in network infrastructure about what data the IP protocol should carry.
Now, suppose Internet Telephony gets as good as telephone company telephony (see below), and some enterprising independent programmer wants to make a product that solves the problem of being on hold. They would simply write an end-user application and sell it from their web site. If it works, and people like it, they will sell lots of it. If not, they might try again. But they don't have to go through any long, bureaucratic economic justification, business planning, and technical development processes - they just do it.
(...)
The network provider becomes virtually irrelevant - the user controls the relevant capabilities.
(...)
The Internet breaks the telephone company model by passing control to the end user. It does this by taking the underlying network details out of the picture.
(...)
The vision is one in which the public communications network would be engineered for "always-on" use, not intermittence and scarcity. It would be engineered for intelligence at the end-user's device, not in the network.
(...)
End user devices would be free to behave flexibly because, in the Stupid Network the data is boss, bits are essentially free, and there is no assumption that the data is of a single data rate or data type.
At first sight it looks like we're pretty much there. Is it really so?
While the network of physical devices now just delivers bits, non-physical networks have emerged from the physical network, on which human-to-human (as opposed to machine-to-machine) communication happens. Web forums, collaborative bookmarking, social video sites. These are the real networks we're spending our online lives in.
Are these "stupid networks"? Let's take some of the above quotes as a operative definitions of what a stupid network is, and let's see if it applies (changes emphasized):
...and some enterprising independent programmer wants to make a product that solves the problem of not being able to INSERT IMAGES ON THE WIDGETS WEB FORUM. They would simply write an end-user application and sell it from their web site. If it works, and people like it, they will sell lots of it. If not, they might try again. But they don't have to go through any long, bureaucratic economic justification, business planning, and technical development processes - they just do it.
(...)
The network provider becomes virtually irrelevant - the user controls the relevant capabilities.
Doesn't quite sound right.
In fact, a single web forum is a local intelligent network. It gets people in touch, like the intelligent telephony network. It controls and limits what endpoints are able to do, like the intelligent telephony network. It doesn't "pass control to the end user" at all nor "takes the underlying network details out of the picture". The network details here are the picture.
It might be objected that the Internet is a stupid network because it is possible to "just develop a different forum software" easily; it would be like saying that, if it were easier to start a telephone company, the design of the telephone network would turn from intelligent to stupid. No, it would only mean building one more intelligent network.
Isenberg further notes that...
The very name, Internet, denotes that it is designed to network networks.
Throughout the whole paper, the set of connections through which machines communicate and the set of connections through which humans communicate are implicitely taken to be one and the same. A P2P perspective, in a way. But it doesn't match reality.
When we're talking about human communication, the endpoints of the network are our friends, acquaintances, work contacts. We do have an Internet for human networks, though, even if there's much flashlight pointed straight into our eyes in the hope that we won't notice. But, reading things like:
Instead of fancy "intelligent" network routing translation, in a Stupid Network, intelligent end-user devices would be connected to one or more high speed access networks - always listening for relevant information, for data addressed to their owner.
...I am relieved to already see signs of the future.
References
A. J. Dix (1992). Pace and interaction. Proceedings of HCI'92: People and Computers VII, Eds. A. Monk, D. Diaper and M. Harrison. Cambridge University Press. 193-207. http://www.comp.lancs.ac.uk/computing/users/dixa/papers/pace/ Alan Dix (1998). The Active Web - part I. Interfaces, 38 pp. 18-21. Summer 1998. http://www.hcibook.com/alan/papers/ActiveWeb/ Csikszentmihalyi, Mihaly (1990) Flow: The Psychology of Optimal Experience New York: Harper and Row ISBN 0-06-092043-2 David Isenberg Rise of the Stupid Network Computer Telephony, August 1997, pg 16-26. http://www.hyperorg.com/misc/stupidnet.html
Misc thoughts to be expanded
- Representing stream of events linearly (e.g. traditional chat) vs. building a cohesive representation, possibly with events cancelling/overriding each other (e.g. whiteboard).
- Container or thin-client mode vs. fat client mode: the "container" model being one of the success factors for browsers, i.e. they bring only generic communication infrastructure where deployment is harder (users' machines) and allow domain functionality to be placed where deployment is easy (servers). Talk about the different model in instant messaging and about how the gap can be bridged.
- Increasingly, chat (real-time written verbal communication) has turned from an end to itself (many web based chat systems, IRC, and the tools built into the BBS of the Eighties) to a support to other activities. Just like two co-workers were likely to exchange a drawing or a text document and then bounce comments and corrections back and forth via mail, they are now likely to do so in real-time over instant messaging.
- Shifting the attention from the human-machine communication channel to the human-human communication channel is not a recent matter (Dix 92), but the availability of general purpose, extensible, standardized and commoditised human-human communication channels is (XMPP). Experiments in richer forms of interaction have traditionally been confined to small groups, arguably because such forms were built into client software that had to be deployed to all interested parties. As such, this side of the research on communication and interaction never enjoyed the evolutionary boost provided by massive adoption (as is instead possible for web technologies) and, less obviously, never had a chance to explore emergent, higher-order effects resulting from it.
- Quote: "So a typical channel is used in chunks. The rate at which chunks are produced we shall call the pace of the channel. Although this paper concentrates on pace, it is, of course, not the only property of interest. Other interesting properties of individual channels include the size of chunks (granularity), the nature of the medium (visual, auditory etc.), and structural properties (linearity, persistence, etc.)."
