Extending WWW to support
David Raggett, Hewlett Packard Laboratories
Platform Independent Virtual Reality
At the first WWW conference in May '94 I ran a bof with
Tim Berners-Lee on virtual reality markup languages and the web,
where I presented my vision for a platform independent 3D standard
for the Web, using the term VRML to cover the idea. The next month I
presented the concept to the Internet Society conference in Prague.
About the same time, Mark Pesce and Brian Behlendorf set up the VRML
mailing list and
VRML home page and the rest
is history. Any way here is my original paper:
This is a proposal to allow VR environments to
be incorporated into the World Wide Web, thereby
allowing users to "walk" around and push through
doors to follow hyperlinks to other parts of the Web.
VRML is proposed as a logical markup format for
non-proprietary platform independent VR. The for
mat describes VR environments as compositions of
logical elements. Additional details are specified
using a universal resource naming scheme support
ing retrieval of shared resources over the network.
The paper closes with ideas for how to extend this to
support virtual presence teleconferencing.
This paper describes preliminary ideas for
extending the World Wide Web to incorporate vir
tual reality (VR). By the end of this decade, the con
tinuing advances in price/performance will allow
affordable desktop systems to run highly realistic
virtual reality models. VR will become an increas
ingly important medium, and the time is now ripe to
develop the mechanisms for people to share VR
models on a global basis. The author invites help in
building a proof of concept demo and can be con
tacted at the email address given above.
VR systems at the low end of the price range
show a 3D view into the VR environment together
with a means of moving around and interacting with
that environment. At the minimum you could use the
cursor keys for moving forward and backwards, and
turning left and right. Other keys would allow you to
pick things up and put them down. A mouse
improves the ease of control, but the "realism" is pri
marily determined by the latency of the feedback
loop from control to changes in the display. Joy
sticks and SpaceBalls improve control, but cannot
compete with the total immersion offered by head
mounted displays (HMDs). High end systems use
magnetic tracking of the user's head and limbs,
together with devices like 3D mice and datagloves
to yet further improve the illusion.
Sound can be just as important to the illusion as
the visual simulation: The sound of a clock gets
stronger as you approach it. An aeroplane roars
overhead crossing from one horizon to the next.
High end systems allow for tracking of multiple
moving sources of sound. Distancing is the tech
nique where you get to see and hear more detail as
you approach an object. The VR environment can
include objects with complex behavior, just like
their physical analogues in the real world, e.g. draw
ers in an office desk, telephones, calculators, and
cars. The simulation of behavior is frequently more
demanding computationally than updating the visual
and aural displays.
The Virtual environment may impose the same
restrictions as in the real world, e.g. gravity and
restricting motion to walking, climbing up/down
stairs, and picking up or putting down objects. Alter
natively, users can adopt superpowers and fly
through the air with ease, or even through walls!
When using a simple interface, e.g. a mouse, it may
be easier to learn if the range of actions at any time
is limited to a small set of possibilities, e.g. moving
forwards towards a staircase causes you to climb the
stairs. A separate action is unnecessary, as the VR
environment builds in assumptions about how peo
ple move around. Avatars are used to represent the
user in the VR environment. Typically these are sim
ple disembodied hands, which allow you to grab
objects. This avoids the problems in working out the
positions of the user's limbs and cuts down on the
Platform Independent VR
Is it possible to define an interchange format for
VR environments which can be visualized on a
broad range of platforms from PCs to high-end
At first sight there is little relationship between
the capabilities of systems at either extreme. In prac
tice, many VR elements are composed from com
mon elements, e.g. rooms have floors, walls,
ceilings, doors, windows, tables and chairs. Out
doors, there are buildings, roads, cars, lawns, and
trees etc. Perhaps we can draw upon experience with
document conversion and the Standard Generalized
Markup Language (SGML) [ref. 4] and specify VR
environments at a logical level, leaving browsers to
fill in the details according to the capabilities of each
The basic idea is to compose VR environments
from a limited set of logical elements, e.g. chair,
door, and floor. The dimensions of some of these
elements can be taken by default. Others, like the
dimensions of a room, require lists of points, e.g. to
specify the polygon defining the floor plan. Addi
tional parameters give the color and texture of sur
faces. A picture frame hanging on a wall can be
specified in terms of a bitmapped image.
These elements can be described at a richer level
of detail by reference to external models. The basic
chair element would have a subclassification, e.g.
office chair, which references a detailed 3D model,
perhaps in the DXF format. Keeping such details in
separate files has several advantages:
This makes it easier to create and revise VR envi
ronments than with a flat representation.
- Simplifies the high level VR markup format
Keeping the definition separate from the environ
ment makes it easy to create models in terms of
existing elements, and saves resources.
- Models can be cached for reuse in other VR
Directory services can be used to locate where to
retrieve the model from. In this way, a vast collec
tion of models can be shared across the net.
- Allows for sharing models over the net
Authors can model objects at different levels of
detail according to the capabilities of low, mid and
high end machines. The appropriate choice can be
made when querying the directory service, e.g. by
including machine capabilities in the request. This
kind of negotiation is already in place as part of the
World Wide Web's HTTP protocol [ref. 3].
- Alternative models can be provided according to
each browser's capabilities.
Limiting VR environments to compositions of
known elements would be overly restrictive. To
avoid this, it is necessary to provide a means of
specifying novel objects, including their appearance
and behavior. The high level VR markup format
should therefore be dynamically extendable. The
built-in definitions are merely a short cut to avoid
the need to repeat definitions for common objects.
Universal Resource Locators (URLs
The World Wide Web uses a common naming
scheme to represent hypermedia links and links to
shared resources. It is possible to represent nearly
any file or service with a URL [ref. 2].
The first part always identifies the method of
access (or protocol). The next part generally names
an Internet host and is followed by path information
for the resource in question. The syntax varies
according to the access method given at the start.
Here are some examples:
This is the CERN home page for the World Wide
Web project. The prefix "http" implies that this
resource should be obtained using the hypertext
transfer protocol (HTTP).
The searchable catalog of WWW resources at
CUI, in Geneva. Updated daily.
The Usenet newsgroup "comp.infosystems.www".
This is accessed via the NNTP protocol.
This names an anonymous FTP server:
ftp.ifi.uio.no which includes a collection
of information relating to the Standard Generalized
Markup Language - SGML.
Application to VR
The URL notation can be used in a VR markup
For example, a 3D model of a vehicle or an office
chair. Resources may be defined intensionally, and
generated by the server in response to the user's
- Referencing wire frame models, image tiles and
Major museums could provide educational VR mod
els on particular topics. Hypermedia links would
allow students to easily move from one museum to
another by "walking" through links between the dif
- Hypermedia links to other parts of the Web.
One drawback of URLs is that they generally
depend on particular servers. Work is in progress to
provide widespread support for lifetime identifiers
that are location independent. This will make it pos
sible to provide automated directory services akin to
X.500 for locating the nearest copy of a resource.
MIME: Multipurpose Internet Mail Extensions
MIME describes a set of mechanisms for speci
fying and describing the format of Internet message
bodies. It is designed to allow multiple objects to be
sent in a single message, to support the use of multi
ple fonts plus non-textual material such as images
and audio fragments. Although it was conceived for
use with email messages, MIME has a much wider
applicability. The hypertext transfer protocol HTTP
uses MIME for request and response message for
mats. This allows servers to use a standard notation
for describing document contents, e.g. image/gif for
GIF images and text/html for hypertext documents
in the HTML format. When a client receives a
MIME message the content type is used to invoke
the appropriate viewer. The bindings are specified in
the mailcaps configuration file. This makes it easy to
add local support for a new format without changes
to your mailer or web browser. You simply install
the viewer for the new format and then add the bind
ing into your mailcaps file.
The author anticipates the development of a pub
lic domain viewer for a new MIME content type:
video/vrml. A platform independent VR markup
language would allow people to freely exchange VR
models either as email messages or as linked nodes
in the World Wide Web.
A sketch of the proposed VR markup
A major distinction appears to be indoor and out
door scenes. Indoors, the scene is constructed from a
set of interconnected rooms. Outdoors, you have a
landscape of plains, hills and valleys upon which
you can place buildings, roads, fields, lakes and for
ests etc. The following sketch is in no way compre
hensive, but should give a flavour of how VRML
would model VR environments. Much work remains
to turn this vision into a practical reality.
The starting point is to specify the outlines of the
rooms. Architects drawings describe each building
as a set of floors, each of which is described as a set
of interconnected rooms. The plan shows the posi
tion of windows, doors and staircases. Annotations
define whether a door opens inwards or outwards,
and whether a staircase goes up or down. VRML
directly reflects this hierarchical decomposition with
separate markup elements for buildings, floors,
rooms, doors and staircases etc. Each element can be
given a unique identifier. The markup for adjoining
rooms use this identifier to name interconnecting
doors. Rooms are made up from floors, walls and
ceilings. Additional attributes define the appearance,
e.g. the color of the walls and ceiling, the kind of
plaster coving used to join walls to the ceiling, and
the style of windows. The range of elements and
their permitted attributes are defined by a formal
specification analogous to the SGML document type
Rooms have fittings: carpets, paintings, book
cases, kitchen units, tables and chairs etc. A painting
is described by reference to an image stored sepa
rately (like inlined images in HTML). The browser
retrieves this image and then applies a parallax
transformation to position the painting at the desig
nated location on the wall. Wall paper can be mod
elled as a tiling, where each point on the wall maps
to a point in an image tile for the wall paper. This
kind of texture mapping is computationally expen
sive, and low power systems may choose to employ
a uniform shading instead. Views through windows
to the outside can be approximated by mapping the
line of sight to a point on an image acting as a back
cloth, and effectively at infinity. Kitchen units, tables
and chairs etc. are described by reference to external
models. A simple hierarchical naming scheme can
be used to substitute a simpler model when the more
detailed one would overload a low power browser.
Hypermedia links can be represented in a variety
of ways. The simple approach used in HTML docu
ments for depicting links is almost certainly inade
quate. A door metaphor makes good sense when
transferring to another VR model or to a different
location in the current model. If the link is to an
HTML document, then an obvious metaphor is
opening a book (by tapping on it with your virtual
hand?). Similarly a radio or audio system makes
sense for listening to a audio link, and a television
for viewing an MPEG movie.
A simple way of modelling the ground into
plains, hills and valleys is to attach a rubber sheet to
a set of vertical pins of varying lengths and placed at
irregular locations: zi = fi(x, y). The sheet is single
valued for any x and y, where x and y are orthogonal
axes in the horizontal plane. Smooth terrain can be
described by interpolating gradients specified at
selected points. The process is only applied within
polygons for which all vertices have explicit gradi
ents. This makes it possible to restrict smoothing to
selected regions as needed.
The next step is to add scenery onto the underly
ing ground surface:
This works well if the end-user is flying across a
landscape at a sufficient height that parallax effects
can be neglected for surface detail like trees and
buildings. Realism can be further enhanced by
including an atmospheric haze that obscures distant
- Texture wrapping - mapping an aerial photo
graph onto the ground surface.
A tree can be placed at a given (x, y) coordinate and
scaled to a given height. A range of tree types can be
used, e.g. deciduous (summer/fall), and coniferous.
The actual appearance of each type of tree is speci
fied in a separate model, so VRML only needs the
class name and a means of specifying the model's
parameters (in many cases defaults will suffice).
Extended objects like forests can be rendered by
repeating an image tile or generated as a fractal tex
ture, using attributes to reference external definitions
for the image tile or texture.
- Plants - these come in two categories: point-like
objects such as individual trees and area-like
objects such as forests, fields, weed patches,
lawns and flower beds.
Each road can be described in terms of a sequence of
points along its center and its width. Features like
road lights and crash barriers can be generated by
default according the attributes describing the kind
of road. Road junctions could be specified in detail,
but it seems possible to generate much of this locally
on the basis of the nature of the junction and the end
points of the roads it connects: freeway-exit, clover-
leaf junction, 4-way stop, round-about etc. In gen
eral VRML should avoid specifying detail where
this can be inferred by the browsing tool. This
reduces the load on the network and allows browsers
to show the scene in the detail appropriate to the
power of each platform. Successive generations of
kit can add more and more detail leading to progres
sively more realistic scenes without changes to the
original VRML documents.
- Water - streams, rivers and water falls; ponds,
lakes and the sea. The latter involves attributes
for describing the nature of the beach: muddy
estuary, sandy, rocky and cliffs.
- Borders - fences, hedges, walls etc. which are
fundamentally line-like objects
- Roads - number of lanes, types of junctions,
details for signs, traffic lights etc.
Most buildings can be specified using constructive
geometry, i.e. as a set of intersecting parts each of
which is defined by a rectangular base and some
kind of roof. This approach describes buildings in a
compact style and makes it feasible for VRML to
deal with a rich variety of building types. The tex
ture of walls and roofs, as well as the style of win
dows and doors can be defined by reference to
- Buildings - houses, skyscrapers, factories, filling
stations, barns, silos, etc.
A scene could consist of a number of parked vehi
cles plus a number of vehicles moving along the
road. Predetermined trajectories are rather unexcit
ing. A more interesting approach is to let the behav
ior of the set of vehicles emerge from simple rules
governing the motion of each vehicle. This could
also apply to pedestrians moving on a side-walk.
The rules would be defined in scripts associated with
the model and not part of VRML itself. The opportu
nities for several users to meet up in a shared VR
scene are discussed in the next section.
- Vehicles, and other moving objects
This is effectively at infinity and can be represented
as a back cloth hung in a cylinder around the viewer.
It could be implemented using bitmap images (e.g.
in GIF or JPEG formats). One issue is how to make
the appearance change according to the weather/
time of day.
Outdoor scenes wouldn't be complete without a
range of different weather types! Objects should
gradually lose their color and contrast as their dis
tance increases. Haze is useful for washing out
details as the browser can then ignore objects
beyond a certain distance. The opacity of the haze
will vary according to the weather and time of day.
Fractal techniques can be used to synthesize cloud
formations. The color of the sky should vary as a
function of the angle from the sun and the angle
above the horizon. For VRML, the weather would
be characterized as a set of predetermined weather
The illusion will be more complete if you can see
progressively more detail the closer you get. Unfor
tunately, it is impractical to explicitly specify VR
models in arbitrary detail. Another approach is to let
individual models to reference more detailed models
in a chain of progressively finer detail, e.g. a model
that defines a lawn as a green texture can reference a
model that specifies how to draw individual blades
of grass. The latter is only needed when the user
zooms in on the lawn. The browser then runs the
more detailed model to generate a forest of grass
- Distant scenery, e.g. a mountain range on the
Actions and Scripts
Simple primitive actions are part of the VRML
model, for example to ability of the user to change
position/orientation and to pick up/put down or
"press" objects. Other behaviour is the responsibility
of the various objects and lies outside the scope of
VRML. Thus a virtual calculator would allow users
to press keys and carry out calculations just like the
real thing. This rich behaviour is specified as part of
the model for the calculator object class along with
details of its appearence. A scripting language is
needed for this, but it will be independent of VRML,
and indeed there could be a variety of different lan
guages. The format negotiation mechanism in HTTP
seems appropriate to this, as it would allow browsers
to indicate which representations are supported
when sending requests to servers.
Another issue, is how to provide realism without
excessive computional demands. To date the com
puter graphics community has focussed on mathe
matical models for realism, e.g. ray tracing with
detailed models for how objects scatter or transmit
light. An alternative approach could draw upon
artistic metaphors for rendering scenes. Paintings
are not like photographs, and artists don't try to cap
ture all details, rather they aim to distill the essen
tials with a much smaller number of brush strokes.
This is akin to symbolic representations of scenes.
We may be able to apply this to VR. As an example
consider the difficulty in modelling the folds of cloth
on your shirt as you move your arm around. Model
ling this computationally is going to be very expen
sive, perhaps a few rules can be used to draw in
folds when you fold your arms.
Virtual presence Teleconferencing
The price performance of computer systems cur
rently doubles about every 15 months. This has hap
pened for the last five years and industry pundits see
no end in sight. It therefore makes sense to consider
approaches which today are impractical, but will
soon come within reach.
A world without people would be a dull place
indeed! The markup language described above
allows us to define shared models of VR environ
ments, so the next step is to work out how to allow
people to meet in these environments. This comes
down to two parts:
For people to communicate effectively, the
latency for synchronizing models must of order 100
milliseconds or less. You can get by with longer
delays, but it gets increasingly difficult. Adopting a
formal system for turn taking helps, but you lose the
ability for non-verbal communication. In meetings,
it is common to exchange glances with a colleague
to see how he or she is reacting to what is being said.
The rapid feedback involved in such exchanges calls
for high resolution views of people's faces together
with very low latency.
- The protocols needed to ensure that each user
sees an up to date view of all the other people in
the same virtual location, whether this is a room
or somewhere outdoors.
- A way of visualising people in the virtual envi
ronment, this in turn begs the question of how to
sense each user - their expressions, speech and
A powerful technique will be to use video cam
eras to build real-time 3D models of people's faces.
As the skull shape is fixed, the changes are limited to
the orientation of the skull and the relative position
of the jaw. The fine details in facial expressions can
be captured by wrapping video images onto the 3D
model. This approach greatly reduces the bandwidth
needed to project lifelike figures into the VR envi
ronment. The view of the back of the head and the
ears etc. are essentially unchanging and can be filled
in from earlier shots, or if necessary synthesized
from scratch to match visible cues.
In theory, the approach needs a smaller band
width than conventional video images, as head
movements can be compressed into a simple change
of coordinates. Further gains in bandwidth could be
achieved at a cost in accuracy by characterizing
facial gestures in terms of a composition of "iden
tikit" stereotypes, e.g. shots of mouths which are
open or closed, smiling or frowning. The face is then
built up by blending the static model of the user's
face and jaw with the stereotypes for the mouth,
cheeks, eyes, and forehead.
Although head mounted displays offer total
immersion, they also make it difficult to sense the
user's facial expressions. They are also uncomfort
able to wear. Virtual presence teleconferencing is
therefore more likely to use conventional displays
together with video cameras mounted around the
user's workspace. Lightweight headsets are likely to
be used in preference to stereo or quadraphonic
loudspeaker systems, as they offer greater auditory
realism as well as avoiding trouble when sound
spills over into neighboring work areas.
The cameras also offer the opportunity for hands
free control of the user's position in the VR environ
ment. Tracking of hands and fingers could be used
for gesture control without the need for 3D mice or
spaceballs etc. Another idea is to take cues from
head movements, e.g. moving your head from side
to side could be exaggerated in the VR environment
to allow users to look from side to side without
needing to look away from the display being used to
visualize that environment.
For workstations running the X11 windowing
system, the PEX library for 3D graphics is now
available on most platforms. This makes it practical
to start developing proof of concept platform inde
pendent VR. The proposed VRML interchange for
mat could be used within the World Wide Web or for
email messages. All users would need to do is to
download a public domain VRML browser and add
it to their mailcaps file. The author is interested in
getting in touch with people willing to collaborate in
turning this vision into a reality.
- "Hypertext Markup Language (HTML)",
Tim Berners-Lee, January 1993.
- "Uniform Resource Locators", Tim Berners-Lee, January 1992.
- "Protocol for the Retrieval and Manipulation of Texual and Hypermedia Information",
Tim Berners-Lee, 1993.
- "The SGML Handbook", Charles F. GoldFarb, pub. 1990 by the Clarendon Press, Oxford.