W3C

WebDriver

W3C Working Draft 10 July 2012

This version:
http://www.w3.org/TR/2012/WD-webdriver-20120710/
Latest published version:
http://www.w3.org/TR/webdriver/
Latest editor's draft:
http://dvcs.w3.org/hg/webdriver/raw-file/default/webdriver-spec.html
Editors:
Simon Stewart, Google
David Burns, Mozilla

Abstract

This specification defines the WebDriver API, a platform-and language-neutral interface that allows programs or scripts to introspect into, and control the behaviour of, a web browser. The WebDriver API is primarily intended to allow developers to write tests that automate a browser from a separate controlling process, but may also be implemented in such a way as to allow in-browser scripts to control a browser.

The WebDriver API is defined by a set of interfaces to discover and manipulate DOM elements on a page, and to control the behaviour of the containing browser.

This specification also includes a non-normative reference serialisation (to JSON) of the interface's invocations and responses that may be useful for browser vendors.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

If you wish to make comments regarding this document, please email feedback to public-browser-tools-testing@w3.org. All feedback is welcome, and the editors will read and consider all feedback.

This specification is still under active development and may not be stable. Any implementors who are not actively participating in the preparation of this specification may find unexpected changes occurring. It is suggested that any implementors join the WG for this specification. Despite not being stable, it should be noted that this specification is strongly based on an existing Open Source project — Selenium WebDriver — and the existing implementations of the API defined within that project.

This document was published by the Browser Testing and Tools Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-browser-tools-testing@w3.org (subscribe, archives). All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

The WebDriver API aims to provide a synchronous API that can be used for a variety of use cases, though it is primarily designed to support automated testing of web apps.

1.1 Intended Audience

This specification is intended for implementors of the WebDriver API. It is not intended as light bed time reading.

1.2 Relationship of WebDriver API and Existing Specifications

Where possible and appropriate, the WebDriver API references existing specifications. For example, the list of boolean attributes for elements is drawn from the HTML5 specification. When references are made, this specification will link to the relevant sections.

1.3 Naming the Two Sides of the API

The WebDriver API can be thought of as a client/server process. However, implementation details can mean that this terminology becomes confusing. For this reason, the two sides of the API are called the "local" and the "remote" ends.

Local
The user-facing API. Command objects are sent and Response objects are consumed by the local end of the WebDriver API. It can be thought of as being "local" to the user of the API.
Remote
The implementation of the user-facing API. Command objects are consumed and Response objects are sent by the remote end of the WebDriver API. The implementation of the remote end may be on machine remote from the user of the local end.

There is no requirement that the local and remote ends be in different processes.

2. Commands and Responses

The WebDriver API is designed to be used both in-process and out-of-process. The IDL given in this specification and summarized in Appendix XXXX should be used as the basis for the user-facing API. When used out-of-process, the WebDriver API defines command/repsonse objects that must be used. How these are encoded and transmitted between the browser being automated and the user of the API is left undefined, but a non-normative implementation of this as JSON over HTTP is given in appendix XXXX.

2.1 Command

A command represents a call to the remote end of the WebDriver API.

interface Command {
    attribute string     name;
    attribute dictionary parameters;
    attribute string     sessionId;
};

2.1.1 Attributes

name of type string
The case-sensitive name of the command to execute
parameters of type dictionary
A map of the named parameters to an object representing its value.
sessionId of type string
A reference to the session to which this command is associated.

2.2 Response

A response represents the value returned from the remote end of the WebDriver API.

interface Response {
    readonly attribute string  sessionId;
    readonly attribute integer status;
    readonly attribute object  value;
};

2.2.1 Attributes

sessionId of type string, readonly
A reference to the session to which this command is associated.
status of type integer, readonly
The status code representing the success or failure of the method. Anything other than 0 indicates a failure of some kind
value of type object, readonly
The return value of the method call. It's type is determined by the Command that has been executed. In the specification, each command definition will make clear what the expected return type is.

2.3 Processing Additional Fields on Commands and Responses

Any Command or Response may contain additional fields than those listed above. The content of fields must be maintained, unaltered by any intermeditate processing nodes. There is no requirement to maintain the ordering of fields.

Note

This requirement exists to allow for extension of the protocol, and to allow implementors to decorate Commands and Responses with additional information, perhaps giving context to a series of messages. or providing security information.

2.4 Error Codes

The WebDriver API indicates the success or failure of a command invocation via a status code on the Responseobject. The following values are used and have the following meanings.

Status Code Summary Detail
0 Success The command executed successfully.
7 NoSuchElement An element could not be located on the page using the given search parameters.
8 NoSuchFrame A request to switch to a frame could not be satisfied because the frame could not be found.
9 UnknownCommand The requested resource could not be found, or a request was received using an HTTP method that is not supported by the mapped resource.
10 StaleElementReference An element command failed because the referenced element is no longer attached to the DOM.
11 ElementNotVisible An element command could not be completed because the element is not visible on the page.
12 InvalidElementState An element command could not be completed because the element is in an invalid state (e.g. attempting to click a disabled element).
13 UnknownError An unknown server-side error occurred while processing the command.
15 ElementIsNotSelectable An attempt was made to select an element that cannot be selected.
17 JavaScriptError An error occurred while executing user supplied !JavaScript.
19 XPathLookupError An error occurred while searching for an element by XPath.
21 Timeout An operation did not complete before its timeout expired.
23 NoSuchWindow A request to switch to a different window could not be satisfied because the window could not be found.
24 InvalidCookieDomain An illegal attempt was made to set a cookie under a different domain than the current page.
25 UnableToSetCookie A request to set a cookie's value could not be satisfied.
26 UnexpectedAlertOpen A modal dialog was open, blocking this operation
27 NoAlertOpenError An attempt was made to operate on a modal dialog when one was not open.
28 ScriptTimeout A script did not complete before its timeout expired.
29 InvalidElementCoordinates The coordinates provided to an interactions operation are invalid.
30 IMENotAvailable IME was not available.
31 IMEEngineActivationFailed An IME engine could not be started.
32 InvalidSelector Argument was an invalid selector (e.g. XPath/CSS).
33 SessionNotCreatedException A new session could not be created.
34 MoveTargetOutOfBounds The target for mouse interaction is not on the viewport and cannot be brought into the viewport.

3. Browser Capabilities

Different browsers support different levels of various specifications. For example, some support SVG or the CSS Selector API, but only browsers that implement HTML5 will support LocalStorage. The WebDriver API provides a mechanism to query the supported capabilities of a browser. Each broad area of functionality within the WebDriver API has an associated capability string. Whether a particular capability must or may be supported — as well as fallback mechanisms for handling those cases where a capability is not supported — is discussed where the capability string is defined.

3.1 Capabilities

interface Capabilities {
    readonly attribute dictionary capabilities;
    boolean has (string capabilityName);
    (string or boolean or number)?    get (string capabilityName);
};

3.1.1 Attributes

capabilities of type dictionary, readonly
The underlying collection of capabilities, represented as a dictionary mapping strings to values which may be of type boolean, numerical or string.

3.1.2 Methods

get
Get the value of the key matching capabilityName in the underlying capabilities or nullif no value is defined.
ParameterTypeNullableOptionalDescription
capabilityNamestring
Return type: stringbooleannumber, nullable
has
Queries the underlying capabilities to see whether the value is set. This will return true if the capabilities contain a key with the given capabilityName and the value of that key is defined. If the value is a boolean, this function will return that boolean value. If the value is null, an empty string or a 0 then this method will return false.
ParameterTypeNullableOptionalDescription
capabilityNamestring
Return type: boolean

A Capabilities instance must be immutable. If a mutable Capabilities instance is required, then the MutableCapabilities must be used instead.

3.2 MutableCapabilities

interface MutableCapabilities : Capabilities {
    void set (string capabilityName, (string or boolean or number)? value);
};

3.2.1 Methods

set
Set the value of the given capabilityNameto the given value. If the value is not a boolean, numerical type or a string, a WebDriverException should be thrown.
ParameterTypeNullableOptionalDescription
capabilityNamestring
valuestringbooleannumber
Return type: void

4. Sessions

Non-normative summary: A session is equivalent to a single instantiation of a particular browser, including all child windows. The WebDriver API gives each session a UUID stored as a string that can be used to differentiate one session from another, allowing multiple browsers to be controlled on the same machine if needed, and allowing sessions to be routed via a multiplexer. This ID is sent with every Command and returned with every Response and is stored on the sessionIdfield.

4.1 Creating a Session

The process for successfully creating a session follows.

  1. The local end creates a new Capabilities or MutableCapabilities instance describing the desired capabilities for the session. The Capabilities object may be empty, but must be defined.
  2. The local end creates a new Command with the "name" being "newSession" and the "parameters" containing an entry named "desiredCapabilities" with the value set to the Capabilities instance from the previous step. An optional "requiredCapabilities" entry may also be created and populated with a Capabilities instance. The "sessionId" fields should be left empty.
  3. The Command is serialized and transmitted to the remote end.
  4. The remote end examines the two Capabilities parameters, and creates a new session matching as many of the Capabilities as possible from the "desiredCapabilities" and all of the Capabilities given in the "requiredCapabilities". How the new session is created depends on the implementation of this specification. In the case of a browser automation framework, it is expected that a new instance of the browser is started if possible.
    • If any of the "requiredCapabilities" cannot be fulfilled by the new session, the remote end must quit the session and return the SessionNotCreatedException error code. The error message should list all unmet required capabilities though only the first unmet required capability must be given.
  5. The session must be assigned a UUID which must be unique for each session (by definition). Generating the UUID may occur before the session is created. If the Command object had the "sessionId" field set, this may be discarded in favour of the freshly generated UUID. Because of this, it is recommended that UUID generation be done on the remote end. If the UUID has already been used, a Response must be sent with the status code set to SessionNotCreatedException and the value being an explanation that the UUID has previously been used.
  6. The remote end create a new Response object.
    • The "sessionId" field is assigned the UUID associated with this session.
    • The session is described by filling a Capabilities instance with keys matching the parts of this specification that can be fulfilled. This is assigned to the "value" field of the Response. This fields must be filled
    • The "status" field is set to the SUCCESS error code.
  7. The Response is transmitted or returned back to the local end.

There is no requirement for the local end to validate that some or all of the fields sent on the Capabilities associated with the Command match those returned in the Response.

4.1.1 Capability Names

The following keys are to be used in the Capabilities instances.

browserName
The name of the desired browser as a string
browserVersion
The version number of the browser, given as a string
platformName
The OS that the browser is running on, matching any of the platform names given below.
platformVersion
The version of the OS that the browser is running on as a string.
4.1.1.1 Platform Names

These should be named in the style of enums in C-like languages.

  • ANDROID
  • IOS
  • LINUX
  • MAC
  • UNIX
  • WINDOWS

In addition "ANY" may be used to indicate the underlying OS is either unknown or does not matter. Implementors may add additional platform names.

4.1.2 Error Handling

The following status codes must be returned by the "newSession" command. Please consult the table in the "commands" section for numerical values:

Success
The session was successfully created. The "value" field of the Response must contain a Capabilities object describing the session
Timeout
The new session could not be created within the time allowed for command execution on the remote end. This time may be infinite. The "value" field of the Response should contain a string explaining that a timeout has occurred, but it may be left empty or filled with the empty string.
UnknownError
An unhandled error of some sort has occurred. The "value" field of the Response should contain a more detailed description of the error.

4.1.3 Remote End Matching of Capabilities

This section is non-normative.

The suggested order for comparing keys in the Capabilities instance when creating a session is:

  1. browserName
  2. browserVersion
  3. platform
  4. platformVersion

For all comparisons, if the key is missing (as determined by a call to Capability.has()), that particular criteria shall not factor into the comparison.

6. Controlling Windows

6.1 Defining "window" and "frame"

Within this specification, a window equates to anything that would be referred to as "window.top" in javascript. Put another way, within this spec browser tabs are counted as separate windows.

TODO: define "frame"

6.2 Window Handles

Each window has a "window handle" associated with it. This is an opaque string which is unique to the window. The suggested implementation is as a UUID. The "getWindowHandle" command can be used to obtain the window handle for the window that commands are currently acting upon:

Command Name getWindowHandle
Parameters "sessionId" {string} The key that identifies which session this request is for.
Return Value string

6.3 Iterating Over Windows

Command Name getWindowHandles
Parameters "sessionId" {string} The key that identifies which session this request is for.
Return Value Array.<string>

This array of returned strings must contain a handle for every window associated with the browser session and no others. In addition, at the time of collecting the window handles the javascript expression "window.top.closed" must evaluate to false.

The ordering of the keys is not defined, but should be determined by iterating over each top level browser window and returning the tabs within that window before iterating over the tabs of the next top level browser window. For example, in the diagram below, the window handles should be returned as the handles for: win1tab1, win1tab2, win2.

Two top level windows. The first window has two tabs, lablled win1tab1 and win1tab2. The second window has only one tab labelled win2

6.4 Closing Windows

Command Name close
Parameters "sessionId" {string} The key that identifies which session this request is for.
Return Value None

The close command closes the window that commands are currently being sent to. If this means that a call to get the list of window handles returns an empty list, then this close command must be the equivalent of calling "quit". In all other cases, control must be returned to the calling process once the window has been closed or an alert is displayed by the closing window.

Once the window has closed, future commands must return an error NoSuchWindowException until a new window is selected for receiving commands.

6.5 Resizing and Positioning Windows

Command Name setWindowSize
Parameters "sessionId" {string} The key that identifies which session this request is for.
"windowHandle" {string} The handle referring to the window to resize.
"width" {number} The new window width.
"height" {number} The new window height.
Return Value None
Errors UnsupportedOperationException if the window could not be resized.
Command Name getWindowSize
Parameters "sessionId" {string} The key that identifies which session this request is for.
"windowHandle" {string} The handle referring to the window to resize.
Return Value An object with two keys:
"width" {number} The width of the specified window.
"height" {number} The height of the specified window.
Command Name maximizeWindow
Parameters "sessionId" {string} The key that identifies which session this request is for.
"windowHandle" {string} The handle referring to the window to resize.
Return Value None
Errors UnsupportedOperationException if the window could not be resized.
Command Name fullscreenWindow
Parameters "sessionId" {string} The key that identifies which session this request is for.
"windowHandle" {string} The handle referring to the window to resize.
Return Value None
Errors UnsupportedOperationException if the window could not be resized.

Each of these commands accept the window handles returned by "getWindowHandles" and "getWindowHandle". In addition, the window handle may be "current", in which case the window that commands are currently being handled by must be acted upon.

The "width" and "height" values refer to the "window.outerheight" and "window.outerwidth" properties. For those browsers that do not support these properties, these represent the height and width of the whole browser window including window chrome and window resizing borders/handles.

After setWindowSize, the whole browser window must be left as if the restore button had been pressed, and must not be in the maximised state.

After maximizeWindow, the whole browser window must be left as if the maximise button had been pressed; it is not sufficient to leave the window "restored", but with the full screen dimensions.

If a request is made to resize a window to a size which cannot be performed (e.g. the browser has a minimum, or fixed window size), an UnsupportedOperationException must be thrown.

6.6 Scaling the Content of Windows

TODO

7. Where Commands Are Handled

Web applications can be composed of multiple windows and/or frames. For a normal user, the context in which an operation is performed is obvious: it's the window or frame that currently has OS focus and which has just received user input. The WebDriver API does not follow this convention. There is an expectation that many browsers using the WebDriver API may be used at the same time on the same machine. This section describes how WebDriver tracks which window or frame is currently the context in which commands are being executed.

7.1 Default Content

WebDriver's default context is the equivalent of window.top.

7.2 Switching Windows

Command Name switchToWindow
Parameters "sessionId" {string} The key that identifies which session this request is for.
"name" {string} The identifier used for a window.
Return Value None
Errors NoSuchWindowException if no matching window can be found

The "switchToWindow" command is used to select which window should currently be accepting commands. In order to determine which window should be used for accepting commands, the "switchToWindow" command will iterate over all windows. For each window, the following will be compared --- in this order --- with the "name" parameter:

  1. A window handle, obtained from "getWindowHandles" or "getWindowHandle".
  2. The window name, as defined when the window was opened (the value of "window.name")
  3. The "id" attribute of the window.

If no windows match, then a "NoSuchWindowException" must be thrown, otherwise the "default content" of the first window to match will be selected for accepting commands.

When a new browser session is started by WebDriver and only a single window is present then the default content of that window becomes the "current" window. When more than one window is opened, the "current" window is undefined. Any commands that are executed at this point that require a window must throw an exception (TODO: Which exception? Ideally the same as if a window had just been closed). The correct way for a user to recover from this situation is to obtain the list of window handles and to "switch to" one of these.

7.3 Switching Frames

Command Name switchToFrame
Parameters "sessionId" {string} The key that identifies which session this request is for.
"id" {?(string|number|!WebElement=)} The identifier used for a window.
Return Value None
Errors NoSuchFrameException if no matching frame can be found

The "switchToFrame" command is used to select which frame within a window should be used for handling future commands. All frame switching is taken from the current context from which commands are currently being handled. The "id" parameter can be one of a string, number of an element. WebDriver implementations must determine which frame to select using the following algorithm:

  1. If the "id" is a number the current context is set to the equivalent of the JS expression "window.frames[n]" where "n" is the number and "window" is the DOM window represented by the current context.
  2. If the "id" is null, the current context is set to the default context.
  3. If the "id" is a string:
    1. If the JS expression "window.frames[id]" evaluated in the current context returns a window, where "id" is the value of the the "id" parameter, the current context is set to that.
    2. Otherwise for each value of "window.frames" (referred to as "window"):
      1. If "window" has a "name" property or attribute equal to the "id" parameter, this becomes the current context.
      2. If "window" has an "id" property or attribute equal to the "id" parameter, this becomes the current context.
  4. If the "id" represents a WebElement, and the corresponding DOM element represents a FRAME or an IFRAME, and the WebElement is part of the current context, the "window" property of that DOM element becomes the current context.

In all cases if no match is made a "NoSuchFrameException" must be thrown.

Frame switching must succeed even if doing so would cross a security origin, or javascript executing in window.top's context would otherwise not be able to access the frame being switched to.

8. Running Without Window Focus

All browsers must comply with the focus section of the [HTML5] spec. In particular, the requirement that the element within a top-level browsing context be independent of whether or not the top-level browsing context itself has system focus must be followed.

Note

This requirement is put in place to allow efficient machine utilization when using the WebDriver API to control several browsers independently on the same desktop

9. Elements

One of the key abstractions of the WebDriver API is the WebElement interface. Each instance of this interface represents an Element as defined in the [DOM4] specification. Because the WebDriver API is designed to allow users to interact with apps as if they were actual users, the capabilities offered by the WebElement interface are somewhat different from those offered by the DOM Element interface.

Each WebElement instance must have an ID, which is distinct from the value of the DOM Element's "id" property. The ID for every WebElement representing the same underlying DOM Element must be the same. The IDs used to refer to different underlying DOM Elements must be unique.

interface WebElement {
    readonly attribute DOMString id;
};

9.1 Attributes

id of type DOMString, readonly
The WebDriver ID of this particular element. This should be a UUID.
Note

This requirement around WebElement IDs allows for efficient equality checks when the WebDriver API is being used out of process.

This section of the specification covers finding elements. Later sections deal with querying and interacting with these located elements. The primary interface used by the WebDriver API for locating elements is the SearchContext.

9.2 Lists of WebElements

The primary grouping of WebElement instances is an array of WebElement instances

A reference to an WebElement is obtained via a SearchContext. The key interfaces are:

interface Locator {
    readonly attribute DOMString strategy;
    readonly attribute DOMString value;
};

Attributes

strategy of type DOMString, readonly
The name of the strategy that should be used to locate elements.
value of type DOMString, readonly
The value to pass to the element finding strategy
interface SeachContext {
    WebElement[] findElements (Locator locator);
    WebElement   findElement (Locator locator);
};

Methods

findElement
ParameterTypeNullableOptionalDescription
locatorLocator
Return type: WebElement
findElements
ParameterTypeNullableOptionalDescription
locatorLocator
Return type: WebElement[]

9.3 Element Location Strategies

9.3.1 ARIA

This section is non-normative: It should be possible to find elements using their ARIA roles. It may be possible to find elements using their ARIA states and properties. All references to "ARIA" refer to [WAI-ARIA]

9.3.2 CSS Selectors

Capability Name Type
cssSelectors boolean

If a browser supports the CSS Selectors API ([SELECTORS-API]) it must support locating elements by CSS Selector. If the browser does not support the browser CSS Selector spec it may chose to implement locating by this mechanism. Elements must be returned in the same order as if "querySelectorAll" had been called. Compound selectors are allowed.

9.3.3 ECMAScript

Finding elements by ecmascript is covered in the ecmascript part of this spec.

9.3.4 Element ID

This strategy must be supported by all WebDriver implementations.

The HTML5 specification ([HTML5]) states that element IDs must be unique within their home subtree. Sadly, this uniqueness requirement is not always met. Consequently, this strategy is equally valid for finding a single element, or groups of elements. In the case of finding a single WebElement, this must be functionally identical to a call to "document.getElementById()" from the Web DOM Core specification ([DOM4]). When finding multiple elements, this is equivalent to an CSS query of "#value" where "value" is the ID being searched for with all "'" characters being properly escaped..

9.3.7 XPath

All WebDriver implementations must support finding elements by XPath 1.0 [XPATH] with the edits from section 3.3 of the [HTML5] specification made. If no native support is present in the browser, a pure JS implementation may be used. When called, the returned values must be equivalent of calling "evaluate" function from the DOM Level 3 XPath spec [DOM-LEVEL-3-XPATH] with the result type set to "ORDERED_NODE_SNAPSHOT_TYPE (7).

10. Reading Element State

10.1 Determining Visibility

The following algorithm is used to determine if an element has been displayed.

Command Name isDisplayed
Parameters "id" {string} The ID of the WebElement on which to operate.
Return Value {boolean} Whether the element is displayed.
Errors StaleElementReferenceException if the element referenced is no longer attached to the DOM

10.2 Determining Whether a WebElement Is Selected

WebDriver determines whether a WebElement is selected using the following algorithm:

  1. If the item is not "selectable", the WebElement is not selected. A selectable element is either an OPTION element or an INPUT element of type "checkbox" or "radio".
  2. If the WebElement represents an INPUT element, call the "getProperty" method described above looking for the "checked" property. This indicates whether the element is selected.
  3. Otherwise, call the "getProperty" method described above looking for the "selected" property. This indicates whether the element is selected.
Command Name isSelected
Parameters "id" {string} The ID of the WebElement on which to operate.
Return Value {boolean} Whether the element is selected, according to the above algorithm.
Errors StaleElementReferenceException if the element referenced is no longer attached to the DOM

10.3 Reading Attributes and Properties

Although the [HTML5] spec is very clear about the difference between the properties and attributes of a DOM element, users are frequently confused between the two. Because of this, the WebDriver API offers a single command ("getElementAttribute") which covers the case of returning both the value of a DOM element's property or attribute. If a user wishes to refer specifically to an attribute or a property, they should evaluate Javascript in order to be unambiguous. In this section, the "attribute" with name name shall refer to the result of calling the Javascript "getAttribute" function on the element, with the following exceptions:

Note

These aliases provide the commonly used names for element properties.

Command Name getElementAttribute
Parameters "sessionId" {string} The key that identifies which session this request is for.
"id" {string} The ID of the WebElement on which to operate.
"name" {string} The name of the property of attribute to return.
Return Value {string|null} The value returned by the above algorithm, coerced to a nullable string, or null if no value is defined.
Errors StaleElementException If the element is no longer attached to the DOM.

10.4 Rendering Text

All WebDriver implementations must support getting the visible text of a WebElement, with excess whitespace compressed.

The following definitions are used in this section:

Whitespace
Any text that matches the ECMAScript regular expression class \s.
Whitespace excluding non-breaking spaces
Any text that matches the ECMAScript regular expression [^\S\xa0]
Block level element
A block-level element is one which is not a table cell, and whose effective CSS display style is not in the set ['inline', 'inline-block', 'inline-table', 'none', 'table-cell', 'table-column', 'table-column-group']
Horizontal whitespace characters
Horizontal whitespace characters are defined by the ECMAScript regular expression [\x20\t\u2028\u2029].

The expected return value is roughly what a text-only browser such as Lynx would display. The algorithm for determining this text is as follows:

Let lines equal an empty array. Then:

  1. For each child of node, at time of execution, in order:
    1. Get whitespace, text-transform, and then, if child is:
      • a node which is not visible, do nothing
      • a [DOM4] text node let text equal the nodeValue property of child. Then:
        1. Remove any zero-width spaces (\u200b), form feeds (\f) or vertical tab feeds (\v) from text.
        2. Canonicalize any recognized single newline sequence in text to a single newline (greedily matching (\r\n|\r|\n) to a single \n)
        3. If the parent's effective CSS whitespace style is 'normal' or 'nowrap' replace each newline (\n) in text with a single space character (\x20). If the parent's effective CSS whitespace style is 'pre' or 'pre-wrap' replace each horizontal whitespace character with a non-breaking space character (\xa0). Otherwise replace each sequence of horizontal whitespace characters except non-breaking spaces (\xa0) with a single space character
        4. Apply the parent's effective CSS text-transform style as per the CSS 2.1 specification ([CSS21])
        5. If last(lines) ends with a space character and text starts with a space character, trim the first character of text.
        6. Append text to last(lines) in-place
      • an element which is visible. If the element is a:
        • BR element: Push '' to lines and continue
        • Block-level element and if last(lines) is not '', push '' to lines.
        And then recurse depth-first to step 1 with child set to the current element
      • If element is a TD element, or the effective CSS display style is 'table-cell', and last(lines) is not '', and last(lines) does not end with whitespace append a single space character to last(lines) [Note: Most innerText implementations append a \t here]
      • If element is a block-level element: push '' to lines
  2. For each line in lines trim any leading and trailing whitespace excluding non-breaking space characters.
  3. Let s be lines.join('\n')
  4. Trim any leading and trailing whitespace excluding non-breaking space characters from s.
  5. Replace any non-breaking spaces (\xa0) with spaces (\x20) in s.
  6. Return s.

11. Executing Javascript

Note

Open questions: What happens if a user's JS triggers a modal dialog? Blocking seems like a reasonable idea, but there is an assumption that WebDriver is not threadsafe. What happens to unhandled JS errors? Caused by a user's JS? Caused by JS on a page? How does a user of the API obtain the list of errors? Is that list cleared upon read?

If a browser supports JavaScript and JavaScript is enabled, it must set the "javascriptEnabled" capability to true, and it must support the execution of arbitrary JavaScript.

Capability NameType
javascriptEnabledboolean

11.1 Javascript Command Parameters

The Argument type is defined as being {(number|boolean|DOMString|WebElement|dictionary|Array.>Argument>)?}

interface JavascriptCommandParameters {
    readonly attribute DOMString  script;
    readonly attribute Argument[] args;
};

11.1.1 Attributes

args of type array of Argument, readonly
The parameters to the function defined by script.
script of type DOMString, readonly
The JavaScript to execute, in the form of a Function body.

When executing Javascript, it must be possible to reference the args parameter using the function's arguments object. The arguments must be in the same order as defined in args. Each WebDriver implementation must preprocess the values in args using the following algorithm:

For each index, index in args, if args[index] is...

  1. a number, boolean, DOMString, or null, then let args[index] = args[index].
  2. an array, then recursively apply this algorithm to args[index] and assign the result to args[index].
  3. a dictionary, then recursively apply this algorithm to each value in args[index] and assign the result to args[index].
  4. a WebElement, then:
    1. If the element's ID does not represent a DOMElement, or it represents a DOMElement that is no longer attached to the document's tree, then the WebDriver implementation must immediately abort the command and return a StaleElementReference error.
    2. Otherwise, let args[index] be the underlying DOMElement.
  5. Otherwise WebDriver implementations may throw an UnknownError indicating the index of the unhandled parameter (TODO: Should a more specific error be thrown?) but should attempt to convert the value into a dictionary.

11.2 Synchronous Javascript Execution

Command Name executeScript
Parameters "sessionId" {string} The key that identifies which session this request is for.
"script" {string} The script to execute.
"args" {Array.<Argument>} The script arguments.
Return Value {Argument} The value returned by the script, or null.
Errors JavascriptError if the executing script threw an exception.
StaleElementReferenceException if a WebElement referenced is no longer attached to the DOM.
UnknownError if an argument or the return value is of an unhandled type.

When executing JavaScript, the WebDriver implementations must use the following algorithm:

  1. Let window be the Window object for WebDriver's current command context.
  2. Let script be the DOMString from the command's script parameter.
  3. Let fn be the Function object created by executing new Function(script);
  4. Let args be the JavaScript array created by the pre-processing algorithm defined above.
  5. Invoke fn.apply(window, args);
  6. If step #5 threw, then:
    1. Let error be the thrown value.
    2. Set the command's response status to JavascriptError.
    3. Set the command's response value to a dictionary, dict.
    4. If error is an Error, then set a "message" entry in dict whose value is the DOMString defined by error.message.
    5. Otherwise, set a "message" entry in dict whose value is the DOMString representation of error.
  7. Otherwise:
    1. Let result be the value returned by the function in step #5.
    2. Set the command's response status to Success.
    3. Let value be the result of the following algorithm:
      1. If result is:
        1. undefined or null, return null.
        2. a number, boolean, or DOMString, return result.
        3. a DOMElement, then return the corresponding WebElement for that DOMElement.
        4. an array or NodeList, then return the result of recursively applying this algorithm to result.
        5. an object, then return the dictionary created by recursively applying this algorithm to each property in result.
    4. Set the command's response value to value.
  8. Return the command response.

11.3 Asynchronous Javascript Execution

Command Name executeAsyncScript
Parameters "sessionId" {string} The key that identifies which session this request is for.
"script" {string} The script to execute.
"args" {Array.<Argument>} The script arguments.
Return Value {Argument} The value returned by the script, or null.
Errors JavascriptError if the executing script threw an exception.
StaleElementReferenceException if a WebElement referenced is no longer attached to the DOM.
Timeout if the callback is not called within the time specified by the "script" timeout.
UnknownError if an argument or the return value is of an unhandled type.

When executing asynchronous JavaScript, the WebDriver implementation must use the following algorithm:

  1. Let timeout be the value of the last "script" timeout command, or 0 if no such commands have been received.
  2. Let window be the Window object for WebDriver's current command context.
  3. Let script be the DOMString from the command's script parameter.
  4. Let fn be the Function object created by executing new Function(script);
  5. Let args be the JavaScript array created by the pre-processing algorithm defined above.
  6. Let callback be a Function object pushed to the end of args.
  7. Register a one-shot timer on window set to fire timeout milliseconds in the future.
  8. Invoke fn.apply(window, args);
  9. If step #8 threw, then:
    1. Let error be the thrown value.
    2. Set the command's response status to JavascriptError.
    3. Set the command's response value to a dictionary, dict.
    4. If error is an Error, then set a "message" entry in dict whose value is the DOMString defined by error.message.
    5. Otherwise, set a "message" entry in dict whose value is the DOMString representation of error.
  10. Otherwise, the WebDriver implementation must wait for one of the following to occur:
    1. if the timer from step #7 fires, the WebDriver implementation must immediately set the command response status to Timeout and return.
    2. if the window fires an unload event, the WebDriver implementation must immediately set the command response status to JavascriptError and return with the error message set to "Javascript execution context no longer exists.".
    3. if the callback function is invoked, then:
      1. Let result be the first argument passed to callback.
      2. Set the command's response status to Success.
      3. Let value be the result of the following algorithm:
        1. If result is:
          1. undefined or null, return null.
          2. a number, boolean, or DOMString, return result.
          3. a DOMElement, then return the corresponding WebElement for that DOMElement.
          4. an array or NodeList, then return the result of recursively applying this algorithm to result. WebDriver implementations should limit the recursion depth.
          5. an object, then return the dictionary created by recursively applying this algorithm to each property in result.
      4. Set the command's response value to value.
    4. Return the command response.

11.4 Reporting Errors

12. Cookies

13. Timeouts

This section describes how timeouts and implicit waits are handled within WebDriver

The "timeouts" command is used to set the value of a timeout that a command can execute for.

Command Name timeouts
Parameters "sessionId" {string} The key that identifies which session this request is for.
"type" {string} The type of operation to set the timeout for. Valid values are: "implicit", "page load", "script"
"ms" - {number} The amount of time, in milliseconds, that time-limited commands are permitted to run.
Return Value None
Errors None

14. User Input

There are two ways to interact with elements: directly or implicitly. The difference between the two is similar to the difference between "Do what I mean" vs "Do as I say": The commands for directly interacting with elements express explicit intention for the desired outcome. For this kind of interaction, the implementation of this specification should take additional steps to ensure the interaction would happen as the user intended (for example, by scrolling the element into the viewport). Implicit interaction differs by following the user's instructions without additional interpretation. Interaction would be with the currently active element, as defined by the browser. The implementation would maintain the current keyboard and mouse state in order to fulfill the user's instructions.

14.1 Interaction directly with elements

14.1.1 Interactable elements

Some user-input actions required the element to be interactable. The following conditions must be met for the element to be considered interactable:

  • The element must be visible, as defined in section 10.1.
  • The element must not be disabled:
    • If the currently loaded document is HTML4, the element does not support the disabled attribute (according to the [HTML401] spec), or the disabled attribute is not set.
    • If the currently loaded document is HTML5, the element is not disabled as defined in the [HTML5] spec.

14.1.2 Clicking

partial interface WebElement {
    void click ();
};
14.1.2.1 Methods
click
Clicks in the middle of the WebElement instance. The middle of the element is defined as the middle of the box returned by calling getBoundingClientRect on the underlying DOM Element, according to the [CSSOM-VIEW] spec. If the element is outside the viewport (according to the [CSS21] spec), the implementation should bring the element into view first. The implementation may invoke scrollIntoView on the underlying DOM Element. The element must be visible, as defined in section 10.1. See the note below for when the element is obscured by another element. Exceptions:
  • links (A elements): Clicking happens in the middle of the first bounding client rectangle. This is to overcome overflowing links where the middle of the bounding client rectangle does not actually fall on a clickable part of the link.
  • Select elements with without the multiple attribute set: Clicking on the select element must open up a selection menu. After clicking on an option, the selection menu must be closed. Clicking directly on an option element (without clicking on the select element previously) must open a selection menu, as if the select option was clicked first. Ultimately, after clicking clicking one of the options, the selection menu must be closed.

The possible errors for this command:

  • StaleElementReference if the given element is no longer attached to the DOM.
  • ElementNotVisible if the element is hidden and thus cannot be interacted with.
  • MoveTargetOutOfBounds if the element cannot be scrolled into view.
No parameters.
Return type: void
Note

As the goal is to emulate users as closely as possible, the implementation should not allow clicking on elements that are obscured by other elements. The implementation should try to scroll the element into view, but in case it is fully obscured, it should not be clickable.

14.1.3 Typing keys

A pre-requirement for keys-based interaction with an element is that it is interactable (as defined earlier in the section). Typing into an element could take place if one of the following conditions is met:

  • The element is focusable as defined in the editing section of the [HTML5] spec.
  • The element could be the activeElement. In addition to focusable elements, this allows typing to the body element.
  • In an HTML5 document, the element is editable as a result of having its contentEditable attribute set or the containing document is in designMode.

Prior to any keyboard interaction, the element should be focused if it does not currently have the focus. This is the case if one of the following holds:

  • The element is not document.activeElement
  • The owner document of the element to be interacted with is not the focused document.

In case focusing is needed, the implementation must follow the focusing steps as described in the focus management section of the [HTML5] spec. The focus must not leave the element at the end of the interaction, other than as a result of the interaction itself (i.e. when the tab key is sent).

partial interface WebElement {
    void clear ();
    void sendKeys (string[] keysToSend);
};
14.1.3.1 Methods
clear
Clears the value of the element.
No parameters.
Return type: void
sendKeys
Sends a sequence of keyboard events representing the keys in the keysToSend parameter.

Caret positioning: If focusing was needed, after following the focusing steps, the caret must be positioned at the end of the text currently in the element. At the end of the interaction, the caret must be positioned at the end of the typed text sequence, unless the keys sent position it otherwise (e.g. using the LEFT key).

There are four different types of keys that are emulated:

  • Character literals - lower-case symbols.
  • Uppercase letters and symbols requiring the SHIFT key for typing.
  • Modifier keys
  • Special keys
The rest of this section details the values used to represent the different keys, as well as the expected behaviour for each key type.
ParameterTypeNullableOptionalDescription
keysToSendstring[]
Return type: void

When emulating user input, the implementation must generate the same sequence of events that would have been produced if a real user was sitting in front of the keyboard and typing the sequence of characters. In cases where there is more than one way to type this sequence, the implementation must choose one of the valid ways. For example, typing AB may be achieved by:

  • Holding down the Shift key
  • Pressing the letter 'a'
  • Pressing the letter 'b'
  • Releasing the Shift key
Alternatively, it can be achieved by:
  • Holding down the Shift key
  • Pressing the letter 'a'
  • Releasing the Shift key
  • Holding down the Shift key
  • Pressing the letter 'b'
  • Releasing the Shift key
Or by simply turning on the CAPS LOCK first. Since all methods are valid, any of them can be used.

The implementation may use the following algorithm to generate the events. If the implementation is using a different algorithm, it must adhere to the requirements listed below.

For each key, key in keysToSend, do

  1. If key is a lower-case symbol:
    1. If the Shift key is not pressed:
      1. Generate a sequence of key-down, key-press and key-up events with key as the character to emulate
    2. else (The Shift key is pressed)
      1. let uppercaseKey be the upper-case character matching key
      2. Generate a sequence of key-down, key-press and key-up events with uppercaseKey as the character to emulate
  2. Else if key is an upper-case symbol:
    1. If the Shift key is not pressed:
      1. Generate a key-down event of the Shift key.
      2. Generate a sequence of key-down, key-press and key-up events with key as the character to emulate
      3. Generate a key-up event of the Shift key.
    2. else (The Shift key is pressed)
      1. Generate a sequence of key-down, key-press and key-up events with key as the character to emulate
  3. Else if key represents a modifier key:
    1. let modifier be the modifier key represented by key
    2. If modifier is currently held down:
      1. Generate a key-up event of modifier
    3. Else:
      1. Generate a key-down event of modifier
    4. Maintain this key state and use it to modify the input until it is pressed again.
  4. Else if key represents the NULL key:
    1. Generate key-up events of all modifier keys currently held down.
    2. All modifier keys are now assumed to be released.
  5. Else if key represents a special key:
    1. Translate key to the special key it represents
    2. Generate a sequence of key-down, key-press and key-up events for the special key.

Once keyboard input is complete, an implicit NULL key is sent unless the final character is the NULL key.

Any implementation must comply with these requirements:

  • For uppercase letters and symbols that require the Shift key to be pressed, there are two options:
    • A single Shift key-down event is generated before the entire sequence of uppercase letters.
    • Before each such letter or symbol, a Shift key-down event is generated. After each letter or symbol, a Shift key-up event is generated.
  • A user-specified Shift press implies capitalization of all following characters.
  • If a user-specified Shift press precedes uppercase letters and symbols, a second Shift key-down event must not be generated. In that case, a Shift key-up event must not be generated implicitly by the implementation.
  • The NULL key releases all currently held down modifier keys.
  • The state of all modifier keys must be reset at the end of each sendKeys call and the appropriate key-up events generated

Character types

The keysToSend parameter contains a mix of printable characters and pressable keys that aren't text. Press-able keys that aren't text are stored in the Unicode PUA (Private Use Area) code points, 0xE000-0xF8FF. The following table describes the mapping between PUA and key:

Key Code Type
NULL\uE000NULL
CANCEL\uE001Special key
HELP\uE002Special key
BACK_SPACE\uE003Special key
TAB\uE004Special key
CLEAR\uE005Special key
RETURN\uE006Special key
ENTER\uE007Special key
SHIFT\uE008Modifier
LEFT_SHIFT\uE008Modifier
CONTROL\uE009Modifier
LEFT_CONTROL\uE009Modifier
ALT\uE00AModifier
LEFT_ALT\uE00AModifier
PAUSE\uE00BSpecial key
ESCAPE\uE00CSpecial key
SPACE\uE00DSpecial key
PAGE_UP\uE00ESpecial key
PAGE_DOWN\uE00FSpecial key
END\uE010Special key
HOME\uE011Special key
LEFT\uE012Special key
ARROW_LEFT\uE012Special key
UP\uE013Special key
ARROW_UP\uE013Special key
RIGHT\uE014Special key
ARROW_RIGHT\uE014Special key
DOWN\uE015Special key
ARROW_DOWN\uE015Special key
INSERT\uE016Special key
DELETE\uE017Special key
SEMICOLON\uE018Special key
EQUALS\uE019Special key
NUMPAD0\uE01ASpecial key
NUMPAD1\uE01BSpecial key
NUMPAD2\uE01CSpecial key
NUMPAD3\uE01DSpecial key
NUMPAD4\uE01ESpecial key
NUMPAD5\uE01FSpecial key
NUMPAD6\uE020Special key
NUMPAD7\uE021Special key
NUMPAD8\uE022Special key
NUMPAD9\uE023Special key
MULTIPLY\uE024Special key
ADD\uE025Special key
SEPARATOR\uE026Special key
SUBTRACT\uE027Special key
DECIMAL\uE028Special key
DIVIDE\uE029Special key
F1\uE031Special key
F2\uE032Special key
F3\uE033Special key
F4\uE034Special key
F5\uE035Special key
F6\uE036Special key
F7\uE037Special key
F8\uE038Special key
F9\uE039Special key
F10\uE03ASpecial key
F11\uE03BSpecial key
F12\uE03CSpecial key
META\uE03DSpecial key
COMMAND\uE03DSpecial key
ZENKAKU_HANKAKU\uE040Special key

The keys considered upper-case symbols are either defined by the current keyboard locale or are derived from the US 104 keys Windows keyboard layout, which are:

  • A - Z
  • !$^*()+{}:?|~@#%_\" & < >

When the user input is emulated natively (see note below), the implementation should use the current keyboard locale to determine which symbols are upper case. In all other cases, the implementation must use the US 104 key Windows keyboard layout to determine those symbols.

The state of the physical keyboard must not affect emulated user input.

Internationalized input

Non-latin symbols: TBD

Complex scripts using Input Method Editor (IME): TBD

Note

User input should be emulated natively: The input events should be indistinguishable from a user, behind a screen and a keyboard, interacting with the browser. For that purpose, it is highly recommended that input events not be generated at the DOM level. Instead, emulated input events should originate from the browser's own event queue, just like other user input. This is the order of preference for methods to emulate user input:

Note

These input methods could be used to interact with whe browser's chrome. However, the way to do so is not defined as it most likely to be implementation-specific.

14.2 High Level APIs: Clicking and Typing

This section deals with implicit interaction. In this kind of interaction, a user describes a series of input actions which the implementation should fulfill with little to no additional actions. This kind of interactions allow dragging and dropping or combining keyboard and mouse actions.

interface Actions {
    void keyDown (Keys theKey);
    void keyUp (Keys theKey);
    void sendKeys (string[] keysToSend);
    void moveToElement (WebElement toElement);
    void moveByOffset (int xOffset, int yOffset);
    void clickAndHold ();
    void release ();
    void click ();
    void doubleClick ();
    void contextClick ();
};

14.2.1 Methods

click
Single-clicks the left mouse button, at the current mouse location.
No parameters.
Return type: void
clickAndHold
Clicks, without releasing, the left mouse button, at the current mouse location.
No parameters.
Return type: void
contextClick
Performs a context-click at the current mouse location.
No parameters.
Return type: void
doubleClick
Double-clicks the left mouse button, at the current mouse location.
No parameters.
Return type: void
keyDown
Performs a modifier key press. Does not release the modifier key - it is kept pressed for subsequent interactions. The only valid values for theKey parameter ones defined as modifier keys in the previous section.
ParameterTypeNullableOptionalDescription
theKeyKeys
Return type: void
keyUp
Performs a modifier key release. The only valid values for theKey parameter ones defined as modifier keys in the previous section.
ParameterTypeNullableOptionalDescription
theKeyKeys
Return type: void
moveByOffset
Moves the mouse cursor from its current position by the given offset. The offset provided may be negative. If the end coordinates are outside the viewport, then the viewport should be scrolled to match.
ParameterTypeNullableOptionalDescription
xOffsetint
yOffsetint
Return type: void
moveToElement
Moves the mouse cursor to the middle of toElement, after toElement has been scrolled into view (if outside the viewport). The middle of the element is calculated using getBoundingClientRect
ParameterTypeNullableOptionalDescription
toElementWebElement
Return type: void
release
Releases the previously-held left mouse button, at the current mouse location.
No parameters.
Return type: void
sendKeys
Sends a sequence of keys to the active element.
ParameterTypeNullableOptionalDescription
keysToSendstring[]
Return type: void

The following conditions must hold for implementation of this high-level API:

Mouse movement and scrolling: Despite the implicit nature of this API, some mouse movement actions still cause implicit scrolling. The alternative, of providing an API to scroll the viewport and forcing the user to do so, would be inconvenient. When moving to an element, the implementation should use the same method of scrolling as is used for WebElement.click(). When moving by an offset, the implementation has a greater freedom to decide how much to scroll. In both cases, the implementation must not scroll if the target coordinates are already in the viewport.

Note

The methods described in this interface are a minimal set. An implementation may add additional helper methods for convenience. Such a method could be dragAndDrop(WebElement, WebElement) as a shorthand form of calling moveTo(WebElement), clickAndHold(), moveTo(WebElement), release()

14.3 Low Level APIs

TODO: Describe the commands for basic control of keyboard and mouse.

14.3.1 Mouse

14.3.2 Keyboard

14.3.2.1 IME

14.3.3 Touch

16. Snapshots

16.1 Screen

16.2 Current Window

16.3 Element

17. Handling non-HTML Content

18. Extending the Protocol

A. Command Summary

This is an exhaustive list of the commands listed in this specification.

B. Command Format

This is essentially the content at the start of the json wire protocol

C. Thread Safety

D. Logging

E. Mapping to HTTP and JSON

F. Acknowledgements

Many thanks to Robin Berjon for making our lives so much easier with his cool tool. Thanks to Jason Leyba and Malcolm Rowe for proof reading and suggesting areas for improvement. Thanks to Jason Leyba, Eran Messeri and Daniel Wagner-Hall for contributing sections to this document.

G. References

G.1 Normative references

[CSS21]
Bert Bos; et al. Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification.. W3C Recommendation. URL: http://www.w3.org/TR/CSS21/
[CSSOM-VIEW]
Anne van Kesteren. CSSOM View Module. 22 February 2008. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/2008/WD-cssom-view-20080222
[DOM4]
Anne van Kesteren; Aryeh Gregor; Ms2ger. DOM4. 5 April 2012. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/dom/
[HTML401]
David Raggett; Ian Jacobs; Arnaud Le Hors. HTML 4.01 Specification. 24 December 1999. W3C Recommendation. URL: http://www.w3.org/TR/1999/REC-html401-19991224
[HTML5]
Ian Hickson; David Hyatt. HTML5. 25 May 2011. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/html5
[SELECTORS-API]
Lachlan Hunt; Anne van Kesteren. Selectors API. 14 November 2008. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/2008/WD-selectors-api-20081114
[WAI-ARIA]
Lisa Pappas; et al. Accessible Rich Internet Applications (WAI-ARIA) 1.0. 24 February 2009. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/2009/WD-wai-aria-20090224
[XPATH]
James Clark; Steven DeRose. XML Path Language (XPath) Version 1.0. 16 November 1999. W3C Recommendation. URL: http://www.w3.org/TR/1999/REC-xpath-19991116/

G.2 Informative references

[DOM-LEVEL-3-XPATH]
Ray Whitmer. Document Object Model (DOM) Level 3 XPath Specification. 26 February 2004. W3C Note. URL: http://www.w3.org/TR/2004/NOTE-DOM-Level-3-XPath-20040226