Media Capture Depth Stream Extensions

This specification extends the Media Capture and Streams specification [[!GETUSERMEDIA]] to allow a depth-only stream or combined depth+video stream to be requested from the web platform using APIs familiar to web authors.

Dependencies

The MediaStreamTrack and MediaStream interfaces this specification extends are defined in [[!GETUSERMEDIA]].

The Constraints, MediaStreamConstraints, MediaTrackSettings, MediaTrackConstraints, MediaTrackSupportedConstraints, MediaTrackCapabilities, and MediaTrackConstraintSet dictionaries this specification extends are defined in [[!GETUSERMEDIA]].

The getUserMedia(), getSettings() methods and the NavigatorUserMediaSuccessCallback callback are defined in [[!GETUSERMEDIA]].

The concepts muted, disabled, and overconstrained as applied to MediaStreamTrack are defined in [[!GETUSERMEDIA]].

The terms source and consumer are defined in [[!GETUSERMEDIA]].

The MediaDeviceKind enumeration is defined in [[!GETUSERMEDIA]].

The video element and ImageData (and its data attribute and Canvas Pixel ArrayBuffer), VideoTrack, HTMLMediaElement (and its srcObject attribute), HTMLVideoElement interfaces and the CanvasImageSource enum are defined in [[!HTML]].

The terms media data, media provider object , assigned media provider object, and the concept potentially playing are defined in [[!HTML]].

The term permission and the permission name "camera" are defined in [[!PERMISSIONS]].

The DataView, Uint8ClampedArray, and Uint16Array buffer source types are defined in [[WEBIDL]].

Terminology

The term depth+video stream means a MediaStream object that contains one or more MediaStreamTrack objects of kind "depth" (depth stream track) and one or more MediaStreamTrack objects of kind "video" (video stream track).

The term depth-only stream means a MediaStream object that contains one or more MediaStreamTrack objects of kind "depth" (depth stream track) only.

The term video-only stream means a MediaStream object that contains one or more MediaStreamTrack objects of kind "video" (video stream track) only, and optionally of kind "audio".

The term depth stream track means a MediaStreamTrack object whose kind is "depth". It represents a media stream track whose source is a depth camera.

The term video stream track means a MediaStreamTrack object whose kind is "video". It represents a media stream track whose source is a video camera.

Depth map

A depth map is an abstract representation of a frame of a depth stream track. A depth map is an image that contains information relating to the distance of the surfaces of scene objects from a viewpoint.

A depth map has an associated near value which is a double. It represents the minimum range in meters.

A depth map has an associated far value which is a double. It represents the maximum range in meters.

A depth map has an associated focal length which is a double. It represents the focal length of the camera in millimeters.

A depth map has an associated horizontal field of view which is a double. It represents the horizontal angle of view in degrees.

A depth map has an associated vertical field of view which is a double. It represents the vertical angle of view in degrees.

The data type of a depth map is 16-bit unsigned integer. The algorithm to convert the depth map value to grayscale, given a depth map value d, is as follows:

Let near be the the near value.
Let far be the the far value.
Apply the rules to convert using range linear to d to obtain quantized value d_16bit.
Return d_16bit.

The rules to convert using range linear are as given in the following formula:

$d_{n} = \frac{d - n e a r}{f a r - n e a r}$

$d_{16 b i t} = ⌊ d_{n} \cdot 65535 ⌋$

The depth measurement d (in meter units) is recovered by solving the rules to convert using range linear for d as follows:

Given d_16bit, near near value and far far value, normalize d_16bit to [0, 1] range:
$d_{n} = \frac{d_{16 b i t}}{65535}$
Solve the rules to convert using range linear for d:
$d = (d_{n} \cdot (f a r - n e a r)) + n e a r$

Extensions

`MediaStreamConstraints` dictionary

          partial dictionary MediaStreamConstraints {
              (boolean or MediaTrackConstraints) depth = false;
          };

If the depth dictionary member has the value true, the MediaStream returned by the getUserMedia() method MUST contain a depth stream track. If the depth dictionary member is set to false, is not provided, or is set to null, the MediaStream MUST NOT contain a depth stream track. If the depth dictionary member is set to a valid Constraints dictionary, the MediaStream returned by the getUserMedia() method MUST contain a depth stream track that fulfills the specified mandatory constraints.

The permission associated with a depth camera source is "camera",

If the user agent requests a combined depth+video stream, the devices in the constraint should be satisfied as belonging to the same group or physical device. The decision to select and satisfy which device pair is left up to the implementation.

`MediaStream` interface

          partial interface MediaStream {
              sequence<MediaStreamTrack> getDepthTracks();
          };

The getDepthTracks() method, when invoked, MUST return a sequence of depth stream tracks in this stream.

The getDepthTracks() method MUST return a sequence that represents a snapshot of all the MediaStreamTrack objects in this stream's track set whose kind is equal to "depth". The conversion from the track set to the sequence is user agent defined and the order does not have to be stable between calls.

The MediaStream consumer for the depth-only stream and depth+video stream is the video element [[!HTML]].

New consumers may be added in a future version of this specification.

Implementation considerations

A video stream track and a depth stream track can be combined into one depth+video stream. The rendering of the two tracks are intended to be synchronized. The resolution of the two tracks are intended to be same. And the coordination of the two tracks are intended to be calibrated. These are not hard requirements, since it might not be possible to synchronize tracks from sources.

`MediaStreamTrack` interface

The kind attribute MUST, on getting, return the string "depth" if the object represents a depth stream track.

If a MediaStreamTrack of kind "depth" is muted or disabled, it MUST render black frames, or a zero-information-content equivalent.

`MediaDeviceInfo` interface

The string "depthinput" is the MediaDeviceKind value for the depth camera input device.

Media provider object

A media provider object can represent a depth-only stream (and specifically, not a depth+video stream). The user agent MUST support a HTMLMediaElement with an assigned media provider object that is a depth-only stream, and in particular, the srcObject IDL attribute that allows the HTMLMediaElement to be assigned a media provider object MUST, on setting and getting, behave as specified in [[!HTML]].

The `video` element

When a video element is potentially playing and its assigned media provider object is a depth-only stream, the user agent MUST, for each pixel of the media data that is represented by a depth map, given a depth map value d, convert the depth map value to grayscale and render the returned value to the screen.

It is an implementation detail how the frames are rendered to the screen. For example, the implementation may use a grayscale or red representation, with either 8-bit or 16-bit precision.

For a video element whose assigned media provider object is a depth+video stream, the user agent MUST act as if all the MediaStreamTracks of kind "depth" were removed prior to when the video element is potentially playing.

`VideoTrack` interface

For each depth stream track in the depth-only stream, the user agent MUST create a corresponding VideoTrack as defined in [[HTML]].

`ImageData` interface

When an ImageData object's pixel values represent a depth map (that is, the image source for the 2D rendering context CanvasImageSource is a HTMLVideoElement whose media data represents a depth map), its Uint8ClampedArray source assigned to the data attribute represents the 16-bit depth value by assigning the low 8-bit of the 16-bit depth value d_low8bit on its red component, and high 8-bit of the 16-bit depth value d_high8bit on its green component.

If the values are read from the default Uint8ClampedArray view, they are represented as Canvas Pixel ArrayBuffer data as follows:

red_8bit = d_low8bit

green_8bit = d_high8bit

blue_8bit = 0

alpha_8bit = 0

          var depthVideo = document.querySelector('video');
          var canvas = document.querySelector('canvas');
          var context = canvas.getContext('2d');
          var w = depthVideo.videoWidth, h = depthVideo.videoHeight;

          context.drawImage(depthVideo, 0, 0, w, h);
          var imageData = context.getImageData(0, 0, w, h);

          // Create a new DataView dv on to an ArrayBuffer buffer
          // that exposes it as an array of unsigned 16-bit integers.
          var dv = new DataView(imageData.data.buffer);

          // Read every fourth unsigned 16-bit value in little endian representation.
          for (var i = 0; i < imageData.data.length; i += 4) {
            console.log(dv.getUint16(i, true));
          }

          // Alternatively, reconstruct unsigned 16-bit value. The result
          // is the same as if DataView and getUint16() was used.
          for (var i = 0; i < imageData.data.length; i += 4) {
            console.log(imageData.data[i] + (imageData.data[i+1] << 8));
          }

`MediaTrackSettings` dictionary

When the getSettings() method is invoked on a depth stream track, the user agent MUST return the following dictionary that extends the MediaTrackSettings dictionary:

          partial dictionary MediaTrackSettings {
              double              near;
              double              far;
              double              focalLength;
              double              horizontalFieldOfView;
              double              verticalFieldOfView;
          };

The near dictionary member represents the depth map's near value.

The far dictionary member represents the depth map's far value.

The focalLength dictionary member represents the depth map's focal length.

The horizontalFieldOfView dictionary member represents the depth map's horizontal field of view.

The verticalFieldOfView dictionary member represents the depth map's vertical field of view.

Constrainable properties

Property name	Values	Notes
`near`	`ConstrainDouble`	The near value, in meters.
`far`	`ConstrainDouble`	The far value, in meters.
`focalLength`	`ConstrainDouble`	The focal length, in millimeters.
`horizontalFieldOfView`	`ConstrainDouble`	The horizontal field of view, in degrees.
`verticalFieldOfView`	`ConstrainDouble`	The vertical field of view, in degrees.

The near, far, focalLength, horizontalFieldOfView, and verticalFieldOfView constrainable properties are defined to apply only to depth stream tracks.

The focalLength, horizontalFieldOfView, and verticalFieldOfView properties could be upstreamed to a future version of the the Media Capture and Streams specification [[!GETUSERMEDIA]] to allow them to be applied to video MediaStreamTrack objects as well.

The near and far constrainable properties, when set, allow the implementation to pick the best depth camera mode optimized for the range [near, far] and help minimize the error introduced by the lossy conversion from the depth value d to a quantized d_8bit and back to an approximation of the depth value d.

If the far property's value is less than the near property's value, the depth stream track is overconstrained.

If the near value, far value, focal length, horizontal field of view, or vertical field of view is fixed due to a hardware or software limitation, the corresponding constrainable property's value MUST be set to the value reported by the underlying implementation. (For example, the focal length of the lens may be fixed, or the underlying platform may not expose the focal length information.)

          partial dictionary MediaTrackConstraintSet {
            ConstrainDouble near;
            ConstrainDouble far;
            ConstrainDouble focalLength;
            ConstrainDouble horizontalFieldOfView;
            ConstrainDouble verticalFieldOfView;
          };

          partial dictionary MediaTrackSupportedConstraints {
            boolean near = true;
            boolean far = true;
            boolean focalLength = true;
            boolean horizontalFieldOfView = true;
            boolean verticalFieldOfView = true;
          };

          partial dictionary MediaTrackCapabilities {
            (double or DoubleRange) near;
            (double or DoubleRange) far;
            (double or DoubleRange) focalLength;
            (double or DoubleRange) horizontalFieldOfView;
            (double or DoubleRange) verticalFieldOfView;
          };

`WebGLRenderingContext` interface

Implementation considerations

This section is currently work in progress, and subject to change.

A video element whose source is a MediaStream object containing a depth stream track may be uploaded to a WebGL texture of format RGB and type UNSIGNED_BYTE. [[WEBGL]]

For each pixel of this WebGL texture, the R component represents the lower 8 bit value of 16 bit depth value, the G component represents the upper 8 bit value of 16 bit depth value and the value in B component is not defined.

Introduction

Use cases and requirements

Dependencies

Terminology

Depth map

Extensions

`MediaStreamConstraints` dictionary

`MediaStream` interface

Implementation considerations

`MediaStreamTrack` interface

`MediaDeviceInfo` interface

Media provider object

The `video` element

`VideoTrack` interface

`ImageData` interface

`MediaTrackSettings` dictionary

Constrainable properties

`WebGLRenderingContext` interface

Implementation considerations

Examples

Playback of depth+video stream

WebGL Fragment Shader based post-processing

Privacy and security considerations

Acknowledgements

Introduction

Use cases and requirements

Dependencies

Terminology

Depth map

Extensions

MediaStreamConstraints dictionary

MediaStream interface

Implementation considerations

MediaStreamTrack interface

MediaDeviceInfo interface

Media provider object

The video element

VideoTrack interface

ImageData interface

MediaTrackSettings dictionary

Constrainable properties

WebGLRenderingContext interface

Implementation considerations

Examples

Playback of depth+video stream

WebGL Fragment Shader based post-processing

Privacy and security considerations

Acknowledgements

`MediaStreamConstraints` dictionary

`MediaStream` interface

`MediaStreamTrack` interface

`MediaDeviceInfo` interface

The `video` element

`VideoTrack` interface

`ImageData` interface

`MediaTrackSettings` dictionary

`WebGLRenderingContext` interface