This specification extends the Media Capture and Streams specification [[!GETUSERMEDIA]] to allow a depth-only stream or combined depth+video stream to be requested from the web platform using APIs familiar to web authors.
This extensions specification defines a new media type and constrainable property per Extensibility guidelines of the Media Capture and Streams specification [[!GETUSERMEDIA]]. Horizontal reviews and feedback from early implementations of this specification are encouraged.
Depth cameras are increasingly being integrated into devices such as phones, tablets, and laptops. Depth cameras provide a depth map, which conveys the distance information between points on an object's surface and the camera. With depth information, web content and applications can be enhanced by, for example, the use of hand gestures as an input mechanism, or by creating 3D models of real-world objects that can interact and integrate with the web platform. Concrete applications of this technology include more immersive gaming experiences, more accessible 3D video conferences, and augmented reality, to name a few.
To bring depth capability to the web platform, this specification
extends
the MediaStream interface [[!GETUSERMEDIA]] to
enable it to also contain depth-based
MediaStreamTracks. A depth-based
MediaStreamTrack, referred to as a depth stream
track, represents an abstraction of a stream of frames that can
each be converted to objects which contain an array of pixel data,
where each pixel represents the distance between the camera and the
objects in the scene for that point in the array. A
MediaStream object that contains one or more
depth stream tracks is referred to as a depth-only stream
or depth+video stream.
Depth cameras usually produce 16-bit depth values per pixel, so this specification defines a 16-bit grayscale representation of a depth map.
This specification attempts to address the Use Cases and Requirements for accessing depth stream from a depth camera. See also the Examples section for concrete usage examples.
This specification defines conformance criteria that apply to a single product: the user agent that implements the interfaces that it contains.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [[!WEBIDL]], as this specification uses that specification and terminology.
The
MediaStreamTrack and
MediaStream interfaces this specification extends
are defined in [[!GETUSERMEDIA]].
The
Constraints,
MediaStreamConstraints,
MediaTrackSettings,
MediaTrackConstraints,
MediaTrackSupportedConstraints,
MediaTrackCapabilities, and
MediaTrackConstraintSet dictionaries this
specification extends are defined in [[!GETUSERMEDIA]].
The
getUserMedia(), getSettings()
methods and the
NavigatorUserMediaSuccessCallback callback are
defined in [[!GETUSERMEDIA]].
The concepts muted,
disabled,
and
overconstrained as applied to
MediaStreamTrack are defined in [[!GETUSERMEDIA]].
The terms source and consumer are defined in [[!GETUSERMEDIA]].
The
MediaDeviceKind enumeration is defined in
[[!GETUSERMEDIA]].
The
video element and ImageData
(and its
data attribute and
Canvas Pixel ArrayBuffer),
VideoTrack,
HTMLMediaElement (and its
srcObject attribute),
HTMLVideoElement interfaces and the
CanvasImageSource enum are defined in [[!HTML]].
The terms media data, media provider object , assigned media provider object, and the concept potentially playing are defined in [[!HTML]].
The term permission
and the permission name "camera"
are defined in [[!PERMISSIONS]].
The DataView,
Uint8ClampedArray,
and Uint16Array
buffer source types are defined in [[WEBIDL]].
The term depth+video stream means a MediaStream
object that contains one or more MediaStreamTrack objects of
kind "depth" (depth stream track) and one or more
MediaStreamTrack objects of kind "video" (video
stream track).
The term depth-only stream means a MediaStream object
that contains one or more MediaStreamTrack objects of kind
"depth" (depth stream track) only.
The term video-only stream means a MediaStream object
that contains one or more MediaStreamTrack objects of kind
"video" (video stream track) only, and optionally
of kind "audio".
The term depth stream track means a MediaStreamTrack
object whose kind is "depth". It represents a media stream
track whose source is a depth camera.
The term video stream track means a MediaStreamTrack
object whose kind is "video". It represents a media stream
track whose source is a video camera.
A depth map is an abstract representation of a frame of a depth stream track. A depth map is an image that contains information relating to the distance of the surfaces of scene objects from a viewpoint.
A depth map has an associated near value which is a double. It represents the minimum range in meters.
A depth map has an associated far value which is a double. It represents the maximum range in meters.
A depth map has an associated focal length which is a double. It represents the focal length of the camera in millimeters.
A depth map has an associated horizontal field of view which is a double. It represents the horizontal angle of view in degrees.
A depth map has an associated vertical field of view which is a double. It represents the vertical angle of view in degrees.
The data type of a depth map is 16-bit unsigned integer. The algorithm to convert the depth map value to grayscale, given a depth map value d, is as follows:
The rules to convert using range linear are as given in the following formula:
dn=d−nearfar−near
d16bit=⌊dn⋅65535⌋
The depth measurement d (in meter units) is recovered by solving the rules to convert using range linear for d as follows:
dn=d16bit65535
d=(dn⋅(far−near))+near
MediaStreamConstraints dictionary
partial dictionary MediaStreamConstraints {
(boolean or MediaTrackConstraints) depth = false;
};
If the depth dictionary member has the value
true, the MediaStream returned by the getUserMedia()
method MUST contain a depth stream track. If the depth
dictionary member is set to false, is not provided, or is set to
null, the MediaStream MUST NOT contain a depth stream
track. If the depth dictionary member is set to a valid
Constraints dictionary, the MediaStream returned by the
getUserMedia() method MUST contain a depth stream track
that fulfills the specified mandatory constraints.
The permission associated with a depth camera source is "camera",
MediaStream interface
partial interface MediaStream {
sequence<MediaStreamTrack> getDepthTracks();
};
The getDepthTracks() method, when invoked,
MUST return a sequence of depth
stream tracks in this stream.
The getDepthTracks() method MUST return a
sequence that represents a snapshot of all the
MediaStreamTrack objects in this stream's track
set whose kind is equal to "depth".
The conversion from the track set to the sequence is user
agent defined and the order does not have to be stable between
calls.
The MediaStream consumer for the depth-only
stream and depth+video stream is the video element [[!HTML]].
A video stream track and a depth stream track can be combined into one depth+video stream. The rendering of the two tracks are intended to be synchronized. The resolution of the two tracks are intended to be same. And the coordination of the two tracks are intended to be calibrated. These are not hard requirements, since it might not be possible to synchronize tracks from sources.
MediaStreamTrack interface
The kind attribute MUST, on getting, return
the string "depth" if the object represents a depth
stream track.
If a MediaStreamTrack of kind "depth" is
muted or disabled, it MUST render black frames, or a
zero-information-content equivalent.
MediaDeviceInfo interface
The string "depthinput" is the MediaDeviceKind
value for the depth camera input device.
A media provider object can represent a depth-only stream (and specifically, not a depth+video stream). The user agent MUST support a HTMLMediaElement with an assigned media provider object that is a depth-only stream, and in particular, the srcObject IDL attribute that allows the HTMLMediaElement to be assigned a media provider object MUST, on setting and getting, behave as specified in [[!HTML]].
video element
When a video element is potentially playing and its assigned media provider object is a depth-only stream, the user agent MUST, for each pixel of the media data that is represented by a depth map, given a depth map value d, convert the depth map value to grayscale and render the returned value to the screen.
For a video element whose assigned media provider
object is a depth+video stream, the user agent MUST
act as if all the MediaStreamTracks of kind
"depth" were removed prior to when the
video element is potentially playing.
VideoTrack interface
For each depth stream track in the depth-only stream, the user agent MUST create a corresponding VideoTrack as defined in [[HTML]].
ImageData interface
When an ImageData object's pixel values represent a depth map (that is, the image source for the 2D rendering context CanvasImageSource is a HTMLVideoElement whose media data represents a depth map), its Uint8ClampedArray source assigned to the data attribute represents the 16-bit depth value by assigning the low 8-bit of the 16-bit depth value dlow8bit on its red component, and high 8-bit of the 16-bit depth value dhigh8bit on its green component.
If the values are read from the default Uint8ClampedArray
view, they are represented as Canvas Pixel
ArrayBuffer data as follows:
red8bit = dlow8bit
green8bit = dhigh8bit
blue8bit = 0
alpha8bit = 0
var depthVideo = document.querySelector('video');
var canvas = document.querySelector('canvas');
var context = canvas.getContext('2d');
var w = depthVideo.videoWidth, h = depthVideo.videoHeight;
context.drawImage(depthVideo, 0, 0, w, h);
var imageData = context.getImageData(0, 0, w, h);
// Create a new DataView dv on to an ArrayBuffer buffer
// that exposes it as an array of unsigned 16-bit integers.
var dv = new DataView(imageData.data.buffer);
// Read every fourth unsigned 16-bit value in little endian representation.
for (var i = 0; i < imageData.data.length; i += 4) {
console.log(dv.getUint16(i, true));
}
// Alternatively, reconstruct unsigned 16-bit value. The result
// is the same as if DataView and getUint16() was used.
for (var i = 0; i < imageData.data.length; i += 4) {
console.log(imageData.data[i] + (imageData.data[i+1] << 8));
}
MediaTrackSettings dictionary
When the getSettings() method is invoked on a depth stream
track, the user agent MUST return the following dictionary
that extends the MediaTrackSettings dictionary:
partial dictionary MediaTrackSettings {
double near;
double far;
double focalLength;
double horizontalFieldOfView;
double verticalFieldOfView;
};
The near dictionary member represents the
depth map's near value.
The far dictionary member represents the
depth map's far value.
The focalLength dictionary member
represents the depth map's focal length.
The horizontalFieldOfView dictionary member
represents the depth map's horizontal field of view.
The verticalFieldOfView dictionary member
represents the depth map's vertical field of view.
| Property name | Values | Notes |
|---|---|---|
near
|
ConstrainDouble
|
The near value, in meters. |
far
|
ConstrainDouble
|
The far value, in meters. |
focalLength
|
ConstrainDouble
|
The focal length, in millimeters. |
horizontalFieldOfView
|
ConstrainDouble
|
The horizontal field of view, in degrees. |
verticalFieldOfView
|
ConstrainDouble
|
The vertical field of view, in degrees. |
The near, far, focalLength,
horizontalFieldOfView, and
verticalFieldOfView constrainable properties are defined
to apply only to depth stream tracks.
focalLength, horizontalFieldOfView, and
verticalFieldOfView properties could be upstreamed to a
future version of the the Media Capture and Streams
specification [[!GETUSERMEDIA]] to allow them to be applied to video
MediaStreamTrack objects as well.
The near and far constrainable properties,
when set, allow the implementation to pick the best depth camera mode
optimized for the range [near, far] and help minimize
the error introduced by the lossy conversion from the depth value
d to a quantized d8bit and back to an
approximation of the depth value d.
If the far property's value is less than the
near property's value, the depth stream track is
overconstrained.
If the near value, far value, focal length, horizontal field of view, or vertical field of view is fixed due to a hardware or software limitation, the corresponding constrainable property's value MUST be set to the value reported by the underlying implementation. (For example, the focal length of the lens may be fixed, or the underlying platform may not expose the focal length information.)
partial dictionary MediaTrackConstraintSet {
ConstrainDouble near;
ConstrainDouble far;
ConstrainDouble focalLength;
ConstrainDouble horizontalFieldOfView;
ConstrainDouble verticalFieldOfView;
};
partial dictionary MediaTrackSupportedConstraints {
boolean near = true;
boolean far = true;
boolean focalLength = true;
boolean horizontalFieldOfView = true;
boolean verticalFieldOfView = true;
};
partial dictionary MediaTrackCapabilities {
(double or DoubleRange) near;
(double or DoubleRange) far;
(double or DoubleRange) focalLength;
(double or DoubleRange) horizontalFieldOfView;
(double or DoubleRange) verticalFieldOfView;
};
WebGLRenderingContext interface
A video element whose source is a
MediaStream object containing a depth stream
track may be uploaded to a WebGL texture of format
RGB and type UNSIGNED_BYTE. [[WEBGL]]
For each pixel of this WebGL texture, the R component represents the lower 8 bit value of 16 bit depth value, the G component represents the upper 8 bit value of 16 bit depth value and the value in B component is not defined.
navigator.mediaDevices.getUserMedia({
depth: true,
video: true
}).then(function (stream) {
// Wire the media stream into a <video> element for playback.
// The RGB video is rendered.
var video = document.querySelector('#video');
video.srcObject = stream;
video.play();
// Construct a depth-only stream out of the existing depth stream track.
var depthOnlyStream = new MediaStream(s.getDepthTracks()[0]);
// Wire the depth-only stream into another <video> element for playback.
// The depth information is rendered in its grayscale representation.
var depthVideo = document.querySelector('#depthVideo');
depthVideo.srcObject = depthOnlyStream;
depthVideo.play();
}
);
// This code sets up a video element from a depth stream, uploads it to a WebGL
// texture, and samples that texture in the fragment shader, reconstructing the
// 16-bit depth values from the red and green channels.
navigator.mediaDevices.getUserMedia({
depth: true,
}).then(function (stream) {
// wire the stream into a <video> element for playback
var depthVideo = document.querySelector('#depthVideo');
depthVideo.srcObject = stream;
depthVideo.play();
}).catch(function (reason) {
// handle gUM error here
});
// ... later, in the rendering loop ...
gl.texImage2D(
gl.TEXTURE_2D,
0,
gl.RGB,
gl.RGB,
gl.UNSIGNED_BYTE,
depthVideo
);
<script id="fragment-shader" type="x-shader/x-fragment">
varying vec2 v_texCoord;
// u_tex points to the texture unit containing the depth texture.
uniform sampler2D u_tex;
uniform float far;
uniform float near;
void main() {
vec4 floatColor = texture2D(u_tex, v_texCoord);
float dn = floatColor.r;
float depth = 0.;
depth = far * near / ( far - dn * ( far - near));
// ...
}
</script>
The privacy and security considerations discussed in [[!GETUSERMEDIA]] apply to this extension specification.
Thanks to everyone who contributed to the Use Cases and Requirements, sent feedback and comments. Special thanks to Ningxin Hu for experimental implementations, as well as to the Project Tango for their experiments.
The range linear format [[!XDM]] is licensed under CC BY 4.0.