This specification defines the HTML microdata mechanism. This mechanism allows machine-readable data to be embedded in HTML documents similarly to the mechanism of RDFa. It is compatible with JSON, and can be written in a style which is convertible to RDF, although two-way conversion is not lossless.
This document is an editor's draft for the Web Platform Working Group, proposed as an update to the 4 May 2017 W3C First Public Working Draft.
This specification is an extension to HTML. All normative content in the HTML specification, unless specifically overridden by this specification, is intended to be the basis for this specification.
If you wish to make comments regarding this document please submit them as github issues. All feedback is welcome, but please note the contribution guidelines require agreement to the terms of the W3C Patent Policy for substantive contributions.
This specification depends on the HTML specification. [[!HTML52]]
Information expressed as microdata can be converted to JSON, as described in Section 6.1. Microdata can generally be converted to RDF, as described in Microdata to RDF, but only when an additional set of constraints are applied to the microdata content. [[!JSON]][[microdata-rdf]]
The URL specification, and RFC 3986 which uses the term URI, defines a URL, valid URL, and absolute URL. For the purposes of this specification, the terms "URL" and "URI" are equivalent. [[RFC3986]][[URL]]
This specification relies on the HTML specification to define the following terms: [[!HTML52]]
HTML defines tree order and the concept of a node's home subtree.
HTML defines the terms space characters, split a string on spaces, and prefix match.
HTML defines the meaning of the term HTML elements, as well as all the elements referenced in this specification. In the context of content models it defines the terms flow content and phrasing content. It also defines what an element's ID or language is in HTML.
HTML defines a set of global attributes and the concepts of a boolean attribute, and an unordered set of unique space-separated tokens,
HTML defines what the document's current address is.
Finally, HTML also defines the concepts of drag-and-drop initialization steps and of the list of dragged nodes, which come up in the context of drag-and-drop interfaces. [[!HTML52]]
Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.
For example, were the spec to say:
To eat an orange, the user must: 1. Peel the orange. 2. Separate each slice of the orange. 3. Eat the orange slices.
...it would be equivalent to the following:
To eat an orange: 1. The user must peel the orange. 2. The user must separate each slice of the orange. 3. The user must eat the orange slices.
Here the key word is "must".
The former (imperative) style is generally preferred in this specification for stylistic reasons.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
The microdata model consists of groups of name-value pairs known as items.
Each group is known as an item. Each item can have item types, a global identifier (if the vocabulary specified by the item types support global identifiers for items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item). The names are unordered relative to each other, but if a particular name has multiple values, they do have a relative order.
Every HTML element may have an
itemscope attribute specified.
The itemscope attribute is a
boolean attribute.
An element with the itemscope attribute specified
creates a new item, a group of name-value pairs.
Elements with an itemscope attribute may have an
itemtype attribute specified, to give the
item types of the item.
The itemtype attribute, if specified, must have a value that
is an unordered set of unique space-separated tokens that are
case-sensitive, each of which is a valid URL that is an absolute
URL, and all of which are defined to use the same vocabulary. The attribute's value must
have at least one token.
The item types of an item are the tokens obtained
by splitting the element's itemtype attribute's value on spaces.
If the itemtype attribute is missing or parsing it
in this way finds no tokens, the item is said to have no
item types.
The item types must all be types defined in applicable specifications and must all be defined to use the same vocabulary.
Except if otherwise specified by that specification, the URLs given as the item types should not be automatically dereferenced.
A specification could define that its item type can be derefenced to provide the user with help information, for example. In fact, vocabulary authors are encouraged to provide useful information at the given URL.
Item types are opaque identifiers, and user agents must not dereference unknown item types, or otherwise deconstruct them, in order to determine how to process items that use them.
The itemtype attribute must not be specified on elements
that do not have an itemscope attribute specified.
An item is said to be a typed item when either it has an item type, or it is the value of a property of a typed item. The relevant types for a typed item is the item's item types, if it has any, or else is the relevant types of the item for which it is a property's value.
Elements with an itemscope attribute and an
itemtype attribute that references a vocabulary
that is defined to support global identifiers for items
may also have an itemid attribute specified,
to give a global identifier for the item,
so that it can be related to other items on pages elsewhere on the Web.
The itemid attribute, if specified, must have a value that is
a valid URL potentially surrounded by spaces.
The global identifier of an item
is the value of its element's itemid attribute, if it has one,
resolved relative to the element on which the attribute is specified.
If the itemid attribute is missing or if resolving it fails, it
is said to have no global identifier.
The itemid attribute must not be specified on elements
that do not have both an itemscope attribute and an
itemtype attribute specified, and must not be specified
on elements with an itemscope attribute whose
itemtype attribute specifies a vocabulary that does not
support global identifiers for items,
as defined by that vocabulary's specification.
The exact meaning of a global identifier is determined by the vocabulary's specification. It is up to such specifications to define whether multiple items with the same global identifier (whether on the same page or on different pages) are allowed to exist, and what the processing rules for that vocabulary are with respect to handling the case of multiple items with the same global identifier.
Elements with an itemscope attribute may have an
itemref attribute specified, to give a list of additional
elements to crawl to find the name-value pairs of the item.
The itemref attribute, if specified, must have a value that
is an unordered set of unique space-separated tokens that are
case-sensitive, consisting of IDs of elements in the same home subtree.
The itemref attribute must not be specified on elements that
do not have an itemscope attribute specified.
The itemref attribute is not part of the
microdata data model. It is merely a syntactic construct to aid authors in adding annotations to
pages where the data to be annotated does not follow a convenient tree structure. For example, it
allows authors to mark up data in a table so that each column defines a separate
item, while keeping the properties in the cells.
This example shows a simple vocabulary used to describe the products of a model railway manufacturer. The vocabulary has just five property names:
This vocabulary has four defined item types:
Each item that uses this vocabulary can be given one or more of these types, depending on what the product is.
Thus, a locomotive might be marked up as:
<dl itemscope itemtype="http://md.example.com/loco
http://md.example.com/lighting">
<dt>Name:
<dd itemprop="name">Tank Locomotive (DB 80)
<dt>Product code:
<dd itemprop="product-code">33041
<dt>Scale:
<dd itemprop="scale">HO
<dt>Digital:
<dd itemprop="digital">Delta
</dl>
A turnout lantern retrofit kit might be marked up as:
<dl itemscope itemtype="http://md.example.com/track
http://md.example.com/lighting">
<dt>Name:
<dd itemprop="name">Turnout Lantern Kit
<dt>Product code:
<dd itemprop="product-code">74470
<dt>Purpose:
<dd>For retrofitting 2 <span itemprop="track-type">C</span> Track
turnouts. <meta itemprop="scale" content="HO">
</dl>
A passenger car with no lighting might be marked up as:
<dl itemscope itemtype="http://md.example.com/passengers"> <dt>Name: <dd itemprop="name">Express Train Passenger Car (DB Am 203) <dt>Product code: <dd itemprop="product-code">8710 <dt>Scale: <dd itemprop="scale">Z </dl>
Great care is necessary when creating new vocabularies. Often, a hierarchical approach to types can be taken that results in a vocabulary where each item only ever has a single type, which is generally much simpler to manage.
itemprop attributeEvery HTML element may have an
itemprop attribute specified, if doing so
adds one or more properties to one or more
items (as defined below).
The itemprop attribute, if specified,
must have a value that is an
unordered set of unique space-separated tokens that are
case-sensitive, representing the names of the name-value pairs that it adds. The
attribute's value must have at least one token.
Each token must be either:
Specifications that introduce defined property names must ensure all such property names contain no "." (U+002E) characters, no ":" (U+003A) characters, and no space characters (defined in [[!HTML52]] as U+0020, U+0009, U+000A, U+000C, and U+000D).
When an element with an itemprop attribute
adds a property
to multiple items,
the requirement above regarding the tokens applies for each item
individually.
For the following code:
<div itemscope itemtype="http://example.com/a"> <ref refid="x"> </div>
<div itemscope itemtype="http://example.com/b"> <ref refid="x"> </div>
<meta id="x" itemprop="z" content="">
The author should be certain that z is valid for both the http://example.com/a and http://example.com/b vocabularies.
The property names of an element are the tokens that the element's
itemprop attribute is found to contain
when its value is split on spaces,
with the order preserved but with duplicates removed leaving only the first occurrence of each name.
Within an item, the properties are unordered with respect to each other, except for properties with the same name, which are ordered in the order they are given by the algorithm that defines the properties of an item.
In the following example, the "a" property has the values "1" and "2", in that order, but whether the "a" property comes before the "b" property or not is not important:
<div itemscope> <p itemprop="a">1</p> <p itemprop="a">2</p> <p itemprop="b">test</p> </div>
Thus, the following is equivalent:
<div itemscope> <p itemprop="b">test</p> <p itemprop="a">1</p> <p itemprop="a">2</p> </div>
As is the following:
<div itemscope> <p itemprop="a">1</p> <p itemprop="b">test</p> <p itemprop="a">2</p> </div>
And the following:
<div id="x"> <p itemprop="a">1</p> </div> <div itemscope itemref="x"> <p itemprop="b">test</p> <p itemprop="a">2</p> </div>
The property value of a name-value pair added by an
element with an itemprop attribute
is as given for the first matching case in the following list:
itemscope attributeThe value is the item created by the element.
meta elementThe value is the value of the element's content attribute,
if any, or the empty string if there is no such attribute.
audio, embed, iframe,
img, source, track, or video elementThe value is the absolute URL that results from
resolving the value of the element's src attribute
relative to the element at the time the attribute is set, or the empty string if there is no such attribute
or if resolving it results in an error.
a, area, or link elementThe value is the absolute URL that results from
resolving the value of the element's href attribute
relative to the element at the time the attribute is set, or the empty string if there is no such
attribute or if resolving it results in an error.
object elementThe value is the absolute URL that results from
resolving the value of the element's data attribute
relative to the element at the time the attribute is set, or the empty string if there is no such
attribute or if resolving it results in an error.
data elementThe value is the value of the element's value attribute,
if it has one, or the empty string otherwise.
meter elementThe value is the value of the element's value attribute,
if it has one, or the empty string otherwise.
time elementThe value is the element's datetime value.
The value is the element's textContent.
The URL property elements are the a, area,
audio, embed, iframe, img, link,
object, source, track, and video elements.
If a property's value, as defined by the property's definition, is an absolute URL, the property must be specified using a URL property element.
These requirements do not apply just because a property value happens to match the syntax for a URL. They only apply if the property is explicitly defined as taking such a value.
For example, a book about the first moon landing could be
called "mission:moon". A "title" property from a vocabulary that defines a title as being a string
would not expect the title to be given in an a element, even though it looks like a
URL. On the other hand, if there was a (rather narrowly scoped!) vocabulary for
"books whose titles look like URLs" which had a "title" property defined to take a URL, then the
property would expect the title to be given in an a element (or one of the
other URL property elements), because of the requirement above.
To find the properties of an item defined by the element root, the user agent must run the following steps. These steps are also used to flag microdata errors.
Let results, memory, and pending be empty lists of elements.
Add the element root to memory.
Add the child elements of root, if any, to pending.
If root has an itemref attribute,
split the value of that itemref attribute on spaces.
For each resulting token ID, if there is an element in the home subtree
of root whose ID is ID, then add
the first such element to pending.
Loop: If pending is empty, jump to the step labeled end of loop.
Remove an element from pending and let current be that element.
If current is already in memory, there is a microdata error; return to the step labeled loop.
Add current to memory.
If current does not have an
itemscope attribute, then:
add all the child elements of current to pending.
If current has an
itemprop
attribute specified and has one or more property names, then add
current to results.
Return to the step labeled loop.
End of loop: Sort results in tree order.
Return results.
A document must not contain any items for which the algorithm to find the properties of an item finds any microdata error.
An item is a
top-level microdata item
if its element does not have an itemprop
attribute.
All itemref attributes in a Document must be
such that there are no cycles in the graph formed from representing each item in the Document as a node in the graph and each
property of an item whose value is another item as an edge in the graph connecting
those two items.
A document must not contain any elements that have an
itemprop attribute
that would not be found to be a property of any of the items
in that document were their properties
all to be determined.
In this example, a single license statement is applied to two works, using
itemref from the items representing the works:
<!DOCTYPE HTML> <html> <head> <title>Photo gallery</title> </head> <body> <h1>My photos</h1> <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses"> <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest."> <figcaption itemprop="title">The house I found.</figcaption> </figure> <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses"> <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside."> <figcaption itemprop="title">The mailbox.</figcaption> </figure> <footer> <p id="licenses">All images licensed under the <a itemprop="license" href="http://www.opensource.org/licenses/mit-license.php">MIT license</a>.</p> </footer> </body> </html>
The above results in two items with the type "http://n.whatwg.org/work",
one with:
images/house.jpeg
http://www.opensource.org/licenses/mit-license.php
...and one with:
images/mailbox.jpeg
http://www.opensource.org/licenses/mit-license.php
Currently, the itemscope,
itemprop,
and other microdata attributes are only defined for HTML elements.
This means that attributes with the literal names "itemscope", "itemprop", etc,
do not cause microdata processing to occur on elements in other namespaces, such as SVG.
Thus, in the following example there is only one item, not two.
<p itemscope></p> <!-- this is an item (with no properties and no type) -->
<svg itemscope></svg> <!-- this is not, it's just an svg element with an invalid unknown attribute -->
Given a list of nodes nodes in a Document, a user agent must
run the following algorithm to extract the microdata from those nodes
into a JSON form:
Let result be an empty object.
Let items be an empty array.
For each node in nodes, check if the element is a top-level microdata item, and if it is then get the object for that element and add it to items.
Add an entry to result called "items" whose
value is the array items.
Return the result of serializing result to JSON in the shortest
possible way (meaning no whitespace between tokens, no unnecessary zero digits in numbers, and
only using Unicode escapes in strings for characters that do not have a dedicated escape
sequence), and with a lowercase "e" used, when appropriate, in the
representation of any numbers. [[JSON]]
This algorithm returns an object with a single property that is an array, instead of just returning an array, so that it is possible to extend the algorithm in the future if necessary.
When the user agent is to get the object for an item item, potentially together with a list of elements memory, it must run the following substeps:
Let result be an empty object.
If no memory was passed to the algorithm, let memory be an empty list.
Add item to memory.
If the item has any item types, add an entry to result
called "type" whose value is an array listing the
item types of item, in the order they were specified on the
itemtype attribute.
If the item has a global identifier, add an entry to
result called "id" whose value is the global
identifier of item.
Let properties be an empty object.
For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of an item, run the following substeps:
Let value be the property value of element.
If value is an item, then:
If value is in memory, then let value be
the string "ERROR". Otherwise, get the object for
value, passing a copy of memory, and then replace value
with the object returned from those steps.
For each name name in element's property names, run the following substeps:
If there is no entry named name in properties, then add an entry named name to properties whose value is an empty array.
Append value to the entry named name in properties.
Add an entry to result called "properties" whose
value is the object properties.
Return result.
For example, take this markup:
<!DOCTYPE HTML>
<title>My Blog</title>
<article itemscope itemtype="http://schema.org/BlogPosting">
<header>
<h1 itemprop="headline">Progress report</h1>
<p><time itemprop="datePublished" datetime="2013-08-29">today</time></p>
<link itemprop="url" href="?comments=0">
</header>
<p>All in all, he's doing well with his swim lessons. The biggest thing was he had trouble
putting his head in, but we got it down.</p>
<section>
<h1>Comments</h1>
<article itemprop="comment" itemscope itemtype="http://schema.org/UserComments" id="c1">
<link itemprop="url" href="#c1">
<footer>
<p>Posted by: <span itemprop="creator" itemscope itemtype="http://schema.org/Person">
<span itemprop="name">Greg</span>
</span></p>
<p><time itemprop="commentTime" datetime="2013-08-29">15 minutes ago</time></p>
</footer>
<p>Ha!</p>
</article>
<article itemprop="comment" itemscope itemtype="http://schema.org/UserComments" id="c2">
<link itemprop="url" href="#c2">
<footer>
<p>Posted by: <span itemprop="creator" itemscope itemtype="http://schema.org/Person">
<span itemprop="name">Charlotte</span>
</span></p>
<p><time itemprop="commentTime" datetime="2013-08-29">5 minutes ago</time></p>
</footer>
<p>When you say "we got it down"...</p>
</article>
</section>
</article>
It would be turned into the following JSON by the algorithm above (supposing that the page's
URL was http://blog.example.com/progress-report):
{
"items": [
{
"type": [ "http://schema.org/BlogPosting" ],
"properties": {
"headline": [ "Progress report" ],
"datePublished": [ "2013-08-29" ],
"url": [ "http://blog.example.com/progress-report?comments=0" ],
"comment": [
{
"type": [ "http://schema.org/UserComments" ],
"properties": {
"url": [ "http://blog.example.com/progress-report#c1" ],
"creator": [
{
"type": [ "http://schema.org/Person" ],
"properties": {
"name": [ "Greg" ]
}
}
],
"commentTime": [ "2013-08-29" ]
}
},
{
"type": [ "http://schema.org/UserComments" ],
"properties": {
"url": [ "http://blog.example.com/progress-report#c2" ],
"creator": [
{
"type": [ "http://schema.org/Person" ],
"properties": {
"name": [ "Charlotte" ]
}
}
],
"commentTime": [ "2013-08-29" ]
}
}
]
}
}
]
}
If the itemprop attribute is
present on link or meta, they are
flow content and phrasing content. The
link and meta elements may be used where
phrasing content is expected if the
itemprop attribute is present.
If a link element has an
itemprop
attribute, the rel attribute may be omitted.
If a meta element has an itemprop
attribute, the name, http-equiv, and
charset attributes must be omitted, and the
content attribute must be present.
If the itemprop is specified
on an a or area element, then the href attribute must also be
specified.
If the itemprop is specified
on an iframe element, then the data attribute must also be
specified.
If the itemprop is specified
on an embed element, then the data attribute must also be
specified.
If the itemprop is specified
on an object element, then the data attribute must also be
specified.
If the itemprop is specified
on a media element, then the src attribute must also be
specified.
The drag-and-drop initialization steps are:
The user agent must take the list of dragged nodes
and extract the microdata from those
nodes into a JSON form, and then must add the resulting
string to the dataTransfer member,
associated with the application/microdata+json format.
This section is not normative
Machine-readable data may be presented to users, for example by search engines. Identifying content that should, or should not, be translated, would be helpful but currently microdata strips markup. It is possible to use XMLLiterals in RDFa to ensure that markup is kept.
Vocabulary design is difficult. Different languages and cultures present view ambiguity differently: two terms with different meanings in one situation may be most naturally translated by a single term that has both meanings, or a single term may have two natural translations. When developing for localisation, it is important to provide sufficient contextual information about terms in a vocabulary to enable accurate translation.
This section is not normative
Microdata does not generally interact with personally identifying information, being a static document format. It is possible that information is more clearly identified, and thus to record personally identifying information more explicitly, however this is not a new possibility and can be achieved just as easily without microdata.
This section is not normative
Microdata does not generally interact with browsers, being a static document format that lacks any DOM interface. Microdata to makes information machine-readable, but does not automatically include provenance information for the statements it encodes. Processors of microdata should consider the trustworthiness of sources they use, including the possibility that data is no longer accurate, and whether the connection over which the data was gathered is secure.
application/microdata+jsonThis registration is for community review and will be submitted to the IESG for review, approval, and registration with IANA.
application/json [[JSON]]application/json [[JSON]]application/json [[JSON]]application/json [[JSON]]application/microdata+json type asserts that the
resource is a JSON text that consists of an object with a single
entry called "items" consisting of an array
of entries, each of which consists of an object with an entry
called "id" whose value is a string, an
entry called "type" whose value is another
string, and an entry called "properties"
whose value is an object whose entries each have a value
consisting of an array of either objects or strings, the objects
being of the same form as the objects in the aforementioned "items" entry.
Thus, the relevant specifications
are the JSON specification and this specification. [[JSON]]
Applications that transfer data intended for use with HTML's microdata feature, especially in the context of drag-and-drop, are the primary application class for this type.
application/json [[JSON]]application/json [[JSON]]application/json [[JSON]]Fragment identifiers used with
application/microdata+json resources have the same
semantics as when used with application/json (namely,
at the time of writing, no semantics at all). [[JSON]]
Changes made between the First Public Working Draft and the 23 October 2013 W3C Note
Changes made between the First Public Working Draft and the 23 October 2013 W3C Note
The original specification for Microdata was developed by Ian Hickson. Uptake has been substantially been driven by its use for the schema.org vocabulary.
The current editors would like to thank the following people for direct contributions to their work:
Gregg Kellogg, Ivan Herman, Tab Atkins, Xiaoqian Wu.