Living Standard — Last Updated 9 June 2025
This section only applies to user agents, data mining tools, and conformance checkers.
The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XML syntax".
User agents must use the parsing rules described in this section to generate the DOM trees from
text/html
resources. Together, these rules define what is referred to as the
HTML parser.
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.
As stated in the terminology section,
references to element types that do not explicitly specify a
namespace always refer to elements in the HTML namespace. For example, if the spec
talks about "a menu
element", then that is an element with the local name "menu
", the namespace "http://www.w3.org/1999/xhtml
", and
the interface HTMLMenuElement
. Where possible, references to such elements are
hyperlinked to their definition.
The input to the HTML parsing process consists of a stream of code
points, which is passed through a tokenization stage followed by a tree
construction stage. The output is a Document
object.
Implementations that do not support scripting do not
have to actually create a DOM Document
object, but the DOM tree in such cases is
still used as the model for the rest of the specification.
In the common case, the data handled by the tokenization stage comes from the network, but
it can also come from script running in the user
agent, e.g. using the document.write()
API.
There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token:
...
< script >
document. write( ''
);
script >
...
To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false.
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.
Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.
Some parse errors have dedicated codes outlined in the table below that should be used by conformance checkers in reports.
Error descriptions in the table below are non-normative.
Code | Description |
---|---|
abrupt-closing-of-empty-comment | This error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E (>) code
point (i.e., |
abrupt-doctype-public-identifier | This error occurs if the parser encounters a U+003E (>) code point in the
DOCTYPE public identifier (e.g., |
abrupt-doctype-system-identifier | This error occurs if the parser encounters a U+003E (>) code point in the
DOCTYPE system identifier (e.g., |
absence-of-digits-in-numeric-character-reference | This error occurs if the parser encounters a numeric character reference that doesn't contain any digits (e.g., |
cdata-in-html-content | This error occurs if the parser encounters a CDATA
section outside of foreign content (SVG or MathML). The parser treats such CDATA
sections (including leading " |
character-reference-outside-unicode-range | This error occurs if the parser encounters a numeric character reference that references a code point that is greater than the valid Unicode range. The parser resolves such a character reference to a U+FFFD REPLACEMENT CHARACTER. |
control-character-in-input-stream | This error occurs if the input stream contains a control code point that is not ASCII whitespace or U+0000 NULL. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM. |
control-character-reference | This error occurs if the parser encounters a numeric character reference that references a control code point that is not ASCII whitespace or is a U+000D CARRIAGE RETURN. The parser resolves such character references as-is except C1 control references that are replaced according to the numeric character reference end state. |
duplicate-attribute | This error occurs if the parser encounters an attribute in a tag that already has an attribute with the same name. The parser ignores all such duplicate occurrences of the attribute. |
end-tag-with-attributes | This error occurs if the parser encounters an end tag with attributes. Attributes in end tags are ignored and do not make their way into the DOM. |
end-tag-with-trailing-solidus | This error occurs if the parser encounters an end
tag that has a U+002F (/) code point right before the closing U+003E (>)
code point (e.g., |
eof-before-tag-name | This error occurs if the parser encounters the end of the input stream
where a tag name is expected. In this case the parser treats the beginning of a start tag (i.e., |
eof-in-cdata | This error occurs if the parser encounters the end of the input stream in a CDATA section. The parser treats such CDATA sections as if they are closed immediately before the end of the input stream. |
eof-in-comment | This error occurs if the parser encounters the end of the input stream in a comment. The parser treats such comments as if they are closed immediately before the end of the input stream. |
eof-in-doctype | This error occurs if the parser encounters the end of the input stream in a DOCTYPE. In such a case, if the DOCTYPE is correctly placed as a
document preamble, the parser sets the |
eof-in-script-html-comment-like-text |
This error occurs if the parser encounters the end of the input stream in text
that resembles an HTML comment inside
This can enable cross-site scripting attacks. An example of this would be a page that lets the
user enter some font family names that are then inserted into a CSS For example, consider the following markup:
This will be parsed into: The
As another example, consider the following markup:
This will be parsed into: That is, the
For historical reasons, this algorithm does not round-trip an initial U+000A LINE FEED (LF)
character in For example, consider the following markup:
When this document is first parsed, the Because of the special role of the When creating a customized built-in element via the parser, a developer uses the
But when creating a customized built-in element via its constructor or via
To ensure that serialize-parse roundtrips still work, the serialization process explicitly
writes out the element's
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
13.4 Parsing HTML fragmentsThe HTML fragment parsing algorithm, given an Parts marked fragment case in algorithms in the HTML parser section are parts that only occur if the parser was created for the purposes of this algorithm. The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.
|