HTML

Living Standard — Last Updated 9 June 2025

13.2 Parsing HTML documents

This section only applies to user agents, data mining tools, and conformance checkers.

The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XML syntax".

User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. Together, these rules define what is referred to as the HTML parser.

While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.

For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.

As stated in the terminology section, references to element types that do not explicitly specify a namespace always refer to elements in the HTML namespace. For example, if the spec talks about "a menu element", then that is an element with the local name "menu", the namespace "http://www.w3.org/1999/xhtml", and the interface HTMLMenuElement. Where possible, references to such elements are hyperlinked to their definition.

13.2.1 Overview of the parsing model

The input to the HTML parsing process consists of a stream of code points, which is passed through a tokenization stage followed by a tree construction stage. The output is a Document object.

Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.

In the common case, the data handled by the tokenization stage comes from the network, but it can also come from script running in the user agent, e.g. using the document.write() API.

There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.

In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token:

...
<script>
 document.write('');
script>
...

To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false.

13.2.2 Parse errors

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.

Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.

Some parse errors have dedicated codes outlined in the table below that should be used by conformance checkers in reports.

Error descriptions in the table below are non-normative.

Code	Description
abrupt-closing-of-empty-comment	This error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E (>) code point (i.e., or ). The parser behaves as if the comment is closed correctly.
abrupt-doctype-public-identifier	This error occurs if the parser encounters a U+003E (>) code point in the DOCTYPE public identifier (e.g., ). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the `Document` to quirks mode.
abrupt-doctype-system-identifier	This error occurs if the parser encounters a U+003E (>) code point in the DOCTYPE system identifier (e.g., ). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the `Document` to quirks mode.
absence-of-digits-in-numeric-character-reference	This error occurs if the parser encounters a numeric character reference that doesn't contain any digits (e.g., `&#qux;`). In this case the parser doesn't resolve the character reference.
cdata-in-html-content	This error occurs if the parser encounters a CDATA section outside of foreign content (SVG or MathML). The parser treats such CDATA sections (including leading "`[CDATA[`" and trailing "`]]`" strings) as comments.
character-reference-outside-unicode-range	This error occurs if the parser encounters a numeric character reference that references a code point that is greater than the valid Unicode range. The parser resolves such a character reference to a U+FFFD REPLACEMENT CHARACTER.
control-character-in-input-stream	This error occurs if the input stream contains a control code point that is not ASCII whitespace or U+0000 NULL. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM.
control-character-reference	This error occurs if the parser encounters a numeric character reference that references a control code point that is not ASCII whitespace or is a U+000D CARRIAGE RETURN. The parser resolves such character references as-is except C1 control references that are replaced according to the numeric character reference end state.
duplicate-attribute	This error occurs if the parser encounters an attribute in a tag that already has an attribute with the same name. The parser ignores all such duplicate occurrences of the attribute.
end-tag-with-attributes	This error occurs if the parser encounters an end tag with attributes. Attributes in end tags are ignored and do not make their way into the DOM.
end-tag-with-trailing-solidus	This error occurs if the parser encounters an end tag that has a U+002F (/) code point right before the closing U+003E (>) code point (e.g., ). Such a tag is treated as a regular end tag.
eof-before-tag-name	This error occurs if the parser encounters the end of the input stream where a tag name is expected. In this case the parser treats the beginning of a start tag (i.e., `<`) or an end tag (i.e., `) as text content.`
eof-in-cdata	This error occurs if the parser encounters the end of the input stream in a CDATA section. The parser treats such CDATA sections as if they are closed immediately before the end of the input stream.
eof-in-comment	This error occurs if the parser encounters the end of the input stream in a comment. The parser treats such comments as if they are closed immediately before the end of the input stream.
eof-in-doctype	This error occurs if the parser encounters the end of the input stream in a DOCTYPE. In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the `Document` to quirks mode.
eof-in-script-html-comment-like-text	This error occurs if the parser encounters the end of the input stream in text that resembles an HTML comment inside `script` element content (e.g., ", or having a `p` element that contains a `ul` element (as the `ul` element's start tag would imply the end tag for the `p`). This can enable cross-site scripting attacks. An example of this would be a page that lets the user enter some font family names that are then inserted into a CSS `style` block via the DOM and which then uses the `innerHTML` IDL attribute to get the HTML serialization of that `style` element: if the user enters "" as a font family name, `innerHTML` will return markup that, if parsed in a different context, would contain a `script` node, even though no `script` node existed in the original DOM. For example, consider the following markup: `<form id="outer"><div>form><form id="inner"><input>` This will be parsed into: `html` `head` `body` `form` `id`="`outer`" `div` `form` `id`="`inner`" `input` The `input` element will be associated with the inner `form` element. Now, if this tree structure is serialized and reparsed, the start tag will be ignored, and so the `input` element will be associated with the outer `form` element instead. `<html><head>head><body><form id="outer"><div><form id="inner"><input>form>div>form>body>html>` `html` `head` `body` `form` `id`="`outer`" `div` `input` As another example, consider the following markup: `<a><table><a>` This will be parsed into: `html` `head` `body` `a` `a` `table` That is, the `a` elements are nested, because the second `a` element is foster parented. After a serialize-reparse roundtrip, the `a` elements and the `table` element would all be siblings, because the second start tag implicitly closes the first `a` element. `<html><head>head><body><a><a>a><table>table>a>body>html>` `html` `head` `body` `a` `a` `table` For historical reasons, this algorithm does not round-trip an initial U+000A LINE FEED (LF) character in `pre`, `textarea`, or `listing` elements, even though (in the first two cases) the markup being round-tripped can be conforming. The HTML parser will drop such a character during parsing, but this algorithm does not serialize an extra U+000A LINE FEED (LF) character. For example, consider the following markup: `<pre> Hello.pre>` When this document is first parsed, the `pre` element's child text content starts with a single newline character. After a serialize-reparse roundtrip, the `pre` element's child text content is simply "`Hello.`". Because of the special role of the `is` attribute in signaling the creation of customized built-in elements, in that it provides a mechanism for parsed HTML to set the element's `is` value, we special-case its handling during serialization. This ensures that an element's `is` value is preserved through serialize-parse roundtrips. When creating a customized built-in element via the parser, a developer uses the `is` attribute directly; in such cases serialize-parse roundtrips work fine. `<script> window.SuperP = class extends HTMLParagraphElement {}; customElements.define("super-p", SuperP, { extends: "p" }); script> <div id="container"><p is="super-p">Superb!p>div> <script> console.log(container.innerHTML); // container.innerHTML = container.innerHTML; console.log(container.innerHTML); //` `console.assert(container.firstChild instanceof SuperP); script>` But when creating a customized built-in element via its constructor or via `createElement()`, the `is` attribute is not added. Instead, the `is` value (which is what the custom elements machinery uses) is set without intermediating through an attribute. `<script> container.innerHTML = ""; const p = document.createElement("p", { is: "super-p" }); container.appendChild(p); // The is attribute is not present in the DOM: console.assert(!p.hasAttribute("is")); // But the element is still a super-p: console.assert(p instanceof SuperP); script>` To ensure that serialize-parse roundtrips still work, the serialization process explicitly writes out the element's `is` value as an `is` attribute: `<script> console.log(container.innerHTML); // container.innerHTML = container.innerHTML; console.log(container.innerHTML); //` `console.assert(container.firstChild instanceof SuperP); script>` Escaping a string (for the purposes of the algorithm above) consists of running the following steps: Replace any occurrence of the "`&`" character by the string "`&`". Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ". Replace any occurrences of the "`<`" character by the string "`<`". Replace any occurrences of the "`>`" character by the string "`>`". If the algorithm was invoked in the attribute mode, then replace any occurrences of the "`"`" character by the string "`"`". 13.4 Parsing HTML fragments The HTML fragment parsing algorithm, given an `Element` node `context`, string `input`, and an optional boolean `allowDeclarativeShadowRoots` (default false) is the following steps. They return a list of zero or more nodes. Parts marked fragment case in algorithms in the HTML parser section are parts that only occur if the parser was created for the purposes of this algorithm. The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification. Let `document` be a `Document` node whose type is "`html`". If `context`'s node document is in quirks mode, then set `document`'s mode to "`quirks`". Otherwise, if `context`'s node document is in limited-quirks mode, then set `document`'s mode to "`limited-quirks`". If `allowDeclarativeShadowRoots` is true, then set `document`'s allow declarative shadow roots to true. Create a new HTML parser, and associate it with `document`. Set the state of the HTML parser's tokenization stage as follows, switching on the `context` element: `title` `textarea` Switch the tokenizer to the RCDATA state. `style` `xmp` `iframe` `noembed` `noframes` Switch the tokenizer to the RAWTEXT state. `script` Switch the tokenizer to the script data state. `noscript` If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state. `plaintext` Switch the tokenizer to the PLAINTEXT state. Any other element Leave the tokenizer in the data state. For performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions. Let `root` be the result of creating an element given `document`, "`html`", the HTML namespace, null, null, false, and `context`'s custom element registry. Append `root` to `document`. Set up the HTML parser's stack of open elements so that it contains just the single element `root`. If `context` is a `template` element, then push "in template" onto the stack of template insertion modes so that it is the new current template insertion mode. Create a start tag token whose name is the local name of `context` and whose attributes are the attributes of `context`. Let this start tag token be the start tag token of `context`; e.g. for the purposes of determining if it is an HTML integration point. Reset the parser's insertion mode appropriately. The parser will reference the `context` element as part of that algorithm. Set the HTML parser's `form` element pointer to the nearest node to `context` that is a `form` element (going straight up the ancestor chain, and including the element itself, if it is a `form` element), if any. (If there is no such `form` element, the `form` element pointer keeps its initial value, null.) Place the `input` into the input stream for the HTML parser just created. The encoding confidence is irrelevant. Start the HTML parser and let it run until it has consumed all the characters just inserted into the input stream. Return `root`'s children, in tree order. ← 13 The HTML syntax — Table of Contents — 13.5 Named character references →

13.2 Parsing HTML documents

13.2.1 Overview of the parsing model

13.2.2 Parse errors

13.4 Parsing HTML fragments