1. 13.2 Parsing HTML documents
      1. 13.2.1 Overview of the parsing model
      2. 13.2.2 Parse errors
      3. 13.2.3 The input byte stream
        1. 13.2.3.1 Parsing with a known character encoding
        2. 13.2.3.2 Determining the character encoding
        3. 13.2.3.3 Character encodings
        4. 13.2.3.4 Changing the encoding while parsing
        5. 13.2.3.5 Preprocessing the input stream
      4. 13.2.4 Parse state
        1. 13.2.4.1 The insertion mode
        2. 13.2.4.2 The stack of open elements
        3. 13.2.4.3 The list of active formatting elements
        4. 13.2.4.4 The element pointers
        5. 13.2.4.5 Other parsing state flags
      5. 13.2.5 Tokenization
        1. 13.2.5.1 Data state
        2. 13.2.5.2 RCDATA state
        3. 13.2.5.3 RAWTEXT state
        4. 13.2.5.4 Script data state
        5. 13.2.5.5 PLAINTEXT state
        6. 13.2.5.6 Tag open state
        7. 13.2.5.7 End tag open state
        8. 13.2.5.8 Tag name state
        9. 13.2.5.9 RCDATA less-than sign state
        10. 13.2.5.10 RCDATA end tag open state
        11. 13.2.5.11 RCDATA end tag name state
        12. 13.2.5.12 RAWTEXT less-than sign state
        13. 13.2.5.13 RAWTEXT end tag open state
        14. 13.2.5.14 RAWTEXT end tag name state
        15. 13.2.5.15 Script data less-than sign state
        16. 13.2.5.16 Script data end tag open state
        17. 13.2.5.17 Script data end tag name state
        18. 13.2.5.18 Script data escape start state
        19. 13.2.5.19 Script data escape start dash state
        20. 13.2.5.20 Script data escaped state
        21. 13.2.5.21 Script data escaped dash state
        22. 13.2.5.22 Script data escaped dash dash state
        23. 13.2.5.23 Script data escaped less-than sign state
        24. 13.2.5.24 Script data escaped end tag open state
        25. 13.2.5.25 Script data escaped end tag name state
        26. 13.2.5.26 Script data double escape start state
        27. 13.2.5.27 Script data double escaped state
        28. 13.2.5.28 Script data double escaped dash state
        29. 13.2.5.29 Script data double escaped dash dash state
        30. 13.2.5.30 Script data double escaped less-than sign state
        31. 13.2.5.31 Script data double escape end state
        32. 13.2.5.32 Before attribute name state
        33. 13.2.5.33 Attribute name state
        34. 13.2.5.34 After attribute name state
        35. 13.2.5.35 Before attribute value state
        36. 13.2.5.36 Attribute value (double-quoted) state
        37. 13.2.5.37 Attribute value (single-quoted) state
        38. 13.2.5.38 Attribute value (unquoted) state
        39. 13.2.5.39 After attribute value (quoted) state
        40. 13.2.5.40 Self-closing start tag state
        41. 13.2.5.41 Bogus comment state
        42. 13.2.5.42 Markup declaration open state
        43. 13.2.5.43 Comment start state
        44. 13.2.5.44 Comment start dash state
        45. 13.2.5.45 Comment state
        46. 13.2.5.46 Comment less-than sign state
        47. 13.2.5.47 Comment less-than sign bang state
        48. 13.2.5.48 Comment less-than sign bang dash state
        49. 13.2.5.49 Comment less-than sign bang dash dash state
        50. 13.2.5.50 Comment end dash state
        51. 13.2.5.51 Comment end state
        52. 13.2.5.52 Comment end bang state
        53. 13.2.5.53 DOCTYPE state
        54. 13.2.5.54 Before DOCTYPE name state
        55. 13.2.5.55 DOCTYPE name state
        56. 13.2.5.56 After DOCTYPE name state
        57. 13.2.5.57 After DOCTYPE public keyword state
        58. 13.2.5.58 Before DOCTYPE public identifier state
        59. 13.2.5.59 DOCTYPE public identifier (double-quoted) state
        60. 13.2.5.60 DOCTYPE public identifier (single-quoted) state
        61. 13.2.5.61 After DOCTYPE public identifier state
        62. 13.2.5.62 Between DOCTYPE public and system identifiers state
        63. 13.2.5.63 After DOCTYPE system keyword state
        64. 13.2.5.64 Before DOCTYPE system identifier state
        65. 13.2.5.65 DOCTYPE system identifier (double-quoted) state
        66. 13.2.5.66 DOCTYPE system identifier (single-quoted) state
        67. 13.2.5.67 After DOCTYPE system identifier state
        68. 13.2.5.68 Bogus DOCTYPE state
        69. 13.2.5.69 CDATA section state
        70. 13.2.5.70 CDATA section bracket state
        71. 13.2.5.71 CDATA section end state
        72. 13.2.5.72 Character reference state
        73. 13.2.5.73 Named character reference state
        74. 13.2.5.74 Ambiguous ampersand state
        75. 13.2.5.75 Numeric character reference state
        76. 13.2.5.76 Hexadecimal character reference start state
        77. 13.2.5.77 Decimal character reference start state
        78. 13.2.5.78 Hexadecimal character reference state
        79. 13.2.5.79 Decimal character reference state
        80. 13.2.5.80 Numeric character reference end state
      6. 13.2.6 Tree construction
        1. 13.2.6.1 Creating and inserting nodes
        2. 13.2.6.2 Parsing elements that contain only text
        3. 13.2.6.3 Closing elements that have implied end tags
        4. 13.2.6.4 The rules for parsing tokens in HTML content
          1. 13.2.6.4.1 The "initial" insertion mode
          2. 13.2.6.4.2 The "before html" insertion mode
          3. 13.2.6.4.3 The "before head" insertion mode
          4. 13.2.6.4.4 The "in head" insertion mode
          5. 13.2.6.4.5 The "in head noscript" insertion mode
          6. 13.2.6.4.6 The "after head" insertion mode
          7. 13.2.6.4.7 The "in body" insertion mode
          8. 13.2.6.4.8 The "text" insertion mode
          9. 13.2.6.4.9 The "in table" insertion mode
          10. 13.2.6.4.10 The "in table text" insertion mode
          11. 13.2.6.4.11 The "in caption" insertion mode
          12. 13.2.6.4.12 The "in column group" insertion mode
          13. 13.2.6.4.13 The "in table body" insertion mode
          14. 13.2.6.4.14 The "in row" insertion mode
          15. 13.2.6.4.15 The "in cell" insertion mode
          16. 13.2.6.4.16 The "in select" insertion mode
          17. 13.2.6.4.17 The "in select in table" insertion mode
          18. 13.2.6.4.18 The "in template" insertion mode
          19. 13.2.6.4.19 The "after body" insertion mode
          20. 13.2.6.4.20 The "in frameset" insertion mode
          21. 13.2.6.4.21 The "after frameset" insertion mode
          22. 13.2.6.4.22 The "after after body" insertion mode
          23. 13.2.6.4.23 The "after after frameset" insertion mode
        5. 13.2.6.5 The rules for parsing tokens in foreign content
      7. 13.2.7 The end
      8. 13.2.8 Speculative HTML parsing
      9. 13.2.9 Coercing an HTML DOM into an infoset
      10. 13.2.10 An introduction to error handling and strange cases in the parser
        1. 13.2.10.1 Misnested tags:
        2. 13.2.10.2 Misnested tags:

        3. 13.2.10.3 Unexpected markup in tables
        4. 13.2.10.4 Scripts that modify the page as it is being parsed
        5. 13.2.10.5 The execution of scripts that are moving across multiple documents
        6. 13.2.10.6 Unclosed formatting elements
    2. 13.3 Serializing HTML fragments
    3. 13.4 Parsing HTML fragments

13.2 Parsing HTML documents

This section only applies to user agents, data mining tools, and conformance checkers.

The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XML syntax".

User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. Together, these rules define what is referred to as the HTML parser.

While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.

For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.

As stated in the terminology section, references to element types that do not explicitly specify a namespace always refer to elements in the HTML namespace. For example, if the spec talks about "a menu element", then that is an element with the local name "menu", the namespace "http://www.w3.org/1999/xhtml", and the interface HTMLMenuElement. Where possible, references to such elements are hyperlinked to their definition.

13.2.1 Overview of the parsing model

The input to the HTML parsing process consists of a stream of code points, which is passed through a tokenization stage followed by a tree construction stage. The output is a Document object.

Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.

In the common case, the data handled by the tokenization stage comes from the network, but it can also come from script running in the user agent, e.g. using the document.write() API.

There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.

In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token:

...
<script>
 document.write('

'); script> ...

To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false.

13.2.2 Parse errors

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.

Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.

Some parse errors have dedicated codes outlined in the table below that should be used by conformance checkers in reports.

Error descriptions in the table below are non-normative.

Code Description
abrupt-closing-of-empty-comment

This error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E (>) code point (i.e., or ). The parser behaves as if the comment is closed correctly.

abrupt-doctype-public-identifier

This error occurs if the parser encounters a U+003E (>) code point in the DOCTYPE public identifier (e.g., ). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

abrupt-doctype-system-identifier

This error occurs if the parser encounters a U+003E (>) code point in the DOCTYPE system identifier (e.g., ). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

absence-of-digits-in-numeric-character-reference

This error occurs if the parser encounters a numeric character reference that doesn't contain any digits (e.g., &#qux;). In this case the parser doesn't resolve the character reference.

cdata-in-html-content

This error occurs if the parser encounters a CDATA section outside of foreign content (SVG or MathML). The parser treats such CDATA sections (including leading "[CDATA[" and trailing "]]" strings) as comments.

character-reference-outside-unicode-range

This error occurs if the parser encounters a numeric character reference that references a code point that is greater than the valid Unicode range. The parser resolves such a character reference to a U+FFFD REPLACEMENT CHARACTER.

control-character-in-input-stream

This error occurs if the input stream contains a control code point that is not ASCII whitespace or U+0000 NULL. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM.

control-character-reference

This error occurs if the parser encounters a numeric character reference that references a control code point that is not ASCII whitespace or is a U+000D CARRIAGE RETURN. The parser resolves such character references as-is except C1 control references that are replaced according to the numeric character reference end state.

duplicate-attribute

This error occurs if the parser encounters an attribute in a tag that already has an attribute with the same name. The parser ignores all such duplicate occurrences of the attribute.

end-tag-with-attributes

This error occurs if the parser encounters an end tag with attributes. Attributes in end tags are ignored and do not make their way into the DOM.

end-tag-with-trailing-solidus

This error occurs if the parser encounters an end tag that has a U+002F (/) code point right before the closing U+003E (>) code point (e.g., ). Such a tag is treated as a regular end tag.

eof-before-tag-name

This error occurs if the parser encounters the end of the input stream where a tag name is expected. In this case the parser treats the beginning of a start tag (i.e., <) or an end tag (i.e., ) as text content.

eof-in-cdata

This error occurs if the parser encounters the end of the input stream in a CDATA section. The parser treats such CDATA sections as if they are closed immediately before the end of the input stream.

eof-in-comment

This error occurs if the parser encounters the end of the input stream in a comment. The parser treats such comments as if they are closed immediately before the end of the input stream.

eof-in-doctype

This error occurs if the parser encounters the end of the input stream in a DOCTYPE. In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

eof-in-script-html-comment-like-text

This error occurs if the parser encounters the end of the input stream in text that resembles an HTML comment inside script element content (e.g., ", or having a p element that contains a ul element (as the ul element's start tag would imply the end tag for the p).

This can enable cross-site scripting attacks. An example of this would be a page that lets the user enter some font family names that are then inserted into a CSS style block via the DOM and which then uses the innerHTML IDL attribute to get the HTML serialization of that style element: if the user enters "" as a font family name, innerHTML will return markup that, if parsed in a different context, would contain a script node, even though no script node existed in the original DOM.

For example, consider the following markup:

<form id="outer"><div>form><form id="inner"><input>

This will be parsed into:

The input element will be associated with the inner form element. Now, if this tree structure is serialized and reparsed, the

start tag will be ignored, and so the input element will be associated with the outer form element instead.

<html><head>head><body><form id="outer"><div><form id="inner"><input>form>div>form>body>html>

As another example, consider the following markup:

<a><table><a>

This will be parsed into:

That is, the a elements are nested, because the second a element is foster parented. After a serialize-reparse roundtrip, the a elements and the table element would all be siblings, because the second start tag implicitly closes the first a element.

<html><head>head><body><a><a>a><table>table>a>body>html>

For historical reasons, this algorithm does not round-trip an initial U+000A LINE FEED (LF) character in pre, textarea, or listing elements, even though (in the first two cases) the markup being round-tripped can be conforming. The HTML parser will drop such a character during parsing, but this algorithm does not serialize an extra U+000A LINE FEED (LF) character.

For example, consider the following markup:

<pre>

Hello.pre>

When this document is first parsed, the pre element's child text content starts with a single newline character. After a serialize-reparse roundtrip, the pre element's child text content is simply "Hello.".

Because of the special role of the is attribute in signaling the creation of customized built-in elements, in that it provides a mechanism for parsed HTML to set the element's is value, we special-case its handling during serialization. This ensures that an element's is value is preserved through serialize-parse roundtrips.

When creating a customized built-in element via the parser, a developer uses the is attribute directly; in such cases serialize-parse roundtrips work fine.

<script>
window.SuperP = class extends HTMLParagraphElement {};
customElements.define("super-p", SuperP, { extends: "p" });
script>

<div id="container"><p is="super-p">Superb!p>div>

<script>
console.log(container.innerHTML); // 

container.innerHTML = container.innerHTML; console.log(container.innerHTML); //

console.assert(container.firstChild instanceof SuperP); script>

But when creating a customized built-in element via its constructor or via createElement(), the is attribute is not added. Instead, the is value (which is what the custom elements machinery uses) is set without intermediating through an attribute.

<script>
container.innerHTML = "";
const p = document.createElement("p", { is: "super-p" });
container.appendChild(p);

// The is attribute is not present in the DOM:
console.assert(!p.hasAttribute("is"));

// But the element is still a super-p:
console.assert(p instanceof SuperP);
script>

To ensure that serialize-parse roundtrips still work, the serialization process explicitly writes out the element's is value as an is attribute:

<script>
console.log(container.innerHTML); // 

container.innerHTML = container.innerHTML; console.log(container.innerHTML); //

console.assert(container.firstChild instanceof SuperP); script>

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

  1. Replace any occurrence of the "&" character by the string "&".

  2. Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".

  3. Replace any occurrences of the "<" character by the string "<".

  4. Replace any occurrences of the ">" character by the string ">".

  5. If the algorithm was invoked in the attribute mode, then replace any occurrences of the """ character by the string """.

13.4 Parsing HTML fragments

The HTML fragment parsing algorithm, given an Element node context, string input, and an optional boolean allowDeclarativeShadowRoots (default false) is the following steps. They return a list of zero or more nodes.

Parts marked fragment case in algorithms in the HTML parser section are parts that only occur if the parser was created for the purposes of this algorithm. The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.

  1. Let document be a Document node whose type is "html".

  2. If context's node document is in quirks mode, then set document's mode to "quirks".

  3. Otherwise, if context's node document is in limited-quirks mode, then set document's mode to "limited-quirks".

  4. If allowDeclarativeShadowRoots is true, then set document's allow declarative shadow roots to true.

  5. Create a new HTML parser, and associate it with document.

  6. Set the state of the HTML parser's tokenization stage as follows, switching on the context element:

    title
    textarea
    Switch the tokenizer to the RCDATA state.
    style
    xmp
    iframe
    noembed
    noframes
    Switch the tokenizer to the RAWTEXT state.
    script
    Switch the tokenizer to the script data state.
    noscript
    If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.
    plaintext
    Switch the tokenizer to the PLAINTEXT state.
    Any other element
    Leave the tokenizer in the data state.

    For performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions.

  7. Let root be the result of creating an element given document, "html", the HTML namespace, null, null, false, and context's custom element registry.

  8. Append root to document.

  9. Set up the HTML parser's stack of open elements so that it contains just the single element root.

  10. If context is a template element, then push "in template" onto the stack of template insertion modes so that it is the new current template insertion mode.

  11. Create a start tag token whose name is the local name of context and whose attributes are the attributes of context.

    Let this start tag token be the start tag token of context; e.g. for the purposes of determining if it is an HTML integration point.

  12. Reset the parser's insertion mode appropriately.

    The parser will reference the context element as part of that algorithm.

  13. Set the HTML parser's form element pointer to the nearest node to context that is a form element (going straight up the ancestor chain, and including the element itself, if it is a form element), if any. (If there is no such form element, the form element pointer keeps its initial value, null.)

  14. Place the input into the input stream for the HTML parser just created. The encoding confidence is irrelevant.

  15. Start the HTML parser and let it run until it has consumed all the characters just inserted into the input stream.

  16. Return root's children, in tree order.