HTML::Tagset(3) data tables useful in parsing HTML

VARIABLES

Note that none of these variables are exported.

hashset %HTML::Tagset::emptyElement


hashset %HTML::Tagset::emptyElement

This hashset has as values the tag-names (GIs) of elements that cannot have content. (For example, ``base'', ``br'', ``hr''.) So $HTML::Tagset::emptyElement{'hr'} exists and is true. $HTML::Tagset::emptyElement{'dl'} does not exist, and so is not true.

hashset %HTML::Tagset::optionalEndTag


hashset %HTML::Tagset::optionalEndTag

This hashset lists tag-names for elements that can have content, but whose end-tags are generally, ``safely'', omissible. Example: $HTML::Tagset::emptyElement{'li'} exists and is true.

hash %HTML::Tagset::linkElements


hash %HTML::Tagset::linkElements

Values in this hash are tagnames for elements that might contain links, and the value for each is a reference to an array of the names of attributes whose values can be links.

hash %HTML::Tagset::boolean_attr


hash %HTML::Tagset::boolean_attr

This hash (not hashset) lists what attributes of what elements can be printed without showing the value (for example, the ``noshade'' attribute of ``hr'' elements). For elements with only one such attribute, its value is simply that attribute name. For elements with many such attributes, the value is a reference to a hashset containing all such attributes.

hashset %HTML::Tagset::isPhraseMarkup


hashset %HTML::Tagset::isPhraseMarkup

This hashset contains all phrasal-level elements.

hashset %HTML::Tagset::is_Possible_Strict_P_Content


hashset %HTML::Tagset::is_Possible_Strict_P_Content

This hashset contains all phrasal-level elements that be content of a P element, for a strict model of HTML.

hashset %HTML::Tagset::isHeadElement


hashset %HTML::Tagset::isHeadElement

This hashset contains all elements that elements that should be present only in the 'head' element of an HTML document.

hashset %HTML::Tagset::isList


hashset %HTML::Tagset::isList

This hashset contains all elements that can contain ``li'' elements.

hashset %HTML::Tagset::isTableElement


hashset %HTML::Tagset::isTableElement

This hashset contains all elements that are to be found only in/under a ``table'' element.

hashset %HTML::Tagset::isFormElement


hashset %HTML::Tagset::isFormElement

This hashset contains all elements that are to be found only in/under a ``form'' element.

hashset %HTML::Tagset::isBodyElement


hashset %HTML::Tagset::isBodyElement

This hashset contains all elements that are to be found only in/under the ``body'' element of an HTML document.

hashset %HTML::Tagset::isHeadOrBodyElement


hashset %HTML::Tagset::isHeadOrBodyElement

This hashset includes all elements that I notice can fall either in the head or in the body.

hashset %HTML::Tagset::isKnown


hashset %HTML::Tagset::isKnown

This hashset lists all known HTML elements.

hashset %HTML::Tagset::canTighten


hashset %HTML::Tagset::canTighten

This hashset lists elements that might have ignorable whitespace as children or siblings.

array @HTML::Tagset::p_closure_barriers


array @HTML::Tagset::p_closure_barriers

This array has a meaning that I have only seen a need for in "HTML::TreeBuilder", but I include it here on the off chance that someone might find it of use:

When we see a "<p>" token, we go lookup up the lineage for a p element we might have to minimize. At first sight, we might say that if there's a p anywhere in the lineage of this new p, it should be closed. But that's wrong. Consider this document:

  <html>
    <head>
      <title>foo</title>
    </head>
    <body>
      <p>foo
        <table>
          <tr>
            <td>
               foo
               <p>bar
            </td>
          </tr>
        </table>
      </p>
    </body>
  </html>

The second p is quite legally inside a much higher p.

My formalization of the reason why this is legal, but this:

  <p>foo<p>bar</p></p>

isn't, is that something about the table constitutes a ``barrier'' to the application of the rule about what p must minimize.

So @HTML::Tagset::p_closure_barriers is the list of all such barrier-tags.

hashset %isCDATA_Parent


hashset %isCDATA_Parent

This hashset includes all elements whose content is CDATA.