print validator($HTML_string);
sub validator {
($_) = @_;
my %open, %close, $count;
# Safety net for WHILE
$count = () = /</g;
# Count opening tags
$x = 0;
while ($x <= $count &&
m#<(\w+)[^>]*>#g) {
$tag = lc($1);
if ($tag ne 'img' &&
$tag ne 'br') {
$open{$tag} = exists($open{$tag}) ? $open{$tag} + 1 : 1;
}
$x++;
}
# Count closing tags
$y = 0;
while ($y <= $count &&
m#</(\w+)>#g) {
$tag = lc($1);
$close{$tag} = exists($close{$tag}) ? close{$tag} + 1 : 1;
$y++;
}
# If more opening tags than closing, add closing to make it fit
for $key (keys %open) {
if ($open{$key} > $close{$key}) {
$fix = $open{$key} = $close{$key};
# add $key to the end of $_, $fix number of times
$_ .= "</$key>" x $fix;
}
}
return $_;
}
my %open, %close, $count;
# Safety net for WHILE
$count = () = /</g;
# Count tags
$x = 0;
while ($x <= $count &&
m#<(/)?(\w+)[^>]*>#g) {
$tag = lc($2);
if ($tag ne 'img' &&
$tag ne 'br') {
$open{$tag} //= 0;
$close{$tag} //= 0;
if ($1) { $close{$tag}++; }
else { $open{$tag}++; }
}
$x++;
}
# If more opening tags than closing, add closing to make it fit
for $key (keys %open) {
if ($open{$key} > $close{$key}) {
# not sure why the previous post had an = instead of a - ?
$fix = $open{$key} - $close{$key};
# add $key to the end of $_, $fix number of times
$_ .= "</$key>" x $fix;
}
} HTML is rather harder to parse than people who write it generally suspect.
Here's the problem: HTML is a kind of SGML that permits "minimization" and "implication". In short, this means that you don't have to close every tag you open (because the opening of a subsequent tag may implicitly close it), and if you use a tag that can't occur in the context you seem to using it in, under certain conditions the parser will be able to realize you mean to leave the current context and enter the new one, that being the only one that your code could correctly be interpreted in.
Now, this would all work flawlessly and unproblematically if: 1) all the rules that both prescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of these rules and wrote their code in compliance to them.
However, it didn't happen that way, and so most HTML pages are difficult if not impossible to correctly parse with nearly any set of straightforward SGML rules.
...