Forum Moderators: coopster & phranque

Message Too Old, No Replies

Ensuring that HTML string has proper number of closing tags

         

csdude55

5:53 am on Dec 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Any suggestions for a reliable method to ensure that a string of HTML code is valid? I'm not too concerned with W3C warnings, I really just need to make sure that every opening tag is closed.

What I'm really working with is user-submitted content that's often copied from another site (usually blurbs of news articles). I have a ton of regexes in place to remove tags, styles, etc that might cause a conflict with my site, so I'd really like to have a safety in place to make sure that I don't accidentally remove a closing tag or something that then messes up the display on my site.

My initial thought was to write a function to add every opening tag in the string to an array, excluding <br> and <img> (all other void elements are stripped, anyway). Then do the same thing with every closing tag. Then if the length of both arrays aren't the same, add a closing tag to the end of the string until the length's do match. It wouldn't be perfect, but it would (err, should) prevent a display error on the entire page.

Thoughts? Or is there a module that would do the same thing... but better?

csdude55

6:17 am on Dec 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Quick and dirty:

print validator($HTML_string);

sub validator {
($_) = @_;

my %open, %close, $count;

# Safety net for WHILE
$count = () = /</g;

# Count opening tags
$x = 0;
while ($x <= $count &&
m#<(\w+)[^>]*>#g) {
$tag = lc($1);

if ($tag ne 'img' &&
$tag ne 'br') {
$open{$tag} = exists($open{$tag}) ? $open{$tag} + 1 : 1;
}

$x++;
}

# Count closing tags
$y = 0;
while ($y <= $count &&
m#</(\w+)>#g) {
$tag = lc($1);
$close{$tag} = exists($close{$tag}) ? close{$tag} + 1 : 1;

$y++;
}

# If more opening tags than closing, add closing to make it fit
for $key (keys %open) {
if ($open{$key} > $close{$key}) {
$fix = $open{$key} = $close{$key};

# add $key to the end of $_, $fix number of times
$_ .= "</$key>" x $fix;
}
}

return $_;
}

csdude55

7:10 pm on Dec 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Simplified a little:

my %open, %close, $count;

# Safety net for WHILE
$count = () = /</g;

# Count tags
$x = 0;
while ($x <= $count &&
m#<(/)?(\w+)[^>]*>#g) {
$tag = lc($2);

if ($tag ne 'img' &&
$tag ne 'br') {
$open{$tag} //= 0;
$close{$tag} //= 0;

if ($1) { $close{$tag}++; }
else { $open{$tag}++; }
}

$x++;
}

# If more opening tags than closing, add closing to make it fit
for $key (keys %open) {
if ($open{$key} > $close{$key}) {
# not sure why the previous post had an = instead of a - ?
$fix = $open{$key} - $close{$key};

# add $key to the end of $_, $fix number of times
$_ .= "</$key>" x $fix;
}
}

I should try to find a way to sort %open by the time the index was added, then reverse sort it so that the last element is closed first; otherwise, this:

<div>this<p>that

might become:

<divthis<p>that</div></p>

The only way I can think to do this is to use arrays instead of hashes, but that's going to become a pretty complex pain. And I'm not sure if it's worth it for my purposes.

phranque

6:43 am on Dec 15, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



this row has already been hoed.

you might consider looking at the HTML::TreeBuilder [metacpan.org] and HTML::PrettyPrinter [metacpan.org] perl modules - if not to actually use them, you might get some useful ideas.

phranque

6:46 am on Dec 15, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



from the HTML::TreeBuilder doc linked to above:
HTML is rather harder to parse than people who write it generally suspect.

Here's the problem: HTML is a kind of SGML that permits "minimization" and "implication". In short, this means that you don't have to close every tag you open (because the opening of a subsequent tag may implicitly close it), and if you use a tag that can't occur in the context you seem to using it in, under certain conditions the parser will be able to realize you mean to leave the current context and enter the new one, that being the only one that your code could correctly be interpreted in.

Now, this would all work flawlessly and unproblematically if: 1) all the rules that both prescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of these rules and wrote their code in compliance to them.

However, it didn't happen that way, and so most HTML pages are difficult if not impossible to correctly parse with nearly any set of straightforward SGML rules.
...

csdude55

6:54 pm on Dec 15, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, @phranque. I'm reading through both of them, but I think they assume that I already know what they do? It's very confusing, and the only thing I can think to do is install them, build some test scripts, and then figure out if they even do what I need :-/

Allowing user submitted HTML has been a real pain to deal with! Something as simple as an unclosed <b> would end up bolding everything on the page. Or worse, an unclosed <div> would end up messing up the whole layout!

The only other idea I've had was to surround every post with a <table>. It's not perfect, but it solves 90% of my problems. It doesn't feel like a permanent solution, though.