Catching Non-SGML, High-ASCII, etc.

I have a news article script that allows editors to cut and paste just about anything into an article... The problem is some editors are pasting in MS-Word documents and other non-SGML / high-ASCII un-escaped, un-encoded characters.

It handles the basic stripping of pointed brackets, ampersands, etc.. but I am looking for something more elegant.

I found a thread on thelist at evolt that gives a PHP example for a similar situation:

[lists.evolt.org...]

But I want to (use perl) and convert things like:

“ or ” to &034;
(left and right curved double quote to standard ASCII double quote)

’ or ’ to &039; (single quote)
(left and right curved single quote to standard ASCII single quote)

– or — to &045;
(em / en dash to hyphen)

etc...

I started playing with the CPAN HTML::Entities code, but after a few minutes I decided to ask here if anyone knew of something better before I spend too much time.

If nobody has anything else -- here's a little toy I started. Enter some test with unescaped high ASCII, Unicode chars, Windows-1252 stuff, etc.. then view the source of the page it returns and you'll see what "entities" does.

BUT I AM HOPING SOMEONE HAS SOMETHING BETTER ALREADY...

#!/usr/local/bin/perl
# ==========
# sgmconvt.pl
# ==========
#
#
use HTML::Entities;
$unsafe_chars = "&,<,>,\n,ü";
#
use CGI;
$query = new CGI;
#
$string = $query->param("f_string");
#
print "Content-Type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Online SMGL Entities / Character Encoder Decoder Tool</title>\n";
print "</head>\n";
print "<body>\n";
print "<center>\n";
#
print "<FORM ACTION=\"\" METHOD=\"post\">\n";
print "<TEXTAREA NAME=f_string ROWS=4 COLS=40>$string</TEXTAREA><br>\n";
print "<INPUT TYPE=reset VALUE=Reset>\n";
print "<INPUT TYPE=submit VALUE=Submit>\n";
print "</FORM>\n";
#
encode_entities( $string, $unsafe_chars );
print "<br>Encoded:<br>$string<br><br>\n";
#
print "</center>\n";
print "</body>\n";
print "</html>\n";
#
# eof

Catching Non-SGML, High-ASCII, etc.

Is HTML::Entities too low level?

lexipixel

encyclo

lexipixel

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week