Forum Moderators: coopster & phranque

Message Too Old, No Replies

Catching Non-SGML, High-ASCII, etc.

Is HTML::Entities too low level?

         

lexipixel

9:02 am on Oct 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a news article script that allows editors to cut and paste just about anything into an article... The problem is some editors are pasting in MS-Word documents and other non-SGML / high-ASCII un-escaped, un-encoded characters.

It handles the basic stripping of pointed brackets, ampersands, etc.. but I am looking for something more elegant.

I found a thread on thelist at evolt that gives a PHP example for a similar situation:

[lists.evolt.org...]

But I want to (use perl) and convert things like:

“ or ” to &034;
(left and right curved double quote to standard ASCII double quote)

’ or ’ to &039; (single quote)
(left and right curved single quote to standard ASCII single quote)

– or — to &045;
(em / en dash to hyphen)

etc...

I started playing with the CPAN HTML::Entities code, but after a few minutes I decided to ask here if anyone knew of something better before I spend too much time.

If nobody has anything else -- here's a little toy I started. Enter some test with unescaped high ASCII, Unicode chars, Windows-1252 stuff, etc.. then view the source of the page it returns and you'll see what "entities" does.

BUT I AM HOPING SOMEONE HAS SOMETHING BETTER ALREADY...

#!/usr/local/bin/perl
# ==========
# sgmconvt.pl
# ==========
#
#
use HTML::Entities;
$unsafe_chars = "&,<,>,\n,ü";
#
use CGI;
$query = new CGI;
#
$string = $query->param("f_string");
#
print "Content-Type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Online SMGL Entities / Character Encoder Decoder Tool</title>\n";
print "</head>\n";
print "<body>\n";
print "<center>\n";
#
print "<FORM ACTION=\"\" METHOD=\"post\">\n";
print "<TEXTAREA NAME=f_string ROWS=4 COLS=40>$string</TEXTAREA><br>\n";
print "<INPUT TYPE=reset VALUE=Reset>\n";
print "<INPUT TYPE=submit VALUE=Submit>\n";
print "</FORM>\n";
#
encode_entities( $string, $unsafe_chars );
print "<br>Encoded:<br>$string<br><br>\n";
#
print "</center>\n";
print "</body>\n";
print "</html>\n";
#
# eof

encyclo

3:38 am on Nov 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



un-encoded characters

The characters are encoded, just not encoded in the charset you are using for the page. ;)

Basically, you can consider that the content copied from Word is encoded in windows-1252 (assuming the version of Windows is Western European). You are probably either declaring ISO-8859-1 or UTF-8 on your pages, and the extended characters in windows-1252 which do not correspond to an ISO-8859-1 equivalent, such as the curly quotes, are going to cause problems.

You can look at using

iconv
to convert the incoming data to the character encoding of your choice:

[search.cpan.org...]
[packages.debian.org...]
[gnu.org...]

If you switch to UTF-8 for your site, you can add an

accept-charset="UTF-8"
attribute to your
form
and declare the page charset as UTF-8, and IE (and all modern browsers) will submit the
textarea
contents in UTF-8, including the curly quotes re-encoded as the UTF-8 counterparts.

lexipixel

4:42 am on Nov 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the reply encyclo.

I like to keep things simple, and this is a basic find and replace job, so I may just use one or more regex to convert the worst offenders.