Forum Moderators: coopster

Message Too Old, No Replies

More on Unicode, ISO-xxxx, and UTF-8

         

csdude55

9:56 pm on Jan 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know we were talking about this before, but I can't find it now. And I'm pretty sure it wasn't PHP related at the time, anyway.

I have about 20 years of data, and for the last 10 years or so users have had a contenteditable to submit their content. This was fine until recently, when I could just set the meta tag to UTF-8... but now I'm dealing with people on their phones, submitting Unicode smilies or whatever. And PHP7.x no longer recognized ENT_SUBSTITUTE in htmlspecialchars().

I've been using this to autocorrect everything:

function cleanup_text($text) {
$text = str_replace('​', '', $text);// zero width space

$text = preg_replace('#•?#', '', $text);
$text = preg_replace('#…?#', '...', $text);
$text = preg_replace('#[]?#', '-', $text);
$text = preg_replace('#(|[])?#', "'", $text);
$text = preg_replace('#("|||“??)#', '"', $text);

$text = trim($text);
$text = preg_replace('# {2,}#', ' ', $text);// Repeated whitespace

// PHP 7.x doesn't do substitutions anymore :-(
// return htmlspecialchars($text, ENT_SUBSTITUTE, 'UTF-8');

return htmlspecialchars($text, ENT_IGNORE, 'UTF-8');
}


And this WORKS, but I'm losing some context when a user submits something like this:

You guys suck ☺️😊😊


In his mind he submitted 4 smilies, but they're getting stripped out so it doesn't seem good natured like he intended.

Before I waste the next year trying to convert a full list of unicode and ISO-whatever codes to something I can manage (like images), just to have to do it again with the next iOS update... can you guys and gals suggest any other way to automatically substitute things?

phranque

11:32 pm on Jan 24, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I know we were talking about this before, but I can't find it now. And I'm pretty sure it wasn't PHP related at the time, anyway.

the thread you are thinking of was posted 3 weeks ago in the Webmaster General forum:
charset for user generated content [webmasterworld.com]

lucy24

12:51 am on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In his mind he submitted 4 smilies, but they're getting stripped out so it doesn't seem good natured like he intended.
Well, you could globally convert anything in the Advanced Smileys range to a mechanical :-) ... but sooner or later youll run into some equivalent of the grandparent who thinks lol stands for lots of love, as in I was sorry to hear about the death of your dog, lol.

csdude55

12:59 am on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I haven't found a full list of unicode emojis yet, but I found this:

[unicode.org...]

So I'd still have to come up with a list of about 3,000 potentials to know whether to convert them to :-), :-(, >:-(, or maybe a handful of other possibilities. And, of course, the next update that brings in 10,000 more of them...

Or is it a simpler list of &#1000 - 8000; are :-), &#8001 - 12000 are :-(, and so on?

lucy24

2:15 am on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Or is it a simpler list of &#1000 - 8000; are :-), &#8001 - 12000 are :-(, and so on?
Yes, exactly. Unicode is divided into blocks. A comparatively early set is Dingbats (2700-27BF, or UTF-8 E29C80-E29EBF in three bytes), but mainly you're looking at Miscellaneous Symbols and Pictographs (1F300-1F5FF, or four-byte UTF-8 F09F8C80-F09F97BF) immediately followed by Emoticons (1F600-1F64F, or F09F9880-F09F998F). These are followed in turn by Transport and Map Symbols for another 8 rows, mysteriously including both an occupied and an unoccupied bathtub, but how often will your users need those?

Those are hexadecimal ranges--the form html expresses as ✀--but if your program insists on decimal forms, conversion is trivial. (Ive got at least one online converter bookmarked.)

Edit: No, I do not recommend going through the whole list and determining which simple keyboard emoticon is the most appropriate for each. How do you express Cat Face With Tears Of Joy, or See-No-Evil Monkey? You'd want to put in some generic squiggle, indicating Here Be Emoticons.

w3dk

6:01 pm on Jan 25, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



...a user submits something like this:

You guys suck ☺️😊😊


Is this what the user is actually submitting or after the input has been passed through one of your functions? These are "numeric HTML entities" - plain ASCII - they only become "unicode" when the browser converts them for display. (?)

but I'm losing some context


The cleanup_text() function you posted doesn't seem to be responsible for this? (Or is something missing?)

And PHP7.x no longer recognized ENT_SUBSTITUTE in htmlspecialchars().


I didn't think there were any changes in this regard to PHP 7.x (or PHP8 for that matter)? But I'm not sure how ENT_SUBSTITUTE would really help here?

csdude55

6:28 pm on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, what actually happens is this:

1. When they submit something, the very first thing I do is insert it to a MySQL table as-is, before I change anything. This is just a backup table so that, if there's a problem, I can see what was intended. The database is set to cp1252 West European (latin1), and what I pasted here is a direct copy from that table.

2. On my live site, I've tried setting the Content-Type to both ISO-8859-1 and to UTF-8, using:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

In ISO it will show the emojis as intended. In UTF I just get squares.

3. On the new layout I've been working on, I set the Content-Type using:

<meta charset="UTF-8">

Without using cleanup_text(), I get those same squares. When I do run it through the cleanup_text() function, though, it just comes back as blank.

When I first started developing I was still using PHP 5.x, and then ENT_SUBSTITUTE would fix a lot of weird characters for me automatically. I'm not sure about the emojis, though, I was more worried about quotes, etc. That's why I had to implement several preg_replace(), though; to try to fix them manually.

lucy24

7:05 pm on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



they only become "unicode" when the browser converts them for display.
And then, when posted to the present forum--which is strictly Latin-1 (i.e. de facto 1252)--it changes right back to decimal equivalents. This does not make it easy to ask or answer questions involving anything outside the Latin-1 range, whether it be emoticons, symbols, or a non-Roman script.

In UTF I just get squares.
Yuk. That means it has already been converted into a form that isnt recognizable as UTF-8--a common problem with 1252 or similar content. If everything were retained as numerical entities, the page's stated encoding would make no difference. You could even set the page to ASCII and everything would still display as intended.

w3dk

7:34 pm on Jan 25, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The database is set to cp1252 West European (latin1), and what I pasted here is a direct copy from that table.


Well, "something" would seem to have HTML entity encoded it - as written, those aren't unicode characters. (?)

In ISO it will show the emojis as intended. In UTF I just get squares.


ISO-8859-1 and CP1252 are very similar.

(Just to note, if you are sending back an HTTP "Content-Type" response header with a "charset" directive then this will override anything you set in the HTML.)

Without using cleanup_text(), I get those same squares. When I do run it through the cleanup_text() function, though, it just comes back as blank.


It sounds like you need to convert the encoding from CP1252 (from your DB) to UTF-8 (for display):


$resultUTF8 = mb_convert_encoding($stringCP1252,'UTF-8','CP1252')


EDIT: I see (from your earlier thread) you've tried something like this before, but you have the encoding arguments reversed. For some reason the mb_convert_encoding() function takes the arguments in "reverse": TO, FROM. Not FROM, TO.

w3dk

7:40 pm on Jan 25, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



And then, when posted to the present forum--which is strictly Latin-1 (i.e. de facto 1252)--it changes right back to decimal equivalents.


Are you saying this forum creates the numeric HTML entities (and then double-encodes them)?!


You guys suck &#9786;&#65039;&#128522;&#128522;


EDIT: Oh woo, so it does!

@csdude55: Check your original post, that's not actually what you posted is it?

csdude55

8:07 pm on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@csdude55: Check your original post, that's not actually what you posted is it?

What do you mean?

lucy24

9:49 pm on Jan 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



and then double-encodes them
Yes, if you look at the HTML you'll see all the & ampersands have been converted to &amp; so your browser cant just re-display the entities. (It seems as if there should be a way around this. When I'm making HTML out of a text file, the conversion goes
&([c\s]|&nbsp;)
>>
&amp;$1
so I dont inadvertently convert anything other than free-standing ampersands. It could of course be done with a lookahead instead of a capture. But this is an old discussion.)

w3dk

12:29 am on Jan 26, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



@csdude55: Check your original post, that's not actually what you posted is it?


What do you mean?


In your first post there is the text "You guys suck &#9786;&#65039;&#128522;&#128522;" - is that literally what you typed? Because, whilst the same text appears in my last post, that is NOT what I typed! I typed the actual unicode characters (smiley faces) and the WebmasterWorld forum has converted it into the text you see here (doubly encoded numeric HTML entities). If that's not what you actually typed, then some of my comments above are irrelevant.

lammert

9:45 am on Jan 26, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The database is set to cp1252 West European (latin1)
Your problem is not PHP, your problem is a mediocre database setup. Change it to accept 4 byte UTF-8 properly (utf8mb4_unicode_ci or one of the alikes in MySQL) and your problems are solved where they should have been solved from the beginning.

csdude55

5:40 am on Jan 29, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It sounds like you need to convert the encoding from CP1252 (from your DB) to UTF-8 (for display):
$resultUTF8 = mb_convert_encoding($stringCP1252,'UTF-8','CP1252')


@w3dk, I'm assuming that this would be in lieu of htmlspecialchars()? What about the str_replace() and preg_replace() that I had to do before htmlspecialchars()?

I tried it this evening as a replacement for the rest of the function:

$resultUTF8 = mb_convert_encoding($stringCP1252, 'UTF-8', 'CP1252, ISO-8859-1');

but immediately saw a problem with a from the database and an from an RSS feed (eg, "caf"). So I think that I still need to use the str_replace() and preg_replace commands to manually convert, and maybe I also need to manually convert , , , and ?

I also tried htmlentities(), hoping that it would at least fix the letters, but nope :-(

In your first post there is the text "You guys suck &#9786;&#65039;&#128522;&#128522;" - is that literally what you typed?

Oh, I see... yes, I actually copy and pasted exactly what you see (&#9786;&#65039;&#128522;&#128522;). That's what was in my database.

Your problem is not PHP, your problem is a mediocre database setup. Change it to accept 4 byte UTF-8 properly (utf8mb4_unicode_ci or one of the alikes in MySQL) and your problems are solved where they should have been solved from the beginning.

I don't disagree, @lammert, but wouldn't that require some sort of revisions to the existing data?

csdude55

10:17 pm on Jan 29, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, I think I might have stumbled across a... solution? I'm hesitant to use that word, but so far so good.

Step one was, after connecting to MySQL (using procedural mysqli), I added:

mysqli_set_charset($dbh, 'utf8');


That seemed to solve a lot of my issues :-) Then I modified the cleanup_text() function to this:


function cleanup_text($text) {

// zero width space
$text = str_replace('&#8203;', '', $text);

// changed the preg_replace() to str_replace() cause it's considerably faster
// I could have done this with a single array, but this was easier to read. I might
// change it before I go live
//
// note that htmlentities() used later would convert these to something like
// &acirc;&euro;&whatever; , but that just displays the on the screen
// instead of the desired " or whatever. There's probably a better way to
// handle these, but I haven't found it yet so for now I'm just converting
// the ones I know
$text = str_replace(['•', '•'],
'', $text);

$text = str_replace(['…', '…'],
'...', $text);

$text = str_replace(['–', '–', '—', '—'],
'-', $text);

$text = str_replace(['’', '’', '’', '’'],
"'", $text);

$text = str_replace(['“', '“', '”', ''],
'"', $text);

// repeated whitespace
$text = preg_replace('# {2,}#', ' ',
trim($text)
);

// the only lingering issues are with < and > being changed to &lt; and &gt; ,
// and sometimes &#foo; is changed to &amp;#foo; . So fix those
return str_replace(
// from
['&amp;#', '&lt;', '&gt;'],

// to
['&#', '<', '>'],

// htmlentities() uses ENT_SUBSTITUTE where htmlspecialchars() doesn't
htmlentities($text, ENT_SUBSTITUTE, 'UTF-8')
);
}


I tried using htmlspecialchars() and mb_convert_encoding(), but at this point it looks like htmlentitites() was the only thing that did at least close to what I needed.

Any other suggestions on how to improve it? Or how to convert -whatever to the UTF8-friendly symbol?

lucy24

1:51 am on Jan 30, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Conceptually it's two steps:

Step 1: convert anything in the form blahblah into its 1252 encoding (one byte per character)
Step 2: reinterpret everything as UTF-8

where blahblah really means ([stuff][morestuff]|[stuff][morestuff][morestuff]|[stuff][morestuff][morestuff][morestuff]). The first [stuff] is a set of characters from to (30 total); the second and third are different sets of 16 characters. All [morestuff] are the 64 characters that Windows has in the 80 - BF range. Yes, they'd have to be listed individually.

The good news is that you don't need to understand the above, because I have no idea how, or even whether, the convert and reinterpret parts can be done. (As noted elsewhere, I only speak about three words of php, and this isn't one of them.)

Are you really converting curly quotes and apostrophes () into typewriter quotes (" ' only)? Why?

// the only lingering issues are with < and > being changed to &lt; and &gt;
This sounds like a textbook case of two steps forward, one back, as by far the easiest approach: Disregard the problem until everything else is done, and then convert them back again. May as well convert &quot; as well; I think this is the only other thing html5 has to, er, entitize.