Forum Moderators: coopster
function cleanup_text($text) {
$text = str_replace('​', '', $text);// zero width space
$text = preg_replace('#•?#', '', $text);
$text = preg_replace('#…?#', '...', $text);
$text = preg_replace('#[]?#', '-', $text);
$text = preg_replace('#(|[])?#', "'", $text);
$text = preg_replace('#("|||“??)#', '"', $text);
$text = trim($text);
$text = preg_replace('# {2,}#', ' ', $text);// Repeated whitespace
// PHP 7.x doesn't do substitutions anymore :-(
// return htmlspecialchars($text, ENT_SUBSTITUTE, 'UTF-8');
return htmlspecialchars($text, ENT_IGNORE, 'UTF-8');
} You guys suck ☺️😊😊
I know we were talking about this before, but I can't find it now. And I'm pretty sure it wasn't PHP related at the time, anyway.
In his mind he submitted 4 smilies, but they're getting stripped out so it doesn't seem good natured like he intended.Well, you could globally convert anything in the Advanced Smileys range to a mechanical :-) ... but sooner or later youll run into some equivalent of the grandparent who thinks lol stands for lots of love, as in I was sorry to hear about the death of your dog, lol.
Or is it a simpler list of Ϩ - 8000; are :-), ὁ - 12000 are :-(, and so on?Yes, exactly. Unicode is divided into blocks. A comparatively early set is Dingbats (2700-27BF, or UTF-8 E29C80-E29EBF in three bytes), but mainly you're looking at Miscellaneous Symbols and Pictographs (1F300-1F5FF, or four-byte UTF-8 F09F8C80-F09F97BF) immediately followed by Emoticons (1F600-1F64F, or F09F9880-F09F998F). These are followed in turn by Transport and Map Symbols for another 8 rows, mysteriously including both an occupied and an unoccupied bathtub, but how often will your users need those?
...a user submits something like this:
You guys suck ☺️😊😊
but I'm losing some context
And PHP7.x no longer recognized ENT_SUBSTITUTE in htmlspecialchars().
they only become "unicode" when the browser converts them for display.And then, when posted to the present forum--which is strictly Latin-1 (i.e. de facto 1252)--it changes right back to decimal equivalents. This does not make it easy to ask or answer questions involving anything outside the Latin-1 range, whether it be emoticons, symbols, or a non-Roman script.
In UTF I just get squares.Yuk. That means it has already been converted into a form that isnt recognizable as UTF-8--a common problem with 1252 or similar content. If everything were retained as numerical entities, the page's stated encoding would make no difference. You could even set the page to ASCII and everything would still display as intended.
The database is set to cp1252 West European (latin1), and what I pasted here is a direct copy from that table.
In ISO it will show the emojis as intended. In UTF I just get squares.
Without using cleanup_text(), I get those same squares. When I do run it through the cleanup_text() function, though, it just comes back as blank.
$resultUTF8 = mb_convert_encoding($stringCP1252,'UTF-8','CP1252')
And then, when posted to the present forum--which is strictly Latin-1 (i.e. de facto 1252)--it changes right back to decimal equivalents.
You guys suck ☺️😊😊
and then double-encodes themYes, if you look at the HTML you'll see all the & ampersands have been converted to & so your browser cant just re-display the entities. (It seems as if there should be a way around this. When I'm making HTML out of a text file, the conversion goes
&([c\s]| )
>>
&$1
so I dont inadvertently convert anything other than free-standing ampersands. It could of course be done with a lookahead instead of a capture. But this is an old discussion.)
@csdude55: Check your original post, that's not actually what you posted is it?
What do you mean?
It sounds like you need to convert the encoding from CP1252 (from your DB) to UTF-8 (for display):$resultUTF8 = mb_convert_encoding($stringCP1252,'UTF-8','CP1252')
In your first post there is the text "You guys suck ☺️😊😊" - is that literally what you typed?
Your problem is not PHP, your problem is a mediocre database setup. Change it to accept 4 byte UTF-8 properly (utf8mb4_unicode_ci or one of the alikes in MySQL) and your problems are solved where they should have been solved from the beginning.
mysqli_set_charset($dbh, 'utf8');
function cleanup_text($text) {
// zero width space
$text = str_replace('​', '', $text);
// changed the preg_replace() to str_replace() cause it's considerably faster
// I could have done this with a single array, but this was easier to read. I might
// change it before I go live
//
// note that htmlentities() used later would convert these to something like
// â€&whatever; , but that just displays the on the screen
// instead of the desired " or whatever. There's probably a better way to
// handle these, but I haven't found it yet so for now I'm just converting
// the ones I know
$text = str_replace(['•', '•'],
'', $text);
$text = str_replace(['…', '…'],
'...', $text);
$text = str_replace(['–', '–', '—', '—'],
'-', $text);
$text = str_replace(['’', '’', '’', '’'],
"'", $text);
$text = str_replace(['“', '“', '”', ''],
'"', $text);
// repeated whitespace
$text = preg_replace('# {2,}#', ' ',
trim($text)
);
// the only lingering issues are with < and > being changed to < and > ,
// and sometimes &#foo; is changed to &#foo; . So fix those
return str_replace(
// from
['&#', '<', '>'],
// to
['&#', '<', '>'],
// htmlentities() uses ENT_SUBSTITUTE where htmlspecialchars() doesn't
htmlentities($text, ENT_SUBSTITUTE, 'UTF-8')
);
} // the only lingering issues are with < and > being changed to < and >This sounds like a textbook case of two steps forward, one back, as by far the easiest approach: Disregard the problem until everything else is done, and then convert them back again. May as well convert " as well; I think this is the only other thing html5 has to, er, entitize.