More on Unicode, ISO-xxxx, and UTF-8

I know we were talking about this before, but I can't find it now. And I'm pretty sure it wasn't PHP related at the time, anyway.

I have about 20 years of data, and for the last 10 years or so users have had a contenteditable to submit their content. This was fine until recently, when I could just set the meta tag to UTF-8... but now I'm dealing with people on their phones, submitting Unicode smilies or whatever. And PHP7.x no longer recognized ENT_SUBSTITUTE in htmlspecialchars().

I've been using this to autocorrect everything:

function cleanup_text($text) {
 $text = str_replace('&#8203;', '', $text);// zero width space

 $text = preg_replace('#тАвЭ?#', 'Х', $text);
 $text = preg_replace('#тАжЭ?#', '...', $text);
 $text = preg_replace('#тА[УФ]Э?#', '-', $text);
 $text = preg_replace('#(Т|тА[ЩШ])Э?#', "'", $text);
 $text = preg_replace('#(&quot;|У|Ф|тАЬ?Э?)#', '"', $text);

 $text = trim($text);
 $text = preg_replace('# {2,}#', ' ', $text);// Repeated whitespace

// PHP 7.x doesn't do substitutions anymore :-(
// return htmlspecialchars($text, ENT_SUBSTITUTE, 'UTF-8');

 return htmlspecialchars($text, ENT_IGNORE, 'UTF-8');
}

And this WORKS, but I'm losing some context when a user submits something like this:

You guys suck ☺️😊😊

In his mind he submitted 4 smilies, but they're getting stripped out so it doesn't seem good natured like he intended.

Before I waste the next year trying to convert a full list of unicode and ISO-whatever codes to something I can manage (like images), just to have to do it again with the next iOS update... can you guys and gals suggest any other way to automatically substitute things?

function cleanup_text($text) { // zero width space $text = str_replace('', '', $text); // changed the preg_replace() to str_replace() cause it's considerably faster // I could have done this with a single array, but this was easier to read. I might // change it before I go live // // note that htmlentities() used later would convert these to something like // â€&whatever; , but that just displays the тА on the screen // instead of the desired " or whatever. There's probably a better way to // handle these, but I haven't found it yet so for now I'm just converting // the ones I know $text = str_replace(['тАвЭ', 'тАв'], 'Х', $text); $text = str_replace(['тАжЭ', 'тАж'], '...', $text); $text = str_replace(['тАУЭ', 'тАУ', 'тАФЭ', 'тАФ'], '-', $text); $text = str_replace(['тАЩЭ', 'тАЩ', 'тАЩШЭ', 'тАЩШ'], "'", $text); $text = str_replace(['тАЬЭ', 'тАЬ', 'тАЭ', 'тА'], '"', $text); // repeated whitespace $text = preg_replace('# {2,}#', ' ', trim($text) ); // the only lingering issues are with < and > being changed to < and > , // and sometimes &#foo; is changed to &#foo; . So fix those return str_replace( // from ['&#', '<', '>'], // to ['&#', '<', '>'], // htmlentities() uses ENT_SUBSTITUTE where htmlspecialchars() doesn't htmlentities($text, ENT_SUBSTITUTE, 'UTF-8') ); }

More on Unicode, ISO-xxxx, and UTF-8

csdude55

phranque

lucy24

csdude55

lucy24

w3dk

csdude55

lucy24

w3dk

w3dk

csdude55

lucy24

w3dk

lammert

csdude55

csdude55

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week