charset for user generated content - Webmaster General forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

charset for user generated content

csdude55

6:00 am on Jan 4, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I can't quite decide how to handle this. I have some users (most likely mobile) that submit content that's not quite encoding properly; one submits something in UTF-8, the next submits something in ISO-8859.

For example:

csdude’s string
ðŸ¤ªðŸ¤ªðŸ¤ªðŸ˜€ (shows up when charset="utf-8" as 4 emojis)

The first one has an apostrophe that's encoded as iso-8859-1, and the second seems to be in utf-8. So if I set the charset on the page to ISO then the first one shows up correctly while the second shows weird characters, and vice versa if I set it to UTF-8.

Oddly, both of these were posted by the same user!

The data is written to MySQL through a Perl script, then read and shown back to the user through PHP. I tried to force both to utf-8 in Perl using Unicode::String, but it had no impact:

[metacpan.org...]

use Unicode::String qw(utf8);

$str_1 = utf8("csdude’s string");
$str_2 = utf8("ðŸ¤ªðŸ¤ªðŸ¤ªðŸ˜€");

print qq~
$str_1
$str_2
~;

I also use a few RSS feeds where some are UTF-8 and some are ISO-8859-1 (copied to MySQL for caching so I don't ping them constantly), so I guess that I would be better off to figure out a way to save it in the database with the original encoding, then force UTF-8 when I select it via PHP? So I tried using utf8_encode() and mb_convert_encoding() in PHP, but neither converted after the select, either.

Any thoughts or suggestions on a "better" way to do this?

phranque

6:33 am on Jan 4, 2021 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

lucy24 understands weird encoding issues well and I would await her response.

lucy24

6:32 pm on Jan 4, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Oddly, both of these were posted by the same user!

I'm reminded of the behavior of this very forum before its most recent revision (some 7 years ago by now, how time flies). If the first non-ASCII character in a thread was utf-8 encoded, the whole thread would be stored in the database as utf-8; otherwise it defaulted to latin-1. The current forums are strictly Latin-1, which may lead to picturesque results if you open a thread that predates the change.

Now the complication: Browsers are allowed, in fact encouraged, to interpret Latin-1 (8859-1) as Windows-Latin-1 (1252), a superset that includes up to 32 codepoints--the 80-9F range--which are not in 8859-1 and hence can’t be recognized by unicode. (More exactly: They can’t be recognized alone, though they can occur as part of a multi-byte character.) This becomes relevant when...

The first one has an apostrophe that's encoded as iso-8859-1

Curly apostrophes and quotation marks are among the non-ASCII characters that exist in 1-byte encodings such as Latin-1. But the apostrophe is not in the generic Latin-1 that overlaps with unicode, only in Windows-Latin-1. So once a string containing a ’ has been encoded as Latin-1, it can’t be reinterpreted as unicode.

The same issue shows itself in the second line. Although it claims to be Latin-1, it is again Windows-Latin-1; the giveaway character is Ÿ, again in that forbidden 80-9F range.

Now the bad news: I have no idea how to deal with this in your specific environment, in part because I don’t know whether the encoding happens at the user’s end or after it arrives in your database. Somehow you need to intercept material using Latin-1 (i.e. Windows-Latin-1) and change it to UTF-8, whether that's by converting (as required in the first example) or reinterpreting (second example).

If we were allowed to use smileys, I would plumb cosgan dot de for an array of “I dunno” shrugs.

JorgeV

6:47 pm on Jan 4, 2021 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

Hello,

You need to use the same charset all along your pipeline.

So, if you opt for UTF-8, you need to create your MySQL / MariaDB database with :

DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci

In your HTML page, you have to set:

<head>
<meta charset="utf-8">
</head>

And for your input form:

<form action="xxx" accept-charset="utf-8">

csdude55

9:48 pm on Jan 4, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hmm. My MariaDB is set to cp1252 West European (latin1)... not sure if that's good or bad? I assume it's the default, since I don't remember ever changing it.

I should also mention that data is submitted through Rich HTML (contenteditable), which means that it looks right to the user when they're typing or pasting.

If I change it in the database and the form then will it automatically convert whatever it can to the closest utf-8 neighbor, then discard what it doesn't recognize?

csdude55

10:16 pm on Jan 4, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Follow up... my database is 13.5G in size, including indexes. If I change the character set for it, how long should I expect it to take? Do I need to take the database offline?

lucy24

1:23 am on Jan 5, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

cp1252 West European (latin1)

That explains much, since that is the very Windows codepage seen in your examples. I don't know if you want to take it as comforting, or the reverse, that your database appears to be doing exactly what it was told to do (whether by you, a factory default, a gremlin or some other agency).

csdude55

5:57 am on Jan 5, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Following the trail of logic... if the database is converted everything to CP1252 (aka Windows-1252) then, in theory, this would have worked:

$text = iconv(
 "CP1252", // "from"
 "UTF-8//IGNORE", // "to"; //IGNORE should force it to remove anything that can't be converted
 $text);

But alas, it didn't seem to do anything at all :-(

I also tried these, as suggested in the "contributions" on the PHP page, but they had no impact, either:

// this
$text = htmlentities($text, ENT_QUOTES, "Windows-1252");
$text = html_entity_decode($text, ENT_QUOTES , "utf-8");

// and then this
mb_convert_encoding($text, 'CP1252', 'UTF-8');

lucy24

6:10 pm on Jan 5, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How long has the site been running, and how often is its content revisited? If it isn't too well established, it may be less trouble to toss the existing database, with all its content, and start over again in UTF-8.

csdude55

12:41 am on Jan 6, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How long has the site been running

Pushing 19 years... :-O

and how often is its content revisited

I'm not entirely sure what you mean... like, old content being viewed by a user? With the exception of usernames, I would say that most data over a month old is really just for the sake of nostalgia.

And, well, also excluding the things that are indexed by Google, like message board posts.

lucy24

2:34 am on Jan 6, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Oh, dear. If you had been talking about something you just set up last November, it would be a different matter. I can see where you might not want to dump nineteen years' worth of content ;)

The tricky part is what happens if you change the code for the future, from 1252 to utf-8: what then happens when the before/after database is read? It may not be a huge calamity, especially if things like usernames are reliably all-ASCII.

phranque

4:25 am on Jan 25, 2021 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

further discussion of csdude55's issue in the PHP forum:
More on Unicode, ISO-xxxx, and UTF-8 [webmasterworld.com]