Forum Moderators: coopster & phranque

Message Too Old, No Replies

Making sure that param data is utf-8

         

csdude55

8:53 pm on Nov 1, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On the HTML form, I use:

<meta charset="UTF-8">

But I'm still getting an occasional error that, best I can tell, is because the data coming through isn't UTF-8.

I convert the data to a hash using:

use CGI qw(:standard);

%contents = map { $_ => get_data($_) } param;
%_GET = %contents;
sub get_data {
my $name = shift;
my @values = param($name);
return @values > 1 ? \@values : $values[0];
}


I used to use $contents but have begun to use $_GET (to match PHP), so I set both here to cover my bases.

Any suggestions on ensuring that the data is UTF-8?

lucy24

11:46 pm on Nov 1, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On the HTML form, I use:

<meta charset="UTF-8">
On the form, or on the page as a whole? Either way, a charset declaration doesn't do anything to data coming in (from user to site). It just tells the browser how to render the HTML going in the other direction, from site to user.

:: thinking happily about one now-defunct site’s jaw-dropping mishmash of file encodings, making it impossble to use most of its input fields ::

:: wandering off to learn what on earth the browser’s menu item “Repair text encoding” is supposed to do ::

csdude55

1:10 am on Nov 2, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I meant that's at the top of the page :-)

I also added this to the <form>:

accept-charset="UTF-8"

but it seemed to have no impact.

I was hoping there might be a way to modify the Perl script to decode whatever the param is and force it to be UTF-8. Maybe:

use CGI qw(:standard);
use Encode;

%contents = map { $_ => get_data($_) } param;
%_GET = %contents;
sub get_data {
my $name = shift;
my @values = encode("utf-8", param($name));
return @values > 1 ? \@values : $values[0];
}

lucy24

2:21 am on Nov 2, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



to decode whatever the param is
Can’t be done. (While investigating “repair text encoding” I fell into a rather interesting rabbit hole [hsivonen.fi]. Interesting to me, at least.) One can often tell whether material is in a multi-byte encoding--most likely though not necessarily UTF-8--or a one-byte encoding. But unless you're matching against a finite set of possible inputs, it would be impossible for a computer of ordinary intelligence to tell which one-byte encoding.

Now, you could proceed on the assumption that if it isn't UTF-8 it must be ISO-1252 (Windows-Latin-1) and work from there. But only you know whether this is likely to be a safe assumption.