Maybe we need to elaborate a bit further:
As far as I understand (might not be perfect - so I very much welcome corrections)
MYSQL 5 I'm not using older ones anymore, but I doubt they function the same.
It uses a
character set -> this is how the data is encoded, so it determines what the binary representation for an A is, but does it for each and every character it supports.
It also uses a
collation -> this is how data is compared and sorted. E.g. it determines if it is to be sorted case sensitive or not, if an "ü" (ü should the board eat it) and an "ue" should be considered the same (e.g. useful in German) etc.
You can set the character set and the collation for the mysql instance, for a database, for a table or even for a column. A character set in mysql has a default collation.
e.g.
- for the character set latin1, the default collation is latin1_swedish_ci
- for the character set utf8, the default collation is utf8_general_ci
You can see it with the SQL commands
show character set;
and
show collation;
mysql also uses a character set and collation to communicate with the clients.
To set the communication between a client and the database in utf-8 you use:
SET NAMES 'utf8';
for e.g. interactive use or
charset 'utf8';
to make it persistent through reconnections, or if you use the mysqli interface from php:
$mysqli->set_charset("utf8");
(object mode, assuming $myslqi being the object) or e.g.
mysql_set_charset('utf8', $conn);
for the obsolete mysql interface.
When creating a database you can specify the character set and/or the collation.
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
If both are ommitted the compile time default is used (typically latin1 and latin1_swedish_ci . If only one is suported then the other is derived from the supplied one (e.g. set the collation to utf8_general_ci and it will also force a character set of utf-8.
To change the character set and/or collation, ALTER TABLE can be used.
You can see the character set and collations used with command like:
show variables like "collation%";
show variables like "character%";
If you look at these: do not worry about character_set_server and collation_server still being latin1 and not utf8: they are the server defaults if you specify nothing, the rest should be utf-8.
HTML In order to make sure the browsers understand that the data they get sent and that they send to you , they too need to be told it is in utf-8:
You can set a HTTP header (IE ignores this)
You can add a HTML head <meta> tag
e.g.:
In apache's httpd.conf (or I guess in a .htaccess)
AddDefaultCharset UTF-8
AddType text/html;charset=utf-8 html
The obsolete HTML4 etc require it to be:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML5 wants us to use:
<meta charset="utf-8" />
[you want it in both the HTTP header (apache and or header() from php) and in the meta tag.]
PHP To output html:
Do not forget to set the content type header. E.g. I use
header('Content-Type: application/xhtml+xml;charset=UTF-8');
for my polyglot html5 (take care with the serious consequences of this choice).
To process strings, you need the mb_ variants.
E.g.
- to validate input: you need to make sure the sting is valid UTF-8:
if ( !mb_check_encoding($string,'UTF-8') ) {
//bad input
}
The reason here is that a single character in UTF-8 can be represented by a number of bytes, so a string could contain "unfinished" sequences leading to a number of potential issues
- strlen() on UTF-8 data returns the size in bytes, not the number of characters, so you need :
mb_strlen($string, 'UTF-8')
There's a truckload of mb_ string functions (see references below) - use them whenever you deal with UTF-8 data.
References: [
php.net...]
[
mysql.he.net...]
[
php.net...]