The wonders of UTF-8

or: Why “ä” and “ä” isn’t the same..

Don’t these characters look the same to you?
To me – they do. Well now they do – i noticed during a project that one of the characters didn’t show up on screen while being clearly visible in the Code Inspection Tools of Chrome or Firefox.

What had happend?
A colleague copy pasted text from a PDF File and used parts from it in a description text.

It seems that some software instead of using the simple “ä” use a UTF-8 combination equivalent of “a” and ” ¨ “.
Often the single ” ¨  ” is not contained in public available fonts. This character is called trema or dieresis.

Fortunately the php-intl package already contains a solution for my problem – the Normalizer Class:

I attached an example for you:

$a ='ä';
$b ='ä';

echo urlencode($a);
echo ' ';
echo urlencode($b).PHP_EOL.PHP_EOL;

$a = Normalizer::normalize( $a, Normalizer::FORM_C );
$b = Normalizer::normalize( $b, Normalizer::FORM_C );

echo urlencode($a);
echo ' ';
echo urlencode($b);