The wonders of UTF-8

or: Why “ä” and “ä” isn’t the same..

Don’t these characters look the same to you?
To me – they do. Well now they do – i noticed during a project that one of the characters didn’t show up on screen while being clearly visible in the Code Inspection Tools of Chrome or Firefox.

What had happend?
A colleague copy pasted text from a PDF File and used parts from it in a description text.

It seems that some software instead of using the simple “ä” use a UTF-8 combination equivalent of “a” and ” ¨ “.
Often the single ” ¨  ” is not contained in public available fonts. This character is called trema or dieresis.

Fortunately the php-intl package already contains a solution for my problem – the Normalizer Class: https://www.php.net/manual/en/normalizer.normalize.php

I attached an example for you:

<?php
$a ='ä';
$b ='ä';

echo urlencode($a);
echo ' ';
echo urlencode($b).PHP_EOL.PHP_EOL;

$a = Normalizer::normalize( $a, Normalizer::FORM_C );
$b = Normalizer::normalize( $b, Normalizer::FORM_C );

echo urlencode($a);
echo ' ';
echo urlencode($b);

Sources:
https://chars.suikawiki.org/string?s=%C3%A4
https://chars.suikawiki.org/string?s=a%CC%88

https://blog.marcoka.de/index.php/posts/mit-umlauten-ins-21jahrhundert
https://www.php.net/manual/en/normalizer.normalize.php

Leave a Reply

Your email address will not be published. Required fields are marked *