Aug
25
2011

Using PHP to Replace Special Characters with their Equivalents

After having just completed an extensive text file parsing script, I discovered something very very very annoying.  A small army of Microsoft Word inspired characters had invaded the imported plain text files (courtesy a number of citation management softwares and websites), causing text to have all sorts of ‘fun’ symbols sprucing things up.

For example “Tâ��ms” => “Teams”, but with that extra something added in for visual highlight (or something).

So what was a programmer to do, except create a function to replace a myriad of annoying characters with their equivalents.  No, I didn’t stutter with text people.  Not wanting to replace heaps of special characters with *, or a space, or underscore, the only option was to manually assemble a list of characters that was creating problems, and their equivalents.  This includes the extra special MS Word curly quote set (“quote”) amongst many others.

Without further delay, the normalize_str function:


function normalize_str($str)
{
$invalid = array('Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z',
'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A',
'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E',
'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y',
'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a',
'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e',  'ë'=>'e', 'ì'=>'i', 'í'=>'i',
'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y',  'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r', "`" => "'", "´" => "'", "„" => ",", "`" => "'",
"´" => "'", "“" => "\"", "”" => "\"", "´" => "'", "’" => "'", "{" => "",
"~" => "", "–" => "-", "’" => "'");

$str = str_replace(array_keys($invalid), array_values($invalid), $str);

return $str;
}

And it’s appropriate usage:


$text = "However through the ´actuation of devices and objects´ in the user’s
physical environment pervasive computing also introduces other significant challenges
to a user’s physical privacy. “We introduce four principles” to guide the construction
of physical privacy policies and demonstrate how existing information privacy models can be extended
to address these aspects of physical privacy.<br />Published 2010 in Educãc{c~{ao Formac{c~{ao &
Tecnologias, pages: 59-67<br />Evaluation of New York&acirc;€™s driver improvement program";

echo normalize_str($text);

Enjoy!

9 Comments + Add Comment

  • I’m developing an “app” in PHP, but some file names have accents on it, but PHP does not support UTF-8 yet. So i used your function to replace then for normal characters o/
    I’m adapting it to use in VBScript also.
    Just, thank you for posting it and helping us! o/

  • very useful trick….it’s work….my problem is slove..
    Thnks dear

  • You could also use the Normalizer class of PHP, as of > 5.3.0
    http://uk3.php.net/manual/en/class.normalizer.php

  • Thanks, very helpful post. What I needed was to display the special characters in the browser, so needed to replace to copied and pasted characters with the html equivalents. Here the replacement array:

    $invalid = array(‘Š’=>’S', ‘š’=>’s', ‘Ð’=>’Ð’, ‘d’=>’d', ‘Ž’=>’Z', ‘ž’=>’z',’C'=>’C', ‘c’=>’c', ‘C’=>’C', ‘c’=>’c', ‘À’=>’À’, ‘Á’=>’Á’, ‘Â’=>’Â’, ‘Ã’=>’Ã’,'Ä’=>’Ä’, ‘Å’=>’Å’, ‘Æ’=>’Æ’, ‘Ç’=>’Ç’, ‘È’=>’È’, ‘É’=>’É’, ‘Ê’=>’Ê’, ‘Ë’=>’Ë’,'Ì’=>’Ì’, ‘Í’=>’Í’, ‘Î’=>’Î’, ‘Ï’=>’Ï’, ‘Ñ’=>’Ñ’, ‘Ò’=>’Ò’, ‘Ó’=>’Ó’, ‘Ô’=>’Ô’,'Õ’=>’Õ’, ‘Ö’=>’Ö’, ‘Ø’=>’Ø’, ‘Ù’=>’Ù’, ‘Ú’=>’Ú’, ‘Û’=>’Û’, ‘Ü’=>’Ü’, ‘Ý’=>’Ý’,'Þ’=>’Þ’, ‘ß’=>’ß’, ‘à’=>’à’, ‘á’=>’á’, ‘â’=>’â’, ‘ã’=>’ã’, ‘ä’=>’ä’, ‘å’=>’å’,'æ’=>’æ’, ‘ç’=>’ç’, ‘è’=>’è’, ‘é’=>’é’, ‘ê’=>’ê’, ‘ë’=>’ë’, ‘ì’=>’ì’, ‘í’=>’í’,'î’=>’î’, ‘ï’=>’ï’, ‘ð’=>’ð’, ‘ñ’=>’ñ’, ‘ò’=>’ò’, ‘ó’=>’ó’, ‘ô’=>’ô’, ‘õ’=>’õ’,'ö’=>’ö’, ‘ø’=>’ø’, ‘ù’=>’ù’, ‘ú’=>’ú’, ‘û’=>’û’, ‘ü’=>’ü’, ‘ý’=>’ý’, ‘þ’=>’þ’,'ÿ’=>’ÿ’, ‘R’=>’R', ‘r’=>’r', “`” => “‘”, “´” => “‘”, “„” => “,”, “`” => “‘”,”´” => “‘”, ““” => “\”", “”” => “\”", “´” => “‘”, “’” => “‘”, “{” => “{”, “}” => “}”,”~” => “~”, “–” => “-”, “’” => “‘”, “‘” => “‘”);

    Hope this helps someone!

  • I prefer to use the HTML entities, a good site: http://www.html-entities.org

  • I had to do that also. Here is my own take on it.

    http://beta.renoirboulanger.com/blog/2010/06/comment-remplacer-les-caracteres-bizzares-dans-wordpress-lorsqu-on-a-mal-fait-la-conversion/

    Blog post is in french, but I had to do that exactly. BTW, if you have navigation problem, the “beta” version is my symfony2 bundle layer on top of WordPress database. Just remove the “beta” in the URL for the full (old) site.

  • Thanks for a Beneficial post; I enjoyed it very much. Erin Yore

  • That is a Instructive post. I enjoyed it very much. Nathanial Kendle

  • instructive material – I enjoyed it very much! Tanner Andrick

Leave a comment