[PHP] Fix for UTF-8 characters when using file_get_contents() for sites in ISO-8859-1 charset

in #php8 years ago

During my work, when I used Simple HTML DOM Parser, I get noticed that my code was broken when it parses sites with ISO-8859-1 encoding...

I received such string:

HKM Lederreitstiefel CROCO Langl�nge/enge Weite

During googling I found solution:

mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true))

It fixed previous case:

HKM Lederreitstiefel CROCO Langlänge/enge Weite

but it breaks another one:

["currency"] => string(2) "€"

Finally I changed it to:

mb_convert_encoding(
    mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)),
    'HTML-ENTITIES',
    'UTF-8'
)

And received such strings:

Kardätsche -Flower- für Kinder

It is better, but not good one. Simply adding html_entity_decode function in correct places and we finally received:

Kardätsche -Flower- für Kinder

Such solution rescued my project for parsing websites in different encoding.

I hope that it will help somebody. Have a nice day :-)


P.S. Many thanks for guys here: https://stackoverflow.com/questions/2236668/file-get-contents-breaks-up-utf-8-characters

Coin Marketplace

STEEM 0.12
TRX 0.33
JST 0.032
BTC 109154.00
ETH 3929.53
USDT 1.00
SBD 0.86