PHP: html_entity_decode removing/not showing character

Question

I am having a problem with Â character on my website.

I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).

The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.

When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the Â character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).

How can I prevent and/or remove this character so I can run the regular expression?

EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.

It is removing empty paragraph tags. For some reason users like to add extra lines when they edit which makes the website look horrible. It should remove paragraph tags with only whitespace and/or a nbsp; entity. Example: dev.lovewichita.org/church/profile/25.html — kkeith29
– kkeith29, Commented Apr 12, 2012 at 17:32
Could you add the failing regexp? Then I can try to recreate the problem locally — ANisus
– ANisus, Commented Apr 12, 2012 at 18:40
The regex is: '#<p>([\s\r\n]*)( )?([\s\r\n]*)</p>#'. I threw it together pretty quick so I know there is a better way to write it. I use to be good at the syntax but it seems my memory is fading. — kkeith29
– kkeith29, Commented Apr 12, 2012 at 19:08

Community · Accepted Answer · 2017-05-23 11:56:27Z

1

You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.

Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.

Additionally to find the character you can't see, create a hexdump of the string.

edited May 23, 2017 at 11:56

CommunityBot

11 silver badge

answered Apr 12, 2012 at 18:30

hakre

200k55 gold badges454 silver badges866 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kkeith29 Over a year ago

I copied and pasted from the older version of the website. Would the text not get converted to a format readable under the UTF-8 charset?

hakre Over a year ago

@kkeith29: That depends. Using UTF-8 does not mean that magically everything works now, it's just a character encoding. I think it's most informative if you add the code you've got problems with to your question and the hexdump of the string you run into problems with.

kkeith29 Over a year ago

The code that produces the text is spread throughout the framework (form class, controllers, models, and helpers) so it is hard to post here. Thank you for mentioning the hexdump, it made do a lot of research as to how that would help and it greatly expanded my knowledge of how data is turned into text and how charsets play into that. Thanks to you I confirmed it is a charset problem with that text (a space is the culprit, it is being dislayed as two characters, Â and a space, due to multi-byte stuff from what I understand).

kkeith29 Over a year ago

It actually kind of sad after 7 years, it took me till now to take the time to research that and understand it better.

ANisus · Accepted Answer · 2012-04-12 18:34:43Z

1

Since the character you are talking about exists within the ANSI charset, you can do this:

utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));

This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:

string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

answered Apr 12, 2012 at 18:34

ANisus

78.7k32 gold badges171 silver badges166 bronze badges

Collectives™ on Stack Overflow

PHP: html_entity_decode removing/not showing character

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related