2

I am having a problem with  character on my website.

I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).

The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.

When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).

How can I prevent and/or remove this character so I can run the regular expression?

EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.

5
  • what is your regular expression doing? Commented Apr 12, 2012 at 17:29
  • It is removing empty paragraph tags. For some reason users like to add extra lines when they edit which makes the website look horrible. It should remove paragraph tags with only whitespace and/or a nbsp; entity. Example: dev.lovewichita.org/church/profile/25.html Commented Apr 12, 2012 at 17:32
  • +1 for helping the church out Commented Apr 12, 2012 at 18:39
  • Could you add the failing regexp? Then I can try to recreate the problem locally Commented Apr 12, 2012 at 18:40
  • The regex is: '#<p>([\s\r\n]*)(&nbsp;)?([\s\r\n]*)</p>#'. I threw it together pretty quick so I know there is a better way to write it. I use to be good at the syntax but it seems my memory is fading. Commented Apr 12, 2012 at 19:08

2 Answers 2

1

You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.

Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.

Additionally to find the character you can't see, create a hexdump of the string.

Sign up to request clarification or add additional context in comments.

4 Comments

I copied and pasted from the older version of the website. Would the text not get converted to a format readable under the UTF-8 charset?
@kkeith29: That depends. Using UTF-8 does not mean that magically everything works now, it's just a character encoding. I think it's most informative if you add the code you've got problems with to your question and the hexdump of the string you run into problems with.
The code that produces the text is spread throughout the framework (form class, controllers, models, and helpers) so it is hard to post here. Thank you for mentioning the hexdump, it made do a lot of research as to how that would help and it greatly expanded my knowledge of how data is turned into text and how charsets play into that. Thanks to you I confirmed it is a charset problem with that text (a space is the culprit, it is being dislayed as two characters, Â and a space, due to multi-byte stuff from what I understand).
It actually kind of sad after 7 years, it took me till now to take the time to research that and understand it better.
1

Since the character you are talking about exists within the ANSI charset, you can do this:

utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));

This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:

string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.