3

If we check out the documentation of the htmlspecialchars() function in PHP, we see that it has an $encoding parameter to specify the encoding of the input string.

Now, conversely, I expect the opposite htmlspecialchars_decode() function to also have an $encoding parameter. However, this is NOT the case.

I want to know why exactly is this the case. There has to be some reason for not including an $encoding parameter in htmlspecialchars_decode().

Surprisingly, there is an $encoding parameter in html_entity_decode(), so what's the point of including it in that function.

2
  • Very interesting Question, I only can guess: I think it is, because you are able to set default encodings ini_set( 'default_charset', 'UTF-8' ); and it is somewhat expected, that as long as you are "internalize" something you want it, in the format, which is defined as default... Just my guess so... Commented Apr 26, 2023 at 13:38
  • This is where PHP has its ambiguities :) Commented Apr 26, 2023 at 13:39

1 Answer 1

0

I'd have to guess here slightly, but… htmlspecialchars_decode only decodes a small handful of characters which are all ASCII characters. So there's no need to specify the target encoding you want to decode these characters to, as they're all the same in all ASCII-compatible encodings. Now what if you wanted to decode to a non-ASCII compatible encoding? That is probably virtually never the case, and you can simply do some encoding conversion before and/or afterwards if you really needed that.

PHP has always assumed ASCII for the things that matter to it and arbitrary bytes for anything else that don't matter to it, so this function has never received any unified encoding support, just as a lot of other functions haven't either.

The functions htmlspecialchars and html_entity_decode have received this treatment at some point, as the cases where the encoding does matter are probably encountered more often with them. In the case of html_entity_decode, it decodes a wider range of characters and it does matter what encoding you decode those to.

htmlspecialchars appears to need to know the encoding to properly preserve the string's contents. I don't really understand why, as it would just need to look for certain ASCII bytes to replace, but not passing the correct encoding will garble your non-ASCII text.

Sign up to request clarification or add additional context in comments.

8 Comments

I can relate to what you are saying, but the problem is that htmlspecialchars() does have it, and there's just no sense of it having the parameter.
htmlspecialchars does seem to do… something… with strings besides just looking for the ASCII bytes of HTML special characters and encoding them. For example, htmlspecialchars(iconv("UTF-8", "SJIS", "<漢字 &>")) garbles the input when not passing "SJIS" as the htmlspecialchars $encoding parameter. I'm not entirely sure what it does, but here we are.
You're right. I also noticed that it does more than just look for ASCII bytes. But then I expect htmlspecialchars_decode() to do the same, what do you think?
Is there any way to ask this from the developers of PHP, maybe on GitHub?
I imagine digging through the C implementation to see what it does would be a good first step.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.