19

How can I determine if a string contains non-printable characters/is likely binary data?

This is for unit testing/debugging -- it doesn't need to be exact.

2
  • 3
    How does this question get so little attention? This is a very common problem :( Commented May 21, 2019 at 20:45
  • 2
    How does PHP not make a distinction between a string and an arbitrary byte array? Insane. Commented May 15, 2020 at 6:55

9 Answers 9

18

This will have to do.

function isBinary($str) {
    return preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0;
}
Sign up to request clarification or add additional context in comments.

5 Comments

Unfortunately, this doesn't work with non-english western languages, as they include characters like: ñ (spanish), ö ä (swedish), è ê ç (french) and so on...
@IgnacioSegura Good point. I think it might be better to explicitly define the set of control characters, and enable the u flag.
just add in begin of function: if (mb_detect_encoding($str)) return false;
@IgnacioSegura It is (apparently) possible to match all characters of all languages; see stackoverflow.com/questions/15861088/….
This code is inherently flawed as it only catches 62% of the cases in ASCII. It will NOT work with non-ASCII languages. This is, at best, NON-PRODUCTION CODE. Also, note, most people do not consider tab/return/linefeed to be printable characters.
13

I have studied all answers to this question, and ended up with a different solution.

  • The accepted answer preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0 flags non-ASCII characters as binary, this includes latin accents, chinese, russian, greek, hebrew, arabic, etc.
  • ctype_print has the same problem as the above.
  • strpos($string, "\0")===FALSE is almost good, but you can have binary data without null characters.
  • preg_match('//u', $params[$index]) is almost identical to the solution I ended up using, but it might throw a warning when dealing with binary data, eg: Compilation failed: invalid UTF-8 string at offset 1, although I haven't been able to replicate this warning.

Detecting whether a string is binary is a fuzzy detection by nature, as there isn't a specification that specifies what is binary what is not. There is no control characters that we can look for.

What we can do is look for bytes that do not represent a meaningful character in any language.

With that in mind, the most efficient way seems to be to check for UTF-8 compliance on the string:

protected function isBinary(string $data): bool
{        
    return ! mb_check_encoding($data, 'UTF-8');
}

I have written unit tests and it has correctly detected everything so far:

  • ASCII
  • Latin
  • Chinese
  • Greek
  • Hebrew
  • Russian
  • Arabic
  • Japanese

And correctly detected the binaries I used in the unit tests.

3v4l

Comments

5

After a few attempts using ctype_ and various workarounds like removing whitespace chars and checking for empty, I decided I was going in the wrong direction. The following approach uses mb_detect_encoding (with the strict flag!) and considers a string as "binary" if the encoding cannot be detected.

So far i haven't found a non-binary string which returns true, and the binary strings that return false only do so if the binary happens to be all printable characters.

/**
 * Determine whether the given value is a binary string by checking to see if it has detectable character encoding.
 *
 * @param string $value
 *
 * @return bool
 */
function isBinary($value): bool
{
    return false === mb_detect_encoding((string)$value, null, true);
}

Comments

2

To search for non-printable characters, you can use ctype_print (http://php.net/manual/en/function.ctype-print.php).

3 Comments

@MrTux: Well then combine it with a check for ctype_space
@CBroe Can it be combined? ctype_print($x) || ctype_space($x) won't work. They both check against the entire string.
There's a slight catch: ctype_print() only works reliably with ASCII strings. If you pass it a string that contains non-ASCII characters, it may return unexpected results. Non-ASCII characters include accented latin characters, such as á, greek, chinese, etc
2

From Symfony database debug tool:

if (!preg_match('//u', $params[$index])) // the string is binary

Detect if a string contains non-Unicode characters.

5 Comments

What does that not match? It matches "hello" and "\x00" and empty strings and everything else I've tried.
It means that a string contains non-unicode characters. Worked for me - it detects 'text' files pulled from an external source that are not text, i.e. contains characters that cannot be entered into mysql text/longtext field. Original purpose: when a database query is exported for logging/debug, it displays "(binary data)" instead of original content, to keep logs readable. Perhaps "binary" is not clearly defined , so several incompatible solutions may exist.
Could you give an example string that returns 0?
@mpen tuobenessere.it/ads.txt. For obvious reasons, I cannot provide it quoted, the browser/formatter will remove the offending character.
preg_match('//u', hex2bin('a670c89d4a324e47')) Ahah..well that returns false. I wonder what's special about that string.
0

A hacky solution (which I have seen quite often) would be to search for NUL \0 chars.

if (strpos($string, "\0")===FALSE) echo "not binary";

A more sophisticated approach would be to check if the string contains valid unicode.

2 Comments

That's not quite good enough. Many binary strings won't contain a NUL byte.
Yeah, but it's a good indicator. Just checking for unprintable chars (as tabs won't help you, too).
0

I would use a simple ctype_print. It works for me:

public function is_binary(string $string):bool
{
    if(!ctype_print($string)){
        return true;
    }

    return false
}

Comments

-1

My assumption is that what the OP wants to do is the following:

$hex = hex2bin(“0588196d706c65206865782064617461”);
// how to determine if $hex is a BINARY string or a CHARACTER string?

Yeah, this is not possible. Let’s look at WHY:

$string = “1234”

In binary this would be 31323334. Guess what you get when you do the following?

hex2bin(‘31323334’) == ‘1234’

You get true. But wait, you may be saying, I specified the BINARY and it should be the BINARY 0x31 0x32 0x33 0x34! Yeah, but PHP doesn’t know the difference. YOU know the difference, but how is PHP going to figure it out?

If the idea is to test for non-printable because reasons, that’s quite different. But no amount of Regex voodoo will allow the code to magically know that YOU want to think of this as a string of binary.

10 Comments

Yeah.. I was and am aware of this fact, but it's good to point out for others :-) I think the probability of at least 1 non-printable char is pretty good for a long enough "binary" string though. Again, it was just for debugging so that I could auto-convert to hex or something instead of printing jibberish.
The odds are 0.578125% that a character will be non-printable. That probability remains true for each byte no matter the length. Worse, it fails with non-ASCII languages. My point was that this is bad practice and should NEVER be used for production code. I would mark your answer as such.
That doesn't sound right at all. My answer claims 158/255 chars as non-printable which is 62%. Given a randomly distributed 16-byte string, the odds are near 100% that isBinary will return true. Where did you come up with your figure? And again, this isn't "production" code, it's "something went wrong and I want to echo that value to the terminal so I can see what it was" code.
My bad. I combined hex x20 with decimal 128 XD … SO 32 characters (0x00-0x1F) + DEL = 33 unprintable characters in ASCII (tab/return/linefeed are seldom considered printable characters, but to each his own). Add 128 = 161 unprintable. 161/256 = 0.6289% chance it will be unprintable. No. The odds are NOT 100% for 16 characters. It’s better odds than Las Vegas, but people in Vegas still win. Your isBinary WILL fail.
62% not 0.62%, very different. And yes, over 99% with just 5 bytes. I used a calculator omnicalculator.com/statistics/probability
|
-2

TRy a reg exp replace, replacing '[:print:]' with "", and if the result is "" then it contains only printable characters, else it contains non-printable characters as well.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.