3

I'd like to generate a unique identifier based on the content of an array. My initial approach was to simply do:

$key = md5(json_encode($array));

However, I'd like to be absolutely sure that the key is unique and there is a remote possibility that two distinct arrays can produce the same md5 hash. Current idea is to do:

$key = base64_encode(json_encode($array));

This is guaranteed to be unique but produces quite a long key. Can I use sha512 or does this type of hash also have the same potential for key collision as md5? Is there any way to generate a shorter key than the base64 method which is 100% guaranteed to be unique?

To be 100% clear, my question is: How can I generate the shortest possible 100% unique identifier for a set of data?

13
  • should it be based on the content of the array? if not, use uuid or something similar ? Commented Apr 14, 2017 at 14:57
  • Yes the array contents - updated question. Not sure I understand your suggestion. The arrays do not contain any unique id. That's exactly what I'd like to generate. Commented Apr 14, 2017 at 14:59
  • base64 isn't a hash, it is the data itself, and reversible, the collision chance will be based on the hash function that you use, I don't think any have a complete 100% chance, but they should be very close to it. Commented Apr 14, 2017 at 15:00
  • Your question is about generating a 100% guaranteed unique key, but you are using the contents of the array to generate a hash. Should the unique key be derived from the contents of the array or not? That was my question. Commented Apr 14, 2017 at 15:02
  • 6
    If you use a hashing function, which has limited amount of possible values, then by definition you can NEVER be 100% sure. It's improbable, that during your lifetime, you'll ever get a collision even for MD5, but - that's not 100% sure. This is an XY problem, why don't you ask about the real problem you're having? Obviously, you're working with data and you need to ensure you're not receiving duplicates or something similar. I'd leave this perceived solution aside and ask about the real issue which this hashing approach was supposed to solve. Commented Apr 14, 2017 at 15:06

1 Answer 1

7

If you want a 100% guaranteed unique key to match your content, then the only way is to use the full length of your content. You can use the json_encoded string as-is, or you could run it through base64_encode() or bin2hex() or similar if you want a string that doesn't have any "special" characters. Any hash function like md5, sha1, sha256 etc obviously cannot be 100% unique - because they have a fixed length, and due to the https://en.wikipedia.org/wiki/Pigeonhole_principle there must necessarily be non-unique results for input content that is larger than the hash.

In practice, md5 and sha1 collisions have now been published, but stronger hash functions exist where no collisions are known or expected for a long time, so you could also look into using a modern hash algorithm and be fairly safe that you will not have any duplicates.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. I believe this answers my question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.