Hashing algorithms today are widely used to check for integrity of data, but why are they safe to use? A 256-bit hashing algorithm generates 256 bits representation of given data. However, a 256-bit hash only has 2512 variations. But 1 KB of data has 28192 different variations. It's mathematically impossible for every piece of data in the world to have different hash values. So why are hashing algorithms safe?
-
1This should probably be migrated to Computer Scientists or SecurityCarcigenicate– Carcigenicate2016-06-20 18:05:27 +00:00Commented Jun 20, 2016 at 18:05
-
1@KelvinZhang Click the StackExchange menu at the top of this page. Close to the bottom there is a search bar where you can find other communities. Type in "security" for example to bring up the correct community.kojow7– kojow72016-06-20 18:15:05 +00:00Commented Jun 20, 2016 at 18:15
-
1what you are looking for is called "hash collision". It is when 2 contents have the same hash. They are very rare due to a/ the construct of the hash algorithms and b/ the entropy of the content itself.njzk2– njzk22016-06-20 20:08:10 +00:00Commented Jun 20, 2016 at 20:08
-
I'm voting to close this question as off-topic because it is not about programming. It could on-topic in security.stackexchange.com1615903– 16159032016-06-21 04:34:01 +00:00Commented Jun 21, 2016 at 4:34
-
There are two main types of hash functions. 1. Those where collisions are allowable such as those used in dictionary lookup functions that use a secondary method such as re-hashing or full comparison to eliminate ambiguity. 2. Those that are collision resistant such as cryptographic hash functions where a single bit difference in the input will cause approximately 50% of the output bits to change. SHA-256 is of the second type and safe to use to determine if two files are the same.zaph– zaph2016-06-21 10:47:53 +00:00Commented Jun 21, 2016 at 10:47
2 Answers
The reasons why hashing algorithms are considered safe are due to the following:
- They are irreversible. You can't get to the input data by reverse-engineering the output hash value.
- A small change in the input will produce a vastly different hash value. i.e. "hello" vs "hellp" will generate completely different values.
The assumption being made with data integrity is that a majority of your input is going to be the same between a good copy of input data and a bad (malicious) copy of input data. The small change in data will make the hash value completely different. Therefore, if I try to inject any malicious code or data, that small change will completely throw-off the value of the hash. When comparison is done with a known hash value, it'll be easily determinable if data has been modified or corrupted.
You are correct in that there is risk of collisions between an infinite number of datasets, but when you compare two datasets that are very similar, it is reasonable to assume that the hash values of those two almost-equivalent datasets with be completely different.
12 Comments
Not all hashes are safe. There are good hashes (for some value of "good") where it's sufficiently non-trivial to intentionally create collisions (I think FNV-1a may fall in this category). However, a cryptographic hash, used correctly, would be computationally expensive to generate a collision for.
"Good" hashes generally have the property that small changes in the input cause large changes in the output (rule of thumb is that a single-bit flip in the input cause roughly b bit flips in the output, for a 2b hash). There are some special-purpose hashes where "close inputs generate close hashes" is actually a feature, but you probably would not use those for error detecting, but they may be useful for other applications.
A specific use for FNV-1a is to hash large blocks of data, then compare the computed hash to that of other blocks. Only blocks that have a matching hash need to be fully compared to see if they're identical, meaning that a large number of blocks can simply be ignored, speeding up the comparison by orders of magnitude (you can compare one 2 MB to another in approximately the same time as you can compare its 64-bit hash to that of the hash of 256Ki blocks; although you will probbaly have a few blocks that have colliding hashes).
Note that "just a hash" may not be enough to provide security, you may also need to apply some sort of signing mechanism to ensure that you don't have the problem of someone modifying the hashed-over text as well as the hash.
Simply for ensuring storage integrity (basically "protect against accidental modification" as a threat model), a cryptographic hash without signature, plus the original size, should be good enough. You would need a really really unlikely sequence of random events mutating a fixed-length bit string to another fixed-length bit string of the same length, giving the same hash. Of course, this does not give you any sort of error correction ability, just error detection.