Why are hashing algorithms safe to use?

Question

Hashing algorithms today are widely used to check for integrity of data, but why are they safe to use? A 256-bit hashing algorithm generates 256 bits representation of given data. However, a 256-bit hash only has 2⁵¹² variations. But 1 KB of data has 2⁸¹⁹² different variations. It's mathematically impossible for every piece of data in the world to have different hash values. So why are hashing algorithms safe?

This should probably be migrated to Computer Scientists or Security — Carcigenicate
– Carcigenicate, Commented Jun 20, 2016 at 18:05
@KelvinZhang Click the StackExchange menu at the top of this page. Close to the bottom there is a search bar where you can find other communities. Type in "security" for example to bring up the correct community. — kojow7
– kojow7, Commented Jun 20, 2016 at 18:15
what you are looking for is called "hash collision". It is when 2 contents have the same hash. They are very rare due to a/ the construct of the hash algorithms and b/ the entropy of the content itself. — njzk2
– njzk2, Commented Jun 20, 2016 at 20:08
I'm voting to close this question as off-topic because it is not about programming. It could on-topic in security.stackexchange.com — 1615903
– 1615903, Commented Jun 21, 2016 at 4:34
There are two main types of hash functions. 1. Those where collisions are allowable such as those used in dictionary lookup functions that use a secondary method such as re-hashing or full comparison to eliminate ambiguity. 2. Those that are collision resistant such as cryptographic hash functions where a single bit difference in the input will cause approximately 50% of the output bits to change. SHA-256 is of the second type and safe to use to determine if two files are the same. — zaph
– zaph, Commented Jun 21, 2016 at 10:47

Mackers · Accepted Answer · 2016-06-20 18:18:20Z

6

The reasons why hashing algorithms are considered safe are due to the following:

They are irreversible. You can't get to the input data by reverse-engineering the output hash value.
A small change in the input will produce a vastly different hash value. i.e. "hello" vs "hellp" will generate completely different values.

The assumption being made with data integrity is that a majority of your input is going to be the same between a good copy of input data and a bad (malicious) copy of input data. The small change in data will make the hash value completely different. Therefore, if I try to inject any malicious code or data, that small change will completely throw-off the value of the hash. When comparison is done with a known hash value, it'll be easily determinable if data has been modified or corrupted.

You are correct in that there is risk of collisions between an infinite number of datasets, but when you compare two datasets that are very similar, it is reasonable to assume that the hash values of those two almost-equivalent datasets with be completely different.

answered Jun 20, 2016 at 18:18

Mackers

1,0508 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Kelvin Zhang Over a year ago

Do their irreversibilities rely on the fact that data is longer than the hash?

Mackers Over a year ago

In a way yes. Since you're reducing an infinite number of possibilities of input values into a discrete number of output hash values, you have an infinite number of values that the input could have been, making the true input indeterminable.

sascha Over a year ago

I'm not too happy with this answer, as it misses the question in my opinion. The answer to his question is mostly pure statistics: what is the probability of collisions. Depending of the nature of the environment, remark 2 of this answer could be important (collision-distribution). I don't think remark 1 adds something to this (as i understood the question; it's important in crypto). Also remark 1 is not always valid, as non-cryptographic hash-functions usually have no reason to counter the inversion, so it might be possible to generate x, so that f(x)=y (not a reconstruction; but relevant).

sascha Over a year ago

It's also a special constraint to make a cryprographic hash-function hard to invert. While the different bit-lenghts have some effect, this characteristic is implemented with extra-care designing the algorithm (compare with block-ciphers, where the length doesn't change, but in inversion is nearly impossible).

SomeWittyUsername Over a year ago

@Mackers That's only part of the story. Compression algorithms compact the data yet the result is reversible

|

Vatine · Accepted Answer · 2016-06-21 09:13:20Z

Not all hashes are safe. There are good hashes (for some value of "good") where it's sufficiently non-trivial to intentionally create collisions (I think FNV-1a may fall in this category). However, a cryptographic hash, used correctly, would be computationally expensive to generate a collision for.

"Good" hashes generally have the property that small changes in the input cause large changes in the output (rule of thumb is that a single-bit flip in the input cause roughly b bit flips in the output, for a 2b hash). There are some special-purpose hashes where "close inputs generate close hashes" is actually a feature, but you probably would not use those for error detecting, but they may be useful for other applications.

A specific use for FNV-1a is to hash large blocks of data, then compare the computed hash to that of other blocks. Only blocks that have a matching hash need to be fully compared to see if they're identical, meaning that a large number of blocks can simply be ignored, speeding up the comparison by orders of magnitude (you can compare one 2 MB to another in approximately the same time as you can compare its 64-bit hash to that of the hash of 256Ki blocks; although you will probbaly have a few blocks that have colliding hashes).

Note that "just a hash" may not be enough to provide security, you may also need to apply some sort of signing mechanism to ensure that you don't have the problem of someone modifying the hashed-over text as well as the hash.

Simply for ensuring storage integrity (basically "protect against accidental modification" as a threat model), a cryptographic hash without signature, plus the original size, should be good enough. You would need a really really unlikely sequence of random events mutating a fixed-length bit string to another fixed-length bit string of the same length, giving the same hash. Of course, this does not give you any sort of error correction ability, just error detection.

Collectives™ on Stack Overflow

Why are hashing algorithms safe to use?

2 Answers 2

12 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related