How should I implement a string hashing function for these requirements?

Question

Ok, I need a hashing function to meet the following requirements. The idea is to be able to link together directories that are part of the same logical structure but stored in different physical areas of the file system.

I need to implement it in Java, it must be consistent across execution sessions and it can return a long.

I will be hashing directory names / strings. This should work so that "somefolder1" and "somefolder2" will return different hashes, as would "JJK" and "JJL". I'd also like some idea of when clashes are likely to occur.

Any suggestions?

Thanks

Joey · Accepted Answer · 2010-01-22 13:24:00Z

4

Well, nearly all hashing functions have the property that small changes in the input yield large changes in the output, meaning that "somefolder1" and "somefolder2" will always yield a different hash.

As for clashes, just look at how large the hash output is. Java's own hashcode() returns an int, therefore you can expect clashes more often than with MD5 or SHA-1, for example which yield 128 and 160 bit, respectively.

You shouldn't try creating such a function from scratch, though.

However, I didn't quite understand whether collisions shouldn't ever occur with your use case or whether they are acceptable if rare. For linking folders I'd definitely use a guarenteed-to-be-unique identifier instead of something that might occur more than once.

edited Jan 22, 2010 at 13:24

answered Jan 22, 2010 at 12:54

Joey

357k88 gold badges705 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

flesh Over a year ago

Clashes are unlikely - the max number of directories at the same level might be 10,000, so my feeling was 64 bits should be enough. What guaranteed-to-be-unique options do I have? I would need to use the hash as an indexed column in a db (not a PK though) ..

Jon Skeet · Accepted Answer · 2010-01-22 12:55:05Z

2

You haven't described under what circumstances different strings should return the same hash.

In general, I would approach designing a hashing function by first implementing the equality function. That should show you which bits of data you need to include in the hash, and which should be discarded. If the equality between two different bits of data is complicated (e.g. case-insensitivity) then hopefully there will be a corresponding hash function for that particular comparison.

Whatever you do, don't assume that equal hashes mean equal keys (i.e. that hashing is unique) - that's always a cause of potential problems.

answered Jan 22, 2010 at 12:55

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

7 Comments

flesh Over a year ago

You haven't described under what circumstances different strings should return the same hash. The honest answer is I don't know! My guess is once a string is over a certain length - say 14 chars - because most directories will be short in name. Is that a reasonable requirement?

Jon Skeet Over a year ago

@flesh: Not really. When do you want to treat two strings as being equal? What do you need over and above the normal hashCode method of Java's String class?

flesh Over a year ago

The honest answer is I'm completely new to Java, so, if for the requirements I'm suggesting above, Java's String hashCode is the best fit (including it being reliable through sessions so I can use it as an id) then great. Otherwsie, as I asked the other poster above, what guaranteed-to-be-unique options do I have bearing in mind I need to use the hash as an indexed column in a db (not a PK or unique though). Incidentally, when does string.hashCode treat two strings as equal? .. thanks for the advice.

Jon Skeet Over a year ago

You don't have any sensible guaranteed-to-be-unique options, really. String.hashCode treats two strings as equal when they are equal - when they're the same sequence of characters. But please don't use hash codes as IDs in a database... hashing is not meant to be used that way.

Thilo Over a year ago

@Jon Skeet: "But please don't use hash codes as IDs in a database". Well, git uses SHA-256 hashes as IDs. I suppose it just depends on how much faith you have in there never being collisions.

|

Thilo · Accepted Answer · 2010-01-22 13:30:59Z

1

Java's String hashcode will give you an int, if you want a long, you could take the least-significant 64 bits of the MD5 sum for the String.

Collisions could occur, your system must be prepared for that. Maybe if you give a little more detail as to what the hash codes will be used for, we can see if collisions would cause problems or not.

answered Jan 22, 2010 at 13:30

Thilo

264k107 gold badges527 silver badges674 bronze badges

Comments

RossFabricant · Accepted Answer · 2010-01-22 13:48:05Z

1

With a uniformly random hash function with M possible values, the odds of a collision happening after N hashes are 50% when

N = .5 + SQRT(.25 - 2 * M * ln(.5))

Look up the birthday problem for more analysis.

You can avoid collisions if you know all your keys in advance, using perfect hashing.

answered Jan 22, 2010 at 13:48

RossFabricant

12.6k3 gold badges44 silver badges50 bronze badges

Collectives™ on Stack Overflow

How should I implement a string hashing function for these requirements?

4 Answers 4

1 Comment

7 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related