1

I am calculating a hash of a text string in Java and C#, requirement being that if the text strings are identical the hash is the same. I settled on Java's .hashValue() as it is quite simple and straight forward(and I am fault tolerant to a potential collision), - or so I thought.
My C# implementation turns out to be unbearably slow.

Here is the implementation in c# (java is almost identical) :

        char[] val = string.ToCharArray();
        int hash = 0;
        for (int i = 0; i < string.Count(); i++) {
            hash = 31 * hash + val[i];
        }

Now I pass in two text strings, both read from text files on disc (C#, System.IO.File.ReadAllText), the fist is 10kb the second is 100kb

java zips right by both of them and generates the result. C# takes about 600ms for the 10kb file and then a whooping 50 seconds for the latter. In essense, the C# version does not scale linearly, and at a certain size it becomes a not-feasible approach. Given the exponential scaling, and that i cant fanthom ADD and MUL begins to take more time, it leads me to believe it has to be some memory management that goes haywire with C# indexing the char array. Is this expected behavior ... or what am I missing? :-)

Best regards.

3
  • 2
    Have you tried using val.Length since the count method might actually by counting the string each time? Commented Mar 19, 2014 at 14:21
  • 3
    Is "string" even legal as a variable name? Commented Mar 19, 2014 at 14:30
  • @ShellShock No, it is not. Commented Mar 19, 2014 at 14:37

1 Answer 1

7
for (int i = 0; i < string.Count(); i++) {

In this line, you should either use string.Length (no parentheses) or, preferably, val.Length.

Count() is an extension method which gets the length of the string by enumerating it every time you call it.

A more conventional C# implementation of the same algorithm would be:

int hash = 0;
foreach(char c in string)
{
    hash = 31 * hash + c;
}

As pointed out in the comments, string is not a valid variable name is C# since it is a keyword (an alias for System.String), but I kept it here for clarity.

Sign up to request clarification or add additional context in comments.

6 Comments

Indeed - using Count() it became an O(N^2) operation!
Thanks alot, this is it :) .. Consider me wiser. "string" was substituted in my original question, i thought it was transparent.
I wasnt going to, but apparantly I am, since I am here typing now...so here goes ... If String/string is immutable.. why would .Count have to 'count' the length each time? Seems unnecessarily inefficient?
It is unnecessarily ineffecient, that's why you should use String.Length instead, which is just a property of a string. The .Count()' extension method is defined on any type that implements IEnumerable' (similar to Iterable' in Java) to get the length of a sequence, _without knowing anything particular about the sequence_. String` implements IEnumerable because it's essentially a sequence of chars, so you can use the Count() method, but it has no idea what kind of sequence String actually is, so the only way to determine the sequence's length is to enumerate it until it stops.
Okay, I got that, what I am asking is, would it not be safe to override .Count in string/String and just return length? .. Not trying to cover or justify my original bad-form-code, just curious :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.