What's In A Hash Collision?

Question

When I have used Custom Types in Map Keys and Sets, I generally follow a similar pattern listed in the Apex Developer Guide:

public Integer hashCode() {
    return (31 * x) ^ y;
}

Someone asked recently and I honestly don't know: Why Multiply by 31? My guess is the primary intent is to reduce the risk of hash collision. Does it matter that it's prime? Are larger or smaller numbers worse for some reason?

The involvement of this seemingly magic number is especially interesting because sometimes we actually want all the keys to return the same hash code, for example when comparing date ranges.

Consider this example (Execute Anonymous script) which seems to operate the same whether the hashes are distinct or all the same:

class Key
{
    final String value1, value2;
    Key(String value1, String value2)
    {
        this.value1 = value1;
        this.value2 = value2;
    }
    public Boolean equals(Object instance)
    {
        Key that = (Key)instance;
        return this.value1 == that.value1 &&
            this.value2 == that.value2;
    }
    public Integer hashCode() { return 1; }
    // as opposed to (31 * value1.hashCode()) ^ value2.hashCode()
}

Set<Key> keys = new Set<Key>
{
    new Key('a', 'a'), new Key('a', 'a'),
    new Key('b', 'b'), new Key('b', 'b'),
    new Key('c', 'c'), new Key('c', 'c'),
    new Key('d', 'd'), new Key('d', 'd')
};
system.assertEquals(4, keys.size());

What is the benefit in returning distinct hash codes, when the behavior appears to be the same? Is there a performance gain?

This question seems to be about a general feature in Java. SO has a large audience and I've seen excellent technically detailed answers — Nayuki, 10 hours ago

sfdcfox · Accepted Answer · 2017-01-30 21:15:32Z

The difference is how many times equals will be called (and thus, your performance). As an example, let's take a look at the following class:

public class KeyTrial {
    public static Integer eCounter = 0, hCounter = 0;
    Integer value;
    public KeyTrial(Integer val) {
        value = val;
    }
    public Boolean equals(Object o) {
        eCounter++;
        return ((KeyTrial)o).value == value;
    }
    public Integer hashCode() {
        hCounter++;
        return value;
    }
}

Now, we'll execute the following code:

Set<KeyTrial> trials = new Set<KeyTrial>();
Long startTime = DateTime.now().getTime();
for(Integer i = 0; i < 1000; i++) {
    trials.add(new KeyTrial((Math.random()*10000000).intValue()));
}
Long endTime = DateTime.now().getTime();
System.debug(LoggingLevel.error, 'equals called: '+KeyTrial.eCounter);
System.debug(LoggingLevel.error, 'hashCode called: '+KeyTrial.hCounter);
System.debug(LoggingLevel.error, 'time elapsed: '+(endTime-startTime));

The eCounter (number of equals methods called) will vary randomly, only when there's a hash collision. In a sample run, I got the following output:

14:24:09.1 (129448217)|USER_DEBUG|[7]|ERROR|equals called: 0
14:24:09.1 (129469186)|USER_DEBUG|[8]|ERROR|hashCode called: 1000
14:24:09.1 (129491729)|USER_DEBUG|[9]|ERROR|time elapsed: 117

As you can see, there were no collisions, and it ran 1000 items in 117 ms.

In contrast, when we change the return of hashCode to 1, we get the following output:

14:23:16.11 (4860282035)|USER_DEBUG|[7]|ERROR|equals called: 508207
14:23:16.11 (4860302147)|USER_DEBUG|[8]|ERROR|hashCode called: 1000
14:23:16.11 (4860324908)|USER_DEBUG|[9]|ERROR|time elapsed: 4856

The time difference is staggering. It took 41 times more CPU time, because equals had to be called 508207 times to verify if there was a match. The exact performance will vary, but expect it to be absolutely terrible.

This is because Set collections are b-trees. In other words, they're sorted by hash code first, then compared by equals if more than 1 item exists for that hash code to verify if it is a duplicate.

B-trees are quick because they can quickly locate an entry even in thousands of items. If you've ever played "guess my number", where someone thinks of a value between 1 and 100, and they tell you higher or lower for each guess, you know that you should binary search, for example, guessing 50, 75, 63, 69, 66, 65 instead of counting from 1 to 100 until you hit it. In fact, if you do a binary search, you are guaranteed to hit the correct value within 7 guesses.

Similarly, a binary search only needs to make 1 comparison for each bit in the size of the item list. For example, if there are 1000 items, it requires no more than 10 decisions to find the correct hash code.

You can read more about them by searching your favorite engine, or looking at how the hashmap works in Java.

There's a variety of ways you can generate a unique hash code; you should select an algorithm that suits your purposes. Simple multiplication is a bad choice, because x times y is the same as y times x, so you'd have far more hash collisions. Hash collisions means that equals needs to be called, which is bad. Ideally, you want to spread your values across a wide spectrum to avoid collisions. However, using something like (31 * x) ^ y will not yield the same value as (31 * y) ^ x. You're using a non-associative property, so there will be fewer hash collisions.

As for why 31 was selected, I found this answer, which says it better than I ever could:

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

So this answer verifies that my understanding of the implications was correct, but doesn't tell me anything about the significance of the magic number (31). Thoughts there? — Adrian Larson, 11 hours ago
@AdrianLarson Silly me. I found an appropriate answer for 31. I'm done editing now. — sfdcfox, 10 hours ago
When you say b-tree, do you mean binary tree? Also, your answer at the end relates to hashtables, which usually aren't part of trees. — corsiKa, 5 hours ago

asked	today
viewed	548 times
active	today

current community

your communities

more stack exchange communities

What's In A Hash Collision?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged apex map bestpractice performance or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

What's In A Hash Collision?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged apex map bestpractice performance or ask your own question.

Linked

Related

Hot Network Questions