July 12, 2015

A short hashing tale

Introduction

Have you ever wondered what is the distribution of allocations in data structures which make use of hashing, such as hash tables or hash sets, when using different datasets? How often do we have collisions? How many slots remain empty? Is the size of the data structure important?

I will try to indirectly answer these questions with an experimental presentation.

Birthday paradox

I am just going to say that the birthday paradox has a really useful conclusion; it tells us when the first collision is going to happen! So, let’s say that we have a hash function that takes an arbitrary number of bits as input and outputs an n-bit hash. Then, we are going to see the first collision after 2n/2 additions in the collection (and if you look at this from a security point of view, an n-bit hash is going to break after 2n/2 evaluations of the hash function). Now say that we’re using the 32-bit CRC function. Then we are expecting to see the first collision after hashing 65536 items. However, when you’re hashing and then inserting in a collection – in case its size is less than 216 in this example – you’d expect to see the first collision earlier since we’re wrapping the integer representation of the hash around the collection’s size – unfortunately we don’t have infinite memory.

Description

I used a data set of 5163 male and female first names, which you can find here. I am using a simple integer array to represent whether a cell is empty, occupied, or hosts two or more collided entries. Now my workflow was the following:

  1. Hash each string – I used Java’s hashCode()
  2. Wrap the integer representation of the hash around the size of the collection
  3. Allocate the corresponding slot
  4. Number 0 signifies an empty cell, 1 an occupied cell and 2 or more a collision

Results

I experimented with six different sizes of the collection storing the hashed strings. I start at half the total size (2581) and keep adding half the size until I reach three times the size (15489). I can summarise the whole story with the following table – which is sort of self-explanatory:

Total Empty Collisions
2581 305 (11%) 1552 (60%)
5163 1892 (36%) 1348 (26%)
7744 3979 (51%) 1128 (14%)
10326 6261 (60%) 944 (9%)
12907 8627 (66%) 766 (5%)
15489 11095 (71%) 683 (4%)

Let's take a quick look at the collisions' distribution at Figure 1. There are 2581 slots (x-axis) and the number of collisions is depicted on y-axis.

  1. Hash each string – I used Java’s hashCode()
  2. Wrap the integer representation of the hash around the size of the collection
  3. Allocate the corresponding slot
  4. Number 0 signifies an empty cell, 1 an occupied cell and 2 or more a collision

Figure 1

Figure 2 depicts the percentage of the slots that remained unallocated (x-axis) and the percentage of the slots where we noted collisions (y-axis). Each dot on the curve is a different table size as described earlier. Essentially this is the graphical representation of the table above.

Figure 2

Snip, snap, snout, this tale's told out.