*** Welcome to piglix ***

Hash join


The hash join is an example of a join algorithm and is used in the implementation of a relational database management system.

Hash join is similar to nested loop join but faster than nested loop join and hash join is used for equi join.

The task of a join algorithm is to find, for each distinct value of the join attribute, the set of tuples in each relation which have that value.

Hash joins require an equijoin predicate (a predicate comparing values from one table with values from the other table using the equals operator '=').

The classic hash join algorithm for an inner join of two relations proceeds as follows:

The first phase is usually called the "build" phase, while the second is called the "probe" phase. Similarly, the join relation on which the hash table is built is called the "build" input, whereas the other input is called the "probe" input. It is like merge join algorithm.

This algorithm is simple, but it requires that the smaller join relation fits into memory, which is sometimes not the case. A simple approach to handling this situation proceeds as follows:

This is essentially the same as the block nested loop join algorithm. This algorithm scans more times than necessary.

A better approach is known as the "grace hash join", after the GRACE database machine for which it was first implemented.

This algorithm avoids rescanning the entire relation by first partitioning both and via a hash function, and writing these partitions out to disk. The algorithm then loads pairs of partitions into memory, builds a hash table for the smaller partitioned relation, and probes the other relation for matches with the current hash table. Because the partitions were formed by hashing on the join key, it must be the case that any join output tuples must belong to the same partition.


...
Wikipedia

...