Hashing

View on GitHub

Static Hashing

A bucket is a unit of storage containing one or more entries (a bucket is typically a disk block).
- we obtain the bucket of an entry from its search-key value using a hash function
Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B.
Hash function is used to locate entries for access, insertion as well as deletion.
Entries with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate an entry.
In a hash index, buckets store entries with pointers to records
In a hash file-organization buckets store records

Hash Functions

A good hash function gives an average-case lookup that is a small constant, independent of the number of search keys.
We hope records are distributed uniformly among the buckets.
The worst hash function maps all keys to the same bucket, while the best hash function maps all keys to distinct addresses.
Ideally, distribution of keys to addresses is uniform and random.
Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet to the ith bucket.
- Problem: this does not give uniform distribution.
- Many more names will be mapped to A than to X.
- Typical hash functions perform some operation on the internal binary machine representations of characters in a key.

Handling of Bucket Overflows

Bucket overflow can occur because of
- insufficient buckets
- skew in distribution of records
  - multiple records have same search-key value
  - chosen hash function produces non-uniform distribution of key values
Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets
Overflowing chaining is the overflow buckets of a given bucket are chained together in a linked list
Above scheme is also called closed addressing (or open hashing depending on the book you use)
- Alternative: open addressing (also called open hashing or closed hashing)
- it does not use overflow buckets, and is not suitable for database applications because it does not support deletes efficiently

Deficiencies of Static Hashing

In static hashing, function $h$ maps search key values to a fixed set of $B$ bucket addresses. Databases grow or shrink with time.

If initial number of buckets is too small, and file grows, performance will degrade due to too much Overflows
if space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull)
if database shrinks again space will be wasted

One solution: periodic re-organization of the file with a new hash function

expensive, disrupts normal operations

Better solution: allow the number of buckets to be modified dynamically

Dynamic Hashing

Periodic rehashing
- If number of entries in a hash table becomes (say) 1.5 times size of hash table,
create new hash table of size (say) 2 times the size of the previous hash table
Rehash all entries to new table
Linear Hashing
- Do rehashing in an incremental manner
Extendable Hashing
- Tailored to disk based hashing, with buckets shared by multiple hash values
- Doubling of # of entries in hash table, without doubling $#$ of buckets

Comparison of Ordered Indexing and Hashing

Cost of periodic re-organization
Relative frequency of insertions and deletions
Is it desirable to optimize average access time at the expense of worstcase access time?
Expected type of queries:
- Hashing is generally better at retrieving records having a specified value of the key.
- If range queries are common, ordered indices are to be preferred
In practice:
- PostgreSQL supports hash indices, but discourages use due to poor performance
- Oracle supports static hash organization, but not hash indices
- SQLServer supports only B+-trees

Multiple-Key access

Composite search keys are search keys containing more than one attribute
- E.g., (dept_name, salary)
Lexicographic ordering: (a1, a2) < (b1, b2) if either
- a1 < b1, or
- a1=b1 and a2 < b2

Creation of Indices

Most database systems allow specification of type of index, and clustering. Indices on primary key created automatically by all databases. When a tuple is inserted the index can be used to check that the primary key is not violated.

Some databases also create indices on foreign key attributes.

Indices can greatly speed up lookups, but impose cost on updates. Index tuning assistants/wizards supported on several databases to help choose indices, based on query and update workload.

Write Optimized Indices

Performance of B+-trees can be poor for write-intensive workloads
- One I/O per leaf, assuming all internal nodes are in memory
- With magnetic disks, < 100 inserts per second per disk
- With flash memory, one page overwrite per insert
Two approaches to reducing cost of writes
- Log-structured merge tree
- Buffer tree

Bitmap Indices

Bitmap indices are a special type of index designed for efficient querying on multiple keys. Records in a relation are assumed to be numbered sequentially. Given a number it must be easy to retrieve records.

Applicable on attributes that take on a relatively small number of distinct values.

A bitmap is simply an array of bits.

In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute. Bitmap has as many bits as records. In a bitmap for value $v$ , the bit for a record is 1 if the record has the value $v$ for the attribute, and is 0 otherwise.

They are useful for queries on multiple attributes. Queries are answered using bitmap operations. Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap.

Bitmap indices are generally very small when compared with relation size.

Efficient Implementation of Bitmap operations

Counting number of 1s can be done fast by a trick:

Use each byte to index into a precomputed array of 256 elements each storing the count of 1s in the binary representation
- Can use pairs of bytes to speed up further at a higher memory cost
Add up the retrieved counts
Bitmaps can be used instead of Tuple-ID lists at leaf levels of B+-trees, for values that have a large number of matching records
Worthwhile if > 1/64 of the records have that value, assuming a tuple id is 64 bits
Above technique merges benefits of bitmap and B+-tree indices

Indexing

Intro to CockroachDB