What Is Hash Function?

This article will examine the ensuing subjects. The hash function as well as the properties of the hash function. Examples of the hash functions. Cryptographic hash function. What is a hash function?

Definition Of Hash Function

Any function that converts data of any size to data of a specific size. A Hash Function for Cryptography

Hash Function Information

A function for converting a large phone contact into a small numerical value. The hash table uses the plotted integer value as an entry. A hash function, in simple words, converts a large number or text into a small integer that can be used as the index in a hash table.

What Does “Good Hash Function” Imply?

The following are characteristics of a good hash function:

  • Easily computed
  • The keys should be distributed evenly (each table position is equally likely for each key).

For instance, taking the first three digits of a phone number is a terrible hash function. The last three digits are regarded as a better function. Please keep in mind that this may not be the most efficient hash algorithm. Perhaps there are better options.

We may often use heuristic strategies to generate a good hash function. In this design phase, qualitative research on the distribution of the keys may be valuable. In principle, a hash function should be based on every single piece of the key, so that two keys that vary in only one bit or one set of bits (whether the group is at the start, finish, or center of the key or present throughout the key) hash into distinct values. As a result, a hash function that retrieves only a piece of a key is ineffective. Similarly, two keys that are only digits or character permutations of one another (such as 139 and 319) should be hashed into separate results.

The following are the two heuristic methods: hashing by division and hashing by multiplying:

  • Modification procedure:
  • We map a key into one of the table slots by taking the remainder of the key divided by table size in this approach for constructing hash functions. To put it another way, the hash function is

h(key) = key mod table_size

i.e. key % table_size

More On Good Hash Function

  • Hashing via division is quick since it just demands one division operation.
  • We purposely avoid specific table size numbers when using the division method, such as table size should not be a power of a number such as r, since if table_ size = r^p, then h(key) is merely the p lowest-order bits of the key. We should construct the hash function to rely on all the bits of the key unless we know that all low-order p-bit patterns are equally likely.
  • When using the division approach, the best results are obtained when the table size is set to prime. Even if the table size is prime, an additional constraint is necessary. And if table_size is a prime such that r % table_size equals 1, the hash function h (key) = key % table_size is simply the sum of the binary representation of the characters in the key mod table_size.
  • Assume r = 256 and table size = 17, where r% table size, i.e. 256 percent 17 = 1
  • As a result, key = 37599, its hash is

37599 % 17 = 12

  • However, key = 573, its hash function is also

573 % 17 = 12

  • As a result, numerous keys can have the same hash using this hash function.
    Collision is the term for this.
  • For table size, a prime that isn’t too close to an exact power of two is frequently a decent choice.

Method Of Multiplication:

  • In the multiplication method, we multiply the key k by a constant real number c.
    and extract the fractional part of k * c
  • After that, we multiply this value by table size m and take the result’s floor. It’s written like this:

h(k) = floor (m * (k * c mod 1))
or
h(k) = floor (m * frac (k * c))

  • where the floor (x) returns the integer component of the real number x and frac (x) returns the fractional part, both of which are accessible in the standard library math. h. [x – floor (x)] frac (x)
  • An advantage of the multiplication method is that the value of m is not critical, we
    Typically, we choose it to be a power of 2 (m = 2p for some integer p). Since we can then
    easily implement the function on most computers
  • Assume that the machine’s word size is w bits and that the key fits within a single comment.
  • We limit c to fractions of the form s / (2w), with s being an integer in the range 0 < s < 2w.
  • We start by multiplying the key by the w-bit integer s = c * 2w (see picture). The result is a value of 2w bits.

r1 * 2w + r0

where r1 = the product’s highest-order word
r0 = the product’s lower order word

  • This approach works with any value of the constant c, but certain values perform better than others.

c ~ (sqrt (5) – 1) / 2 = 0.618033988

  • is sure to perform admirably.
  • Assume k = 123456 and p = 14.
  • m = 214 = 16384, and w = 32
  • Adapting Knuth’s suggestion, c is to be a fraction of the form s / 2^32.
  • Then key * s = 327706022297664 = (76300 * 2^32) + 17612864,
  • So r1 = 76300 and r0 = 176122864.
  • h(key) = 67 is the result of the 14 most significant bits of r0.

Hash Function Interpretation

Each function that may be used to translate data of any size to fixed-size values is known as a hash function. Hash values, hash codes, digests, or simply hashes are the results of a hash function. The values are typically used to index a hash table, which is a fixed-size table. Hashing, or scatter storage addressing, is the use of a hash function to index a hash table.

In data storage and retrieval applications. Hash functions and their associated hash tables are used to access data at a minimal and practically constant time for each retrieval. They only require a fraction of the storage space required for the data or records themselves. Hashing is a kind of data access that bypasses the non-constant access time of sorted and unsorted lists and organized trees, as well as the often exponential storage requirements of straight access to state spaces of big or variable-length keys.

The use of hash functions is based on key and function interaction statistical properties. With a vanishingly small probability, worst-case behavior is intolerably awful, and average-case behavior can be virtually ideal (minimum collisions).

Checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers are all linked to (and often confused with) hash functions. Although the concepts are somewhat similar, each has its own set of purposes and requirements, as well as being developed and optimized uniquely. In terms of data integrity, the hash function differs from these ideas.

Review Or Assessment

A hash function uses an entry as a key to distinguish a data point or record from a data storage and retrieval application. The keys can be fixed-length (such as an integer) or variable length (such as a name). The data itself can sometimes be the key. The result is a hash code that can be used to index a hash table containing data, records, or references to such

A hash function can be thought of as serving three purposes:

  • Fold variable-length keys by words or other units and use a parity-preserving operator like ADD or XOR to convert them into fixed-length (typically machine word length or less) values.
  • Scramble the key’s bits so that the resulting values are scattered evenly over the keyspace.
  • Transform the key values into numbers that are less than or equal to the table’s size.

Two basic properties of a good hash function are:

1) It should be very quick to compute;

2) it should reduce output value duplication (collisions).

For their effectiveness, hash functions rely on establishing favorable probability distributions, which keeps access times near-constant. Access times can approach linear in the number of items in the database due to high table loading factors, abnormal key sets, and poorly designed hash algorithms. Hash functions can be constructed to provide the best worst-case performance, good performance under large table loading factors, and, in rare situations, perfect (collisionless) key-to-hash code mapping. Parity-preserving bit operations (XOR and ADD), multiply, and divide are used in the implementation. A collision-resolution method that uses an auxiliary data structure like linked lists or systematic probing of the table to identify an empty slot is a necessary complement to the hash function.

Hash Table

To collect and store data items or entries, hash functions are used together with hash tables. The hash function converts each data point or record’s key into a hash code, which is then used to index the hash table. When an item is entered into the table, the hash code may index an empty slot (also known as a bucket), in which case the item is inserted there. If the hash code indexes an entire slot, collision resolution is necessary: The new item can be omitted (not added to the table), replaced, or added to the table in a different location using a given technique. This process is dependent on the hash table’s structure: Each slot is the head of a linked list or chain in chained hashing, and objects that collide at the slot are added to the chain.

To allow faster access, chains can be retained in random order and searched linearly, in serial order, or as a self-ordering list by frequency. When using open address hashing, the table is probed in a specific order. It is commonly done by linear probing, quadratic probing, or double hashing until an available slot is found or the entire database is explored (overflow). The search for the object continues until the item is discovered. Then an open slot is discovered, or the entire table has been checked (item not in the table).

Advanced Applications

Caches for huge data sets stored on sluggish media are also built using hash algorithms. Because each collision may be addressed by deleting or writing back the older of the two colliding items, a cache is often easier than a hashed search table.

The Bloom filter, a space-efficient probabilistic data structure for determining whether an item is a member of that group, relies on hash functions.

Geometric hashing, often called the grid method, is a type of hashing. The set of all inputs in these applications is a metric space. The hashing function may be thought of as a partition of that space into a grid of cells. The hash function returns an index tuple when the table is an array with two or more indices. These indices are called a grid file, grid index, bucket grid, or other names. This principle is widely used in computer graphics, computational geometry, and many other fields of study to solve a variety of plane and three-dimensional proximity problems, such as finding the closest pairs in a set of points, similar shapes in a list of shapes, similar images in a query image, etc.

Associative arrays and dynamic sets are also implemented using hash tables.

Hash Function Properties

Conformity

A decent hash function should map the expected inputs across its output range as equally as feasible. That is, each hash value in the output range should have nearly the same likelihood of being created. The rationale for this last criterion is that the cost of hashing-based approaches rises rapidly when the frequency of collisions—that is, pairs of inputs mapped to the same hash value—increases. If some hash values are more likely to occur than others, lookup operations will have to search through a larger number of conflicting table entries.

It’s important to note that this condition just demands that the value be evenly distributed, not random. A decent randomizing function is often an excellent choice as a hash function (barring computational performance concerns). However, this is not always the case.

Only a small portion of the legitimate inputs are generally stored in hash tables. A club membership list, for example, may contain only a few hundred names out of a massive list of all possible names. In these circumstances, the uniformity condition should apply to nearly all typical subsets of entries in the table, not only to the entire set of all conceivable entries.

Still On Conformity

In another sense, if a normal set of m entries is hashed to n table slots, the chances of a bucket receiving significantly more than m/n records should be quite unlikely. Only a few buckets should have more than one or two records if m is smaller than n. Even if n is much bigger than m, a small number of collisions is virtually unavoidable—see the birthday problem.

When the keys are known ahead of time and the key set is static, a hash function that provides absolute (or collisionless) conformity can be found. A perfect hash function is claimed to exist. There is no algorithm for creating such a function; instead, finding one is a factorial count of the number of keys to map vs. the number of table slots they are tapped into. Finding a perfect hash function for more than a small set of keys is generally computationally impossible; the resulting function is likely to be more computationally complex than a standard hash function and provide only a marginal advantage over a function with good statistical properties that yields the fewest possible collisions.

Assessment and evaluation

The chi-squared test can be used to examine the uniformity of the hash value distribution when testing a hash function. This is a goodness-of-fit test that compares the actual (or uniform) distribution of items in buckets to the expected (or uniform) distribution. This is the formula: display style frac {\sum {j=0}^{m1}(b{j})(b_{j}+1)/2}{(n/2m)(n+2m-1)

where: {\displaystyle n}is the number of keys, {\displaystyle m} is the number of buckets,
{\displaystyle b_{j}}is the number of items in bucket {\displaystyle j}

The hash function tested has an anticipated homogenous density if the ratio is within one confidence interval (0.95 – 1.05).

Hash functions can have some technical properties that make it more likely that they’ll have a uniform distribution when applied. One is the strict avalanche criterion: whenever a single input bit is complemented, each of the output bits changes with a 50% probability. The reason for this property is that selected subsets
of the keyspace may have low variability. A little quantity of variability, even one bit, should translate into a great level of variability (i.e. dispersion over the tablespace) in the output for the result to be evenly distributed. Because some bits are hesitant to change, the keys cluster around such values, each bit should change with a probability of 50%. The mapping approaches a fixed XOR function of a single bit if the bits wish to change too quickly. In the literature, standard tests for this characteristic have been described. The criterion’s applicability to a multiplicative hash function is evaluated here.

Effectiveness

The usage of a hash function in data storage and retrieval applications involves a trade-off between search time and data storage capacity. An extremely compact unordered linear list would be the optimum medium if search time was limitless; if storage space was unlimited, a randomly available structure indexable by the key-value would be very huge, very sparse, but very quick. Regardless of the number of keys, a hash function takes a finite amount of time to translate a potentially enormous keyspace to a realistic quantity of storage space searchable in a bounded length of time. In most applications, the hash function should be computed with the least amount of latency and with the fewest number of instructions possible.

The easiest approaches (folding), multiplicative methods, and division-based procedures are the most difficult (slowest) in terms of computational complexity.

Since collisions should be rare and only cause a minor delay, it’s generally preferable to choose a faster hash function than one that requires more processing but prevents a few collisions.

Because the division is microprogrammed on practically all chip architectures, division-based methods can be particularly problematic. Divide (modulo) a constant can be inverted to multiply by the constant’s word-size multiplicative-inverse.

Broadly (universal) acceptance

A universal hashing method is a randomization algorithm that chooses h from a family of hashing functions. Also that the likelihood of any two separate keys colliding is 1/m, where m is the number of distinct hash values preferred by the two keys. For any dispersion of input data, universal hashing guarantees (in a statistical sense) that the hash function usage will respond as if it were using a random function. However, it will have more collisions than perfect hashing and may necessitate more procedures than a particularly unique hash function.

Usage

A hash function can be used in many different circumstances. A hash function that only accepts particular table sizes, lines up to a specific height or doesn’t take a seed. It isn’t as useful as one that can.

Determination

A hash technique must be consistent, which means it must always produce the same hash value for a given input value. In other words, in the statistical sense, it must be a function of the facts to be hashed. Hash functions that rely on external variable factors, such as pseudo-random number generators or the hour of the day, are exempt from these criteria. It also removes functions that rely on the memory address of the object being hashed changing during operation (as can happen on systems that use particular trash collection methods). Rehashing the item is sometimes doable.

The determination is in the setting of the function’s repetition. In addition to the input to be hashed, Python adds the capability that hash functions employ a randomized seed that is created once when the Python process starts. When utilized within a single run, the Python hash (SipHash) is still a legitimate hash function. However, if the values are saved (for example, to disk), they can no longer be used as acceptable hash values because the random value may change in the next iteration.

Specified (Defined) Range

The output of a hash function is frequently desired to be of a predetermined size (but see below). If the output is limited to 32-bit integer values, for example, the hash values can be used to index an array. This type of hashing is frequently used to speed up data searches. Breaking the input data into chunks of a certain size can be used to produce a fixed-length output from a variable-length input. For data searches, hash functions use an arithmetic expression that iteratively processes pieces of the input (such as characters in a string) to obtain the hash value.

Options Or Variable Range

The range of hash values in many apps is varied for each run of the program or fluctuates over time. Like when a hash table needs to be widened. A hash function with two arguments is required. The input data z and the number of permissible hash values n.

A popular technique is to calculate a constant hash function with a large range (like 0 to 232-1), divide by n, and then use the remainder of the division. If n is a power of 2, bit masking and bit shifting can be used to achieve this. When using this method, the hash function should be designed. This is so that the output has a fairly uniform distribution between 0 and n 1. This is for any n value encountered in the application. The remainder may be uniform only for particular values of n, such as odd or prime numbers, depending on the function.

Dynamic Hash Function, Minimum Trend

A dynamic hash table is one that uses the hash function to store values. It stores the value in a hash table that lasts longer than the program’s runtime and can be enlarged or reduced.

When the table is resized, it’s preferable to have a hash function that moves the smallest number of records. What is required is a hash function H(z,n), where z is the hashed key and n is the number of possible hash values, such that H(z,n + 1) = H(z,n) with probability close to n/(n + 1).

Linear hashing and spiral storing are two examples of dynamic hash functions. They run in the same amount of time. But loosen the equality property in order to accomplish minimal movement or trend. Extendible hashing employs a dynamic hash function that takes up space proportionate to n to compute and becomes a function of the previously input keys. There have been several techniques developed that keep the uniformity property while computing the value of H(z,n).

In distributed hash tables, a hash function with a little memory space is extremely useful.

Normalization of Data

In some cases, the supplied data may include properties that are not useful for comparison. It may be preferable to overlook the distinction between uppercase and lowercase letters. While looking for a personal name, for example. Any two inputs that are regarded as equal must return the same hash value.
This can be done by upper-casing all letters and normalizing the input before hashing it.

Types Of Hashing integer data

For hashing integers, there are various standard algorithms. Data-dependent which strategy produces the best distribution. The modulo division method is one of the most basic and often used procedures in practice.

Exactness (Identity) hash Function

When the amount of data to hash is so minimal, the hashed value can be the data (reinterpreted as an integer). This identity hash function can be computed for free. Because it converts each input to a unique hash value, this hash function is ideal.

The size of the type that is used as the hashed value determines the meaning of “small enough.” The hash code is a 32-bit integer in Java. As a result, the 32-bit integer Integer and 32-bit floating-point Float objects can easily utilize the value, however, the 64-bit integer Long and 64-bit floating-point Double objects cannot.

This hashing algorithm can easily be used for other sorts of data. When translating character strings between upper and lower case. For example, the binary encoding of each character, which can be interpreted as an integer, can be used to index a table containing the alternate version of that character (“A” for “a”, “8” for “8”, etc.). The table has just 28 = 256 entries if each character is recorded in 8 bits (as in extended ASCII or ISO Latin 1); in the case of Unicode characters, the table has 17*216 = 1114112 entries.

A relatively similar method can be used to map two-letter country codes. Like “us” or “za,” to nation names (262 = 676 table items), 5-digit zip codes. And 13083, to city names (100000 entries), and so on. Invalid data values (such as “xx” or “00000”) may be left undefined in the database or mapped to a “null” column.

Simple (Trivial) hash function

The keys may be regarded as ‘hashed’. If they are equally or relatively consistently dispersed over the keyspace, resulting in basically random key values. In this situation, any number of bits from the key can be dialed out and used as an index in the hash table. A simple hash function, for instance, might mask off the least relevant m bits and use the output as an index into a 2m-bit hash table.

Folding

This folding hash code is created by dividing the input into n sections of m bits. Where 2m is the table size, then combining the sections with a parity-preserving bitwise operation like ADD or XOR, then trimming off any superfluous bits at the high or low end with a mask or shifts. There are five sections for a table size of 15 bits and a key value of 0x0123456789ABCDEF, for instance. We get 0x7AA4, which is a 15-bit value, by adding them together.

Mid-squares

Squaring the input and extracting a suitable number of middle digits or bits yields a mid-squares hash code. Squaring the key provides 15,241,578,750,190,521, therefore the hash code is taken as the middle 4 digits of the 17-digit value 8750 (ignoring the high digit). If the key doesn’t have a lot of leading or following zeros, the mid-squares approach generates a suitable hash code. This is a version of multiplicative hashing, however, it isn’t as good as it may be. This is because an arbitrary key isn’t a decent multiplier.

Hashing Divisions (Partitioning)

Using a modulo function on the key and selecting a divisor display style M that is a prime number near to the table size, display style h(K)=Kmod M is a common strategy. The table is usually a power of two in size. This yields a distribution ranging from display style 0 to M1. This yields good results across a wide range of key sets. The fact that division is microprogrammed on most modern architectures, including x86, and can be 10 times slower than multiply is a key disadvantage of division hashing. Another disadvantage is that it will not break up clustered keys. The keys 123000, 456000, 789000, and so on modulo 1000, for example, all map to the same location. In fact, this strategy works well since many key sets are already sufficiently random, and the likelihood of a key set being cycled by a large prime number is low.

Coding Using Algebra

This coding is a variation of the division method of hashing that maps n bits to m bits by dividing by a polynomial modulo 2 rather than an integer. In this method, {\displaystyle M=2^{m}} and we
postulate an {\displaystyle m}th degree polynomial {\displaystyle \mathrm {Z}
(x)=x^{m}+\zeta {m-1}x^{m-1}+…+\zeta {0}} A key {\displaystyle K=(k_{n1}…k_{1}k_{0}){2}}can be regarded as the polynomial {\displaystyle K(x)=k{n-1}x^{n1}+…+k_{1}x+k_{0}} The remainder using polynomial arithmetic modulo 2 is
{\displaystyle K(x)\mod {Z(x)}=h_{m-1}x^{m-1}+…+h_{1}x+h_{0}} Then {\displaystyle
h(K)=(h_{m-1}…h_{1}h_{0})_{2}}. If {\displaystyle Z(x)} is constructed to have t or fewer
non-zero coefficients, then keys which share less than t bits are guaranteed to not collide.

The GF(2k ) field is used to create Z, a function of k, t, and n with a divisor of 2k -1. As an example, Knuth says: Displaystyle for n=15, m=10, and t=7 Z(x)=x^{10}+x^{8}+x^{5}+x^{4}+x^{2}+x+1} The following is the derivation:

It is one of the fastest hash functions to compute because of the right shift.

Multiplicative hashing is prone to a “common error” that causes poor diffusion. Higher-value input bits have no effect on lower-value output bits. Before the multiplication step corrects this. A transmutation on the input shifts the span of retained top bits down and XORs or ADDs them to the key. As a result, the following is the resultant function:

unsigned hash(unsigned K)
{
K ^= K >> (w-m);
return (a*K) >> (w-m);

Fibonacci hashing

Tabulation hashing

This is also referred to as Zobrist hashing after  American computer scientist  Albert Zobrist, which is a method for creating universal families of hash functions by combining table lookup and XOR operations.  For hashing purposes (particularly hashing of integer-number keys),  this technique has proven to be highly fast and of good quality.

Zobrist hashing was first presented as a way to compactly represent chess situations in computer gaming programs. Each type of piece (six each for black and white) on each space on the board was allocated a unique random number. At the start of the program, a table of 64×12 such numbers is created. The random numbers may be any length, but due to the 64 squares on the board, 64 bits seemed appropriate.

Cycling over the pieces in a position, indexing the appropriate random integers (empty spaces were not included in the calculation), and XORing them together was how a position was transcribed (the starting value could be 0, the identity value for XOR, or a random seed). To create a hash table index, the result was reduced by modulo, folding, or some other action. The table contained the original Zobrist hash as a description of the placement.

Subsequently, the technique was expanded to hashing integers by assigning a unique 32-bit random number to each byte in each of four possible positions in the word. As a result, a table of 28×4 random numbers is created. By successively indexing the table with the value of each byte of the plain text integer and XORing the loaded values together a 32-bit hashed integer is created. The usage of a table of 28×8 64-bit random numbers is a natural extension to 64-bit integers.

A function of this type has numerous interesting theoretical qualities, including 3-tuple independence, which states that any 3-tuple of keys is equally probable to be mapped to any 3-tuple of hash values.

Hash Function Customized

The hash function can be constructed to take advantage of the keys’ inherent entropy. Masking out only the volatile bits and hashing on those will yield a better and probably faster hash function. That is if the keys have leading or trailing zeros, or certain fields that are unused. Always zero or some other constant, or generally very little. If the keys are cyclic or feature additional redundancies, some divisors or multipliers in the division and multiplication schemes might produce more uniform hash functions.

Hash Functions’ Characteristics

The hash function should have the following features in order to be an effective cryptographic tool.

Pre-Image Rejection

  1. This feature indicates that reversing a hash function should be computationally difficult.
  2. In another sense, if h produces z, finding any input value x that hashes to z should be tough.
  3. These attribute guards against an intruder who only has a hash value and is struggling to figure out what the input is.

Pre-Image Resistance 2

  1. This definition states that given input and its hash, finding another input with the same hash should be difficult.
  2. So, if a hash function h for an input x produces hash value h(x), then it should be difficult to find any other input value y such that h(y) = h(x).
  3. This characteristic of the hash function protects against an attacker who has an input value and its hash and wishes to replace the original input value with a different lawful value.

Resistance to Collision

  1. This feature implies that finding two different inputs of any length that produce the same hash should be difficult. A collision-free hash function is another name for this property.
  2. It’s difficult to identify any two separate inputs x and y for a hash function h such that h(x) = h(y).
  3. It is impossible for a hash function to not have collisions because it is a compression function with a defined hash length. This collision-free feature just verifies that these collisions should be difficult to locate.
  4. This attribute makes finding two input values with the same hash very challenging for an intruder.
  5. A hash function that is collision-resistant is also second pre-image resistant.

Creation of Hashing Algorithms

A mathematical formula that works on two fixed-size blocks of information to generate a hash code lies at the foundation of hashing. Part of the hashing algorithm is this hash function.

The algorithm determines the size of each data block. Block sizes generally range from 128 to 512 bits.
Like a block cipher, the hashing algorithm uses rounds of the aforementioned hash function. Each round accepts a fixed-size input, which is usually a combination of the most recent message block and the previous round’s output.

This procedure is done as many times as necessary to hash the entire message.

The following diagram shows a schematic of the hashing algorithm. Since the first message block’s hash value is used as an input to the second hash operation. Then the output of which changes the result of the third operation, and so on. This is known as the hashing avalanche effect.

When two messages differ by even a single bit of data, the avalanche effect produces significantly different hash values.

Make a clear distinction between a hash function and an algorithm. By working on two blocks of fixed-length binary data, the hash function produces a hash code.

The hashing algorithm specifies how the message will be broken up. Also how the results from previous message blocks will be linked together when using the hash function.

See the Learning List for more information.
  1. Blockchain Technology
  2. Defi
  3. NFTs
  4. DAOs
  5. Crypto
  6. Web 3.0
  7. Altcoin Tokenomics
  8. Metaverse
  9. Smart Contracts

Leave a Comment

error: Content is protected !!