Computing practitioners encounter hash functions almost every day, although they may not necessarily be the center of attention. When you install software packages or updates, the authenticity of a package is often verified with a cryptographic hash. Indeed, under the hood, everyone who browses the web these days is using cryptographic hashes to check that the certificates used by web servers are in order.
Non-cryptographic hash functions are also all around us. Those who write code often use dictionaries, which are usually implemented with hash functions (so much so that Perl calls them hashes). Those who use load balancers are often using hash functions to distribute the load. There are even smart algorithms for counting and deduplication of data that take advantage of non-cryptographic hashes for efficiency.
The common feature of both cryptographic and non-cryptographic hash functions is that they take inputs of data of any size and convert it to a deterministic output of specific length. Unlike encryption, this is intended to be a one-way process (i.e., decryption is never required).
Cryptographic hashes have a security requirement, which is usually phrased in terms of resistance to collisions and pre-images. Indeed, for a cryptographic hash function, knowing the hash output should give you no clue about how to reconstruct the input data. In theory, you could achieve this by assigning a random output for every input and storing them in a huge table. In practice, however, clever designs for hash functions have emerged that involve bit operations, lookup tables, and repeated rounds to mix the input data.
In the case of non-cryptographic hash functions, security is not the primary requirement. Usually, you would like a hash function to map input data (e.g., typically strings/IP addresses/objects) in a reasonable way across some resource (e.g., linked lists, servers, or data structures). Often, reasonable means uniformly distributed outputs, or at least as close to uniform as can be achieved relatively quickly.
This desired uniformity can come under attack, giving rise to security requirements in the non-cryptographic setting; as pointed out by Scott Crosby and Dan Wallach,6 an attacker might try to mess things up by presenting inputs that all get assigned to the same resource. Simpler functions, however, are often used for the sake of speed, at least when compared with the more demanding design goals of the cryptographic hash. These non-cryptographic hash functions are the focus of this article.
Over the years, many hash functions have been proposed,16 but let's look at some well-known simple examples listed in table 1.
You can assess the behavior of a (non-cryptographic) hash function by putting it to work to see how it performs. One common use is in populating hash tables. In this case, we use modulo arithmetic to assign the inputs to one of M "buckets" by taking the hash of the input, h, and using h mod M to identify the bucket. How does the desired uniform-looking output compare with the actual results? To find out, let's pick a few example input sets (see table 2) and run some code. We have selected items that you would expect to populate a hash table (names, words, IP addresses) as well as something that might be troublesome (bit strings that differ from a fixed string in only one bit). This is called the Bias dataset.
You would assume the main goal is to ensure the table is filled as evenly as possible. Indeed, for a hash table implemented with a list at each bucket, it is easy enough to show that a uniform output is optimal, as this avoids long chains of data being stored in any one bucket, while avoiding unused buckets. Let's see how uniformly the hash output is distributed modulo M.
As each of these datasets comprises 1,000 lines of input, using M = 500 buckets would therefore result in a load factor of two. This means ideally, there would be two entries per bucket and no empty buckets. To get a feel for what happens in practice, you can see in figure 1 the number of empty buckets observed for each case: about 64 empty buckets except for the Bias data, which leaves many buckets empty for three of the four hashes, and must, consequently, be crowding other buckets with data. Murmur2 is resistant to this problem; we will come back to this.
There are various ways to measure how far from uniform the distribution over the buckets is. For example, you might take a statistical approach and use a chi-squared test. In figure 2 you can see the p-value from the test. If each output value was really independently and uniformly distributed over the M buckets, then these p-values would be uniformly distributed between 0 and 1. On the other hand, repeated values close to 0 would be suspicious, which is again the case for the Bias data. This time, there is a cluster of small values for DJBX33A, a function that can be shown to preserve bit patterns in the input data. The Murmur2 values, curiously, are also lower than expected.
Interestingly, while bad behavior is evident in the Bias dataset, there are more values close to 1 for the IP Addresses set. Further investigation shows that the IP addresses contain several runs of sequential addresses that behave particularly nicely for some of the hashes. This shows that sometimes there may be gains to matching your hash function to patterns that are common in your data.
Taking a step back, you can, to some degree, play a game of pick-your-metric, and it might be reasonable to do so based on the requirements of your application. You can also play pick-your-M-value. For comparison, figure 3 shows the incidence of empty buckets after picking the nearby numbers, M = 499 and M = 512. There is folklore around hash tables suggesting that an M that is prime or a power of two may be good. The original choice of M = 500 and the power of two M = 512 have similar behavior. This is because the structure of the Bias dataset results in recurring patterns in the low bits, which are preserved by working modulo numbers divisible by a power of two. A prime M, such as 499, will mitigate this sort of problem, as, in fact, will any odd values of M.
To explore this, we tried all numbers of buckets M between 488 and 522, using the Bias dataset again and combining it with FNV-1a (see figure 4, prime numbers highlighted in red).
The number of empty buckets (blue line, axis on right-hand side) bounces quite regularly between about 60 and 270. In general, when the number of buckets is an even number, only even-numbered buckets are filled, meaning that half are left empty. Likewise, any even bucket number resulted in a zero p-value (orange line in figure 4). While odd-numbered buckets also saw a consistent number of empty buckets, p-values showed a more varied pattern. Interestingly, p-values are largest at 507 or 513 buckets, which are odd but not prime. So much for folklore.
So, it would appear that when using the more "real life" datasets (i.e., excluding Bias), the number of empty buckets and distributions are broadly similar for the first three hashes, both FNVs and Murmur2. It also seems you can hide some problems by using an odd value of M, although not necessarily a prime. You can also look at other metrics, such as the number of buckets with collisions, and the story stays the same.
This leads to the question, Is there some general principle that could be used to design a non-cryptographic hash function? One commonly adopted measure is the avalanche criterion. To understand this, think back to the theoretical example of designing a hash function by listing every input and storing a corresponding randomly generated N-bit output. If you selected any two inputs, then the outputs would be independent of one another. In fact, each bit of the output would be set independently, meaning there would be a 50/50 chance of a flip when comparing each bit of the two outputs.
The avalanche criterion tests for this and is usually stated as, If you change one bit in your input, this change affects every bit of the output with a probability of 50 percent.12 For small input sizes, hash designers can test for every possible input, but for larger sizes, inputs are often randomly sampled.
Figure 5 shows how the avalanche metric is typically evaluated. For each hash function, we have followed the work of Bret Mulvey12 and chosen random 32-bit inputs and then toggled each input bit to observe the chance of each of the 32 output bits changing over 10,000 tests. The results are shown in an output matrix, colored as follows:
• Green ⇒ toggling input bit i resulted in a change to output bit j approximately 50 percent of the time (45?55 percent). This is the ideal.
• Yellow ⇒ a chance of change either 25?45 percent or 55?75 percent.
• Red ⇒ the effect of toggling input bit i was a change to output bit j either 0?25 percent or 75?100 percent of the time. This is the worst-case scenario.
Murmur2 shows an impressive avalanche performance, with all entries in green. Indeed, Murmur2 was designed with the avalanche criterion in mind. Murmur2's structure in table 1 shows that it reads in four bytes of data at a time, which then goes through a thorough mixing step on just the current input before being combined with the previous inputs in a way that is like FNV. This mixing step essentially hides simple bit changes within a more random-looking change. Murmur2 also uses a "finalization step" to mix the output bits before producing the final hash output.
In contrast, the FNV hashes involve only two simple steps per byte of data (XOR and multiplication by prime), and DBJX33A is even simpler: taking in each byte of data, multiplying by 33, and adding to the previous value. For FNV, this results in some patterns from the input bytes being preserved in the output bytes. For example, the distinctive sawtooth pattern observed in the lower output bits of both FNV-1a and FNV-1 is because no output bit lower than the toggled input bit will ever change.
This is a result of the distribution of 1-bits in the FNV prime, which also causes the large swaths of red across the top of both FNV matrices: These represent the most recently processed byte, where changes to it have not had a chance to dissipate into all the bits. For DJBX33A, multiplying by 33, or 0b00100001
, is equivalent to shifting five places to the left and then adding the original byte. Therefore, a lot of information about the input is preserved mod 32, and it requires many runs through the algorithm to propagate changes to the higher output bits. These issues can be clearly observed in figure 5.
Note that although Murmur2 shows much better behavior in terms of the avalanche criterion, its collision resistance and distribution performance on these datasets was not so exceptional, except for the Bias dataset. That dataset was constructed to have many single-bit input differences, which is exactly the type of change targeted by the avalanche criterion. It is possible to produce a challenging dataset for Murmur2 by applying the inverse of its input-mixing step to the Bias dataset. This is a general lesson?every hash function has a dataset that it will find challenging, if the data is selected carefully enough.
Avalanche is an interesting criterion, but it targets a specific form of input differences, where single-bit changes are important. Given that this doesn't seem so common in real data, let's take a quick trip through history to understand how the avalanche criterion became interesting for designers.
Some early references to something like the avalanche criterion come up when discussing S-boxes (substitution boxes) in the DES (Data Encryption Standard) cipher. One strategy, when attacking a cipher, can be to make small changes to the input and then attempt to learn about the key. Here, avalanche is described as ensuring that changing a single bit of the input will change at least two bits in the S-box output,10 which in turn leads to an avalanche of changed bits as the internal rounds of the DES cipher are repeated. In Applied Cryptography, when analyzing the DES system, Bruce Schneier describes the cryptographic benefit of this:
By allowing one bit to affect two substitutions, the dependency of the output bits on the input bits spreads faster. This is called an avalanche effect. DES is designed to reach the condition of having every bit of the ciphertext depend on every bit of the plaintext and every bit of the key as quickly as possible.13
These original uses of avalanche are clearly in a cryptographic context and might reasonably be applied to cryptographic hashes, where similar security requirements are present. The question now is, How has avalanche migrated from a cryptographic background into the non-cryptographic world? We looked at a number of papers to see where references to avalanche as a criterion for non-cryptographic hashes may have originated and progressed through the literature.
For example, in their paper, "Novel non-cryptographic hash functions for networking and security applications on FPGA," Claesen, et al., state, "Non-cryptographic hashes for applications such as Bloom filters, hash tables, and sketches must be fast, uniformly distributed and must have excellent avalanche properties."4 Two papers are cited in support of this (Bloom3 and Cormode and Muthukrishnan5), but neither one directly mentions avalanche.
Similarly, in "Performance of the most common non-cryptographic hash functions, " Estébanez, et al., mention that the avalanche effect is important for non-cryptographic hash functions, providing several references.7 One is to Valloud's book on assessing hash functions, Hashing in Smalltalk: Theory and Practice, which is actually lukewarm on the avalanche criterion:
Although the avalanche test is quite popular, note that... it says nothing of the distribution of the actual hash values. In particular, it is quite possible to construct hash functions of horrific quality that nevertheless effortlessly achieve general avalanche. 16
Another common reference is Uzgalis's "Hashing concepts and the Java programming language" from 1996. 15 The author does note that one of the two criteria used to choose a candidate hash for Java was a mixing criterion: "The algorithm must guarantee that a change in any single bit of a key should result in an equally probable change to every bit of the hash value." This is clearly the avalanche criterion, but the author goes on to make a distinction between randomized-looking output and uniformly distributed outputs. Again, avalanche is vital only for those in search of a random result.
In fact, in all the applicable papers we identified, the avalanche criterion is either presented as a goal in itself or provides references that either are dead ends or lead to discussions that are guarded about avalanche's universal applicability.1,8,9
Another challenge that arises with avalanche is that it is really targeting output that is uniform over, say, a 32-bit output. What the application would usually like, however, is an output that is uniform over the M resources. In fact, it is impossible to pack a uniform 32-bit output uniformly into M buckets unless M is a power of two, which brings its own drawbacks, as seen earlier. This problem was recognized some years ago for random-number generators and is described as modulo bias. This resulted in the introduction of functions such as arc4random_uniform()
, which use extra random bits to resolve the bias. Work is ongoing to optimize fixes for modulo bias for random numbers.11 The situation for hash functions is somewhat less developed, probably because the input data itself can always be a source of non-uniformity.
Although cryptographic and non-cryptographic hash functions are everywhere, there seems to be a gap in how they are designed. Lots of criteria exist for cryptographic hashes motivated by various security requirements, but on the non-cryptographic side there is a certain amount of folklore that, despite the long history of hash functions, has not been fully explored. While targeting a uniform distribution makes a lot of sense for real-world datasets, it can be a challenge when confronted by a dataset with particular patterns.
Avalanche may be an important criterion for cryptographic hash functions, but its focus on randomizing the relationship between input and output may not have a big impact on distribution of entries in a hash table. Distributing outputs in a hash table raises the long-debated question of whether to choose your M to be a prime of power of two but does not generally consider the option of simply using an odd number.
This evidence all fits with the idea that non-cryptographic hash functions perhaps deserve more attention. Some already think that too much is being asked of individual cryptographic hashes: Maybe a one-size-fits-all approach is not workable.14 Another approach has been to design hashes targeting a middle ground; these would have some security requirements but maintain the speed and efficiency of a non-cryptographic hash?for example, SipHash.2 Perhaps studying hashes for a range of specific situations will help clarify how (or if) we should evaluate non-cryptographic hash functions in general.
1. Akoto-Adjepong, V., Asante, M., Okyere-Gyamfi, S. 2020. An enhanced non-cryptographic hash function. International Journal of Computer Applications 176 (15), 10?17; https://www.ijcaonline.org/archives/ volume176/number15/akotoadjepong-2020-ijca-920014.pdf.
2. Aumasson, J.-P., Bernstein, D. J. 2012. SipHash: a fast short-input PRF. In Progress in Cryptology (IndoCrypt), 489?508. Springer; https://link.springer.com/chapter/10.1007/978-3-642-34931-7_28.
3. Bloom, B. H. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7), 422?426; https://dl.acm.org/doi/10.1145/362686.362692.
4. Claesen, T., Sateesan, A., Vliegen, J., Mentens, N. 2021. Novel non-cryptographic hash functions for networking and security applications on FPGA. Euromicro Conference on Digital System Design; https://www.computer.org/csdl/proceedings-article/dsd/2021/270300a347/1xCb7oBYCvS.
5. Cormode, G., Muthukrishnan, S. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (1), 58?75; https://dl.acm.org/doi/10.1016/j.jalgor.2003.12.001.
6. Crosby, S. A., Wallach, D. S. 2003. Denial of service via algorithmic complexity attacks. 12th Usenix Security Symposium; https://www.usenix.org/legacy/events/sec03/tech/full_papers/crosby/crosby.pdf.
7. Estébanez, C., Saez, Y., Recio, G., Isasi, P. 2014. Performance of the most common non-cryptographic hash functions. Journal of Software: Practice and Experience 44 (6), 681?698. John Wiley & Sons; https://onlinelibrary.wiley.com/doi/10.1002/spe.2179; https://e-archivo.uc3m.es/rest/api/core/bitstreams/c2735ec2-8ae9-47f1-b663-dcda0c4f6204/content.
8. Estébanez, C., Saez, Y., Recio, G., Isasi, P. Automatic design of noncryptographic hash functions using genetic programming; https://e-archivo.uc3m.es/bitstream/handle/10016/30762/automatic_CI_2014_ps.pdf.
9. Henke, C., Schmoll, C., Zseby, T. Empirical evaluation of hash functions for multipoint measurements. ACM SIGCOMM Computer Communication Review 38 (3), 39?50; https://dl.acm.org/doi/10.1145/1384609.1384614.
10. Katz, J., Lindell, Y. 2014. Substitution permutation networks (chapter 6.2.1). In Introduction to Modern Cryptography. CRC Press.
11. Lemire, D. 2019. Fast random integer generation in an interval. ACM Transactions on Modeling and Computer Simulation 29 (1), 1?12; https://dl.acm.org/doi/10.1145/3230636.
12. Mulvey, B. The Pluto Scarab; https://web.archive.org/web/20230603152138/https://papa.bretmulvey.com/post/124027987928/hash-functions.
13. Schneier, B. 1996. Description of DES (chapter 12.2). In Applied Cryptography: Protocols, Algorithms, and Source Code in C, second edition. John Wiley and Sons, Inc.
14. Noll, L. C. Too many demands for a single cryptographic hash function to satisfy; http://www.isthe.com/chongo/tech/comp/crypto-hash.html.
15. Uzgalis, R. 1996. Hashing concepts and the Java programming language; http://www.serve.net/buz/hash.adt/java.000.html.
16. Valloud, A. 2008. Hashing in Smalltalk: Theory and Practice. Self-published, lulu.com.
Catherine Hayes obtained her mathematics degree from Maynooth University in 2004 and has worked in financial portfolio management for the 20 years since. When the lure of mathematics became too strong, she returned to Maynooth University to complete an MSc, supervised by David Malone, and was introduced to the fascinating world of hash functions.
David Malone is a researcher, sysadmin, or professor, depending on when you talk to him. He has worked on FreeBSD, IPv6, guessing passwords, and even warcycling around Dublin. He did his degrees in mathematics at Trinity College Dublin and works at Maynooth University.
Copyright © 2024 held by owner/author. Publication rights licensed to ACM.
Originally published in Queue vol. 22, no. 4—
Comment on this article in the ACM Digital Library
Nicole Forsgren, Eirini Kalliamvakou, Abi Noda, Michaela Greiler, Brian Houck, Margaret-Anne Storey - DevEx in Action
DevEx (developer experience) is garnering increased attention at many software organizations as leaders seek to optimize software delivery amid the backdrop of fiscal tightening and transformational technologies such as AI. Intuitively, there is acceptance among technical leaders that good developer experience enables more effective software delivery and developer happiness. Yet, at many organizations, proposed initiatives and investments to improve DevEx struggle to get buy-in as business stakeholders question the value proposition of improvements.
João Varajão, António Trigo, Miguel Almeida - Low-code Development Productivity
This article aims to provide new insights on the subject by presenting the results of laboratory experiments carried out with code-based, low-code, and extreme low-code technologies to study differences in productivity. Low-code technologies have clearly shown higher levels of productivity, providing strong arguments for low-code to dominate the software development mainstream in the short/medium term. The article reports the procedure and protocols, results, limitations, and opportunities for future research.
Ivar Jacobson, Alistair Cockburn - Use Cases are Essential
While the software industry is a fast-paced and exciting world in which new tools, technologies, and techniques are constantly being developed to serve business and society, it is also forgetful. In its haste for fast-forward motion, it is subject to the whims of fashion and can forget or ignore proven solutions to some of the eternal problems that it faces. Use cases, first introduced in 1986 and popularized later, are one of those proven solutions.
Jorge A. Navas, Ashish Gehani - OCCAM-v2: Combining Static and Dynamic Analysis for Effective and Efficient Whole-program Specialization
OCCAM-v2 leverages scalable pointer analysis, value analysis, and dynamic analysis to create an effective and efficient tool for specializing LLVM bitcode. The extent of the code-size reduction achieved depends on the specific deployment configuration. Each application that is to be specialized is accompanied by a manifest that specifies concrete arguments that are known a priori, as well as a count of residual arguments that will be provided at runtime. The best case for partial evaluation occurs when the arguments are completely concretely specified. OCCAM-v2 uses a pointer analysis to devirtualize calls, allowing it to eliminate the entire body of functions that are not reachable by any direct calls.