阅读背景:

散列函数为什么要使用素数模数?

来源:互联网 

A long time ago, I bought a data structures book off the bargain table for

A long time ago, I bought a data structures book off the bargain table for $1.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".

很久以前,我以1.25美元的价格从交易台上买了一本数据结构书。在其中,哈希函数的解释说,由于“数学的本质”,它最终应该由质数修改。

What do you expect from a $1.25 book?

你对1.25美元的书有什么期望?

Anyway, I've had years to think about the nature of math, and still can't figure it out.

无论如何,我有多年的时间来思考数学的本质,但仍然无法弄明白。

Is the distribution of numbers truly more even when there are a prime number of buckets? Or is this an old programmer's tale that everyone accepts because everybody else accepts it?

当存在大量的桶时,数字的分布是否真的更均匀?或者这是一个老程序员的故事,每个人都接受,因为其他人都接受它?

13 个解决方案

#1


Usually a simple hash function works by taking the "component parts" of the input (characters in the case of a string), and multiplying them by the powers of some constant, and adding them together in some integer type. So for example a typical (although not especially good) hash of a string might be:

通常,简单的散列函数通过获取输入的“组成部分”(在字符串的情况下为字符),并将它们乘以某个常量的幂,并将它们以某种整数类型加在一起来工作。因此,例如字符串的典型(尽管不是特别好)散列可能是:

(first char) + k * (second char) + k^2 * (third char) + ...

Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k, at least until the integer type overflows.

然后,如果输入一堆具有相同第一个字符的字符串,那么结果将全部是相同的模k,至少直到整数类型溢出。

[As an example, Java's string hashCode is eerily similar to this - it does the characters reverse order, with k=31. So you get striking relationships modulo 31 between strings that end the same way, and striking relationships modulo 2^32 between strings that are the same except near the end. This doesn't seriously mess up hashtable behaviour.]

[例如,Java的字符串hashCode与此类似 - 它的字符顺序相反,k = 31。因此,在以相同方式结束的字符串之间获得以模数31为模型的醒目关系,以及在除了接近结尾之外相同的字符串之间以2 ^ 32为模数的醒目关系。这并不会严重扰乱哈希表行为。]

A hashtable works by taking the modulus of the hash over the number of buckets.

散列表通过将散列的模数与桶的数量相乘来工作。

It's important in a hashtable not to produce collisions for likely cases, since collisions reduce the efficiency of the hashtable.

在散列表中重要的是不要为可能的情况产生冲突,因为冲突会降低散列表的效率。

Now, suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items, like all having the same first character. This is a fairly predictable usage pattern, I'd say, so we don't want it to produce too many collisions.

现在,假设有人将一大堆值放入哈希表中,这些哈希表在项之间具有某种关系,就像所有具有相同的第一个字符一样。我会说,这是一种相当可预测的使用模式,因此我们不希望它产生太多的冲突。

It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.

事实证明,“由于数学的性质”,如果散列中使用的常量和桶的数量是互质的,那么在一些常见情况下碰撞会被最小化。如果它们不是互质的,那么在输入之间存在一些相互简单的关系,其中碰撞没有被最小化。所有的哈希值都与公共因子相等,这意味着它们都将落入具有以公共因子为模的值的桶的第1个中。你得到n次碰撞,其中n是公因子。因为n至少为2,所以我认为一个相当简单的用例产生至少两倍于正常情况的冲突是不可接受的。如果某个用户打算将我们的分发分解成桶,我们希望它是一个奇怪的事故,而不是一些简单的可预测用法。

Now, hashtable implementations obviously have no control over the items put into them. They can't prevent them being related. So the thing to do is to ensure that the constant and the bucket counts are coprime. That way you aren't relying on the "last" component alone to determine the modulus of the bucket with respect to some small common factor. As far as I know they don't have to be prime to achieve this, just coprime.

现在,散列表实现显然无法控制放入它们的项目。他们不能阻止他们相关。所以要做的是确保常量和桶数是互质的。这样,您不仅仅依靠“最后”组件来确定铲斗的模数与一些小的公因数。据我所知,他们没有必要成为实现这一点的首要任务,只需互质。

But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.

但是如果哈希函数和哈希表是独立编写的,那么哈希表不知道哈希函数是如何工作的。它可能使用一个小因子的常数。如果你很幸运,它可能完全不同并且是非线性的。如果哈希足够好,那么任何桶数都可以。但是一个偏执的哈希表不能假设一个好的哈希函数,所以应该使用素数桶。类似地,偏执散列函数应该使用较大的素数常量,以减少某人使用多个桶的机会,这些桶碰巧与常量具有共同因子。

In practice, I think it's fairly normal to use a power of 2 as the number of buckets. This is convenient and saves having to search around or pre-select a prime number of the right magnitude. So you rely on the hash function not to use even multipliers, which is generally a safe assumption. But you can still get occasional bad hashing behaviours based on hash functions like the one above, and prime bucket count could help further.

在实践中,我认为使用2的幂作为桶的数量是相当正常的。这是方便的,并且节省了必须搜索或预先选择正确幅度的素数。因此,您依赖哈希函数不使用偶数乘数,这通常是一个安全的假设。但是你仍然可以根据上面的哈希函数偶尔得到糟糕的哈希散列行为,而主要的桶数可能会有所帮助。

Putting about the principle that "everything has to be prime" is as far as I know a sufficient but not a necessary condition for good distribution over hashtables. It allows everybody to interoperate without needing to assume that the others have followed the same rule.

关于“一切都必须是素数”的原则,就我所知,在哈希表上进行良好分配是一个充分但不是必要的条件。它允许每个人进行互操作,而无需假设其他人遵循相同的规则。

[Edit: there's another, more specialized reason to use a prime number of buckets, which is if you handle collisions with linear probing. Then you calculate a stride from the hashcode, and if that stride comes out to be a factor of the bucket count then you can only do (bucket_count / stride) probes before you're back where you started. The case you most want to avoid is stride = 0, of course, which must be special-cased, but to avoid also special-casing bucket_count / stride equal to a small integer, you can just make the bucket_count prime and not care what the stride is provided it isn't 0.]

[编辑:有另一个更专业的理由使用素数桶,如果你处理线性探测的碰撞。然后你从哈希码计算一个步幅,如果那个步幅是桶数的一个因子,那么你只能在你回到起点之前做(bucket_count / stride)探测。你最想避免的情况是stride = 0,当然,这必须是特殊的,但是为了避免特殊套管bucket_count / stride等于一个小整数,你可以只做一个bucket_count prime而不关心什么提供的步幅不是0.]

#2


The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 'statements' that you most probably have read somewhere

从哈希表插入/ retreiving时,首先要做的是计算给定键的hashCode,然后通过hashCode%table_length将hashCode修剪为hashTable的大小来找到正确的桶。以下是您最有可能在某处阅读的2个“陈述”

  1. If you use a power of 2 for table_length, finding (hashCode(key) % 2^n ) is as simple and quick as (hashCode(key) & (2^n -1)). But if your function to calculate hashCode for a given key isn't good, you will definitely suffer from clustering of many keys in a few hash buckets.
  2. 如果对table_length使用2的幂,则查找(hashCode(key)%2 ^ n)与(hashCode(key)&(2 ^ n -1))一样简单快捷。但是如果你为给定键计算hashCode的函数不好,你肯定会在几个散列桶中聚集许多键。

  3. But if you use prime numbers for table_length, hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function.
  4. 但是如果你对table_length使用素数,那么即使你有一个稍微愚蠢的hashCode函数,计算出来的hashCodes也可以映射到不同的散列桶。

And here is the proof.

这是证据。

If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering

如果假设你的hashCode函数导致以下hashCodes {x,2x,3x,4x,5x,6x ...},那么所有这些将集中在m个桶中,其中m = table_length / GreatestCommonFactor (table_length,x)。 (验证/得出这个是微不足道的)。现在,您可以执行以下操作之一以避免群集

Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries. Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

确保你没有生成太多的hashCode,这些hashCode是{x,2x,3x,4x,5x,6x ...}中的另一个hashCode的倍数。但是如果你的hashTable应该有这个,那么这可能有点困难数百万条目。或者通过使GreatestCommonFactor(table_length,x)等于1来简单地使m等于table_length,即通过使table_length与x进行互操作。如果x可以是任何数字,那么请确保table_length是素数。

From - https://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html

来自 - https://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html

#3


https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

Pretty clear explanation, with pictures too.

非常明确的解释,也有图片。

Edit: As a summary, primes are used because you have the best chance of obtaining a unique value when multiplying values by the prime number chosen and adding them all up. For example given a string, multiplying each letter value with the prime number and then adding those all up will give you its hash value.

编辑:作为总结,使用素数是因为当您将值乘以所选的素数并将它们全部加起来时,您最有可能获得唯一值。例如,给定一个字符串,将每个字母值与素数相乘,然后将它们全部添加将为您提供其哈希值。

A better question would be, why exactly the number 31?

一个更好的问题是,为什么31号呢?

#4


tl;dr

index[hash(input)%2] would result in a collision for half of all possible hashes and a range of values. index[hash(input)%prime] results in a collision of <2 of all possible hashes. Fixing the divisor to the table size also ensures that the number cannot be greater than the table.

index [hash(input)%2]会导致所有可能的哈希值和一系列值的一半发生冲突。 index [hash(input)%prime]导致所有可能的哈希值中的<2个冲突。将除数固定为表大小也可确保数字不能大于表。

#5


Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P. Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.

使用Primes是因为您很有可能获得使用模数P的多项式的典型散列函数的唯一值。比如,对长度<= N的字符串使用此类散列函数,并且您发生了碰撞。这意味着2个不同的多项式产生相同的模P值。这些多项式的差异也是相同度数N(或更小)的多项式。它只有N个根(这里是数学本质所表现出来的,因为这个说法只适用于场上的多项式=>素数)。因此,如果N远小于P,则可能不会发生碰撞。之后,实验可能会显示37大到足以避免长度为5-10的字符串哈希表的冲突,并且足够小以用于计算。

#6


Just to provide an alternate viewpoint there's this site:

只是为了提供另一个观点,就是这个网站:

https://www.codexon.com/posts/hash-functions-the-modulo-prime-myth

Which contends that you should use the largest number of buckets possible as opposed to to rounding down to a prime number of buckets. It seems like a reasonable possibility. Intuitively, I can certainly see how a larger number of buckets would be better, but I'm unable to make a mathematical argument of this.

哪种情况认为你应该使用最大数量的存储桶,而不是向下舍入到主数量的存储桶。这似乎是一种合理的可能性。直觉上,我当然可以看到更多的桶会更好,但我无法对此进行数学论证。

#7


Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

Primes是唯一的数字。它们的独特之处在于,素数与任何其他数字的乘积最有可能是独特的(不像当然的素数本身那样独特),因为使用素数来构成它。此属性用于散列函数。

Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.

给定一个字符串“Samuel”,您可以通过将每个组成数字或字母与素数相乘并将它们相加来生成唯一的哈希值。这就是使用素数的原因。

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about https://www.azillionmonkeys.com/qed/hash.html

然而,使用素数是一种古老的技术。关键在于理解,只要您可以生成足够独特的密钥,您也可以转移到其他散列技术。有关https://www.azillionmonkeys.com/qed/hash.html此主题的更多信息,请访问此处

https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

#8


It depends on the choice of hash function.

这取决于哈希函数的选择。

Many hash functions combine the various elements in the data by multiplying them with some factors modulo the power of two corresponding to the word size of the machine (that modulus is free by just letting the calculation overflow).

许多散列函数通过将它们与模拟与机器的字大小相对应的2的幂的一些因子相乘来组合数据中的各种元素(仅通过让计算溢出来使模数是空闲的)。

You don't want any common factor between a multiplier for a data element and the size of the hash table, because then it could happen that varying the data element doesn't spread the data over the whole table. If you choose a prime for the size of the table such a common factor is highly unlikely.

您不希望数据元素的乘数与散列表的大小之间存在任何公因子,因为这样可能会发生变化数据元素不会将数据扩散到整个表中的情况。如果您选择表格大小的素数,则这种共同因素极不可能发生。

On the other hand, those factors are usually made up from odd primes, so you should also be safe using powers of two for your hash table (e.g. Eclipse uses 31 when it generates the Java hashCode() method).

另一方面,这些因素通常由奇数素数组成,因此对于哈希表使用2的幂也应该是安全的(例如,Eclipse在生成Java hashCode()方法时使用31)。

#9


Suppose your table-size (or the number for modulo) is T = (B*C). Now if hash for your input is like (N*A*B) where N can be any integer, then your output won't be well distributed. Because every time n becomes C, 2C, 3C etc., your output will start repeating. i.e. your output will be distributed only in C positions. Note that C here is (T / HCF(table-size, hash)).

假设您的表大小(或模数)是T =(B * C)。现在,如果您的输入的哈希值类似于(N * A * B),其中N可以是任何整数,那么您的输出将不会很好地分布。因为每次n变为C,2C,3C等,您的输出将开始重复。即您的输出将仅在C位置分发。注意这里的C是(T / HCF(表大小,散列))。

This problem can be eliminated by making HCF 1. Prime numbers are very good for that.

通过制作HCF 1可以消除这个问题。素数非常好。

Another interesting thing is when T is 2^N. These will give output exactly same as all the lower N bits of input-hash. As every number can be represented powers of 2, when we will take modulo of any number with T, we will subtract all powers of 2 form number, which are >= N, hence always giving off number of specific pattern, dependent on the input. This is also a bad choice.

另一件有趣的事情是T为2 ^ N.这些将使输出与输入哈希的所有较低N位完全相同。因为每个数字都可以表示为2的幂,当我们用T取任何数的模数时,我们将减去2个表格数的所有幂,即> = N,因此总是给出特定模式的数量,这取决于输入。这也是一个糟糕的选择。

Similarly, T as 10^N is bad as well because of similar reasons (pattern in decimal notation of numbers instead of binary).

类似地,T为10 ^ N也是坏的,因为类似的原因(数字的十进制表示法的模式而不是二进制)。

So, prime numbers tend to give a better distributed results, hence are good choice for table size.

因此,素数倾向于提供更好的分布结果,因此是表格大小的良好选择。

#10


I'd like to add something for Steve Jessop's answer(I can't comment on it since I don't have enough reputation). But I found some helpful material. His answer is very help but he made a mistake: the bucket size should not be a power of 2. I'll just quote from the book "Introduction to Algorithm" by Thomas Cormen, Charles Leisersen, et al on page263:

我想为Steve Jessop的回答添加一些内容(由于我没有足够的声誉,我无法发表评论)。但我发现了一些有用的材料。他的回答非常有帮助,但是他犯了一个错误:桶的大小不应该是2的幂。我只想引用Thomas Cormen,Charles Leisersen等人在第263页的“算法导论”一书:

When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.

当使用除法时,我们通常避免使用m的某些值。例如,m不应该是2的幂,因为如果m = 2 ^ p,则h(k)只是k的p个最低位。除非我们知道所有低阶p比特模式都是同等可能的,否则我们最好设计散列函数以依赖于密钥的所有比特。正如练习11.3-3要求你展示的那样,当k是以基数2 ^ p解释的字符串时,选择m = 2 ^ p-1可能是一个糟糕的选择,因为置换k的字符不会改变其散列值。

Hope it helps.

希望能帮助到你。

#11


Copying from my other answer https://stackoverflow.com/a/43126969/917428. See it for more details and examples.

从我的其他答案复制https://stackoverflow.com/a/43126969/917428。查看更多详细信息和示例。

I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:

我认为这只与计算机在基础2中工作这一事实有关。只要考虑基础10的相同方法是如何工作的:

  • 8 % 10 = 8
  • 8%10 = 8

  • 18 % 10 = 8
  • 18%10 = 8

  • 87865378 % 10 = 8
  • 87865378%10 = 8

It doesn't matter what the number is: as long as it ends with 8, its modulo 10 will be 8.

数字是什么并不重要:只要它以8结尾,其模数10将为8。

Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.

选取足够大的非幂次数将确保散列函数确实是所有输入位的函数,而不是它们的子集。

#12


For a hash function it's not only important to minimize colisions generally but to make it impossible to stay with the same hash while chaning a few bytes.

对于散列函数,不仅重要的是一般地最小化colisions,而且在chaning几个字节时不可能保持相同的散列。

Say you have an equation: (x + y*z) % key = x with 0<x<key and 0<z<key. If key is a primenumber n*y=key is true for every n in N and false for every other number.

假设您有一个等式:(x + y * z)%key = x,其中0 <键,0=""> <键。如果key是一个素数n="" *="" y="key对于N中的每个n都为真,而对于每个其他数字则为false。

An example where key isn't a prime example: x=1, z=2 and key=8 Because key/z=4 is still a natural number, 4 becomes a solution for our equation and in this case (n/2)*y = key is true for every n in N. The amount of solutions for the equation have practially doubled because 8 isn't a prime.

键不是主要示例的示例:x = 1,z = 2且key = 8因为key / z = 4仍然是自然数,4成为我们方程的解,在这种情况下(n / 2) * y = n对于N中的每个n都为真。方程的解的数量实际上翻了一倍,因为8不是素数。

If our attacker already knows that 8 is possible solution for the equation he can change the file from producing 8 to 4 and still gets the same hash.

如果我们的攻击者已经知道8可能是等式的解决方案,他可以将文件从生成8改为4并仍然获得相同的散列。

#13


I've read the popular wordpress website linked in some of the above popular answers at the top. From what I've understood, I'd like to share a simple observation I made.

我已经阅读了流行的wordpress网站,该网站在顶部的一些上述流行答案中链接。根据我的理解,我想分享一个简单的观察结果。

You can find all the details in the article here, but assume the following holds true:

您可以在此处找到文章中的所有详细信息,但假设以下情况属实:

  • Using a prime number gives us the "best chance" of an unique value
  • 使用素数给我们一个独特价值的“最佳机会”

A general hashmap implementation wants 2 things to be unique.

一般的hashmap实现需要两件事是唯一的。

  • Unique hash code for the key
  • 密钥的唯一哈希码

  • Unique index to store the actual value
  • 存储实际值的唯一索引

How do we get the unique index? By making the initial size of the internal container a prime as well. So basically, prime is involved because it possesses this unique trait of producing unique numbers which we end up using to ID objects and finding indexes inside the internal container.

我们如何获得独特的指数?通过使内部容器的初始尺寸成为主要原因。所以基本上,涉及素数是因为它具有产生唯一数字的这种独特特征,我们最终使用这些特征来识别ID对象并在内部容器中查找索引。

Example:

key = "key"

key =“key”

value = "value" uniqueId = "k" * 31 ^ 2 + "e" * 31 ^ 1` + "y"

value =“value”uniqueId =“k”* 31 ^ 2 +“e”* 31 ^ 1` +“y”

maps to unique id

映射到唯一ID

Now we want a unique location for our value - so we

现在我们想要一个独特的价值位置 - 所以我们

uniqueId % internalContainerSize == uniqueLocationForValue , assuming internalContainerSize is also a prime.

uniqueId%internalContainerSize == uniqueLocationForValue,假设internalContainerSize也是素数。

I know this is simplified, but I'm hoping to get the general idea through.

我知道这是简化的,但我希望能够得到一般的想法。


.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".A long time ago, I bought a data structures boo




你的当前访问异常,请进行认证后继续阅读剩余内容。

分享到: