Programming Puzzles & Code Golf Stack Exchange is a question and answer site for programming puzzle enthusiasts and code golfers. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

In this you will write a hash function in 140 bytes1 or less of source code. The hash function must take an ASCII string as input, and return a 24-bit unsigned integer ([0, 224-1]) as output.

Your hash function will be evaluated for every word in this large British English dictionary2. Your score is the amount of words that share a hash value with another word (a collision).

The lowest score wins, ties broken by first poster.

Test case

Before submitting, please test your scoring script on the following input:

duplicate
duplicate
duplicate
duplicate

If it gives any score other than 4, it is buggy.


Clarifying rules:

  1. Your hash function must run on a single string, not a whole array. Also, your hash function may not do any other I/O than the input string and output integer.
  2. Built-in hash functions or similar functionality (e.g. encryption to scramble bytes) is disallowed.
  3. Your hash function must be deterministic.
  4. Contrary to most other contests optimizing specifically for the scoring input is allowed.

1 I am aware Twitter limits characters instead of bytes, but for simplicity we will use bytes as a limit for this challenge.
2 Modified from Debian's wbritish-huge, removing any non-ASCII words.

share|improve this question
6  
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch's ? What the...? – Luis Mendo yesterday
3  
@DonMuesli en.wikipedia.org/wiki/Llanfairpwllgwyngyll (fun fact: that word is also in Jelly's built-in compression dictionary) – Martin Büttner yesterday
6  
I think you should disallow built-in dictionaries. – Dennis yesterday
2  
For reference: Taking the 24 MSB of SHA-512 would achieve a score of 6816. – Dennis yesterday
3  
Some back-of-the-envelope calculations: With D=340275 words and R=2^24 hash outputs, a random hash has an expected D^2/(2*R) = 3450 colliding pairs, some of which overlap. There are an expected D^3/(6*R^2) = 23 colliding triples and a negligible number of larger collisions, which means these triples are likely disjoint. This gives an expected 6829 words that share a hash value, ~70 in triples and the rest in pairs. The standard deviation is estimated at 118, so getting <6200 with a random hash is roughly a 5 sigma event. – xnor 14 hours ago

18 Answers 18

Python, 5333 4991

I believe this is the first contender to score significantly better than a random oracle.

def H(s):n=int(s.encode('hex'),16);return n%(8**8-ord('+%:5O![/5;QwrXsIf]\'k#!__u5O}nQ~{;/~{CutM;ItulA{uOk_7"ud-o?y<Cn~-`bl_Yb'[n%70]))
share|improve this answer
1  
Sorcery! def H(s):n=int(s.encode('hex'),16);return n%... saves 5 bytes, in case you can use them somehow... – Dennis yesterday
2  
@Dennis I could have used 5 bytes to make the string constant 5 bytes longer. But I would have to start over building the string constant from scratch if I change the length. And I am not sure those 5 bytes will give me enough improvement that it is worth starting over building the string. I have spend hours of CPU time optimizing the string constant already. – kasperd yesterday
    
@Dennis I guess a few extra bytes would give me the freedom to use some characters in the constant needing escaping. That way I could potentially use a few extra bytes without having to construct the string all over again. – kasperd yesterday
2  
If you want another byte, 2**24 == 8**8. – Anders Kaseorg yesterday

Python 2, 140 bytes, 4662 4471 4362 colliding words

def f(s):n=int(s.encode('hex'),16);return n%(8**8+3-60*ord('4BZp%(jTvy"WTf.[Lbjk6,-[LVbSvF[Vtw2e,NsR?:VxC0h5%m}F5,%d7Kt5@SxSYX-=$N>'[n%71]))

Inspired by the form of kasperd’s solution, obviously—but with the important addition of an affine transformation on the modulus space, and entirely different parameters.

share|improve this answer
    
+1 I am not giving up without a fight. But I think I have to stop optimizing my current solution and find another approach, because I am not going to beat you if I keep using my current approach to optimizing the parameters. I'll be back with an edit to my solution once I have beaten yours.... – kasperd yesterday
    
@kasperd: Awesome, bring it on. :-P – Anders Kaseorg yesterday
1  
@AndersKaseorg How do you find the string? – somebody 19 hours ago
    
@AndersKaseorg I managed to speed up my parameter search a lot. And I removed an obstacle which was causing my search to get stuck on suboptimal solutions. But I still couldn't make it any better than 4885. After some pondering on why I couldn't make it any further I suddenly realized what is wrong with my solution and how it can be fixed. Now the affine transformation in your solution makes perfect sense to me. I think the only way I can possibly catch up is by using an affine transformation myself. – kasperd 15 hours ago
1  
@kasperd: Very nice. When searching for a better string in n%(8**8-ord('…'[n%70])) without other parameter changes, I had only managed to get to 4995, so it looks like your new optimizer has caught up to mine. Now this gets more interesting! – Anders Kaseorg 14 hours ago

Python, 6446 6372


This solution achieves lower collision count than all previous entries, and it needs only 44 of the 140 bytes allowed for code:

H=lambda s:int(s.encode('hex'),16)%16727401
share|improve this answer
    
I think this is invalid. The output must be in the inclusive range [0, 2**24-1]. Or am I interpreting the instructions incorrectly? I'm pretty sure a hash for the range needs to be able to output all values in that range. – mbomb007 yesterday
2  
@mbomb007 orlp's own submission does %(2**24-1), so I think it might be good to ask for clarification – Sp3000 yesterday
10  
@mbomb007 The challenge says no such thing. It says the function must take an ASCII string as input and output an integer in that range. Regardless of which input you give my function, the output will be in that range. The mathematical definition of the word function does not require it to produce every allowed output. If that's what you wanted the mathematical term you'd be using was surjective function. But the word surjective was not used in the requirements. – kasperd yesterday
    
@kasperd My point is that maybe the OP meant that, whether the word was included or not, hence why I'm asking for clarification. In general, hash functions usually work that way. – mbomb007 yesterday
    
@mbomb007: There is no requirement that hash functions be surjective. For example, many memory-address-based hash functions can only produce multiples of some small power of 2 due to memory alignment, including the default object hash in older versions of Python. Many hash functions even have a smaller domain than codomain, so they couldn't be surjective anyway. – user2357112 yesterday

JavaScript (ES6), 6389

The hash function (105 bytes):

s=>[...s.replace(/[A-Z]/g,a=>(b=a.toLowerCase())+b+b)].reduce((a,b)=>(a<<3)*28-a^b.charCodeAt(),0)<<8>>>8

The scoring function (NodeJS) (170 bytes):

h={},c=0,l=require('fs').readFileSync(process.argv[2],'utf8').split('\n').map(a=>h[b=F(a)]=-~h[b])
for(w of Object.getOwnPropertyNames(h)){c+=h[w]>1&&h[w]}
console.log(c)

Call as node hash.js dictionary.txt, where hash.js is the script, dictionary.txt is the dictionary text file (without the final newline), and F is defined as the hashing function.

Thanks Neil for shaving 9 bytes off the hashing function!

share|improve this answer
    
Why the assignment to a? Also, instead of ((...)>>>0)%(1<<24) you can probably use (...)<<8>>>8. – Neil yesterday
    
@Neil Because alphabet, and I'm boring =P Also, nice bitwise mathing! That saved 7 bytes =) – Mwr247 yesterday
    
Good thing this isn't code golf, otherwise I'd have to ding you for the unused variable i as well. – Neil yesterday
    
@Neil Crap >_< I use that while testing some alternative hashing ideas and forgot to remove XD Yes, good thing this isn't a golf, though all the same, I'd love it if I could compress the hash and scoring functions into the same 140 bytes, so every bit helps ;) – Mwr247 yesterday
1  
@Sp3000 Gah, I see what you mean. Mine is not counting the ones that are in there initially when the collision is found. I'll fix that. – Mwr247 yesterday

CJam, 6273

{49f^245b16777213%}

XOR each character with 49, reduce the resulting string via x, y ↦ 245x + y, and take the residue modulo 16,777,213 (the largest 24-bit prime).

Scoring

$ cat hash.cjam
qN% {49f^245b16777213%} %N*N
$ all=$(wc -l < british-english-huge.txt)
$ unique=$(cjam hash.cjam < british-english-huge.txt | sort | uniq -u | wc -l)
$ echo $[all - unique]
6273
share|improve this answer
    
I reimplemented the algorithm in python from your description. I can confirm that your score checks out with the official score computation. – kasperd yesterday

CJam, 4125

0000000: 7b 5f 37 62 32 30 30 25 5f 22 51 f0 5b 0a  {_7b200%_"Q.[.
000000e: 25 c6 ca 91 94 3c 16 85 db 59 b7 d0 7c f3  %....<...Y..|.
000001c: 83 3c 8c 8a 07 94 20 10 10 f5 82 24 ab 14  .<.... ....$..
000002a: 33 28 ce 33 db ce 1b ac 99 5c 22 d4 a0 be  3(.3.....\"...
0000038: 2e 74 84 14 15 f7 14 10 d4 ca 60 2d 5a 59  .t........`-ZY
0000046: dc 49 b1 a6 21 1a 60 e2 0b a5 a7 52 36 35  .I..!.`....R65
0000054: 1d 75 03 49 b2 7c 57 14 a9 8f 25 4a c2 53  .u.I.|W...%J.S
0000062: 17 bd 59 78 59 f1 1b c1 53 f1 12 7b 69 22  ..YxY...S..{i"
0000070: 32 35 36 62 47 62 3d 32 2a 31 34 39 2b 40  256bGb=2*149+@
000007e: 62 38 33 38 37 33 25 32 30 30 2a 2b 7d     b83873%200*+}

This approach divides domain and codomain into 200 disjoint sets, and defines a different hash function for each pair.

The involved constants have been chosen more or less arbitrarily, and tweaking them will hopefully improve the score. I'll try to find better ones when I have time.

Scoring

$ echo $LANG
en_US
$ cat gen.cjam
"qN%{_7b200%_"
[5 1 15 0 5 11 0 10 2 5 12 6 12 10 9 1 9 4 3 12 1 6 8 5 13 11 5 9 11 7 13 0 7 12 15 3 8 3 3 12 8 12 8 10 0 7 9 4 2 0 1 0 1 0 15 5 8 2 2 4 10 11 1 4 3 3 2 8 12 14 3 3 13 11 12 14 1 11 10 12 9 9 2 2 13 4 10 0 11 14 2 14 7 4 8 4 1 4 1 5 15 7 1 4 1 0 13 4 12 10 6 0 2 13 5 10 5 9 13 12 4 9 11 1 10 6 2 1 1 10 6 0 14 2 0 11 10 5 10 7 5 2 3 6 3 5 1 13 7 5 0 3 4 9 11 2 7 12 5 7 1 4 10 9 8 15 2 5 4 10 12 2 5 3 1 7 11 13 5 9 7 8 5 9 15 1 1 11 12 1 5 3 15 1 1 2 7 11 6 9]
Gb256b:c`
"256bGb=2*149+@b83873%200*+}%$e`0f={1>},1bN"
$ cjam gen.cjam > test.cjam
$ cjam test.cjam < british-english-huge.txt
4125
share|improve this answer

Mathematica, 6473

The next step up... instead of summing the character codes we treat them as the digits of a base-151 number, before taking them modulo 224.

hash[word_] := Mod[FromDigits[ToCharacterCode @ word, 151], 2^24]

Here is a short script to determine the number of collisions:

Total[Last /@ DeleteCases[Tally[hash /@ words], {_, 1}]]

I've just tried all bases systematically from 1 onwards, and so far base 151 yielded the fewest collisions. I'll try a few more to bring down the score a bit further, but the testing is a bit slow.

share|improve this answer

Javascript (ES5), 6765

This is CRC24 shaved down to 140 Bytes. Could golf more but wanted to get my answer in :)

function(s){c=0xb704ce;i=0;while(s[i]){c^=(s.charCodeAt(i++)&255)<<16;for(j=0;j++<8;){c<<=1;if(c&0x1000000)c^=0x1864cfb}}return c&0xffffff}

Validator in node.js:

var col = new Array(16777215);
var n = 0;

var crc24_140 = 
function(s){c=0xb704ce;i=0;while(s[i]){c^=(s.charCodeAt(i++)&255)<<16;for(j=0;j++<8;){c<<=1;if(c&0x1000000)c^=0x1864cfb}}return c&0xffffff}

require('fs').readFileSync('./dict.txt','utf8').split('\n').map(function(s){ 
    var h = crc24_140(s);
    if (col[h]===1) {
        col[h]=2;
        n+=2;
    } else if (col[h]===2) {
        n++;
    } else {
        col[h]=1;
    }
});

console.log(n);
share|improve this answer
    
Welcome to Programming Puzzles & Code Golf! – Alex A. yesterday
    
...And thanks for the warm welcome @AlexA.! – binarymax yesterday

Python, 9310


Yeah, not the best, but at least it is something. As we say in crypto, never write your own hash function.

This is exactly 140 bytes long, as well.

F=lambda x,o=ord,m=map:int((int(''.join(m(lambda z:str(o(z)^o(x[-x.find(z)])^o(x[o(z)%len(x)])),x)))^(sum(m(int,m(o,x))))^o(x[-1]))%(2**24))
share|improve this answer

Matlab, 30828 8620 6848

It builds the hash by assigning a prime number to each ascii character/position combo and calculating their product for each word modulo the largest prime smaller than 2^24. Note that for testing I moved the call to primes outside into the tester directly before the while loop and passed it into the hash function, because it sped it up by about a factor of 1000, but this version works, and is self-contained. It may crash with words longer than about 40 characters.

function h = H(s)
p = primes(1e6);
h = 1;
for i=1:length(s)
    h = mod(h*p(double(s(i))*i),16777213);
end
end

Tester:

clc
clear variables
close all

file = fopen('british-english-huge.txt');
hashes = containers.Map('KeyType','uint64','ValueType','uint64');

words = 0;
p = primes(1e6);
while ~feof(file)
    words = words + 1;
    word = fgetl(file);
    hash = H(word,p);
    if hashes.isKey(hash)
        hashes(hash) = hashes(hash) + 1;
    else
        hashes(hash) = 1;
    end
end

collisions = 0;
for key=keys(hashes)

    if hashes(key{1})>1
        collisions = collisions + hashes(key{1});
    end
end
share|improve this answer
    
If you want to save space in your program, you don't have to convert your char to a double explicitly. Also you could use numel rather than length. Not sure what you'd do with all those extra bytes though! – Suever 7 hours ago

Python, 340053

A terrible score from a terrible algorithm, this answer exists more to give a small Python script that displays scoring.

H=lambda s:sum(map(ord, s))%(2**24)

To score:

hashes = []
with open("british-english-huge.txt") as f:
    for line in f:
        word = line.rstrip("\n")
        hashes.append(H(word))

from collections import Counter
print(sum(v for k, v in Counter(hashes).items() if v > 1))
share|improve this answer
1  
It could be useful to have the scoring code assert that the return value from the hash function is an integer in the permitted range. – kasperd 12 hours ago

Python, 6390 6376 6359

H=lambda s:reduce(lambda a,x:a*178+ord(x),s,0)%(2**24-48)

May be considered a trivial modification to Martin Büttner's answer.

share|improve this answer
    
I think this is invalid. The output must be in the inclusive range [0, 2**24-1]. Or am I interpreting the instructions incorrectly? I'm pretty sure a hash for the range needs to be able to output all values in that range. – mbomb007 yesterday
1  
@mbomb007 That's not true. If your function always outputs 4 it's still outputting in the range [0, 2**24-1]. The only thing that's not allowed is outputting any number not inside that range, e.g. -1 or 2**24. – orlp yesterday

C#, 6251 6335

int H(String s){int h = 733;foreach (char c in s){h = (h * 533 + c);}return h & 0xFFFFFF;}

The constants 533 and 733 889 and 155 give the best score out of all of the ones I've searched so far.

share|improve this answer

C++: 7112 6694 6483 6479 6412 6339 collisions, 90 bytes

I implemented a naïve genetic algorithm for my coefficient array. I'll update this code as it finds better ones. :)

int h(const char*s){uint32_t t=0,p=0;while(*s)t="cJ~Z]q"[p++%6]*t+*s++;return t%16777213;}

Test function:

int main(void)
{
    std::map<int, int> shared;

    std::string s;
    while (std::cin >> s) {
        shared[h(s.c_str())]++;
    }

    int count = 0;
    for (auto c : shared) {
        if ((c.first & 0xFFFFFF) != c.first) { std::cerr << "invalid hash: " << c.first << std::endl; }
        if (c.second > 1) { count += c.second; }
    }

    std::cout << count << std::endl;
    return 0;
}
share|improve this answer

Ruby, 9309 collisions, 107 bytes

def hash(s);require'prime';p=Prime.first(70);(0...s.size).reduce(0){|a,i|a+=p[i]**(s[i].ord)}%(2**24-1);end 

Not a good contender, but I wanted to explore a different idea from other entries.

Assign the first n primes to the first n positions of the string, then sum all prime[i] ** (ascii code of string[i]), then mod 2**24-1.

share|improve this answer

Java 8, 7054 6467

This is inspired by (but not copied from) the builtin java.lang.String.hashCode function, so feel free to disallow according to rule #2.

w -> { return w.chars().reduce(53, (acc, c) -> Math.abs(acc * 79 + c)) % 16777216; };

To score:

import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Function;

public class TweetableHash {
    public static void main(String[] args) throws Exception {
        List<String> words = Files.readAllLines(Paths.get("british-english-huge.txt"));

        Function<String, Integer> hashFunc = w -> { return w.chars().reduce(53, (acc, c) -> Math.abs(acc * 79 + c)) % 16777216; };

        Map<Integer, Integer> hashes = new HashMap<>();
        for (String word : words) {
            int hash = hashFunc.apply(word);
            if (hash < 0 || hash >= 16777216) {
                throw new Exception("hash too long for word: " + word + " hash: " + hash);
            }

            Integer numOccurences = hashes.get(hash);
            if (numOccurences == null) {
                numOccurences = 0;
            }
            numOccurences++;

            hashes.put(hash, numOccurences);
        }

        int numCollisions = hashes.values().stream().filter(i -> i > 1).reduce(Integer::sum).get();
        System.out.println("num collisions: " + numCollisions);
    }
}
share|improve this answer
    
@muddyfish could you check out the current version? i think ive accounted for 3-way-collisions and am still getting the same result. – Bewusstsein yesterday
    
This doesn't account for three-way collisions. If you replace hashes with Map<Integer, Integer> hashes = new HashMap<>() and then count the number of words for each hash, you can account for them correctly. – Peter Taylor yesterday
    
Your scoring still looks incorrect. I think to compute a correct score, you have to output numHashes + numCollisions. (Which I guess will put you close to my 6832 estimate for a random oracle.) – kasperd yesterday
    
Modify the grading portions to this: pastebin.com/nLeg4qut – Bent Nee Humor yesterday
    
yeah, fixed the grading and it looks like a much more reasonable value now, ty – Bewusstsein yesterday

Ruby, 6473 collisions, 129 bytes

h=->(w){@p=@p||(2..999).select{|i|(2..i**0.5).select{|j|i%j==0}==[]};c=w.chars.reduce(1){|a,s|(a*@p[s.ord%92]+179)%((1<<24)-3)}}

The @p variable is filled with all the primes below 999.

This converts ascii values into primes numbers and takes their product modulo a large prime. The fudge factor of 179 deals with the fact that the original algorithm was for use in finding anagrams, where all words that are rearrangements of the same letters get the same hash. By adding the factor in the loop, it makes anagrams have distinct codes.

I could remove the **0.5 (sqrt test for prime) at the expense of poorer performance to shorten the code. I could even make the prime number finder execute in the loop to remove nine more characters, leaving 115 bytes.

To test, the following tries to find the best value for the fudge factor in the range 1 to 300. It assume that the word file in in the /tmp directory:

h=->(w,y){
  @p=@p||(2..999).
    select{|i|(2..i**0.5). 
    select{|j|i%j==0}==[]};
  c=w.chars.reduce(1){|a,s|(a*@p[s.ord%92]+y)%((1<<24)-3)}
}

american_dictionary = "/usr/share/dict/words"
british_dictionary = "/tmp/british-english-huge.txt"
words = (IO.readlines british_dictionary).map{|word| word.chomp}.uniq
wordcount = words.size

fewest_collisions = 9999
(1..300).each do |y|
  whash = Hash.new(0)
  words.each do |w|
    code=h.call(w,y)
    whash[code] += 1
  end
  hashcount = whash.size
  collisions = whash.values.select{|count| count > 1}.inject(:+)
  if (collisions < fewest_collisions)
    puts "y = #{y}. #{collisions} Collisions. #{wordcount} Unique words. #{hashcount} Unique hash values"
    fewest_collisions = collisions
  end
end
share|improve this answer
1  
The score looks suspicious. Are you sure you are counting all the colliding words? Several previous answers mistakenly counted only one word for each colliding hash value. – kasperd yesterday
    
You may be right. I must consider how I counted and see if it is the same as your definition. I am counting how many words there are and subtracting how many unique hashcodes were generated. If words A and B get the same hashcode, is that one collision or two? I count it as one. – Paul Chernoch yesterday
1  
I didn't define the scoring function. I just copied it from the example answer posted by the same user who posted the challenge. Most answers have scores ranging between 6273 and 6848. There have been multiple answers each making the same mistake in the score computation leading to computing a score approximately half of what it should have been. (Exactly half the correct score if there are no cases of three colliding words.) – kasperd yesterday
1  
Yes, I made the same mistake. I will amend my answer later. Got to catch a bus. – Paul Chernoch yesterday
    
Fixed the scoring. – Paul Chernoch yesterday

Python, 6995 6862 6732

Just a simple RSA function. Pretty lame, but beats some answers.

M=0x5437b3a3b1
P=0x65204c34d
def H(s):
    n=0
    for i in range(len(s)):
        n+=pow(ord(s[i]),P,M)<<i
    return n%(8**8)
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.