Find matching phone numbers without actually knowing them

Question

We have a mobile app that, given two users, needs to let them see what common contacts they have based on their phone numbers. How can we do this in a cryptographically secure way and respecting the users' privacy (i.e. without sharing the numbers in plain-text between them or with a server)?

Our current solution is:

(On phone A) Generate a single random salt.
(On phone A) Take all phone numbers from phone A and generate a SHA512 hash with multiple rounds for each phone number using the salt generated from step 1. Something like sha512(sha512(sha512(phonenumber + salt) + phonenumber + salt) + phonenumber + salt).
(On phone A) Send these hashes along with the salt to phone B (via a server).
(On phone B) Repeat step 2 to generate hashes for its own phone numbers using the same salt generated on phone A.
(On phone B) Compare the two list of hashes - a matching hash means a matching phone number, and therefore a common contact.

Is this a flawed approach, prone to rainbow-tables/brute-force attacks, and if so is there any other solution that is more suitable? Maybe using bcrypt with a given salt is better than doing multiple rounds of sha512?

"Is this a flawed approach, prone to rainbow-tables/brute-force attacks"... In my country, a mobile phone number is made of 10 digits. Therefore, it would be pretty easy to build a list of all the 10^10 possible combinations. In fact, since 10^10<28^8, it would be easier than to build a list of all possible 8-bit passwords with 6 letters and 2 digits. — A. Darwin, 2 days ago
@A.Darwin I think you are right. Let's say we make a hash take 0.001 seconds to compute. For a phone list of 1000 contacts, it would take 1 second which is acceptable. But for brute-forcing 10^10 numbers it would only take 115 days, possibly much less on a powerful CPU. Maybe this approach can work for emails, though. — liviucmg, 2 days ago
@drewbenn - if you made your comments an answer, I'd upvote it — Neil Smithline, 2 days ago
I'm pretty sure what you are looking for is called "private information retrieval" (PIR) which is feasible for small data- and userbases but gets too hard to be done for larger ones. — SEJPM, 2 days ago
You mentioned via server. Could you just have a server do the comparison? — Paparazzi, yesterday

lserni · Accepted Answer · 2016-05-06 07:08:13Z

bcrypt would be a somewhat better approach because it is designed to be (programmably) slow.

Using a large enough salt and a reasonable complexityFactor, bcrypt(salt + number, complexityFactor) should yield a viable hash and you avoid "rolling your own cryptography", which could possibly turn out to be a difficult sell. To increase security you just crank up complexityFactor.

An attacker would now have to generate the bcrypt not only of every 10-digit phone number (which could be feasible: there are only 10¹⁰ numbers after all), but of every possible salted sequence. With a 10-character base64 salt (60 bits of entropy), the complexity increases of twenty orders of magnitude.

Brute forcing

Suppose you have 1,000 contacts . The CPU of your average phone seems to be two orders of magnitude slower than a server core array. I think it's reasonable to say that it will be three orders of magnitude slower than a semi-dedicated GPU implementation of bcrypt, which is said not to be so efficient.

So we tune bcrypt to take 100 milliseconds for each encoding. This means that we need 1 minute 40 seconds to generate our 1,000 hashes, which is a reasonable time for one thousand contacts (a progress bar seems in order). If the user only has 100 contacts, he'll be done in 10 seconds.

The attacker, given the salt of one number, has to generate perhaps 10⁸ numbers to reasonably cover the mobile number space (the first number, and possibly the first two, aren't really 10 or 100 -- I count them as 1). It will take three orders of magnitude less than 10⁸ times 100 milliseconds, i.e. than 10⁷ seconds. This is down to 10⁴ seconds, or around two hours and a half (or a whole day if the GPU optimization thingy turns out not to work).

In less than four months, the whole 1,000 contacts will have been decrypted - using one optimized server. Use ten such servers, and the attacker will be done in two weeks.

The problem, as pointed out by Ángel's answer and Neil Smithline's comments, is that the key space is small.

In practice user A will produce something (a hash block, or whatever) to be made available somehow to B. User B must have a method that works like

matches = (boolean|NumberArray) function(SomethingFromA, NumberFromB)

(little changes if the second parameter is a set of N numbers, since UserB can build a set using one true number and N-1 numbers known to be fake or not interesting. It can lengthen attack time by a factor of N).

This function works in a time T... actually this function must work in a time T short enough that user B, in a commercial real world application, is satisfied.

Therefore one bound we can't easily dodge is that M numbers must be checked in an acceptable time on an average smartphone. Another bound we can't reasonably dodge is that User B can supply fake numbers to the algorithm (i.e. people that aren't really contacts, and possibly do not even exist).

Both bounds also are enforced if the checker is on a third server; this only assures a lower execution time limit that can thwart some scenarios, such as "decrypt all UserA's numbers", but not others such as "verify who has this number", as in drewbenn's answer).

From these two bounds arise the fact that using a smartphone (or a third party server with minimum execution time enforcement), cycling through all 10⁸ reasonable numbers takes about 10⁸ smartphoneTime time, or on the order of one thousand months.

Attack strategies to decrease this time are distributing the attack between several verifiers, or running it on a non-throttled and faster server (this requires the algorithm being available, but assuming the contrary is security through obscurity), and they look both feasible and affordable.

A loophole?

One possibility, not easy to implement with a local algorithm, could be to introduce a small probability of false positives. I.e., the above oracle function will occasionally (say once every ten thousand contacts), and deterministically on UserA's input, return true to one of UserB's numbers.

This means that the brute force attack on 10⁸ numbers will yield UserA's contacts mingled with 10⁴ other numbers. Determinism on UserA's input means that two successive checks on these 10⁴ found items won't further whittle them. Unless UserB's can grab a different copy of UserA's input, which will yield a different set of false positives and allow filtering the true contacts as the intersection of the two sets, this may make the bruteforced answer less appealing. This comes at a cost - honest UserBs will have to get the occasional false hit.

Are you proposing a unique salt per phone number? If not, how is what you said different from what the OP proposes? — Neil Smithline, 2 days ago
I don't believe that you can crank the complexity high enough to provide any real protection. As the client is a relatively weak device (a mobile), any complexity high enough to stop an attacker using a server to crack the hashes will be way too slow for the mobile. — Neil Smithline, 2 days ago
Hi @NeilSmithline , since the common bcrypt representation already contains the salt as well as the complexity factor, and it has to be sent, it's no storage cost and very little CPU cost to randomly generate a new salt for each new number. Otherwise yes, what I'm proposing is exactly what the OP was already concluding by himself... I'm just sort of voting between the options he presented. (update) thanks for your second insight too, you're right that it is a concern that needs addressing. I'll update my answer. — lserni, 2 days ago
Separate salts greatly increases the complexity of doing comparison. Let's assume the received list has N phone numbers and my list has M phone numbers. I would need to run bcrypt on each of my M phone numbers using each of the N salts from the received list. That's M*N calls to bcrypt. That seems too costly. It was to avoid this computational explosion that led the OP to use only a single hash. — Neil Smithline, 2 days ago
I think it may be the case that there's not a hash-based solution. @Ángel's answer points to a solution by Wisper Systems, but it is non-trivial to understand, not to mention implement. You would really need a team of cryptographic developers to pull something like that together and have confidence that it is actually secure. — Neil Smithline, 2 days ago

drewbenn · Answer 2 · 2016-05-05 19:54:55Z

up vote 15 down vote

There are potentially other privacy issues you're not considering yet. By design your app makes it easy to see who is connected to a certain target. So an attacker creates one contact on their phone (the activist/informant/terrorist/victim they are interested in) and then connects to many other users through your app, to create a list of the target's contacts. So for example a DV abuser could use this app to make a list of people still in contact with his ex: even Google has had problems getting this right.

answered 2 days ago

drewbenn

1,508818

4

This doesn't answer the question but it's an excellent point. – TTT 2 days ago

@TTT it does not fully answer the question because there are multiple points to the question. It does address the main issue: Is this a flawed approach? A: yes. drewbenn even goes further and points a new facet where it fails. +1'd - - - - P.S.: It is very common on SE sites once an authoritative user has made his answer to place these and to add to @ userAbove's answer... And to me it is very constructive. – Mindwin yesterday

@Mindwin - I completely agree, which is why it earned my vote as well. – TTT yesterday

+1 This was pretty much my exact thought when I read the question. Although.... it could just as easily be "her ex" – Justin Ohms yesterday

add a comment |

Ángel · Answer 3 · 2016-05-05 18:56:32Z

up vote 10 down vote

Yes, it is (a bit) flawed. The problem is that the space is too small, so even with the multiple rounds and salts, it's relatively easy to bruteforce.

Open Whisper Systems had a witty system where they provided an encrypted bloom filter that can be queried locally using blind signatures. They explain the process (as well as providing a good discussion of the private information retrieval problems) at https://whispersystems.org/blog/contact-discovery/

Sadly, they had to discontinue this on TextSecure due to practical issues with a big user base. In your case, as you are sharing numbers between two final users, it should be feasible, either with their same method or using another protocol like those published ones that are mentioned by Moxie.

answered 2 days ago

Ángel

5,0151725

1

That's a very cool solution. – Neil Smithline 2 days ago

There was another implementation called LOAF, but its site is currently down (loaf.cantbedone.org) so I fear it may be abandonware. It might be possible to find descriptions of it, though. It was by Maciej Ceglowski, Joshua Schachter, and Peter Gadjokov. Maciej has a semi-active blog so he can probably be questioned for details if required ;-) – Steve Jessop yesterday

@SteveJessop archive.org to the rescue! It appears to use a Bloom filter. – drewbenn yesterday

+1 Bllom filter was also whyt came to my mind as possible improvement. As in: Both ends send bloom filters (based on individual hash function) matching their contacts and with a false positives rate witin predetermined bounds, after that they exchange the underlying hash functions, and finally they may perform (encrypted) Q&A sessions where one ends asks about a specific number and the other replies yes or no, but refuses to answer if the queried number does not match the opponents Bloom filter – Hagen von Eitzen 22 hours ago

add a comment |

TTT · Answer 4 · 2016-05-05 22:15:36Z

up vote 5 down vote

How can we do this in a cryptographically secure way and respecting the users' privacy (i.e. without sharing the numbers in plain-text between them or with a server)?

tldr: You can't.

Hashing is great for certain uses, but this is probably not one of them. The reason is that an attacker would know that there are only 10 billion possibilities (for 10 digit phone numbers), and this makes it too easy to brute force any discovered hashes.

Instead, you could accomplish what you want if:

one of the phones is willing to trust the other. One phone shares it's contact list with the other, and the other one does the compare. The second phone does not need to share its list with the first.
you are willing to use a trusted mediator, i.e. a server. This is the best approach from the users point of view because neither party needs to share their contact list with the other. Both would share their lists with the mediator and it would report back with the matches, and promise not to save anything. Communication with each involved party would use standard public/private encryption key pairs, and no one's data would be at risk. Of course this assumes your users will trust you when you promise not to retain their contact lists. But if you can get your users to trust you enough to install your app in the first place, then you've already earned their trust.

edited 2 days ago

answered 2 days ago

TTT

3,1872517

1

Note that the two parties don't necessarily need to trust the mediator. Let both parties together pick a One Time Pad (each picks a random sequence, XOR'ed together), and then both XOR each number with this OTP. (Remember, it's an OTP - it must be long enough for all numbers!) The mediator receives two lists, and returns the intersection. (Bytes 17-27 match, bytes 36-46 match etc). The mediator provably cannot know what those bytes mean, lacking the OTP. – MSalters 2 days ago

1

@MSalters For this to work you would need the same number beeing stored at the same position. So there must be an algorithm which says which phonenumber is stored at which position. If this algorithm is known to server, the server might get to know a match at bytes 17-27 means phonenumber +123456789. – H. Idden yesterday

@HIdden no, the numbers are not concatenated and then crypted, but crypted before concatenation. In fact it isn't concatenation into a larger opaque stream at all, the boundaries are told to the server, so it can evaluate all pairwise comparisons. – Ben Voigt yesterday

@MSalters - what is the point of each party choosing their own OTP and XOR'ing the two together? Since they then have to use the XOR'ed OPT they could easily figure out the other party's own OTP, so couldn't one just generate the OPT and pass it to the other? – TTT yesterday

@MSalters - so you're saying that the same OTP would be used to encrypt each number? If yes, then that is no longer a One-Time-Pad since it is being re-used. My gut tells me that with enough phone numbers it should be possible for the mediator to determine what the pad is. – TTT yesterday

| show 5 more comments

drewbenn · Answer 5 · 2016-05-06 17:52:23Z

Let's do some tests!

I started with a naive bash implementation, and calculated 10k numbers in 33 seconds:

#!/bin/bash

phone="2125551212"
salt="abcdefghijklmnopqrstuvwxyz"

shasalt() { echo "$* $phone $salt" | sha512sum; }

for f in {1..10000}
do
    shasalt $(shasalt $(shasalt)) >/dev/null # or write to a file...
    ((phone++))
done
echo $phone>&2

Then using a SHA512 C++ algorithm I found online I wrote some ugly sample C++ code. I generated hashes for an entire area code in 84 seconds, or about 160 seconds when I wrote them out to a file (generating a 1.4GiB rainbow table):

#include "sha512.hh"
#include <iostream>
#include <sstream>
#include <string.h>
using namespace std;

int main(int argc, const char* argv[])
{
  char salt[10] = "liviucmg";
  int x = 2120000000; // won't work above 2^31 i.e. past area code 213
  char y[21];
  snprintf(y, 21, "%010d,%s", x, salt);
  const int len = strnlen(y, 21);

  for (int i = 0; i < 10000000; i++, x++) {
      snprintf(y, 21, "%010d,%s", x, salt);
      //cout << y << "," << sw::sha512::calculate(y, len) << endl;
      string FIRST(sw::sha512::calculate(y, len));
      string FIRST_GO(FIRST + y);
      string SECOND(sw::sha512::calculate(&FIRST_GO, 1+FIRST_GO.length()));
      string SECOND_GO(SECOND + y);
      // uncomment this line to generate a hash table:
      //cout << y << "," << sw::sha512::calculate(&SECOND_GO, 1+SECOND_GO.length()) << endl;
  }
  return 0;
}

This was all on my 3-year-old laptop, which can be found online in a similar configuration for about $300. So for a pretty minimal expense an attacker could brute force any 10-digit number and unique salt combination in about a day (less if they only tested real area codes), or all the interesting numbers (those in the target's area code and physically adjacent area codes) in about 5 minutes. I didn't try to tune (or debug!) my test code, so I imagine those times could be cut in half by tuning the code. Running it on better hardware like a graphics card or FPGA would also cut down the time significantly: I would assume any number+salt could be brute-forced in about an hour or less by any competent attacker.

Fun fact: Brazil international phone numbers can begin with 555. (country code 55, area code starting with a 5). So a 12 or 13 digit phone number beginning with 555 can be real. — Mindwin, yesterday

H. Idden · Answer 6 · 2016-05-06 10:06:52Z

With Isemis mention of "probability of false positives" I thought about Zero-knowledge proof. This answer makes no claims to be secure as it was never reviewed, so others should review and comment it. I am no professional security expert either and I didn't have the time to make sure the low number of possible phone numbers might be a problem.

User A and User B connect to each other (directly or via server) and assign a random identifier (RI) to each contact their contacts and stores it in a list (LA and LB) together with the phonenumber padded with the nonce and after applying a Key derivation function. LA stays at A and LB stays at B.
User A proves one round of knowlege (see Zero-knowledge proof) for each of the numbers in list LA together with RI. User B has to find out by bruteforce to which of the contacts in LB each RI of A might fit. Each contact without fit is dropped out of LB. Possible fits are stored in the list for next round to improve bruteforce speed.
User B proves one round of knowlege for each of the numbers in list LB together with RI. User A has to find out by bruteforce to which of the contacts in LA each RI of B might fit. Each contact without fit is dropped out of LA. Possible fits are stored in the list for next round to improve bruteforce speed.
Repeat step 2 and 3 often enough to be statistically secure that each one has proven each of the numbers to the other user.
Optional: exchange the remainings in lists LA and LB in a secure way mentioned by others or via a secure channel between User A and User B to be sure the matching of RIs was correct.

With this algorithm the server or an observer never learns any relevant information about the contacts of A or B except their count and how many they have in common. User A and B only learn the same information as the server and the common contacts.

Since the first steps of the algorithm are exponential and ressource intensive one should include protection against DOS in implementations.

Aurora · Answer 7 · 2016-05-06 07:57:24Z

Why hash them? Why not encrypt instead, and send them to your own server. That way neither client device has to have access to each other's contacts, and the server does most of the work.

This does introduce a single point of failure though, the server. If it was compromised, the attackers could potentially access all numbers. This could be controlled if nothing is really stored and everything is done in memory.

NikoNyrh · Answer 8 · 2016-05-07 11:59:44Z

This method should have following properties:

Probability of false positives can be made as small as desired
Server learns only the (approximate) number of items on each person's phone book, but not numbers themselves and cannot brute-force them
Client-side brute-force attacks are impossible, as server can enforce policies against them
Phones do not learn the number of other phone's contacts

Suggested steps:

Phones A and B generate a shared secret S by using a Key-agreement protocol (authentication would be nice here to defend against MitM)
Bloom filter size and number of hashes can be hard-coded, protocol could enforce that at max 10.000 phone numbers are allowed / person (to midigate against hash collisions and abuse)
S is used to salt hashes of Bloom filter by some suitable method
A and B store their phone numbers to a Bloom filter and submit them to the server
Server confirms that estimated cardinality of filters isn't too high, nobody can claim to know all possible numbers
Server calculates the intersection of sets and sends the result to A and B
Now A and B know (fairly accurately) which numbers they have in common by checking which numbers went to intersected bins.

Actually the Bloom filter could be replaced by submitting plain SHA-256 hashes of numbers to the server and matching hashes can be submitted to clients. This simplifies the logic a lot and has lower chance for false positives.

asked	2 days ago
viewed	3231 times
active	today

current community

your communities

more stack exchange communities

Find matching phone numbers without actually knowing them

Our current solution is:

8 Answers 8

Brute forcing

A loophole?

Your Answer

Not the answer you're looking for? Browse other questions tagged cryptography hash privacy brute-force salt or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Find matching phone numbers without actually knowing them

Our current solution is:

8 Answers 8

Brute forcing

A loophole?

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged cryptography hash privacy brute-force salt or ask your own question.

Linked

Related

Hot Network Questions