Randomized algorithm for file comparison

Introduction

The next problem we use randomization to solve might seem a bit closer to one we might face in reality. It goes like this. Alice and Bob each have copies of the same file that they need to keep synchronized (call Alice’s file $A$ and Bob’s $B$). Over time, however, it’s possible that the Alice’s and Bob’s files get out of sync. Our task today is to come up with a protocol by which Alice and Bob can check that $A = B$ without one having to send the other his/her entire file.

The algorithm

In order to throw math at the problem the way we want to, we need to make it a bit more abstract. To this end, we stipulate that $A$ and $B$ are represented by $n$-bit strings. The comparison protocol works as follows:

Alice picks a prime $p \in \{2..n^2\lg n\}$ (fear not; we will explain the choice of this range soon). Because $A$ is an $n$-bit string, we can look at it as an $n$- bit binary integer. Alice computes $A \pmod p$ and sends Bob the prime $p$ and $A \pmod p$. (Note: computing $A \pmod p$ means the remainder of $A$ left over when we divide it by $p$.)
Bob computes $B \pmod p$. If $A = B \pmod p$ ($A$ and $B$ leave the same $p$- remainder), Bob outputs that the files are the same. Otherwise, he should conclude that the files he and Alice have are out of sync.

The question we need to answer is now: How confident can we be that the files are indeed the same when Bob says “same”? In other words, we want to know what the probability is that the algorithm errs.

Analysis

There are two cases to analyze here. If the files were the same to begin with, the algorithm will never fail. Mathematically, if $A = B$, then $A = B \pmod p$ so Bob will always output “same” in this case.

The interesting case to analyze is the case in which $A \neq B$. In this case, we are interested in

$$\Pr[A = B \pmod p ~|~ A \neq B].$$

In English, we are interested in the probability that Bob outputs “same” even when his and Alice’s files are not in sync. To analyze this probability, we need to entertain a quick tangent.

In particular, we need to motivate our preference that $p \in \{2..n^2\lg n\}$. There is a neat theorem in number theory called the Prime Number Theorem which states that in the range $\{2..N\}$, there are about $\frac{N}{\lg N}$ primes. We can show, using the theorem and some algebra that we need not bother ourselves with here, that there are about $n^2$ primes in the range from which we drew $p$ (start by substituting $N = n^2 \lg n$ in the theorem).

Keep this fact in mind. Let $C =	A - B	$. We can reformulate the probability of the
protocol failing as $\Pr[C = 0 \pmod p ~	~ C \neq 0]$.

Next, we note that because $A$ and $B$ both have $n$-bits, $C$ too must have $n$- bits. This means that $1 \leq C \leq 2^n$. Note that 2 is prime, so a nice feature of $C$ that we can observe is that $C$ has at most $n$ prime divisors. This is the key fact. Because we had $n^2$ primes from which to choose $p$ and there are (at most) $n$ bad choices among them (a choice is bad if $p$ divides $C$ and thus $A- B$, whence $p \mid A$, $p \mid B$), the probability that $p$ is a “bad prime” is at most $\frac{n}{n^2} = \frac{1}{n}$.

The probability of success is thus at least $1 - 1/n$, which is great because, intuitively, it means that as our strings get larger, the odds of the algorithm failing get very, very small.

Space complexity

The last minor thing we note is that the number of bits required for this protocol is only the number of bits required to represent $p$ and $A \pmod p$ (what Alice sends to Bob). The $A \pmod p$ is smaller than $p$, so we can write an upper bound on the total number of bits required as $2\lg p$ bits. Because $p$ is at most $n^2 \lg n$, we require $2 \lg(n^2\lg n) = \lg (n^4(\lg n)^2) = 4 \lg n + 2\lg(\lg n) = O(\lg n)$ bits to, with high probability, successfully compare the files. Our goal was to share a sublinear number of bits relative to the size of the file, so the protocol above achieves the desired aim.

Conclusion

You’ll note that if you go back and look at the matrix multiplication post I put up a little while ago, the analysis here and there are very similar. In each case, we had a problem in which we needed to compare two objects without comparing the entire objects to one another. In this case, the objects were strings; in the matrix case, the objects were matrices. In both cases, we devised schemes wherein we mapped the larger objects to smaller ones that were much easier to compare; although the representations of the smaller objects are lossy, we choose our mapping carefully so that the information we sacrifice only introduces a small probability of failure. The technique described above is called fingerprinting, and it is a very powerful tool used in the study and design of randomized algorithms.

Randomized algorithm for file comparison

Introduction

The algorithm

Analysis

Space complexity

Conclusion

Further Reading

How does OpenAI's DALL-E work?

Faster language model inference

RANSAC for robust data fitting

Trending Tags