Computer-Science on Jack Gindi

A bound on sorting performance

Sun, 17 Nov 2024 00:00:00 +0000

Introduction

If you wake any programmer up in the middle of the night and ask them to name an algorithm, a sizable fraction would probably invoke some kind of sorting procedure. Some might name quicksort, some merge sort, still others insertion sort, and some might troll you by naming Bogosort.

The first three of those algorithms are all what are known as comparison-based sorts: all of them work by comparing elements and making decisions based on the results of the comparisons. In this post, I want to talk about a lower bound on efficiency for comparison-based sorting algorithms. In other words, I want to show that if you invented a new comparison-based sorting algorithm, then even without knowing how it works, I could tell you what its best conceivable runtime is (as a function of the input size).

To get a better sense for what I mean, let’s dive in.

Comparison-based sorting

To understand what we mean by comparison-based sorting, let’s walk through one of the algorithms I mentioned earlier: merge sort.

Merge sort essentially works by sorting the first half of the input, sorting the second half, and then merging the two sorted results. But how do we sort the first and second halves? We sort the first half of the first half, sort the second half of the first half, then merge them. And so on and so forth. In order for this recursive process to work, though, the process has to bottom out, right? Right! It bottoms out when a “half” is empty or has one element, since empty and singleton lists are (trivially) sorted.

To show this with an example, let’s say we start with the input list [1, 3, 8, 4, 5, 2, 6, 9, 7]. We would

Split the list into two halves: [1, 3, 8, 4] and [5, 2, 6, 9, 7].
Split the first half into two halves: [1, 3] and [8, 4].
Split the first half into two halves: [1] and [3].
Each of [1] and [3] is sorted, so we merge them into [1, 3].
Split [8, 4] into two halves: [8] and [4].
Each of [8] and [4] is sorted, so we merge them into [4, 8].
Now we merge [1, 3] and [4, 8] into [1, 3, 4, 8].
Carry out the same recursive process for [5, 2, 6, 9, 7] to get [2, 5, 6, 7, 9].
Merge [1, 3, 4, 8] with [2, 5, 6, 7, 9] to get the final result: [1, 2, 3, 4, 5, 6, 7, 8, 9].

The “comparison"s that happen in merge sort occur in the merging stage, which we won’t go into detail about here. Now that we’ve seen one example of a comparison-based sort, we will now turn to thinking about sorting more generally using decision trees.

Performance bound

So how can we possibly say anything important about the efficiency of a whole class of algorithms without considering every possible implementation?

First, let’s suppose we have some input list of size $n$. The indices of this list – i.e., the numbers $1, \dots, n$ – has $n! = n \cdot (n-1) \cdot (n-2) \cdot \dots \cdot 3 \cdot 2 \cdot 1$ possible orderings, exactly one of which puts the elements in sorted order. We want to say something about the minimum number of comparisons required to find this ordering.

One way to think about this sorting problem is to use the abstraction of a decision tree. To make this more specific, the leaf nodes (the nodes the bottom of the tree) each represent one possible ordering of the list. The other nodes (called internal nodes) represent comparisons between elements at different indices of the list. An example of this tree is shown in the image below (source):

Each of the ovals represents a comparison between the elements at two indices of the array. To understand how to read this tree, let’s say that our input array is called $A$. At the root node of the tree, if $A[1] \leq A[2]$, then we would proceed to take the left branch and compare $A[2]$ with $A[3]$. If $A[2] \leq A[3]$, then we would take the left branch again and reach the leftmost leaf, which would indicate that $A$ was already in sorted order. With other input orderings, though, the index order that results in $A$ being sorted might be some other leaf, which we could similarly determine by doing a bunch of comparisons. (The key to avoiding confusion here is to remember that the numbers in the ovals are indices, not the actual elements of the input list.)

Now, if there are $n!$ possible orderings of $A$, then there must be $n!$ leaves in the tree. Furthermore, we know that the length of the longest root-to-leaf path is the largest possible (worst-case) number of comparisons we would need to do to get our sorted order. Thus, in order to understand the best possible worst-case performance of our comparison-based sort, we want to find the length of the longest possible root-to-leaf path in this decision tree.

Let’s suppose that the algorithm always completes after $h$ steps. Another way of stating what we want to find is to say we’re looking for a lower bound on $h$. With $h$ comparisons, we can distinguish between $2^h$ orderings, since each comparison has two possible outcomes and the indices are distinct (even if the elements aren’t). In order to make sure we find the sorted order, we need to make sure that the $n!$ possible orderings can all be covered with $h$ comparisons (i.e., by checking at most $2^h$ orderings). In other words, we need this inequality to hold:

$$ 2^h \geq n!. $$

Taking the log (base 2, as is customary in computer science) of both sides, this can be rewritten as

$$ h \geq \log(n!). $$

That’s great!… But what is $\log(n!)$? On the one hand, $n!$ is huge, but on the other, maybe the $\log$ tames it? Well, we know that $n! \geq n(n-1)\dots (n/2) \geq (n/2)^{n/2}$, so we can rewrite our inequality again as

$$ h \geq \log(n!) \geq \log((n/2)^{n/2}) = \frac{n}{2}\log \biggr( \frac{n}{2} \biggr). $$

(The equality holds because of a property of logarithms: $\log(a^b) = b\log(a)$.) Ignoring constants, we get that $h$ must is bounded below by $n \log n$. To put it in a way that underscores how cool this proof is, what we’re saying here is that no comparison sort can work using a worst-case number of comparisons that is (ignoring constants) smaller than $n \log n$. Again, we did this without looking at any particular implementations!

Coda: can we do better?

There’s one question left to answer: What if we relax the requirement that our algorithm be based on comparisons? Can we achieve a better worst-case performance than $n \log n$?

The answer is yes, and if our inputs follow a couple of additional (important) assumptions, we can do it with a pretty simple algorithm at that. If the elements of the input list are nonnegative integers that take values up to some maximum $M$, we can use the following algorithm:

Create a list $C$ of zeros of size $M$. The counting array $C$ effectively acts as a frequency table for the input, where $C[j]$ holds the count of occurrences of the integer $j$.
For each element of the array, increment the count at that element’s index. In other words, if you see a 5, increment $C[5]$.
Once you’ve iterated through the entire list, iterate over $C$ and add $C[j]$ copies of $j$ to the output list.

This algorithm, called counting sort, is probably the most famous non-comparison sort. If the initial array has $n$ elements and $C$ has size $M$, then the algorithm takes $n + M$ steps to complete (ignoring constants). This can be good or it can be bad. It can be awesome if $M$ is not too much larger than $n$, since then (ignoring constants again), it would take approximately $n$ steps, which is faster than our comparison-sort lower bound of $n \log n$. If $M$ is very large, however, say $M \approx n^2$, then we’ve lost our performance edge. This algorithm also doesn’t work with non-integers, which makes it less generally applicable than we’d like.

Conclusion

In this post, we started with a hands-on example, established theoretical bounds using a decision tree formulation of the sorting problem, and finally explored how changing our assumptions about the inputs can unlock faster algorithms.

Sorting algorithms are a cornerstone of computer science, and understanding their limits helps us appreciate their clever design and implementation. The balance between theory and practice highlights the necessity of mathematics to the design of efficient algorithms that power our world.

Simulated annealing

Fri, 08 Mar 2024 00:00:00 +0000

Introduction

Optimization problems are everywhere!

Whether it’s finding the most efficient way to deliver packages to customers, determining the best next move in a game of chess, or figuring out how to adjust the parameters of a gigantic machine learning model, many important practical problems are, at their cores, optimization problems. In this post, we will learn an optimization meta-algorithm called simulated annealing, a general approach to (approximately) finding global solutions to optimization problems… which is, interestingly, inspired by a physical process from material science.

Simulated annealing: overview

Annealing

Before discussing its algorithmic analog, we should sketch out what annealing is and how it works. Annealing is a process that alters the physical and chemical properties of a metal so that it can be worked more easily. It begins by heating the metal above its recrystallization point in order for it to enter a state in which its chemical states can change more freely. We then slowly cool the meatal to allow it to settle into a chemically superior state.

Relationship to optimization

One way to solve an optimization problem is to carry out the the following iterative process:

Start in some state (e.g., a chessboard just after it has been set up).
Transition from the the current state to a new state that most decreases the value of your an objective function you want to minimize (e.g., make a move that decreases your probability of losing).
Repeat step 2 until a stopping condition is met (e.g., the game ends).

Simulated annealing modifies this process. It would instead look something like this:

Start in an initial state.
Sample a random candidate state to transition to.
With some probability that depends on the current and candidate state, accept the candidate transition. Otherwise, stay put.
Repeat steps 2 and 3 until a stopping condition is met.

The analogy to physical annealing comes from the fact that step 3 depends on a parameter called the temperature. In physics, the higher the temperature of a system, the more jittery the system is – that is, the more random motion there is among its constituent particles. Early in the optimization process, we set a high temperature; this allows the algorithm explore by accepting riskier transitions, i.e., those that result in a higher (worse) objective value than that of our current state. As the annealing progresses, we lower the temperature; this causes the optimization process to become more conservative. Eventually, the space of acceptable next states will contain only those that are better than the current state (in terms of objective value).

Global vs local optimization

One question you might ask is: Why bother with the high temperature phase at all? If low temperatures will allow the algorithm to only move toward better solutions, why not always make those kinds of moves? The key to answering this question lies in the difference between globally and locally optimal solutions to a problem. Suppose you are on a quest to see the view from the highest point in San Francisco (which has lots of hills). If you only ever step in the direction of steepest ascent from where you are, you will reach the top of some hill, but it’s possible that in order to reach the top of the highest hill, you should have walked downhill for a while in another direction first and only then started to ascend. The hill whose acme you reached by steepest ascent – a local optimum – is not very difficult to find. By contrast, finding the true tallest peak in San Francisco – the global optimum – is trickier.

For many problems – like finding the best set of parameters for a machine learning model – local optima work very well, and in many cases we have powerful algorithms for efficiently finding them. Finding global optima, on the other hand, is far more challenging in general, and good algorithms are scarcer, if they exist at all. Simulated annealing is a probabilistic strategy for searching for global optima by exploring aggresively enough early to find the base of the right hill.

In the remainder of this post, we will more explicitly discuss how we carry out steps 2 and 3, and show how we might apply this meta-algorithm to the traveling salesman problem, one of the most difficult discrete optimization problems we have.

Acceptance probability

The key detail that I want to make more precise is how we formulate the probability of accepting a transition from one state to another. In terms of notation, $\mathcal S$ is the entire state space, $s \in \mathcal S$ refers to the current state, $s' \in \mathcal S$ refers to the candidate state, $E:\mathcal S \to \mathbb R$ is the objective function (lower is better) that we want to minimize, and $T_k$ is the temperature parameter value on the $k$th step (a real number). Our objective is to find the state $s^\star$ that minimizes $E$. In other words, we want to find

$$ s^\star := \underset{s \in \mathcal S}{\text{argmin}} ~ E(s). $$

(The “$:=$” symbol means that the right hand side is the definition of $s^\star$, rather than some equation to prove or solve.)

For simplicity, define $e = E(s)$ and $e' = E(s')$. If we are currently in state $s_k$, and $e' < e$, we automatically transition to $s'$. If $e' > e$, then we transition with probability

$$ P_{\rm acc}(e, e'; T_k) = \exp(-(e' - e) / T_k). $$

(Note: This is not a probability distribution over states. Instead, here, $P_{\rm acc}$ is used to make a decision about whether to transition to a particular successor state. For this purpose, after we compute $p = P_{\rm acc}(e, e'; T_k)$, we can use a random number generator to generate a random number $r$. If $r < p$, we transition. To sample a state from the entire state space, we would need the transition probabilities for each possible transition to sum to 1.)

Let’s take a minute to think through why this acceptance probability works the way we want it to:

Since we would have accepted if $e' < e$, we can assume that $e' - e > 0$. If this difference is very positive, the negative sign and the exponential around it makes $P_{\rm acc}$ very small. This means that the probability of accepting a transition exponentially decays for less desirable candidates.
Decreasing the value of $T_k$ (as $k$ increases) causes the exponent to become large and negative, producing probabilities close to 0. This means that as we run more steps and decrease the temperature, the same differences in objective value will become less and less acceptable. This aligns with the intuition that as the temperature decreases, the optimization process becomes more conservative.

Application: The Traveling Salesman Problem (TSP)

If you’ve never heard of the traveling salesman problem, check out this wikipedia article before continuing. To summarize:

There are $n$ cities to visit.
There are roads connecting every pair of cities.
Each road has a (nonnegative) toll associated with it.
Goal: Find the minimum cost path that ends where you start and visits each city exactly once.

The first thing to remember when using simulated annealing is that for most problems we would apply it to, we should expect to not obtain the globally optimal solution at the end; instead, we hope for a result that is just good enough. In the case of TSP, simulated annealing can give us a reasonable approximation, but we cannot really guarantee anything more than that.

Another practical consideration that arises is how to define the state space for the problem at hand. Before continuing, think about how you might define it for TSP.

A sensible way to define it is to consider any ordering of the $n$ cities to be a state. Defining states this way, there are $n!$ states – for $n \geq 20$, this number is absolutely massive. With such large state spaces, one typically also has to narrow the space of transitions under consideration at each step. In the case of TSP, we might do this by only allowing transitions that swap a pair of cities in the order. Can you think of other methods?

Finally, we define our objective function $E$ to just be the total cost of a particular route, and we stop when we’ve gone some number of iterations without making progress over the best solution we’ve obtained so far. With this setup, we can implement our algorithm following the pseudo-python below:

def simulated_annealing_TSP(G, max_iters_with_no_improvement):
 """
 G: the initial problem structure (tolls for each road)
 max_iters_with_no_improvement: The maximum number of iterations allowed without
 surpassing the best seen so far before termination.
 """

 # initialize the temperature and pick an initial state
 T = initialize_temperature()
 s = pick_random_city_order(G)

 no_improvement_counter, best_state, lowest_so_far = 0, None, inf
 # while we're making progress...
 while no_improvement_counter < max_iters_with_no_improvement:
 # select a candidate next state
 candidate = randomly_swap_pair(s) # using any restriction
 # compute the costs of the current state and the candidate
 e_s, e_cand = total_cost(s, G), total_cost(candidate, G)
 # decide whether to accept by comparing a uniform random number
 # to the acceptance probability described earlier
 if uniform(0, 1) < p_acc(e_s, e_cand, T):
 s = candidate
 # if we see a new best, reset the progress counter and
 # save the best state
 if e_cand < lowest_so_far:
 best_state = s
 no_improvement_counter = 0
 else:
 no_improvement_counter += 1
 # reduce the temperature
 T = reduce_temperature(T)

 # return the best state when the iteration completes
 return best_state

There are some details and optimizations left out, but hopefully the code feels straightforward enough for you to try to implement this on your own!

(Note: One detail we left out in the above is the schedule to use to reduce $T_k$ over time. This is a subtle problem, since if we lower it too quickly, our optimization process will not sufficiently explore, whereas if we lower it too slowly, we may not make forward progress fast enough.)

Conclusion

In this post, we briefly described a meta-algorithm called simulated annealing that can help approximate global optima for properly formulated optimization problems, many of which are extremely computationally difficult. It is often most useful when there are not other more direct, problem-specific algorithms we can bring to bear. In addition to describing the general setup, we also looked at how we could apply this approach to the TSP, which gave us a flavor for some of the practical considerations that arise when trying to fit a problem into the SA framework.

Simulated annealing is a powerful tool that is employed to solve a variety of thorny optimization problems across the sciences. It’s a good tool to have in your toolkit – I hope it comes in handy!

Distinct values in a data stream

Sun, 11 Sep 2022 00:00:00 +0000

Introduction

In this post I detail a randomized algorithm (that looks rather like black magic) to count the number of distinct elements in a data stream.

Naive solution

Suppose that data is presented as a sequence of values $\sigma = s_1, \dots, s_m$ where, for simplicity, the $s_i \in \lbrace 0, \dots, n - 1\rbrace$. I want to know the number of distinct values that were in the stream. For example, if the sequence were $\sigma = 1,2,3,4,5,5,7$, our algorithm should output 6. How might we accomplish this?

A very simple way is to keep an $n$-bit vector $v$ ($n$ because that is the size of the set our values are being drawn from) where $v_i$ represents whether we have seen the element $i$. Once we have seen all of the data, we sum the values in $v$ and output that as our result. Easy, right?

Does it work?

The issue here is that for sufficiently large data sets, using $n$ bits of storage is not possible. The above approach is (provably) optimal in terms of space… but that “optimality” is only with respect to deterministic algorithms that produce the correct result every time. What if we used randomization? Might we be able to achieve sublinear space usage?

A better solution

The rest of this post will detail a randomized algorithm that uses sublinear space and to solve the above problem with a solution that is correct as close to 100% of the time as we’d like… it’s rather like magic. Let’s see how it works.

The algorithm is as follows:

Choose $\varepsilon \in (0,1)$.
Let $t = \frac{400}{\varepsilon^2}$.
Pick a pairwise independent hash function $h: \{0,\dots,n-1\} \to \{0,\dots,n-1\}$
Upon receiving $s_i$, compute $h(s_i)$ and update $D$, a datastructure with which we keep track of the $t$ smallest hash values we’ve computed.
When we stop receiving values, let $X$ be the $t^{th}$ smallest hash value and output $\frac{nt}{X}$.

Before we move on, note that the algorithm only requires space for (1) the hash function (discussed below) and (2), the data structure $D$, which is a constant amount of space that depends on $\varepsilon$… Assuming that our hash function doesn’t take up too much space, the algorithm satisfies our space requirement.

I imagine you’re thinking what I’m thinking (or what I was thinking)… namely, that there is absolutely no reason why that should work. Before we dive into some completely mind-blowing mathematical analysis, I want to quickly digress to explain what a pairwise independent hash function is and then to fill in some details and provide an “intuitive” flavor for where this algorithm comes from.

Pairwise independent hash functions

Imagine we have a hash function $h$, two arbitrary inputs $x_1$ and $x_2$, and their corresponding outputs $y_1 = h(x_1)$ and $y_2 = h(x_2)$. The hash function $h$ is a pairwise independent hash function if knowing the probability that $h(x_1) = y_1$ does not give us any information about the probability that $h(x_2) = y_2$.

In mathematical terms, $h$ is pairwise independent if $\Pr[h(x_1)=y_1 \wedge h(x_2) = y_2] = \frac{1}{n^2}$ ($n$ is the size of the output space in our case). The natural question we ask when we present a definition is: Do such objects exist? Without going into too much detail about why, be assured they do indeed exist, and we are going to pick ours from the family $\mathcal{H} = \lbrace h_{ab}: \lbrace 0,\dots,p-1 \rbrace \to \lbrace 0,\dots, p-1\rbrace \rbrace $ where $p$ is prime, $0 \leq a \leq p -1$ and $0 \leq b \leq p-1$, defined, for some input $k$, by $h_{ab}(k) = ak + b \mod p$. In our case, if $n$ is not prime, we can find a prime near $n$ and let $p$ be that prime (an interesting proof for another time is that for any $n$, there is always a prime between $n$ and $2n$). Note that the only thing we have to store about this hash function to use it are $a$ and $b$ – each of which only requires $\log p$ bits of storage (so we are still under the linear space we are trying to avoid).

How do we get $\frac{nt}{X}$?

Next, I’ll try to motivate where $\frac{nt}{X}$ comes from. Suppose that there are $k$ distinct values in the stream (that is, suppose that $k$ is the solution to our problem). Let those values be $a_1,\dots,a_k$. Because we have $n$ possible outputs of our hash function and there are $k$ distinct values, we can expect that the distance between the $h(a_i)$ is $\frac{n}{k}$. In particular, we expect the $t$th smallest value to be at $t \cdot \frac{n}{k}$. Thus, $X \approx \frac{nt}{k}$. Solving for $k$, we see that $k \approx \frac{nt}{X}$, so that’s exactly what we output. In the next piece, we will need to distinguish between the right answer and our output so the correct answer will henceforth be referred to as $k$ and we will refer to our output, $nt/X$, as $\hat k$.

So does the fancy algorithm work?

All that’s left to do is now to show that $\hat k$ is very close to $k$. Mathematically speaking, we want to show that

$$\frac{k}{1 + \varepsilon} \leq \hat k \leq (1+\varepsilon)k$$

with a probability $\geq \frac{99}{100}$ (where $\varepsilon$ is the parameter we chose in the first step of the algorithm). If we can show that $\Pr[\hat k > (1+\varepsilon)k] \leq \frac{1}{200}$ and that $\Pr[\hat k < k/(1 +\varepsilon)] \leq \frac{1}{200}$, we get

$$\Pr[k/(1+\varepsilon) \leq \hat k \leq (1+\varepsilon)k] = 1 -\Pr[\hat k > (1+\varepsilon)k] -\Pr[\hat k < k/(1 +\varepsilon)].$$

Because each of the probabilities on the right side of the inequality are $\leq \frac{1}{200}$, we can rewrite the above as $\Pr[k/(1+\varepsilon) \leq \hat k \leq (1+\varepsilon)k] \geq 1 - \frac{2}{200} = \frac{99}{100}$ which is what we want.

So all we have left to do is to show that the two probabilities are indeed both $\leq \frac{1}{200}$. We will only do analysis of one of the two probabilities because a symmetric argument takes care of the other side. All of this put together means we only have one claim left to prove: $\Pr[\hat k > (1+\varepsilon)k] \leq \frac{1}{200}$.

First, we note that

$$ \begin{align*} \Pr[\hat k > (1+\varepsilon)k] &= \Pr[ (nt)/X > (1+\varepsilon)k] \\ &= \Pr\biggr[X < \frac{nt}{(1+\varepsilon)k}\biggr]. \end{align*} $$

With this in mind, define a random variable $Y_i$ which takes the value 1 if $h(a_i) < \frac{nt}{(1+\varepsilon)k}$ and 0 otherwise. Now, observe that on average, the odds of $h(a_i)$ taking a value less than $\frac{nt}{(1+\varepsilon)k}$ is the number of hash values between 0 and $\frac{nt}{(1 + \varepsilon)k}$ divided by the number of possible values $h(a_i)$ can take. We can write this mathematically as

$$ E[Y_i] = \frac{tn}{(1+\varepsilon)kn} = \frac{t}{(1+\varepsilon)k}. $$

Next, let the random variable $Y$ be the sum of the $Y_i$. Because expectation is linear, we can infer that $E[Y] = \sum_{i = 1}^k E[Y_i] = k \cdot\frac{t}{(1+\varepsilon)k} = \frac{t}{1 + \varepsilon}$. We also see that in this case, $\text{Var}(Y) = \frac{t}{1+\varepsilon} - \frac{t^2}{(1+\varepsilon)^2} \leq \frac{t}{1 + \varepsilon} = E[Y]$. We’re almost there!

We can now more readily examine the probability we were interested in above in terms of $Y$. That is, we can say

$$\Pr \biggr[X < \frac{nt}{(1+\varepsilon)k} \biggr] = \Pr[Y \geq t].$$

Why? The left hand probability represents the chances that the $t$th smallest hash value we saw was less than some value, let’s call said nasty value $M$ for a minute. $Y$ is the number of hash values we saw that were less than $M$. If at least $t$ values hashed to values less than $M$, then $X$, the $t$th smallest hash value will be less than $M$, hence the equality.

Using our variance result from earlier, we have $t = (1 + \varepsilon) E[Y]$, we really want to know $\Pr[Y \geq (1 + \varepsilon) E[Y]]$. This is bounded above by $\Pr[|Y - E[Y]| \geq \varepsilon E[Y]]$ (adding in the absolute value causes inequality rather than equality). Chebyshev’s inequality tells us that

$$\Pr[|Y - E[Y]| \geq \varepsilon E[Y]] \leq \frac{\text{Var}(Y)}{\varepsilon^2 E[Y]^2}.$$

Because $\text{Var}(Y) \leq E[Y]$, we can write

$$\frac{\text{Var}(Y)}{\varepsilon^2 E[Y]^2} \leq \frac{E[Y]}{\varepsilon^2 E[Y]^2} = \frac{1}{\varepsilon^2 E[Y]}.$$

Now recall that earlier, we said that $t = (1 + \varepsilon) E[Y] \iff E[Y] = \frac{t}{1 + \varepsilon}$. We can substitute this in for $E[Y]$ above and get

$$\frac{1}{\varepsilon^2 E[Y]} = \frac{1 + \varepsilon}{\varepsilon^2 t}.$$

Because $\varepsilon$ is at most 1, we conclude

$$\frac{1 + \varepsilon}{\varepsilon^2 t} \leq \frac{2}{\varepsilon^2 \frac{400}{\varepsilon^2}} = \frac{2}{400} = \frac{1}{200}.$$

Thus, all in all, we’ve shown that the odds of our return value being an over-estimate is bounded above by 1/200. A similar argument shows that the probability of underestimating is also bounded above by 1/200, so the probability of erring is at most 1/200 + 1/200 = 1/100 which means our probability of success is at least 99/100, as desired.

Solving Wordle

Wed, 12 Jan 2022 00:00:00 +0000

Introduction

For those who don’t know what Wordle is, check it out. It’s essentially a word game that works like the game MasterMind. If you’ve been on the internet in the past couple of weeks, you’ve probably seen your friends or follows posting little images that show how quickly, and by which path, they solved the day’s puzzle. After trying it, I thought it might be fun to try to write some code that solves the puzzle (most of the time). The rest of this post will walk through how I came up with the solution, how I put the code together, and some insights I gleaned using my solver.

So what are the rules?

Before going any further, I want to review the rules. The game proceeds as follows. A target word is chosen and hidden from the player. On each turn, the player guesses a five-letter word. After each guess, the player receives feedback about the letter at each position of their guess. For each position, the player might receive:

Green: if the letter of the guess matches the letter of the target at that position.
Yellow: if the letter of the guess matches the letter at some other position of the target.
Grey: if the letter of the guess is not in the target.

For example, if the target word is “taker” and the guess is “talks”, the feedback would be talks , because the first two letters are exactly right, “l” and “s” are not in the target word at all, and “k” is in the target but in a different position than it occupies in the guess.

Coming up with a solution

At first, I tried to apply some concepts I had been studying as part of a reinforcement learning class I’d been taking online. It’s possible that the formulation I came up with just wasn’t a good one, but a simple approach without any fancy AI turned out to actually work very well. I’ve learned, both through my job and some independent study, that conceptual simplicity is often underrated.

My general approach was simple. After collecting the body of words that Wordle uses (which can actually be obtained pretty easily by inspecting the page source of the game’s webpage), I thought through what the skeleton of an algorithm would look like, and I came up with this:

guesses_made = 0
set current guess to an initial guess
while (guesses_made < 6) and (current_guess is not the target):
 get feedback on current guess
 use feedback to reduce the set of valid words
 make make another guess
 increment guesses_made

One way of looking at this skeleton is that we begin the game with no constraints on which words are and are not valid. As we make guesses and see feedback, we gain additional information that allows us to further and further constrain the set of available words until – hopefully – we’ve narrowed it all the way down and we’re certain of the answer.

As described just above, the algorithm skeleton is missing a few important details, namely:

How does the feedback allow us to determine the pool of valid words we can choose from?
How do we make our next guess given a set of guessable words?

The implementation choices we make to answer those two questions ultimately lead to different algorithms. In this post, we discuss some quick-and-dirty, very simple choices that turn out to perform well, but I’d encourage you to come up with interesting alternatives on your own to see if you can come up with something even better!

Using the feedback

Using the feedback requires specifying what kinds of words each type of feedback allows us to eliminate.

When we receive grey feedback, we know to eliminate all words that contain the grey letter.

When we receive green feedback, at position 2, say, we know to eliminate all words that do not contain the green letter at position 2.

When we receive yellow feedback, there are two kinds of elimination we can perform. If we get yellow feedback in position 3, then we know that the letter in our guess at position 3 cannot be in the target word at position 3, so we can eliminate all words that correspond with our guess at position 3. We can also eliminate any words that do not contain the yellow letter, as we know it must be in the target somewhere.

Finally, there is the problem of words with the same letter repeated multiple times. Thinking things through a little bit, we realize that the number of yellow and green copies of a given letter is a lower bound on the number of copies of that letter that must be in the target word. For example, if we have a yellow “t” and a green “t” in the feedback, we know that the target word must have at least 2 “t"s, so we can eliminate all words with 0 or 1 “t"s.

Making the next guess

Each time we use feedback to cull the set of valid words, we have to then choose from a potentially vast set of remaining words. In order to do this, we have to come up with some heuristic to narrow the field.

In my case, I chose to give each word a score and then chose the word with the highest score (breaking ties randomly if required). To compute the score, I first came up with the distribution of letters in each position. For example, at position 1, maybe “s” was the most common letter, making up 6% of the letters found in position 1 across the set of possible words. If the word under evaluation contains an “s” at position 1, the word would accrue a credit of 0.06 for the “s”. The sum of these credits across the 5 positions determines the word score. At each point, I select the valid word with the highest score.

This scoring system has (at least) one obvious weakness! If the target word has letters in certain positions that are very uncommon for that position, the algorithm will pick other words first and possibly run out of guesses. Trying to figure out how to remedy this would make the algorithm more robust, but I haven’t given it enough thought as of this writing.

How well does the solver work?

With an allowance of 6 guesses, on a random sample of 5k target words, my solver successfully found the target word about 90% of the time in an average of 5 guesses. Increasing the allowance to 9 guesses, it succeeds 98.5% of the time. With 15 guesses, it succeeds on all 5k examples. In the instances where it fails with 6 guesses, there are, on average, about 7 valid choices left. That’s pretty good!

What is the best word to start with?

Before trying to answer this question, it should be noted that “best” in this context depends on your algorithm. Different scoring methodologies, for example, would imply a different ordering on the quality of initial guesses. The results below are obtained using the algorithm we just described; if you vary the algorithm, you might find something different.

For each possible initial guess – around 13k of them – I chose 200 random target words. (I could have chosen more than 200, but I was constrained by computation time.) For each guess, we record the fraction of the 200 problems that were solved successfully and assign that as a score for that initial guess.

Some of the words that achieved the top 5% of scores were

chapt
chimp
germs
compt
match
chems
frump
bumph
spick
crumb
…

and some of the words in the bottom 5% were

nanna
zooea
gazoo
zexes
vairy
roque
navvy
ninon
ozzie
nouny
…

(there were others, but I don’t show them here for brevity). None of the words in the top 5% of scorers repeated letters, while 90% of those in the bottom 5% did, suggesting that when picking your first word, it would be unwise to pick a word with repeated letters.

I also looked at how much of the best initial guesses were made up of the five most common (s, e, a, o, and r) and five least common (v, z, j, x, and q) letters of the alphabet. (Here, most and least common are relative to the 12k Wordle words.) At first, the results seem a bit counterintuitive – 51% of the letters in the worst 5% of first guesses are made of the five most common letters of the alphabet, while only 28% of those the best 5% are! What gives!?

One hypothesis is that feedback on common letters may help significantly narrow the field, but repeating them doesn’t provide much additional information, so it’s worth diversifying. But then, you might ask, why aren’t there words in there with repeated infrequent letters? I imagine that this is probably because there just aren’t that many words to begin with that have multiple qs, js, vs, xs, or zs. The better scorers use common letters, but avoid repeating them, so the fraction of common letters is smaller in that set. If we look at the fraction of letters in high and low scorers that come from the five least common letters, what we see is striking: the high scorers do not contain any letters from the five least frequent letters, while the lower scorers are 14% infrequent.

Conclusion

I hope you found this exercise as fun and interesting as I did. I’ll also be posting the code I used soon, so feel free to have a look at it if you’re interested in seeing what the actual implementation looks like. Even though the insights we arrived at were pretty intuitive, I hope that you enjoyed putting a little bit of rigor to it. Happy Wordling!

Edit (2022-01-17): I’ve posted the code here. In response to some feedback I received about the post, I also changed the word-scoring algorithm to encourage helpful exploration, rather than only using words that satisfy the constraints we’ve accumulated information about.

Solving sudoku as a linear program

Sun, 11 Apr 2021 00:00:00 +0000

Introduction

During the last several months, my wife and I went through a Kenken phase. For those not familiar with Kenken, check out this website. Kenken is a great example of a class of problems that are more broadly categorized as constraint satisfaction problems (CSPs). I wrote some code that generates and solves Kenkens, which I’d like to write about when I have more time, but in this post I want to talk about an interesting, nonstandard way to set up and solve a different, far more popular CSP: sudoku!

What is sudoku?

For those who are not familiar with sudoku puzzles, they are designed as follows. The initial board is a 9x9, partially filled in grid. The goal is for the solver to find (the unique) values for the empty squares so that the board satisfies the following rules:

There can be no repeated values in any row. (Each row must contain each value from 1 up to 9 exactly once.)
There can be no repeated values in any column. (Each column must contain each value from 1 up to 9 exactly once.)
There can be no repeated values in each of the 9 3x3 groups of cells outlined in a bolded black line in the figures below.

An unfilled board (left) and its solution (right).

Modeling the problem

In order to formulate this as an optimization problem we need to come up with a variable to optimize and constraints that tell the optimization algorithm what kinds of solutions are valid. In any sudoku problem, there are 81 cells that need to be filled in (including those that are already populated at the start). Each of those cells has 9 possible values, so we will index our variable $x$ with 3 indices: the first will indicate the row, the second will indicate the column, and the third will refer to a value between 1 and 9. Furthermore, we will require each $x_{ijk}$ to take either the value 0 or the value 1: $x_{ijk} = 1$ if the cell at position $(i, j)$ on the board has the value $k$ and 0 otherwise.

This way of modeling a sudoku board is not the most intuitive one possible, but it will make formulating our constraints easier, which is the topic we turn to next. (If the above wasn’t clear, give it another read before moving on.)

Constraints

There are a few types of constraints we need to respect to make sure that our optimization algorithm comes up with a valid solution to our sudoku problems. We will discuss each type of constraint in turn.

Respecting given values

The first constraint the solution needs to satisfy is that the values that are already provided must be respected. That is, if the first cell is already filled with the value 5, the optimizer is not allowed to change that. By the way we defined $x$ earlier, this would correspond to the constraint $x_{115} = 1$. Similarly, if the lower right value were set to 1, we would have another constraint corresponding to this value given by $x_{991} = 1$. We add a constraint like this for every value that is provided on the initial board.

Each cell contains a single value

This is a relatively simple constraint, but an important one. To model this constraint for cell $(1, 1)$, we would require that $\sum_{k=1}^9 x_{11k} = 1$. Because each entry of $x$ is either 0 or 1, this constraint says that exactly one of the entries corresponding to the first cell must be set. We add a constraint like the one I described for the first cell for each of the 81 cells.

Row, column, and box constraints

We require that each row contain each digit from 1 to 9 . To encode that the digits in row $i$ must be unique, we need to make sure that for each value $k$, we have $\sum_{j=1}^9 x_{ijk} = 1$ (in the sum, $i$ and $k$ are fixed). There are 81 such constraints corresponding to the rows, another 81 can be analogously formulated for the columns, and another 81 can be formulated to model the box constraints.

Consolidating constraints

Once we’ve modeled all the constraints, we can stack them into a single matrix equality constraint $Ax = b$. Although we’ve been discussing $x$ as though it were three dimensional, when we pass the optimization problem to the computer, we flatten it into one long 729-vector (9 rows $\times$ 9 columns $\times$ 9 candidate values per cell). Each row of $A$ corresponds to a single constraint. Coefficients corresponding to the variables that are active in that constraint are set to 1 and the other coefficients in that row are all set to 0. Because all of our constraints have 1s on their right hand sides, we have $b = \mathbf 1$.

Formulating and solving the problem

We can now formulate our optimization problem:

$$ \begin{align*} &\underset{x \in \{0, 1\}^{729}}{\text{minimize}} ~~ 0\\ &\text{subject to} ~~ Ax = \mathbf 1. \end{align*} $$

Because sudokus have unique solutions, we just need to find the $x$ that satisfies our constraints – there will only be one such $x$! Because we just need to find that $x$, the objective value doesn’t have to help us discriminate between competing feasible points for this problem, so we can safely use 0 as our objective function. (This type of problem is known as a feasibility problem.)*

With the optimization problem in hand, the following short piece of code uses CVXPY to solve any sudoku puzzle very quickly:

import cvxpy as cp
import numpy as np

# ... <code that formulates the constraints> ...

x = cp.Variable(N, boolean=True)
A = np.array(list(constrs))
b = np.ones(len(constrs))
objective = cp.Minimize(0)
constraints = [A @ x == b]
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.ECOS_BB)

The full code (~100 lines) can be found here.

Conclusion

Sudoku is typically framed and solved as a CSP with algorithms that involve some guessing and checking. I thought this was an interesting application of optimization to a problem that maybe doesn’t immediately lend itself to such a formulation. Hope you enjoyed!

*As a note for a slightly more technically inclined reader, a friend of mine pointed out that the algorithm that the solver uses, called branch and bound, ends up degenerating into an exponential tree search, akin to just trying out all possibilities with no way of discriminating between “better” and “worse” points. To remedy this (at least in part), we can use the objective function $x^T \mathbf 1$, which counts the number of ones in any solution. By using this objective instead of $0$, we give the algorithm a way to “prune” the tree it is searching, which may lead to performance benefits. The code implementing this approach would only be slightly different:

import cvxpy as cp
import numpy as np

# ... <code that formulates the constraints> ...

x = cp.Variable(N, boolean=True)
A = np.array(list(constrs))
b = np.ones(len(constrs))
objective = cp.Minimize(x.T @ np.ones(N)) # <-- new objective function
constraints = [A @ x == b]
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.ECOS_BB)

Distributed hash tables

Wed, 10 Jan 2018 00:00:00 +0000

Introduction

While watching some lectures on distributed/cloud computing, I came across distributed hash table, which is a way to implement a hash table (if you’re unfamiliar, see here) distributed across a bunch of different servers connected by a network. The goal is to implement the table so that (1) finding a key’s location is “easy” (read: efficient) and (2) the user does not have to worry about the underlying network topology. In other words, to the client, using the distributed table should feel like they are using one in-memory on a single machine. I chose to write about this because it’s a place where theory gracefully translates itself well to practice. It turns out that by using a mathematically fun and interesting hashing technique and a clever data structure in concert, we can achieve a pretty efficient distributed hash table.

Hash tables

For those unfamiliar with the basic idea of hashing and hash tables, imagine that you have a bunch of objects that you need to store. You have a bunch of cabinets in which to store said objects. You’d ideally like to put the items into the cabinets in such a way that when you want to retrieve a particular item from storage, you can do so quickly. To help you accomplish this, I give you a crystal ball which, based on some combination of the object’s characteristics, determines which cabinet to put it in/retrieve it from. For example, if the object you’d like to store is red, spherical and manufactured by Hasbro, your crystal ball might assign it cabinet #1 (note: a crucial property of this crystal ball is that if it says cabinet #1 for a particular item, it will always say cabinet #1 for that item; it won’t change its opinion).

Then, if at some later point you want to retrieve the red ball manufactured by Hasbro, you would consult the crystal ball I gave you to figure out where you put it. For this to work well and grant the efficiency you seek, you wouldn’t want too many objects landing in the same cabinet; if the crystal ball assigned several items to cabinet #1, for instance, then when your crystal ball reveals that the red Hasbro ball is in cabinet #1, you would have to look at all of the items that landed there to find it. In the worst possible case, if every object somehow landed in cabinet #1, you would potentially have to rifle through every item you deposited. As the number of items you need to store gets larger and larger, the inconvenience of items in the same cabinet could range anywhere from mildly annoying to intractable.

Before we continue, I want to attach some technical terminology to the basic aspects of hashing just discussed. In the above example, hash table itself is the set of cabinets you have for storage. We will let the number of cabinets in our table be denoted by m and we use n to denote the number of items we have to store. The items are called keys. The crystal ball you consulted in order to know where to put/find each item is known as a hash function. The easiest way to think of a hash function is as a mathematical function that takes some object in and uses its properties to deterministically output an integer. One of the properties we care about when we choose hash functions to suit our applications is whether they assign lots of different keys to the same buckets. If keys $k_1$ and $k_2$ are mapped to the same bucket by the hash function, we say that the function collides on these keys. (As a note before we get into the specifics of the hashing technique we’ll use to build our distributed table, any algorithm that uses hashing usually relies on the assumption about hash functions (see here) which basically guarantees that the probability that a key lands in any particular bucket is $1/\#\text{buckets}$. More technically, the assumption asserts that the hash function distributes keys uniformly. We can actually show mathematically that if the hash function distributes keys uniformly, we expect any particular cabinet to have $n/m$ items in it, implying that in the worst case, we expect that the runtime of a lookup/insert/update/delete is $O(n/m)$. Provided that we choose $m$ close to $n$, we effectively expect that the aforementioned operations take constant time (read: are extremely fast).)

Complexities of a distributed network

Now let’s say I have a bunch of files that I want to store across a collection of machines. For this particular example, say I have 3 servers available labeled 0, 1, and 2. The simplest way to store the files across my servers is to take the hash of the file I want to store (an integer), compute its remainder when you divide by 3 (note that this value can only be 0, 1 or 2) and stash the file in the server (“cabinet”) with the corresponding label. To find out which server a file $f$ has been stored on, simply compute $\text{hash}(f) % 3$ (this means the remainder that $\text{hash}(f)$ leaves when divided by 3) and ask that server for the file.

This technique seems all well and good, until you consider one of the realities of distributed systems: that machines (machine will henceforth be interchangeable with “node” or “server”) join and leave the system all the time. Can you see the challenge this presents? What happens if I stored a file on server 1 and then I ask my hash function where it is? See if you can spot the problem before continuing.

The problem is if a new node joins or leaves the system, the hash function might give me the wrong answer. Let’s say that for some file $f$, $\text{hash}(f) = 40$. If I stored $f$ when my system only had 3 nodes, my algorithm would have placed $f$ on node 1, because $40 / 3 = 13$ remainder $1$. Now let’s say I add a 4th node and subsequently ask where $f$ is. Well $\text{hash}(f) = 40$, and $40 / 4 = 10$ remainder $0$! But from earlier, we saw that $f$ is actually sitting on node 1! How might we solve this problem? Think about this before reading the next paragraph.

We can solve this problem by rehashing every file given the new number of servers whenever the system registers a new machine (or some machine leaves). This solution, however, is very computationally cumbersome. Each time a new node registers (or leaves), the system has to stop serving clients, compute the new hash of every single file in the system and move the files to their new homes. (This can be somewhat mitigated by storing metadata about files instead of the files themselves on the servers in the system, but given enough files, even moving around all the metadata would be pretty slow.)

Can we avoid this pitfall somehow? The answer, as you might’ve guessed, is yes, and the technique is called consistent hashing.

Consistent hashing

In consistent hashing, we pick some integer $s$ and imagine a logical ring with $2^s$ discrete slots labeled $0, 1, 2,\dots, 2^s - 1$. For example, if $s = 3$, then our logical ring would have slots labeled $0,1,\dots, 7 $($= 2^3 - 1$). Ideally we choose an $s$ big enough that $2^s$ is a lot bigger than the number of nodes we expect.

(For the remainder of these steps when I talk about locations on the ring, I’m not talking about an actual ring. If you’re familiar with modular arithmetic, all we’re doing is wrapping hashes around a modulus. If you’re thinking of it as an actual ring, that’ll work too.)

First we place the servers on the ring by hashing their (ip address, port) pairs and taking remainders modulo $2^s$. If for some server $A$, $\text{hash}(A)$ gives a remainder of $a$ modulo $2^s$, it would logically occupy slot $a$ on the ring.

Next, in order to make accesses fast, we introduce a clever data structure called a finger table at each node. Each finger table maintains $s$ pointers to other nodes on the ring as follows. For $i$ between $0$ and $s - 1$ (inclusive), the $i$th finger table entry for a node at ring position $N$ is the first node whose position is greater than or equal to $(N + 2^i) \pmod {2^s}$. Let’s look at a quick example.

Suppose that $s = 5$ and we had nodes at positions 3, 7, 16, and 27 on our logical ring. To compute 3’s finger table, we compute

$$ \begin{align*} 3 + 2^0 &= 3 + 1 &&= 4 &&&\pmod{32} \\\\ 3 + 2^1 &= 3 + 2 &&= 5 &&&\pmod{32} \\\\ 3 + 2^2 &= 3 + 4 &&= 7 &&&\pmod {32} \\\\ 3 + 2^3 &= 3 + 8 &&= 11 &&&\pmod {32} \\\\ 3 + 2^4 &= 3 + 16 &&= 19 &&&\pmod {32} \end{align*} $$

The first node larger than or equal to the first entry in the finger table is 7, so the first finger pointer is to 7. The same goes for the second and third entries. The fourth entry would point to 16 and the final entry would point to 27. So for node 3, the finger table would look like $(7, 7, 7, 16, 27)$.

Each node also maintains pointers to the nodes to its right and left along the ring. The successor of 3 is 7 and the predecessor of 3 is 27. In a similar vein, can more generally refer to predecessors and successors of any slot on the ring.

To perform a lookup, we make use of both the logical ring we introduced and the per-node finger tables. Let’s say that machine 7 wants to make a query for a key that hashes to 2 (and would thus reside on node 3 — the first node to its right on the ring). It would look in its finger table for the largest node that is to the left of the key’s position — this would be node 27 — and then route the query there. Node 27 would then route the query to its successor (because it is greater than the key we’re looking for), which would then be able to send the data associated with the requested key back to 7.

(In this example the ring is small and we don’t require that many hops. This won’t always be the case. For instance, if $s = 10$ and a query initiated at 3 for a key that hashed to 999, the largest node to the left of the key that 3 knows about would be at most $3 + 2^9 = 3 + 512 = 515 \pmod{1024} (=2^{10})$, but there might be a node at 997 that is much closer that 3 doesn’t have in its finger table.)

Why do this?

The last things I want to briefly discuss are the benefits this whole complicated system confers. The first thing we get is relatively fast lookup time. We can actually prove that (with high probability) lookups are logarithmic. In essence, what this means is that on every hop that we take using the finger tables, we traverse at least half of the remaining distance between the current node and the node containing the desired key (the key’s successor). A sketch of the proof goes like this:

Suppose we are at node $n$ and the node immediately to the left of a key $k$ is some node $p$. According to the algorithm outlined above, node $n$ will search its table and elect to move to the largest node it knows about that is to the left of $k$. Call this node $m$. If $m$ is the $i$th entry of $n$’s finger table, then because $p$ is necessarily at or to the left of $m$, both $m$ and $p$ are between $2^{i-1}$ and $2^i$ away from $n$. This, in turn, means that $m$ and $p$ are at most $2^{i-1}$ away from one another (because $2^i - 2^{i-1} = 2^{i-1}$). Thus, the distance between $m$ and $p$ (at most $2^{i-1}$) is at most half the distance from $n$ to $m$ (at least $2^{i-1}$), which is what we wanted.

Using this halving result, we can note further that after $t$ hops, the number of slots we haven’t yet searched is the $(\text{total number of slots}) * (1/2)^t = 2^s/2^t$ (because we halved the distance between where we started and our destination $t$ times). If we make $\log n$ hops (where n is the number of nodes in the system), we have $2^s/2^{\log n} = 2^s/n$ slots left to search. Because we assumed that the hash function we chose distributed nodes uniformly about the ring, we only expect there to be 1 node in this window. That there are $\log n$ such nodes has an even smaller probability! This means that after $\log n$ finger table hops, we will have to (with high probability) make at most $\log n$ more hops to get to our destination. If you’re familiar with big-Oh notation, the total runtime for a lookup in the worst case is, with high probability, $O(\log n + \log n) = O(2\log n) = O(\log n)$, as we wanted.

As a final note, insertions, deletions, updates are bottlenecked by the lookup operation. Once we know we can expect logarithmic lookup, we automatically know that those other operations are logarithmic too.

Conclusion

For those not familiar with what logarithmic runtime means, let’s just say it’s pretty fast. The other cool thing we get out of this logical ring, which is potentially even more important when you’re talking about systems that frequently gain and lose machines, is that you don’t have to move so many of the keys around when a new node joins or leaves. For example, if a node leaves, you only have to update the finger tables of nodes that used to point to the node that left. (Can you figure out what you would have to do if a node joined?)

I think this is a great example of the way that algorithms and mathematical reasoning are a huge part of the push toward more scalable system architectures. If they aren’t using this exact algorithm, engineers at Google, Facebook, Amazon, Netflix and others are using similar ideas to push the boundaries of what it means for distributed systems to be available, scalable, efficient and maintainable.

TSP is inapproximable

Wed, 26 Apr 2017 00:00:00 +0000

Introduction

As an introductory computer science student, I was enamored by the Traveling Salesman Problem (if you’ve never heard of it, see Travelling salesman problem - Wikipedia). It is very easy to state and has very simple and important practical applications, yet somehow my professors were telling me that we don’t, at present, have an efficient algorithm to solve it. There are heuristics, yes, but you can (easily) design pathological inputs on which whatever heuristic you design performs horribly.

Approximation algorithm

It’s often the case that when faced with $NP$-hard optimization problems (such as TSP), instead of heuristics, we try to design approximation algorithms, or algorithms that provably produce outputs within some acceptable factor of the answer. To be a bit more precise about it:

An algorithm $A$ is an $\alpha$-approximation for a maximization problem $P$ if on every instance $I$ of $P$, $A$ produces a solution of size at least $OPT/\alpha$ (where $OPT$ is the size of the optimal solution for the instance $I$).
$A$ is an $\alpha$-approximation for a minimization problem if $A$ produces a solution of size at most $\alpha OPT$.

Note that $\alpha \geq 1$.

Inapproximability

There are myriad approximation algorithms out there for a bunch of different problems; I may make them the subject of a future blog post… but in the remainder of this post, I want to actually show that for any $\varepsilon > 0$, it is impossible to come up with a $(1-\varepsilon)$-approximation for TSP unless $P = NP$ (in which case we wouldn’t need approximations, we would have a polynomial time algorithm to solve the problem exactly!). In other words, unless $P = NP$, not only do we not have an efficient algorithm for TSP, we can’t even approximate it efficiently! The proof of this fact, which I find surprising and kind of amazing given the amount of effort and brainpower that have been thrown at this problem over the years, is pretty simple, so I thought it would be fun to go through it here.

(Before we do, if you aren’t familiar with it, read Hamiltonian path problem - Wikipedia)

The proof

We’ll start with a sketch of the proof, then move on to the actual proof.

Proof idea: The proof is by contradiction. We will assume we have a polynomial time approximation for TSP and use it as a black box to solve a known $NP$-complete problem, HAM-CYCLE, in polynomial time. Assuming $P \neq NP$, this is impossible, so such an approximation cannot exist.

Proof: Suppose that $A$ is a polynomial time approximation for TSP. Use $A$ to construct the following algorithm $A’$ which, on some input graph $G = (V,E)$, computes a solution (a YES or NO) to HAM-CYCLE:

Create the graph $G’ = (V, E’)$ by completing $G$ (i.e. by adding edges to $E$ until there are edges between every pair of vertices in $V$).
Give the edges in $E$ weights of 0, and give those in $E’ - E$ weights of 1. (Note that $G’$ is an instance of TSP.)
Use $A$ to approximate the least cost tour $T$ in $G’$.
Output NO if $T$ has weight $> 0$ and YES otherwise.

We just need to argue that $A$ outputs a tour of weight 0 if and only if there is a hamiltonian cycle in $G$. To see this, note that by definition, $A$ finds a tour whose combined weight is within some factor of the optimal tour on $G’$. If the optimal tour on $G’$ can indeed be made up only of edges from $G$, it has weight 0, in which case $A$ would have to return an answer within some factor of 0… namely 0. If the A finds a tour with weight $> 0$, then it must not have been able to find a tour using only edges of $G$, in which case we can safely output that there is no hamiltonian cycle in $G$. QED.

Conclusion

And with that, just a few short words, we were, assuming $P \neq NP$, able to rule out all possible approximation schemes that anyone could ever think of! Imagine all the time and effort we’ve saved!

Randomized algorithm for file comparison

Sat, 25 Feb 2017 00:00:00 +0000

Introduction

The next problem we use randomization to solve might seem a bit closer to one we might face in reality. It goes like this. Alice and Bob each have copies of the same file that they need to keep synchronized (call Alice’s file $A$ and Bob’s $B$). Over time, however, it’s possible that the Alice’s and Bob’s files get out of sync. Our task today is to come up with a protocol by which Alice and Bob can check that $A = B$ without one having to send the other his/her entire file.

The algorithm

In order to throw math at the problem the way we want to, we need to make it a bit more abstract. To this end, we stipulate that $A$ and $B$ are represented by $n$-bit strings. The comparison protocol works as follows:

Alice picks a prime $p \in \{2..n^2\lg n\}$ (fear not; we will explain the choice of this range soon). Because $A$ is an $n$-bit string, we can look at it as an $n$- bit binary integer. Alice computes $A \pmod p$ and sends Bob the prime $p$ and $A \pmod p$. (Note: computing $A \pmod p$ means the remainder of $A$ left over when we divide it by $p$.)
Bob computes $B \pmod p$. If $A = B \pmod p$ ($A$ and $B$ leave the same $p$- remainder), Bob outputs that the files are the same. Otherwise, he should conclude that the files he and Alice have are out of sync.

The question we need to answer is now: How confident can we be that the files are indeed the same when Bob says “same”? In other words, we want to know what the probability is that the algorithm errs.

Analysis

There are two cases to analyze here. If the files were the same to begin with, the algorithm will never fail. Mathematically, if $A = B$, then $A = B \pmod p$ so Bob will always output “same” in this case.

The interesting case to analyze is the case in which $A \neq B$. In this case, we are interested in

$$\Pr[A = B \pmod p ~|~ A \neq B].$$

In English, we are interested in the probability that Bob outputs “same” even when his and Alice’s files are not in sync. To analyze this probability, we need to entertain a quick tangent.

In particular, we need to motivate our preference that $p \in \{2..n^2\lg n\}$. There is a neat theorem in number theory called the Prime Number Theorem which states that in the range $\{2..N\}$, there are about $\frac{N}{\lg N}$ primes. We can show, using the theorem and some algebra that we need not bother ourselves with here, that there are about $n^2$ primes in the range from which we drew $p$ (start by substituting $N = n^2 \lg n$ in the theorem).

Keep this fact in mind. Let $C = |A - B|$. We can reformulate the probability of the protocol failing as $\Pr[C = 0 \pmod p ~|~ C \neq 0]$.

Next, we note that because $A$ and $B$ both have $n$-bits, $C$ too must have $n$- bits. This means that $1 \leq C \leq 2^n$. Note that 2 is prime, so a nice feature of $C$ that we can observe is that $C$ has at most $n$ prime divisors. This is the key fact. Because we had $n^2$ primes from which to choose $p$ and there are (at most) $n$ bad choices among them (a choice is bad if $p$ divides $C$ and thus $A- B$, whence $p \mid A$, $p \mid B$), the probability that $p$ is a “bad prime” is at most $\frac{n}{n^2} = \frac{1}{n}$.

The probability of success is thus at least $1 - 1/n$, which is great because, intuitively, it means that as our strings get larger, the odds of the algorithm failing get very, very small.

Space complexity

The last minor thing we note is that the number of bits required for this protocol is only the number of bits required to represent $p$ and $A \pmod p$ (what Alice sends to Bob). The $A \pmod p$ is smaller than $p$, so we can write an upper bound on the total number of bits required as $2\lg p$ bits. Because $p$ is at most $n^2 \lg n$, we require $2 \lg(n^2\lg n) = \lg (n^4(\lg n)^2) = 4 \lg n + 2\lg(\lg n) = O(\lg n)$ bits to, with high probability, successfully compare the files. Our goal was to share a sublinear number of bits relative to the size of the file, so the protocol above achieves the desired aim.

Conclusion

You’ll note that if you go back and look at the matrix multiplication post I put up a little while ago, the analysis here and there are very similar. In each case, we had a problem in which we needed to compare two objects without comparing the entire objects to one another. In this case, the objects were strings; in the matrix case, the objects were matrices. In both cases, we devised schemes wherein we mapped the larger objects to smaller ones that were much easier to compare; although the representations of the smaller objects are lossy, we choose our mapping carefully so that the information we sacrifice only introduces a small probability of failure. The technique described above is called fingerprinting, and it is a very powerful tool used in the study and design of randomized algorithms.

Randomized matrix multiplication checking

Sat, 25 Feb 2017 00:00:00 +0000

Introduction

Warning: This post is pretty technical. It details a part of a lecture from a class I’m in this semester. It requires some mathematical maturity. I’ll to my best to make it as accessible as possible, but how effective I’ll be at that remains a mystery… Here we go!

Background

The simplest algorithm for matrix multiplication takes $O(n^3)$. As of right now, the fastest algorithm we have for matrix multiplication runs in $O(n^{2.373})$ time. This algorithm is very complicated and I won’t go into detail here (mostly because I don’t know them), but should we devise an algorithm that is even faster than that, we might want an efficient algorithm to check that it computes the correct result. A little bit of thought shows us that the best conceivable complexity for matrix multiplication is $O(n^2)$ — that is the amount of time it would take us to write out the result even if no additional computation was necessary. This post details a randomized algorithm that checks the correctness of matrix multiplication in $O(n^2)$ time. Now, to the formal problem statement…

Suppose you had three $n \times n$ matrices $A, B$, and $C$ and I wanted to know whether $AB = C$.

First approach

A first approach is to choose a random entry $C_{ij}$ and check that it was computed correctly by checking that it equals the dot product of row $i$ of $A$ and column $j$ of $B$. Each such check takes linear time, so we can do up to $n$ of them and still stay at our desired runtime of $O(n^2)$. This seems clever enough… I mean, $n$ entries is pretty good, right? If all of those entries were computed correctly, we can be reasonably sure that $AB = C$… right?

Trying again

Nope! The problem with this approach can be better understood by asking the following question: If there is some entry of $C$ that was computed incorrectly, what are the odds that the above algorithm would catch it? Well, if there are $n^2$ entries in $C$ and we choose $n$ of them to check, provided that we pick them uniformly at random, we have a $\frac{n}{n^2} = \frac{1}{n}$ chance of catching the entry that was computed incorrectly. Because asymptotically, $n^2$ runs away from $n$, as our matrices get bigger, our odds of catching a mistake shrinks. That isn’t good! How else might we go about this?…

Consider the following algorithm:

Generate an $n$-bit vector $r$ where each component of the vector is selected uniformly at random from $\{0,1\}$.
Compute $ABr$ and $Cr$.
If $ABr = Cr$, return true, else return false.

Note that step 1 takes $O(n)$ time. Because $ABr = A(Br)$ and $Br$ is just a vector of length $n$, step 2 takes $O(n^2 + n^2) = O(n^2)$ time. The last step takes linear time, so, in total, the above approach takes $O(n^2)$ time — just what we need. But how does it compare to the alternative at finding mistakes?

Does it work?

There are two cases to consider here:

$AB = C$
$AB \neq C$

In the first case, there aren’t any mistakes to catch, so our analysis need only consider the second case. Define $D = C - AB$. What we are formally interested in is the probability that $ABr = Cr \iff (C - AB)r = Dr = 0$ when $AB \neq C \iff C - AB = D \neq 0$. In other words, we are interested in $\Pr[Dr = 0 ~|~ D \neq 0]$. To compute this probability, suppose there was indeed an entry of $C$ that was computed incorrectly. Without loss of generality, assume that $D_{11} \neq 0$. Note that this means that $(AB)_{11}$ and $C_{11}$ are different, so $D_{11}$ is of interest to us — it is the elusive mistake. To better understand what’s going on here, consider the following:

$$\begin{bmatrix}D_{11} & \dots & D_{1n}\\\vdots & \ddots &\\& &\end{bmatrix}\begin{bmatrix}r_1\\ r_2\\ \vdots\end{bmatrix}=\begin{bmatrix}D_{11}r_1 + \dots + D_{1n}r_n\\ \vdots\end{bmatrix}$$

(computation of $Dr$).

The only way the algorithm can be fooled is if, somehow, $D_{11}r_1 + \dots + D_{1n}r_n = 0$. That is, if the first entry of $Dr$ is $0$, we find ourselves in a potential case wherein $Dr = 0$ even though $D \neq 0$ — in English, we find ourselves in a case, when $ABr = Cr$ even though $AB \neq C$. What is the probability of this happening? Some thought suggests that we are looking for the probability that $D_{11}r_1 + \dots +D_{1n}r_n = 0$, or, equivalently, the odds that $r_1 = - \frac{D_{12}r_2 + \dots D_{1n}r_n}{D_{11}}$. Recall that $r_1 \in \{0,1\}$. If that ugly fraction is neither 0 nor 1, we’re good to go because $r_1$ cannot possibly take on that value. If, however, that fraction does equal 0 or 1, then there is a $\frac{1}{2}$ chance that we assigned $r_1$ that value. Thus, $\Pr[Dr = 0 ~|~ D \neq 0] \geq \frac{1}{2}$, which means that our algorithm will catch a mistake at least half of the time.

Conclusion

Isn’t math cool?!?! In lecture, this was used to show that randomization is a powerful tool that allows to do all kinds of mathematically rigorous magic. I was blown away; I hope you were too.