<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Computer-Science on Jack Gindi</title><link>https://www.jgindi.me/tags/computer-science/</link><description>Recent content in Computer-Science on Jack Gindi</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sun, 17 Nov 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://www.jgindi.me/tags/computer-science/index.xml" rel="self" type="application/rss+xml"/><item><title>A bound on sorting performance</title><link>https://www.jgindi.me/posts/2024-11-17-sorting-bound/</link><pubDate>Sun, 17 Nov 2024 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2024-11-17-sorting-bound/</guid><description>&lt;h1 id="introduction">Introduction&lt;/h1>
&lt;p>If you wake any programmer up in the middle of the night and ask them to name an algorithm, a sizable fraction would probably invoke some kind of sorting procedure. Some might name &lt;a href="https://en.wikipedia.org/wiki/Quicksort#:~:text=Quicksort%20is%20a%20divide%2Dand,or%20greater%20than%20the%20pivot.">quicksort&lt;/a>, some &lt;a href="https://en.wikipedia.org/wiki/Merge_sort">merge sort&lt;/a>, still others &lt;a href="https://en.wikipedia.org/wiki/Insertion_sort">insertion sort&lt;/a>, and some might troll you by naming &lt;a href="https://en.wikipedia.org/wiki/Bogosort">Bogosort&lt;/a>.&lt;/p>
&lt;p>The first three of those algorithms are all what are known as comparison-based sorts: all of them work by comparing elements and making decisions based on the results of the comparisons. In this post, I want to talk about a lower bound on efficiency for comparison-based sorting algorithms. In other words, I want to show that if you invented a new comparison-based sorting algorithm, then even without knowing how it works, I could tell you what its best conceivable runtime is (as a function of the input size).&lt;/p>
&lt;p>To get a better sense for what I mean, let&amp;rsquo;s dive in.&lt;/p>
&lt;h1 id="comparison-based-sorting">Comparison-based sorting&lt;/h1>
&lt;p>To understand what we mean by comparison-based sorting, let&amp;rsquo;s walk through one of the algorithms I mentioned earlier: merge sort.&lt;/p>
&lt;p>Merge sort essentially works by sorting the first half of the input, sorting the second half, and then merging the two sorted results. But how do we sort the first and second halves? We sort the first half of the first half, sort the second half of the first half, then merge them. And so on and so forth. In order for this recursive process to work, though, the process has to bottom out, right? Right! It bottoms out when a &amp;ldquo;half&amp;rdquo; is empty or has one element, since empty and singleton lists are (trivially) sorted.&lt;/p>
&lt;p>To show this with an example, let&amp;rsquo;s say we start with the input list [1, 3, 8, 4, 5, 2, 6, 9, 7]. We would&lt;/p>
&lt;ol>
&lt;li>Split the list into two halves: [1, 3, 8, 4] and [5, 2, 6, 9, 7].&lt;/li>
&lt;li>Split the first half into two halves: [1, 3] and [8, 4].&lt;/li>
&lt;li>Split the first half into two halves: [1] and [3].&lt;/li>
&lt;li>Each of [1] and [3] is sorted, so we merge them into [1, 3].&lt;/li>
&lt;li>Split [8, 4] into two halves: [8] and [4].&lt;/li>
&lt;li>Each of [8] and [4] is sorted, so we merge them into [4, 8].&lt;/li>
&lt;li>Now we merge [1, 3] and [4, 8] into [1, 3, 4, 8].&lt;/li>
&lt;li>Carry out the same recursive process for [5, 2, 6, 9, 7] to get [2, 5, 6, 7, 9].&lt;/li>
&lt;li>Merge [1, 3, 4, 8] with [2, 5, 6, 7, 9] to get the final result: [1, 2, 3, 4, 5, 6, 7, 8, 9].&lt;/li>
&lt;/ol>
&lt;p>The &amp;ldquo;comparison&amp;quot;s that happen in merge sort occur in the merging stage, which we won&amp;rsquo;t go into detail about here. Now that we&amp;rsquo;ve seen one example of a comparison-based sort, we will now turn to thinking about sorting more generally using decision trees.&lt;/p>
&lt;h1 id="performance-bound">Performance bound&lt;/h1>
&lt;p>So how can we possibly say anything important about the efficiency of a whole class of algorithms without considering every possible implementation?&lt;/p>
&lt;p>First, let&amp;rsquo;s suppose we have some input list of size $n$. The &lt;em>indices&lt;/em> of this list &amp;ndash; i.e., the numbers $1, \dots, n$ &amp;ndash; has $n! = n \cdot (n-1) \cdot (n-2) \cdot \dots \cdot 3 \cdot 2 \cdot 1$ possible orderings, exactly one of which puts the &lt;em>elements&lt;/em> in sorted order. We want to say something about the minimum number of comparisons required to find this ordering.&lt;/p>
&lt;p>One way to think about this sorting problem is to use the abstraction of a decision tree. To make this more specific, the leaf nodes (the nodes the bottom of the tree) each represent one possible ordering of the list. The other nodes (called internal nodes) represent comparisons between elements at different indices of the list. An example of this tree is shown in the image below (&lt;a href="https://genome.sph.umich.edu/w/images/b/b8/Biostat615-lecture6-presentation.pdf">source&lt;/a>):&lt;/p>
&lt;img src="https://www.jgindi.me/posts/sorting-bound/tree.png" width="550" height="350"/>
&lt;p>Each of the ovals represents a comparison between the elements at two &lt;em>indices&lt;/em> of the array. To understand how to read this tree, let&amp;rsquo;s say that our input array is called $A$. At the root node of the tree, if $A[1] \leq A[2]$, then we would proceed to take the left branch and compare $A[2]$ with $A[3]$. If $A[2] \leq A[3]$, then we would take the left branch again and reach the leftmost leaf, which would indicate that $A$ was already in sorted order. With other input orderings, though, the index order that results in $A$ being sorted might be some other leaf, which we could similarly determine by doing a bunch of comparisons. (The key to avoiding confusion here is to remember that the numbers in the ovals are &lt;em>indices&lt;/em>, not the actual elements of the input list.)&lt;/p>
&lt;p>Now, if there are $n!$ possible orderings of $A$, then there must be $n!$ leaves in the tree. Furthermore, we know that the length of the longest root-to-leaf path is the largest possible (worst-case) number of comparisons we would need to do to get our sorted order. Thus, in order to understand the best possible worst-case performance of our comparison-based sort, we want to find &lt;strong>the length of the longest possible root-to-leaf path in this decision tree.&lt;/strong>&lt;/p>
&lt;p>Let&amp;rsquo;s suppose that the algorithm always completes after $h$ steps. Another way of stating what we want to find is to say we&amp;rsquo;re looking for a lower bound on $h$. With $h$ comparisons, we can distinguish between $2^h$ orderings, since each comparison has two possible outcomes and the &lt;em>indices&lt;/em> are distinct (even if the elements aren&amp;rsquo;t). In order to make sure we find the sorted order, we need to make sure that the $n!$ possible orderings can all be covered with $h$ comparisons (i.e., by checking at most $2^h$ orderings). In other words, we need this inequality to hold:
&lt;/p>
$$
2^h \geq n!.
$$&lt;p>Taking the log (base 2, as is customary in computer science) of both sides, this can be rewritten as
&lt;/p>
$$
h \geq \log(n!).
$$&lt;p>That&amp;rsquo;s great!&amp;hellip; But what is $\log(n!)$? On the one hand, $n!$ is huge, but on the other, maybe the $\log$ tames it? Well, we know that $n! \geq n(n-1)\dots (n/2) \geq (n/2)^{n/2}$, so we can rewrite our inequality again as
&lt;/p>
$$
h \geq \log(n!) \geq \log((n/2)^{n/2}) = \frac{n}{2}\log \biggr( \frac{n}{2} \biggr).
$$&lt;p>
(The equality holds because of a property of logarithms: $\log(a^b) = b\log(a)$.) Ignoring constants, we get that $h$ must is bounded below by $n \log n$. To put it in a way that underscores how cool this proof is, what we&amp;rsquo;re saying here is that no comparison sort can work using a worst-case number of comparisons that is (ignoring constants) smaller than $n \log n$. Again, we did this without looking at any particular implementations!&lt;/p>
&lt;h1 id="coda-can-we-do-better">Coda: can we do better?&lt;/h1>
&lt;p>There&amp;rsquo;s one question left to answer: What if we relax the requirement that our algorithm be based on comparisons? Can we achieve a better worst-case performance than $n \log n$?&lt;/p>
&lt;p>The answer is yes, and if our inputs follow a couple of additional (important) assumptions, we can do it with a pretty simple algorithm at that. If the elements of the input list are nonnegative integers that take values up to some maximum $M$, we can use the following algorithm:&lt;/p>
&lt;ol>
&lt;li>Create a list $C$ of zeros of size $M$. The counting array $C$ effectively acts as a frequency table for the input, where $C[j]$ holds the count of occurrences of the integer $j$.&lt;/li>
&lt;li>For each element of the array, increment the count at that element&amp;rsquo;s index. In other words, if you see a 5, increment $C[5]$.&lt;/li>
&lt;li>Once you&amp;rsquo;ve iterated through the entire list, iterate over $C$ and add $C[j]$ copies of $j$ to the output list.&lt;/li>
&lt;/ol>
&lt;p>This algorithm, called counting sort, is probably the most famous non-comparison sort. If the initial array has $n$ elements and $C$ has size $M$, then the algorithm takes $n + M$ steps to complete (ignoring constants). This can be good or it can be bad. It can be awesome if $M$ is not too much larger than $n$, since then (ignoring constants again), it would take approximately $n$ steps, which is faster than our comparison-sort lower bound of $n \log n$. If $M$ is very large, however, say $M \approx n^2$, then we&amp;rsquo;ve lost our performance edge. This algorithm also doesn&amp;rsquo;t work with non-integers, which makes it less generally applicable than we&amp;rsquo;d like.&lt;/p>
&lt;h1 id="conclusion">Conclusion&lt;/h1>
&lt;p>In this post, we started with a hands-on example, established theoretical bounds using a decision tree formulation of the sorting problem, and finally explored how changing our assumptions about the inputs can unlock faster algorithms.&lt;/p>
&lt;p>Sorting algorithms are a cornerstone of computer science, and understanding their limits helps us appreciate their clever design and implementation. The balance between theory and practice highlights the necessity of mathematics to the design of efficient algorithms that power our world.&lt;/p></description></item><item><title>Simulated annealing</title><link>https://www.jgindi.me/posts/2024-03-08-sim-ann/</link><pubDate>Fri, 08 Mar 2024 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2024-03-08-sim-ann/</guid><description>&lt;h1 id="introduction">Introduction&lt;/h1>
&lt;p>Optimization problems are everywhere!&lt;/p>
&lt;p>Whether it&amp;rsquo;s finding the most efficient way to deliver packages to customers,
determining the best next move in a game of chess, or figuring out how to adjust the parameters of a
gigantic machine learning model, many important practical problems are, at their cores, optimization problems.
In this post, we will learn an optimization meta-algorithm called &lt;strong>simulated annealing&lt;/strong>, a
general approach to (approximately) finding global solutions to optimization problems&amp;hellip; which is,
interestingly, inspired by a physical process from material science.&lt;/p>
&lt;h1 id="simulated-annealing-overview">Simulated annealing: overview&lt;/h1>
&lt;h2 id="annealing">Annealing&lt;/h2>
&lt;p>Before discussing its algorithmic analog, we should sketch out what annealing is and how it works.
Annealing is a process that alters the physical and chemical properties of a metal so that it can be worked
more easily. It begins by heating the metal above its recrystallization point in order for it to
enter a state in which its chemical states can change more freely. We then slowly cool the meatal to allow it
to settle into a chemically superior state.&lt;/p>
&lt;h2 id="relationship-to-optimization">Relationship to optimization&lt;/h2>
&lt;p>One way to solve an optimization problem is to carry out the the following iterative process:&lt;/p>
&lt;ol>
&lt;li>Start in some state (e.g., a chessboard just after it has been set up).&lt;/li>
&lt;li>Transition from the the current state to a new state that most decreases the value of your
an objective function you want to minimize (e.g., make a move that decreases your probability of losing).&lt;/li>
&lt;li>Repeat step 2 until a stopping condition is met (e.g., the game ends).&lt;/li>
&lt;/ol>
&lt;p>Simulated annealing modifies this process. It would instead look something like this:&lt;/p>
&lt;ol>
&lt;li>Start in an initial state.&lt;/li>
&lt;li>Sample a random candidate state to transition to.&lt;/li>
&lt;li>With some probability that depends on the current and candidate state, accept the candidate
transition. Otherwise, stay put.&lt;/li>
&lt;li>Repeat steps 2 and 3 until a stopping condition is met.&lt;/li>
&lt;/ol>
&lt;p>The analogy to physical annealing comes from the fact that step 3 depends on a parameter called the temperature.
In physics, the higher the temperature of a system, the more jittery the system is &amp;ndash; that is, the
more random motion there is among its constituent particles. Early in the optimization process, we set a
high temperature; this allows the algorithm explore by accepting riskier transitions, i.e., those that result in a
higher (worse) objective value than that of our current state. As the annealing progresses, we lower the temperature;
this causes the optimization process to become more conservative. Eventually, the space of acceptable
next states will contain only those that are better than the current state (in terms of objective value).&lt;/p>
&lt;h2 id="global-vs-local-optimization">Global vs local optimization&lt;/h2>
&lt;p>One question you might ask is: Why bother with the high temperature phase at all? If low temperatures
will allow the algorithm to only move toward better solutions, why not always make those kinds of moves?
The key to answering this question lies in the difference between globally and locally optimal solutions
to a problem. Suppose you are on a quest to see the view from the highest point in San Francisco
(which has lots of hills). If you only ever step in the direction of steepest ascent from where you are,
you will reach the top of &lt;em>some&lt;/em> hill, but it&amp;rsquo;s possible that in order to reach the top of the &lt;em>highest&lt;/em> hill,
you should have walked downhill for a while in another direction first and only then started to ascend.
The hill whose acme you reached by steepest ascent &amp;ndash; a local optimum &amp;ndash; is not very difficult to find.
By contrast, finding the true tallest peak in San Francisco &amp;ndash; the global optimum &amp;ndash; is trickier.&lt;/p>
&lt;p>For many problems &amp;ndash; like finding the best set of parameters for a machine learning model &amp;ndash; local
optima work very well, and in many cases we have powerful algorithms for efficiently finding them.
Finding global optima, on the other hand, is far more challenging in general, and good algorithms
are scarcer, if they exist at all. Simulated annealing is a probabilistic strategy for searching for
&lt;em>global&lt;/em> optima by exploring aggresively enough early to find the base of the right hill.&lt;/p>
&lt;p>In the remainder of this post, we will more explicitly discuss how we carry out steps 2 and 3, and show how
we might apply this meta-algorithm to the &lt;a href="https://en.wikipedia.org/wiki/Travelling_salesman_problem">traveling salesman problem&lt;/a>,
one of the most difficult discrete optimization problems we have.&lt;/p>
&lt;h1 id="acceptance-probability">Acceptance probability&lt;/h1>
&lt;p>The key detail that I want to make more precise is how we formulate the probability of accepting a
transition from one state to another. In terms of notation, $\mathcal S$ is the entire state
space, $s \in \mathcal S$ refers to the current state, $s' \in \mathcal S$ refers
to the candidate state, $E:\mathcal S \to \mathbb R$ is the objective function (lower is better) that we want
to minimize, and $T_k$ is the temperature parameter value on the $k$th step (a real number). Our objective
is to find the state $s^\star$ that minimizes $E$. In other words, we want to find
&lt;/p>
$$
s^\star := \underset{s \in \mathcal S}{\text{argmin}} ~ E(s).
$$&lt;p>
(The &amp;ldquo;$:=$&amp;rdquo; symbol means that the right hand side is the definition of $s^\star$, rather than some
equation to prove or solve.)&lt;/p>
&lt;p>For simplicity, define $e = E(s)$ and $e' = E(s')$. If we are currently in state $s_k$,
and $e' &lt; e$, we automatically transition to $s'$. If $e' > e$, then we transition
with probability
&lt;/p>
$$
P_{\rm acc}(e, e'; T_k) = \exp(-(e' - e) / T_k).
$$&lt;p>
(Note: This is not a probability distribution over states. Instead, here, $P_{\rm acc}$ is used to make a decision about
whether to transition to a &lt;em>particular&lt;/em> successor state. For this purpose, after we compute
$p = P_{\rm acc}(e, e'; T_k)$, we can use a random number generator to generate a random
number $r$. If $r &lt; p$, we transition. To sample a state from the entire state space, we would need the
transition probabilities for each possible transition to sum to 1.)&lt;/p>
&lt;p>Let&amp;rsquo;s take a minute to think through why this acceptance probability works the way we want it to:&lt;/p>
&lt;ul>
&lt;li>Since we would have accepted if $e' &lt; e$, we can assume that $e' - e > 0$. If
this difference is very positive, the negative sign and the exponential around it makes
$P_{\rm acc}$ very small. This means that the probability of accepting a transition
exponentially decays for less desirable candidates.&lt;/li>
&lt;li>Decreasing the value of $T_k$ (as $k$ increases) causes the exponent to become large and negative,
producing probabilities close to 0. This means that as we run more steps and decrease the temperature,
the same differences in objective value will become less and less acceptable. This aligns with the
intuition that as the temperature decreases, the optimization process becomes more conservative.&lt;/li>
&lt;/ul>
&lt;h1 id="application-the-traveling-salesman-problem-tsp">Application: The Traveling Salesman Problem (TSP)&lt;/h1>
&lt;p>If you&amp;rsquo;ve never heard of the traveling salesman problem, check out &lt;a href="https://en.wikipedia.org/wiki/Travelling_salesman_problem">this wikipedia article&lt;/a>
before continuing. To summarize:&lt;/p>
&lt;ul>
&lt;li>There are $n$ cities to visit.&lt;/li>
&lt;li>There are roads connecting every pair of cities.&lt;/li>
&lt;li>Each road has a (nonnegative) toll associated with it.&lt;/li>
&lt;li>&lt;strong>Goal:&lt;/strong> Find the minimum cost path that ends where you start and visits each city exactly once.&lt;/li>
&lt;/ul>
&lt;p>The first thing to remember when using simulated annealing is that for most problems we would
apply it to, we should expect to not obtain the globally optimal solution at the end; instead, we
hope for a result that is just good enough. In the case of TSP, simulated annealing can give us a
reasonable approximation, but we cannot really guarantee anything more than that.&lt;/p>
&lt;p>Another practical consideration that arises is how to define the state space for the problem at hand.
Before continuing, think about how you might define it for TSP.&lt;/p>
&lt;p>A sensible way to define it is to consider any ordering of the $n$ cities to be a state. Defining states
this way, there are $n!$ states &amp;ndash; for $n \geq 20$, this number is absolutely massive. With such large state
spaces, one typically also has to narrow the space of transitions under consideration at each step. In the case of TSP,
we might do this by only allowing transitions that swap a pair of cities in the order. Can you think of
other methods?&lt;/p>
&lt;p>Finally, we define our objective function $E$ to just be the total cost of a particular route, and we
stop when we&amp;rsquo;ve gone some number of iterations without making progress over the best solution we&amp;rsquo;ve obtained
so far. With this setup, we can implement our algorithm following the pseudo-python below:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">simulated_annealing_TSP&lt;/span>(G, max_iters_with_no_improvement):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> G: the initial problem structure (tolls for each road)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> max_iters_with_no_improvement: The maximum number of iterations allowed without
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> surpassing the best seen so far before termination.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># initialize the temperature and pick an initial state&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> T &lt;span style="color:#f92672">=&lt;/span> initialize_temperature()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> s &lt;span style="color:#f92672">=&lt;/span> pick_random_city_order(G)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> no_improvement_counter, best_state, lowest_so_far &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#66d9ef">None&lt;/span>, inf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># while we&amp;#39;re making progress...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">while&lt;/span> no_improvement_counter &lt;span style="color:#f92672">&amp;lt;&lt;/span> max_iters_with_no_improvement:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># select a candidate next state&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> candidate &lt;span style="color:#f92672">=&lt;/span> randomly_swap_pair(s) &lt;span style="color:#75715e"># using any restriction&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># compute the costs of the current state and the candidate&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> e_s, e_cand &lt;span style="color:#f92672">=&lt;/span> total_cost(s, G), total_cost(candidate, G)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># decide whether to accept by comparing a uniform random number&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># to the acceptance probability described earlier&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> uniform(&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#f92672">&amp;lt;&lt;/span> p_acc(e_s, e_cand, T):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> s &lt;span style="color:#f92672">=&lt;/span> candidate
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># if we see a new best, reset the progress counter and&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># save the best state&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> e_cand &lt;span style="color:#f92672">&amp;lt;&lt;/span> lowest_so_far:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> best_state &lt;span style="color:#f92672">=&lt;/span> s
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> no_improvement_counter &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> no_improvement_counter &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># reduce the temperature&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> T &lt;span style="color:#f92672">=&lt;/span> reduce_temperature(T)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># return the best state when the iteration completes&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> best_state
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>There are some details and optimizations left out, but hopefully the code feels straightforward enough for
you to try to implement this on your own!&lt;/p>
&lt;p>(Note: One detail we left out in the above is the schedule to use to reduce $T_k$ over time. This is
a subtle problem, since if we lower it too quickly, our optimization process will not sufficiently explore,
whereas if we lower it too slowly, we may not make forward progress fast enough.)&lt;/p>
&lt;h1 id="conclusion">Conclusion&lt;/h1>
&lt;p>In this post, we briefly described a meta-algorithm called simulated annealing that can help approximate
global optima for properly formulated optimization problems, many of which are extremely computationally difficult.
It is often most useful when there are not other more direct, problem-specific algorithms we can bring to bear.
In addition to describing the general setup, we also looked at how we could apply this approach to the TSP,
which gave us a flavor for some of the practical considerations that arise when trying to fit a problem
into the SA framework.&lt;/p>
&lt;p>Simulated annealing is a powerful tool that is employed to solve a variety of thorny optimization problems
across the sciences. It&amp;rsquo;s a good tool to have in your toolkit &amp;ndash; I hope it comes in handy!&lt;/p></description></item><item><title>Distinct values in a data stream</title><link>https://www.jgindi.me/posts/2022-09-11-data-stream/</link><pubDate>Sun, 11 Sep 2022 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2022-09-11-data-stream/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>In this post I detail a randomized algorithm (that looks rather like black magic) to count the number of distinct elements in a data stream.&lt;/p>
&lt;h2 id="naive-solution">Naive solution&lt;/h2>
&lt;p>Suppose that data is presented as a sequence of values $\sigma = s_1, \dots, s_m$ where, for simplicity, the $s_i \in \lbrace 0, \dots, n - 1\rbrace$. I want to know the number of distinct values that were in the stream. For example, if the sequence were $\sigma = 1,2,3,4,5,5,7$, our algorithm should output 6. How might we accomplish this?&lt;/p>
&lt;p>A very simple way is to keep an $n$-bit vector $v$ ($n$ because that is the size of the set our values are being drawn from) where $v_i$ represents whether we have seen the element $i$. Once we have seen all of the data, we sum the values in $v$ and output that as our result. Easy, right?&lt;/p>
&lt;h2 id="does-it-work">Does it work?&lt;/h2>
&lt;p>The issue here is that for sufficiently large data sets, using $n$ bits of storage is not possible. The above approach is (provably) optimal in terms of space&amp;hellip; but that &amp;ldquo;optimality&amp;rdquo; is only with respect to deterministic algorithms that produce the correct result every time. What if we used randomization? Might we be able to achieve sublinear space usage?&lt;/p>
&lt;h2 id="a-better-solution">A better solution&lt;/h2>
&lt;p>The rest of this post will detail a randomized algorithm that uses sublinear space and to solve the above problem with a solution that is correct as close to 100% of the time as we&amp;rsquo;d like&amp;hellip; it&amp;rsquo;s rather like magic. Let&amp;rsquo;s see how it works.&lt;/p>
&lt;p>The algorithm is as follows:&lt;/p>
&lt;ul>
&lt;li>Choose $\varepsilon \in (0,1)$.&lt;/li>
&lt;li>Let $t = \frac{400}{\varepsilon^2}$.&lt;/li>
&lt;li>Pick a pairwise independent hash function $h: \{0,\dots,n-1\} \to \{0,\dots,n-1\}$&lt;/li>
&lt;li>Upon receiving $s_i$, compute $h(s_i)$ and update $D$, a datastructure with which we keep track of the $t$ smallest hash values we&amp;rsquo;ve computed.&lt;/li>
&lt;li>When we stop receiving values, let $X$ be the $t^{th}$ smallest hash value and output $\frac{nt}{X}$.&lt;/li>
&lt;/ul>
&lt;p>Before we move on, note that the algorithm only requires space for (1) the hash function (discussed below) and (2), the data structure $D$, which is a constant amount of space that depends on $\varepsilon$&amp;hellip; Assuming that our hash function doesn&amp;rsquo;t take up too much space, the algorithm satisfies our space requirement.&lt;/p>
&lt;p>I imagine you&amp;rsquo;re thinking what I&amp;rsquo;m thinking (or what I was thinking)&amp;hellip; namely, that there is absolutely no reason why that should work. Before we dive into some completely mind-blowing mathematical analysis, I want to quickly digress to explain what a pairwise independent hash function is and then to fill in some details and provide an &amp;ldquo;intuitive&amp;rdquo; flavor for where this algorithm comes from.&lt;/p>
&lt;h3 id="pairwise-independent-hash-functions">Pairwise independent hash functions&lt;/h3>
&lt;p>Imagine we have a hash function $h$, two arbitrary inputs $x_1$ and $x_2$, and their corresponding outputs $y_1 = h(x_1)$ and $y_2 = h(x_2)$. The hash function $h$ is a pairwise independent hash function if knowing the probability that $h(x_1) = y_1$ does not give us any information about the probability that $h(x_2) = y_2$.&lt;/p>
&lt;p>In mathematical terms, $h$ is pairwise independent if
$\Pr[h(x_1)=y_1 \wedge h(x_2) = y_2] = \frac{1}{n^2}$
($n$ is the size of the output space in our case). The natural question we ask when we present a definition is: Do such objects exist? Without going into too much detail about why, be assured they do indeed exist, and we are going to pick ours from the family
$\mathcal{H} = \lbrace h_{ab}: \lbrace 0,\dots,p-1 \rbrace \to \lbrace 0,\dots, p-1\rbrace \rbrace $
where $p$ is prime, $0 \leq a \leq p -1$ and $0 \leq b \leq p-1$, defined, for some input $k$, by $h_{ab}(k) = ak + b \mod p$. In our case, if $n$ is not prime, we can find a prime near $n$ and let $p$ be that prime (an interesting proof for another time is that
for any $n$, there is always a prime between $n$ and $2n$). Note that the only thing we have to store about this hash function to use it are $a$ and $b$ &amp;ndash; each of which only requires $\log p$ bits of storage (so we are still under the linear space we are trying to avoid).&lt;/p>
&lt;h2 id="how-do-we-get-">How do we get $\frac{nt}{X}$?&lt;/h2>
&lt;p>Next, I&amp;rsquo;ll try to motivate where $\frac{nt}{X}$ comes from. Suppose that there are $k$ distinct values in the stream (that is, suppose that $k$ is the solution to our problem). Let those values be $a_1,\dots,a_k$. Because we have $n$ possible outputs of our hash function and there are $k$ distinct values, we can expect that the distance between the $h(a_i)$ is $\frac{n}{k}$. In particular, we expect the $t$th smallest value to be at $t \cdot \frac{n}{k}$. Thus, $X \approx \frac{nt}{k}$. Solving for $k$, we see that $k \approx \frac{nt}{X}$, so that&amp;rsquo;s exactly what we output. In the next piece, we will need to distinguish between the right answer and our output so the correct answer will henceforth be referred to as $k$ and we will refer to our output, $nt/X$, as $\hat k$.&lt;/p>
&lt;h2 id="so-does-the-fancy-algorithm-work">So does the fancy algorithm work?&lt;/h2>
&lt;p>All that&amp;rsquo;s left to do is now to show that $\hat k$ is very close to $k$. Mathematically speaking, we want to show that
&lt;/p>
$$\frac{k}{1 + \varepsilon} \leq \hat k \leq (1+\varepsilon)k$$&lt;p>
with a probability $\geq \frac{99}{100}$ (where $\varepsilon$ is the parameter we chose in the first step of the algorithm).
If we can show that $\Pr[\hat k > (1+\varepsilon)k] \leq \frac{1}{200}$ and that $\Pr[\hat k &lt; k/(1 +\varepsilon)] \leq \frac{1}{200}$, we get
&lt;/p>
$$\Pr[k/(1+\varepsilon) \leq \hat k \leq (1+\varepsilon)k] = 1 -\Pr[\hat k > (1+\varepsilon)k] -\Pr[\hat k &lt; k/(1 +\varepsilon)].$$&lt;p>Because each of the probabilities on the right side of the inequality are $\leq \frac{1}{200}$, we can rewrite the above as
$\Pr[k/(1+\varepsilon) \leq \hat k \leq (1+\varepsilon)k] \geq 1 - \frac{2}{200} = \frac{99}{100}$
which is what we want.&lt;/p>
&lt;p>So all we have left to do is to show that the two probabilities are indeed both $\leq \frac{1}{200}$.
We will only do analysis of one of the two probabilities because a symmetric argument takes care of the other side. All of this put together means we only have one claim left to prove: $\Pr[\hat k > (1+\varepsilon)k] \leq \frac{1}{200}$.&lt;/p>
&lt;p>First, we note that
&lt;/p>
$$
\begin{align*}
\Pr[\hat k > (1+\varepsilon)k] &amp;= \Pr[ (nt)/X > (1+\varepsilon)k] \\
&amp;= \Pr\biggr[X &lt; \frac{nt}{(1+\varepsilon)k}\biggr].
\end{align*}
$$&lt;p>With this in mind, define a random variable $Y_i$ which takes the value 1 if $h(a_i) &lt; \frac{nt}{(1+\varepsilon)k}$ and 0 otherwise. Now, observe that on average, the odds of $h(a_i)$ taking a value less than $\frac{nt}{(1+\varepsilon)k}$ is the number of hash values between 0 and $\frac{nt}{(1 + \varepsilon)k}$ divided by the number of possible values $h(a_i)$ can take. We can write this mathematically as
&lt;/p>
$$
E[Y_i] = \frac{tn}{(1+\varepsilon)kn} = \frac{t}{(1+\varepsilon)k}.
$$&lt;p>Next, let the random variable $Y$ be the sum of the $Y_i$. Because expectation is linear, we can infer that $E[Y] = \sum_{i = 1}^k E[Y_i] = k \cdot\frac{t}{(1+\varepsilon)k} = \frac{t}{1 + \varepsilon}$. We also see that in this case,
$\text{Var}(Y) = \frac{t}{1+\varepsilon} - \frac{t^2}{(1+\varepsilon)^2} \leq \frac{t}{1 + \varepsilon} = E[Y]$.
We&amp;rsquo;re almost there!&lt;/p>
&lt;p>We can now more readily examine the probability we were interested in above in terms of $Y$. That is, we can say
&lt;/p>
$$\Pr \biggr[X &lt; \frac{nt}{(1+\varepsilon)k} \biggr] = \Pr[Y \geq t].$$&lt;p>
Why? The left hand probability represents the chances that the $t$th smallest hash value we saw was less than some value, let&amp;rsquo;s call said nasty value $M$ for a minute. $Y$ is the number of hash values we saw that were less than $M$. If at least $t$ values hashed to values less than $M$, then $X$, the $t$th smallest hash value will be less than $M$, hence the equality.&lt;/p>
&lt;p>Using our variance result from earlier, we have $t = (1 + \varepsilon) E[Y]$, we really want to know $\Pr[Y \geq (1 + \varepsilon) E[Y]]$. This is bounded above by $\Pr[|Y - E[Y]| \geq \varepsilon E[Y]]$ (adding in the absolute value causes inequality rather than equality). Chebyshev&amp;rsquo;s inequality tells us that
&lt;/p>
$$\Pr[|Y - E[Y]| \geq \varepsilon E[Y]] \leq \frac{\text{Var}(Y)}{\varepsilon^2 E[Y]^2}.$$&lt;p>
Because $\text{Var}(Y) \leq E[Y]$, we can write
&lt;/p>
$$\frac{\text{Var}(Y)}{\varepsilon^2 E[Y]^2} \leq \frac{E[Y]}{\varepsilon^2 E[Y]^2} = \frac{1}{\varepsilon^2 E[Y]}.$$&lt;p>
Now recall that earlier, we said that $t = (1 + \varepsilon) E[Y] \iff E[Y] = \frac{t}{1 + \varepsilon}$. We can substitute this in for $E[Y]$ above and get
&lt;/p>
$$\frac{1}{\varepsilon^2 E[Y]} = \frac{1 + \varepsilon}{\varepsilon^2 t}.$$&lt;p>
Because $\varepsilon$ is at most 1, we conclude
&lt;/p>
$$\frac{1 + \varepsilon}{\varepsilon^2 t} \leq \frac{2}{\varepsilon^2 \frac{400}{\varepsilon^2}} = \frac{2}{400} = \frac{1}{200}.$$&lt;p>Thus, all in all, we&amp;rsquo;ve shown that the odds of our return value being an over-estimate is bounded above by 1/200. A similar argument shows that the probability of underestimating is also bounded above by 1/200, so the probability of erring is at most 1/200 + 1/200 = 1/100 which means our probability of success is at least 99/100, as desired.&lt;/p></description></item><item><title>Solving Wordle</title><link>https://www.jgindi.me/posts/2022-01-12-solving-wordle/</link><pubDate>Wed, 12 Jan 2022 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2022-01-12-solving-wordle/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>For those who don&amp;rsquo;t know what Wordle is, &lt;a href="https://www.powerlanguage.co.uk/wordle/">check it out&lt;/a>.
It&amp;rsquo;s essentially a word game that works like the game MasterMind. If you&amp;rsquo;ve been on the internet
in the past couple of weeks, you&amp;rsquo;ve probably seen your friends or follows posting
little images that show how quickly, and by which path, they solved the day&amp;rsquo;s puzzle.
After trying it, I thought it might be fun to try to write some code that solves the puzzle
(most of the time). The rest of this post will walk through how I came up with the solution,
how I put the code together, and some insights I gleaned using my solver.&lt;/p>
&lt;h2 id="so-what-are-the-rules">So what are the rules?&lt;/h2>
&lt;p>Before going any further, I want to review the rules. The game proceeds
as follows. A target word is chosen and hidden from the player. On each turn,
the player guesses a five-letter word. After each guess, the player receives feedback
about the letter at each position of their guess. For each position, the player might
receive:&lt;/p>
&lt;ol>
&lt;li>&lt;span style="color:green">&lt;strong>Green:&lt;/strong>&lt;/span> if the letter of the guess matches the letter of the target at that position.&lt;/li>
&lt;li>&lt;span style="color:orange">&lt;strong>Yellow:&lt;/strong>&lt;/span> if the letter of the guess matches the letter at some other position of the target.&lt;/li>
&lt;li>&lt;span style="color:grey">&lt;strong>Grey:&lt;/strong>&lt;/span> if the letter of the guess is not in the target.&lt;/li>
&lt;/ol>
&lt;p>For example, if the target word is &amp;ldquo;taker&amp;rdquo; and the guess is &amp;ldquo;talks&amp;rdquo;, the feedback
would be
&lt;span style="color:green">ta&lt;/span>&lt;span style="color:grey">l&lt;/span>&lt;span style="color:orange">k&lt;/span>&lt;span style="color:grey">s&lt;/span>
, because the first two letters are exactly right, &amp;ldquo;l&amp;rdquo; and &amp;ldquo;s&amp;rdquo; are
not in the target word at all, and &amp;ldquo;k&amp;rdquo; is in the target but in a different position than
it occupies in the guess.&lt;/p>
&lt;h2 id="coming-up-with-a-solution">Coming up with a solution&lt;/h2>
&lt;p>At first, I tried to apply some concepts I had been studying as part of a reinforcement
learning class I&amp;rsquo;d been taking online. It&amp;rsquo;s possible that the formulation I came up
with just wasn&amp;rsquo;t a good one, but a simple approach without any fancy AI turned out
to actually work very well. I&amp;rsquo;ve learned, both through my job and some independent
study, that conceptual simplicity is often underrated.&lt;/p>
&lt;p>My general approach was simple. After collecting the body of words that Wordle
uses (which can actually be obtained pretty easily by inspecting the page source
of the game&amp;rsquo;s webpage), I thought through what the skeleton of an algorithm would
look like, and I came up with this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>guesses_made &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>set current guess to an initial guess
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">while&lt;/span> (guesses_made &lt;span style="color:#f92672">&amp;lt;&lt;/span> &lt;span style="color:#ae81ff">6&lt;/span>) &lt;span style="color:#f92672">and&lt;/span> (current_guess &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#f92672">not&lt;/span> the target):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> get feedback on current guess
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> use feedback to reduce the set of valid words
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> make make another guess
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> increment guesses_made
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>One way of looking at this skeleton is that we begin the game with no constraints
on which words are and are not valid. As
we make guesses and see feedback, we gain additional information that
allows us to further and further constrain the set of available words until &amp;ndash; hopefully &amp;ndash;
we&amp;rsquo;ve narrowed it all the way down and we&amp;rsquo;re certain of the answer.&lt;/p>
&lt;p>As described just above, the algorithm skeleton is missing a few important details,
namely:&lt;/p>
&lt;ul>
&lt;li>How does the feedback allow us to determine the pool of valid words we can choose from?&lt;/li>
&lt;li>How do we make our next guess given a set of guessable words?&lt;/li>
&lt;/ul>
&lt;p>The implementation choices we make to answer those two questions ultimately lead
to different algorithms. In this post, we discuss
some quick-and-dirty, very simple choices that turn out to perform well, but
I&amp;rsquo;d encourage you to come up with interesting alternatives on your own to see if you
can come up with something even better!&lt;/p>
&lt;h3 id="using-the-feedback">Using the feedback&lt;/h3>
&lt;p>Using the feedback requires specifying what kinds of words each type of feedback allows
us to eliminate.&lt;/p>
&lt;p>When we receive grey feedback, we know to eliminate all words that contain the grey letter.&lt;/p>
&lt;p>When we receive green feedback, at position 2, say, we know to eliminate all words
that do not contain the green letter at position 2.&lt;/p>
&lt;p>When we receive yellow feedback, there are two kinds of elimination we can perform. If
we get yellow feedback in position 3, then we know that the letter in our guess at position
3 cannot be in the target word at position 3, so we can eliminate all words that correspond
with our guess at position 3. We can also eliminate any words that do not contain the yellow
letter, as we know it must be in the target somewhere.&lt;/p>
&lt;p>Finally, there is the problem of words with the same letter repeated multiple times.
Thinking things through a little bit, we realize that the number of yellow and green
copies of a given letter is a lower bound on the number of copies of that letter that
must be in the target word. For example, if we have a yellow &amp;ldquo;t&amp;rdquo; and a green &amp;ldquo;t&amp;rdquo; in
the feedback, we know that the target word must have at least 2 &amp;ldquo;t&amp;quot;s, so we can eliminate
all words with 0 or 1 &amp;ldquo;t&amp;quot;s.&lt;/p>
&lt;h3 id="making-the-next-guess">Making the next guess&lt;/h3>
&lt;p>Each time we use feedback to cull the set of valid words, we have to then choose from
a potentially vast set of remaining words. In order to do this, we have to come up
with some heuristic to narrow the field.&lt;/p>
&lt;p>In my case, I chose to give each word a score and then chose the word with the highest score (breaking ties randomly if required).
To compute the score, I first came up with the distribution of letters in each position.
For example, at position 1, maybe &amp;ldquo;s&amp;rdquo; was the most common letter, making up 6% of the letters
found in position 1 across the set of possible words.
If the word under evaluation contains an &amp;ldquo;s&amp;rdquo; at position 1, the word
would accrue a credit of 0.06 for the &amp;ldquo;s&amp;rdquo;. The sum of these credits across the 5 positions
determines the word score. At each point, I select the valid word with the highest score.&lt;/p>
&lt;p>This scoring system has (at least) one obvious weakness! If the target word has letters in certain positions that are very
uncommon for that position, the algorithm will pick other words first and possibly run out of guesses. Trying to figure out
how to remedy this would make the algorithm more robust, but I haven&amp;rsquo;t given it enough thought as of this writing.&lt;/p>
&lt;h2 id="how-well-does-the-solver-work">How well does the solver work?&lt;/h2>
&lt;p>With an allowance of 6 guesses, on a random sample of 5k target words, my solver successfully found the target word about 90% of the time
in an average of 5 guesses.
Increasing the allowance to 9 guesses, it succeeds 98.5% of the time. With 15 guesses, it succeeds on all 5k examples. In the instances
where it fails with 6 guesses, there are, on average, about 7 valid choices left. That&amp;rsquo;s pretty good!&lt;/p>
&lt;h2 id="what-is-the-best-word-to-start-with">What is the best word to start with?&lt;/h2>
&lt;p>Before trying to answer this question, it should be noted that &amp;ldquo;best&amp;rdquo; in this context
depends on your algorithm. Different scoring methodologies, for example, would imply
a different ordering on the quality of initial guesses. The results below are obtained
using the algorithm we just described; if you vary the algorithm, you might find something
different.&lt;/p>
&lt;p>For each possible initial guess &amp;ndash; around 13k of them &amp;ndash; I chose 200 random target words.
(I could have chosen more than 200, but I was constrained by computation time.)
For each guess, we record the fraction of the 200 problems that were solved successfully and assign
that as a score for that initial guess.&lt;/p>
&lt;p>Some of the words that achieved the top 5% of scores were&lt;/p>
&lt;ul>
&lt;li>chapt&lt;/li>
&lt;li>chimp&lt;/li>
&lt;li>germs&lt;/li>
&lt;li>compt&lt;/li>
&lt;li>match&lt;/li>
&lt;li>chems&lt;/li>
&lt;li>frump&lt;/li>
&lt;li>bumph&lt;/li>
&lt;li>spick&lt;/li>
&lt;li>crumb&lt;/li>
&lt;li>&amp;hellip;&lt;/li>
&lt;/ul>
&lt;p>and some of the words in the bottom 5% were&lt;/p>
&lt;ul>
&lt;li>nanna&lt;/li>
&lt;li>zooea&lt;/li>
&lt;li>gazoo&lt;/li>
&lt;li>zexes&lt;/li>
&lt;li>vairy&lt;/li>
&lt;li>roque&lt;/li>
&lt;li>navvy&lt;/li>
&lt;li>ninon&lt;/li>
&lt;li>ozzie&lt;/li>
&lt;li>nouny&lt;/li>
&lt;li>&amp;hellip;&lt;/li>
&lt;/ul>
&lt;p>(there were others, but I don&amp;rsquo;t show them here for brevity). None of the words in the top
5% of scorers repeated letters, while 90% of those in the bottom 5% did, suggesting
that when picking your first word, it would be unwise to pick a word with repeated letters.&lt;/p>
&lt;p>I also looked at how much of the best initial guesses were made up of the five most common
(s, e, a, o, and r) and five least common (v, z, j, x, and q) letters of the alphabet.
(Here, most and least common are relative to the 12k Wordle words.) At first, the results
seem a bit counterintuitive &amp;ndash; 51% of the letters in the worst 5% of first guesses are
made of the five most common letters of the alphabet, while only 28% of those the best 5% are!
What gives!?&lt;/p>
&lt;p>One hypothesis is that feedback on common
letters may help significantly narrow the field, but repeating them doesn&amp;rsquo;t provide much additional
information, so it&amp;rsquo;s worth diversifying. But then, you might ask, why aren&amp;rsquo;t there words in there
with repeated infrequent letters? I imagine that this is probably because there just aren&amp;rsquo;t that many
words to begin with that have multiple qs, js, vs, xs, or zs. The better scorers use common
letters, but avoid repeating them, so the fraction of common letters is smaller in that set.
If we look at the fraction of letters in high and low scorers that come from the five least common letters,
what we see is striking: the high scorers do not contain &lt;em>any&lt;/em> letters from the five least
frequent letters, while the lower scorers are 14% infrequent.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>I hope you found this exercise as fun and interesting as I did. I&amp;rsquo;ll also be posting the code
I used soon, so feel free to have a look at it if you&amp;rsquo;re interested
in seeing what the actual implementation looks like.
Even though the insights we arrived at were pretty intuitive, I hope that
you enjoyed putting a little bit of rigor to it. Happy Wordling!&lt;/p>
&lt;p>Edit (2022-01-17): I&amp;rsquo;ve posted the code &lt;a href="https://github.com/gindij/wordle">here&lt;/a>. In response to some feedback I received about the post,
I also changed the word-scoring algorithm to encourage helpful
exploration, rather than only using words that satisfy the constraints
we&amp;rsquo;ve accumulated information about.&lt;/p></description></item><item><title>Solving sudoku as a linear program</title><link>https://www.jgindi.me/posts/2021-04-11-sudoku-lp/</link><pubDate>Sun, 11 Apr 2021 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2021-04-11-sudoku-lp/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>During the last several months, my wife and I went through a Kenken phase. For
those not familiar with Kenken, check out this &lt;a href="https://www.kenkenpuzzle.com">website&lt;/a>.
Kenken is a great example of a class of problems that are more broadly categorized as
constraint satisfaction problems (CSPs). I wrote some code that generates and solves Kenkens,
which I&amp;rsquo;d like to write about when I have more time, but in this post I want to talk about
an interesting, nonstandard way to set up and solve a different, far more popular CSP: sudoku!&lt;/p>
&lt;h2 id="what-is-sudoku">What is sudoku?&lt;/h2>
&lt;p>For those who are not familiar with sudoku puzzles, they are designed as follows.
The initial board is a 9x9, partially filled in grid. The goal is for the solver
to find (the unique) values for the empty squares so that the board satisfies the following
rules:&lt;/p>
&lt;ol>
&lt;li>There can be no repeated values in any row. (Each row must contain each value
from 1 up to 9 exactly once.)&lt;/li>
&lt;li>There can be no repeated values in any column. (Each column must contain each
value from 1 up to 9 exactly once.)&lt;/li>
&lt;li>There can be no repeated values in each of the 9 3x3 groups of cells outlined
in a bolded black line in the figures below.&lt;/li>
&lt;/ol>
&lt;div align='center'>
&lt;img src="https://www.jgindi.me/posts/sudoku-lp/initial.png" alt="drawing" width="200" height="200"/>
&lt;img src="https://www.jgindi.me/posts/sudoku-lp/solved.png" alt="drawing" width="200" height="200"/>
&lt;p>An unfilled board (left) and its solution (right).&lt;/p>
&lt;/div>
&lt;h2 id="modeling-the-problem">Modeling the problem&lt;/h2>
&lt;p>In order to formulate this as an optimization problem we need to come up with a
variable to optimize and constraints that tell the optimization algorithm what kinds
of solutions are valid. In any sudoku problem, there are 81 cells that need to
be filled in (including those that are already populated at the start). Each of those
cells has 9 possible values, so we will index our variable $x$ with 3 indices: the first
will indicate the row, the second will indicate the column, and the third will refer to a
value between 1 and 9. Furthermore, we will require each $x_{ijk}$ to take either the
value 0 or the value 1: $x_{ijk} = 1$ if the cell at position $(i, j)$ on the board
has the value $k$ and 0 otherwise.&lt;/p>
&lt;p>This way of modeling a sudoku board is not the most intuitive one possible, but it
will make formulating our constraints easier, which is the topic we turn to next. (If
the above wasn&amp;rsquo;t clear, give it another read before moving on.)&lt;/p>
&lt;h2 id="constraints">Constraints&lt;/h2>
&lt;p>There are a few types of constraints we need to respect to make sure that our
optimization algorithm comes up with a valid solution to our sudoku problems. We
will discuss each type of constraint in turn.&lt;/p>
&lt;h3 id="respecting-given-values">Respecting given values&lt;/h3>
&lt;p>The first constraint the solution needs to satisfy is that the values that are
already provided must be respected. That is, if the first cell is already
filled with the value 5, the optimizer is not allowed to change that. By the way
we defined $x$ earlier, this would correspond to the constraint $x_{115} = 1$. Similarly,
if the lower right value were set to 1, we would have another constraint corresponding
to this value given by $x_{991} = 1$. We add a constraint like this for every value
that is provided on the initial board.&lt;/p>
&lt;h3 id="each-cell-contains-a-single-value">Each cell contains a single value&lt;/h3>
&lt;p>This is a relatively simple constraint, but an important one. To model this constraint
for cell $(1, 1)$, we would require that $\sum_{k=1}^9 x_{11k} = 1$. Because each
entry of $x$ is either 0 or 1, this constraint says that exactly one of the entries
corresponding to the first cell must be set. We add a constraint like the one I described
for the first cell for each of the 81 cells.&lt;/p>
&lt;h3 id="row-column-and-box-constraints">Row, column, and box constraints&lt;/h3>
&lt;p>We require that each row contain each digit from 1 to 9 . To encode that the digits
in row $i$ must be unique, we need to make sure that for each value $k$, we have
$\sum_{j=1}^9 x_{ijk} = 1$ (in the sum, $i$ and $k$ are fixed). There are 81 such
constraints corresponding to the rows, another 81 can be analogously formulated
for the columns, and another 81 can be formulated to model the box constraints.&lt;/p>
&lt;h3 id="consolidating-constraints">Consolidating constraints&lt;/h3>
&lt;p>Once we&amp;rsquo;ve modeled all the constraints, we can stack them into a single matrix
equality constraint $Ax = b$. Although we&amp;rsquo;ve been discussing $x$ as though
it were three dimensional, when we pass the optimization problem to the computer, we
flatten it into one long 729-vector (9 rows $\times$ 9 columns $\times$ 9
candidate values per cell). Each row of $A$ corresponds to a single constraint.
Coefficients corresponding to the variables that are active in that constraint
are set to 1 and the other coefficients in that row are all set to 0. Because
all of our constraints have 1s on their right hand sides, we have $b = \mathbf 1$.&lt;/p>
&lt;h2 id="formulating-and-solving-the-problem">Formulating and solving the problem&lt;/h2>
&lt;p>We can now formulate our optimization problem:
&lt;/p>
$$
\begin{align*}
&amp;\underset{x \in \{0, 1\}^{729}}{\text{minimize}} ~~ 0\\
&amp;\text{subject to} ~~ Ax = \mathbf 1.
\end{align*}
$$&lt;p>
Because sudokus have unique solutions, we just need to find the $x$ that satisfies
our constraints &amp;ndash; there will only be one such $x$! Because we just need to find that
$x$, the objective value doesn&amp;rsquo;t have to help us discriminate between competing feasible
points for this problem, so we can safely use 0 as our objective function. (This type
of problem is known as a feasibility problem.)*&lt;/p>
&lt;p>With the optimization problem in hand, the following short piece of code
uses &lt;a href="https://www.cvxpy.org">CVXPY&lt;/a> to solve any sudoku puzzle very quickly:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> cvxpy &lt;span style="color:#66d9ef">as&lt;/span> cp
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> numpy &lt;span style="color:#66d9ef">as&lt;/span> np
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># ... &amp;lt;code that formulates the constraints&amp;gt; ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>x &lt;span style="color:#f92672">=&lt;/span> cp&lt;span style="color:#f92672">.&lt;/span>Variable(N, boolean&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>array(list(constrs))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>b &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>ones(len(constrs))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>objective &lt;span style="color:#f92672">=&lt;/span> cp&lt;span style="color:#f92672">.&lt;/span>Minimize(&lt;span style="color:#ae81ff">0&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>constraints &lt;span style="color:#f92672">=&lt;/span> [A &lt;span style="color:#f92672">@&lt;/span> x &lt;span style="color:#f92672">==&lt;/span> b]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>problem &lt;span style="color:#f92672">=&lt;/span> cp&lt;span style="color:#f92672">.&lt;/span>Problem(objective, constraints)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>problem&lt;span style="color:#f92672">.&lt;/span>solve(solver&lt;span style="color:#f92672">=&lt;/span>cp&lt;span style="color:#f92672">.&lt;/span>ECOS_BB)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The full code (~100 lines) can be found &lt;a href="https://github.com/gindij/SudokuLP">here&lt;/a>.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Sudoku is typically framed and solved as a CSP with
algorithms that involve some guessing and checking. I thought this was an interesting
application of optimization to a problem that maybe doesn&amp;rsquo;t immediately lend itself
to such a formulation. Hope you enjoyed!&lt;/p>
&lt;p>*As a note for a slightly more technically inclined reader, a friend of mine
pointed out that the algorithm that the solver uses, called &lt;a href="https://web.stanford.edu/class/ee364b/lectures/bb_slides.pdf">branch and bound&lt;/a>,
ends up degenerating into an exponential tree search, akin to just trying out
all possibilities with no way of discriminating between &amp;ldquo;better&amp;rdquo; and &amp;ldquo;worse&amp;rdquo;
points. To remedy this (at least in part), we can use the objective function
$x^T \mathbf 1$, which counts the number of ones in any solution. By using
this objective instead of $0$, we give the algorithm a way to &amp;ldquo;prune&amp;rdquo; the tree it
is searching, which may lead to performance benefits. The code implementing this
approach would only be slightly different:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> cvxpy &lt;span style="color:#66d9ef">as&lt;/span> cp
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> numpy &lt;span style="color:#66d9ef">as&lt;/span> np
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># ... &amp;lt;code that formulates the constraints&amp;gt; ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>x &lt;span style="color:#f92672">=&lt;/span> cp&lt;span style="color:#f92672">.&lt;/span>Variable(N, boolean&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>array(list(constrs))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>b &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>ones(len(constrs))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>objective &lt;span style="color:#f92672">=&lt;/span> cp&lt;span style="color:#f92672">.&lt;/span>Minimize(x&lt;span style="color:#f92672">.&lt;/span>T &lt;span style="color:#f92672">@&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>ones(N)) &lt;span style="color:#75715e"># &amp;lt;-- new objective function&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>constraints &lt;span style="color:#f92672">=&lt;/span> [A &lt;span style="color:#f92672">@&lt;/span> x &lt;span style="color:#f92672">==&lt;/span> b]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>problem &lt;span style="color:#f92672">=&lt;/span> cp&lt;span style="color:#f92672">.&lt;/span>Problem(objective, constraints)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>problem&lt;span style="color:#f92672">.&lt;/span>solve(solver&lt;span style="color:#f92672">=&lt;/span>cp&lt;span style="color:#f92672">.&lt;/span>ECOS_BB)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div></description></item><item><title>Distributed hash tables</title><link>https://www.jgindi.me/posts/2018-01-10-dht/</link><pubDate>Wed, 10 Jan 2018 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2018-01-10-dht/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>While watching some lectures on distributed/cloud computing, I came across distributed hash table, which is a way to
implement a hash table (if you’re unfamiliar, see &lt;a href="https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=13&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=0ahUKEwjm4Yuk4L_YAhVH3WMKHWcECpIQFghgMAw&amp;amp;url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHash_table&amp;amp;usg=AOvVaw35HhLIk5wS4LY8oElN6VI7">here&lt;/a>) distributed across a bunch of different servers
connected by a network. The goal is to implement the table so that (1) finding a key’s location is “easy” (read: efficient)
and (2) the user does not have to worry about the underlying network topology. In other words, to the client, using the
distributed table should feel like they are using one in-memory on a single machine.
I chose to write about this because it’s a place where theory gracefully translates itself well to practice. It turns out
that by using a mathematically fun and interesting hashing technique and a clever data structure in concert, we can achieve
a pretty efficient distributed hash table.&lt;/p>
&lt;h1 id="hash-tables">Hash tables&lt;/h1>
&lt;p>For those unfamiliar with the basic idea of hashing and hash tables, imagine that you have a bunch of objects that you need
to store. You have a bunch of cabinets in which to store said objects. You’d ideally like to put the items into the
cabinets in such a way that when you want to retrieve a particular item from storage, you can do so quickly. To help you
accomplish this, I give you a crystal ball which, based on some combination of the object’s characteristics, determines
which cabinet to put it in/retrieve it from. For example, if the object you’d like to store is red, spherical and
manufactured by Hasbro, your crystal ball might assign it cabinet #1 (note: a crucial property of this crystal ball is
that if it says cabinet #1 for a particular item, it will always say cabinet #1 for that item; it won’t change its
opinion).&lt;/p>
&lt;p>Then, if at some later point you want to retrieve the red ball manufactured by Hasbro, you would consult the crystal ball I
gave you to figure out where you put it. For this to work well and grant the efficiency you seek, you wouldn’t want too
many objects landing in the same cabinet; if the crystal ball assigned several items to cabinet #1, for instance, then
when your crystal ball reveals that the red Hasbro ball is in cabinet #1, you would have to look at all of the items that
landed there to find it. In the worst possible case, if every object somehow landed in cabinet #1, you would potentially
have to rifle through every item you deposited. As the number of items you need to store gets larger and larger, the
inconvenience of items in the same cabinet could range anywhere from mildly annoying to intractable.&lt;/p>
&lt;p>Before we continue, I want to attach some technical terminology to the basic aspects of hashing just discussed. In the
above example, &lt;strong>hash table&lt;/strong> itself is the set of cabinets you have for storage. We will let the number of cabinets in our
table be denoted by m and we use n to denote the number of items we have to store. The items are called &lt;strong>key&lt;/strong>s. The
crystal ball you consulted in order to know where to put/find each item is known as a &lt;strong>hash function&lt;/strong>. The easiest way to
think of a hash function is as a mathematical function that takes some object in and uses its properties to
deterministically output an integer. One of the properties we care about when we choose hash functions to suit our
applications is whether they assign lots of different keys to the same buckets. If keys $k_1$ and $k_2$ are mapped to the
same bucket by the hash function, we say that the function &lt;strong>collides&lt;/strong> on these keys.
(As a note before we get into the specifics of the hashing technique we’ll use to build our distributed table, any
algorithm that uses hashing usually relies on the assumption about hash functions (see &lt;a href="https://en.wikipedia.org/wiki/SUHA_(computer_science)">here&lt;/a>) which basically guarantees that the probability that a key lands in any particular bucket
is $1/\#\text{buckets}$. More technically, the assumption asserts that the hash function distributes keys uniformly.
We can actually show mathematically that if the hash function distributes keys uniformly, we expect any particular cabinet
to have $n/m$ items in it, implying that in the worst case, we expect that the runtime of a lookup/insert/update/delete is
$O(n/m)$. Provided that we choose $m$ close to $n$, we effectively expect that the aforementioned operations take constant
time (read: are extremely fast).)&lt;/p>
&lt;h2 id="complexities-of-a-distributed-network">Complexities of a distributed network&lt;/h2>
&lt;p>Now let’s say I have a bunch of files that I want to store across a collection of machines. For this particular example,
say I have 3 servers available labeled 0, 1, and 2. The simplest way to store the files across my servers is to take the
hash of the file I want to store (an integer), compute its remainder when you divide by 3 (note that this value can only be
0, 1 or 2) and stash the file in the server (“cabinet”) with the corresponding label. To find out which server a file $f$
has been stored on, simply compute $\text{hash}(f) % 3$ (this means the remainder that $\text{hash}(f)$ leaves when
divided by 3) and ask that server for the file.&lt;/p>
&lt;p>This technique seems all well and good, until you consider one of the realities of distributed systems: that machines
(machine will henceforth be interchangeable with “node” or “server”) join and leave the system all the time. Can you see
the challenge this presents? What happens if I stored a file on server 1 and then I ask my hash function where it is? See
if you can spot the problem before continuing.&lt;/p>
&lt;p>The problem is if a new node joins or leaves the system, the hash function might give me the wrong answer. Let’s say that
for some file $f$, $\text{hash}(f) = 40$. If I stored $f$ when my system only had 3 nodes, my algorithm would have placed
$f$ on node 1, because $40 / 3 = 13$ remainder $1$. Now let’s say I add a 4th node and subsequently ask where $f$ is. Well
$\text{hash}(f) = 40$, and $40 / 4 = 10$ remainder $0$! But from earlier, we saw that $f$ is actually sitting on node 1!
How might we solve this problem? Think about this before reading the next paragraph.&lt;/p>
&lt;p>We can solve this problem by rehashing every file given the new number of servers whenever the system registers a new
machine (or some machine leaves). This solution, however, is very computationally cumbersome. Each time a new node
registers (or leaves), the system has to stop serving clients, compute the new hash of every single file in the system and
move the files to their new homes. (This can be somewhat mitigated by storing metadata about files instead of the files
themselves on the servers in the system, but given enough files, even moving around all the metadata would be pretty
slow.)&lt;/p>
&lt;p>Can we avoid this pitfall somehow? The answer, as you might’ve guessed, is yes, and the technique is called consistent
hashing.&lt;/p>
&lt;h2 id="consistent-hashing">Consistent hashing&lt;/h2>
&lt;p>In consistent hashing, we pick some integer $s$ and imagine a logical ring with $2^s$ discrete slots labeled $0, 1,
2,\dots, 2^s - 1$. For example, if $s = 3$, then our logical ring would have slots labeled $0,1,\dots, 7 $($= 2^3 - 1$).
Ideally we choose an $s$ big enough that $2^s$ is a lot bigger than the number of nodes we expect.&lt;/p>
&lt;p>(For the remainder of these steps when I talk about locations on the ring, I’m not talking about an actual ring. If you’re
familiar with modular arithmetic, all we’re doing is wrapping hashes around a modulus. If you’re thinking of it as an
actual ring, that’ll work too.)&lt;/p>
&lt;p>First we place the servers on the ring by hashing their (ip address, port) pairs and taking remainders modulo $2^s$. If
for some server $A$, $\text{hash}(A)$ gives a remainder of $a$ modulo $2^s$, it would logically occupy slot $a$ on the
ring.&lt;/p>
&lt;p>Next, in order to make accesses fast, we introduce a clever data structure called a finger table at each node. Each finger table maintains $s$ pointers to other nodes on the ring as follows. For $i$ between $0$ and $s - 1$ (inclusive), the $i$th finger table entry for a node at ring position $N$ is the first node whose position is greater than or equal to $(N + 2^i) \pmod {2^s}$. Let’s look at a quick example.&lt;/p>
&lt;p>Suppose that $s = 5$ and we had nodes at positions 3, 7, 16, and 27 on our logical ring. To compute 3’s finger table, we compute
&lt;/p>
$$
\begin{align*}
3 + 2^0 &amp;= 3 + 1 &amp;&amp;= 4 &amp;&amp;&amp;\pmod{32} \\\\
3 + 2^1 &amp;= 3 + 2 &amp;&amp;= 5 &amp;&amp;&amp;\pmod{32} \\\\
3 + 2^2 &amp;= 3 + 4 &amp;&amp;= 7 &amp;&amp;&amp;\pmod {32} \\\\
3 + 2^3 &amp;= 3 + 8 &amp;&amp;= 11 &amp;&amp;&amp;\pmod {32} \\\\
3 + 2^4 &amp;= 3 + 16 &amp;&amp;= 19 &amp;&amp;&amp;\pmod {32}
\end{align*}
$$&lt;p>
The first node larger than or equal to the first entry in the finger table is 7, so the first finger pointer is to 7. The same goes for the second and third entries. The fourth entry would point to 16 and the final entry would point to 27. So for node 3, the finger table would look like $(7, 7, 7, 16, 27)$.&lt;/p>
&lt;p>Each node also maintains pointers to the nodes to its right and left along the ring. The &lt;strong>successor&lt;/strong> of 3 is 7 and the &lt;strong>predecessor&lt;/strong> of 3 is 27. In a similar vein, can more generally refer to predecessors and successors of any slot on the ring.&lt;/p>
&lt;p>To perform a lookup, we make use of both the logical ring we introduced and the per-node finger tables. Let’s say that machine 7 wants to make a query for a key that hashes to 2 (and would thus reside on node 3 — the first node to its right on the ring). It would look in its finger table for the largest node that is to the left of the key’s position — this would be node 27 — and then route the query there. Node 27 would then route the query to its successor (because it is greater than the key we’re looking for), which would then be able to send the data associated with the requested key back to 7.&lt;/p>
&lt;p>(In this example the ring is small and we don’t require that many hops. This won’t always be the case. For instance, if $s = 10$ and a query initiated at 3 for a key that hashed to 999, the largest node to the left of the key that 3 knows about would be at most $3 + 2^9 = 3 + 512 = 515 \pmod{1024} (=2^{10})$, but there might be a node at 997 that is much closer that 3 doesn’t have in its finger table.)&lt;/p>
&lt;h2 id="why-do-this">Why do this?&lt;/h2>
&lt;p>The last things I want to briefly discuss are the benefits this whole complicated system confers. The first thing we get is relatively fast lookup time. We can actually prove that (with high probability) lookups are logarithmic. In essence, what this means is that on every hop that we take using the finger tables, we traverse at least half of the remaining distance between the current node and the node containing the desired key (the key’s successor). A sketch of the proof goes like this:&lt;/p>
&lt;p>Suppose we are at node $n$ and the node immediately to the left of a key $k$ is some node $p$. According to the algorithm outlined above, node $n$ will search its table and elect to move to the largest node it knows about that is to the left of $k$. Call this node $m$. If $m$ is the $i$th entry of $n$’s finger table, then because $p$ is necessarily at or to the left of $m$, both $m$ and $p$ are between $2^{i-1}$ and $2^i$ away from $n$. This, in turn, means that $m$ and $p$ are at most $2^{i-1}$ away from one another (because $2^i - 2^{i-1} = 2^{i-1}$). Thus, the distance between $m$ and $p$ (at most $2^{i-1}$) is at most half the distance from $n$ to $m$ (at least $2^{i-1}$), which is what we wanted.&lt;/p>
&lt;p>Using this halving result, we can note further that after $t$ hops, the number of slots we haven’t yet searched is the $(\text{total number of slots}) * (1/2)^t = 2^s/2^t$ (because we halved the distance between where we started and our destination $t$ times). If we make $\log n$ hops (where n is the number of nodes in the system), we have $2^s/2^{\log n} = 2^s/n$ slots left to search. Because we assumed that the hash function we chose distributed nodes uniformly about the ring, we only expect there to be 1 node in this window. That there are $\log n$ such nodes has an even smaller probability! This means that after $\log n$ finger table hops, we will have to (with high probability) make at most $\log n$ more hops to get to our destination. If you’re familiar with big-Oh notation, the total runtime for a lookup in the worst case is, with high probability, $O(\log n + \log n) = O(2\log n) = O(\log n)$, as we wanted.&lt;/p>
&lt;p>As a final note, insertions, deletions, updates are bottlenecked by the lookup operation. Once we know we can expect logarithmic lookup, we automatically know that those other operations are logarithmic too.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>For those not familiar with what logarithmic runtime means, let’s just say it’s pretty fast. The other cool thing we get out of this logical ring, which is potentially even more important when you’re talking about systems that frequently gain and lose machines, is that you don’t have to move so many of the keys around when a new node joins or leaves. For example, if a node leaves, you only have to update the finger tables of nodes that used to point to the node that left. (Can you figure out what you would have to do if a node joined?)&lt;/p>
&lt;p>I think this is a great example of the way that algorithms and mathematical reasoning are a &lt;em>huge&lt;/em> part of the push toward more scalable system architectures. If they aren’t using this exact algorithm, engineers at Google, Facebook, Amazon, Netflix and others are using similar ideas to push the boundaries of what it means for distributed systems to be available, scalable, efficient and maintainable.&lt;/p></description></item><item><title>TSP is inapproximable</title><link>https://www.jgindi.me/posts/2017-04-26-inapprox-tsp/</link><pubDate>Wed, 26 Apr 2017 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2017-04-26-inapprox-tsp/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>As an introductory computer science student, I was enamored by the Traveling Salesman
Problem (if you’ve never heard of it, see &lt;a href="https://en.wikipedia.org/wiki/Travelling_salesman_problem">Travelling salesman problem -
Wikipedia&lt;/a>). It is very
easy to state and has very simple and important practical applications, yet somehow
my professors were telling me that we don’t, at present, have an efficient
algorithm to solve it. There are heuristics, yes, but you can (easily) design
pathological inputs on which whatever heuristic you design performs horribly.&lt;/p>
&lt;h2 id="approximation-algorithm">Approximation algorithm&lt;/h2>
&lt;p>It’s often the case that when faced with $NP$-hard optimization problems (such as
TSP), instead of heuristics, we try to design &lt;em>approximation algorithms&lt;/em>, or
algorithms that provably produce outputs within some acceptable factor of the
answer. To be a bit more precise about it:&lt;/p>
&lt;ol>
&lt;li>An algorithm $A$ is an $\alpha$-approximation for a maximization problem $P$ if on every instance $I$ of $P$, $A$ produces a solution of size at least $OPT/\alpha$ (where $OPT$ is the size of the optimal solution for the instance $I$).&lt;/li>
&lt;li>$A$ is an $\alpha$-approximation for a minimization problem if $A$ produces a solution of size at most $\alpha OPT$.&lt;/li>
&lt;/ol>
&lt;p>Note that $\alpha \geq 1$.&lt;/p>
&lt;h2 id="inapproximability">Inapproximability&lt;/h2>
&lt;p>There are myriad approximation algorithms out there for a bunch of different
problems; I may make them the subject of a future blog post… but in the remainder of
this post, I want to actually show that for any $\varepsilon > 0$, it is impossible
to come up with a $(1-\varepsilon)$-approximation for TSP unless $P = NP$ (in which
case we wouldn’t need approximations, we would have a polynomial time algorithm to
solve the problem exactly!). In other words, unless $P = NP$, not only do we not
have an efficient algorithm for TSP, we can’t even &lt;em>approximate&lt;/em> it efficiently!
The proof of this fact, which I find surprising and kind of amazing given the
amount of effort and brainpower that have been thrown at this problem over the
years, is pretty simple, so I thought it would be fun to go through it here.&lt;/p>
&lt;p>(Before we do, if you aren’t familiar with it, read &lt;a href="https://en.wikipedia.org/wiki/Hamiltonian_path_problem">Hamiltonian path problem -
Wikipedia&lt;/a>)&lt;/p>
&lt;h2 id="the-proof">The proof&lt;/h2>
&lt;p>We&amp;rsquo;ll start with a sketch of the proof, then move on to the actual proof.&lt;/p>
&lt;p>&lt;strong>Proof idea:&lt;/strong> The proof is by contradiction. We will assume we have a polynomial
time approximation for TSP and use it as a black box to solve a known $NP$-complete
problem, HAM-CYCLE, in polynomial time. Assuming $P \neq NP$, this is impossible, so
such an approximation cannot exist.&lt;/p>
&lt;p>&lt;strong>Proof:&lt;/strong> Suppose that $A$ is a polynomial time approximation for TSP. Use $A$ to
construct the following algorithm $A’$ which, on some input graph $G = (V,E)$,
computes a solution (a YES or NO) to HAM-CYCLE:&lt;/p>
&lt;ol>
&lt;li>Create the graph $G’ = (V, E’)$ by completing $G$ (i.e. by adding edges to $E$
until there are edges between every pair of vertices in $V$).&lt;/li>
&lt;li>Give the edges in $E$ weights of 0, and give those in $E’ - E$ weights of 1. (Note
that $G’$ is an instance of TSP.)&lt;/li>
&lt;li>Use $A$ to approximate the least cost tour $T$ in $G’$.&lt;/li>
&lt;li>Output NO if $T$ has weight $> 0$ and YES otherwise.&lt;/li>
&lt;/ol>
&lt;p>We just need to argue that $A$ outputs a tour of weight 0 if and only if there is a
hamiltonian cycle in $G$. To see this, note that by definition, $A$ finds a tour
whose combined weight is within some factor of the optimal tour on $G’$. If the
optimal tour on $G’$ can indeed be made up only of edges from $G$, it has weight 0,
in which case $A$ would have to return an answer within some factor of 0… namely 0.
If the A finds a tour with weight $> 0$, then it must not have been able to find a
tour using only edges of $G$, in which case we can safely output that there is no
hamiltonian cycle in $G$. &lt;strong>QED.&lt;/strong>&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>And with that, just a few short words, we were, assuming $P \neq NP$, able to rule
out all possible approximation schemes that anyone could ever think of! Imagine all
the time and effort we’ve saved!&lt;/p></description></item><item><title>Randomized algorithm for file comparison</title><link>https://www.jgindi.me/posts/2017-01-25-file-comp/</link><pubDate>Sat, 25 Feb 2017 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2017-01-25-file-comp/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>The next problem we use randomization to solve might seem a bit closer to one we
might face in reality. It goes like this. Alice and Bob each have copies of the same
file that they need to keep synchronized (call Alice&amp;rsquo;s file $A$ and Bob&amp;rsquo;s $B$). Over
time, however, it&amp;rsquo;s possible that the Alice&amp;rsquo;s and Bob&amp;rsquo;s files get out of sync. Our
task today is to come up with a protocol by which Alice and Bob can check that $A =
B$ without one having to send the other his/her entire file.&lt;/p>
&lt;h2 id="the-algorithm">The algorithm&lt;/h2>
&lt;p>In order to throw math at the problem the way we want to, we need to make it a bit
more abstract. To this end, we stipulate that $A$ and $B$ are represented by $n$-bit
strings. The comparison protocol works as follows:&lt;/p>
&lt;ul>
&lt;li>Alice picks a prime $p \in \{2..n^2\lg n\}$ (fear not; we will explain the choice
of this range soon). Because $A$ is an $n$-bit string, we can look at it as an $n$-
bit binary integer. Alice computes $A \pmod p$ and sends Bob the prime $p$ and $A
\pmod p$. (Note: computing $A \pmod p$ means the remainder of $A$ left over when we
divide it by $p$.)&lt;/li>
&lt;li>Bob computes $B \pmod p$. If $A = B \pmod p$ ($A$ and $B$ leave the same $p$-
remainder), Bob outputs that the files are the same. Otherwise, he should conclude
that the files he and Alice have are out of sync.&lt;/li>
&lt;/ul>
&lt;p>The question we need to answer is now: How confident can we be that the files are
indeed the same when Bob says &amp;ldquo;same&amp;rdquo;? In other words, we want to know what the
probability is that the algorithm errs.&lt;/p>
&lt;h2 id="analysis">Analysis&lt;/h2>
&lt;p>There are two cases to analyze here. If the files were the same to begin with, the
algorithm will never fail. Mathematically, if $A = B$, then $A = B \pmod p$ so Bob
will always output &amp;ldquo;same&amp;rdquo; in this case.&lt;/p>
&lt;p>The interesting case to analyze is the case in which $A \neq B$. In this case, we are
interested in
&lt;/p>
$$\Pr[A = B \pmod p ~|~ A \neq B].$$&lt;p>
In English, we are interested in the probability that Bob outputs &amp;ldquo;same&amp;rdquo; even when
his and Alice&amp;rsquo;s files are not in sync. To analyze this probability, we need to
entertain a quick tangent.&lt;/p>
&lt;p>In particular, we need to motivate our preference that $p \in \{2..n^2\lg n\}$.
There is a neat theorem in number theory called the Prime Number Theorem which states
that in the range $\{2..N\}$, there are about $\frac{N}{\lg N}$ primes. We can
show, using the theorem and some algebra that we need not bother ourselves with here,
that there are about $n^2$ primes in the range from which we drew $p$ (start by
substituting $N = n^2 \lg n$ in the theorem).&lt;/p>
&lt;p>Keep this fact in mind. Let $C = |A - B|$. We can reformulate the probability of the
protocol failing as $\Pr[C = 0 \pmod p ~|~ C \neq 0]$.&lt;/p>
&lt;p>Next, we note that because $A$ and $B$ both have $n$-bits, $C$ too must have $n$-
bits. This means that $1 \leq C \leq 2^n$. Note that 2 is prime, so a nice feature of
$C$ that we can observe is that $C$ has at most $n$ prime divisors. This is the key
fact. Because we had $n^2$ primes from which to choose $p$ and there are (at most)
$n$ bad choices among them (a choice is bad if $p$ divides $C$ and thus $A- B$,
whence $p \mid A$, $p \mid B$), the probability that $p$ is a &amp;ldquo;bad prime&amp;rdquo; is at most
$\frac{n}{n^2} = \frac{1}{n}$.&lt;/p>
&lt;p>The probability of success is thus at least $1 - 1/n$, which is great because,
intuitively, it means that as our strings get larger, the odds of the algorithm
failing get very, very small.&lt;/p>
&lt;h2 id="space-complexity">Space complexity&lt;/h2>
&lt;p>The last minor thing we note is that the number of bits required for this protocol is
only the number of bits required to represent $p$ and $A \pmod p$ (what Alice sends
to Bob). The $A \pmod p$ is smaller than $p$, so we can write an upper bound on the
total number of bits required as $2\lg p$ bits. Because $p$ is at most $n^2 \lg n$,
we require $2 \lg(n^2\lg n) = \lg (n^4(\lg n)^2) = 4 \lg n + 2\lg(\lg n) = O(\lg n)$ bits to, with high probability, successfully compare the files. Our goal was
to share a sublinear number of bits relative to the size of the file, so the
protocol above achieves the desired aim.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>You&amp;rsquo;ll note that if you go back and look at the matrix multiplication post I put up a
little while ago, the analysis here and there are very similar. In each case, we had
a problem in which we needed to compare two objects without comparing the entire
objects to one another. In this case, the objects were strings; in the matrix case,
the objects were matrices. In both cases, we devised schemes wherein we mapped the
larger objects to smaller ones that were much easier to compare; although the
representations of the smaller objects are lossy, we choose our mapping carefully so
that the information we sacrifice only introduces a small probability of failure. The
technique described above is called fingerprinting, and it is a very powerful tool
used in the study and design of randomized algorithms.&lt;/p></description></item><item><title>Randomized matrix multiplication checking</title><link>https://www.jgindi.me/posts/2017-02-25-check-matmul/</link><pubDate>Sat, 25 Feb 2017 00:00:00 +0000</pubDate><guid>https://www.jgindi.me/posts/2017-02-25-check-matmul/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Warning: This post is pretty technical. It details a part of a lecture from a class
I&amp;rsquo;m in this semester. It requires some mathematical maturity. I&amp;rsquo;ll to my best to make
it as accessible as possible, but how effective I&amp;rsquo;ll be at that remains a mystery&amp;hellip;
Here we go!&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>The simplest algorithm for matrix multiplication takes $O(n^3)$. As of right now, the
fastest algorithm we have for matrix multiplication runs in $O(n^{2.373})$ time. This
algorithm is very complicated and I won&amp;rsquo;t go into detail here (mostly because I don&amp;rsquo;t
know them), but should we devise an algorithm that is even faster than that, we might
want an efficient algorithm to check that it computes the correct result. A little
bit of thought shows us that the best conceivable complexity for matrix
multiplication is $O(n^2)$ &amp;mdash; that is the amount of time it would take us to write
out the result even if no additional computation was necessary. This post details a
randomized algorithm that checks the correctness of matrix multiplication in $O(n^2)$
time. Now, to the formal problem statement&amp;hellip;&lt;/p>
&lt;p>Suppose you had three $n \times n$ matrices $A, B$, and $C$ and I wanted to know
whether $AB = C$.&lt;/p>
&lt;h2 id="first-approach">First approach&lt;/h2>
&lt;p>A first approach is to choose a random entry $C_{ij}$ and check that it was computed
correctly by checking that it equals the dot product of row $i$ of $A$ and column $j$
of $B$. Each such check takes linear time, so we can do up to $n$ of them and still
stay at our desired runtime of $O(n^2)$. This seems clever enough&amp;hellip; I mean, $n$
entries is pretty good, right? If all of those entries were computed correctly, we
can be reasonably sure that $AB = C$&amp;hellip; right?&lt;/p>
&lt;h2 id="trying-again">Trying again&lt;/h2>
&lt;p>Nope! The problem with this approach can be better understood by asking the following
question: If there is some entry of $C$ that was computed incorrectly, what are the
odds that the above algorithm would catch it? Well, if there are $n^2$ entries in $C$
and we choose $n$ of them to check, provided that we pick them uniformly at random,
we have a $\frac{n}{n^2} = \frac{1}{n}$ chance of catching the entry that was
computed incorrectly. Because asymptotically, $n^2$ runs away from $n$, as our
matrices get bigger, our odds of catching a mistake shrinks. That isn&amp;rsquo;t good! How
else might we go about this?&amp;hellip;&lt;/p>
&lt;p>Consider the following algorithm:&lt;/p>
&lt;ol>
&lt;li>Generate an $n$-bit vector $r$ where each component of the vector is selected
uniformly at random from $\{0,1\}$.&lt;/li>
&lt;li>Compute $ABr$ and $Cr$.&lt;/li>
&lt;li>If $ABr = Cr$, return true, else return false.&lt;/li>
&lt;/ol>
&lt;p>Note that step 1 takes $O(n)$ time. Because $ABr = A(Br)$ and $Br$ is just a vector
of length $n$, step 2 takes $O(n^2 + n^2) = O(n^2)$ time. The last step takes linear
time, so, in total, the above approach takes $O(n^2)$ time &amp;mdash; just what we need. But
how does it compare to the alternative at finding mistakes?&lt;/p>
&lt;h2 id="does-it-work">Does it work?&lt;/h2>
&lt;p>There are two cases to consider here:&lt;/p>
&lt;ol>
&lt;li>$AB = C$&lt;/li>
&lt;li>$AB \neq C$&lt;/li>
&lt;/ol>
&lt;p>In the first case, there aren&amp;rsquo;t any mistakes to catch, so our analysis need only
consider the second case. Define $D = C - AB$. What we are formally interested in is
the probability that $ABr = Cr \iff (C - AB)r = Dr = 0$ when $AB \neq C \iff C - AB =
D \neq 0$. In other words, we are interested in $\Pr[Dr = 0 ~|~ D \neq 0]$. To
compute this probability, suppose there was indeed an entry of $C$ that was computed
incorrectly. Without loss of generality, assume that $D_{11} \neq 0$. Note that this
means that $(AB)_{11}$ and $C_{11}$ are different, so $D_{11}$ is of interest to us &amp;mdash;
it is the elusive mistake. To better understand what&amp;rsquo;s going on here, consider the
following:
&lt;/p>
$$\begin{bmatrix}D_{11} &amp; \dots &amp; D_{1n}\\\vdots &amp; \ddots &amp;\\&amp;
&amp;\end{bmatrix}\begin{bmatrix}r_1\\ r_2\\ \vdots\end{bmatrix}=\begin{bmatrix}D_{11}r_1 + \dots + D_{1n}r_n\\ \vdots\end{bmatrix}$$&lt;p>
(computation of $Dr$).&lt;/p>
&lt;p>The only way the algorithm can be fooled is if, somehow, $D_{11}r_1 + \dots +
D_{1n}r_n = 0$. That is, if the first entry of $Dr$ is $0$, we find ourselves in a
potential case wherein $Dr = 0$ even though $D \neq 0$ &amp;mdash; in English, we find
ourselves in a case, when $ABr = Cr$ even though $AB \neq C$. What is the probability
of this happening? Some thought suggests that we are looking for the probability that
$D_{11}r_1 + \dots +D_{1n}r_n = 0$, or, equivalently, the odds that $r_1 = -
\frac{D_{12}r_2 + \dots D_{1n}r_n}{D_{11}}$. Recall that $r_1 \in \{0,1\}$. If that
ugly fraction is neither 0 nor 1, we&amp;rsquo;re good to go because $r_1$ cannot possibly take
on that value. If, however, that fraction does equal 0 or 1, then there is a
$\frac{1}{2}$ chance that we assigned $r_1$ that value. Thus, $\Pr[Dr = 0 ~|~ D \neq
0] \geq \frac{1}{2}$, which means that our algorithm will catch a mistake at least
half of the time.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Isn&amp;rsquo;t math cool?!?! In lecture, this was used to show that
randomization is a powerful tool that allows to do all kinds of mathematically
rigorous magic. I was blown away; I hope you were too.&lt;/p></description></item></channel></rss>