Posts Distributed hash tables
Post
Cancel

Distributed hash tables

Introduction

While watching some lectures on distributed/cloud computing, I came across distributed hash table, which is a way to implement a hash table (if you’re unfamiliar, see here) distributed across a bunch of different servers connected by a network. The goal is to implement the table so that (1) finding a key’s location is “easy” (read: efficient) and (2) the user does not have to worry about the underlying network topology. In other words, to the client, using the distributed table should feel like they are using one in-memory on a single machine. I chose to write about this because it’s a place where theory gracefully translates itself well to practice. It turns out that by using a mathematically fun and interesting hashing technique and a clever data structure in concert, we can achieve a pretty efficient distributed hash table.

Hash tables

For those unfamiliar with the basic idea of hashing and hash tables, imagine that you have a bunch of objects that you need to store. You have a bunch of cabinets in which to store said objects. You’d ideally like to put the items into the cabinets in such a way that when you want to retrieve a particular item from storage, you can do so quickly. To help you accomplish this, I give you a crystal ball which, based on some combination of the object’s characteristics, determines which cabinet to put it in/retrieve it from. For example, if the object you’d like to store is red, spherical and manufactured by Hasbro, your crystal ball might assign it cabinet #1 (note: a crucial property of this crystal ball is that if it says cabinet #1 for a particular item, it will always say cabinet #1 for that item; it won’t change its opinion).

Then, if at some later point you want to retrieve the red ball manufactured by Hasbro, you would consult the crystal ball I gave you to figure out where you put it. For this to work well and grant the efficiency you seek, you wouldn’t want too many objects landing in the same cabinet; if the crystal ball assigned several items to cabinet #1, for instance, then when your crystal ball reveals that the red Hasbro ball is in cabinet #1, you would have to look at all of the items that landed there to find it. In the worst possible case, if every object somehow landed in cabinet #1, you would potentially have to rifle through every item you deposited. As the number of items you need to store gets larger and larger, the inconvenience of items in the same cabinet could range anywhere from mildly annoying to intractable.

Before we continue, I want to attach some technical terminology to the basic aspects of hashing just discussed. In the above example, hash table itself is the set of cabinets you have for storage. We will let the number of cabinets in our table be denoted by m and we use n to denote the number of items we have to store. The items are called keys. The crystal ball you consulted in order to know where to put/find each item is known as a hash function. The easiest way to think of a hash function is as a mathematical function that takes some object in and uses its properties to deterministically output an integer. One of the properties we care about when we choose hash functions to suit our applications is whether they assign lots of different keys to the same buckets. If keys $k_1$ and $k_2$ are mapped to the same bucket by the hash function, we say that the function collides on these keys. (As a note before we get into the specifics of the hashing technique we’ll use to build our distributed table, any algorithm that uses hashing usually relies on the assumption about hash functions (see here) which basically guarantees that the probability that a key lands in any particular bucket is $1/\#\text{buckets}$. More technically, the assumption asserts that the hash function distributes keys uniformly. We can actually show mathematically that if the hash function distributes keys uniformly, we expect any particular cabinet to have $n/m$ items in it, implying that in the worst case, we expect that the runtime of a lookup/insert/update/delete is $O(n/m)$. Provided that we choose $m$ close to $n$, we effectively expect that the aforementioned operations take constant time (read: are extremely fast).)

Complexities of a distributed network

Now let’s say I have a bunch of files that I want to store across a collection of machines. For this particular example, say I have 3 servers available labeled 0, 1, and 2. The simplest way to store the files across my servers is to take the hash of the file I want to store (an integer), compute its remainder when you divide by 3 (note that this value can only be 0, 1 or 2) and stash the file in the server (“cabinet”) with the corresponding label. To find out which server a file $f$ has been stored on, simply compute $\text{hash}(f) % 3$ (this means the remainder that $\text{hash}(f)$ leaves when divided by 3) and ask that server for the file.

This technique seems all well and good, until you consider one of the realities of distributed systems: that machines (machine will henceforth be interchangeable with “node” or “server”) join and leave the system all the time. Can you see the challenge this presents? What happens if I stored a file on server 1 and then I ask my hash function where it is? See if you can spot the problem before continuing.

The problem is if a new node joins or leaves the system, the hash function might give me the wrong answer. Let’s say that for some file $f$, $\text{hash}(f) = 40$. If I stored $f$ when my system only had 3 nodes, my algorithm would have placed $f$ on node 1, because $40 / 3 = 13$ remainder $1$. Now let’s say I add a 4th node and subsequently ask where $f$ is. Well $\text{hash}(f) = 40$, and $40 / 4 = 10$ remainder $0$! But from earlier, we saw that $f$ is actually sitting on node 1! How might we solve this problem? Think about this before reading the next paragraph.

We can solve this problem by rehashing every file given the new number of servers whenever the system registers a new machine (or some machine leaves). This solution, however, is very computationally cumbersome. Each time a new node registers (or leaves), the system has to stop serving clients, compute the new hash of every single file in the system and move the files to their new homes. (This can be somewhat mitigated by storing metadata about files instead of the files themselves on the servers in the system, but given enough files, even moving around all the metadata would be pretty slow.)

Can we avoid this pitfall somehow? The answer, as you might’ve guessed, is yes, and the technique is called consistent hashing.

Consistent hashing

In consistent hashing, we pick some integer $s$ and imagine a logical ring with $2^s$ discrete slots labeled $0, 1, 2,\dots, 2^s - 1$. For example, if $s = 3$, then our logical ring would have slots labeled $0,1,\dots, 7 $($= 2^3 - 1$). Ideally we choose an $s$ big enough that $2^s$ is a lot bigger than the number of nodes we expect.

(For the remainder of these steps when I talk about locations on the ring, I’m not talking about an actual ring. If you’re familiar with modular arithmetic, all we’re doing is wrapping hashes around a modulus. If you’re thinking of it as an actual ring, that’ll work too.)

First we place the servers on the ring by hashing their (ip address, port) pairs and taking remainders modulo $2^s$. If for some server $A$, $\text{hash}(A)$ gives a remainder of $a$ modulo $2^s$, it would logically occupy slot $a$ on the ring.

Next, in order to make accesses fast, we introduce a clever data structure called a finger table at each node. Each finger table maintains $s$ pointers to other nodes on the ring as follows. For $i$ between $0$ and $s - 1$ (inclusive), the $i$th finger table entry for a node at ring position $N$ is the first node whose position is greater than or equal to $(N + 2^i) \pmod {2^s}$. Let’s look at a quick example.

Suppose that $s = 5$ and we had nodes at positions 3, 7, 16, and 27 on our logical ring. To compute 3’s finger table, we compute

\begin{align*} 3 + 2^0 &= 3 + 1 &&= 4 &&&\pmod{32} \\\\ 3 + 2^1 &= 3 + 2 &&= 5 &&&\pmod{32} \\\\ 3 + 2^2 &= 3 + 4 &&= 7 &&&\pmod {32} \\\\ 3 + 2^3 &= 3 + 8 &&= 11 &&&\pmod {32} \\\\ 3 + 2^4 &= 3 + 16 &&= 19 &&&\pmod {32} \end{align*}

The first node larger than or equal to the first entry in the finger table is 7, so the first finger pointer is to 7. The same goes for the second and third entries. The fourth entry would point to 16 and the final entry would point to 27. So for node 3, the finger table would look like $(7, 7, 7, 16, 27)$.

Each node also maintains pointers to the nodes to its right and left along the ring. The successor of 3 is 7 and the predecessor of 3 is 27. In a similar vein, can more generally refer to predecessors and successors of any slot on the ring.

To perform a lookup, we make use of both the logical ring we introduced and the per-node finger tables. Let’s say that machine 7 wants to make a query for a key that hashes to 2 (and would thus reside on node 3 — the first node to its right on the ring). It would look in its finger table for the largest node that is to the left of the key’s position — this would be node 27 — and then route the query there. Node 27 would then route the query to its successor (because it is greater than the key we’re looking for), which would then be able to send the data associated with the requested key back to 7.

(In this example the ring is small and we don’t require that many hops. This won’t always be the case. For instance, if $s = 10$ and a query initiated at 3 for a key that hashed to 999, the largest node to the left of the key that 3 knows about would be at most $3 + 2^9 = 3 + 512 = 515 \pmod{1024} (=2^{10})$, but there might be a node at 997 that is much closer that 3 doesn’t have in its finger table.)

Why do this?

The last things I want to briefly discuss are the benefits this whole complicated system confers. The first thing we get is relatively fast lookup time. We can actually prove that (with high probability) lookups are logarithmic. In essence, what this means is that on every hop that we take using the finger tables, we traverse at least half of the remaining distance between the current node and the node containing the desired key (the key’s successor). A sketch of the proof goes like this:

Suppose we are at node $n$ and the node immediately to the left of a key $k$ is some node $p$. According to the algorithm outlined above, node $n$ will search its table and elect to move to the largest node it knows about that is to the left of $k$. Call this node $m$. If $m$ is the $i$th entry of $n$’s finger table, then because $p$ is necessarily at or to the left of $m$, both $m$ and $p$ are between $2^{i-1}$ and $2^i$ away from $n$. This, in turn, means that $m$ and $p$ are at most $2^{i-1}$ away from one another (because $2^i - 2^{i-1} = 2^{i-1}$). Thus, the distance between $m$ and $p$ (at most $2^{i-1}$) is at most half the distance from $n$ to $m$ (at least $2^{i-1}$), which is what we wanted.

Using this halving result, we can note further that after $t$ hops, the number of slots we haven’t yet searched is the $(\text{total number of slots}) * (1/2)^t = 2^s/2^t$ (because we halved the distance between where we started and our destination $t$ times). If we make $\log n$ hops (where n is the number of nodes in the system), we have $2^s/2^{\log n} = 2^s/n$ slots left to search. Because we assumed that the hash function we chose distributed nodes uniformly about the ring, we only expect there to be 1 node in this window. That there are $\log n$ such nodes has an even smaller probability! This means that after $\log n$ finger table hops, we will have to (with high probability) make at most $\log n$ more hops to get to our destination. If you’re familiar with big-Oh notation, the total runtime for a lookup in the worst case is, with high probability, $O(\log n + \log n) = O(2\log n) = O(\log n)$, as we wanted.

As a final note, insertions, deletions, updates are bottlenecked by the lookup operation. Once we know we can expect logarithmic lookup, we automatically know that those other operations are logarithmic too.

Conclusion

For those not familiar with what logarithmic runtime means, let’s just say it’s pretty fast. The other cool thing we get out of this logical ring, which is potentially even more important when you’re talking about systems that frequently gain and lose machines, is that you don’t have to move so many of the keys around when a new node joins or leaves. For example, if a node leaves, you only have to update the finger tables of nodes that used to point to the node that left. (Can you figure out what you would have to do if a node joined?)

I think this is a great example of the way that algorithms and mathematical reasoning are a huge part of the push toward more scalable system architectures. If they aren’t using this exact algorithm, engineers at Google, Facebook, Amazon, Netflix and others are using similar ideas to push the boundaries of what it means for distributed systems to be available, scalable, efficient and maintainable.

This post is licensed under CC BY 4.0 by the author.