Introduction

Optimization problems are everywhere!

Whether it’s finding the most efficient way to deliver packages to customers, determining the best next move in a game of chess, or figuring out how to adjust the parameters of a gigantic machine learning model, many important practical problems are, at their cores, optimization problems. In this post, we will learn an optimization meta-algorithm called simulated annealing, a general approach to (approximately) finding global solutions to optimization problems… which is, interestingly, inspired by a physical process from material science.

Simulated annealing: overview

Annealing

Before discussing its algorithmic analog, we should sketch out what annealing is and how it works. Annealing is a process that alters the physical and chemical properties of a metal so that it can be worked more easily. It begins by heating the metal above its recrystallization point in order for it to enter a state in which its chemical states can change more freely. We then slowly cool the meatal to allow it to settle into a chemically superior state.

Relationship to optimization

One way to solve an optimization problem is to carry out the the following iterative process:

Start in some state (e.g., a chessboard just after it has been set up).
Transition from the the current state to a new state that most decreases the value of your an objective function you want to minimize (e.g., make a move that decreases your probability of losing).
Repeat step 2 until a stopping condition is met (e.g., the game ends).

Simulated annealing modifies this process. It would instead look something like this:

Start in an initial state.
Sample a random candidate state to transition to.
With some probability that depends on the current and candidate state, accept the candidate transition. Otherwise, stay put.
Repeat steps 2 and 3 until a stopping condition is met.

The analogy to physical annealing comes from the fact that step 3 depends on a parameter called the temperature. In physics, the higher the temperature of a system, the more jittery the system is – that is, the more random motion there is among its constituent particles. Early in the optimization process, we set a high temperature; this allows the algorithm explore by accepting riskier transitions, i.e., those that result in a higher (worse) objective value than that of our current state. As the annealing progresses, we lower the temperature; this causes the optimization process to become more conservative. Eventually, the space of acceptable next states will contain only those that are better than the current state (in terms of objective value).

Global vs local optimization

One question you might ask is: Why bother with the high temperature phase at all? If low temperatures will allow the algorithm to only move toward better solutions, why not always make those kinds of moves? The key to answering this question lies in the difference between globally and locally optimal solutions to a problem. Suppose you are on a quest to see the view from the highest point in San Francisco (which has lots of hills). If you only ever step in the direction of steepest ascent from where you are, you will reach the top of some hill, but it’s possible that in order to reach the top of the highest hill, you should have walked downhill for a while in another direction first and only then started to ascend. The hill whose acme you reached by steepest ascent – a local optimum – is not very difficult to find. By contrast, finding the true tallest peak in San Francisco – the global optimum – is trickier.

For many problems – like finding the best set of parameters for a machine learning model – local optima work very well, and in many cases we have powerful algorithms for efficiently finding them. Finding global optima, on the other hand, is far more challenging in general, and good algorithms are scarcer, if they exist at all. Simulated annealing is a probabilistic strategy for searching for global optima by exploring aggresively enough early to find the base of the right hill.

In the remainder of this post, we will more explicitly discuss how we carry out steps 2 and 3, and show how we might apply this meta-algorithm to the traveling salesman problem, one of the most difficult discrete optimization problems we have.

Acceptance probability

The key detail that I want to make more precise is how we formulate the probability of accepting a transition from one state to another. In terms of notation, $\mathcal S$ is the entire state space, $s \in \mathcal S$ refers to the current state, $s’ \in \mathcal S$ refers to the candidate state, $E:\mathcal S \to \mathbb R$ is the objective function (lower is better) that we want to minimize, and $T_k$ is the temperature parameter value on the $k$th step (a real number). Our objective is to find the state $s^\star$ that minimizes $E$. In other words, we want to find

$$ s^\star := \underset{s \in \mathcal S}{\text{argmin}} ~ E(s). $$

(The “$:=$” symbol means that the right hand side is the definition of $s^\star$, rather than some equation to prove or solve.)

For simplicity, define $e = E(s)$ and $e’ = E(s’)$. If we are currently in state $s_k$, and $e’ < e$, we automatically transition to $s’$. If $e’ > e$, then we transition with probability

$$ P_{\rm acc}(e, e'; T_k) = \exp(-(e' - e) / T_k). $$

(Note: This is not a probability distribution over states. Instead, here, $P_{\rm acc}$ is used to make a decision about whether to transition to a particular successor state. For this purpose, after we compute $p = P_{\rm acc}(e, e’; T_k)$, we can use a random number generator to generate a random number $r$. If $r < p$, we transition. To sample a state from the entire state space, we would need the transition probabilities for each possible transition to sum to 1.)

Let’s take a minute to think through why this acceptance probability works the way we want it to:

Since we would have accepted if $e’ < e$, we can assume that $e’ - e > 0$. If this difference is very positive, the negative sign and the exponential around it makes $P_{\rm acc}$ very small. This means that the probability of accepting a transition exponentially decays for less desirable candidates.
Decreasing the value of $T_k$ (as $k$ increases) causes the exponent to become large and negative, producing probabilities close to 0. This means that as we run more steps and decrease the temperature, the same differences in objective value will become less and less acceptable. This aligns with the intuition that as the temperature decreases, the optimization process becomes more conservative.

Application: The Traveling Salesman Problem (TSP)

If you’ve never heard of the traveling salesman problem, check out this wikipedia article before continuing. To summarize:

There are $n$ cities to visit.
There are roads connecting every pair of cities.
Each road has a (nonnegative) toll associated with it.
Goal: Find the minimum cost path that ends where you start and visits each city exactly once.

The first thing to remember when using simulated annealing is that for most problems we would apply it to, we should expect to not obtain the globally optimal solution at the end; instead, we hope for a result that is just good enough. In the case of TSP, simulated annealing can give us a reasonable approximation, but we cannot really guarantee anything more than that.

Another practical consideration that arises is how to define the state space for the problem at hand. Before continuing, think about how you might define it for TSP.

A sensible way to define it is to consider any ordering of the $n$ cities to be a state. Defining states this way, there are $n!$ states – for $n \geq 20$, this number is absolutely massive. With such large state spaces, one typically also has to narrow the space of transitions under consideration at each step. In the case of TSP, we might do this by only allowing transitions that swap a pair of cities in the order. Can you think of other methods?

Finally, we define our objective function $E$ to just be the total cost of a particular route, and we stop when we’ve gone some number of iterations without making progress over the best solution we’ve obtained so far. With this setup, we can implement our algorithm following the pseudo-python below:

def simulated_annealing_TSP(G, max_iters_with_no_improvement):
   """
   G: the initial problem structure (tolls for each road)
   max_iters_with_no_improvement: The maximum number of iterations allowed without
      surpassing the best seen so far before termination.
   """

   # initialize the temperature and pick an initial state
   T = initialize_temperature()
   s = pick_random_city_order(G)

   no_improvement_counter, best_state, lowest_so_far = 0, None, inf
   # while we're making progress...
   while no_improvement_counter < max_iters_with_no_improvement:
      # select a candidate next state
      candidate = randomly_swap_pair(s)  # using any restriction
      # compute the costs of the current state and the candidate
      e_s, e_cand = total_cost(s, G), total_cost(candidate, G)
      # decide whether to accept by comparing a uniform random number
      # to the acceptance probability described earlier
      if uniform(0, 1) < p_acc(e_s, e_cand, T):
         s = candidate
         # if we see a new best, reset the progress counter and
         # save the best state
         if e_cand < lowest_so_far:
            best_state = s
            no_improvement_counter = 0
         else:
            no_improvement_counter += 1
      # reduce the temperature
      T = reduce_temperature(T)

   # return the best state when the iteration completes
   return best_state

There are some details and optimizations left out, but hopefully the code feels straightforward enough for you to try to implement this on your own!

(Note: One detail we left out in the above is the schedule to use to reduce $T_k$ over time. This is a subtle problem, since if we lower it too quickly, our optimization process will not sufficiently explore, whereas if we lower it too slowly, we may not make forward progress fast enough.)

Conclusion

In this post, we briefly described a meta-algorithm called simulated annealing that can help approximate global optima for properly formulated optimization problems, many of which are extremely computationally difficult. It is often most useful when there are not other more direct, problem-specific algorithms we can bring to bear. In addition to describing the general setup, we also looked at how we could apply this approach to the TSP, which gave us a flavor for some of the practical considerations that arise when trying to fit a problem into the SA framework.

Simulated annealing is a powerful tool that is employed to solve a variety of thorny optimization problems across the sciences. It’s a good tool to have in your toolkit – I hope it comes in handy!

Simulated annealing

Introduction

Simulated annealing: overview

Annealing

Relationship to optimization

Global vs local optimization

Acceptance probability

Application: The Traveling Salesman Problem (TSP)

Conclusion

Further Reading

How does OpenAI's DALL-E work?

Faster language model inference

RANSAC for robust data fitting

Trending Tags