Many programmers, upon hearing a problem is NP-Hard, assume that optimizations for exact solutions are impossible or pointless. However, it is important to keep in mind that while NP-Hard problems are non-polynomial in the worst case, the average case can be improved beyond that. Additionally, ``non-polynomial'' encompasses a wide class of running times, some of which are much more feasible than others. This article looks at a well-known NP-hard problem and iteratively optimizes a solution, presenting a final implementation that is thousands of times faster than a naive approach for some small input sets, and makes it possible to process data sets that would otherwise be intractible.
A well-known problem in Computer Science is finding the maximum independent set of a finite, undirected, acyclical graph. This problem has been proved to be NP-Complete. This makes it tempting to use a simple brute-force algorithm. A few optimizations can vastly improve the run-time of the algorithm though, allowing it to scale far beyond what is possible with a brute-force approach.
The algorithm described here is based on the approache described in A New Exact Algorithm for the Maximum-Weight Clique Problem Based on a Heuristic Vertex-Coloring and a Backtrack Search (Though I don't recommend working directly from this paper, as it has some errors which can provide implementation difficulties), which is itself based on a combination of the approach described by R. Carraghan and P. Pardalos and that of P.R.J. Östergård
A graph \(G\) is made up of a set of \(n\) vertices \(V = (v_1...v_n)\). These vertices are connected by \(k\) edges \(E = (e_1...e_k)\). Each edge \(e\) consists of a pair \((v_1, v_2)\) of vertices that are connected by the edge. Because the graph is undirected, \(e = (v_1, v_2)\) and \(e = (v_2, v_1)\) are equivalent. The complement of \(G\), \(G'\), is created by connecting all vertices not connected in \(G\), and removing all existing connections
The graph \(G\) may be either weighted or unweighted. If it is weighted, each vertex has associated with it a weight \(v_w\). In the unweighted case, weights are assumed to be 1.
Two vertices are considered independent if there is no edge connecting them. Or more formally, vertices \(v_1\) and \(v_2\) are dependent if there exists an edge \(e = (v_1, v_2) \in E\). Otherwise, \(v_1\) and \(v_2\) are independent.
A set of independent vertices \(K\) is maximal if no more vertices can be added to it. That is, \(\forall v \in V, v \notin K, \exists e = (u_1, u_2) \in E\) such that \(u_1 \in V\) and \(u_2 \in K\). A maximal set is the maximum set if it has the highest weight of all maximal sets.
Related to the maximum independent set problem is the maximum clique problem. A clique is a set of vertices where each vertex is connected to the other. An independent set in \(G\) is a clique in \(G'\).
Rather than searching for independent sets directly, we will find \(G'\) and search for maximal cliques.
The simplest and most obvious way of solving the problem is a brute-force approach. The pseudocode is as follows:
Here is a C99 implementation, written to be efficient at the expense of good design:
This approach quickly becomes intractible as the number of vertices increases, as shown in Figure 1 (For each graph, the number of edges is set to \(k = 0.75 * n * (n - 1)\), where \(n\) is the number of vertices). Benchmarks were done in Fedora 11 x64 on a dual-core 2.1GHz Core 2 Duo with 2GB Ram
The time required quickly grows out of control: while a 25-vertex graph can be processed in less than a second, a 35-vertex graph requires over an hour. Anything larger than that is most likely infeasible.
A simple but substantial improvement can be made by observing that the order that a clique is discovered does not matter. Therefore, when building a subgraph of adjacent vertices, only those with an index greater than the current one being expanded need be considered. The modified code looks like this:
The original formula first examined all \(n\) vertices, then for each one examined the next \(n - 1\) vertices, for \(n!\) total vertices examined.
The new version still looks at \(n - 1\) then \(n - 2\) vertices, and so on for vertex \(0\). The next vertex, though, looks at \(n - 2\) vertices, then \(n - 3\), and so forth, meaning it takes \((n - 1)! + (n - 2!) + ...\).
This is the only method that improves the worst-case time of the algorithm. However, it is still of the same order as the previous version. Both of these methods are \(O(n!)\), as the shape of Figure 2 shows. However, while the previous method takes over an hour for a 35-vertex graph, this method can process a 100-vertex graph in under a minute.
Another refinement is the addition of a backtracking element to prevent recalculating values more than is needed. First, we order by weight descending. Each time we fully process vertices \(1...n\), we cache the weight of the best clique found as \(c_n\).
When processing later vertices, a check is done to see if a cached value exists on the vertex being expanded. If so, expansion only continues if the weight of the current clique plus the \(c_n\) is better than the current best. Otherwise, expansion of the vertex is skipped.
Again, this improvement does not reduce the worst-case complexity of the algorithm. Experimental results with random data, however, show it runs approximately three times faster than the previous version. See Figure 3.
The algorithm can be further improved by first coloring the graph. Graph coloring is done by assigning a color to each vertex, with the restriction that adjacent vertices cannot have the same color.
Because all vertices on a clique are connected to each other, no two of them can share the same color. So, for a colored graph, the largest possible weight of the maximum clique is the sum of the largest weight in each color. This is called the degree of the graph.
There are many colorings of a graph possible. The more accurate the coloring is, the more optimized the algorithm will be. The coloring algorithm shown is a simple greedy algorithm. For each vertex, a check is done to see which colors are in use by its neighboring vertices. The first free color is used.
The degree of each subgraph is created as vertices are added to the graph. When the graph is expanded, a check is made to see if it is possible for a full expansion to be better than the best clique. If not, the entire subgraph is skipped.
Again this algorithm does not necessarily improve the worst-case running time, but for a random graph it can run over 100x faster than the previous version, as Figure 4 shows.
One interesting thing about the optimization based on coloring is that the performance of the algorithm can be improved simply by improving the coloring method. In the greedy color algorithm above, the first vertices tend to get a 'better' coloring because they have less restrictions on them.
One simple improvement is to first sort the vertices by the number of neighbors, decreasing. Ones with more neighbors are more likely to be in a maximal clique, and so it is worth the effor to give them a better coloring.
As Figure 5 shows, this approximately doubles the speed of the algorithm.
Figure 6 shows a comparison of timings from all the different timings. What was a nearly unusable algorithm has been improved to the point where it is reasonable for small graphs.
This is not intended to necessarily present the fastest solution possible; if anything, it has shown that further improvements are always possible. It does show, however, that it can be worth the effort to investigate improved solutions, even for ``impossible'' problems.
Here is the final working code: