Scaling hierarchical agglomerative clustering to trillion-edge graphs

Scaling hierarchical agglomerative clustering to trillion-edge graphs


In a line of labor beginning in 2021, we fastidiously investigated the complexity and parallel scalability of HAC, from a sparse graph-theoretic perspective. We noticed that in follow, solely m << n² of probably the most comparable entries within the similarity matrix have been truly related for computing high-quality clusters. Specializing in this sparse setting, we got down to decide if we may design HAC algorithms that exploit sparsity to realize two key targets:

  1. a variety of computation steps proportional to the variety of similarities (m), somewhat than n², and
  2. excessive parallelism and scalability.

We began by fastidiously exploring the primary purpose, since attaining this might imply that HAC might be solved a lot quicker on sparse graphs, thus presenting a path for scaling to extraordinarily massive datasets. We proved this risk by presenting an environment friendly sub-quadratic work algorithm for the issue on sparse graphs which runs in sub-quadratic work (particularly, O(nm½) work, as much as poly-logarithmic elements). We obtained the algorithm by designing a cautious accounting scheme that mixes the traditional nearest-neighbor chain algorithm for HAC with a dynamic edge-orientation algorithm.

Nevertheless, the sub-quadratic work algorithm requires sustaining a sophisticated dynamic edge-orientation knowledge construction and isn’t simple to implement. We complemented our precise algorithm with a easy algorithm for approximate average-linkage HAC, which runs in nearly-linear work, i.e., O(m + n) as much as poly-logarithmic elements, and is a pure rest of the grasping algorithm. Let ε be an accuracy parameter. Then, extra formally, a (1+ε)-approximation of HAC is a sequence of cluster merges, the place the similarity of every merge is inside an element of (1+ε) of the very best similarity edge within the graph at the moment (i.e., the merge that the grasping precise algorithm will carry out).

Experimentally, this notion of approximation (say for ε=0.1) incurs a minimal high quality loss over the precise algorithm on the identical graph. Moreover, the approximate algorithm additionally yielded massive speedups of over 75× over the quadratic-work algorithms, and will scale to inputs with tens of tens of millions of factors. Nevertheless, our implementations have been sluggish for inputs with various hundred million edges as they have been fully sequential.

The subsequent step was to aim to design a parallel algorithm that had the identical work and a provably low variety of sequential dependencies (formally, low depth, i.e., longest crucial path). In a pair of papers from NeurIPS 2022 and ICALP 2024, we studied learn how to receive good parallel algorithms for HAC. First, we confirmed the widespread knowledge that precise average-linkage HAC is tough to parallelize as a result of sequential dependencies between successive grasping merges. Formally, we confirmed that the issue is as laborious to parallelize as some other downside solvable in polynomial time. Thus, barring a breakthrough in complexity idea, average-linkage HAC is unlikely to confess quick parallel algorithms.

On the algorithmic facet, we developed a parallel approximate HAC algorithm, referred to as ParHAC, that we present is extremely scalable and runs in near-linear work and poly-logarithmic depth. ParHAC works by grouping the sides into O(log n) weight lessons, and processing every class utilizing a fastidiously designed low-depth symmetry-breaking algorithm. ParHAC enabled the clustering of the WebDataCommons hyperlink graph, one of many largest publicly out there graph datasets with over 100 billion edges, in only a few hours utilizing a cheap multicore machine.

Leave a Reply

Your email address will not be published. Required fields are marked *