Nobody ever implements sort

In discussions on Hacker News, I repeatedly see comments about "nobody ever implements sorting algorithms" -- or, generalized, that it's a waste of time studying up on basic data structures and algorithms, because they're all already implemented.

While it's rare, it's not been my experience to never re-implement the basics.  I decided to data-mine my and my research group's past and current projects to identify the number of places we've implemented elementary data structures for good reasons.

Sorting

In the Cuckoo Filter, we implement a custom sort to sort the four items that can go in a bucket.  Because this is a high-performance context, and the size of the set to be sorted is fixed and small, it's much faster to do this inline.  This sort only cares about the least-significant byte of the key.

    inline void SortPair(uint32_t& a, uint32_t& b) {
         if ((a & 0x0f) > (b & 0x0f)) {
             std::swap(a, b);
Four-element sorting network
         }
    }

    inline void SortTags(uint32_t *tags) {
         SortPair(tags[0], tags[2]);
         SortPair(tags[1], tags[3]);
         SortPair(tags[0], tags[1]);
         SortPair(tags[2], tags[3]);
         SortPair(tags[1], tags[2]);
    }

It looks obvious, but there are a few things to note about doing this:  First, it implements a basic sorting network, as shown at right.  Key to this choice is that it has a "depth" of three -- the first two compare/swap operations can be performed in parallel;  then the next two;  then the final one.  This is an optimal sorting network for size 4.

Is this actually faster?  Definitely.  Comparing this vs. the naive use of std::sort:

inline int32_t STLSortTags(uint32_t (&tags)[4]) {
  std::sort(std::begin(tags), std::end(tags), 
            [](uint32_t a, uint32_t b) {
      return (a & 0xf) < (b & 0xf);
    });
}

Indicates -- with some overhead, because my benchmark program also initializes the array with random data each time -- that it's at least 5.6x faster:


fully inlined : 0.76 seconds
STL sort : 4.26 seconds

The place where this is used is performance-critical -- the "semi-sorted" (a type of compression) cuckoo filter has to sort the contents of a 4-element bucket any time an element is inserted.  The sorting network works well on our target x86 platform because it exploits the inherent parallelism of modern processors (they can issue multiple instructions per cycle, if those instructions are independent).  The entirely inlined implementation avoids a lot of unnecessary function call and setup overhead from a more general-purpose sorting algorithm.

Breadth-First Search

In libcuckoo, our high-performance Cuckoo hash library for C++, we implement a custom BFS queue, for both size and memory efficiency reasons.  The size of our search is bounded, and this queue is on the speed-critical path for inserting items into the hash table as it becomes full, so we wanted it to be as fast as possible.  Xiaozhou Li and then Manu Goyal implemented and reimplemented a custom queue and BFS for this, which is part of why libcuckoo is so fast.  (The use of BFS for insertion was one of the core contributions in our Eurosys14 paper, "Algorithmic Improvements for Fast Concurrent Cuckoo Hashing".)

I implemented a similar version of this for the Cuckoo hash table in TensorFlow.

Hash Tables

Ok, I put that up there for humor, if you know what my research group has been working on for the last several years.  Don't go implement your own.  Use libcuckoo. :)

Circular Buffers and Lists

In MICA, a high-performance network-based key-value store, we provide a custom implementation of a list using a circular buffer.  Using circular buffers is one of the more frequent reasons we end up reimplementing data structures, when we know something in advance about their expected and maximum size and want to ruthlessly eliminate allocator overhead.  There's one in the queue in libcuckoo as well.  These come up a lot in systems - in fact, the Linux kernel documentation provides a specific explanation about them.  Anuj Kalia just finished modifying the Mellanox infiniband device driver to reuse its descriptors in a circular buffer fashion (page 8).

So you're saying I should go implement my own data structures?

Of course not.  Don't.  99% of the time, don't.  I left out from this article the thousands of times we use std::sort() and every other predefined bit of wonderful code we didn't have to write ourselves.  The STL and other libraries are great.  In many cases, they're faster than what you'd write the first two or three times on your own (though Daniel Lemire might disagree).  From a software engineering perspective, you shouldn't reinvent the wheel.

But every now and then, after you've made sure you're using the right types of algorithms and data structures, and have written repeatable benchmarks, and have made sure that improving the speed of this component actually matters, and have broken out perf and pprof or Intel VTune and identified that there's a clear bottleneck you need to overcome... then you may find that there's a nice spot for customizing the implementation of a standard data structure or algorithm.

Discuss on Hacker News.

Popular posts from this blog

22 Months of Voice-Based Assistants

Minting Money with Monero ... and CPU vector intrinsics

Reflecting on CS Graduate Admissions