## Sunday 3 June 2007

### Highly Efficient 4 Way Merging

Acknowledgements: This article is based on the work of Peter Sanders and Roman Dementiev. You can find the full source code presented here in their STXXL library (STL library for large data structures), Peter Sanders' paper and the accompanying source code. Warning: If you are of sensitive nature either regarding
• computer science theory
• highly practical program optimization
then I recommend you to stop reading this article at once. You might faint or something. Maybe, however, you can get a “bling” moment out of this article. Do you remember Mergesort from your introductory course in programming or computer science? An essential part of it is merging two already sorted lists into one as demonstrated in the animation below.
```/* Merge two already sorted arrays where both are terminated
* by 0x0fff ffff which may not appear *inside* arrays.
*/
void merge2(int a[], int b[], int result[]) {
int i = 0, j = 0, k = 0;

while (true) {
if (a[i] <= b[j]) {
result[k] = a[i];
i += 1; k += 1;
} else {
result[k] = b[j];
j += 1; k += 1;
}

if (result[k] == 0x0fffffff) return;
}
}
```
If you are like me you do not think about Mergesort any more after understanding it and writing the exam it is relevant for. After all - people tell you that Quicksort has the same average run time and is normally faster and in place. This is not true in the general case but I'll keep that for another article. So, what is the deal about merging? If you complete your initial courses on algorithms and do not pursue algorithmics any further then chances are pretty good that you go ahead to things with more quickly visible and “tangible” results such as “enterprise” programs. But you shut your eyes to a pretty interesting part of computer science: Algorithmics. Merging gets interesting if you deal with I/O efficient algorithms: If you process data sets that do not fit your main memory any more then you will propably want to minimize I/O requests lest your shiny new Core 2 Duo processor will wait for your slow hard disk most of the time. Similar things have to be considered for the performance difference of your slow DRAM and the fast on CPU caches. Sequence Heaps are a highly I/O efficient (and cache efficient for data sets that fit into main memory) data structure for maintaining priority queues. Applications for priority queues include graph algorithms where you want to find the next largest node in your graph as fast as possible and graphs can get huge (think of all streets in Europe or North America, for example). The data structure described by Sanders requires an efficient R-way merger (where R is less than or equal to 4 for practicable sizes). Building a two way merger was pretty simple and we could extend this to a 4 way merger:
```/* Merge four already sorted arrays where both are terminated
* by 0x0fff ffff which may not appear *inside* arrays.
*/
void naive_merge4(int a[], int b[], int c[], int d[], int result[]) {
int i = 0, j = 0, k = 0, l = 0, m = 0;

while(true) {
if (a[i] <= b[j] && a[i] <= c[k] && a[i] <= d[l]) {
result[m] = a[i];
m += 1; i += 1;
} else if (b[j] <= a[i] && b[j] <= c[k] && b[j] <= d[l]) {
result[m] = b[j];
m += 1; j += 1;
} else if (c[k] <= a[i] && c[k] <= b[j] && c[k] <= d[l]) {
result[m] = c[k];
m += 1; k += 1;
} else { /* (d[l] <= a[i] && d[l] <= b[j] && d[l] <= c[k]) */
result[m] = d[l];
m += 1; l += 1;
}

if (result[m] == 0x0fffffff) return;
}
}
```
OK, so we are all fluffy and set now, aren't we? If all arrays have the same length n then our naive_merge4 will run in O(n) and we cannot expect a better asymptotic runtime, right? We will not be able to improve on the asymptotic runtime since we have to look at least once at each element but each of our if statements has three comparisons! We have to do nine comparisons each time d contains the smallest element! Can we do better? We could use the comparison demonstrated in the following figure and reduce the comparisons to three in every case. Can we do even better? The answer is: Yes, we can - using Sander's code for four way merging:
```template
void merge4_iterator(
InputIterator & from0,
InputIterator & from1,
InputIterator & from2,
InputIterator & from3,
OutputIterator to, int_type sz, Cmp_ cmp)
{
OutputIterator done    = to + sz;

#define StartMerge4(a, b, c, d)\
if ( (!cmp(*from##a ,*from##b )) && (!cmp(*from##b ,*from##c )) \
&& (!cmp(*from##c ,*from##d )) )\
goto s ## a ## b ## c ## d ;

// b>a c>b d>c
// a<b b<c c<d
// a<=b b<=c c<=d
// !(a>b) !(b>c) !(c>d)

StartMerge4(0, 1, 2, 3);
// [...] call StartMerge4 for every permutation of 0, 1, 2, 3

#define Merge4Case(a, b, c, d)\
s ## a ## b ## c ## d:\
if (to == done) goto finish;\
*to = *from ## a ;\
++to;\
++from ## a ;\
if (cmp(*from ## c , *from ## a))\
{\
if (cmp(*from ## b, *from ## a )) \
goto s ## a ## b ## c ## d; \
else \
goto s ## b ## a ## c ## d; \
}\
else \
{\
if (cmp(*from ## d, *from ## a))\
goto s ## b ## c ## a ## d; \
else \
goto s ## b ## c ## d ## a; \
}

Merge4Case(0, 1, 2, 3);
// [...] call Merge4Case for every permutation of 0, 1, 2, 3

finish:
return;
}
```
So, what's happening here? The macro Merge4Case generates the code for a specific ordering of the input streams from0, ..., from3. For each ordering, the next case can be found using only two comparisons (see the figure below). The macro StartMerge4 merely gets us started by jumping to the appropriate case. We have to do some work for initialization but after that we only have to do only two comparisons for each merge step! The instruction pointer (i.e. the current point of execution in the code) encodes the ordering of the input streams' ordering. Of course, there is a tradeoff: While naive_merge4 is only 20 lines long, the code merge4_iterator expands to much more than 200 lines. Of course, the native code does not correlate directly with the number of lines but the optimized variant will make the compiler create much more code than the naive merge method. The code for the optimized 4 way merger fits into the instruction cache and the tradeoff makes sense. However, we have to keep in mind that the code generated is dependent on the number of permutation of input streams. Thus, it is in O(R!). While 4! is only 24 a calculator or pen and paper give us 40'320 for 8!. I guess the code generated for an 8 way merger might not quite fit into the L1 instruction cache of the common CPUs out there (Wikipedia tells us that the Itanium 2 has a 16KB L1 instruction cache and a 256KB L2 cache that is used for both instructions and data). So, what can we learn from this?
• There are very elegant and powerful uses of the C preprocessor and the otherwise harmful GOTO.
• People at universities can write very pragmatic and efficient code.
• Algorithmics can be interesting (hopefully).