Writings - The Mystery of the Sort Benchmark Outliers

Niek Sanders, May 2018

This is another post on weird bugs and mysteries. A few days ago I looked at contributing to Timely Dataflow. I dug into the repo's radix sort implementation and saw weird behavior when running the benchmarks:

test rsort_lsb_u32_20 ... bench: 13,115,873 ns/iter (+/- 861,527) = 319 MB/s test rsort_lsb_u32_20 ... bench: 22,135,310 ns/iter (+/- 772,498) = 189 MB/s <== test rsort_lsb_u32_20 ... bench: 13,009,435 ns/iter (+/- 364,713) = 322 MB/s test rsort_lsb_u32_20 ... bench: 13,008,352 ns/iter (+/- 713,115) = 322 MB/s test rsort_lsb_u32_20 ... bench: 13,083,947 ns/iter (+/- 769,034) = 320 MB/s test rsort_lsb_u32_20 ... bench: 21,942,109 ns/iter (+/- 978,852) = 191 MB/s <== test rsort_lsb_u32_20 ... bench: 13,088,477 ns/iter (+/- 785,538) = 320 MB/s test rsort_lsb_u32_20 ... bench: 13,064,438 ns/iter (+/- 762,363) = 321 MB/s

On my desktop, an Intel Core i5-7600, about a fifth of the runs slowed from 320 MB/s to 190 MB/s. This was particularly confounding because the benchmark is deterministic.

I got a little obsessed with finding the root cause. After wrapping my head around the radix sort code, I profiled with perf_events using "perf record" and "perf report". But proportionally the two profiles looked similar.

Next I inspected performance counters with "perf stat -d". The mystery deepened when comparing a good run against the bad:

"good run" test rsort_lsb_u32_20 ... bench: 13,082,895 ns/iter (+/- 802,286) = 320 MB/s Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20': 3966,057471 task-clock (msec) # 0,990 CPUs utilized 39 context-switches # 0,010 K/sec 2 cpu-migrations # 0,001 K/sec 189.429 page-faults # 0,048 M/sec 16.254.795.809 cycles # 4,098 GHz 29.488.536.562 instructions # 1,81 insn per cycle 2.799.011.967 branches # 705,742 M/sec 6.735.243 branch-misses # 0,24% of all branches 8.004.471.223 L1-dcache-loads # 2018,244 M/sec 1.168.268.824 L1-dcache-load-misses # 14,60% of all L1-dcache hits 9.072.863 LLC-loads # 2,288 M/sec 1.280.500 LLC-load-misses # 14,11% of all LL-cache hits 4,006246558 seconds time elapsed "bad run" test rsort_lsb_u32_20 ... bench: 22,093,918 ns/iter (+/- 9,656,989) = 189 MB/s Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20': 5055,207656 task-clock (msec) # 0,937 CPUs utilized 94 context-switches # 0,019 K/sec 1 cpu-migrations # 0,000 K/sec 189.418 page-faults # 0,037 M/sec 21.962.826.744 cycles # 4,345 GHz 29.493.221.138 instructions # 1,34 insn per cycle 2.799.928.144 branches # 553,870 M/sec 6.728.506 branch-misses # 0,24% of all branches 8.005.459.112 L1-dcache-loads # 1583,606 M/sec 1.130.853.381 L1-dcache-load-misses # 14,13% of all L1-dcache hits 8.311.800 LLC-loads # 1,644 M/sec 1.269.525 LLC-load-misses # 15,27% of all LL-cache hits 5,397361827 seconds time elapsed

Most counts are very close. But somehow the good runs manage ~1.8 instructions/cycle whereas the bad runs putter at ~1.3.

Despite the L1 load/miss ratios being similar, I guessed that alignment issues were coming into play. I hacked the Rust benchmark to dump the memory addresses of each vector's buffer and they were all at least 64-byte aligned. So even vectorized AVX512 instructions should have been fine. That idea was a bust.

Then I tried reproducing the behavior on different machines. I used three AWS instance types: c3.xlarge, c4.2xlarge, and c5.large. All of these use Intel Xeon CPUs and showed steady throughput across runs. (I did 20 runs per machine, but for brevity show only 3).

"c3.2xlarge" test rsort_lsb_u32_20 ... bench: 24,280,929 ns/iter (+/- 2,554,741) = 172 MB/s test rsort_lsb_u32_20 ... bench: 24,193,735 ns/iter (+/- 2,120,372) = 173 MB/s test rsort_lsb_u32_20 ... bench: 24,119,255 ns/iter (+/- 2,040,252) = 173 MB/s ... "c5.large" test rsort_lsb_u32_20 ... bench: 14,928,288 ns/iter (+/- 555,767) = 280 MB/s test rsort_lsb_u32_20 ... bench: 14,213,352 ns/iter (+/- 699,836) = 295 MB/s test rsort_lsb_u32_20 ... bench: 14,908,216 ns/iter (+/- 425,255) = 281 MB/s ...

The c3 and c4 throughputs were about the same so only one is shown. The c5 family actually has a significant performance boost. Interestingly, my consumer Intel i5-7600 significantly outperforms the c5's Intel Xeon Platinum 8124M on single core performance at 280 MB/s vs 320 MB/s.

So the AWS instances didn't have performance outliers. The problem was specific to my home setup. I spent a few hours wading through the immense number of different counters exposed by perf_events, trying to understand why the instructions/cycle would sometimes vary so drastically. None of my endless perf runs showed significant differences.

I was about to give up.

There was one last counter I decided to try... hw_interrupts.received.

"good run" test rsort_lsb_u32_20 ... bench: 13,041,410 ns/iter (+/- 760,375) = 321 MB/s Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20': 1.161 hw_interrupts.received "bad run" test rsort_lsb_u32_20 ... bench: 21,963,625 ns/iter (+/- 878,996) = 190 MB/s Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20': 1.880 hw_interrupts.received

Success! I finally had a real difference between the two scenarios that I reproduce easily.

I'd love to say that I used some advanced technique to provably track down the culprit for the hardware interrupts. In reality just I took a guess that it was this guy:

Photo of TP-Link TL-WN722N Wireless NIC

Sure enough, after disabling my little USB wireless NIC the latency outliers went away:

test rsort_lsb_u32_20 ... bench: 13,078,868 ns/iter (+/- 805,109) = 320 MB/s test rsort_lsb_u32_20 ... bench: 13,122,640 ns/iter (+/- 737,939) = 319 MB/s test rsort_lsb_u32_20 ... bench: 13,057,573 ns/iter (+/- 736,634) = 321 MB/s test rsort_lsb_u32_20 ... bench: 13,052,186 ns/iter (+/- 822,351) = 321 MB/s test rsort_lsb_u32_20 ... bench: 13,107,309 ns/iter (+/- 772,251) = 319 MB/s ... (snipped 55 more runs) ...

So I'm calling the mystery solved. I mentioned my obsession tracking this down. When I started it was 10pm. Somehow it was 4:30am when I wrapped up.

>>> Writings