Arm outline atomics

One of the features introduced in the Armv8.1-A architecture version were new atomic instructions, also known as LSE (large system extensions). They include a number of atomic operations done with a single atomic instruction (CAS – compare and swap, arithmetic operations etc.) which result in a significant performance improvement over “normal” implementation, which usually has exclusive load, operation, exclusive store, check if the operation was not interrupted and a branch if it was interrupted (also known as LL/SC). The performance improvement could be estimated as roughly 20%, and could benefit lockless algorithms and containers in Unity, as well as the job system (especially with tiny jobs) and other subsystems.

An obvious question which follows is device support. Currently, when targeting Arm64 in Unity, we default to the vanilla Armv8-A architecture version, which is fully supported by every 64-bit Arm device in the world. Support for Armv8.1-A extensions was mandated in Cortex A75, which is equivalent to Samsung Galaxy S9 (released 2018). This means that a significant share of our end-users are still on a device which doesn’t support LSE. If such a device stumbles upon an LSE instruction, the process will immediately die with an “invalid operation” exception (ILLOPC). So we cannot just go and enable LSE all over the place (which can be done by passing -march=armv8-a+lse to the compiler).

A potential solution would be to have both options (LSE and LL/SC) with a runtime dispatch. Luckily, Arm has done a fantastic job which takes care of everything in the compiler itself, and it’s called outline atomics (option present in GCC and clang): https://reviews.llvm.org/D91157 and https://github.com/llvm/llvm-project/blob/main/llvm/docs/Atomics.rst. The idea is as follows:

  • You add -moutline-atomics flag to your compiler command line
  • The compiler generates code which does runtime detection of your CPU capabilities (does it support LSE or not?)
  • The compiler generates intrinsic functions (for different kinds of atomic operations) which have two branches: LSE and LL/SC, which one is called depends on the CPU capabilities (see previous item)
  • Every time your code does an atomic operation, the compiler inserts a call to the corresponding intrinsic function. This means that on old devices, the path with LL/SC will be selected, while LSE-capable devices will be running LSE code – all that with zero changes needed to your code, nice and backward compatible! You only have to add the compiler flag.

Magic!

But in the end, is it free performance? For every atomic instruction, you have to pay the cost of calling a function and doing an if on CPU capabilities (likely an easy target to the branch predictor). Compare this to inlined atomic instructions – be it LL/SC or pure LSE (if you’re targeting only modern CPUs). Our friends at Arm (including the developer of the outline atomics feature in LLVM) say that the benefit of LSE is much higher than the cost associated with the function call, so it’s a net plus. If so, it sounds like we could provide some performance improvement for our users at a very low cost! Let’s give it a try.

Before we start, I must share a well-known study of LSE in MySQL which TL;DR resulted in “no benefit”. Interesting, and let’s keep it in mind.

Implementation

It’s a one-liner, a usual case for complicated but well-designed features: just add -moutline-atomics to the clang flags.

Performance assessment strategy

This is the most important thing. Here’s the plan:

  • A non-LSE-capable device (an older device)
    • Baseline: no outline atomics, basically inlined LL/SC code, as it exists currently in our codebase.
    • Outline atomics: enable outline atomics. In this case, no improvement is expected, just the overhead of having to call the outlined functions. A perfect result would be no regression or a regression around measurement error (up to 5%). If the regression is significant (20+%), then it’s a red flag.
  • An LSE-capable-device (a newer device)
    • Baseline: no outline atomics, basically inlined LL/SC code, as it exists currently in our codebase.
    • Outline atomics: enable outline atomics. In this case, an improvement is expected because of the new LSE instructions, minus the function call overhead. A perfect result would be an improvement of 10+%. No improvement is a red flag.
    • Pure LSE: enable LSE as the target in the compiler command line. This option is there to compare with outline atomics and try to assess the function call and CPU capabilities branch overhead. Not shippable because not compatible with older devices, but helps understand the nature of improvement/regression with outline atomics. Must perform better than previous option.

The following devices were used:

  • non-LSE: nVidia Shields in our CI buildfarm, Cortex A57
  • non-LSE: Huawei Honor 9, Cortex A53 (tested locally)
  • LSE: Samsung Galaxy S22, Cortex X2+A710+A510, latest Armv9 device on the market at the time of testing (tested locally)

Round 1. Main Unity and CI

We have remarkable Native Performance tests which run on CI and report to a huge database. Let’s run the whole suite with outline atomics and without, and compare the results.

On a Shield, there’s a single change that’s worth mentioning: a slight regression in global_no_contention_Atomic_Add.

unity_perftest_shield_global_no_contention_Atomic_Add

Few other regressions are most likely unrelated.

Let’s try on an LSE-capable device then. I’m running the tests locally and report data to the database.

The results are unfortunately quite unstable - and it partly relies on the fact that it’s running on a consumer device lying on my table. However, few results are curious.

unity_perftest_s22_1 unity_perftest_s22_2 unity_perftest_s22_3 unity_perftest_s22_4

The global_no_contention_Atomic_Add which regressed on a Shield, shows an improvement on S22 (0.12 => 0.10ms). Some of the MemoryManagerPerformance tests improved significantly. Looks good so far!

Okay let’s now go to the code and try to understand what’s going on.

The Code

Of course, we have more than one atomics implementation in the Unity codebase. :(

  • We have ExtendedAtomicOps-arm64.h. It uses explicit inline assembly which always generates oldschool LDXR/STXR instructions. Outline atomics have no effect on these, and they are used in quite a few places, which explains why there’s less effect to be measured all over the board.
  • We have baselib’s atomics, which are a much more recent, platform-agnostic implementation using compiler instrinsics (GCC builtins, also implemented in LLVM). It is used in MemoryManager and other places, but currently has smaller adoption than the old ExtendedAtomicOps. This implementation benefits from LSE and outline atomics, which explains why there are gains in MemoryManager performance tests.
  • Some places are still using std::atomic, which benefits from LSE/outline atomics too.

The overall strategy is to move as many consumers as possible to baselib, or to rewrite ExtendedAtomicOps to be a shim which redirects to baselib. LSE/outline atomics would benefit that.

However, if no regressions are found, it may be a good idea to adopt outline atomics now – and patiently wait for the future to come. It also means more motivation to move to baselib’s atomics all over the codebase.

Luckily, baselib has a set of benchmarks! Let’s run them and look at the results.

Round 2. Baselib Benchmarks

Baselib possesses a great collection of benchmarks, which cover all atomics use cases, plus collects the same results for std::atomics to use as a baseline.

The results of running the benchmarks on Galaxy S22 (LSE-capable) are available in the table below.

Testcase Benchmark default, ns outline, ns Change
Baselib atomics (int64_t) load(relaxed) 0,000385595 0,000386961 0,35%
Baselib atomics (int64_t) load(acquire) 0,000385233 0,00039 1,24%
Baselib atomics (int64_t) load(seq_cst) 0,00039 0,00039125 0,32%
Baselib atomics (int64_t) store(relaxed) 0,000393934 0,000385104 -2,24%
Baselib atomics (int64_t) store(release) 0,00039 0,000393064 0,79%
Baselib atomics (int64_t) store(seq_cst) 0,00039125 0,00039 -0,32%
Baselib atomics (int64_t) fetch_add(relaxed) 5,72956 4,99676 -12,79%
Baselib atomics (int64_t) fetch_add(acquire) 5,67628 4,95283 -12,75%
Baselib atomics (int64_t) fetch_add(release) 5,68544 4,98981 -12,24%
Baselib atomics (int64_t) fetch_add(acq_rel) 5,67653 4,9972 -11,97%
Baselib atomics (int64_t) fetch_add(seq_cst) 7,27749 7,52526 3,40%
Baselib atomics (int64_t) fetch_and(relaxed) 5,67146 4,99698 -11,89%
Baselib atomics (int64_t) fetch_and(acquire) 5,673 4,8902 -13,80%
Baselib atomics (int64_t) fetch_and(release) 5,67463 4,92497 -13,21%
Baselib atomics (int64_t) fetch_and(acq_rel) 5,72999 5,04349 -11,98%
Baselib atomics (int64_t) fetch_and(seq_cst) 7,27855 7,50369 3,09%
Baselib atomics (int64_t) fetch_or(relaxed) 5,67244 4,99649 -11,92%
Baselib atomics (int64_t) fetch_or(acquire) 5,67435 5,04275 -11,13%
Baselib atomics (int64_t) fetch_or(release) 5,67056 4,95184 -12,67%
Baselib atomics (int64_t) fetch_or(acq_rel) 5,69323 4,95234 -13,01%
Baselib atomics (int64_t) fetch_or(seq_cst) 7,39012 7,53271 1,93%
Baselib atomics (int64_t) fetch_xor(relaxed) 5,72943 4,99723 -12,78%
Baselib atomics (int64_t) fetch_xor(acquire) 5,72798 4,92503 -14,02%
Baselib atomics (int64_t) fetch_xor(release) 5,67386 4,96399 -12,51%
Baselib atomics (int64_t) fetch_xor(acq_rel) 5,67432 4,99725 -11,93%
Baselib atomics (int64_t) fetch_xor(seq_cst) 7,27855 7,45118 2,37%
Baselib atomics (int64_t) exchange(relaxed) 0,000393206 0,00039125 -0,50%
Baselib atomics (int64_t) exchange(acquire) 5,43095 5,04421 -7,12%
Baselib atomics (int64_t) exchange(release) 0,00039125 0,000389176 -0,53%
Baselib atomics (int64_t) exchange(acq_rel) 5,50492 5,05285 -8,21%
Baselib atomics (int64_t) exchange(seq_cst) 7,23937 7,53526 4,09%
Baselib atomics (int64_t) cmp_xchg_weak fail(relaxed. relaxed) 5,2336 5,45062 4,15%
Baselib atomics (int64_t) cmp_xchg_weak success(relaxed. relaxed) 5,88586 5,419 -7,93%
Baselib atomics (int64_t) cmp_xchg_weak fail(acquire. relaxed) 5,23867 5,39911 3,06%
Baselib atomics (int64_t) cmp_xchg_weak success(acquire. relaxed) 5,95429 5,46143 -8,28%
Baselib atomics (int64_t) cmp_xchg_weak fail(acquire. acquire) 5,28305 5,39779 2,17%
Baselib atomics (int64_t) cmp_xchg_weak success(acquire. acquire) 5,95301 5,38517 -9,54%
Baselib atomics (int64_t) cmp_xchg_weak fail(release. relaxed) 5,30522 5,45177 2,76%
Baselib atomics (int64_t) cmp_xchg_weak success(release. relaxed) 5,88906 5,41026 -8,13%
Baselib atomics (int64_t) cmp_xchg_weak fail(acq_rel. relaxed) 5,23937 5,39805 3,03%
Baselib atomics (int64_t) cmp_xchg_weak success(acq_rel. relaxed) 5,9027 5,36478 -9,11%
Baselib atomics (int64_t) cmp_xchg_weak fail(acq_rel. acquire) 5,2675 5,44287 3,33%
Baselib atomics (int64_t) cmp_xchg_weak success(acq_rel. acquire) 5,86618 5,39944 -7,96%
Baselib atomics (int64_t) cmp_xchg_weak fail(seq_cst. relaxed) 5,16631 6,84211 32,44%
Baselib atomics (int64_t) cmp_xchg_weak success(seq_cst. relaxed) 7,54723 8,33851 10,48%
Baselib atomics (int64_t) cmp_xchg_weak fail(seq_cst. acquire) 5,34307 6,81893 27,62%
Baselib atomics (int64_t) cmp_xchg_weak success(seq_cst. acquire) 7,56281 8,30719 9,84%
Baselib atomics (int64_t) cmp_xchg_weak fail(seq_cst. seq_cst) 5,17506 7,59047 46,67%
Baselib atomics (int64_t) cmp_xchg_weak success(seq_cst. seq_cst) 5,5682 7,65737 37,52%
Baselib atomics (int64_t) cmp_xchg_strong fail(relaxed. relaxed) 5,21179 5,39832 3,58%
Baselib atomics (int64_t) cmp_xchg_strong success(relaxed. relaxed) 5,69832 5,42264 -4,84%
Baselib atomics (int64_t) cmp_xchg_strong fail(acquire. relaxed) 5,23045 5,39779 3,20%
Baselib atomics (int64_t) cmp_xchg_strong success(acquire. relaxed) 5,83734 5,42521 -7,06%
Baselib atomics (int64_t) cmp_xchg_strong fail(acquire. acquire) 5,21508 5,45208 4,54%
Baselib atomics (int64_t) cmp_xchg_strong success(acquire. acquire) 5,87539 5,40666 -7,98%
Baselib atomics (int64_t) cmp_xchg_strong fail(release. relaxed) 5,24858 5,45319 3,90%
Baselib atomics (int64_t) cmp_xchg_strong success(release. relaxed) 5,78798 5,41572 -6,43%
Baselib atomics (int64_t) cmp_xchg_strong fail(acq_rel. relaxed) 5,24169 5,39858 2,99%
Baselib atomics (int64_t) cmp_xchg_strong success(acq_rel. relaxed) 5,77166 5,46338 -5,34%
Baselib atomics (int64_t) cmp_xchg_strong fail(acq_rel. acquire) 5,22574 5,39832 3,30%
Baselib atomics (int64_t) cmp_xchg_strong success(acq_rel. acquire) 5,66156 5,38044 -4,97%
Baselib atomics (int64_t) cmp_xchg_strong fail(seq_cst. relaxed) 5,31198 6,82313 28,45%
Baselib atomics (int64_t) cmp_xchg_strong success(seq_cst. relaxed) 7,60092 8,2542 8,59%
Baselib atomics (int64_t) cmp_xchg_strong fail(seq_cst. acquire) 5,3248 6,84395 28,53%
Baselib atomics (int64_t) cmp_xchg_strong success(seq_cst. acquire) 7,48102 8,25419 10,34%
Baselib atomics (int64_t) cmp_xchg_strong fail(seq_cst. seq_cst) 5,24991 7,65791 45,87%
Baselib atomics (int64_t) cmp_xchg_strong success(seq_cst. seq_cst) 7,41788 7,63039 2,86%
std::atomic (int64_t) load(relaxed) 0,00039 0,00039 0,00%
std::atomic (int64_t) load(acquire) 0,00039 0,00039 0,00%
std::atomic (int64_t) load(seq_cst) 0,0003925 0,000395796 0,84%
std::atomic (int64_t) store(relaxed) 0,00039 0,0003925 0,64%
std::atomic (int64_t) store(release) 0,00039125 0,00039 -0,32%
std::atomic (int64_t) store(seq_cst) 0,00039125 0,00039 -0,32%
std::atomic (int64_t) fetch_add(relaxed) 5,76841 5,04124 -12,61%
std::atomic (int64_t) fetch_add(acquire) 5,67638 5,23763 -7,73%
std::atomic (int64_t) fetch_add(release) 5,71074 5,21996 -8,59%
std::atomic (int64_t) fetch_add(acq_rel) 5,68082 5,17513 -8,90%
std::atomic (int64_t) fetch_add(seq_cst) 5,71148 5,25163 -8,05%
std::atomic (int64_t) fetch_and(relaxed) 5,76754 5,06334 -12,21%
std::atomic (int64_t) fetch_and(acquire) 5,68766 5,17591 -9,00%
std::atomic (int64_t) fetch_and(release) 5,70922 5,23549 -8,30%
std::atomic (int64_t) fetch_and(acq_rel) 5,76538 5,23106 -9,27%
std::atomic (int64_t) fetch_and(seq_cst) 5,71144 5,20844 -8,81%
std::atomic (int64_t) fetch_or(relaxed) 5,71144 5,06334 -11,35%
std::atomic (int64_t) fetch_or(acquire) 5,67575 5,17685 -8,79%
std::atomic (int64_t) fetch_or(release) 5,70869 5,17661 -9,32%
std::atomic (int64_t) fetch_or(acq_rel) 5,71081 5,21962 -8,60%
std::atomic (int64_t) fetch_or(seq_cst) 5,71097 5,2925 -7,33%
std::atomic (int64_t) fetch_xor(relaxed) 5,76597 5,04114 -12,57%
std::atomic (int64_t) fetch_xor(acquire) 5,71112 5,1749 -9,39%
std::atomic (int64_t) fetch_xor(release) 5,71097 5,28538 -7,45%
std::atomic (int64_t) fetch_xor(acq_rel) 5,76538 5,25347 -8,88%
std::atomic (int64_t) fetch_xor(seq_cst) 5,71125 5,209 -8,79%
std::atomic (int64_t) exchange(relaxed) 0,713731 0,720574 0,96%
std::atomic (int64_t) exchange(acquire) 5,52406 5,2086 -5,71%
std::atomic (int64_t) exchange(release) 1,42757 1,78423 24,98%
std::atomic (int64_t) exchange(acq_rel) 5,48457 5,17565 -5,63%
std::atomic (int64_t) exchange(seq_cst) 5,52522 5,17559 -6,33%
std::atomic (int64_t) cmp_xchg_weak(relaxed. relaxed) fail 5,28103 6,86618 30,02%
std::atomic (int64_t) cmp_xchg_weak(relaxed. relaxed) success 5,57233 6,65872 19,50%
std::atomic (int64_t) cmp_xchg_weak(acquire. relaxed) fail 5,32297 7,65681 43,84%
std::atomic (int64_t) cmp_xchg_weak(acquire. relaxed) success 5,55278 7,04572 26,89%
std::atomic (int64_t) cmp_xchg_weak(acquire. acquire) fail 5,19573 7,5397 45,11%
std::atomic (int64_t) cmp_xchg_weak(acquire. acquire) success 5,65387 7,06106 24,89%
std::atomic (int64_t) cmp_xchg_weak(release. relaxed) fail 5,16921 8,64542 67,25%
std::atomic (int64_t) cmp_xchg_weak(release. relaxed) success 5,8789 8,09059 37,62%
std::atomic (int64_t) cmp_xchg_weak(acq_rel. relaxed) fail 5,14874 8,4178 63,49%
std::atomic (int64_t) cmp_xchg_weak(acq_rel. relaxed) success 5,62648 8,15887 45,01%
std::atomic (int64_t) cmp_xchg_weak(acq_rel. acquire) fail 5,19702 8,41399 61,90%
std::atomic (int64_t) cmp_xchg_weak(acq_rel. acquire) success 5,61241 8,0948 44,23%
std::atomic (int64_t) cmp_xchg_weak(seq_cst. relaxed) fail 5,6107 8,45259 50,65%
std::atomic (int64_t) cmp_xchg_weak(seq_cst. relaxed) success 5,74546 8,06755 40,42%
std::atomic (int64_t) cmp_xchg_weak(seq_cst. acquire) fail 5,19561 8,4248 62,15%
std::atomic (int64_t) cmp_xchg_weak(seq_cst. acquire) success 5,83114 8,08243 38,61%
std::atomic (int64_t) cmp_xchg_weak(seq_cst. seq_cst) fail 5,44959 8,5268 56,47%
std::atomic (int64_t) cmp_xchg_weak(seq_cst. seq_cst) success 5,68065 8,06806 42,03%
std::atomic (int64_t) cmp_xchg_strong(relaxed. relaxed) fail 5,37833 6,92512 28,76%
std::atomic (int64_t) cmp_xchg_strong(relaxed. relaxed) success 5,59084 6,70653 19,96%
std::atomic (int64_t) cmp_xchg_strong(acquire. relaxed) fail 5,22344 7,39602 41,59%
std::atomic (int64_t) cmp_xchg_strong(acquire. relaxed) success 5,53385 7,11215 28,52%
std::atomic (int64_t) cmp_xchg_strong(acquire. acquire) fail 5,19899 7,37137 41,78%
std::atomic (int64_t) cmp_xchg_strong(acquire. acquire) success 5,71289 7,13459 24,89%
std::atomic (int64_t) cmp_xchg_strong(release. relaxed) fail 5,18419 8,51836 64,31%
std::atomic (int64_t) cmp_xchg_strong(release. relaxed) success 5,62459 8,22757 46,28%
std::atomic (int64_t) cmp_xchg_strong(acq_rel. relaxed) fail 5,17071 8,19045 58,40%
std::atomic (int64_t) cmp_xchg_strong(acq_rel. relaxed) success 5,7873 8,10029 39,97%
std::atomic (int64_t) cmp_xchg_strong(acq_rel. acquire) fail 5,16332 8,22449 59,29%
std::atomic (int64_t) cmp_xchg_strong(acq_rel. acquire) success 5,87045 8,11597 38,25%
std::atomic (int64_t) cmp_xchg_strong(seq_cst. relaxed) fail 5,57339 8,15235 46,27%
std::atomic (int64_t) cmp_xchg_strong(seq_cst. relaxed) success 5,67978 8,15304 43,54%
std::atomic (int64_t) cmp_xchg_strong(seq_cst. acquire) fail 5,19931 8,14173 56,59%
std::atomic (int64_t) cmp_xchg_strong(seq_cst. acquire) success 5,75014 8,08624 40,63%
std::atomic (int64_t) cmp_xchg_strong(seq_cst. seq_cst) fail 5,51236 8,15659 47,97%
std::atomic (int64_t) cmp_xchg_strong(seq_cst. seq_cst) success 5,63752 8,13753 44,35%

Most important observations are:

  • baselib’s fetch_[op], exchange, compare-and-exchange show a clear improvement of some 10-20% - which is expected – and is a great result!
  • Operations with SEQ_CST memory model show a much worse improvement and even regressions in some cases! Needs investigation.
  • std::atomic has significantly regressed in compare_exchange_weak and compare_exchange_strong operations.

Round 3. Mandatory assembly check!

Okay, now let’s try to find out what’s happening behind the scenes and try to explain the performance results on SEQ_CST operations.

Here’s an example outline-atomics function:

00000000007a3d60 <__aarch64_cas4_acq_rel>:
  7a3d60: 5f 24 03 d5  	hint	#34
  7a3d64: 10 01 00 90  	adrp	x16, 0x7c3000 <__aarch64_cas8_acq+0x4>
  7a3d68: 10 42 43 39  	ldrb	w16, [x16, #208]
  7a3d6c: 70 00 00 34  	cbz	w16, 0x7a3d78 <__aarch64_cas4_acq_rel+0x18>
  7a3d70: 41 fc e0 88  	casal	w0, w1, [x2]
  7a3d74: c0 03 5f d6  	ret
  7a3d78: f0 03 00 2a  	mov	w16, w0
  7a3d7c: 40 fc 5f 88  	ldaxr	w0, [x2]
  7a3d80: 1f 00 10 6b  	cmp	w0, w16
  7a3d84: 61 00 00 54  	b.ne	0x7a3d90 <__aarch64_cas4_acq_rel+0x30>
  7a3d88: 41 fc 11 88  	stlxr	w17, w1, [x2]
  7a3d8c: 91 ff ff 35  	cbnz	w17, 0x7a3d7c <__aarch64_cas4_acq_rel+0x1c>
  7a3d90: c0 03 5f d6  	ret 

It has the branch on CPU capabilities (cbz), LSE implementation (casal) and LL/SC (ldaxr/stlxr).

The call site looks something like this:

  4f4170: c0 fd 9f 52  	mov	w0, #65518
  4f4174: c1 fd 9f 52  	mov	w1, #65518
  4f4178: e2 b3 04 91  	add	x2, sp, #300
  4f417c: e0 ff af 72  	movk	w0, #32767, lsl #16
  4f4180: e1 ff af 72  	movk	w1, #32767, lsl #16
  4f4184: f7 be 0a 94  	bl	0x7a3d60 <__aarch64_cas4_acq_rel>
>>4f4188: bf 3b 03 d5  	dmb	ish 

Highlighted is the full memory barrier generated by baselib when SEQ_CST is used. The corresponding code in baselib has this curious comment:

// Patch gcc and clang intrinsics to achieve a sequentially consistent barrier.
// As of writing Clang 9, GCC 9 none of them produce a seq cst barrier for load-store operations.
// To fix this we switch load store to be acquire release with a full final barrier.

However, after a lengthy discussion afterwards with the folks who wrote the code and our friends at Arm, as well as reading up on the research papers in the internets https://plv.mpi-sws.org/scfix/paper.pdf and https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html, https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/armv8-sequential-consistency, we came to a conclusion that the full memory barrier is too conservative for this case. The full memory barrier is needed for legacy __sync GCC builtins though. We’ll be fixing the SEQ_CST atomics in baselib by removing the barrier and re-measuring performance of outline atomics, if time permits.

Here’s the assembly from LL/SC atomic version to compare:

  4fbb28: 09 fd 5f 88  	ldaxr	w9, [x8]
  4fbb2c: 3f 01 19 6b  	cmp	w9, w25
  4fbb30: 61 00 00 54  	b.ne	0x4fbb3c 
  4fbb34: 17 fd 09 88  	stlxr	w9, w23, [x8]
  4fbb38: 02 00 00 14  	b	0x4fbb40
  4fbb3c: 5f 3f 03 d5  	clrex
  4fbb40: bf 3b 03 d5  	dmb	ish 

Looks pretty logical and neat (mind the memory barrier).

Why is outline-atomics code slower than pure LL/SC, taken that both have the barrier? It couldn’t be just a function call (branch-and-link) overhead, can it? Let’s jump deeper, into a sampling profiler…

Round 4. Profiling benchmarks

There are two goals to be achieved with a sampling profiler: first, confirm that LSE path is taken; second, try to find out where most time is spent on.

I’m usually sticking to Arm Mobile Studio (Arm Streamline) as a sampling profiler, but since baselib benchmarks are an ELF application and not an APK, I didn’t find a way to profile it out of box in Streamline.

UPDATE: I got a follow-up from Pete Harris @ Arm:

“It works – a little manual setup, but it’s not too painful.

  • Push the binary you want to test and gatord to /data/local/tmp
  • From android device shell in /data/local/tmp:
  • Run setprop security.perf_harden 0
  • Run ./gatord --app <app name> <app command line args>. This will pause and wait for the host tool to connect before starting the app

From the host tool:

  • In the “Start” tab select “TCP (advanced)”
  • Select the Android device in the device table
  • Configure counters and “Start” when ready (The Configure application part of this doesn’t work for Android) Once captured, right click on the capture in the “Streamline Data” tab and select “Analyze”. Add the ELF images / symbol files you need and reanalyze.”

So it’s doable in Streamline too, but this time I used Google’s simpleperf bundled in Android NDK instead. The “native program” option -np is specially designed for such cases.

Here’s the answer to the first question:

profile_simpleperf_cas4_acq_rel

Yes, runtime detection is working correctly. Yes, LSE instructions are being used (in this particular case, <unknown> is cas* - compare-and-swap instruction, simpleperf didn’t enable new extensions when dumping disassembly, hint #34 is a PAC instruction paciasp). Yes, they are quite fast – no clear bottleneck, even if the interrupts were not 100% precise because the microarchitecture prefers to not stop at this or that instruction. No issues found here.

Next, let’s check the profile of the call site.

profile_simpleperf_outline_callsite

It is unlikely that subs takes that much CPU time. Maybe the CPU interrupts are one instruction off here, but it could also be a side effect of the full memory barrier that the next instruction has to wait longer than expected. Anyway, most of the time is being spent in the memory barrier and around it, which doesn’t really answer the question – why is it performing slower than a non-LSE version?

Here’s to the assembly of LL/SC version:

profile_simpleperf_ll_sc

(please disregard the yellow markup)

This one seems to spend most of its time in subs and very little in the memory barrier - which we could once again attribute to microarchitecture preferences, but this assumption is speculative.

Anyway, there is no clear answer as to where the time is spent in the LSE version compared to the LL/SC version when a full memory barrier is in place. I can only state that the outline-atomics version is slower – maybe it’s related to a sequence of function call => memory barrier => arithmetic with flag setting.

Taken that we later decided to remove the full memory barrier from baselib’s SEQ_CST implementations, this regression becomes much less significant.

Round 5. STL atomics: weak or strong?

STL’s std::atomic performance is used in baselib as a baseline, but it’s still curious to find out why it has become slower with outline-atomics than without them.

When using LL/SC, the benchmark calls into __cxx_atomic_compare_exchange_weak<int>(), but when with -moutline-atomics, for some reason it becomes a strong exchange __cxx_atomic_compare_exchange_strong<int>() which is significantly slower (almost by half).

First I thought there’s a bug in the benchmark, but no, it’s correct.

When I remove strong compare-and-exchange from the benchmark, the assembly shows calls into the weak version in STL, which is correct! When there are calls to weak and strong versions of STL atomics, then only strong seems to be used. This hints of identical code folding (ICF), but why is it happening?

Since our code is using catch benchmarking, I first thought it’s to blame. Looking at the resulting code after all preprocessors run, I see lambdas

holder = [&]()
    {
        T prev = value2;
        memory.compare_exchange_strong(prev, value2, order1, order2);
    };
holder = [&]()
    {
        T prev = value2;
        memory.compare_exchange_weak(prev, value2, order1, order2);
    };

so I thought maybe it’s lambdas which fold. However, by tuning input and tests here and there, I was able to dump the assembly of the STL’s weak and strong versions with outline enabled.

bool std::__ndk1::__cxx_atomic_compare_exchange_strong<int>(std::__ndk1::__cxx_atomic_base_impl<int>*, int*, int, std::__ndk1::memory_order, std::__ndk1::memory_order)
bool std::__ndk1::__cxx_atomic_compare_exchange_weak<int>(std::__ndk1::__cxx_atomic_base_impl<int>*, int*, int, std::__ndk1::memory_order, std::__ndk1::memory_order)

They are identical (apart from branch addresses)! This answers the question why LLVM folded the two functions into one. I thought okay, maybe it’s an NDK bug – so I tried the latest (at the time) r25beta4. Same result. The code itself is different from r23, but weak and strong versions in r25 are the same.

Trying to isolate it, I created this repro in godbolt. The assembly with outline intrinsics is identical in strong and weak versions.

With LL/SC, the assembly does differ, having stxr vs. stlxr instruction.

I raised the question with Arm compiler folks, and we found out that the pure LSE versions for weak and strong exchange are exactly the same (confirmed in godbolt). The outline atomics version just builds upon an LSE version, so they are still the same. Doesn’t explain the regression, the LSE instructions must be faster than LL/SC, regardless weak or strong.

The profile confirms that LSE is actually used:

profile_simpleperf_std_atomic_cas4_acq_rel

Overall, some of baselib’s std::atomic benchmarks show performance regressions in outline atomics vs. LL/SC versions on LSE-capable devices. This shouldn’t happen. We spent some time with out friends at Arm investigating this, and I’ll share results in a separate post.

Round 6. Pure LSE

Let’s find out what is the cost related to the function call overhead and CPU capabilities branch introduced by the outline atomics implementation. To remove the function call overhead, we’re building a binary which has atomics implemented in pure inlined LSE instructions.

At first, I tried obtaining some skills in writing inline assembly in C, but after few hours I realized that I can just compile using -march=armv8-a+lse, and the compiler will do the job for me automatically! Of course, such a binary can only run on an LSE-capable device, and will generate an illegal instruction exception on an older device (verified locally).

Now, on to performance. Selected benchmark results are available in the table below.

perf_pure_lse

Results:

  • As expected, pure LSE is always faster than outline atomics
  • Absolute overhead of having to call the intrinsic function with outline atomics and check the boolean containing CPU capabilities lies in the range of 0.2-0.3 ns
  • Relative overhead varies from 3 to 7%.

Overall, the result is expected: the overhead is there, but it’s not very high. The performance benefits LSE brings are definitely higher than that.

Round 7. Benchmarking outline atomics on old, non-LSE devices

The test device for local tests was Huawei Honor 9 (Cortex A53).

Testcase Benchmark default, ns outline, ns Change
Baselib atomics (int64_t) fetch_add(relaxed) 8,7645 13,472 53,71%
Baselib atomics (int64_t) fetch_add(acquire) 8,7609 13,509 54,20%
Baselib atomics (int64_t) fetch_add(release) 9,434 14,158 50,07%
Baselib atomics (int64_t) fetch_add(acq_rel) 9,4301 14,139 49,93%
Baselib atomics (int64_t) fetch_add(seq_cst) 12,126 14,821 22,22%
Baselib atomics (int64_t) fetch_and(relaxed) 8,7579 13,479 53,91%
Baselib atomics (int64_t) fetch_and(acquire) 8,7507 13,483 54,08%
Baselib atomics (int64_t) fetch_and(release) 9,4248 14,155 50,19%
Baselib atomics (int64_t) fetch_and(acq_rel) 9,4281 14,25 51,14%
Baselib atomics (int64_t) fetch_and(seq_cst) 12,12 14,827 22,33%
Baselib atomics (int64_t) fetch_or(relaxed) 8,7657 13,474 53,71%
Baselib atomics (int64_t) fetch_or(acquire) 8,7623 13,477 53,81%
Baselib atomics (int64_t) fetch_or(release) 9,4364 14,152 49,97%
Baselib atomics (int64_t) fetch_or(acq_rel) 9,4307 14,164 50,19%
Baselib atomics (int64_t) fetch_or(seq_cst) 12,133 14,829 22,22%
Baselib atomics (int64_t) fetch_xor(relaxed) 8,7602 13,472 53,79%
Baselib atomics (int64_t) fetch_xor(acquire) 8,7708 13,49 53,81%
Baselib atomics (int64_t) fetch_xor(release) 9,475 14,254 50,44%
Baselib atomics (int64_t) fetch_xor(acq_rel) 9,4521 14,145 49,65%
Baselib atomics (int64_t) fetch_xor(seq_cst) 12,134 14,84 22,30%
Baselib atomics (int64_t) exchange(relaxed) 0,00017724 0,00017163 -3,17%
Baselib atomics (int64_t) exchange(acquire) 8,7589 13,469 53,78%
Baselib atomics (int64_t) exchange(release) 0,00017292 0,00017162 -0,75%
Baselib atomics (int64_t) exchange(acq_rel) 10,784 15,503 43,76%
Baselib atomics (int64_t) exchange(seq_cst) 12,142 16,176 33,22%
Baselib atomics (int64_t) cmp_xchg_weak fail(relaxed,relaxed) 4,7 12,808 172,51%
Baselib atomics (int64_t) cmp_xchg_weak success(relaxed,relaxed) 8,094 14,168 75,04%
Baselib atomics (int64_t) cmp_xchg_weak fail(acquire,relaxed) 4,723 11,465 142,75%
Baselib atomics (int64_t) cmp_xchg_weak success(acquire,relaxed) 8,0849 14,251 76,27%
Baselib atomics (int64_t) cmp_xchg_weak fail(acquire,acquire) 4,716 12,806 171,54%
Baselib atomics (int64_t) cmp_xchg_weak success(acquire,acquire) 8,0929 14,159 74,96%
Baselib atomics (int64_t) cmp_xchg_weak fail(release,relaxed) 4,7166 11,454 142,84%
Baselib atomics (int64_t) cmp_xchg_weak success(release,relaxed) 8,7658 14,822 69,09%
Baselib atomics (int64_t) cmp_xchg_weak fail(acq_rel,relaxed) 4,7171 12,797 171,29%
Baselib atomics (int64_t) cmp_xchg_weak success(acq_rel,relaxed) 8,7552 14,828 69,36%
Baselib atomics (int64_t) cmp_xchg_weak fail(acq_rel,acquire) 4,7161 11,442 142,62%
Baselib atomics (int64_t) cmp_xchg_weak success(acq_rel,acquire) 8,7553 14,824 69,31%
Baselib atomics (int64_t) cmp_xchg_weak fail(seq_cst,relaxed) 4,7327 12,9 172,57%
Baselib atomics (int64_t) cmp_xchg_weak success(seq_cst,relaxed) 13,471 16,173 20,06%
Baselib atomics (int64_t) cmp_xchg_weak fail(seq_cst,acquire) 4,7116 14,16 200,53%
Baselib atomics (int64_t) cmp_xchg_weak success(seq_cst,acquire) 13,479 16,189 20,11%
Baselib atomics (int64_t) cmp_xchg_weak fail(seq_cst,seq_cst) 7,4291 12,812 72,46%
Baselib atomics (int64_t) cmp_xchg_weak success(seq_cst,seq_cst) 12,132 15,492 27,70%
Baselib atomics (int64_t) cmp_xchg_strong fail(relaxed,relaxed) 4,7175 11,465 143,03%
Baselib atomics (int64_t) cmp_xchg_strong success(relaxed,relaxed) 10,103 14,152 40,08%
Baselib atomics (int64_t) cmp_xchg_strong fail(acquire,relaxed) 4,7216 12,803 171,16%
Baselib atomics (int64_t) cmp_xchg_strong success(acquire,relaxed) 10,11 14,241 40,86%
Baselib atomics (int64_t) cmp_xchg_strong fail(acquire,acquire) 4,7157 11,468 143,19%
Baselib atomics (int64_t) cmp_xchg_strong success(acquire,acquire) 10,113 14,161 40,03%
Baselib atomics (int64_t) cmp_xchg_strong fail(release,relaxed) 4,7336 12,828 171,00%
Baselib atomics (int64_t) cmp_xchg_strong success(release,relaxed) 10,774 14,813 37,49%
Baselib atomics (int64_t) cmp_xchg_strong fail(acq_rel,relaxed) 4,7194 11,46 142,83%
Baselib atomics (int64_t) cmp_xchg_strong success(acq_rel,relaxed) 10,8 14,842 37,43%
Baselib atomics (int64_t) cmp_xchg_strong fail(acq_rel,acquire) 4,715 12,8 171,47%
Baselib atomics (int64_t) cmp_xchg_strong success(acq_rel,acquire) 10,776 14,821 37,54%
Baselib atomics (int64_t) cmp_xchg_strong fail(seq_cst,relaxed) 4,7202 14,24 201,68%
Baselib atomics (int64_t) cmp_xchg_strong success(seq_cst,relaxed) 13,486 16,179 19,97%
Baselib atomics (int64_t) cmp_xchg_strong fail(seq_cst,acquire) 4,7181 12,824 171,80%
Baselib atomics (int64_t) cmp_xchg_strong success(seq_cst,acquire) 13,485 16,189 20,05%
Baselib atomics (int64_t) cmp_xchg_strong fail(seq_cst,seq_cst) 7,4121 12,154 63,98%
Baselib atomics (int64_t) cmp_xchg_strong success(seq_cst,seq_cst) 12,801 15,5 21,08%
std::atomic (int64_t) fetch_add(relaxed) 10,111 14,226 40,70%
std::atomic (int64_t) fetch_add(acquire) 12,807 17,533 36,90%
std::atomic (int64_t) fetch_add(release) 13,485 18,209 35,03%
std::atomic (int64_t) fetch_add(acq_rel) 13,464 17,523 30,15%
std::atomic (int64_t) fetch_add(seq_cst) 13,469 17,535 30,19%
std::atomic (int64_t) fetch_and(relaxed) 10,121 14,14 39,71%
std::atomic (int64_t) fetch_and(acquire) 12,819 17,543 36,85%
std::atomic (int64_t) fetch_and(release) 13,464 18,29 35,84%
std::atomic (int64_t) fetch_and(acq_rel) 13,506 17,521 29,73%
std::atomic (int64_t) fetch_and(seq_cst) 13,48 17,535 30,08%
std::atomic (int64_t) fetch_or(relaxed) 10,116 14,17 40,08%
std::atomic (int64_t) fetch_or(acquire) 12,802 17,514 36,81%
std::atomic (int64_t) fetch_or(release) 13,483 18,199 34,98%
std::atomic (int64_t) fetch_or(acq_rel) 13,481 17,525 30,00%
std::atomic (int64_t) fetch_or(seq_cst) 13,48 17,5 29,82%
std::atomic (int64_t) fetch_xor(relaxed) 10,135 14,27 40,80%
std::atomic (int64_t) fetch_xor(acquire) 12,798 17,517 36,87%
std::atomic (int64_t) fetch_xor(release) 13,471 18,187 35,01%
std::atomic (int64_t) fetch_xor(acq_rel) 13,482 17,516 29,92%
std::atomic (int64_t) fetch_xor(seq_cst) 13,62 17,516 28,60%
std::atomic (int64_t) exchange(relaxed) 4,2759 4,0396 -5,53%
std::atomic (int64_t) exchange(acquire) 12,796 17,529 36,99%
std::atomic (int64_t) exchange(release) 9,437 9,446 0,10%
std::atomic (int64_t) exchange(acq_rel) 14,826 18,97 27,95%
std::atomic (int64_t) exchange(seq_cst) 14,821 18,872 27,33%
std::atomic (int64_t) cmp_xchg_weak(relaxed, relaxed,fail) 16,841 28,313 68,12%
std::atomic (int64_t) cmp_xchg_weak(relaxed, relaxed,success) 14,151 27,637 95,30%
std::atomic (int64_t) cmp_xchg_weak(acquire, relaxed,fail) 22,244 31,011 39,41%
std::atomic (int64_t) cmp_xchg_weak(acquire, relaxed,success) 19,546 29,074 48,75%
std::atomic (int64_t) cmp_xchg_weak(acquire, acquire,fail) 19,545 31 58,61%
std::atomic (int64_t) cmp_xchg_weak(acquire, acquire,success) 18,863 28,982 53,64%
std::atomic (int64_t) cmp_xchg_weak(release, relaxed,fail) 20,208 30,334 50,11%
std::atomic (int64_t) cmp_xchg_weak(release, relaxed,success) 19,544 30,014 53,57%
std::atomic (int64_t) cmp_xchg_weak(acq_rel, relaxed,fail) 22,266 29,694 33,36%
std::atomic (int64_t) cmp_xchg_weak(acq_rel, relaxed,success) 20,232 29,756 47,07%
std::atomic (int64_t) cmp_xchg_weak(acq_rel, acquire,fail) 19,547 29,766 52,28%
std::atomic (int64_t) cmp_xchg_weak(acq_rel, acquire,success) 20,216 29,784 47,33%
std::atomic (int64_t) cmp_xchg_weak(seq_cst, relaxed,fail) 24,274 32,373 33,36%
std::atomic (int64_t) cmp_xchg_weak(seq_cst, relaxed,success) 22,851 32,993 44,38%
std::atomic (int64_t) cmp_xchg_weak(seq_cst, acquire,fail) 20,89 33,678 61,22%
std::atomic (int64_t) cmp_xchg_weak(seq_cst, acquire,success) 20,88 35,128 68,24%
std::atomic (int64_t) cmp_xchg_weak(seq_cst, seq_cst,fail) 22,91 32,35 41,20%
std::atomic (int64_t) cmp_xchg_weak(seq_cst, seq_cst,success) 21,559 33,07 53,39%
std::atomic (int64_t) cmp_xchg_strong(relaxed, relaxed,fail) 16,856 28,309 67,95%
std::atomic (int64_t) cmp_xchg_strong(relaxed, relaxed,success) 14,151 27,72 95,89%
std::atomic (int64_t) cmp_xchg_strong(acquire, relaxed,fail) 22,247 30,994 39,32%
std::atomic (int64_t) cmp_xchg_strong(acquire, relaxed,success) 19,531 28,984 48,40%
std::atomic (int64_t) cmp_xchg_strong(acquire, acquire,fail) 20,884 30,979 48,34%
std::atomic (int64_t) cmp_xchg_strong(acquire, acquire,success) 19,552 29,102 48,84%
std::atomic (int64_t) cmp_xchg_strong(release, relaxed,fail) 20,21 30,34 50,12%
std::atomic (int64_t) cmp_xchg_strong(release, relaxed,success) 19,537 29,982 53,46%
std::atomic (int64_t) cmp_xchg_strong(acq_rel, relaxed,fail) 22,227 29,674 33,50%
std::atomic (int64_t) cmp_xchg_strong(acq_rel, relaxed,success) 20,196 29,732 47,22%
std::atomic (int64_t) cmp_xchg_strong(acq_rel, acquire,fail) 20,86 29,6 41,90%
std::atomic (int64_t) cmp_xchg_strong(acq_rel, acquire,success) 20,205 29,645 46,72%
std::atomic (int64_t) cmp_xchg_strong(seq_cst, relaxed,fail) 24,917 32,326 29,73%
std::atomic (int64_t) cmp_xchg_strong(seq_cst, relaxed,success) 22,35 33,308 49,03%
std::atomic (int64_t) cmp_xchg_strong(seq_cst, acquire,fail) 20,898 33,711 61,31%
std::atomic (int64_t) cmp_xchg_strong(seq_cst, acquire,success) 20,219 35,049 73,35%
std::atomic (int64_t) cmp_xchg_strong(seq_cst, seq_cst,fail) 22,916 32,376 41,28%
std::atomic (int64_t) cmp_xchg_strong(seq_cst, seq_cst,success) 20,902 33,571 60,61%

The absolute numbers are significantly slower than on the S22 (more than 2x), which is not surprising.

Comparing the outline version to a “normal” one (inline LL/SC atomics), the following conclusions can be made:

  • Outline atomics are always slower than “normal” builds
  • The absolute overhead of outline atomics lies within the range of 5 to 10 ns (!!) which is much higher than on LSE-capable devices. This applies both to STL’s and baselib’s atomics.
  • The relative overhead is quite high in this case, higher than the 20% threshold suggested at the beginning, and in some cases even more than 100%.

Our friends at Arm did their own measurements and confirmed that the overhead of outline atomics on Cortex-A53 is higher than that on LSE-capable devices, likely caused by less efficient branch prediction.  

Conclusion

Applying outline atomics is really very very easy – it’s just a compiler flag. The feature implementation is very appealing because of the backward compatibility it provides.

Results on modern, LSE-capable devices are positive. The overhead related to the function call with outline atomics is much less than the benefits brought by new atomic instructions.

However, the performance results on older devices are quite disappointing. The overhead is significantly higher than the 20% threshold expected at the beginning.

One of Unity’s core values is Users First, and unfortunately, we cannot sacrifice a large portion of our users (Cortex A53 is still very, very popular) for the benefit of another group. We can wait until the share of non-LSE-capable devices is negligible, or try some other tactics (our own dispatch at a higher level – for example, in lockless collections, job system?).

As a result, we are not adopting outline atomics at the moment. This conclusion is to be reassessed in few years.

Your mileage with the feature may vary depending on your target CPUs. If the share of old in-order cores is low for you, then adopting outline atomics may be a valid trade-off.

Function Multi-Versioning

Since the approach when the compiler automatically generates the necessary runtime dispatch for you is nice and elegant, and it helps optimize your code for latest Arm architectures, Arm started developing a generic version of this feature, called Function Multi-versioning, and it’s being added to Clang and GCC. Here’s the description: https://github.com/ARM-software/acle/blob/main/main/acle.md#function-multi-versioning – check it out if you want to target newer Arm architecture extensions while maintaining backward compatibility!