Reading RISC-V specification is very fun and a totally normal hobby that’s not weird at all. Anyway there is an interesting remark in the specification of the newly ratified Zalasr extension for load-acquire and store-release:
The versions without the aq bit set are RESERVED.
LD.{AQ, AQRL}is RV64-only.The aq bit is mandatory because the two encodings that would be produced are not seen as useful at this time. … The version with only the rl bit would correspond to load-release. Load-release has theoretical applications in seqlocks, but is not supported in language-level memory models and so is not included.
There is a similar remark on S*.RL about store-acquire.
At first glance, load-release and store-acquire don’t even seem to make sense. However, it should be remembered that coherence has four flavors. The most obvious one is the write-read coherence, where a write is observed by a subsequent read. But we also have read-write coherence, where a read did not observe a subsequent write. Ignoring some troublesome details arising from mixed-size accesses, every memory address has a total order of all reads and writes. So it’s pretty natural to consider a load-release and store-acquire pair.
Seqlocks
To have a better understanding of the semantics of load-release and store-acquire, let’s take a look at a seqlock. Seqlocks are a variant of reader-writer locks that slightly prioritize writers more. The traditional reader-writer locks , which an older post took a look at, may suffer from writer starvation. Seqlocks flip the priority, so a write can always proceed at any time, and readers may live-lock.
The structure of a seqlock is very similar to an optimistic lock on the reader side, plus a normal mutex on the writer side:
// The sequence number
// - If odd, a write is in progress
// - If even, readers are safe to read
atomic<uint64_t> seq = 0;
void do_read() {
while (true) {
uint64_t begin = seq.load(std::memory_order_acquire);
if (begin & 1) {
// Write in progress
continue;
}
////////
// Read data
////////
// Release the lock
atomic_thread_fence(std::memory_order_acquire);
uint64_t end = seq.load(std::memory_order_relaxed);
if (end == begin) {
// No write happened during read
break;
}
}
}
void do_write() {
// Acquire lock. This is done by treating the low bit of seq as a mutex.
uint64_t orig;
while (true) {
orig = seq.load(std::memory_order_relaxed);
if (orig & 1) {
// Another writer in progress
continue;
}
bool updated = seq.compare_exchange_weak(
orig, orig + 1,
std::memory_order_relaxed,
std::memory_order_relaxed
);
if (updated) {
atomic_thread_fence(std::memory_order_release);
break;
}
}
////////
// Update data
////////
// Release the lock
seq.store(orig + 2, std::memory_order_release);
}
Notably, like any optimistic lock, the locking successfulness is (also) checked at the end of the critical section. This results in a load-release / store-acquire pair.[1]
However, since we don’t have load-release and store-acquire in the C++ memory model, we have to use two fences to emulate them.[2] This is really unfortunate, because from a hardware perspective, there is an opportunity to optimize load-release and store-acquire better than fences, i.e. add a load-wait-for-store control path for individual load-queue elements. Hardware can also just fallback to a fence. In fact, RISC-V already has some cursed stuff such as fence w,r.
It’s quite counterintuitive that for a load-release we need an acquire fence, and vice-versa. This is also due to the fact that in C++, a release-fence is only meaningful when followed by a store, not a load. Since we’re trying to prevent a load-load reordering, we need to use an acquire fence coupled with all previous loads. On RISC-V, the release fence will compile to fence rw, w, and the acquire fence will compile to fence r, rw. Ideally we need fence r, r and fence w, w, but there is no way to do that in C++ right now.
Additional resources
The initial ideas for Zalasr seem to stem from the discussion in the RISC-V mailing list thread RISC-V memory model topics. The seqlocks remark originally comes from the C++ paper N4455 No Sane Compiler Would Optimize Atomics
For seqlocks themselves, there is an amazing explanation on StackOverflow by Peter Cordes. Everyday I dream of being as knowledgeable as him. Please check it out.
- [1] As for the other combinations, since for any acquire-release pair, no matter which one is the read or write, only a read can generate results that can be used as a condition. So a store-acquire / store-release pair does not seem to be useful. For a load-release / load-acquire pair, the read-read coherence needs to be mediated by at least one write in between. There might be niche use cases where a dedicated thread maintains a counter to schedule other threads, whose work has a deadline but can be retried (re-grab a sequence number from a ticket generator).
- [2] Stronger orders like
std::memory_order_seq_cstalone do not increase the strength of a single load or store. Per C++ standard, a seq-cst load may reorder with other memory accesses. The fence is necessary here.