Load Release, Store Acquire, and Seqlocks

Reading RISC-V specification is very fun and a totally normal hobby that’s not weird at all. Anyway there is an interesting remark in the specification of the newly ratified Zalasr extension for load-acquire and store-release:

The versions without the aq bit set are RESERVED. LD.{AQ, AQRL} is RV64-only.

The aq bit is mandatory because the two encodings that would be produced are not seen as useful at this time. … The version with only the rl bit would correspond to load-release. Load-release has theoretical applications in seqlocks, but is not supported in language-level memory models and so is not included.

There is a similar remark on S*.RL about store-acquire.

At first glance, load-release and store-acquire don’t even seem to make sense. However, it should be remembered that coherence has four flavors. The most obvious one is the write-read coherence, where a write is observed by a subsequent read. But we also have read-write coherence, where a read did not observe a subsequent write. Ignoring some troublesome details arising from mixed-size accesses, every memory address has a total order of all reads and writes. So it’s pretty natural to consider a load-release and store-acquire pair.

Seqlocks

To have a better understanding of the semantics of load-release and store-acquire, let’s take a look at a seqlock. Seqlocks are a variant of reader-writer locks that slightly prioritize writers more. The traditional reader-writer locks , which an older post took a look at, may suffer from writer starvation. Seqlocks flip the priority, so a write can always proceed at any time, and readers may live-lock.

The structure of a seqlock is very similar to an optimistic lock on the reader side, plus a normal mutex on the writer side:

// The sequence number
// - If odd, a write is in progress
// - If even, readers are safe to read
atomic<uint64_t> seq = 0;

void do_read() {
  while (true) {
    uint64_t begin = seq.load(std::memory_order_acquire);
    if (begin & 1) {
      // Write in progress
      continue;
    }
    
    ////////
    // Read data
    ////////

    // Release the lock
    atomic_thread_fence(std::memory_order_acquire);
    uint64_t end = seq.load(std::memory_order_relaxed);

    if (end == begin) {
      // No write happened during read
      break;
    }
  }
}

void do_write() {
  // Acquire lock. This is done by treating the low bit of seq as a mutex.
  uint64_t orig;
  while (true) {
    orig = seq.load(std::memory_order_relaxed);
    if (orig & 1) {
      // Another writer in progress
      continue;
    }

    bool updated = seq.compare_exchange_weak(
      orig, orig + 1,
      std::memory_order_relaxed,
      std::memory_order_relaxed
    );
    if (updated) {
      atomic_thread_fence(std::memory_order_release);
      break;
    }
  }

  ////////
  // Update data
  ////////

  // Release the lock
  seq.store(orig + 2, std::memory_order_release);
}

Notably, like any optimistic lock, the locking successfulness is (also) checked at the end of the critical section. This results in a load-release / store-acquire pair.^[1]

However, since we don’t have load-release and store-acquire in the C++ memory model, we have to use two fences to emulate them.^[2] This is really unfortunate, because from a hardware perspective, there is an opportunity to optimize load-release and store-acquire better than fences, i.e. add a load-wait-for-store control path for individual load-queue elements. Hardware can also just fallback to a fence. In fact, RISC-V already has some ~~cursed~~ stuff such as fence w,r.

It’s quite counterintuitive that for a load-release we need an acquire fence, and vice-versa. This is also due to the fact that in C++, a release-fence is only meaningful when followed by a store, not a load. Since we’re trying to prevent a load-load reordering, we need to use an acquire fence coupled with all previous loads. On RISC-V, the release fence will compile to fence rw, w, and the acquire fence will compile to fence r, rw. Ideally we need fence r, r and fence w, w, but there is no way to do that in C++ right now.

Additional resources

The initial ideas for Zalasr seem to stem from the discussion in the RISC-V mailing list thread RISC-V memory model topics. The seqlocks remark originally comes from the C++ paper N4455 No Sane Compiler Would Optimize Atomics

For seqlocks themselves, there is an amazing explanation on StackOverflow by Peter Cordes. Everyday I dream of being as knowledgeable as him. Please check it out.

[1] As for the other combinations, since for any acquire-release pair, no matter which one is the read or write, only a read can generate results that can be used as a condition. So a store-acquire / store-release pair does not seem to be useful. For a load-release / load-acquire pair, the read-read coherence needs to be mediated by at least one write in between. There might be niche use cases where a dedicated thread maintains a counter to schedule other threads, whose work has a deadline but can be retried (re-grab a sequence number from a ticket generator).
[2] Stronger orders like std::memory_order_seq_cst alone do not increase the strength of a single load or store. Per C++ standard, a seq-cst load may reorder with other memory accesses. The fence is necessary here.