Search This Blog

Monday, January 08, 2007

Tools of the Trade - Atomic Primitives

In the wake of the recent memory barriers series, today we continue looking at topics related to multiprogramming, by taking a survey of the various atomic primitives. This particular topic is more for informational value than practical use (at least as far as LibQ is concerned, which does not support all of theseso you do not need to know the differences).

Atomic primitives are arguably the most important tool in programming that requires coordination between multiple threads and/or processors/cores. There are four basic types of atomic primitives: swap, fetch and phi, compare and swap, and load linked/store conditional. Usually these operations are performed on values the same size as or smaller than a word (the size of the processor's registers), though sometimes consecutive (single address) multiword versions are provided.

The most primitive (pun not intended) is the swap operation. This operation exchanges a value in memory with a value in a register, atomically setting the value in memory and returning the original value in memory. This is not actually very useful, in terms of multiprogramming, with only a single practical use: the construction of test and set (TAS; or test and test and set - TATAS) spinlocks. In these spinlocks, each thread attempting to enter the spinlock spins, exchanging 1 into the spinlock value in each iteration. If 0 is returned, the spinlock was free, and the thread now owns the spinlock; if 1 is returned, the spinlock was already owned by another thread, and the thread attempting to enter the spinlock must keep spinning. Ultimately, the owning thread leaves the spinlock by setting the spinlock value to 0.
C/C++ - Code
  1. ATOMIC word swap(volatile word &value, word newValue)
  2. {
  3. word oldValue = value;
  4. value = newValue;
  5. return oldValue;
  6. }
  7. void EnterSpinlock(volatile word &value)
  8. {
  9. // Test and test and set spinlock
  10. while (value == 1 || swap(value, 1) == 1) {}
  11. }
  12. void LeaveSpinlock(volatile word &value)
  13. {
  14. value = 0;
  15. }
This makes for an extremely crude spinlock. The fact that there is only a single value all threads share means that spinning can create a lot of cache coherency traffic, as all spinning processors will be writing to the same address, continually invalidating each other's caches. Furthermore, this mechanism precludes any kind of order preservation, as it's impossible to distinguish when a given thread began waiting; threads may enter the spinlock in any order, regardless of how long any thread has been waiting to enter the spinlock.

Next up the power scale is the fetch and phi family. All members of this family follow the same basic process: a value is atomically read from memory, modified by the processor, and written back, with the original value returned to the program. The modification performed can be almost anything; one of the most useful modifications is the add operation (in this case it's called fetch and add). The fetch and add operation is notably more useful than the swap operation, but is still less than ideal; in addition to test and set spinlocks, fetch and add can be used to create thread-safe counters, and spinlocks that both preserve order and (potentially) greatly reduce cache coherency traffic.
  1. ATOMIC word fetch_and_add(volatile word &value, word addend)
  2. {
  3. word oldValue = value;
  4. value += addend;
  5. return oldValue;
  6. }
  7. void EnterSpinlock (volatile word &seq, volatile word &cur)
  8. {
  9. word mySeq = fetch_and_add(seq, 1);
  10. while (cur != mySeq) {}
  11. }
  12. void LeaveSpinlock(volatile word &cur)
  13. {
  14. cur++;
  15. }
Next is the almost ubiquitous compare and swap (CAS) operation. In this operation, a value is read from memory, and if it matches a comparison value in the processor, a third value in the processor is atomically written in the memory location, and ultimately the original value in memory is returned to the program. This operation is very popular because it allows you to perform almost any operation on a single memory location atomically, by reading the value, modifying it, and then compare-and-swapping it back to memory. For this reason, it is considered to be a universal atomic primitive.
  1. ATOMIC word compare_and_swap(volatile word &value, word newValue, word comparand)
  2. {
  3. word oldValue = value;
  4. if (oldValue == comparand)
  5. value = newValue;
  6. return oldValue;
  7. }
Some architectures (such as the x86) support double-width compare and swap (atomic compare and swap of two consecutive words). While this is convenient, it should not be relied upon in portable software, as many architectures do not support it. Note that double-width compare and swap is NOT the same as double compare and swap (which we'll look at soon).

Another universal atomic primitive is the load linked/store conditional (LL/SC) pair of instructions. In this model, when a value is loaded linked into a register, a reservation is placed on that memory address. If that reservation still exists when the store conditional operation is performed, the store will be performed. If another thread writes to the memory location between a load linked and store conditional, the reservation is cleared, and any other threads will fail their store conditional (resulting in skipping the store altogether). If the store conditional fails, the program must loop back and try the operation again, beginning with the load linked.

In theory, the load linked/store conditional primitive is superior to the compare and swap operation. As any access to the memory address will clear the reservations of other processors, the total number of reads is 1, in contrast to the 2 reads with compare and swap (1 read to get the value initially, and a second read to verify the value is unchanged). Furthermore, the LL/SC operation may distinguish whether a write has occurred, regardless of whether the value was changed by the write (we'll come back to why this is a good thing when we discuss hazard pointers). Unfortunately, my research indicates that most implementations of LL/SC do not guarantee that a write will be distinguished if the written value is the same as the value at the time of the reserve.

Finally, the wild card: the double compare and swap (DCAS). In a double compare and swap, the compare and swap operation is performed simultaneously on the values in two memory addresses. Obviously this provides dramatically more power than any previous operation, which only operate on single addresses. Unfortunately, support for this primitive is extremely rare in real-world processors, and it is typically only used by lock-free algorithm designers that are unable to reduce their algorithm to single-address compare and swap operations.
  1. ATOMIC void double_compare_and_swap(volatile word &value1, word newValue1, word comparand1, word &oldValue1, volatile word &value2, word newValue2, word comparand2, word &oldValue2)
  2. {
  3. oldValue1 = value1;
  4. oldValue2 = value2;
  5. if (oldValue1 == comparand1 && oldValue2 == comparand2)
  6. {
  7. value1 = newValue1;
  8. value2 = newValue2;
  9. }
  10. }
Lastly, note that I'm using rather loose classification of these primitives. Technically, you could place the swap (fetch and store) operations in the fetch and phi family, as well, but it seemed more intuitive to me to separate them.

2 comments:

Alfons Laarman said...

interesting post, thanks. I would still like to learn more about the effects of these operations on the cache coherency protocol. Do you think that compare and swap, for example, enforces a cache writeback or will the location just be marked dirty?

Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!