Search This Blog

Thursday, April 13, 2006

What Is It Good For?

So, a couple people have been asking what the advantage of atomic structures is over simple traditional structures (like STL) and a mutex. Speed is the primary reason. Atomic structures are slightly slower than their traditional counterparts when only a single thread is using them (the best case scenario), but quickly make up for this in multithreaded programs.

Mutexes, by definition, are mutually exclusive. Only a single thread may hold one, and any additional threads trying to get it have to get in line and wait their turn. Now, doing something with a data structure inside a mutex doesn't take that many cycles (I'd estimate about 500, if a memory allocation is required), but that's still completely dead time.

The real killer, however, is the possibility of going into wait. Suppose A holds a mutex and B wants that mutex. If B is lucky, it's running on a different CPU, and spinning a bit (up to 500 cycles) will succeed in waiting A out. But if there's only a single processor, spinning is out of the question. Worse yet, A could have its time slice expire inside the mutex, and now B will have to wait tens or hundreds of thousands of cycles (which is way beyond any reasonable spin limit), prompting it having to wait on the mutex.

Once B has to wait on the mutex, any hope of a speedy trial is over. Once you give up the CPU, Gord knows when you'll get it back (even in the best case it will likely take more than 10,000 cycles). Even worse, if A gets preempted in the mutex, that causes two wait equivalents: one for A to get the CPU again, and one for B to get the CPU after A leaves the mutex. These problems only get worse if there's a thread C or D also waiting for the mutex.

So, what can lock-free structures do? Lock-free means, by definition, that at any given time, one of the threads attempting to perform an operation will succeed in a constant number of cycles that make up 1 iteration (atomic structures are implemented in AtomicCompareExchange loops). For the atomic stack, the simplest of the structures (because only a single iteration failure point exists), each thread that didn't get it on the first try will have to spin, in the worst case, x iterations, where x is the number of threads that have some CPU attempting an operation on the same structure at the same time. If my logic is correct, this can be solved to yield the formula that the total amount of dead time (dead iterations between all threads), in the worst case, is O(x^3), where x is the number of CPUs in the system. In the average case, only 1 or 2 CPUs will be attempting operations at the same time, which means 0 and 1 dead iterations, respectively. The other atomic structures are more complicated, and I'm not certain how to calculate the complexity of them in the worst case (math never was my strong suit).

In addition to being a dramatic improvement in dead time, there are some cases where threads absolutely cannot wait. These are mainly hardware-linked real-time threads. For example, one day BZ asked me for help with a problem he was having, where he needed to dequeue something from a shared structure in a hardware feeder thread (the thread fetches data for the hardware, and so has a very limited amount of time it can take to get that data). In this case, going into wait means death. That leaves only the possibility of lock-free (or, even better, wait-free) structures.

There is a stronger version of lock-free algorithms called wait-free algorithms. Wait-free algorithms provide an additional guarantee: that threads attempting an operation will complete that operation in the same order they started the operation (unlike lock-free algorithms, which only guarantee an upper limit on the number of failed attempts for a thread). However, wait-free structures are far more difficult to design than lock-free structures, which are already far more difficult to design than thread-unsafe structures. I only have an algorithm for a wait-free implementation of load/store with reserve.

No comments: