A fast lock-free queue for C++(转)

Sharing data between threads in annoying. Really annoying. If you do it wrong, the data is likely to become corrupted. Even more error-prone is ensuring the shared data is perceived in the right order (relative to other data reads/writes).

There are, as always, two ways to deal with this: The easy way, and the hard way.

The easy way

Use the locks (e.g. mutexes, critical sections, or equivalent) that come with your platform. They’re not too hard to understand conceptually, and even easier to use. You don’t have to worry about ordering problems, because the libraries/OS take care of that for you. The only general problem with locks is that they tend to be slow (“slow” being relative here; for general use, they’re plenty speedy enough).

The catch

I was looking into doing some audio programming, where the audio callback (which gets called when the device needs more samples to fill its buffer) was on another thread. The goal with audio programming is to never let the audio “glitch” — which means your callback code must be as close to real-time programming as you can get on a conventional OS. So, the faster things compute, the better, but what really matters with real-time programming is how deterministic the computation time will be — will your code reliably take 5ms or could there be an occasional 300ms spike?

What does this mean? Well, for starters you can’t allocate heap memory — because that might involve an OS call, which might take a non-deterministic amount of time to complete (e.g. if you have to wait for a page to be swapped to disk). In fact, it’s best to avoid calling into the kernel at all. But, this actually isn’t the main problem.

The main problem is synchronizing the audio thread with the main thread. We can’t use locks, because aside from non-negligible overhead (considering the frequency of callbacks), they can cause priority inversion — you don’t want a high-priority audio thread waiting to enter a critical section that some background thread is obliviously taking its time releasing, causing the audio to glitch.

The hard way

Enter lock-free programming.

Lock-free programming is a way of writing thread-safe code such that in the case of contention, the system is guaranteed to advance as a whole. “Wait-free” programming takes this a step further: the code is set up such that each thread can always advance regardless of what the other is doing. This also has the benefit of avoiding expensive locks.

Why doesn’t everyone use lock-free programming? Well, it’s not easy to get right. You have to be very careful never to corrupt data under any circumstances, regardless of what each of the threads are doing. But even harder to get right are guarantees related to ordering. Consider this example (a and b start at 0):

Thread A    Thread B
b = 42
a = 1

What are the possible sets of values that thread B could output? It could be any one of {0 0}, {1 42}, {0 42}, and {1 0}. {0 0}, {1 42}, and {0 42} make sense (which one actually results depends on timing), but how can {1 0} happen? This is because the compiler is allowed to re-order the loads and stores (or even remove some or invent others) in order to improve efficiency, as long as it appears to do the same thing on that thread. But when another thread starts interacting with the first, this re-ordering can become apparent.

It gets worse. You can force the compiler to not reorder specific reads or writes, but the CPU is allowed to (and does, all the time!) re-order instructions too, again as long it appears to do the same thing on the core that the instructions are running on.

Memory barriers

Fortunately, you can force certain ordering guarantees all the way down to the CPU level using memory barriers. Unfortunately, they are extremely tricky to use properly. In fact, a major problem with lock-free code is that it’s prone to bugs that can only be reproduced by specific non-deterministic interleavings of thread operations. This means a lock-free algorithm with a bug may work 95% of the time and only fail the other 5%. Or it may work all of the time on a developer’s CPU, but fail occasionally on another CPU.

This is why lock-free programming is generally considered hard: It’s easy to get something that looks right, but it requires much more effort and thought to create something which is guaranteed to work under all circumstances. Luckily, there are a few tools to help verify implementations. One such tool is Relacy: It works by running unit tests (that you write) under every possible permutation of thread interleaving, which is really cool.

Back to memory barriers: At the lowest level of the CPU, different cores each have their own cache, and these caches need to be kept in sync with each other (a sort of eventual consistency). This is accomplished with a sort of internal messaging system between the various CPU cores (sounds slow? It is. So it was highly optimized with, you guessed it, more caching!). Since both cores are running code at the same time, this can lead to some interesting orderings of memory loads/stores (like in the example above). Some stronger ordering guarantees can be built on top of this messaging system; that’s what memory barriers do. The types of memory barriers that I found most powerful were the acquire and release memory barriers. A release memory barrier tells the CPU that any writes that came before the barrier should be made visible in other cores if any of the writes after the barrier become visible, provided that the other cores perform a read barrier after reading the data that was written to after the write barrier. In other words, if a thread B can see a new value that was written after a write barrier on a separate thread A, then after doing a read-barrier (on thread B), all writes that occurred before the write barrier on thread A are guaranteed to be visible on thread B. Once nice feature of acquire and release semantics is that they boil down to no-ops on x86 — every write implicitly has release semantics, and every read implicitly has acquire semantics! You still need to add the barriers, though, because they prevent compiler re-ordering (and they’ll generate the right assembly code on other processor architectures which don’t have as strong a memory-ordering model).

If this sounds confusing, it is! But don’t give up hope, because there’s many well-explained resources; a good starting point is Jeff Preshing’s very helpful articles on the subject. There’s also a short list of links in this answer on Stack Overflow.

A wait-free single-producer, single consumer queue

Since I apparently find tangents irresistible, I, of course, set out to build my own lock-free data structure. I went for a wait-free queue for a single producer, single consumer architecture (which means there’s only ever two threads involved). If you need a data structure which is safe to use from several threads at once, you’ll need to find another implementation (MSVC++ comes with one). I chose this restriction because it’s much easier to work with than one where there’s a free-for-all of concurrent threads, and because it was all I needed.

I had looked at using some existing libraries (particularly liblfds), but I wasn’t satisfied with the memory management (I wanted a queue that wouldn’t allocate any memory at all on the critical thread, making it suitable for real-time programming). So, I ignored the copious advice about only doing lock-free programming if you’re already an expert in it (how does anyone become an expert like that?), and was able to successfully implement a queue!

The design is as follows:

A contiguous circular buffer is used to store elements in the queue. This allows memory to be allocated up front, and may provide better cache usage. I call this buffer a “block”.

To allow the queue to grow dynamically without needing to copy all the existing elements into a new block when the block becomes too small (which isn’t lock-free friendly anyway), multiple blocks (of independent size) are chained together in a circular linked list. This makes for a queue of queues, essentially.

The block at which elements are currently being inserted is called the “tail block”. The block at which elements are currently being consumed is called the “front block”. Similarly, within each block there is a “front” index and a “tail” index. The front index indicates the next full slot to read from; the tail index indicates the next empty slot to insert into. If these two indices are equal, the block is empty (exactly one slot is left empty when the queue is full in order to avoid ambiguity between a full block and an empty block both having equal head and tail indices).

In order to keep the queue consistent while two threads are operating on the same data structure, we can take advantage of the fact that the producing thread always goes in one direction in the queue, and the same can be said for the consuming thread. This means that even if a variable’s value is out of date on a given thread, we know what range it can possibly be in. For example, when we dequeue from a block, we can check the value of the tail index (owned by the other thread) in order to compare it against the front index (which the dequeue thread owns, and is hence always up-to-date). We may get a stale value for the tail index from the CPU cache; the enqueue thread could have added more elements since then. But, we know that the tail will never go backwards — more elements could have been added, but as long as the tail was not equal to the front at the time we checked it, there’s guaranteed to be at least one element to dequeue.

Using these sort of guarantees, and some memory barriers to prevent things like the tail from being incremented before (or perceived as being incremented before) the element is actually added to the queue, it’s possible to design a simple algorithm to safely enqueue and dequeue elements under all possible thread interleavings. Here’s the pseudocode:

# Enqueue
If room in tail block, add to tail
Else check next block
    If next block is not the head block, enqueue on next block
    Else create a new block and enqueue there
    Advance tail to the block we just enqueued to

# Dequeue
Remember where the tail block is
If the front block has an element in it, dequeue it
    If front block was the tail block when we entered the function, return false
    Else advance to next block and dequeue the item there
The algorithms for enqueueing and dequeueing within a single block are even simpler since the block is of fixed size:

# Enqueue within a block (assuming we've already checked that there's room in the block)
Copy/move the element into the block's contiguous memory
Increment tail index (wrapping as needed)

# Dequeue from a block (assuming we've already checked that it's not empty)
Copy/move the element from the block's contiguous memory into the output parameter
Increment the front index (wrapping as needed)

Obviously, this pseudocode glosses over some stuff (like where memory barriers are placed), but the actual code is not much more complex if you want to check out the lower-level details.

When thinking about this data structure, it’s helpful to make the distinction between variables that are “owned” by (i.e. written to exclusively by) the consumer or producer threads. A variable that is owned by a given thread is never out of date on that thread. A variable that is owned by a thread may have a stale value when it is read on a different thread, but by using memory barriers carefully we can guarantee that the rest of the data is at least as up to date as the variable’s value indicates when reading it from the non-owning thread.

To allow the queue to potentially be created and/or destroyed by any thread (independent from the two doing the producing and consuming), a full memory barrier (memory_order_seq_cst) is used at the end of the constructor and at the beginning of the destructor; this effectively forces all the CPU cores to synchronize outstanding changes. Obviously, the producer and consumer must have stopped using the queue before the destructor can be safely called.

Give me teh codes

What’s the use of a design without a solid (tested) implementation?

I’ve released my implementation on GitHub. Feel free to fork it! It consists of two headers, one for the queue and a header it depends on which contains some helpers.

It has several nice characteristics:

  • Compatible with C++11 (supports moving objects instead of making copies)
  • Fully generic (templated container of any type) — just like std::queue, you never need to allocate memory for elements yourself (which saves you the hassle of writing a lock-free memory manager to hold the elements you’re queueing)
  • Allocates memory up front, in contiguous blocks
  • Provides a try_enqueue method which is guaranteed never to allocate memory (the queue starts with an initial capacity)
  • Also provides an enqueue method which can dynamically grow the size of the queue as needed
  • Doesn’t use a compare-and-swap loop; this means enqueue and dequeue are O(1) (not counting memory allocation)
  • On x86, the memory barriers compile down to no-ops, meaning enqueue and dequeue are just a simple series of loads and stores (and branches)
  • Compiles under MSVC2010+ and GCC 4.7+ (and should work on any C++11 compliant compiler)

It should be noted that this code will only work on CPUs that treat aligned integer and native-pointer-size loads/stores atomically; fortunately, this includes every modern processor (including ARM, x86/x86-64, and PowerPC). It is not designed to work on the DEC Alpha (which seems to have the weakest memory-ordering guarantees of all time).

I’m releasing the code and algorithm under the terms of the simplified BSD license. Use it at your own risk; in particular, lock-free programming is a patent minefield, and this code may very well violate a pending patent (I haven’t looked). It’s worth noting that I came up with the algorithm and implementation from scratch, independent of any existing lock-free queues.

Performance and correctness

In addition to agonizing over the design for quite some time, I tested the algorithm using several billion randomized operations in a simple stability test (on x86). This, of course, helps inspire confidence, but proves nothing about the correctness. In order to ensure it was correct, I also tested using Relacy, which ran all the possible interleavings for a simple test which turned up no errors; it turns out this simple test wasn’t comprehensive, however, since I eventually did find a bug later using a different set of randomized runs (which I then fixed).

I’ve only tested this queue on x86-64, which is rather forgiving as memory ordering goes. If somebody is willing to test this code on another architecture, let me know! The quick stability test I whipped up is available here.

In terms of performance, it’s fast. Really fast. In my tests, I was able to get up to about 12+ million concurrent enqueue/dequeue pairs per second! (The dequeue thread had to wait for the enqueue thread to catch up if there was nothing in the queue.) After I had implemented my queue, though, I found another single-consumer, single-producer templated queue (written by the author of Relacy) published on Intel’s website; his queue is roughly twice as fast, though it doesn’t have all the features that mine does, and his only works on x86 (and, at this scale, “twice as fast” means the difference in enqueue/dequeue time is in the nanosecond range).

Update8 months ago

I spent some time properly benchmarking, profiling, and optimizing the code, using Dmitry’s single-producer, single-consumer lock-free queue (published on Intel’s website) as a baseline for comparison. Mine’s now faster in general, particularly when it comes to enqueueing many elements (mine uses a contiguous block instead of separate linked elements). Note that different compilers give different results, and even the same compiler on different hardware yields significant speed variations. The 64-bit version is generally faster than the 32-bit one, and for some reason my queue is much faster under GCC on a Linode. Here are the benchmark results in full:

32-bit, MSVC2010, on AMD C-50 @ 1GHz
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0039s | 0.0268s | 0.0040s | 0.0271s | 0.0040s | 0.0270s | 6.8x
Raw remove        | 0.0015s | 0.0017s | 0.0015s | 0.0018s | 0.0015s | 0.0017s | 1.2x
Raw empty remove  | 0.0048s | 0.0027s | 0.0049s | 0.0027s | 0.0048s | 0.0027s | 0.6x
Single-threaded   | 0.0181s | 0.0172s | 0.0183s | 0.0173s | 0.0182s | 0.0173s | 0.9x
Mostly add        | 0.0243s | 0.0326s | 0.0245s | 0.0329s | 0.0244s | 0.0327s | 1.3x
Mostly remove     | 0.0240s | 0.0274s | 0.0242s | 0.0277s | 0.0241s | 0.0276s | 1.1x
Heavy concurrent  | 0.0164s | 0.0309s | 0.0349s | 0.0352s | 0.0236s | 0.0334s | 1.4x
Random concurrent | 0.1488s | 0.1509s | 0.1500s | 0.1522s | 0.1496s | 0.1517s | 1.0x

Average ops/s:
    ReaderWriterQueue: 23.45 million
    SPSC queue:        28.10 million

64-bit, MSVC2010, on AMD C-50 @ 1GHz
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0022s | 0.0210s | 0.0022s | 0.0211s | 0.0022s | 0.0211s | 9.6x
Raw remove        | 0.0011s | 0.0022s | 0.0011s | 0.0023s | 0.0011s | 0.0022s | 2.0x
Raw empty remove  | 0.0039s | 0.0024s | 0.0039s | 0.0024s | 0.0039s | 0.0024s | 0.6x
Single-threaded   | 0.0060s | 0.0054s | 0.0061s | 0.0054s | 0.0061s | 0.0054s | 0.9x
Mostly add        | 0.0080s | 0.0259s | 0.0081s | 0.0263s | 0.0080s | 0.0261s | 3.3x
Mostly remove     | 0.0092s | 0.0109s | 0.0093s | 0.0110s | 0.0093s | 0.0109s | 1.2x
Heavy concurrent  | 0.0150s | 0.0175s | 0.0181s | 0.0200s | 0.0165s | 0.0190s | 1.2x
Random concurrent | 0.0367s | 0.0349s | 0.0369s | 0.0352s | 0.0368s | 0.0350s | 1.0x

Average ops/s:
    ReaderWriterQueue: 34.90 million
    SPSC queue:        32.50 million

32-bit, MSVC2010, on Intel Core 2 Duo T6500 @ 2.1GHz
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0011s | 0.0097s | 0.0011s | 0.0099s | 0.0011s | 0.0098s | 9.2x
Raw remove        | 0.0005s | 0.0006s | 0.0005s | 0.0006s | 0.0005s | 0.0006s | 1.1x
Raw empty remove  | 0.0018s | 0.0011s | 0.0019s | 0.0011s | 0.0018s | 0.0011s | 0.6x
Single-threaded   | 0.0047s | 0.0040s | 0.0047s | 0.0040s | 0.0047s | 0.0040s | 0.9x
Mostly add        | 0.0052s | 0.0114s | 0.0053s | 0.0116s | 0.0053s | 0.0115s | 2.2x
Mostly remove     | 0.0055s | 0.0067s | 0.0056s | 0.0068s | 0.0055s | 0.0068s | 1.2x
Heavy concurrent  | 0.0044s | 0.0089s | 0.0075s | 0.0128s | 0.0066s | 0.0107s | 1.6x
Random concurrent | 0.0294s | 0.0306s | 0.0295s | 0.0312s | 0.0294s | 0.0310s | 1.1x

Average ops/s:
    ReaderWriterQueue: 71.18 million
    SPSC queue:        61.02 million

64-bit, MSVC2010, on Intel Core 2 Duo T6500 @ 2.1GHz
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0007s | 0.0097s | 0.0007s | 0.0100s | 0.0007s | 0.0099s | 13.6x
Raw remove        | 0.0004s | 0.0015s | 0.0004s | 0.0020s | 0.0004s | 0.0018s | 4.6x
Raw empty remove  | 0.0014s | 0.0010s | 0.0014s | 0.0010s | 0.0014s | 0.0010s | 0.7x
Single-threaded   | 0.0024s | 0.0022s | 0.0024s | 0.0022s | 0.0024s | 0.0022s | 0.9x
Mostly add        | 0.0031s | 0.0112s | 0.0031s | 0.0115s | 0.0031s | 0.0114s | 3.7x
Mostly remove     | 0.0033s | 0.0041s | 0.0033s | 0.0041s | 0.0033s | 0.0041s | 1.2x
Heavy concurrent  | 0.0042s | 0.0035s | 0.0067s | 0.0039s | 0.0054s | 0.0038s | 0.7x
Random concurrent | 0.0142s | 0.0141s | 0.0145s | 0.0144s | 0.0143s | 0.0142s | 1.0x

Average ops/s:
    ReaderWriterQueue: 101.21 million
    SPSC queue:        71.42 million

32-bit, Intel ICC 13, on Intel Core 2 Duo T6500 @ 2.1GHz
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0014s | 0.0095s | 0.0014s | 0.0097s | 0.0014s | 0.0096s | 6.8x
Raw remove        | 0.0007s | 0.0006s | 0.0007s | 0.0007s | 0.0007s | 0.0006s | 0.9x
Raw empty remove  | 0.0028s | 0.0013s | 0.0028s | 0.0018s | 0.0028s | 0.0015s | 0.5x
Single-threaded   | 0.0039s | 0.0033s | 0.0039s | 0.0033s | 0.0039s | 0.0033s | 0.8x
Mostly add        | 0.0049s | 0.0113s | 0.0050s | 0.0116s | 0.0050s | 0.0115s | 2.3x
Mostly remove     | 0.0051s | 0.0061s | 0.0051s | 0.0062s | 0.0051s | 0.0061s | 1.2x
Heavy concurrent  | 0.0066s | 0.0036s | 0.0084s | 0.0039s | 0.0076s | 0.0038s | 0.5x
Random concurrent | 0.0291s | 0.0282s | 0.0294s | 0.0287s | 0.0292s | 0.0286s | 1.0x

Average ops/s:
    ReaderWriterQueue: 55.65 million
    SPSC queue:        63.72 million

64-bit, Intel ICC 13, on Intel Core 2 Duo T6500 @ 2.1GHz
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0010s | 0.0099s | 0.0010s | 0.0100s | 0.0010s | 0.0099s | 9.8x
Raw remove        | 0.0006s | 0.0015s | 0.0006s | 0.0018s | 0.0006s | 0.0017s | 2.7x
Raw empty remove  | 0.0024s | 0.0016s | 0.0024s | 0.0016s | 0.0024s | 0.0016s | 0.7x
Single-threaded   | 0.0026s | 0.0023s | 0.0026s | 0.0023s | 0.0026s | 0.0023s | 0.9x
Mostly add        | 0.0032s | 0.0114s | 0.0032s | 0.0118s | 0.0032s | 0.0116s | 3.6x
Mostly remove     | 0.0037s | 0.0042s | 0.0037s | 0.0044s | 0.0037s | 0.0044s | 1.2x
Heavy concurrent  | 0.0060s | 0.0092s | 0.0088s | 0.0096s | 0.0077s | 0.0095s | 1.2x
Random concurrent | 0.0168s | 0.0166s | 0.0168s | 0.0168s | 0.0168s | 0.0167s | 1.0x

Average ops/s:
    ReaderWriterQueue: 68.45 million
    SPSC queue:        50.75 million

64-bit, GCC 4.7.2, on Linode 1GB virtual machine (Intel Xeon L5520 @ 2.27GHz)
                  |        Min        |        Max        |        Avg
Benchmark         |   RWQ   |  SPSC   |   RWQ   |  SPSC   |   RWQ   |  SPSC   | Mult
Raw add           | 0.0004s | 0.0055s | 0.0005s | 0.0055s | 0.0005s | 0.0055s | 12.1x
Raw remove        | 0.0004s | 0.0030s | 0.0004s | 0.0030s | 0.0004s | 0.0030s | 8.4x
Raw empty remove  | 0.0009s | 0.0060s | 0.0010s | 0.0061s | 0.0009s | 0.0060s | 6.4x
Single-threaded   | 0.0034s | 0.0052s | 0.0034s | 0.0052s | 0.0034s | 0.0052s | 1.5x
Mostly add        | 0.0042s | 0.0096s | 0.0042s | 0.0106s | 0.0042s | 0.0103s | 2.5x
Mostly remove     | 0.0042s | 0.0057s | 0.0042s | 0.0058s | 0.0042s | 0.0058s | 1.4x
Heavy concurrent  | 0.0030s | 0.0164s | 0.0036s | 0.0216s | 0.0032s | 0.0188s | 5.8x
Random concurrent | 0.0256s | 0.0282s | 0.0257s | 0.0290s | 0.0257s | 0.0287s | 1.1x

Average ops/s:
    ReaderWriterQueue: 137.88 million
    SPSC queue:        24.34 million

In short, my queue is blazingly fast, and actually doing anything with it will eclipse the overhead of the data structure itself.

The benchmarking code is available here (compile and run with full optimizations).

Update update (May 20, ’13): I’ve since changed the benchmark to preallocate the queues’ memory, and added a comparison against Facebook’s folly::ProducerConsumerQueue (which is a smidgeon faster but cannot grow as needed). Here are the results:

64-bit, GCC 4.7.2, on Linode 1GB virtual machine (Intel Xeon L5520 @ 2.27GHz)
                  |-----------  Min ------------|------------ Max ------------|------------ Avg ------------|
Benchmark         |   RWQ   |  SPSC   |  Folly  |   RWQ   |  SPSC   |  Folly  |   RWQ   |  SPSC   |  Folly  | xSPSC | xFolly
Raw add           | 0.0003s | 0.0018s | 0.0003s | 0.0003s | 0.0018s | 0.0003s | 0.0003s | 0.0018s | 0.0003s | 5.52x | 0.96x
Raw remove        | 0.0003s | 0.0030s | 0.0004s | 0.0003s | 0.0030s | 0.0004s | 0.0003s | 0.0030s | 0.0004s | 8.58x | 1.28x
Raw empty remove  | 0.0009s | 0.0061s | 0.0006s | 0.0010s | 0.0061s | 0.0006s | 0.0010s | 0.0061s | 0.0006s | 6.40x | 0.61x
Single-threaded   | 0.0034s | 0.0053s | 0.0033s | 0.0034s | 0.0053s | 0.0033s | 0.0034s | 0.0053s | 0.0033s | 1.54x | 0.97x
Mostly add        | 0.0042s | 0.0046s | 0.0042s | 0.0043s | 0.0046s | 0.0042s | 0.0042s | 0.0046s | 0.0042s | 1.09x | 0.99x
Mostly remove     | 0.0042s | 0.0049s | 0.0043s | 0.0042s | 0.0051s | 0.0043s | 0.0042s | 0.0050s | 0.0043s | 1.20x | 1.02x
Heavy concurrent  | 0.0025s | 0.0100s | 0.0024s | 0.0028s | 0.0101s | 0.0026s | 0.0026s | 0.0100s | 0.0025s | 3.88x | 0.97x
Random concurrent | 0.0265s | 0.0268s | 0.0262s | 0.0273s | 0.0287s | 0.0284s | 0.0269s | 0.0280s | 0.0271s | 1.04x | 1.01x

Average ops/s:
    ReaderWriterQueue: 142.60 million
    SPSC queue:        33.86 million
    Folly queue:       181.16 million

The future of lock-free programming

I think that as multi-threaded code becomes more prevalent, more and more libraries will become available that leverage lock-free programming for speed (while abstracting the gory, error-prone guts from the application developer). I wonder, though, how far this multi-core-with-shared-memory architecture will go — the cache coherence protocols and memory barrier instructions don’t seem to scale very well (performance-wise) to highly parallel CPUs. Immutable shared memory with thread-local writes may be the future. Sounds like a job for functional programming!

ref: http://moodycamel.com/blog/2013/a-fast-lock-free-queue-for-c++

C 数组形参

知识这个东西,真是知道的越多就不知道的越多,C/C++这塘水得多深啊,哈哈。 看下面3个片段:

void fun(char a[100]) {
        fprintf(stderr, "%s\n", a);

int main(void) {
        char aa[200] = "abcdef";
void fun(char a[]) {
        fprintf(stderr, "%s\n", a);

int main(void) {
        char aa[200] = "abcdef";
void fun(char* a) {
        fprintf(stderr, "%s\n", a);

int main(void) {
        char aa[200] = "abcdef";


  • fun(char a[100]):实际上这里数组长度100会被编译器忽略,唯一可能起的作用是提示调用者这里应该传入一个长度为100的数组,但这种提示也是毫无约束性的。
  • fun(char a[]):这里a[]的作用是可以提示调用者这里处理的是一个数组而并不是char,但是编译器还是会将a当作一个char来处理,也就是说如果你在fun()函数中测试sizeof(a)的话,你得到的是一个指针的长度(在32位机器上一般是4)。
  • fun(char *a):这种形式应该是普通青年最常用的方式了吧……^_^,一般还会加一个数组长度参数len 。

Mathematics in Movies

Mathematics in Movies http://www.math.harvard.edu/~knill/mathmovies/

Baby’s First Garbage Collector (转载)

When I get stressed out and have too much to do, I have this paradoxical reaction where I escape from that by coming up with another thing to do. Usually it’s a tiny self-contained program that I can write and finish.

The other morning, I was freaking myself out about the book I’m working on and the stuff I have to do at work and a talk I’m preparing for Strange Loop, and all of the sudden, I thought, “I should write a garbage collector.”

Yes, I realize how crazy that paragraph makes me seem. But my faulty wiring is your free tutorial on a fundamental piece of programming language implementation! In about a hundred lines of vanilla C, I managed to whip up a basic mark-and-sweep collector that actually, you know, collects. Garbage collection is considered one of the more shark-infested waters of programming, but in this post, I’ll give you a nice kiddie pool to paddle around in. (There may still be sharks in it, but at least it will be shallower.)

Reduce, reuse, recycle

The basic idea behind garbage collection is that the language (for the most part) appears to have access to infinite memory. The developer can just keep allocating and allocating and allocating and, as if by magic, it never fails.

Of course, machines don’t have infinite memory. So the way the implementation does this is that when it needs to allocate a bit of memory and it realizes it’s running low, it collects garbage. “Garbage” in this context means memory it previously allocated that is no longer being used. For the illusion of infinite memory to work, the language needs to be very safe about “no longer being used”. It would be no fun if random objects just started getting reclaimed while your program was trying to access them.

In order to be collectible, the language has to ensure there’s no way for the program to use that object again. If it can’t get a reference to the object, then it obviously can’t use it again. So the definition of “in use” is actually pretty simple:

  • Any object that’s being referenced by a variable that’s still in scope is in use.
  • Any object that’s referenced by another object that’s in use is in use.

The second rule is the recursive one. If object A is referenced by a variable, and it has some field that references object B, then B is in use since you can get to it through A.

The end result is a graph of reachable objects—all of the objects in the world that you can get to by starting at a variable and traversing through objects. Any object not in that graph of reachable objects is dead to the program and its memory is ripe for a reaping.

Marking and sweeping

There’s a bunch of different ways you can implement the process of finding and reclaiming all of the unused objects, but the simplest and first algorithm ever invented for it is called “mark-sweep”. It was invented by John McCarthy, the man who invented Lisp and beards, so you implementing it now is like communing with one of the Elder Gods, but hopefully not in some Lovecraftian way that ends with you having your mind and retinas blasted clean.

It works almost exactly like our definition of reachability: Starting at the roots, traverse the entire object graph. Every time you reach an object, set a “mark” bit on it to true.

Once that’s done, find all of the objects whose mark bits are not set and delete them.

That’s it. I know, you could have come up with that, right? If you had, you’d be the author of a paper cited hundreds of times. The lesson here is that to be famous in CS, you don’t have to come up with really smart stuff, you just have to come up with dumb stuff first.

A pair of objects

Before we can get to implementing those two steps, let’s get a couple of preliminaries out of the way. We won’t be actually implementing an interpreter for a language—no parser, bytecode, or any of that foolishness—but we do need some minimal amount of code to create some garbage to collect.

Let’s play pretend that we’re writing an interpreter for a little language. It’s dynamically typed, and has two types of objects: ints and pairs. Here’s an enum to identify an object’s type:

typedef enum {
} ObjectType;

A pair can be a pair of anything, two ints, an int and another pair, whatever. You can go surprisingly far with just that. Since an object in the VM can be either of these, the typical way in C to implement it is with a tagged union.

We’ll define it thusly:

typedef struct sObject {
  ObjectType type;

  union {
    /* OBJ_INT */
    int value;

    /* OBJ_PAIR */
    struct {
      struct sObject* head;
      struct sObject* tail;
} Object;

The main Object struct has a type field that identifies what kind of value it is— either an int or a pair. Then it has a union to hold the data for the int or pair. If your C is rusty, a union is a struct where the fields overlap in memory. Since a given object can only be an int or a pair, there’s no reason to have memory in a single object for all three fields at the same time. A union does that. Groovy.

A minimal virtual machine

Now we can wrap that in a little virtual machine structure. Its role in this story is to have a stack that stores the variables that are currently in scope. Most language VMs are either stack-based (like the JVM and CLR) or register-based (like Lua). In both cases, there is actually still a stack. It’s used to store local variables and temporary variables needed in the middle of an expression.

We’ll model that explicitly and simply like so:

#define STACK_MAX 256

typedef struct {
  Object* stack[STACK_MAX];
  int stackSize;
} VM;

Now that we’ve got our basic data structures in place, let’s slap together a bit of code to create some stuff. First, let’s write a function that creates and initializes a VM:

VM* newVM() {
  VM* vm = malloc(sizeof(VM));
  vm->stackSize = 0;
  return vm;

Once we’ve got a VM, we need to be able to manipulate its stack:

void push(VM* vm, Object* value) {
  assert(vm->stackSize < STACK_MAX, "Stack overflow!");
  vm->stack[vm->stackSize++] = value;

Object* pop(VM* vm) {
  assert(vm->stackSize > 0, "Stack underflow!");
  return vm->stack[--vm->stackSize];

OK, now that we can stick stuff in “variables”, we need to be able to actually create objects. First a little helper function:

Object* newObject(VM* vm, ObjectType type) {
  Object* object = malloc(sizeof(Object));
  object->type = type;
  return object;

That does the actual memory allocation and sets the type tag. We’ll be revisiting this in a bit. Using that, we can write functions to push each kind of object onto the VM’s stack:

void pushInt(VM* vm, int intValue) {
  Object* object = newObject(vm, OBJ_INT);
  object->value = intValue;
  push(vm, object);

Object* pushPair(VM* vm) {
  Object* object = newObject(vm, OBJ_PAIR);
  object->tail = pop(vm);
  object->head = pop(vm);

  push(vm, object);
  return object;

And that’s it for our little VM. If we had a parser and an interpreter that called those functions, we’d have an honest to God language on our hands. And, if we had infinite memory, it would even be able to run real programs. Since we don’t, let’s start collecting some garbage.

Marky mark

The first phase is marking. We need to walk all of the reachable objects and set their mark bit. The first thing we need then is to add a mark bit to Object:

typedef struct sObject {
  unsigned char marked;
  /* Previous stuff... */
} Object;

When we create a new object, we’ll modify newObject() to initialize marked to zero. To mark all of the reachable objects, we start with the variables that are in memory, so that means walking the stack. That looks like this:

void markAll(VM* vm)
  for (int i = 0; i < vm->stackSize; i++) {

That in turn calls mark. We’ll build that in phases. First:

void mark(Object* object) {
  object->marked = 1;

This is the most important bit, literally. We’ve marked the object itself as reachable, but remember we also need to handle references in objects: reachability is recursive. If the object is a pair, its two fields are reachable too. Handling that is simple:

void mark(Object* object) {
  object->marked = 1;

  if (object->type == OBJ_PAIR) {

But there’s a bug here. Do you see it? We’re recursing now, but we aren’t checking for cycles. If you have a bunch of pairs that point to each other in a loop, this will overflow the stack and crash. To handle that, we just need to bail out if we get to an object that we’ve already processed. So the complete mark() function is:

void mark(Object* object) {
  /* If already marked, we're done. Check this first
     to avoid recursing on cycles in the object graph. */
  if (object->marked) return;

  object->marked = 1;

  if (object->type == OBJ_PAIR) {

Now we can call markAll() and it will correctly mark every reachable object in memory. We’re halfway done!

Sweepy sweep

The next phase is to sweep through all of the objects we’ve allocated and free any of them that aren’t marked. But there’s a problem here: all of the unmarked objects are, by definition, unreachable! We can’t get to them!

The VM has implemented the language’s semantics for object references: so we’re only storing pointers to objects in variables and the pair elements. As soon as an object is no longer pointed to by one of those, we’ve lost it entirely and actually leaked memory.

The trick to solve this is that the VM can have its own references to objects that are distinct from the semantics that are visible to the language user. In other words, we can keep track of them ourselves. The simplest way to do this is to just maintain a linked list of every object we’ve ever allocated. We’ll extend Object itself to be a node in that list:

typedef struct sObject {
  /* The next object in the list of all objects. */
  struct sObject* next;

  /* Previous stuff... */
} Object;

The VM will keep track of the head of that list:

typedef struct {
  /* The first object in the list of all objects. */
  Object* firstObject;

  /* Previous stuff... */
} VM;

In newVM() we’ll make sure to initialize firstObject to NULL. Whenever we create an object, we add it to the list:

Object* newObject(VM* vm, ObjectType type) {
  Object* object = malloc(sizeof(Object));
  object->type = type;
  object->marked = 0;

  /* Insert it into the list of allocated objects. */
  object->next = vm->firstObject;
  vm->firstObject = object;

  return object;

This way, even if the language can’t find an object, the language implementation still can. To sweep through and delete the unmarked objects, we just need to traverse the list:

void sweep(VM* vm)
  Object** object = &vm->firstObject;
  while (*object) {
    if (!(*object)->marked) {
      /* This object wasn't reached, so remove it from the list
         and free it. */
      Object* unreached = *object;

      *object = unreached->next;
    } else {
      /* This object was reached, so unmark it (for the next GC)
         and move on to the next. */
      (*object)->marked = 0;
      object = &(*object)->next;

That code is a bit tricky to read because of that pointer to a pointer, but if you work through it, you can see it’s pretty straightforward. It just walks the entire linked list. Whenever it hits an object that isn’t marked, it frees its memory and removes it from the list. When this is done, we will have deleted every unreachable object.

Congratulations! We have a garbage collector! There’s just one missing piece: actually calling it. First let’s wrap the two phases together:

void gc(VM* vm) {

You couldn’t ask for a more obvious mark-sweep implementation. The trickiest part is figuring out when to actually call this. What does “low on memory” even mean, especially on modern computers with near-infinite virtual memory?

It turns out there’s no precise right or wrong answer here. It really depends on what you’re using your VM for and what kind of hardware it runs on. To keep this example simple, we’ll just collect after a certain number of allocations. That’s actually how some language implementations work, and it’s easy to implement. We’ll extend VM to track how many we’ve created:

typedef struct {
  /* The total number of currently allocated objects. */
  int numObjects;

  /* The number of objects required to trigger a GC. */
  int maxObjects;

  /* Previous stuff... */
} VM;

And then initialize them:

VM* newVM() {
  /* Previous stuff... */

  vm->numObjects = 0;
  vm->maxObjects = INITIAL_GC_THRESHOLD;
  return vm;

The INITIAL_GC_THRESHOLD will be the number of objects at which you kick off the first GC. A smaller number is more conservative with memory, a larger number spends less time on garbage collection. Adjust to taste.

Whenever we create an object, we increment numObjects and run a collection if it reaches the max:

Object* newObject(VM* vm, ObjectType type) {
  if (vm->numObjects == vm->maxObjects) gc(vm);

  /* Create object... */

  return object;

I won’t bother showing it, but we’ll also tweak sweep() to decrement numObjects every time it frees one. Finally, we modify gc() to update the max:

void gc(VM* vm) {
  int numObjects = vm->numObjects;


  vm->maxObjects = vm->numObjects * 2;

After every collection, we update maxObjects based on the number of live objects left after the collection. The multiplier there lets our heap grow as the number of living objects increases. Likewise, it will shrink automatically if a bunch of objects end up being freed.


You made it! If you followed all of this, you’ve now got a handle on a simple garbage collection algorithm. If you want to see it all together, here’s the full code. Let me stress here that while this collector is simple, it isn’t a toy.

There are a ton of optimizations you can build on top of this (and in things like GC and programming languages, optimization is 90% of the effort), but the core code here is a legitimate real GC. It’s very similar to the collectors that were in Ruby and Lua until recently. You can ship production code that uses something exactly like this. Now go build something awesome!

ref: http://journal.stuffwithstuff.com/ github: https://github.com/munificent/mark-sweep

Boost Serialization Use Non-Default Constructors

#include <fstream>
#include <cstring>
#include <boost/archive/text_oarchive.hpp>
#include <boost/archive/text_iarchive.hpp>

class gps_position {

	gps_position(int size) {
		std::cout << "constructor: one" << std::endl;
		m_size = size;
		buf = (unsigned char *)malloc(size);	
		memset(buf, 0, sizeof(buf));

	gps_position(int d, int m, float s, int bufSZ) : degrees(d), minutes(m), seconds(s) {
		std::cout << "constructor: two" << std::endl;
		m_size = bufSZ;
		buf = (unsigned char *)malloc(m_size);	
		memset(buf, 0, sizeof(buf));

		buf[0] = 'a';
		buf[1] = 'b';
		buf[2] = 'c';
		buf[3] = '\0';

	~gps_position() {

	void show() {
		std::cout << "degree = " << degrees << " minutes = " << minutes << " seconds = " << seconds
			<< " buf = " << buf << std::endl;

	friend class boost::serialization::access;
    template<class Archive> 
		friend inline  void load_construct_data(Archive &ar, gps_position *t, const unsigned int file_version);
	template<class Archive> 
		friend inline  void save_construct_data(Archive &ar, const gps_position *t, const unsigned int file_version);

	template<class Archive>
	void serialize(Archive &ar, const unsigned int version) {

		ar & degrees;
		ar & minutes;
		ar & seconds;

		for(int i = 0; i < m_size; i++) {
			ar & buf[i];

	int degrees;
	int minutes;
	float seconds;
	unsigned char *buf;
	int m_size;

template<class Archive>
inline void save_construct_data(Archive &ar, const gps_position *t, const unsigned int file_version) {

	std::cout << "save_construct_data: size = " << t->m_size  << std::endl;
	ar << t->m_size;

template<class Archive>
	inline void load_construct_data(Archive &ar, gps_position *t, const unsigned int file_version) {
	int size;

	ar >> size;
	std::cout << "load_construct_data: size = "  << size << std::endl;

int main() {
	std::ofstream ofs("file");
	gps_position* g = new gps_position(35, 59, 24.567f, 1024);

	std::cout << "before : ";
	std::cout <<  std::endl;

		boost::archive::text_oarchive oa(ofs);
		oa << g;

	gps_position *newg = NULL;	
	std::ifstream ifs("file");
		boost::archive::text_iarchive ia(ifs);
		ia >> newg;
	std::cout << "after : ";
	std::cout << std::endl;

	return 0;
g++ -Wall -g -o test test.cpp -lboost_serialization


constructor: two
before serialization: degree = 35 minutes = 59 seconds = 24.567 buf = abc

save_construct_data: size = 1024
load_construct_data: size = 1024
constructor: one
after deserialization: degree = 35 minutes = 59 seconds = 24.567 buf = abc

ref: http://www.boost.org/doc/libs/1_47_0/libs/serialization/test/test_non_default_ctor.cpp