Diving in multiprocessor programming is never easy, I know. Unfortunately for you, you should understand much about computer architecture to do it effectively.
This post will go deep into computer architecture, trying to explain one of the most insidious problems that we may encounter when we do multiprocessor programming: false sharing.
False Sharing: an example in Go
Let’s implement a simple counter interface.
Quite easy, isn’t it?
Let’s create two structures that implement the
NotPaddedCounter has 3
uint64 and its implementation of the
Increment the method just increments them using
PaddedCounter is very similar, but it has these
uint64 arrays between every field. The
Increment method is exactly the same.
Now let’s run a benchmark to test their performances.
Since my computer has 4 CPUs, we are creating 4 goroutines, each one increments the counter 10000 times. Here are the results:
BenchmarkNoPad-4 1000000000 0.00287 ns/op
BenchmarkPad-4 1000000000 0.00137 ns/op
It looks like the padded version overperforms the non-padded one. But why?
To better understand what happened, let’s explore computer components. The next section will explain SMP computer architecture without going too deep.
From now, the components we are interested in are the following:
- The processors: hardware devices that execute threads. Usually, there are more threads than processors;
- The interconnect: communication medium that links processors to processors and processors to memory;
- The memory: a hierarchy of components that store data, ranging from one or more levels of small, fast caches to a large and relatively slow main memory.
It takes a long time for a processor to read or write a value from memory, and longer still for the processor to be sure that value has actually been written in memory.
A multiprocessor consists of multiple hardware processors, each of which executes a thread.
A thread is a sequential program. The difference between a thread and a processor is that, while a processor is a hardware device, a thread is a software construct. A processor can run a thread for a while and then set it to sleep and run another thread.
The interconnect is the medium by which processors communicate with the memory and with other processors. Both processors and the main memory have bus controller units in charge of sending and listening for messages on the bus.
Processors share a main memory, which is a large array of words, indexed by address. A word is typically either 32 or 64 bits, depending on the platform.
My laptop, for example, has a 64 bits architecture, as you can see by running the following command:
➜ ~ uname -m
A processor reads a value from memory by sending a message containing the desired address to memory. The response message contains the associated data. A processor writes a value by sending the address and the new data to memory, and the memory sends back an acknowledgement when the new data has been successfully written.
On modern architectures, a main memory access takes hundreds of cycles, so it is a waste of time for a processor to spend much of its time just waiting for the memory to respond to requests. This problem can be alleviated by introducing one or more caches: small memories that are situated closer to the processors and are much faster than main memory. Caches are logically situated between the processor and the memory: when a processor attempts to read a value from a given memory address, it first looks to see if the value is already in the cache, and if so, it does not need to perform the slower memory access.
If the desired address’s value was found, we say the processor hits in the cache, and otherwise, it misses.
Caches are effective because most programs display a high degree of locality: if a processor reads or writes a memory address, then it is likely to read or write the same location again soon as well as nearby locations soon. So, caches typically operate at a granularity larger than a single word: a cache holds a group of neighbouring words called cache lines.
However, caches are expensive to build and therefore significantly smaller than the memory: only a fraction of the memory locations will fit in the cache at the same time, so, when a location needs to be cached and the cache is full, it is necessary to evict a line, discarding it if it has not been modified, and writing it back to main memory if it has.
Sharing occurs when one processor reads or writes a memory address that is cached by another processor. If both processors are reading the data without modifying it, the data can be cached at both processors, but if one processor tries to update the shared cache line, then the other’s copy must be invalidated to ensure that it does not read an out-of-date value.
Let’s see a simple example to better understand this concept.
Suppose now to have two processors A and B. A reads data from address v and stores the data in its cache. Then B reads v from the same address, so A detects this conflict and responds with v. Now v is cached at both A and B. If B writes to the shared address v, it changes its state to modified and broadcasts a message warning A to set its cache line state to invalid. If A then reads from v, it broadcasts a request and B replies by sending the modified version of v to A and to the main memory.
False sharing occurs when processors that are accessing logically distinct data nevertheless conflict because the locations they are accessing lie on the same cache line.
Objects or fields that are accessed independently should be aligned and padded so that they end up on different cache lines.
That’s why our padded version performs better. In the non-padded version, each processor repeatedly invalidates the cache of the other processors, even if they are not actually operating concurrently, while the pad we added is preventing this situation.
Take it to the extreme
Now let’s take this concept to the extreme!
These 4 methods receive a slice of integer of different sizes (i.e. 8, 16, 32 and 64) and an index, and increment the integer in that position. Easy, isn't?
Let’s run a benchmark to see how they perform.
As you may imagine, the one using integers of 64 bits performs better than the other:
BenchmarkFalseSharing8-4 2 800431872 ns/op
BenchmarkFalseSharing16-4 2 787015829 ns/op
BenchmarkFalseSharing32-4 2 823191693 ns/op
BenchmarkFalseSharing64-4 13681 76860 ns/op
And it does but now we know why.