A survey of concurrency bugs

Writing multithreaded code or even traditional super-loop style firmware involving interrupt service routines (ISRs) is hard work, error prone, and subject to a great deal of confusion. Quoting Koopman, Chapter 20: “Most embedded systems have data concurrency issues in one form or anotherIt’s important to be thorough and proactive in eliminating concurrency problems. If concurrency bugs make it into code you are trying to test or deploy, it is very difficult and expensive to locate and correct them.” (emphasis mine)

How do we know if our software has concurrency or multithreading bugs? Take a look at the product’s bug database and scan for cases with phrases such as:

  • The device occasionally does not respond…
  • The product outputs, rarely, a garbage value…
  • The system sometimes behaves in a erratic manner…
  • During long term testing, typically after processing a million or so messages, the device will sometimes drop <insert communication mechanism here> message…
  • The user interface randomly locks-up during heavy use…
  • I could not reproduce, but it just now glitched out…
  • I am never able to reproduce this issue when using the debugger
  • This bug only reproduces when logging is disabled

If the bug database contains one or more cases with similar phrases to the above examples, then the product likely has concurrency or related multithreading bugs. Don’t feel bad: there are many ways to get this wrong, thereby creating hard to find bugs and erratic or unpredictable behavior in our embedded projects. This post surveys some of the ways we get it wrong, each of which I have observed in real products, and concludes with recommendations on how to get it right.

Need help?

This post was inspired by several recent code review projects where I helped hunt down difficult-to-find concurrency bugs. If your project could use similar help, please check out our code review services.

Background

Skip this section if you are already familiar with the basics of threading and concurrency and their impact on shared data.

Terminology

  • Interrupt Service Routine (ISR): A special software function executed directly by the microcontroller or hardware when a hardware interrupt signal or event occurs. When active, an ISR will interrupt ongoing application code or threads. ISRs typically have special requirements for interacting with the host RTOS.
  • Thread: a sequence of instructions, defined by the application but managed by an operating system or RTOS. Each thread will have its own stack and operating system context and may be considered to be operating concurrently with other threads. Threads start or block (pause) based on application logic and the associated overall state of all other threads in the application. A thread (generally) may have its operations interrupted at any time by the underlying OS or an ISR. All threads in a single application are normally able to access all the memory and resources of the containing application, potentially resulting in “The Bad and the Ugly” results noted later in this post.
  • Atomic: “An indivisible sequence of primitive operations that must complete without interruption…” or so says the Oxford Dictionary of Computer Science. For most firmware developers, this is a single assembly instruction operating on a single variable, such as an integer. Any logical operation that requires more than a single assembly instruction to complete is not atomic.
  • Shared memory: one or more variables or logical data are shared across 2 or more threads or ISRs.
  • volatile (keyword): A C/C++ variable qualifier that informs the compiler that this variable may be modified outside the control or scope of the compiler’s generated code. Therefore the compiler must not optimize the use of this variable. Typically used when declaring hardware register access variables. Historically, especially in the embedded space, was also used to help ensure atomic access to shared data. More recently, code should use various atomic data types for shared data access instead.
  • Critical section: a section of code, preferably very small, that explicitly disables interrupts, ensuring no interruptions during the execution of the code in question.
  • Mutex: Short for “mutual exclusion”. A mutex is an operating system provided object enabling locking of a shared resource.

Overview

For a quick overview of the basic challenge, let us consider an imaginary system with two threads of execution. Both threads access shared data, which is not atomic. The following sequence diagram illustrates what happens when both threads write to this shared data.

Two threads accessing Shared Data.

For a different view of the above, consider the state of our Shared Data at each key moment:

The state of Shared Data as it is modified by two threads without appropriate protections.

As we can see in the above figures, multiple threads or ISRs accessing the same non-atomic shared data may result in an undefined or corrupted state of the shared data. What makes this problem so challenging is a system may operate for hours, days, or months before the exact sequence necessary to cause the corruption occurs. Hence, we must always be aware of thread and ISR context when developing any software involving threads or ISRs. For concrete examples, continue reading “The Bad and the Ugly” below.

The Bad and the Ugly

Note: The examples here assume a standard 32-bit CPU architecture unless noted otherwise.

Modifying a variable with word width larger than the CPU architecture

Embedded systems may be created with 8-bit microcontrollers, or 16-bit, or 32-bit, or even 64-bit CPU architectures. This CPU “word width” generally defines the size of our largest atomic variable. For example, in a 32-bit CPU architecture we may generally assume that assigning a value to a 32-bit integer is thread safe and atomic. The following example demonstrates functions that access a static variable that is atomic and thread safe:

//Example using volatile, as this was the typical
//guidance for pre-C11 or C++11 compilers
static volatile uint32_t m_async_signal_A = 0; 

void SetSignalA(uint32_t value)
{
    m_async_signal_A = value;
}

uint32_t GetSignalA()
{
    return m_async_signal_A;
}

We can verify this by using our toolchain to inspect the assembly output (or just hit godbolt.org):

//ARM, none, gcc 5.4.1
SetSignalA(unsigned int):
        ldr     r3, .L2
        str     r0, [r3]        //single instruction. atomic.
        bx      lr
.L2:
        .word   .LANCHOR0
GetSignalA():
        ldr     r3, .L5
        ldr     r0, [r3]        //single instruction. atomic.
        bx      lr
.L5:
        .word   .LANCHOR0

While all of the above is a positive example showing atomic and thread safe reads and writes, writing to an integer with a word width larger than the native architecture will likely create concurrency risks. For example, if I take the exact same code shown above and build it for the MSP430 (a 16-bit microcontroller), then suddenly our previously atomic code is no longer atomic, no longer thread safe. See below.

SetSignalA(unsigned long):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        MOV.W   R12, &_ZL16m_async_signal_A       //16 bits, instruction 1
        MOV.W   R13, &_ZL16m_async_signal_A+2     //16 bits, instruction 2
        POPM.W  #1, r4
        RET
GetSignalA():
        PUSHM.W #1, R4
        MOV.W   R1, R4
        MOV.W   &_ZL16m_async_signal_A, R12      //16 bits, instruction 1
        MOV.W   &_ZL16m_async_signal_A+2, R13    //16 bits, instruction 2
        POPM.W  #1, r4
        RET

Due to the CPU architecture change, the same code now requires two instructions to access the 32bit variable. This is not thread safe. This is not ISR safe. If the CPU is interrupted in the middle of those two instructions and a new higher priority thread or ISR modifies the same variable, then when the lower priority code executes again, the shared data may be corrupted. The temptation is to look at the code and think: “that will never happen.” However, inevitable feature creep drives modifications and new features, and then one day a new “rare” or “hard to reproduce” bug report will be filed. As Koopman noted in our opening quote: “It’s important to be thorough and proactive in eliminating concurrency problems.”

Math or bit operations on an ‘atomic’ variable

In many cases our so-called atomic variables may only be atomic in the sense of a trivial read or write operation. They are rarely atomic for setting bits, clearing bits, or performing common math operations such as addition or subtraction. This is because these operations typically read the value into a register, perform the operation, and then write the results back out to the shared data’s memory location. Due to the need for multiple CPU instructions, these operations are not atomic and likely require a critical section or mutex. This is especially true with older toolchains, which remain prevalent in the embedded space. Let us look at an example.

//Example using volatile, as this was the typical
//guidance for pre-C11 or C++11 compilers
static volatile uint32_t m_shared = 0; 

void IncrementShared()
{
    ++m_shared;
}

void SetSharedBit(uint8_t bit)
{
    m_shared |= (1 << bit);
}

For the above two trivial functions the compiler generates the following assembly:

IncrementShared():
        ldr     r2, .L3
        ldr     r3, [r2]
        add     r3, r3, #1          //+1. But, the result is in a register
        str     r3, [r2]            //results written to the shared data
        bx      lr
.L3:
        .word   .LANCHOR0
SetSharedBit(unsigned char):
        mov     r1, #1
        ldr     r2, .L6
        ldr     r3, [r2]
        orr     r0, r3, r1, lsl r0  //OR. But, the result is in a register
        str     r0, [r2]            //Results now written to the shared data
        bx      lr
.L6:
        .word   .LANCHOR0

The above examples, demonstrating the need for two CPU instructions to complete the desired operation, apply for the older recommended approach of using ‘volatile’ for our shared data variables. However with modern compilers, we now have the formally defined ‘atomic’ data types. If we change our example to use an atomic datatype:

#include <cstdint>
#include <atomic>

static std::atomic_uint32_t m_shared {0}; 

void IncrementShared()
{
    ++m_shared;
}

void SetSharedBit(uint8_t bit)
{
    m_shared |= (1 << bit);
}

Our compiler now generates (ARM gcc 8.3.1 none):

IncrementShared():
        push    {r4, lr}
        mov     r2, #5
        mov     r1, #1
        ldr     r0, .L4
        bl      __atomic_fetch_add_4    //branch to a mysterious function
        pop     {r4, lr}
        bx      lr
.L4:
        .word   .LANCHOR0
SetSharedBit(unsigned char):
        mov     r1, #1
        push    {r4, lr}
        lsl     r1, r1, r0
        mov     r2, #5
        ldr     r0, .L8
        bl      __atomic_fetch_or_4   //branch to a mysterious function
        pop     {r4, lr}
        bx      lr
.L8:
        .word   .LANCHOR0

What? Our previously easy-to-understand assembly now jumps to obscured library functions. Given that the functions are named “atomic”, we should trust our toolchain and assume we now have truly atomic and therefore thread safe operations. Note: your mileage may vary as not all CPU architectures support all expected atomic operations. In that event, the compiler and/or linker should issue an error.

Hardware registers

I would be remiss if I did not point out that all of the above applies to memory mapped hardware registers as well. The first time I encountered this was on a Toshiba SoC many years ago. The GPIO bits were all controlled via 32bit registers with a single GPIO line controlled per bit, so 32 separate GPIO lines were controlled through a single register. From an SoC developer’s point of view, that was easy and made sense. However, this was annoying and problematic on the software side. We now needed a mutex to protect access to these registers, thereby ensuring atomic access. With the mutex in place, separate threads could now safely control different GPIO lines, all while writing to the same shared data hardware register.

Lately, semiconductor vendors have realized they can play a part in solving issues like this. For example, the new RP2040 microcontroller on the Raspberry Pi Pico explicitly notes this concern (Section 2.3.1.2. GPIO Control): “The OUT and OE registers also have atomic SET, CLR, and XOR aliases, which allows software to update a subset of the pins in one operation. This is vital not only for safe parallel GPIO access between the two cores, but also safe concurrent GPIO access in an interrupt handler and foreground code running on one core.” Nice! I certainly hope that more microcontroller and SoC vendors follow this example.

Using multiple shared state variables in logic

A single variable may be atomic and thread safe for trivial operations such as reading the variable or writing a value to the variable. However, reading multiple variables is never thread safe and must be collectively protected with a critical section or mutex when accessed in related logic. For example in the following code, logic accesses two static variables to determine a true or false result. This example is, strictly speaking, not thread safe.

//various 'signal' values written to in another thread or ISR context
static atomic_uint m_async_signal_A = 0; //C11 atomic type used
static atomic_uint m_async_signal_B = 0;

bool IsModuleReady()
{
   return (m_async_signal_A > 0) && (m_async_signal_B > 0);
}

Why is the above example not thread safe? Because the state of either of the two “async_signal” variables could change during the evaluation of the variables in the logic shown. Here is where this gets difficult — it may not matter. Maybe the caller does not care if the result is strictly speaking ‘wrong’ at that exact moment. Maybe the caller will check again in 100ms when all is stable. Or, maybe being incorrect at that moment results in the product missing a critical real-time deadline. Or, maybe given that a human wrote the logic, the same human assumed the values were stable in subsequent logic. In either case, it is strictly speaking not thread safe.

To repair the above we could consider a critical section or a mutex. The below example has been repaired and is now thread safe.

bool IsModuleReady()
{
    bool isReady = false;
    
    EnterCritical();
    isReady = (m_async_signal_A > 0) && (m_async_signal_B > 0);
    ExitCritical();
    
    return isReady;
}

Note the new calls to EnterCritical() and ExitCritical(). Every RTOS or microcontroller provides a method to ensure a block of code is not interrupted by other threads or interrupt service routines. These blocks are typically referred to as “critical sections”.

Of course now we must worry about abusing the use of critical sections. Too much code inside a critical section? This may cause the system to miss real-time deadlines. If that is the case, consider a mutex to isolate the impact to only the threads interacting with the shared data in question. Of course as soon as we start using mutexes, then we increase the likelihood of other forms of concurrency failures. Every approach involves risks and tradeoffs.

How else might this issue emerge in our code? Consider a typical ‘struct’ with multiple data members. Although each individual data member may be atomic, the collection is not atomic. Logic using multiple members is not thread safe. Copying a data structure, although it may represent a single line of C or C++ code, will always involve multiple CPU instructions, each of which may be interrupted. ‘structs’ are rarely thread safe without critical sections or mutex protections.

Using a critical section for some, but not all

Another common mistake: forgetting to apply a critical section (or mutex). For example, the following code properly applies critical sections to a shared data array:

static uint8_t m_shared_data[16];

void Init()
{
    EnterCritical();
    memset(m_shared_data, 0, sizeof(m_shared_data));
    ExitCritical();
}

void FillIncrement(uint8_t starting_value)
{
    EnterCritical();
    for (size_t i = 0; i < sizeof(m_shared_data); ++i)
    {
        m_shared_data[i] = starting_value++;
    }
    ExitCritical();
}

//hundreds of lines of other code...

However months or maybe years later, a developer adds a ‘Reset’ function to the module. Now we see the following:

//hundreds of lines of other code...

void Reset()
{
    //other code

    memset(m_shared_data, 0, sizeof(m_shared_data));

    //other code
}

With that simple change, the module is now subject to concurrency bugs. These mistakes are perhaps understandable in maintenance, but they can happen at any point during development. For example, if the shared data is a global variable (Koopman chapter 19: “Global Variables Are Evil”), then a developer accessing the variable in one module may not notice the critical section or mutex use in another module developed by yet-another-engineer.

C or C++ standard library usage

Assuming that standard library functions are thread safe is another common mistake. In fact, the first assumption should be that any standard library function or method is not thread safe.

For example, one mistake might be to feed events to a thread using a queue but using the tempting standard library queue instead of an operating system provided queue. Example:

static std::queue<uint8_t> m_shared_queue; 

//external API: PushToThread()
void PushToThread(uint8_t item)
{
    m_shared_queue.push(item);
}

//internal thread (simplified example)
void MyThread()
{
    for(;;)
    {
        if (!m_shared_queue.empty())
        {
            ProcessItem(m_shared_queue.front());
            m_shared_queue.pop();
        }
    }
}

This is not thread safe as the C++ std::queue container is nothing more than a queue data structure with no consideration for threads or even the underlying operating system, if any. It is not thread safe. Neither is std::vector, std::list, or any other standard library container class. If the code requires multiple threads to access a container object, then it is your responsibility to ensure thread safety. Wrap the calls in a critical section or mutex.

It gets worse. We cannot even assume the heap (malloc(), free(), new(), delete()) is thread safe. In a modern desktop operating system (Windows, Linux, MacOS), we can generally assume heap access is thread safe. However for a bare-metal firmware project or a project using an RTOS, we must not assume thread safety. Read the associated documentation and confirm. Never assume a standard library function is thread safe.

Threading Behavior Issues

Now that we have added critical sections or mutex locking to our software in order to solve our “bad and ugly” concurrency problems, the problem is solved, right? I am afraid not. Although we have likely prevented concurrency bugs involving shared data corruption, we now need to be aware of a new class of bugs and error prone behavior that may be introduced by our changes.

Deadlock

Thread deadlock may occur when two or more threads require locks on two or more separate resources. Deadlock may occur if each thread does not follow the exact same order when locking the various mutexes. I am clearly not the first to enumerate this risk, so instead of re-inventing the wheel, here are some links to check out:

Priority Inversion

Priority inversion is a special condition where a higher priority thread is blocked by a combination of lower priority threads and their related locking behavior. This condition results in a missed real-time deadline. Thankfully, nearly all modern operating systems offer options to help prevent this condition. However, we must sometimes remember to enable these options when creating a mutex. Again, rather than describing an already well documented issue, here are references to review:

Race Conditions

Strictly speaking, a race condition is possible regardless of the use of threads, interrupts, and concurrent behavior. It simply requires that code erroneously assume a condition has taken place which in-fact has yet to occur. However, the likelihood of race conditions increases when we add extensive asynchronous behavior through the use of multiple threads and interrupt service routines. I’ve written on race conditions, and so have many others:

The Good

As we see in our “The Bad and the Ugly” examples, there are many ways to create concurrency bugs. Thankfully, there are also many ways to write solid and thread-safe code for our complex embedded systems. The most reliable methods involve using a consistent and strict pattern of thread behavior. Options to consider include the following:

Use only fully event driven threads (aka Active Objects)

Inspired by Samek’s book many years ago, a software architecture using exclusively event driven active objects is my preferred and go-to approach for delivering modern reliable firmware and embedded software. In the strictest embodiment of this approach, we restrict every thread (active object) in the system to the following characteristics:

  • Blocks exclusively on a single event queue, awaiting an event to process
  • Every event contains a copy of all data associated with the event. i.e. no shared data access
  • All data accessed by a thread (active object) is owned exclusively by the active object. There is no shared data access, either directly or indirectly
  • All inter-thread interactions are fully asynchronous
  • Every thread avoids (ideally eliminates) any internal wait states or other forms of blocking. i.e. every event processed is processed in a strict “run to completion” manner.

If we strictly follow the above guidance, then common concurrency bugs involving shared data are eliminated. We might still accidentally create a race condition, but we have fundamentally eliminated the other forms of concurrency bugs.

All that being said, it is not necessarily easy to implement modern firmware with such strict constraints. The temptation to use third-party libraries that do not adhere to these rules is often times too great or may be required due to the complexity of a particular SoC or communications stack. When you find yourself mixing design models, tread carefully, and do your best to isolate the offending code to a single thread context.

Use only a super loop with no interrupt service routines

The simplest choice (if anything is simple): create the firmware with a single super loop using no threads and no interrupt service routines. Extreme? Perhaps, but this will guarantee no concurrency or threading bugs. I am not the only person to recommend this, as I recently found that Koopman makes the same recommendation in his book Better Embedded System Software – Chapter 14, where he discusses the concept of “Pure static scheduling.” This approach is also the easiest to analyze for real-time deadlines and general ‘safety.’ Admittedly, I have yet to personally implement firmware using this approach in a commercial setting (to-date, only university projects). However, I would recommend this approach for the most critical “must not fail” firmware.

Resources and References

I hope this post helps you prevent concurrency and multithreading bugs in your embedded software! For further reading on this topic, please see the references below.

Need help?

If expert help or advice is needed on issues like this, please check out our related services.

3 comments

  1. I’m a big fan of superloop architecture for embedded projects exactly to avoid concurrency bugs.
    Yes, I really use interrupts so concurrency bugs can be introduced, but sharing data between superloop and ISR is much more simple against sharing data between threads in a preemptive RTOS.

    ISRs are generally used for serial communications: RX ISR just move out the received char to a FIFO buffer in RAM, TX ISR push next char to transmit to another FIFO buffer in RAM. This is a well know problem and already solved against concurrency bugs.

    I use hardware timer to control timed tasks and actions, but I don’t use ISR at all. After configuring and starting the timer, superloop only reads the actual counter value. This read must be atomic, but if we run on a 32-bits CPU with a 32-bits timer, the read is atomic. Starting from hardware ticks I create a software timers library that is called by superloop code only, so there’s no risk of concurrency bugs again.

    Often in the past I tempted to switch to preemptive RTOS, but eventually I didn’t find any usefulness compared to the risk of introducing concurrency bugs. Yes, this happens because of my knowledge and experience, but at the moment I can live without any RTOS at all.

    Just an example, I was able to run an Ethernet network board featuing full TCP/IP stack (lwip), TLS (mbedTLS), MQTT client (from lwip too), with a simple superloop and a few interrupts.

    1. Thank you for your comment. No real argument here, although I still prefer RTOS approaches. That being said, for a true critical real time or safety oriented design, I think I would strongly consider a superloop, or a near equivalent RTOS, an example being Samek’s QV kernel: https://www.state-machine.com/qpcpp/arm-cm_qv.html.

      Ultimately we prevent these types of defects by being aware of how they emerge in our code, and then make design decisions to out-right avoid them in the first place.

      Thank you again!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d