The Anatomy of a Race Condition

Matthew Eshleman

4 years ago

Race Condition: When one portion of a system acts upon an event before another portion of the system has delivered the necessary data to correctly process the event in question.
i.e. an assumption in the system that some other event has already occurred when processing a new event.
And we all know what happens when we assume, right?
Matthew Eshleman

Background

We recently purchased our first “new-ish” car since 2005. In our immediate family we owned a 2005 Honda Accord, a 2004 Honda Odyssey, a 2004 Ford Crown Victoria, and a 2004 Honda Accord Coupe. i.e. we were, for the most part, a Honda family. With all of our primary vehicles approaching 200,000 miles, my wife and I decided to spoil ourselves with a little luxury. After an extensive search we purchased a used 2018 Acura MDX. Yes, we couldn’t escape the Honda experience.

What does this have to do with race conditions? Although I generally enjoy the new car, I’m sad to report that it frequently reminds me of this common design flaw.

The Bugs

Like most modern automobiles, our new vehicle contains a plethora of complex electronics implementing both convenience features and safety critical behavior. For the most part they work as desired: the car warns us when we are about to rear-end the vehicle in front of us, happily keeps the vehicle centered in the lane when cruising the interstate, and shuts off half the cylinders when cruising to reduce fuel consumption. These core (and complex) features appear to work correctly and reliably.

But the car also:

Fails to apply my desired seat and steering wheel position settings. (~10% of the time)
Refuses to connect with Apple CarPlay, especially if we connect the phone before powering on the car. (~10% of the time)
- This issue almost never happens if we wait for the vehicle to be fully powered on before connecting the phone.
Fails to turn off the car’s rear view camera view when switching to Drive, preventing us from accessing other functions shared with this particular screen. Thankfully this flaw is very rare, but it does require a full power off of the car to recover.

Each of these issues smell and feel like race conditions. So, what attributes might help us realize we are dealing with a potential race condition?

Attributes of a Race Condition

In my experience an issue smells like a race condition when the bug and associated system exhibit one or more of these attributes:

The issue is rare or unpredictable in nature.
We do not have clear 100% reproducible steps for repeating the flaw.
The issue typically happens when the system is busy.
The bug never occurs if the desired event is delayed in some manner.
The issue does not reproduce when using a debugger.
The issue does not reproduce when we modify logging output levels.
For a microcontroller: we know the firmware is implemented with threads and/or multiple ISRs.
For a system: it consists of multiple independent sub-systems communicating in an asynchronous manner.

Code Review

When we are performing a software code review, how do we spot a potential race condition?

Arbitrary Delay or Sleep

The following code snippet is an example that I would flag in a code review for further examination.

void Functionality()
{
    ActivateAsyncBehavior();
    
    msleep(20); //must wait for external behavior to complete
    
    ReadDataAndActUponIt();
}

Why would I flag this code? There are several reasons:

The sleep is arbitrary and uses a magic number. There is no obvious documented reason why it must wait 20 milliseconds.
What if the asynchronous behavior takes 22 milliseconds to complete? This would be a race condition if the code following the sleep relies on valid data to have been delivered by the asynchronous behavior.
The sleep will prevent the thread of execution from handling other events.

As a general rule, I look at any arbitrary sleep or delay as a code smell. That being said, the following pattern found in many libraries, such as the STM32 CMSIS and HAL libraries, is not necessarily an example of a race condition:

//https://github.com/STMicroelectronics/STM32CubeF7/blob/master/Drivers/STM32F7xx_HAL_Driver/Src/stm32f7xx_hal_uart.c

HAL_StatusTypeDef UART_WaitOnFlagUntilTimeout(UART_HandleTypeDef *huart, uint32_t Flag, FlagStatus Status, uint32_t Tickstart, uint32_t Timeout)
{
  /* Wait until flag is set */
  while ((__HAL_UART_GET_FLAG(huart, Flag) ? SET : RESET) == Status)
  {
    /* Check for the Timeout */
    
    // code removed for this example, see link above for for details
  }
  return HAL_OK;
}

The above code is generally fine by itself. However, what happens if the caller sets the timeout too small and fails to check for the timeout error condition? In that situation we are likely creating a race condition bug.

Order of Events

If I had to guess, I believe this example is a likely candidate for why our Acura sometimes fails to automatically set the seat position. First some background.

When we power off the vehicle, it retracts the steering wheel and moves the seat to a position making it easier for the driver to exit the car. A side effect of this feature is the seat and steering wheel are not in the preferred position when you later re-enter the car. The vehicle attempts to detect which driver is present by detecting which of the two key fobs entered the driver side of the vehicle. Then the car selects the appropriate seat and steering wheel position based on the key fob ID. When it works correctly, the seat and steering wheel automatically return to the driver’s preferred position. Very nice. When it fails to work correctly, the driver may not realize the system’s error until they are driving, at which point the system does not allow you to tap the preferred settings button. The vehicle must be in park. It is important that simple convenience features work reliably because such failures distract (and annoy) the driver.

Given the background, I would imagine the driver seat control software contains some code to receive and process a message indicating the closest key fob, perhaps like this:

enum class KeyFobId 
{ 
  UNKNOWN, 
  KEY_1, 
  KEY_2 
};

static KeyFobId s_lastKey = KeyFobId::UNKNOWN;

void ProcessKeyFobIdentityMsg(void* msg)
{
    s_lastKey = ExtractKeyFobIdFromMsg(msg);
}

Then I would imagine the same processor is capable of processing another message indicating that the seat position should be adjusted:

void ProcessInitiateSeatAdjustMsg()
{
    auto settings = GetSeatSettings(s_lastKey); //notice, uses the 'global' static state
    SetSeatPosition(settings);
}

With the above code, what is the issue? There is a clear race condition. If the key fob identifying message has not arrived, then the ‘seat adjust’ message will be processed with ‘unknown’ as the last key fob state. It seems likely that this is what the Acura is doing because the seat and the steering wheel simply stay in their retracted “exit-the-car” position. These types of issues are difficult to spot in a code review as the reviewer must consider the overall order of events. Additionally, the use of a static state variable may be buried in other utility functions. When we review code for race conditions, we must track the full set of variables and data being used to create the desired behavior as well as when each variable may be updated relative to the code in question. It is sometimes an overwhelming amount of information to track. This is why race conditions are frequently not found during code reviews.

A possible fix

As a side note, how could embedded software developers address an issue like this? One method is to take a hint from our web development peers where RESTful APIs dominate the design landscape. One key constraint of a RESTful service is: “Each request from any client contains all the information necessary to service the request.” With that in mind, our example could be modified as follows:

void ProcessInitialSeatAdjustMsg(void* msg)
{
    auto keyfobId = GetKeyFobIdFromMsg(msg);
    if (keyfobId == KeyFobId::UNKNOWN)
    {
        keyfobId = GetLastValidDriverKeyFobId();
    }
    auto settings = GetSeatSettings(keyfobId);
    SetSeatPosition(settings);
}

There are two important changes:

The message delivers the keyfob ID as part of the payload.
The system/code retains in non-volatile memory the last driver. If the message delivered an ‘unknown’ key, then we at least set the seat position to the last known driver. Of course perhaps we should push that requirement further up the chain and require only valid key fobs be delivered with this message.

Of course, requiring the key fob ID to be part of the incoming message pushes the race condition possibility to another portion of the system. Hopefully, this design decision isolates this concern to a more limited portion of the system.

Announcing an event before changing the data or state

I recently stumbled across the following code in a project I have been helping on:

void OnEnterState()
{
    SetupCode();
    EmitStateChanged();
    mCurrentState = THIS_STATE;
}

Notice the issue? The code was setting the “mCurrentState” variable after emitting a state change event to observers of this class. Since I was already refactoring this code, I went ahead and “fixed” it to the following:

void OnEnterState()
{
    SetupCode();
    mCurrentState = THIS_STATE;
    EmitStateChanged();
}

I retested a few related use cases and all seemed good. However, I later received an email indicating this change had created a bug in another use case (no unit tests in this legacy project). Apparently someone had written code like this:

void OnStateChanged()
{
    if (observed->GetState() == SOME_STATE)  {
        //do this
    } 
    else  {
        //do that
    }
}

The code in question accidentally worked because it was effectively processing on the PREVIOUS state. A race condition was created because the event was received synchronously with the emit and the object in question had not yet actually modified its internal variable to reflect the correct new state. Of course, if the event had been queued (it was queued in some of the use cases, but not all) then the member variable would have been properly set before any queued events were processed and then this code would have exhibited the newly reported bug regardless of my change.

In summary: finish modifying directly related data before announcing the event to any observers.

Conclusion

Software race conditions may create merely irritating bugs or they may create serious life destroying accidents. As engineers we owe it to our end-users to avoid all of the above. Hopefully, these notes will help prevent race conditions in your project.

What additional examples would you add?

References and additional reading

Related Services

Is your embedded software or firmware project troubled by random hard to reproduce issues? If so, please check out our related services: