Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

Question

I'm using Cygwin GCC and run this code:

#include <iostream>
#include <thread>
#include <vector>
using namespace std;

unsigned u = 0;

void foo()
{
    u++;
}

int main()
{
    vector<thread> threads;
    for(int i = 0; i < 1000; i++) {
        threads.push_back (thread (foo));
    }
    for (auto& t : threads) t.join();

    cout << u << endl;
    return 0;
}

Compiled with the line: g++ -Wall -fexceptions -g -std=c++14 -c main.cpp -o main.o.

It prints 1000, which is correct. However, I expected a lesser number due to threads overwriting a previously incremented value. Why does this code not suffer from mutual access?

My test machine has 4 cores, and I put no restrictions on the program that I know of.

The problem persists when replacing the content of the shared foo with something more complex, e.g.

if (u % 3 == 0) {
    u += 4;
} else {
    u -= 1;
}

Intel CPUs have some amazing internal "shoot down" logic to preserve compatibility with very early x86 CPUs used in SMP systems (like dual Pentium Pro machines). Lots of failure conditions that we are taught are possible almost never actually happen on x86 machines. So say a core goes to write u back to memory. The CPU will actually do amazing things like notice that the memory line for u isn't in the CPU's cache and it will restart the increment operation. This is why going from x86 to other architectures can be an eye opening experience! — David Schwartz, 2 days ago
Why does this code NOT suffer from race condition? -- No code suffers from a race condition, until it suffers from a race condition. — PaulMcKenzie, 2 days ago
Testing for unefined behavior is never conclusive, and absence of it's guaranteed manifestation is why it is called undefined. Race condition is undefined behavior, and if it would manifest itself in a clear and observable way all the time it wouldn't be called undefined behavior, it would be called something else. — SergeyA, 2 days ago
Just as with undefined behavior, "has a race condition" does not mean "something bad will happen". Not until you demo your program for your most important client... — Pete Becker, yesterday

Rob K · Accepted Answer · 2017-01-23 22:06:02Z

up vote 228 down vote accepted

foo() is so short, each thread probably finishes before the next one even gets spawned. If you add a sleep for a random time in foo() before the u++, you may start seeing what you expect.

answered 2 days ago

Rob K

6,32011926

46

This indeed changed the output in the expected way. – mafu 2 days ago

39

I would note that this is in general a rather good strategy for exhibiting race conditions. You should be able to inject a pause between any two operations; if not, there's a race condition. – Matthieu M. yesterday

We had just this issue with C# recently. Code almost never failed usually, but the recent addition of an API call in between introduced enough delay to make it consistently change. – Obsidian Phoenix 17 hours ago

@MatthieuM. Doesn't Microsoft have an automated tool that does exactly that, as a method of both detecting race conditions and making them reliably reproducible? – Mason Wheeler 12 hours ago

@MasonWheeler: I work nigh exclusively on Linux, so... dunno :( – Matthieu M. 12 hours ago

| show 1 more comment

Vality · Answer 2 · 2017-01-25 17:35:38Z

up vote 53 down vote

It is important to understand a race condition does not guarantee the code will run incorrectly, merely that it could do anything, as it is an undefined behavior. Including running as expected.

Particularly on X86 and AMD64 machines race conditions in some cases rarely cause issues as many of the instructions are atomic and the coherency guarantees are very high. These guarantees are somewhat reduced on multi processor systems where the lock prefix is needed for many instructions to be atomic.

If on your machine increment is an atomic op, this will likely run correctly even though according to the language standard it is Undefined Behavior.

Specifically I expect in this case the code may be being compiled to an atomic Fetch and Add instruction (ADD or XADD in X86 assembly) which is indeed atomic in single processor systems, however on multiprocessor systems this is not guaranteed to be atomic and a lock would be required to make it so. If you are running on a multiprocessor system there will be a window where threads could interfere and produce incorrect results.

Specifically I compiled your code to assembly using https://godbolt.org/ and foo() compiles to:

foo():
        add     DWORD PTR u[rip], 1
        ret

This means it is solely performing an add instruction which for a single processor will be atomic (though as mentioned above not so for a multi processor system).

edited 14 hours ago

answered 2 days ago

Vality

5,10931341

39

It's important to remember that "running as intended" is a permissible outcome of undefined behavior. – Mark 2 days ago

3

As you indicated, this instruction is not atomic on an SMP machine (which all modern systems are). Even inc [u] is not atomic. The LOCK prefix is required to make an instruction truly atomic. The OP is simply getting lucky. Recall that even though you're telling the CPU "add 1 to the word at this address", the CPU still has to fetch, increment, store that value and another CPU can do the same thing simultaneously, causing the result to be incorrect. – Jonathon Reinhart 2 days ago

2

I down-voted, but then I re-read your question and realized that your atomicity statements were assuming a single CPU. If you edit your question to make this more clear (when you say "atomic", be clear that this is only the case on a single CPU), then I will be able to remove my down-vote. – Jonathon Reinhart 2 days ago

2

Downvoted, i find this claim a bit meh "Particularly on X86 and AMD64 machines race conditions in some cases rarely cause issues as many of the instructions are atomic and the coherency guarantees are very high." The paragraph should start making the explicit assumption that you're focusing on single core. Even so, multi-core architectures are de-facto standard nowadays in consumer devices that I would consider this a corner case to explain last, rather than first. – Patrick Trentin 2 days ago

2

Hmm, no. This is wrong. add DWORD PTR u[rip], 1 or any instruction with a memory operand is not atomic on x86. add rax, 1 (or any instruction with a register operand) would be atomic, but when you have a memory operand, the instruction is decomposed into ~3 micro-ops (a load, an add, and a store in this case), which means it is not atomic. I guess you kind of try and weasel out of this by saying it is only atomic on a single-CPU system, but that doesn't make much sense. It's still not atomic, by definition, and with multiple threads, you're still hosed without a lock prefix. – Cody Gray 2 days ago

| show 15 more comments

Stephan Lechner · Answer 3 · 2017-01-23 22:33:00Z

up vote 14 down vote

I think it is not so much the thing if you put a sleep before or after the u++. It is rather that operation u++ translates to code that is - compared to the overhead of spawning threads that call foo - very quickly performed such that it is unlikely to get intercepted. However, if you "prolong" the operation u++, then the race condition will become much more likely:

void foo()
{
    unsigned i = u;
    for (int s=0;s<10000;s++);
    u = i+1;
}

result: 694

BTW: I also tried

if (u % 2) {
    u += 2;
} else {
    u -= 1;
}

and it gave me most times 1997, but sometimes 1995.

edited 2 days ago

answered 2 days ago

Stephan Lechner

3,0131510

1

I would expect on any vaguely sane compiler that whole function would be optimized to the same thing. I am surprised it wasn't. Thank you for the interesting result. – Vality 2 days ago

This is exactly correct. Many thousands of instructions need to run before the next thread starts executing the tiny function in question. When you make the execution time in the function closer to the thread creation overhead, you see the impact of the race condition. – Jonathon Reinhart 2 days ago

@Vality: I also expected it to delete the spurious for-loop under O3 optimization. It doesn't? – user21820 yesterday

How could else u -= 1 ever be executed? Even in a parallel environment the value should never not fit %2, doesn't it? – mafu yesterday

1

from the output, it looks like else u -= 1 is executed once, the first time foo() is called, when u == 0. The remaining 999 times u is odd and u += 2 is executed resulting in u = -1 + 999 * 2 = 1997; i.e. the correct output. A race condition sometimes causes one of the +=2 to be overwritten by a parallel thread and you get 1995. – Luke yesterday

| show 1 more comment

isanae · Answer 4 · 2017-01-24 00:14:33Z

up vote 5 down vote

It does suffer from a race condition. Put usleep(1000); before u++; in foo and I see different output (< 1000) each time.

edited 2 days ago

isanae

2,0461425

answered 2 days ago

juf

844

add a comment |

dgould · Answer 5 · 2017-01-25 21:13:49Z

The likely answer to why the race condition didn't manifest for you, though it does exist, is that foo() is so fast, compared to the time it takes to start a thread, that each thread finishes before the next can even start. But...
Even with your original version, the result varies by system: I tried it your way on a (quad-core) Macbook, and in ten runs, I got 1000 three times, 999 six times, and 998 once. So the race is somewhat rare, but clearly present.
You compiled with '-g', which has a way of making bugs disappear. I recompiled your code, still unchanged but without the '-g', and the race became much more pronounced: I got 1000 once, 999 three times, 998 twice, 997 twice, 996 once, and 992 once.
Re. the suggestion of adding a sleep -- that helps, but (a) a fixed sleep time leaves the threads still skewed by start time (subject to timer resolution), and (b) a random sleep spreads them out when what we want is to pull them closer together. Instead, I'd code them to wait for a start signal, so I can create them all before letting them get to work. With this version (with or without '-g'), I get results all over place, as low as 974, and no higher than 998:
```
#include <iostream>
#include <thread>
#include <vector>
using namespace std;

unsigned u = 0;
bool start = false;

void foo()
{
    while (!start) {
        std::this_thread::yield();
    }
    u++;
}

int main()
{
    vector<thread> threads;
    for(int i = 0; i < 1000; i++) {
        threads.push_back (thread (foo));
    }
    start = true;
    for (auto& t : threads) t.join();

    cout << u << endl;
    return 0;
}
```

Just a note. The -g flag does not in any way "make bugs disappear." The -g flag on both GNU and Clang compilers simply adds debug symbols to the compiled binary. This allows you to run diagnostic tools like GDB and Memcheck on your programs with some human readable output. For example when Memcheck is run over a program with a memory leak it will not tell you the line number unless the program was built using the -g flag. — MS-DDOS, 6 hours ago

asked	2 days ago
viewed	9405 times
active	today

current community

your communities

more stack exchange communities

Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ race-condition or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ race-condition or ask your own question.

Linked

Related

Hot Network Questions