Valgrind II — helgrind

By | 2013-02-25

Last time we looked at valgrind's memory leak detector, memcheck. This time we’ll be looking at an even more difficult to catch class of bug, thread race conditions.

First let’s be clear what a thread race is: a race is when two threads run code that access the same piece of memory. The outcome is different depending on which thread runs through faster. I’ve discussed race conditions before in the context of interrupt handlers on embedded microcontrollers. Exactly the same set of problems exists for threaded programs, being that each thread effectively interrupts the other at any point.

This time we’re thinking about programs on full desktop systems, and we’ve got access to the helgrind and dsd tools in valgrind to help us find races.

Let’s write a C++ program using pthreads with some runtime faults in it.

// C++
#include <iostream>
// C
#include <stdlib.h>
// OS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>

using namespace std;

// cerr lock, static initialize
pthread_mutex_t cerrLock = PTHREAD_MUTEX_INITIALIZER;

// misappropriated lock
pthread_mutex_t misappropriatedLock = PTHREAD_MUTEX_INITIALIZER;

// racer
int racer = 0;


void *thread1_main( void * )
{
    // This is good locking, and won't trigger helgrind.
    pthread_mutex_lock( &cerrLock );
    cerr << "[THREAD1] Begin" << endl;
    pthread_mutex_unlock( &cerrLock );

    usleep(100000);
    racer++;

    pthread_mutex_lock( &cerrLock );
    cerr << "[THREAD1] End" << endl;
    pthread_mutex_unlock( &cerrLock );

    // Lock in this thread
    pthread_mutex_lock( &misappropriatedLock );

    // Exiting with lock held
    return NULL;
}

void *thread2_main( void * )
{
    pthread_mutex_lock( &cerrLock );
    cerr << "[THREAD2] Begin" << endl;
    pthread_mutex_unlock( &cerrLock );

    usleep(200000);
    racer++;

    pthread_mutex_lock( &cerrLock );
    cerr << "[THREAD2] End" << endl;
    pthread_mutex_unlock( &cerrLock );

    // unlock in this thread
    pthread_mutex_unlock( &misappropriatedLock );

    return NULL;
}

int main(void)
{
    pthread_t thread1, thread2;

    // Start threads
    pthread_create( &thread1, NULL, thread1_main, NULL );
    pthread_create( &thread2, NULL, thread2_main, NULL );

    // Wait for them to finish
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    // Whether or not a race condition happens, corrupting the racer
    // variable, we expect helgrind to detect our bad accesses of it in
    // the two threads.
    if( racer != 2 ) {
        cerr << "race detected" << endl;
    } else {
        cerr << "race not detected "<< endl;
    }

    // API misuse
    pthread_mutex_t testLock;
    // Unlocking an invalid lock
    pthread_mutex_unlock(&testLock);
    // Destroying an invalid lock
    pthread_mutex_destroy(&testLock);


    pthread_mutex_init(&testLock, NULL);
    // Not destroying an allocated mutex

    return 0;
}

// Command line
//  $ g++ -ggdb3 -O2 -Wall -rdynamic -Wextra -c -o threadracer.o threadracer.cc
//  $ g++ -pthread -o threadracer threadracer.o

Running it under helgrind:

valgrind --tool=helgrind \
                --log-file=vg-threadracer.log \
                ./threadracer
[THREAD1] Begin 
[THREAD2] Begin
[THREAD1] End
[THREAD2] End
race not detected

Now let’s look at the log.

==26230== Helgrind, a thread error detector
==26230== Copyright (C) 2007-2012, and GNU GPL'd, by OpenWorks LLP et al.
==26230== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==26230== Command: ./threadracer
==26230== Parent PID: 26229
==26230== 
==26230== ---Thread-Announcement------------------------------------------
==26230== 
==26230== Thread #2 was created
==26230==    at 0x4271778: clone (clone.S:111)
==26230==    by 0x41893BB: pthread_create@@GLIBC_2.1 (createthread.c:256)
==26230==    by 0x402A790: pthread_create_WRK (hg_intercepts.c:255)
==26230==    by 0x8048A4D: main (threadracer.cc:69)
==26230== 
==26230== ----------------------------------------------------------------
==26230== 
==26230== Thread #2: Exiting thread still holds 1 lock
==26230==    at 0x4188CCC: start_thread (pthread_create.c:439)
==26230==    by 0x427178D: clone (clone.S:130)
==26230== 

Locks should be acquired and released in the same thread; so leaving one locked when a thread exits will probably lead to another thread deadlocking on that lock.

==26230== ---Thread-Announcement------------------------------------------
==26230== 
==26230== Thread #3 was created
==26230==    at 0x4271778: clone (clone.S:111)
==26230==    by 0x41893BB: pthread_create@@GLIBC_2.1 (createthread.c:256)
==26230==    by 0x402A790: pthread_create_WRK (hg_intercepts.c:255)
==26230==    by 0x8048A71: main (threadracer.cc:70)
==26230== 
==26230== ----------------------------------------------------------------
==26230== 
==26230== Possible data race during read of size 4 at 0x804A330 by thread #3
==26230== Locks held: none
==26230==    at 0x8048CDA: thread2_main(void*) (threadracer.cc:52)
==26230==    by 0x4188C38: start_thread (pthread_create.c:304)
==26230==    by 0x427178D: clone (clone.S:130)
==26230== 
==26230== This conflicts with a previous write of size 4 by thread #2
==26230== Locks held: none
==26230==    at 0x8048E2A: thread1_main(void*) (threadracer.cc:32)
==26230==    by 0x4188C38: start_thread (pthread_create.c:304)
==26230==    by 0x427178D: clone (clone.S:130)
==26230== 

Magical! helgrind has cleverly found us a race. Quite correctly it’s warning us that the same memory is accessed on line 53 and line 32 without a protective lock.

==26230== ----------------------------------------------------------------
==26230== 
==26230== Thread #3 unlocked lock at 0x804A334 currently held by thread #2
==26230==    at 0x402AFB6: pthread_mutex_unlock (hg_intercepts.c:609)
==26230==    by 0x8048D50: thread2_main(void*) (threadracer.cc:59)
==26230==    by 0x4188C38: start_thread (pthread_create.c:304)
==26230==    by 0x427178D: clone (clone.S:130)
==26230==   Lock at 0x804A334 was first observed
==26230==    at 0x402AC84: pthread_mutex_lock (hg_intercepts.c:495)
==26230==    by 0x8048EA0: thread1_main(void*) (threadracer.cc:39)
==26230==    by 0x4188C38: start_thread (pthread_create.c:304)
==26230==    by 0x427178D: clone (clone.S:130)
==26230== 

This fault that helgrind has found is because we’ve released a lock in thread #3 on line 59, that was acquired in thread #2 on line 39.

==26230== ---Thread-Announcement------------------------------------------
==26230== 
==26230== Thread #1 is the program's root thread
==26230== 
==26230== ----------------------------------------------------------------
==26230== 
==26230== Thread #1 unlocked an invalid lock at 0xBEB69BC8
==26230==    at 0x402AFB6: pthread_mutex_unlock (hg_intercepts.c:609)
==26230==    by 0x8048ACA: main (threadracer.cc:88)
==26230== 

We’ve released a lock that was never allocated on line 88. That’s an API error at least, but can easily result in a crash.

==26230== ----------------------------------------------------------------
==26230== 
==26230== Thread #1's call to pthread_mutex_unlock failed
==26230==    with error code 1 (EPERM: Operation not permitted)
==26230==    at 0x402B033: pthread_mutex_unlock (hg_intercepts.c:618)
==26230==    by 0x8048ACA: main (threadracer.cc:88)
==26230== 

Same fault really, using an unallocated lock has resulted in a permission denied response from the OS.

==26230== ----------------------------------------------------------------
==26230== 
==26230== Thread #1: pthread_mutex_destroy with invalid argument
==26230==    at 0x402AABC: pthread_mutex_destroy (hg_intercepts.c:455)
==26230==    by 0x8048AD2: main (threadracer.cc:90)
==26230== 
==26230== 
==26230== For counts of detected and suppressed errors, rerun with: -v
==26230== Use --history-level=approx or =none to gain increased speed, at
==26230== the cost of reduced accuracy of conflicting-access information
==26230== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 33 from 33)

Unlocking the invalid lock is bad, so is destroying it, and helgrind has caught that too.

valgrind has a similar tool to helgrind called drd, which serves a similar purpose but is (reportedly) faster. It’s not quite as thorough though. Here’s it’s output on the same executable.

==30573== drd, a thread error detector
==30573== Copyright (C) 2006-2012, and GNU GPL'd, by Bart Van Assche.
==30573== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==30573== Command: ./threadracer
==30573== Parent PID: 30572
==30573== 
==30573== Mutex still locked at thread exit: mutex 0x804a334, recursion count 1, owner 2.
==30573==    at 0x402B6FC: pthread_join (drd_pthread_intercepts.c:509)
==30573==    by 0x8048A85: main (threadracer.cc:73)
==30573== mutex 0x804a334 was first observed at:
==30573==    at 0x402CA7F: pthread_mutex_lock (drd_pthread_intercepts.c:650)
==30573==    by 0x8048EA0: thread1_main(void*) (threadracer.cc:39)
==30573==    by 0x418FC38: start_thread (pthread_create.c:304)
==30573==    by 0x427878D: clone (clone.S:130)
==30573== 
==30573== Thread 3:
==30573== Conflicting load by thread 3 at 0x0804a330 size 4
==30573==    at 0x8048CDA: thread2_main(void*) (threadracer.cc:52)
==30573==    by 0x418FC38: start_thread (pthread_create.c:304)
==30573==    by 0x427878D: clone (clone.S:130)
==30573== Allocation context: BSS section of /home/andyp/dev/projects/valgrind/threadracer
==30573== Other segment start (thread 2)
==30573==    (thread finished, call stack no longer available)
==30573== Other segment end (thread 2)
==30573==    (thread finished, call stack no longer available)
==30573== 
==30573== Conflicting store by thread 3 at 0x0804a330 size 4
==30573==    at 0x8048CDA: thread2_main(void*) (threadracer.cc:52)
==30573==    by 0x418FC38: start_thread (pthread_create.c:304)
==30573==    by 0x427878D: clone (clone.S:130)
==30573== Allocation context: BSS section of /home/andyp/dev/projects/valgrind/threadracer
==30573== Other segment start (thread 2)
==30573==    (thread finished, call stack no longer available)
==30573== Other segment end (thread 2)
==30573==    (thread finished, call stack no longer available)
==30573== 
==30573== Mutex not locked: mutex 0x804a334, recursion count 1, owner 0.
==30573==    at 0x402D2EF: pthread_mutex_unlock (drd_pthread_intercepts.c:703)
==30573==    by 0x8048D50: thread2_main(void*) (threadracer.cc:59)
==30573==    by 0x418FC38: start_thread (pthread_create.c:304)
==30573==    by 0x427878D: clone (clone.S:130)
==30573== mutex 0x804a334 was first observed at:
==30573==    at 0x402CA7F: pthread_mutex_lock (drd_pthread_intercepts.c:650)
==30573==    by 0x8048EA0: thread1_main(void*) (threadracer.cc:39)
==30573==    by 0x418FC38: start_thread (pthread_create.c:304)
==30573==    by 0x427878D: clone (clone.S:130)
==30573== 
==30573== Thread 1:
==30573== The object at address 0xbed05bc8 is not a mutex.
==30573==    at 0x402D2EF: pthread_mutex_unlock (drd_pthread_intercepts.c:703)
==30573==    by 0x8048ACA: main (threadracer.cc:88)
==30573== 
==30573== The object at address 0xbed05bc8 is not a mutex.
==30573==    at 0x402C850: pthread_mutex_destroy (drd_pthread_intercepts.c:636)
==30573==    by 0x8048AD2: main (threadracer.cc:90)
==30573== 
==30573== 
==30573== For counts of detected and suppressed errors, rerun with: -v
==30573== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 50 from 18)

I won’t go through these individually, they’re the same set of faults, just being reported by a different tool.

Leave a Reply