UNIX’s Achilles’ Heel I

By | 2013-06-21

UNIX has one (to my mind) big hole in its API. Windows, surprisingly, doesn’t share it. This article will discuss that gap, and how it can be worked around.

File Descriptor Wakes

There are a few multiplexing system calls, select(), poll(), and epoll() that put a thread to sleep, with the wakeup triggered by activity on particular file descriptors. The modern choice is epoll(), but classic examples always use select(). They’re the same at their core.

As well as monitoring an open file or socket, there are handy ways of making a descriptor that can be waited on in one thread, and triggered in another. For example, eventfd() and pipe().

UNIX signals can be coaxed into fitting this model with signalfd(), which removes some of the async signal handling problems that can plague systems software.

Timers can be created to notify via a file descriptor using timerfd_xxx() family functions.

We can monitor for filesystem changes with inotify().

All these combine to mean we can, by use of the multiplexing API, listen simultaneously for any mixture of these events, and handle each appropriately as it happens.

Thread Wakes

Many of the pthread library functions can put a thread to sleep, the core mutex, for example, will sleep until any other thread holding that lock releases it. In theory, that could be used as a wake mechanism. Better though is to use the system intended for event notification across threads – the wait condition.

A wait condition is really just a handle that one or more threads reference, and sleep on, and any other thread can trigger a wake.

There is no native API for multiplexing wait conditions. You cannot wait for conditionA or conditionB in one call.

Never The Twain

The hole in UNIX is exemplified by a requirement like this:

Run a thread that wakes up when a network connection is received, or when, say, a new command is pushed onto a command queue.

The problem is that there is no way to bridge wait conditions into the multiplexed APIs. You can’t sleep on a file descriptor or wait condition; you can only sleep on one. Worse, as mentioned above, you cannot even wait on multiple wait conditions at once.

Solution #1

The common solution is to abandon wait conditions and use eventfd() or pipe() as the event signalling mechanism.

This will work but has the distinct drawback of using up a file descriptor – a limited resource. It also means you lose a lot of flexibility, because you have to centralise that file descriptor so that it’s known and monitored by the thread that wakes and the thread that sleeps. Your design becomes incapable of abstracting the supply of events – you have to tie the event generator and the event receiver closely together.

This makes it hard to write a library module that supplies events. It ends up having to know who its supplying events to.

Solution #2

My solution is one that I based on an obscure article I found by Sun, which now that Sun has gone, I can’t find online. It talked about implementing a Windows-style WaitFor call. The advantage that Windows has in this case is that file descriptors are perfectly acceptable event handles for it’s WaitFor family. However, Sun’s example was only about implementing the ability to wait on multiple conditions in parallel, not wait conditions and file descriptors. We’ll discuss just that simple case first, and move on to mixing in file descriptors in a later article.

Multiplexed Wait Conditions

The key observation we need to make is that we really can only wait for one wait condition at a time on UNIX. Therefore we are never going to be able to call a single function that takes multiple handles created by pthread_cond_init() and wakes on any of them.

The typical form of use of one of these wait conditions would be something like this:

void waitingThread() {
    // --- waiting thread
    pthread_cond_init( &conditionXGreaterThan10, null );
    // ...
    pthread_mutex_lock( &lockX );
    if( !(x > 10) )
        pthread_cond_wait( &conditionXGreaterThan10, &lockX );
    pthread_mutex_unlock( &lockX );
}

void threadThatModifiesX() {
    // --- thread that modifies X
    pthread_mutex_lock( &lockX );
    x = newValueOfX;
    if( x > 10 )
        pthread_cond_broadcast( &conditionXGreaterThan10 );
    pthread_mutex_unlock( &lockX );
}

Then of course, having been led down that path, we create another wait condition for monitoring an event on, say, y == 0. That all then collapses when we want one thread to also wake on either condition.

We have to focus instead on what we want: the ability to wake on any of multiple events. The trick is the realisation that we can represent those multiple events with a single wait condition.

void waitingThread() {
    pthread_cond_init( &conditionXGreaterThan10, null );
    pthread_cond_init( &conditionYEquals0, null );
    pthread_cond_init( &conditionXorYEvent, null );
    // ...
    // conditionXGreaterThan10 code as above
    // ...
    pthread_mutex_lock( &lockXY );
    if( !(x > 10 || y == 0) )
        pthread_cond_wait( &conditionXorY, &lockXY );
    pthread_mutex_unlock( &lockXY );
}

void threadThatModifiesX() {
    pthread_mutex_lock( &lockX );
    pthread_mutex_lock( &lockXY );
    x = newValueOfX;
    if( x > 10 ) {
        pthread_cond_broadcast( &conditionXGreaterThan10 );
        pthread_cond_broadcast( &conditionXorYEvent );
    }
    pthread_mutex_unlock( &lockXY );
    pthread_mutex_unlock( &lockX );
}

void threadThatModifiesY() {
    // --- thread that modifies y
    pthread_mutex_lock( &lockY );
    pthread_mutex_lock( &lockXY );
    y = newValueOfY;
    if( y == 0 ) {
        pthread_cond_broadcast( &conditionYEquals0 );
        pthread_cond_broadcast( &conditionXorYEvent );
    }
    pthread_mutex_unlock( &lockXY );
    pthread_mutex_unlock( &lockY );
}

I’ve left the individual events in too as other threads might monitor them alone, but the important thing to note is that when we want to wait on multiple events we still only wait on one condition. The difference is that it’s a composite condition, representing all of the possible wake conditions.

This sort of code is obviously unwieldy, we would have to alter every bit of code that modifies a variable to tell it about every bit of code that includes a monitor of that variable – worse the thread that modifies x has to know at least a little bit about y thanks to lockXY. Very bad for code isolation.

The solution is to abstract away the multiple broadcasts and multiple locks so that neither side needs to know anything about the other. Clearly then there are two sides: the notifier and the waiter. Here’s what we’ll aim for (I’m going to flip to C++ now as we’ll get needlessly messy code without some object oriented facilities):

// --- global
int x = 0;
int y = 100;
TNotifier notifyXGreaterThan10;
TNotifier notifyYEquals0;

void waitingThread() {
    TEventWaiter EW;
    EW.waitfor( notifyXGreaterThan10 );
    EW.waitfor( notifyYEquals0 );
    EW.sleep();

    // ... interrogate EW for waker ...
}

void notifyingThread1() {
    // --- notifying thread #1
    notifyXGreaterThan10.lock();
    x = newValueOfX;
    if( x > 10 )
        notifyXGreaterThan10.broadcast();
    notifyXGreaterThan10.unlock();
}

void notifyingThread2() {
    // --- notifying thread #2
    notifyYEquals0.lock();
    y = newValueOfX;
    if( y > 10 )
        notifyYEquals0.broadcast();
    notifyYEquals0.unlock();
}

Notice that the x-modifying thread no longer needs to know anything about y; notice also that we’re back to having the original simple appearance that the first single-condition example had.

It’s all very well to show the end point, how do we do it though? We invert the control – instead of having a wait condition per condition, we have a wait condition per waiter, and attach as many events to it as we want.

The following is nowhere near complete, nor even thread-safe, but it captures the core idea.

class TEventWaiter {
  public:
    TEventWaiter() :
        mEvent(PTHREAD_COND_INITIALIZER),
        mConditionListLock(PTHREAD_MUTEX_INITIALIZER)
    {
    }

    ~TEventWaiter() {
        // ... iterate through condition list and unsubscribe
    }

    bool sleep() {
        pthread_mutex_lock( &mConditionListLock );
        pthread_cond_wait( &mEvent, &mConditionListLock );
        pthread_mutex_unlock( &mConditionListLock );
    }

    void waitfor(TNotifier *notifier) {
        pthread_mutex_lock( &mConditionListLock );
        mConditions.push_back(notifier);
        notifier->subscribe(this);
        pthread_mutex_unlock( &mConditionListLock );
    }

    void broadcast( const TNotifier * ) {
        pthread_cond_broadcast( &mEvent );
    }

  protected:
    pthread_cond_t mEvent;
    mutable pthread_mutex_t mConditionListLock;
    list<TNotifier*> mConditions;
};

class TNofifier {
  public:
    TEventWaiter() :
        mSubscriberListLock(PTHREAD_MUTEX_INITIALIZER)
    {
    }

    // Tell every subscriber the event has occurred
    void broadcast() const {
        pthread_mutex_lock( &mSubscriberListLock );
        list<TEventWaiter*>::const_iterator it;
        for( it = mSubscribers.begin(); it != mSubscribers.end(); it++ ) {
            it->broadcast( this );
        }
        pthread_mutex_unlock( &mSubscriberListLock );
    }

    // Allow a waiter to subscribe to us
    void subscribe(TEventWaiter* subscriber) {
        pthread_mutex_lock( &mSubscriberListLock );
        mSubscribers.push_back(subscriber);
        pthread_mutex_unlock( &mSubscriberListLock );
    }

  protected:
    mutable pthread_mutex_t mSubscriberListLock;
    list<TEventWaiter*> mSubscribers;
};

This is an extraordinarily sparse implementation as I’m already well passed my quota for code in a blog post, I’ll get around to a fuller, more general implementation and will post it on my github page eventually (emails to prod me when I forget are welcome).

The important point to note is that these two objects provide, effectively, two table joins (although there is no database of course). One maintains the list of events that can wake it, and the other maintains the list of listeners that should be woken. That is what gives us our strong isolation When a waiter subscribes to an event it does not care nor know if there are fifty others waiting on the same condition. When a supplier of a notification triggers a wake it does not care if one of its waiters is waiting for fifty other events too.

I’m conscious now that I’ve already (as is my habit) written too much for a single post. I’ll return to this next time when I’ll discuss using this technique to blend wait conditions and file descriptor multiplexing. We’ll also ditch the pthreads dependence, and use the nice new C++11 concurrency libraries instead.

Leave a Reply