Panic! At The Async Runtime Shutdown

While working on Crucible, I've been repeatedly bitten by seemingly "impossible" panics happening at a frustratingly low rate. We tracked the issue down to a edge case in how the Tokio runtime manages spawned tasks.

This post presents a toy example which makes the problem obvious.

Consider a system with two tasks:

use std::sync::{
    atomic::{AtomicBool, Ordering},
    Arc,
};
use tokio::sync::mpsc;

/// Spawns a worker and producer task
///
/// Returns a flag which stops the tasks when set
fn spawn_tasks() -> Arc<AtomicBool> {
    let (message_tx, mut message_rx) = mpsc::unbounded_channel();
    let (response_tx, mut response_rx) = mpsc::unbounded_channel();
    let stop = Arc::new(AtomicBool::new(false));

    // Worker task, adds 1 and responds
    tokio::task::spawn(async move {
        while let Some(m) = message_rx.recv().await {
            let _ = response_tx.send(m + 1);
        }
    });

    // Producer task, send stuff to the worker and checks its reply
    let stop_ = stop.clone();
    tokio::task::spawn(async move {
        for i in 0.. {
            // Check to see if we should stop
            if stop_.load(Ordering::Acquire) {
                break;
            }
            message_tx.send(i).unwrap();
            let r = response_rx.recv().await.unwrap();
            assert_eq!(r, i + 1);
        }
    });

    // return the stop handle for the caller to use
    stop
}

Dataflow between the tasks is simple:

dataflow in this system

This system runs while the stop flag is false. Once that flag is set to true, two things will happen in order:

The producer is safe to unwrap its calls to send and recv, because the worker should always be running while the producer is running:

...or will it?


The root cause for my troubles sounds obvious when put plainly:

The Tokio runtime will drop tasks in an arbitrary order during shutdown.

This means that seemingly "impossible" orderings can happen:

We can test this out by running in a tight loop:

fn main() {
    for _ in 0..10_000 {
        let rt = tokio::runtime::Runtime::new().unwrap();
        rt.block_on(async {
            let stop = spawn_tasks();
            tokio::time::sleep(std::time::Duration::from_millis(10)).await;
            stop.store(true, Ordering::Release); // shut everything down

            // the runtime is dropped here!
        });
    }
}

Over 10,000 loops, I see 26 panics at response_rx.recv().await.unwrap(). Even in this toy model designed to highlight the issue, it's an infuriatingly rare occurrence!

The default runtime is multi-threaded, which seems like a necessary condition to trigger the issue: you need to have some tasks continuing to run while the runtime drops others. Sure enough, the issue doesn't happen with the single-threaded runtime.

tokio::test uses the single-threaded runtime by default, so this may not even be visible in unit tests! In our system, we saw these panics during both integration tests and at program exit.


By stopping tasks in an arbitrary order, Tokio isn't doing anything wrong; indeed, it has no way of knowing which tasks are expected to stop in what order.

Quoth the docs:

Tasks spawned through Runtime::spawn keep running until they yield. Then they are dropped. They are not guaranteed to run to completion, but might do so if they do not yield until completion.

However, I wonder whether the runtime should wait for all tasks to yield before dropping the tasks' data. Having task stop in an arbitrary order is fine (and unavoidable), but only dropping data once all tasks have stopped would prevent this kind of issue.

The feasibility of such a "stop, collect, and drop" strategy would depend on Tokio's internals, which I haven't had a chance to investigate.