Panic! At The Async Runtime Shutdown
While working on Crucible, I've been repeatedly bitten by seemingly "impossible" panics happening at a frustratingly low rate. We tracked the issue down to a edge case in how the Tokio runtime manages spawned tasks.
This post presents a toy example which makes the problem obvious.
Consider a system with two tasks:
- The worker receives a number, adds 1, and sends it back
- The producer sends a stream of numbers to the worker, and checks its results
use std::sync::{
atomic::{AtomicBool, Ordering},
Arc,
};
use tokio::sync::mpsc;
/// Spawns a worker and producer task
///
/// Returns a flag which stops the tasks when set
fn spawn_tasks() -> Arc<AtomicBool> {
let (message_tx, mut message_rx) = mpsc::unbounded_channel();
let (response_tx, mut response_rx) = mpsc::unbounded_channel();
let stop = Arc::new(AtomicBool::new(false));
// Worker task, adds 1 and responds
tokio::task::spawn(async move {
while let Some(m) = message_rx.recv().await {
let _ = response_tx.send(m + 1);
}
});
// Producer task, send stuff to the worker and checks its reply
let stop_ = stop.clone();
tokio::task::spawn(async move {
for i in 0.. {
// Check to see if we should stop
if stop_.load(Ordering::Acquire) {
break;
}
message_tx.send(i).unwrap();
let r = response_rx.recv().await.unwrap();
assert_eq!(r, i + 1);
}
});
// return the stop handle for the caller to use
stop
}
Dataflow between the tasks is simple:
This system runs while the stop
flag is false. Once that flag is set to
true
, two things will happen in order:
- The producer sees the flag and stops, dropping
message_tx
- The worker reads
None
frommessage_rx
and stops looping
The producer is safe to unwrap its calls to send
and recv
, because the
worker should always be running while the producer is running:
- The worker only stops once
message_tx
is dropped message_tx
is held by the producer- Therefore, the worker will never stop before the producer
...or will it?
The root cause for my troubles sounds obvious when put plainly:
The Tokio runtime will drop tasks in an arbitrary order during shutdown.
This means that seemingly "impossible" orderings can happen:
- The producer is waiting for a reply at
response_rx.recv().await
- We begin shutting down the Tokio runtime
- The runtime destroys the worker task, which drops
response_tx
- The producer reads
None
fromresponse_rx
and panics when unwrapping it!
We can test this out by running in a tight loop:
fn main() {
for _ in 0..10_000 {
let rt = tokio::runtime::Runtime::new().unwrap();
rt.block_on(async {
let stop = spawn_tasks();
tokio::time::sleep(std::time::Duration::from_millis(10)).await;
stop.store(true, Ordering::Release); // shut everything down
// the runtime is dropped here!
});
}
}
Over 10,000 loops, I see 26 panics at response_rx.recv().await.unwrap()
.
Even in this toy model designed to highlight the issue, it's an infuriatingly
rare occurrence!
The default runtime is multi-threaded, which seems like a necessary condition to trigger the issue: you need to have some tasks continuing to run while the runtime drops others. Sure enough, the issue doesn't happen with the single-threaded runtime.
tokio::test
uses the single-threaded runtime by default, so this may not even be visible in
unit tests! In our system, we saw these panics during both integration tests
and at program exit.
By stopping tasks in an arbitrary order, Tokio isn't doing anything wrong; indeed, it has no way of knowing which tasks are expected to stop in what order.
Quoth the docs:
Tasks spawned through
Runtime::spawn
keep running until they yield. Then they are dropped. They are not guaranteed to run to completion, but might do so if they do not yield until completion.
However, I wonder whether the runtime should wait for all tasks to yield before dropping the tasks' data. Having task stop in an arbitrary order is fine (and unavoidable), but only dropping data once all tasks have stopped would prevent this kind of issue.
The feasibility of such a "stop, collect, and drop" strategy would depend on Tokio's internals, which I haven't had a chance to investigate.