Guided by the beauty of our test suite
This is a story about making software better.
Last summer, I published Raven, an independent implementation of the Uxn CPU and Varavara ordinator. In short, this is a fictional computer designed for cross-platform portability: programs targetting it can be run on any platform that implements the specification.
Here's my implementation running the catclock demo:
Earlier this week, a kind contributor opened a few PRs to make the system work better with their image viewer. They also complimented the performance of the system, saying
I can happily say that with this fix, svitlyna runs almost 50% faster than with uxn11, and that's on AMD64, meaning this is running without optimized assembly. I'm really impressed!
Given the appreciative words, I was well motivated to review their PRs!
I hadn't touched the code since July of 2024, so it took me a little while to get up to speed. I had previously done ad-hoc testing while deeply immersed in the system; coming back months later, it was hard to touch the code and remain confident that I hadn't introduced any new issues.
Over the long weekend, I set up infrastructure to alleviate this lack of confidence, and I'd like to tell you about it. Some of it is table stakes (CI testing on PRs), but there are also a few more exotic tricks (fuzzing and statically preventing panics).
Github CI
This one is pretty basic: I set up a Github Actions job to run on pull requests. I won't bore you with the entire YAML file, but it does a typical set of checks:
- Build and test across all native platforms (macOS, Linux, Windows)
- Confirm that the WebAssembly build completes without errors
- Run Clippy on all platforms
- Make sure that the code is
rustfmt
clean
There were two surprises here. The first surprise was the sad state of Github's Windows runners, which were nearly 10x slower than the Mac and Linux runners:
Complaining on Bluesky revealed this to be a common issue, so I don't think I'm doing anything wrong here.
The second surprise was that the ubuntu-24.04-arm
runner was extremely
unreliable! For example, here's
Clippy segfaulting (?!).
Rerunning the job succeeds – and
aarch64-unknown-linux-gnu
is a
Tier 1
target – so I'd put my money on the job runners being at fault.
The ARM runners are very new, having entered "public preview" only two weeks ago, so I disabled that target and moved on.
Actual snapshot testing
The Varvara specification includes a Screen
device,
which lets you draw pixels into the window. It's a surprisingly complex device,
because it supports automatically drawing entire
sprites (this makes
more sense once you learn that the platform is
inspired by the NES).
I added automatic tests which run the system for some number of cycles, then capture the state of the screen and compare it with a known-good image saved in the repository. The test passes if the image is identical, and fails with a visual comparison otherwise.
To exercise more of the system, the test also performs mouse and keyboard interactions while running. When I first added these interactions, a few of the tests failed, because the mouse and keyboard changed the visual state of the system! This turned out to be a good demonstration of the failure rendering (click for full resolution):
If the new image is actually correct – as it was in this situation – then deleting the previous known-good image will cause it to be regenerated on the next run.
Statically proving the absense of panics
In my previous writeup, I bragged that the VM implementation is
completely free of panics.
Unfortunately, I had to check this property by hand, by disassembling the
executable and scanning through the implementation of Uxn::run
.
As part of the robustification process, I borrowed a trick from
no-panic
to prove the absence of panics at compile time. This strategy was new to me
(though well-known in the Rust community), so let me show you what it looks
like:
#[inline(never)]
pub fn div_no_panic(data: &[u8]) {
// Define a struct which calls a non-existent function when dropped
struct NoPanic;
extern "C" {
#[link_name = "div_may_panic"]
fn trigger() -> !;
}
impl ::core::ops::Drop for NoPanic {
fn drop(&mut self) {
unsafe {
trigger();
}
}
}
// Build an instance of that struct
let guard = NoPanic;
// Perform the possibly-panicking operations
let mut ram = UxnRam::new();
let mut vm = Uxn::new(&mut ram, Backend::Interpreter);
let _ = vm.reset(data);
let pc = 0x100;
for d in data {
vm.stack.push(Value::Byte(*d));
vm.ret.push(Value::Byte(*d));
}
vm.div(pc);
// If we didn't panic, then forget the guard (skipping its drop function)
core::mem::forget(guard);
}
Panics in Rust perform "unwinding":
when a panic occurs, destructors are called for all objects. In this code, the
destructor for NoPanic
calls an extern "C"
function named div_may_panic
.
Unfortunately, this function doesn't exist!
Normally, this would fail at link time with an error, because the linker can't resolve the external function. However, if there are no panic paths, a set of optimizations becomes possible:
- If there are no panics, then
core::mem::forget(guard)
is always called - If we always
forget
the guard, thenNoPanic::drop
is never called - If
NoPanic::drop
is never called, then it doesn't need to be in the binary - If
NoPanic::drop
is not included, then we never link againstdiv_may_panic
In other words, this will fail with a linker error if the compiler believes that the code may panic. The test may have false positives – building without optimization will fail, because the compiler is less aggressive – so I only build these test in release mode.
The div_no_panic
function needs to be instantiated, which we do with a small
test:
#[test]
fn div() {
let data = std::hint::black_box(&[]);
div_no_panic(data);
}
This test also stops the compiler from getting too clever with optimizations:
passing a black_box
object into the function means that the VM may be
initialized with any values for its data and stacks.
Without the black_box
, the compiler is smart enough to notice that the data
and stacks are all zeros; combined with aggressive inlining, we end up checking
the much weaker condition of "this operation can't panic with a zero-initialized
VM", instead of "this operation can't panic with a VM in any state".
As an example, division is defined as a.checked_div(b).unwrap_or(0)
. If I
instead write a / b
, it can panic when b == 0
. Sure enough, this leads to a
linker error:
= note: ld: Undefined symbols:
_div_may_panic, referenced from:
raven_uxn::test::no_panic::div::no_panic::h04b990e75bc12e0c in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
raven_uxn::test::no_panic::div::no_panic::h686d8f2cdb0db704 in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
raven_uxn::test::no_panic::div::no_panic::h8383a3b65857da3c in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
raven_uxn::test::no_panic::div::no_panic::h87538c337282e2a2 in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
raven_uxn::test::no_panic::div::no_panic::hcf7507c29ba59298 in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
It's not exactly beautiful, but it works well enough. The custom link name tells me which operation is responsible for the failure, making it easier to track down.
I ended up writing a
bunch of macros
to check that every single opcode function is
panic-free. Theoretically, I could have just tested
Uxn::op
,
which dispatches to opcode functions based on a u8
argument; unfortunately,
even though each individual opcode function is panic-free, the optimizer
couldn't prove the same of Uxn::op
. Being subject to the whims of the
compiler is a definite downside to this trick!
Having panic-free code checked at compile time is a somewhat exotic tool, but is helpful for high-assurance systems. At work, my colleagues and I use a similar strategy to assert that no panics are present in critical parts of our embedded system, e.g. the supervisor task.
Fuzzing
The Uxn CPU implementation in Raven comes in two flavors:
- The baseline interpreter is written in 100% safe Rust and is guaranteed panic-free (as discussed above)
- The "native" interpreter uses hand-written assembly to run about 30% faster, see in "Beating the compiler" for an explanation of how this works
These two implementations should both be faithful to the Uxn specification, and behave identically. However, when testing svitlyna (an image viewer), I found that they diverged: the baseline interpreter worked fine, but the native interpreter failed to load any images!
This failure meant that I could not admire the sample image of a cat, which is obviously an intolerable situation:
Divergence between the baseline and native interpreters had always been a fear of mine, and I already had hundreds of lines of unit tests to compare between them. These unit tests ran simple instruction sequences and confirmed that both interpreters ended up in identical states.
Obviously, they weren't sufficient, so I pulled out a bigger hammer:
cargo-fuzz
In fuzz testing, we provide a random input to the system and check whether it behaves correctly. Sufficiently advanced fuzzing libraries will actually monitor the code paths (using debugging infrastructure), and are smart about mutating the input to fully explore the program's state space.
For my interpreter, the random input is a ROM, which is just a slice of bytes.
The correctness check compares the machine state between baseline and native
interpreters after the program terminates. "Machine state" is comprised of RAM
(64 KiB), data and return stacks (u8
index plus 256 bytes each), and device
memory (256 bytes); all 66306 bytes must be identical!
It's easy for the fuzzer to discover (and get stuck in) infinite loops, so the test harness also bails out after hitting some max instruction count.
When the fuzzer finds a case that doesn't match, the test harness saves the
failing input and pretty-prints the opcode sequence. For example, it may tell me
that ADDk SFT2kr SWP2r EQU2r SWP2r LDR2kr SFT2kr
fails; cargo fuzz tmin
then
reduces the test case to EQU2r SWP2r
.
The fuzzer found three opcodes where the interpreters produced different results. The first two were straight-forward typos in my assembly implementations:
--- a/raven-uxn/src/native/aarch64.s
+++ b/raven-uxn/src/native/aarch64.s
@@ -874,9 +874,9 @@
_NIP2r:
ldrb w9, [x2, x3]
rpop
ldrb w10, [x2, x3]
rpop
strb w9, [x2, x3]
- sub x11, x1, #1
+ sub x11, x3, #1
and x11, x11, #0xff
- strb w10, [x2, x31]
+ strb w10, [x2, x11]
next
@@ -885,7 +885,7 @@
_SWP2r:
ldrb w11, [x2, x3] ; get the top byte
rpeek w12, x9, 2 ; get the second-from-top byte
strb w11, [x2, x9] ; do the swap!
strb w12, [x2, x3]
rpeek w11, x9, 1
- rpeek w12, x10, 2
+ rpeek w12, x10, 3
strb w11, [x2, x10]
strb w12, [x2, x9]
The third discrepancy, however, was interesting, because it was correct in assembly and incorrect in Rust!
The SFT
opcode reads a byte from the stack and splits it into two nibbles (4
bits each). It then reads another byte from the stack and shifts it right
(controlled by the first nibble) then left (controlled by the second nibble).
It's a one-size-fits-all instruction for power-of-two division, multiplication,
and masking of low bits.
In Rust, I implemented it as follows:
#[inline]
pub fn sft(&mut self, pc: u16) {
let shift = self.stack.pop_byte();
let shr = u32::from(shift & 0xF);
let shl = u32::from(shift >> 4);
let v = self.stack.pop_byte();
self.stack.push(v.wrapping_shr(shr).wrapping_shl(shl));
}
Your eyes may be drawn to wrapping_shr
and wrapping_shl
; what's wrong with
<<
and >>
? This was part of my strategy for eliminating panics, because
it's illegal to shift an integer by more than its width!
For example, want to calculate 1u8 << 8
? Right to jail, right away.
15u16 >> 18
? Believe it or not, jail. We have the best numerical operators
in the world, because of jail.
(To be pedantic, this will panic in debug builds and be Undefined Behavior in
release builds; you should turn on overflow-checks
in release builds to make
it panic there as well)
To suss out the intended behavior, I looked at the reference implementation:
case 0x1f: /* SFT */
t=T;n=N;
SET(2,-1) T = n >> (t & 0xf) << (t >> 4);
break;
In this case, n
is a Uint16
, and the maximum shift is 15, so this is all
legal (hooray!). More importantly, it defines our semantics: we should shift in
zeros if the right-shift exceeds 8 bits, and discard bits that overflow when
shifting left.
This is what I thought I was doing with wrapping_shr
and wrapping_shl
.
Unfortunately, the "wrapping" in those function names applies to the shift
amount, not the shifted value! Specifically, if the shift amount is larger
than the maximum valid shift, invalid bits are masked in the shift amount.
(I'm not the only person confused about this)
It turns out that unbounded_shl
and unbounded_shr
have the desired
semantics, but they're
still experimental and nightly-only.
In the meantime, I can write the expression using checked
shifts:
v.checked_shr(shr).unwrap_or(0).checked_shl(shl).unwrap_or(0)
In assembly, I was using w
-sized registers (32 bits) for all operands, so it
was already doing the right thing:
_SFT:
ldrb w10, [x0, x1] ; load the shift amount
pop ; pop the shift amount from the stack
ldrb w11, [x0, x1] ; load the value to be shifted
lsr w12, w10, 4 ; calculate left shift amount
and w10, w10, #0xf ; calculate right shift amount
lsr w11, w11, w10 ; do the right shift
lsl w11, w11, w12 ; do the left shift
strb w11, [x0, x1] ; write the value back
next
With these three opcodes fixed, I ran the fuzzer with 8 threads for 15-20 minutes. My laptop got very warm, but it failed to find any further issues!
Wrapping up
I don't have any immediate plans for Raven, but next time I'm inclined to touch
it, I'll be starting from a more solid foundation. For example, fuzz testing
means it's possible for me (or someone else!) to contribute an x86-64
assembly
backend with relative confidence!
Software is never perfect, and there are only so many hours in the day, but I
found it rewarding to refine this small project into a microcosm of engineering
rigor. I appreciate the tools and infrastructure that let me create robust and
correct systems: Rust, cargo fuzz
/ libFuzzer
, even (sigh) Github Actions.
My baseline interpreter is now (1) written in 100% safe Rust, (2) guaranteed not to panic, no matter the inputs, and (3) up to 50% faster than the reference implementation in C. Even better, 2/3 of those properties are guaranteed at compile time and checked in CI, so they shouldn't regress!
Finally, the title of this post is a reference to a song by Leonard Cohen, which is well worth your time.