Guided by the beauty of our test suite

This is a story about making software better.

Last summer, I published Raven, an independent implementation of the Uxn CPU and Varavara ordinator. In short, this is a fictional computer designed for cross-platform portability: programs targetting it can be run on any platform that implements the specification.

Here's my implementation running the catclock demo:

An application window showing a cat holding a clock

Earlier this week, a kind contributor opened a few PRs to make the system work better with their image viewer. They also complimented the performance of the system, saying

I can happily say that with this fix, svitlyna runs almost 50% faster than with uxn11, and that's on AMD64, meaning this is running without optimized assembly. I'm really impressed!

Given the appreciative words, I was well motivated to review their PRs!

I hadn't touched the code since July of 2024, so it took me a little while to get up to speed. I had previously done ad-hoc testing while deeply immersed in the system; coming back months later, it was hard to touch the code and remain confident that I hadn't introduced any new issues.

Over the long weekend, I set up infrastructure to alleviate this lack of confidence, and I'd like to tell you about it. Some of it is table stakes (CI testing on PRs), but there are also a few more exotic tricks (fuzzing and statically preventing panics).

Github CI

This one is pretty basic: I set up a Github Actions job to run on pull requests. I won't bore you with the entire YAML file, but it does a typical set of checks:

Build and test across all native platforms (macOS, Linux, Windows)
Confirm that the WebAssembly build completes without errors
Run Clippy on all platforms
Make sure that the code is rustfmt clean

There were two surprises here. The first surprise was the sad state of Github's Windows runners, which were nearly 10x slower than the Mac and Linux runners:

Screenshot of a test matrix in Github actions. Linux completes in 19s, macOS 15s, and Windows in 2m 20s

Complaining on Bluesky revealed this to be a common issue, so I don't think I'm doing anything wrong here.

The second surprise was that the ubuntu-24.04-arm runner was extremely unreliable! For example, here's Clippy segfaulting (?!). Rerunning the job succeeds – and aarch64-unknown-linux-gnu is a Tier 1 target – so I'd put my money on the job runners being at fault.

The ARM runners are very new, having entered "public preview" only two weeks ago, so I disabled that target and moved on.

Actual snapshot testing

The Varvara specification includes a Screen device, which lets you draw pixels into the window. It's a surprisingly complex device, because it supports automatically drawing entire sprites (this makes more sense once you learn that the platform is inspired by the NES).

Picture of a test render showing grid, a set of circular patterns, and a woman's face reflected in four directions

I added automatic tests which run the system for some number of cycles, then capture the state of the screen and compare it with a known-good image saved in the repository. The test passes if the image is identical, and fails with a visual comparison otherwise.

To exercise more of the system, the test also performs mouse and keyboard interactions while running. When I first added these interactions, a few of the tests failed, because the mouse and keyboard changed the visual state of the system! This turned out to be a good demonstration of the failure rendering (click for full resolution):

If the new image is actually correct – as it was in this situation – then deleting the previous known-good image will cause it to be regenerated on the next run.

Statically proving the absense of panics

In my previous writeup, I bragged that the VM implementation is completely free of panics. Unfortunately, I had to check this property by hand, by disassembling the executable and scanning through the implementation of Uxn::run.

As part of the robustification process, I borrowed a trick from no-panic to prove the absence of panics at compile time. This strategy was new to me (though well-known in the Rust community), so let me show you what it looks like:

#[inline(never)]
pub fn div_no_panic(data: &[u8]) {
    // Define a struct which calls a non-existent function when dropped
    struct NoPanic;
    extern "C" {
        #[link_name = "div_may_panic"]
        fn trigger() -> !;
    }
    impl ::core::ops::Drop for NoPanic {
        fn drop(&mut self) {
            unsafe {
                trigger();
            }
        }
    }

    // Build an instance of that struct
    let guard = NoPanic;

    // Perform the possibly-panicking operations
    let mut ram = UxnRam::new();
    let mut vm = Uxn::new(&mut ram, Backend::Interpreter);
    let _ = vm.reset(data);
    let pc = 0x100;
    for d in data {
        vm.stack.push(Value::Byte(*d));
        vm.ret.push(Value::Byte(*d));
    }
    vm.div(pc);

    // If we didn't panic, then forget the guard (skipping its drop function)
    core::mem::forget(guard);
}

Panics in Rust perform "unwinding": when a panic occurs, destructors are called for all objects. In this code, the destructor for NoPanic calls an extern "C" function named div_may_panic. Unfortunately, this function doesn't exist!

Normally, this would fail at link time with an error, because the linker can't resolve the external function. However, if there are no panic paths, a set of optimizations becomes possible:

If there are no panics, then core::mem::forget(guard) is always called
If we always forget the guard, then NoPanic::drop is never called
If NoPanic::drop is never called, then it doesn't need to be in the binary
If NoPanic::drop is not included, then we never link against div_may_panic

In other words, this will fail with a linker error if the compiler believes that the code may panic. The test may have false positives – building without optimization will fail, because the compiler is less aggressive – so I only build these test in release mode.

The div_no_panic function needs to be instantiated, which we do with a small test:

#[test]
fn div() {
    let data = std::hint::black_box(&[]);
    div_no_panic(data);
}

This test also stops the compiler from getting too clever with optimizations: passing a black_box object into the function means that the VM may be initialized with any values for its data and stacks.

Without the black_box, the compiler is smart enough to notice that the data and stacks are all zeros; combined with aggressive inlining, we end up checking the much weaker condition of "this operation can't panic with a zero-initialized VM", instead of "this operation can't panic with a VM in any state".

As an example, division is defined as a.checked_div(b).unwrap_or(0). If I instead write a / b, it can panic when b == 0. Sure enough, this leads to a linker error:

= note: ld: Undefined symbols:
    _div_may_panic, referenced from:
        raven_uxn::test::no_panic::div::no_panic::h04b990e75bc12e0c in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
        raven_uxn::test::no_panic::div::no_panic::h686d8f2cdb0db704 in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
        raven_uxn::test::no_panic::div::no_panic::h8383a3b65857da3c in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
        raven_uxn::test::no_panic::div::no_panic::h87538c337282e2a2 in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
        raven_uxn::test::no_panic::div::no_panic::hcf7507c29ba59298 in raven_uxn-f210ab979edece52.raven_uxn.e5df65eeb70bd9b5-cgu.03.rcgu.o
  clang: error: linker command failed with exit code 1 (use -v to see invocation)

It's not exactly beautiful, but it works well enough. The custom link name tells me which operation is responsible for the failure, making it easier to track down.

I ended up writing a bunch of macros to check that every single opcode function is panic-free. Theoretically, I could have just tested Uxn::op, which dispatches to opcode functions based on a u8 argument; unfortunately, even though each individual opcode function is panic-free, the optimizer couldn't prove the same of Uxn::op. Being subject to the whims of the compiler is a definite downside to this trick!

Having panic-free code checked at compile time is a somewhat exotic tool, but is helpful for high-assurance systems. At work, my colleagues and I use a similar strategy to assert that no panics are present in critical parts of our embedded system, e.g. the supervisor task.

Fuzzing

The Uxn CPU implementation in Raven comes in two flavors:

The baseline interpreter is written in 100% safe Rust and is guaranteed panic-free (as discussed above)
The "native" interpreter uses hand-written assembly to run about 30% faster, see in "Beating the compiler" for an explanation of how this works

These two implementations should both be faithful to the Uxn specification, and behave identically. However, when testing svitlyna (an image viewer), I found that they diverged: the baseline interpreter worked fine, but the native interpreter failed to load any images!

This failure meant that I could not admire the sample image of a cat, which is obviously an intolerable situation:

Picture a cat in a macOS window titled 'Varvara'

Divergence between the baseline and native interpreters had always been a fear of mine, and I already had hundreds of lines of unit tests to compare between them. These unit tests ran simple instruction sequences and confirmed that both interpreters ended up in identical states.

Obviously, they weren't sufficient, so I pulled out a bigger hammer: cargo-fuzz

In fuzz testing, we provide a random input to the system and check whether it behaves correctly. Sufficiently advanced fuzzing libraries will actually monitor the code paths (using debugging infrastructure), and are smart about mutating the input to fully explore the program's state space.

For my interpreter, the random input is a ROM, which is just a slice of bytes. The correctness check compares the machine state between baseline and native interpreters after the program terminates. "Machine state" is comprised of RAM (64 KiB), data and return stacks (u8 index plus 256 bytes each), and device memory (256 bytes); all 66306 bytes must be identical!

It's easy for the fuzzer to discover (and get stuck in) infinite loops, so the test harness also bails out after hitting some max instruction count.

When the fuzzer finds a case that doesn't match, the test harness saves the failing input and pretty-prints the opcode sequence. For example, it may tell me that ADDk SFT2kr SWP2r EQU2r SWP2r LDR2kr SFT2kr fails; cargo fuzz tmin then reduces the test case to EQU2r SWP2r.

The fuzzer found three opcodes where the interpreters produced different results. The first two were straight-forward typos in my assembly implementations:

--- a/raven-uxn/src/native/aarch64.s
+++ b/raven-uxn/src/native/aarch64.s
@@ -874,9 +874,9 @@
_NIP2r:
    ldrb w9, [x2, x3]
    rpop
    ldrb w10, [x2, x3]
    rpop
    strb w9, [x2, x3]
-    sub x11, x1, #1
+    sub x11, x3, #1
     and x11, x11, #0xff
-    strb w10, [x2, x31]
+    strb w10, [x2, x11]
     next

@@ -885,7 +885,7 @@
 _SWP2r:
    ldrb w11, [x2, x3]  ; get the top byte
    rpeek w12, x9, 2    ; get the second-from-top byte
    strb w11, [x2, x9]  ; do the swap!
    strb w12, [x2, x3]

     rpeek w11, x9, 1
-    rpeek w12, x10, 2
+    rpeek w12, x10, 3

     strb w11, [x2, x10]
     strb w12, [x2, x9]

The third discrepancy, however, was interesting, because it was correct in assembly and incorrect in Rust!

The SFT opcode reads a byte from the stack and splits it into two nibbles (4 bits each). It then reads another byte from the stack and shifts it right (controlled by the first nibble) then left (controlled by the second nibble). It's a one-size-fits-all instruction for power-of-two division, multiplication, and masking of low bits.

In Rust, I implemented it as follows:

    #[inline]
    pub fn sft(&mut self, pc: u16) {
        let shift = self.stack.pop_byte();
        let shr = u32::from(shift & 0xF);
        let shl = u32::from(shift >> 4);
        let v = self.stack.pop_byte();
        self.stack.push(v.wrapping_shr(shr).wrapping_shl(shl));
    }

Your eyes may be drawn to wrapping_shr and wrapping_shl; what's wrong with << and >>? This was part of my strategy for eliminating panics, because it's illegal to shift an integer by more than its width!

For example, want to calculate 1u8 << 8? Right to jail, right away. 15u16 >> 18? Believe it or not, jail. We have the best numerical operators in the world, because of jail.

(To be pedantic, this will panic in debug builds and be Undefined Behavior in release builds; you should turn on overflow-checks in release builds to make it panic there as well)

To suss out the intended behavior, I looked at the reference implementation:

case 0x1f: /* SFT  */
    t=T;n=N;
    SET(2,-1) T = n >> (t & 0xf) << (t >> 4);
    break;

In this case, n is a Uint16, and the maximum shift is 15, so this is all legal (hooray!). More importantly, it defines our semantics: we should shift in zeros if the right-shift exceeds 8 bits, and discard bits that overflow when shifting left.

This is what I thought I was doing with wrapping_shr and wrapping_shl. Unfortunately, the "wrapping" in those function names applies to the shift amount, not the shifted value! Specifically, if the shift amount is larger than the maximum valid shift, invalid bits are masked in the shift amount.

(I'm not the only person confused about this)

It turns out that unbounded_shl and unbounded_shr have the desired semantics, but they're still experimental and nightly-only. In the meantime, I can write the expression using checked shifts:

    v.checked_shr(shr).unwrap_or(0).checked_shl(shl).unwrap_or(0)

In assembly, I was using w-sized registers (32 bits) for all operands, so it was already doing the right thing:

_SFT:
    ldrb w10, [x0, x1] ; load the shift amount
    pop                ; pop the shift amount from the stack
    ldrb w11, [x0, x1] ; load the value to be shifted
    lsr w12, w10, 4    ; calculate left shift amount
    and w10, w10, #0xf ; calculate right shift amount
    lsr w11, w11, w10  ; do the right shift
    lsl w11, w11, w12  ; do the left shift
    strb w11, [x0, x1] ; write the value back
    next

With these three opcodes fixed, I ran the fuzzer with 8 threads for 15-20 minutes. My laptop got very warm, but it failed to find any further issues!

Wrapping up

I don't have any immediate plans for Raven, but next time I'm inclined to touch it, I'll be starting from a more solid foundation. For example, fuzz testing means it's possible for me (or someone else!) to contribute an x86-64 assembly backend with relative confidence!

Software is never perfect, and there are only so many hours in the day, but I found it rewarding to refine this small project into a microcosm of engineering rigor. I appreciate the tools and infrastructure that let me create robust and correct systems: Rust, cargo fuzz / libFuzzer, even (sigh) Github Actions.

My baseline interpreter is now (1) written in 100% safe Rust, (2) guaranteed not to panic, no matter the inputs, and (3) up to 50% faster than the reference implementation in C. Even better, 2/3 of those properties are guaranteed at compile time and checked in CI, so they shouldn't regress!

Finally, the title of this post is a reference to a song by Leonard Cohen, which is well worth your time.

Matt Keeter // blog