The Prospero Challenge

Matt Keeter, March 2025

prospero.vm is a plain-text file containing 7866 math expressions.

It begins as follows:

# Text of a monologue from The Tempest
_0 const 2.95
_1 var-x
_2 const 8.13008
_3 mul _1 _2
_4 add _0 _3

...and continues in that vein for many, many lines.

If you evaluate the expression with (x,y) values in the ±1 square, then color pixels black or white based on their sign, you get the following image:

text of a monologue from The Tempest, in black-on-white text

The "challenge" is simple: render the image as quickly as possible.

A basic renderer is 28 lines of Python, using Numpy for pixel parallelism:

import numpy as np

with open('prospero.vm') as f:
    text = f.read().strip()

image_size = 1024
space = np.linspace(-1, 1, image_size)
(x, y) = np.meshgrid(space, -space)
v = {}

for line in text.split('\n'):
    if line.startswith('#'):
        continue
    [out, op, *args] = line.split()
    match op:
        case "var-x": v[out] = x
        case "var-y": v[out] = y
        case "const": v[out] = float(args[0])
        case "add": v[out] = v[args[0]] + v[args[1]]
        case "sub": v[out] = v[args[0]] - v[args[1]]
        case "mul": v[out] = v[args[0]] * v[args[1]]
        case "max": v[out] = np.maximum(v[args[0]], v[args[1]])
        case "min": v[out] = np.minimum(v[args[0]], v[args[1]])
        case "neg": v[out] = -v[args[0]]
        case "square": v[out] = v[args[0]] * v[args[0]]
        case "sqrt": v[out] = np.sqrt(v[args[0]])
        case _: raise Exception(f"unknown opcode '{op}'")
out = v[out]

with open('out.ppm', 'wb') as f: # write the image out
    f.write(f'P5\n{image_size} {image_size}\n255\n'.encode())
    f.write(((out < 0) * 255).astype(np.uint8).tobytes())

On my machine, this takes about 15 seconds to produce a 1024×1024 image.

(Spoilers: it also allocates 60+ GB of RAM for intermediate results, so consider reducing the image size before running it on your own machine)

"But wait!", you say. "There's so much obvious room for improvement! You could pre-parse the expression, or use numba, or run it on the GPU, or use LLVM ..."

Yes. I would like you to do those things, and then tell me about them!

A few ideas to think about:

The Python implementation above is a simple interpreter. What would an optimized interpreter look like? How about a compiler?
How can we take advantage of the expression's mathematical structure?
Start-up time versus steady-state performance. What kinds of precomputation help, and how long does it take? (Obviously, you could precompute the entire image, but that's against the spirit of the challenge)
Along those lines, what happens if the user is making changes to the model? Strategies which rely on expensive precomputation may not be suitable for interactive use.
How does rendering performance change with image size? A 1024×1024 image has 4× as many pixels as 512×512; does it necessarily take 4× longer to render?
Panning and zooming is a simple transform of the incoming (x,y) coordinates. How does your system handle interactive viewing of the model?

Submitting your experiments

Feel free to submit both successful and unsuccessful experiments via email to
matt.j.keeter@gmail.com, and I will add them to this page.

The ideal submission would be a blog post or code repository, along with a few-sentence description that you'd like included on this page. I'd also be happy to include hero images or video embeds.

The description should call out any particularly promising results, clever ideas, or interesting tools! Don't get too fixated on absolute speed; the goal is to explore ideas, and that can absolutely be done without breaking any performance records!

Is there a prize?

Besides eternal fame and glory?

If you submit an interesting write-up, I will buy you a coffee or drink if we happen to be in the same city. It will be particularly easy for Cambridge / Boston / Somerville residents to receive their prize; I make no guarantees for submissions from farther afield.

Background material

Interval Arithmetic and Recursive Subdivision for Implicit Functions and Constructive Solid Geometry (Duff '92) is a great introduction to common techniques. In particular, the combination of interval arithmetic, spatial subdivision, and expression simplification remains the basis for modern renderers.

These ideas are also presented in §3.7 of my thesis, Hierarchical Volumetric Object Representations for Digital Fabrication Workflows (2013), although my personal implementations have evolved since then (see below for examples).

Finally, I gave a talk titled Implicit Surfaces & Independent Research (2025), which covers all of this material and hews closely to my current implementations. The audience was CS undergraduates in their senior year, so it's intended to be approachable!

Gallery

Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces

Matt Keeter, 2020

Paper homepage

The MPR implementation compiles the expression down to a register-based bytecode, then executes it with a small interpreter running in a CUDA kernel. The interpreter evaluates the expression using interval arithmetic on many spatial regions in parallel, then simplifies the expression, subdivides the active regions, and recurses. Below a certain size, we swap over to per-pixel evaluation, also in parallel on the GPU.

Expression simplification is particularly important for performance: by the time we evaluate individual pixels, the expression is 200× smaller!

Most of the hard work went into making a deeply recursive algorithm (with heterogeneous workloads in each branch) into something more GPU-friendly.

Rendering a 1024×1024 image takes 3.9 milliseconds using a Tesla V100, which was a high-end GPU back in 2020 (and is still absurdly powerful). Performance is basically flat through 4096×4096, indicating that we're far from saturating the GPU for this kind of 2D rendering (the paper also describes a similar strategy for 3D rendering).

Unfortunately, I don't have timing numbers for setup time, but the kernels are generic interpreters, i.e. not specialized to a particular expression.

Fidget

Matt Keeter, March 2025

Project homepage

Fidget uses the same broad strategy of interval evaluation and expression simplification, but runs on the CPU and compiles expressions down to machine code with a JIT compiler.

Running on an M1 Max CPU (2021), it renders a 1024×1024 image in 6.3 milliseconds, with about 11 ms of setup time:

$ cargo run -pfidget-cli --release -- render2d -i models/prospero.vm \
    -s1024 --eval=jit -N500 --mode mono
[2025-03-23T21:26:44Z INFO  fidget_cli] Loaded file in 4.994333ms
[2025-03-23T21:26:44Z INFO  fidget_cli] Built shape in 6.04525ms
[2025-03-23T21:26:47Z INFO  fidget_cli] Rendered 500x at 6.331644 ms/frame

Scaling up to 4096×4096, it takes 57 milliseconds, which is about 9× slower; this is sublinear scaling, because it's 16× more pixels.

There are a few interesting aspects to this project:

The JIT compiler's architecture is specialized for very fast compilation, because it's running many times to evaluate a single frame!
There's also a VM interpreter, which is almost as fast: 7.5 ms for a 1024×1024 image
Everything except the JIT compiler can be run in WebAssembly, albeit with a performance penalty

Prospero challenge, now with more garbage collection

Max Bernstein, March 2025

Blog post

Max addresses the "60 GB of intermediate results" issue by adding garbage collection to the Python interpreter, seeing a quick 4× speedup (and a dramatic reduction in RAM usage, down to about 1 GB).

The post also shows off CuPy, a drop-in replacement for Numpy which offloads computation to the GPU. Simply replacing the import statement yields a further 6.6× speedup!

Generating a native CUDA kernel

Kevin Wang, March 2025

Github repo

This demo shows the raw power of a modern GPU, rendering a 4096×4096 frame in about 0.5 milliseconds. The expression is translated directly into a CUDA kernel then compiled and run, evaluating all of the pixels in parallel without the overhead of an interpreter loop.

The downside to this approach is significant startup time – about 2 seconds to compile the kernel – which makes it less suitable for interactive design work. Still, for workflows that focus on viewing static expressions, it's hard to beat.

As a side note, Kevin and I are both puzzled by the specific performance numbers. His GPU has a theoretical max performance of 83 TFLOPS; however, 0.5 milliseconds for 4096×4096 pixels (with 6461 non-const expressions) implies about 200 TFLOPS! If you have theories about the discrepancy, feel free to reach out.

Cranelift JIT and SIMD

Aviva Ruben, March 2025

Blog post

This writeup optimizes brute-force evaluation with a bunch of interesting tools and techniques:

We went from around 5 seconds to ~270 ms. We started with using cranelift for JIT compilation, added rayon to process pixels in parallel, tried switching to f32, and finally added vectorization to get 4 pixels processed per function invocation. Compared to the original python's 15 seconds, I'd say that's pretty good.

I'm also left with a deeper appreciation for the Cranelift compiler: it's both fast (34 milliseconds to compile the whole expression) and easy to embed (coming in under 200 lines of code).

JIT compilation with `ejit`

Kim Kuparinen, March 2025

Code repository

Kim uses the challenge as a test for their own JIT compiler, ejit (which is built on lightening). ejit is even further down the "compilation speed versus performance" tradeoff curve: it compiles the expression about 20× faster than the Cranelift implementation, but renders the image about 6× slower (albeit without SIMD). Numerically heavy parallel computation isn't the intended use-case – ejit is intended to speed up toy scripting languages – but it's interesting to see how it performs!

Fast interpreter with SIMD and parallelism

Lobo, March 2025

Code repo

This submission presents an optimized brute-force interpreter using Rust, portable SIMD, and rayon for multithreading. There is minimal startup overhead; the text is parsed into bytecode, but nothing fancier. The other main optimization is switching from a hashmap to a flat memory array for intermediate results. Even so, it's about 12× faster than Python implementation with garbage collection above (and I'm a fan of any code that includes a struct TrustMeBro).

Julia metaprogramming

Yolhan Mannes et al, March 2025

Github repo | Forum thread

This implementation is a great demonstration of Julia's flexibility and metaprogramming capabilities. Using relatively straight-forward code, the expression is parsed and compiled at runtime, then evaluated either on the GPU or CPU depending on selected backend. GPU evaluation is about 10 ms/frame on a NVIDIA GeForce 4060, while the CPU backend takes around 492 ms. Julia uses LLVM for its JIT compiler, so the resulting functions are fast to evaluate but costly to compute; the authors see about 13 seconds of compilation time.

Optimized evaluation and OpenMP

Janos Meny, April 2025

Github repo

This submission uses algorithmic improvements, a high-performance implementation language, and parallelism to achieve particularly fast rendering: 2 ms for 1024×1024, or 4.3 ms for 4096×4096 (on an Macbook Air M2). The strategy is broadly similar to Fidget: interval evaluation proves entire regions to be inside or outside the shape, and expression simplification reduces the workload while recursing down a quadtree.

The implementation is very approachable: it's a single 600-line file of C++, and simply running make worked out of the box on my machine. For bonus points, OpenMP is used for parallelism: while recursing, #pragma omp annotations cause regions above 64×64 to be distributed among worker threads.

All in all, a very impressive submission!

Doing the Prospero Challenge in RPython

CF Bolz-Tereick, April 2025

Blog post

This post – by one of the PyPy authors – brings an interesting perspective to the problem. Interval arithmetic evaluation is reframed as abstract interpretation; min and max simplification becomes a special case of peephole optimization. The author implements optimized evaluation using a similar recursive strategy as Fidget, in both RPython and C.

There's one new technique that I haven't seen before: "demanded information optimization". In short, we only care about the sign bit, so certain operations can be simplified if only one side can generate a varying sign bit (check out the blog post for a more detailed explanation). This ends up helping a lot, and I may purloin it for my own tools!

The author takes care to make their implementation robust: their interval arithmetic library is tested with hypothesis-based property testing, and they built a random program generator to check that their RPython and C implementations are equivalent.

I appreciate seeing the challenge reframed from a compiler engineer's perspective!

Optimized evaluation in C

Sim Veldstra, April 2025

Github repo

This implementation ports the example code from Python to C, then walks through a set of optimizations, showing improved performance at each step. Optimizations include autovectorization, removing constants from the tape, and multithreading. I particularly appreciate the conclusion:

So there you have it, you really can beat Python by rewriting in C. And it only took 700 extra lines of code.

Artisinal AVX2 JIT compiler

Ken Micklas, April 2025

Blog post

Ken presents a bare-bones JIT compiler: each math operation is mapped directly into AVX2 instructions using the iced_x86 library and a custom register allocator. As you'd expect, this is a blazing fast compiler, generating code in mere milliseconds.

Rendering is similarly speedy, generating a 1024×1024 image in 48 ms; not bad for brute force evaluation!

The writeup shows a few particularly interesting parameter sweeps:

Scaling is roughly linear with physical cores, as you'd expect
Performance is nearly uncorrelated with "number of physical registers used", which is unexpected!
This may be because the number of loads and stores isn't particularly sensitive to register count (which is also surprising!)

AVX JIT compiler and function specialization

Carlo Refice, April 2025

Blog post

Carlo combines a homebrew JIT compiler with tiled rendering, generating specialized native functions for spatial regions of the image. The techniques are broadly similar to other submissions, but they're particularly well presented and built up from first principles – to the point of explaining x86_64 instruction encoding! This is a particular compelling write-up if you're interested in JIT code generation, register allocation, and trace-based function specialization.

Gradients are the new intervals

Matt Keeter, May 2025

Blog post

Hello, it's me again!

I read an interesting paper and wanted to ~~steal~~ borrow some of its ideas.
Here's the short version:

For functions with a bounded gradient, you can produce interval-like results using single-point evaluations (which are cheaper and maintain tighter bounds)
If the function doesn't have a bounded gradient, you can evaluate it with automatic differentiation, then normalize by dividing by the gradient's magnitude

These two ideas let you make an evaluator that follows the same "interval arithmetic + subdivision" strategy, but without any interval arithmetic! The blog post goes into various advantages and disadvantages; I found it to be slightly slower, but it's highly dependent on the shape of your function.

SPIR-V JIT compiler

Joe Neeman, June 2025

Blog post

Joe presents a JIT compiler that translates the program into SPIR-V bytecode, then sends that bytecode to the GPU for fast brute-force evaluation.

This is a similar pipeline as Kevin's CUDA transpiler, but the code generation is orders-of-magnitude faster: loading the shader takes milliseconds, instead of seconds! CUDA runs a full optimizing compiler during lowering, so I suspect Joe's implementation trades (some) runtime performance for much better compilation speed.

The writeup shows how to generate SPIR-V from Rust, presents performance results, and discusses potential improvements. Writing SPIR-V manually is uncommon, so I appreciated the frank discussion of SPIR-V libraries and tools, e.g. using validators and disassemblers to figure out how things work under the hood.

Very fast CPU interpreter

Aaron Pribadi, June 2025

Github repo

Aaron submits our fastest CPU renderer so far, with a bunch of notable new tricks!

The original instruction tape is preprocessed into bytecode, with a twist: the implicit <= 0.0 is reified, then propagated through the expression. This turns operations like min(a, b) into (a <= 0) | (b <= 0), which can then be further simplified.

Expression simplification uses these booleans instead of distance values, which produces shorter final expressions; this is philosophically similar to CF Bolz-Tereick's "demanded information optimization". In addition, operations track their relevant inputs in the forward pass, so the backwards simplification pass is sparse: it depends on the number of expressions in the simplified tape, not the input tape!

The bytecode format has dedicated instructions for lines and ellipses, instead of operating at the level of individual math expressions. Finally, almost everything is vectorized, using NEON intrinsics!

It renders a 1024x1024 image in 0.29 ms on an M1 Macbook Air, which is about 8× faster than our previous optimized record-holder for CPU-based evaluation – an incredible speed-up! The only caveat is that bytecode generation is offline, so it's not included in those numbers.

Simple Rust interpreter

Sam Leonard, July 2025

Github repo

Sam submits a Rust interpreter written in a mere 123 lines of code; it's simple and highly readable. The code parses the program into an Op data type, then evaluates pixels in parallel. It uses Rayon for multithreaded parallelism, with unsafe indexing into a pixel buffer.

Rust codegen with a procedural macro

John Graham, July 2025

Github repo

John submits a fork of Sam's solution above, which uses a proc macro to generate Rust code at compile time. This is an interesting middle ground between runtime interpreters and explicit codegen (e.g. in a build script).

That's all for now; but you can help by submitting your own experiments!

The Prospero Challenge

Submitting your experiments

Is there a prize?

Background material

Gallery

Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces

Fidget

Prospero challenge, now with more garbage collection

Generating a native CUDA kernel

Cranelift JIT and SIMD

JIT compilation with ejit

Fast interpreter with SIMD and parallelism

Julia metaprogramming

Optimized evaluation and OpenMP

Doing the Prospero Challenge in RPython

Optimized evaluation in C

Artisinal AVX2 JIT compiler

AVX JIT compiler and function specialization

Gradients are the new intervals

SPIR-V JIT compiler

Very fast CPU interpreter

Simple Rust interpreter

Rust codegen with a procedural macro

JIT compilation with `ejit`