A simple path tracer on the GPU
There comes a time in any graphics programmer's life that the siren song becomes too loud to endure:
"You should write a raytracer"
For me, the moment came after writing a fragment-shader raytracer in Futureproof: it was a terrible raytracer, but a lot of fun to write.
After a few weeks down the rabbit hole, I'm proud to present
rayray
:
a tiny, slightly less-terrible GPU raytracer.
To start, here are renderings of two classic images: a Cornell Box and the test scene from Ray Tracing in One Weekend.
(Click for high resolution, and keep scrolling for more pretty pictures)
Architecture
rayray
is a traditional forward path tracer.
It casts rays from each pixel in the image,
which scatter until they hit a light or exceed some max number of bounces.
Averaging over thousands of samples creates the final image.
Drag the slider to compare 100 vs. 1000 samples while observing the orb:
The architecture is inspired by Ray Tracing in One Weekend, but unlike the reference implementation, it runs completely on the GPU!
This require a data-driven design (rather than using inheritance), and leads to dramatically faster rendering.
All you need is trace(...)
The core of my raytracer is a function with the signature
bool trace(inout uint seed, inout vec3 pos, inout vec3 dir, inout vec4 color);
This function casts a ray to the next (nearest) object in the scene,
starting at pos
and traveling in direction dir
.
After finding this object,
the ray scatters depending on material, modifying pos
, dir
, and color
.
It returns a flag indicating whether to terminate
(if the ray hit a light or escaped the world).
This makes the actual raytracing function incredibly clean:
#define BOUNCES 6
vec3 bounce(vec3 pos, vec3 dir, inout uint seed, vec4 color) {
for (int i=0; i < BOUNCES; ++i) {
// Walk to the next object in the scene, updating the system state
// using a set of inout variables
if (trace(seed, pos, dir, color)) {
return color.xyz;
}
}
return vec3(0);
}
The grid of images below shows
pos
, dir
, and color
after one call to trace(...)
,
then the result of a single bounce(...)
sample.
Pixels which were lucky enough to hit the light are colored based on their paths, and everything else is black. Over a few thousand samples, things add up to the correct image!
Rendering pipelines
Of course, we haven't yet specified how trace(...)
is actually implemented!
While a scene is being edited,
we use a preview implementation of trace(...)
,
which is compiled once (at startup).
This implementation renders an encoded scene from a storage buffer.
This means that updates to the scene are cheap:
they're just a buffer write operation.
The scene is packed into a vec4
array using a custom binary format:
// Header
[0] number of shapes | 0 | 0 | 0
// Shapes
[1] shape type | shape data offset | material data offset | material type
[2] shape type | shape data offset | material data offset | material type
[3] shape type | shape data offset | material data offset | material type
...
// Data section
[.] material data
...
[.] shape data
...
For example, encoding a scene with two spheres – one a diffuse red material, and the other a white light – takes 6 slots in the array, or 96 bytes:
// Header
[0] 2 | 0 | 0 | 0 // There are 2 shapes
// Shapes
[1] SHAPE_SPHERE | 5 | 3 | MAT_DIFFUSE
[2] SHAPE_SPHERE | 6 | 4 | MAT_LIGHT
// Data section
[3] 1.0 | 0.5 | 0.5 | 0.0 // Red (for diffuse material)
[4] 1.0 | 1.0 | 1.0 | 0.0 // White (for light)
[5] 0.0 | 0.0 | 0.0 | 0.5 // Sphere [x, y, z], r
[6] 0.5 | 0.5 | 0.0 | 0.3 // Sphere [x, y, z], r
The preview shader implements trace(...)
as a
tiny scene interpreter.
It iterates over each shape in the scene,
finding the first hit along the current ray
and updating the ray's pos
.
Then, that shape's material is used to update the ray's
dir
and color
.
After the scene has stabilized (defined as no user interaction for one second), the application builds an optimized kernel, which encodes the scene data directly. This takes longer to build, because it's building a full compute pipeline, but is much faster to evaluate.
The generated shader unrolls the shape-hit loop,
then packs normal and material calculations directly
into a series of switch
statements.
Compare the
interpreter
against one example of the
generated trace(...)
.
Building a scene-specific shader makes a big difference! Testing across various scenes, I see a 2-6x speedup between the interpreter and the scene-optimized pipeline.
Focal blur
Focal blur is accomplished by jittering the ray at the camera plane, while maintaining focus at a particular distance:
This is another case where accumulating random samples just works: by picking a random jitter for each sample, we can get a lovely depth-of-field effect.
Antialiasing works the same way: every pixel is jittered within a 1-pixel radius from its nominal position, which smooths out edges.
Spectral rendering
The final chapter of Ray Tracing: The Rest of Your Life suggests
If you want to do hard-core physically based renderers, convert your renderer from RGB to spectral. I am a big fan of each ray having a random wavelength and almost all the RGBs in your program turning into floats. It sounds inefficient, but it isn't!
This sounded like fun, so I built a spectral mode into rayray
(invoked by the -s
command-line argument).
On each frame, the GPU uniformly selects a wavelength between
400 and 700 nm, converts it to an RGB value, then renders as before,
with the wavelength stored in the w
coordinate of the vec4 color
.
(This is why bounce(...)
takes a vec4 color
but only returns a vec3
)
In spectral mode,
glass materials use the w
coordinate to tweak their index of refraction,
to imitate real-world dispersion.
This means we can make a prism:
(Aside: Rendering a 2D scene with a 3D raytracer was challenging, as rays
scattered randomly off the rear plane were unlikely to hit the narrow light.
I fixed this using a special material
which scatters light perpendicular to its normal;
this means after the first bounce,
all of the rays are scattered in XY
and
are more likely to hit the prism and light)
One expected-but-neat side effect is chromatic aberration. Compare the glass sphere with normal and spectral rendering:
In the spectral image, the refracted blue sphere has blue and red distortions around the edges, and the caustic shows a faint rainbow.
Editor GUI
Like all good graphics tools,
rayray
integrates Dear ImGui
for a debug UI.
The UI is integrated into the raytracer's scene representation, so changes are instantly reflected in the image:
At the time of implementation, there was no canonical WebGPU backend for Dear ImGui, so I wrote one (there's now an experimental implementation in the main repository).
The editor is extremely rough – it's mostly useful for tweaking parameters of scenes defined in code, then copying those parameters back into the original source. Still, being able to drag colors around in real-time is entertaining!
Importing .mol
files
In search of more interesting models,
I discovered MolView,
which is an online database of chemicals.
Each chemical can be downloaded as a
.mol
file,
so I wrote a small importer
which converts into a rayray
scene.
For example, caffeine:
Aside: RNG on the GPU
A raytracer needs a good random number generator (RNG) to model ray scatter off a diffuse material (among other things).
As it turns out, RNGs on the GPU are a fun and contentious topic.
You'll often see the one-liner
float rand(vec2 co) {
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);
}
I have philosophical objections to this function,
because it only works once, and only in a 2D grid.
For raytracing, each ray requires a stream of random values
as it bounces around the scene.
You'd need some way to adjust co.xy
each time rand(...)
is called,
and it would be easy to accidentally add a pattern to the noise.
Ideally, we'd have a RNG which takes a seed and generates a continuous stream
of random values, ad infinitum.
With that in mind,
I decided to use a hash-based approach,
where the seed is a uint
transformed repeatedly by some hash function.
These two blog posts informed my final strategy:
- Generate an initial
uint
seed with a slow-but-good hash function - Use a fast hash function to modify the seed and get the next random value
- Bithack the value from a
uint
directly into afloat
Here's my slow hash function, used to generate the initial seed:
// Jenkins hash function, specialized for a uint key
uint hash(uint key) {
uint h = 0;
for (int i=0; i < 4; ++i) {
h += (key >> (i * 8)) & 0xFF;
h += h << 10;
h ^= h >> 6;
}
h += h << 3;
h ^= h >> 11;
h += h << 15;
return h;
}
(Note that this is slightly different from the "Jenkins' OAT algorithm" in the second blog, which doesn't handle each byte; mine is faithful to the canonical definition)
Applying this hash function to values 0 through 65535, we see the following:
This seems suitably random as a per-pixel seed value!
My fast hash is based on a
linear congruential generator
using a particular magic number from
the literature:
it's simply seed = 0xadb4a92 * seed + 1
.
Here's what repeated iterations look like,
starting from seed = 1
and mapping the top 24 bits to RGB values:
As you can see, this is pretty good: there aren't any obvious visual patterns or repetition.
It's important to use the high bits, because the low bits show more obvious patterns. Here's what the bottom 24 bits look like:
There's a clear cycle in the lowest 8 bits, visible in the blue channel!
Here's the complete rand(...)
function, including bithacking to convert
the highest 23 bits of the hashed seed directly into a
float mantissa:
// Returns a pseudorandom value between -1 and 1
float rand(inout uint seed) {
// 32-bit LCG Multiplier from
// "Computationally Easy, Spectrally Good Multipliers for
// Congruential Pseudorandom Number Generators" [Steele + Vigna]
seed = 0xadb4a92d * seed + 1;
// Low bits have less randomness [L'ECUYER '99], so we'll shift the high
// bits into the mantissa position of an IEEE float32, then mask with
// the bit-pattern for 2.0
uint m = (seed >> 9) | 0x40000000u;
float f = uintBitsToFloat(m); // Range [2:4]
return f - 3.0; // Range [-1:1]
}
This function is tuned to return values in the range -1 to 1, to save an extra scaling step.
I came up with one neat trick myself:
By specifying seed
as an inout
value,
the function returns random floats while mutating the seed automatically,
so you can call rand(seed)
repeatedly to get your stream of values.
For example, generating a random vec3
is simply
vec3 rand3(inout uint seed) {
return vec3(rand(seed), rand(seed), rand(seed));
}
Finally, we generate our initial seed based on the sample count (which changes every frame) and the invocation ID (which changes for each pixel):
void main() {
// Set up our random seed based on the frame and pixel position
uint seed = hash(hash(u.samples) ^ hash(gl_GlobalInvocationID.x));
...
}
Performance
Performance data was captured on 2017 Macbook Pro, with a Radeon Pro 560 GPU and a 2.9 GHz Intel Core i7 CPU.
Two figures are presented: steady-state absolute performance (in mega-rays per second), and equivalent speed when rendering a 1200 x 1200 image.
Scene | Absolute speed (Mray/sec) | 12002 image (frames/sec) |
---|---|---|
Cornell Box | 304 | 211 |
Ray Tracing in One Weekend | 19.1 | 13.3 |
the orb | 33 | 22 |
Golden sphere grid | 86 | 60 |
Prism | 955 | 663 |
Caffeine molecule | 193 | 134 |
To compare, the reference implementation
for Ray Tracing in One Weekend
takes about 577 seconds to render a 1200 x 1200 image at 500 samples per pixel,
(max depth of 6 bounces, compiled with -O3
, output disabled).
rayray
takes 43.4 seconds to render the same image, which is a 13x speedup
(even including startup and shader compilation time).
Of course,
the reference implementation isn't optimized for performance –
but then again, neither is rayray
!
Infrastructure
Like Futureproof,
rayray
is built on a stack of (unnecessarily) modern tools and technologies:
- Written in Zig
- Using WebGPU for graphics
via
wgpu-native
- Shaders compiled from GLSL to SPIR-V with
shaderc
- Minimal GUI using Dear ImGUI, with a custom Zig + WebGPU backend
While building the scene-specific pipeline,
I ran into a pathological complexity explosion in SPIRV-Cross.
Coincidentally, it had been fixed a few days earlier,
so I worked to push that
fix all the
way upstream
to wgpu-native
(and also wgpu-rs
, to be polite).
This was a very positive experience:
the gfx-rs
and wgpu
folks were consistently responsive and helpful,
so props to that community.
(I also found a bug in cbindgen
along the way, which was promptly fixed!)
Future plans
Like Futureproof, this was a toy project / proof of concept, and I'm not planning to maintain it into the future.
As always, the code is on Github, and I'd be happy to link any fork which achieves critical momentum.
Now that naga
is released,
it may be time to transition back to Rust
for graphics work –
my motivation for using Zig was seamless interoperability
with C libraries (mainly shaderc
),
and that's not longer needed in
wgpu
0.7.0.