Reinventing the Square Wheel — Christian Gorton

When it comes to shadows in games, almost everyone uses some form of shadow mapping. The core question is simple: you're shading a pixel, you know where the light is, and you want to know if something is in the way. Shadow mapping solves this by rendering your scene from the light's point of view into a depth texture. Then, for each pixel the camera sees, you project it into the light's space and compare depths. If the shadow map recorded something closer to the light at that position, your pixel is in shadow. It's cheap, it's general, and it works for anything you can rasterize.

The trouble is that the shadow map is a regular grid. Its texels are evenly spaced, and each one stores a single depth value. But the camera's pixels, when projected into light space, don't land neatly on that grid. They land at arbitrary positions between texels. A shadow map can tell you the depth at texel (137, 892), but when your camera pixel projects to (137.4, 891.7), the best you can do is sample the nearest texel, or interpolate between neighbors. You're querying a regular grid at irregular positions, and the grid can only give you an approximate answer.

This mismatch is what causes most of the visual artifacts you'll run into with shadow maps.

Depending on the camera's position and the light's angle, a single texel in the shadow map might correspond to dozens of camera pixels. All of those pixels look up the same texel and get the same answer: shadow or lit. This is what produces the staircase edges you see along shadow boundaries, blocky steps where there should be a smooth transition. You can increase the shadow map resolution to make the steps smaller, but the fundamental problem is that many camera pixels are sharing one shadow map sample. As long as that's true, you'll see the staircase.

Resolution 64

Drag the camera view to rotate the light. Use the slider to change the shadow map resolution. The staircase gets finer but never disappears. Toggle Edges to highlight the shadow boundary directly. Click a cell in the light view to see which camera pixels sample from it.

Shadow acne comes from a surface trying to look up its own depth. The shadow map stores one depth value per texel, but the surface isn't flat relative to the light, it's sloped. The depth at the texel center is different from the depth at the point you're actually shading. So the surface compares against the wrong depth and concludes that it's shadowing itself, producing a pattern of incorrect self-shadowing across the surface.

The standard fix for acne is to add a depth bias, which nudges the comparison threshold so the surface stops self-shadowing. This works, but if the bias is too large, shadows start to visibly separate from the geometry that casts them. This is called peter-panning, and once you start tweaking the bias to balance acne against detachment, you're really just trading one artifact for another.

There are a lot of techniques that address these issues (percentage-closer filtering, cascaded shadow maps, variance shadow maps) and they do help, but they're all working around the same underlying problem. The shadow map is a regular grid, but the camera's pixels project into light space at arbitrary positions that don't align with that grid. The artifacts get smaller, or softer, or less frequent, but they don't go away.

The Irregular Z-Buffer, first described by Johnson et al. in 2004 and later developed by Wyman et al., takes a different approach. Instead of rasterizing a depth grid from the light and sampling it from the camera, you start from the camera side. For each camera-visible pixel, you project it into light space and test it directly against the scene's triangles. If a triangle lies between the pixel and the light at that exact position, the pixel is in shadow. There's no depth texture to interpolate, no bias to tune. The test is exact.

Notice what changed: shadow mapping starts from the light and asks “what depth do I see here?”, then the camera pixels have to make do with whatever the grid recorded. The IZB starts from the camera and asks “is anything blocking my light?” Each pixel gets an answer for its own exact position, not the nearest texel.

Toggle between Shadow Map and IZB. The highlighted edges trace the shadow boundary: a jagged staircase on the shadow map, a smooth curve on the IZB. Switch to the torus to see how the staircase struggles with more complex geometry.

Of course, you can't test every pixel against every triangle. At 1080p you have about two million pixels, and a scene might have a million triangles. That's some two trillion tests per frame! You need some way to narrow down which triangles could possibly shadow a given pixel. If you divide light space into a grid of cells, each pixel only needs to test the triangles that overlap the cell it projects into.

But how do you get the pixels and the triangles into the same grid? Wyman's IZB starts by putting the pixels there. You render the scene from the camera, and for each visible pixel, project it into light space and insert it into the grid cell it lands in. Since many camera pixels can land in the same cell, each cell holds a linked list of pixel references. Then you rasterize the scene again, this time from the light's perspective. For each triangle fragment, you look up the grid cell it covers, walk the linked list of pixels stored there, and test each one against the triangle. Pixels that fail the depth test are marked as shadowed.

This works, and it produces pixel-perfect shadows. But Wyman's implementation (and in practice, every IZB implementation since) relies on some particular hardware and pipeline features that either aren't universally available, or come with performance baggage. The linked lists need to be built from thousands of parallel fragment shader threads, which means careful atomic operations to avoid race conditions. The light-side rasterization needs to be conservative, i.e. each triangle has to shade every grid cell it overlaps, even partially, or you'll miss shadows along triangle edges. Standard GPU rasterization doesn't guarantee this. Vulkan has an extension for it (VK_EXT_conservative_rasterization), and there are older vendor-specific ones like NV_conservative_raster and INTEL_conservative_rasterization, but it's not part of core Vulkan and it's possible it won't be available on all hardware. And the fragment shader doing the shadow testing needs access to the triangle's three vertex positions, which the graphics pipeline doesn't normally provide. The typical workaround is a geometry shader, which passes vertex data through to the fragment stage but breaks vertex sharing and comes with significant overhead.

I'm building a game on a custom Vulkan engine, and I wanted simple, clean shadows that just worked. I wasn't particularly keen on spending a ton of time learning about the zoo of shadow mapping artifacts (though it seems like that happened regardless) and then going and implementing the fixes for each one. I have a soft spot for esoteric, forgotten techniques that punch above their weight, and coming across IZB during my research fit the bill perfectly; pixel-perfect results from an algorithm that the hardware was just barely not ready for when it was first published, and that was largely forgotten in the wake of ray-tracing. But my pipeline isn't set up for ray-tracing, and I didn't want to introduce it just for shadows. Both for the implementation cost (setting up a proper BVH and such) and to keep hardware compatibility broad. I wasn't familiar with how supported conservative rasterization extensions were either, and I wasn't willing to pay the geometry shader tax. What my renderer does have is a heavily GPU-driven, bindless pipeline. The scene is rendered through multi-draw indirect with the GPU computing draw commands, frustum culling, draw counts, and even the light's projection matrix, all with no CPU readback. All geometry, transforms, and materials are accessible via buffer device addresses, so a compute shader can reach into any mesh's data without binding changes. So the question became: can you get the same result using only compute shaders?

Every prior IZB implementation put the pixels in the grid and brought the triangles to them through the rasterization pipeline. But if the rasterization pipeline is what we're trying to avoid, why not do it the other way around? Put the triangles in the grid, and let the pixels come to them. With this, we entirely avoid the downsides of the traditional approach. No geometry shaders, no conservative rasterization, no linked lists. The entire shadow pass reduces to two compute dispatches: one that bins triangles into a light-space grid, and one that tests each camera pixel against the triangles in its cell.

The first of the two dispatches is the bin pass. It's responsible for getting every shadow-casting triangle into the right grid cells. The renderer already has a list of indirect draw commands on the GPU. Each draw command describes a mesh to draw, with an index count, a material, and a transform. The bin pass walks these draw commands in compute and processes every triangle.

We dispatch one workgroup per draw command, with 256 threads per workgroup. Each thread strides over the draw's triangles, thread 0 processes triangles 0, 256, 512, and so on. Since the draw count is computed on the GPU, we dispatch a fixed maximum number of workgroups and let the excess ones early-out.

For each triangle, the shader does the following:

Fetch the vertices. Decode the triangle ID into a draw command index and a local triangle index. Each draw command references a mesh in a global mesh table, and all meshes share a single global index buffer. Read three indices from the index buffer and use them to look up vertex positions from the mesh's vertex data.
Skin if needed. For skinned meshes, blend the joint matrices. Static geometry skips this.
Project to light NDC. Multiply by the light's view-projection matrix (computed earlier by the shadow setup pass from the depth prepass bounds). Reject triangles that fall entirely outside the light frustum.
Compute the grid AABB. Convert the NDC positions to grid coordinates and compute the bounding box in cells.
Count. For each overlapping grid cell, atomically increment that cell's triangle count.

After the count pass, a prefix sum computes the exact offset into a compacted storage buffer for each cell. Then a scatter pass runs, we dispatch one thread per triangle and writes the triangle's ID into the correct positions. The result is a flat buffer where each cell's triangles are stored contiguously, indexed by the prefix sum offsets.

The second dispatch is the evaluation pass, which produces the final shadow mask, screen-sized buffer with the same dimensions as the main render target, storing one shadow value per pixel. It runs one thread per camera pixel.

For each pixel, the shader does the following:

Read the depth buffer. Reconstruct the pixel's world-space position from the camera's depth prepass.
Project into light space. Transform the position twice: once into light NDC to determine which grid cell the pixel falls into, and once into light-view space for the actual shadow test. (Why two projections? We'll come back to this, it turned out to be critical.)
Look up the grid cell. Using the light NDC position, find which cell this pixel maps to and retrieve the list of triangles binned there.
Test each candidate triangle. For each triangle in the cell, fetch its three vertices, project them into light-view space, and run a point-in-triangle test: compute barycentric coordinates using edge functions, check that the point is inside the triangle, and if so, interpolate the triangle's depth at that position. If the triangle is closer to the light than the pixel, the pixel is in shadow.
Write the result. The first triangle that passes the test is enough, mark the pixel as shadowed and move on. If no triangles in the cell occlude the pixel, it's lit.

The main lighting pass reads this shadow mask directly instead of doing any shadow map sampling.

The camera view (left) and the light's depth view (right) with the IZB grid overlaid. Click a grid cell in the light view to see which triangles are binned there. Each colored triangle is a candidate that pixels in that cell get tested against. This is the spatial acceleration structure that makes the IZB practical.

And that's it! Pixel-perfect, alias-free shadows. No staircase edges, no shadow acne, no bias to tune. Every pixel gets a geometrically exact answer.

Unfortunately, that wasn't actually it, there's more work to do. The initial implementation was running at about 15ms per frame on my 3070 Ti, on a scene with functionally no geometry. Ouch! Wyman's approach was running at 5–10ms on hardware from over ten years ago, so I knew we could do better. And on scenes with small triangles, like a character model up close, the shadows would flicker and shimmer, with individual triangles popping in and out of the shadow mask between frames. There were two problems to solve: a correctness issue causing the flickering, and a performance issue that needed about a 30x improvement. The correctness issue came first.

Remember the two projections in the evaluation pass? My initial implementation only did one: we projected into light NDC and did all of our calculations and comparisons there. On large, static geometry, the shadows were correct.

Then I loaded a character model, and the shadows started flickering. Small triangles on the character would pop in and out of the shadow mask between frames, while the shadows from larger geometry stayed stable. The artifacts got worse when I zoomed the camera out and better when I zoomed in. Since the light's orthographic projection is recomputed every frame from the depth buffer's min and max values, zooming out means a larger frustum, and a larger frustum means the NDC space gets more compressed. The flickering was correlating with how tightly the frustum fit the scene, not with the geometry itself.

The problem comes from how the orthographic projection compresses world-space distances. The frustum might span hundreds of world units, and the projection maps that entire range down to [−1, 1] in NDC. A small triangle that's 0.1 world units across becomes about 0.00025 in NDC. The barycentric test computes edge functions, which are products of differences between these compressed vertex positions:

edge = (b.x - a.x) * (p.y - a.y) - (b.y - a.y) * (p.x - a.x)
     ≈ 0.00025 × 0.00025 = 6.25e-8

That's at the floor of float32 precision. The barycentric coordinates become unreliable, and the point-in-triangle test starts giving wrong answers for small triangles.

Once I understood the problem, the fix was straightforward. The grid cell lookup only needs coarse position information, so NDC is fine for that. But the barycentric test needs precision, and that means working with numbers that haven't been crushed by the projection. The light-view matrix is just a rotation and a translation, it doesn't scale anything. A 0.1 unit triangle stays 0.1 units in light-view space, regardless of how large the frustum is:

edge ≈ 0.1 × 0.1 = 0.01  // perfectly healthy float32

So the evaluation pass projects each pixel twice: into light NDC for the grid lookup, and into light-view space for the triangle test. Since the orthographic projection is linear, depth ordering is the same in both spaces, so the depth comparison works without any changes.

With the precision issue fixed, the shadows were correct. But correct and fast are different problems. The evaluation pass on its own was taking just about the entirety of those 15ms. The bin pass was a rounding error in comparison. Almost all of that time was in the inner loop where each pixel tests candidate triangles. To understand why, consider what that loop was actually doing for each triangle: decode the triangle ID back into a draw command and a local index, look up the mesh in the mesh table, read three indices from the global index buffer, fetch three vertex positions, apply skeletal skinning if needed, multiply by the light-view matrix, and then finally run the barycentric test. That's roughly eight scattered global memory reads before a single floating-point comparison happens, repeated for every candidate triangle in the cell, for every pixel on screen.

The obvious realization is that the bin pass already does almost all of this work. It fetches the vertices, applies skinning, and projects them into light space for every triangle in the scene, that's how it knows which grid cells each triangle overlaps. The evaluation pass was then repeating the entire chain for the same triangles, potentially many times over as different pixels test the same geometry.

The fix is to have the bin pass store the results. After projecting a triangle into light-view space, write the three transformed vertex positions into a flat GPU buffer, a vertex cache, indexed by an atomic counter. The grid cells now store indices into this cache instead of raw triangle IDs. The evaluation pass's inner loop goes from eight scattered global reads per triangle to three contiguous reads from the cache. The bin pass does a tiny amount of extra work (one additional projection per triangle, into light-view space alongside the NDC projection it already does), and the evaluation pass drops from ~15ms to ~3ms.

At 3ms the evaluation pass is no longer bottlenecked on scattered memory reads, but there's still a lot of redundant work happening. At 1080p with a 128×128 grid, each grid cell covers roughly 15×8 pixels. An 8×8 workgroup maps to approximately one cell, which means all 64 threads are testing the same set of triangles, but each one is independently reading those triangles from global memory. 63 out of every 64 reads are redundant.

The fix is to have the workgroup cooperatively load the cell's triangles into shared memory. First, the threads elect a representative cell using an atomic min across active threads. In practice, about 90–95% of threads in a workgroup share the same cell:

// Elect the lowest cell index among active threads.
if (local_id == 0) sm_cell_idx = 0xFFFFFFFF;
GroupMemoryBarrierWithGroupSync();
if (active) InterlockedMin(sm_cell_idx, cell_idx);
GroupMemoryBarrierWithGroupSync();

Then, in batches of 64, each thread loads one triangle from the vertex cache into workgroup-shared memory. After a barrier, all active threads with the matching cell test against the local copy:

for (uint batch = 0; batch < total_tris; batch += LDS_BATCH) {
    uint count = min(LDS_BATCH, total_tris - batch);

    // Cooperative load: each thread loads one triangle.
    if (local_id < count) {
        uint ci = compacted_ids[base + batch + local_id];
        sm_v0[local_id] = tri_cache[ci].v0;
        sm_v1[local_id] = tri_cache[ci].v1;
        sm_v2[local_id] = tri_cache[ci].v2;
    }
    GroupMemoryBarrierWithGroupSync();

    // Each active thread tests all loaded triangles.
    if (active && cell_idx == cached_cell) {
        for (uint t = 0; t < count; t++) {
            if (triangle_occludes(sample.xy, sm_v0[t], sm_v1[t], sm_v2[t], sample.z)) {
                shadow_mask[pixel_idx] = 0.0;
                active = false;
                break;
            }
        }
    }
    GroupMemoryBarrierWithGroupSync();
}

Notice that the active flag is how we handle barrier safety. Every thread in the workgroup has to hit the same barriers, even if it has no work to do e.g. sky pixels, pixels outside the light frustum, pixels that already found a shadow. These threads can't return early or the barriers will deadlock. Instead, inactive threads still participate in the cooperative loads and barriers, they just skip the triangle testing. The ~5–10% of threads at workgroup boundaries whose cell differs from the elected one fall back to reading directly from global memory after the batched loop completes. This brought the evaluation pass from ~3ms down to ~800μs.

At this point, with the total shadow pass under a millisecond, I was pretty satisfied with performance. Then I added a more representative game level: a large octagonal arena built in TrenchBroom, and the shadow pass spiked to 4ms. The arena's walls, floors, and ceilings have large triangles that span 50 to 85+ grid cells in light space. At the time, any triangle whose bounding box exceeded 16 cells on either axis was routed to a flat overflow list instead of being scattered across the grid. This overflow list had no spatial filtering; every pixel tests every triangle in it, from global memory. With ~200 large arena triangles in the list, that's 200 extra triangle tests per pixel with no LDS caching, which overwhelmed everything else.

The fix had two parts. First, I increased the cell span threshold from 16 to 32, which kept medium-sized triangles (spanning 17–32 cells) in the spatially-filtered grid where they benefit from LDS caching. Only the truly large triangles (full walls and floors) overflow. Second, I applied the same LDS batching pattern to the overflow list. Since the overflow list is the same for all workgroups, the cooperative loading works identically. Each thread loads one triangle per batch, barrier, then all threads test from shared memory. This brought the shadow pass back down to ~700μs.

The last optimization is a cheap early-out that skips the triangle testing entirely for pixels that are obviously in shadow. This is inspired by Tokuyoshi's 2018 paper “Conservative Z-Prepass for Frustum-Traced Irregular Z-Buffers,” which showed that a conservative depth prepass could roughly halve the cost of Wyman's IZB by skipping interior shadow pixels. Before the IZB pipeline runs, we rasterize a conventional shadow map from the light's perspective with an aggressive depth bias that pushes the stored depth away from the light. This makes the shadow map “conservative”: if a pixel is behind this biased depth, it's definitively in shadow and doesn't need any triangle testing.

The tricky part is avoiding false positives. A naive comparison against the conservative shadow map would incorrectly classify pixels at shadow boundaries and on dense character meshes, so we need two layers of protection.

The first layer of protection is a large depth margin. The early-out only triggers for pixels that are far behind a blocker, 0.05 NDC units, which corresponds to a substantial depth gap. Character self-shadow depth gaps are much smaller than this (around 0.001–0.01 NDC), so the early-out never fires on the character. The character's shadows are always evaluated by the precise IZB triangle test. But for large wall-to-floor depth gaps in enclosed environments (0.05–0.2 NDC), the margin is easily exceeded.

The second layer is a 2×2 bilinear erosion. Instead of reading a single texel from the conservative shadow map, we read a 2×2 footprint and take the minimum depth. At shadow boundaries, uncovered texels have a depth of zero (in reverse-Z), which pulls the minimum below the threshold and pushes those boundary pixels to the IZB test instead of the early-out.

Neither works alone, the margin alone leaves jagged texel-boundary artifacts at shadow edges, and the erosion alone eats into valid shadow coverage on dense meshes. Together they cover each other's blind spots, skipping about 60–70% of shadowed pixels in enclosed environments and bringing the total shadow pass to ~500μs.

After all four optimizations, the full IZB shadow pipeline runs at about 500μs on my 3070 Ti, comfortably within my 1ms goal. A note on these numbers: they're approximate, measured with Tracy profiler GPU zones during development rather than a rigorous benchmarking setup. I'd like to go back and profile against more standardized test scenes in the future. That said, they're representative enough to show the shape of the optimization journey:

Step	Time	Change
Initial MVP	~15ms	Naive per-pixel vertex fetch chain
Vertex cache	~3ms	Bin pass pre-transforms, eval reads cache
LDS caching	~800μs	Cooperative workgroup load into shared memory
Arena regression	~4ms	Large triangles flood overflow list
Cell span + LDS overflow	~700μs	Span=32, LDS-cache overflow list
Conservative SM early-out	~500μs	Large margin + 2×2 erosion

I'm really proud of how these turned out. The shadows look really clean, no artifacts or fiddling. They look the same at any camera distance and on any geometry, which was the whole point. I wanted something I could set up once and not think about again.

What's Next

These are hard shadows from a zero-size light source, and the shadow mask stores one sample per pixel. There are two natural extensions that the IZB architecture is well-suited for.

The first is soft shadows. While researching for this post, I came across Lukas Kalbertodt's work on tiled per-triangle soft shadow volumes, which uses a similar compute-based pipeline with per-tile triangle binning. His approach software-rasterizes each triangle's projection onto the light source into a per-pixel bitmask, giving you penumbra values based on how much of the light disk each pixel can see. This fits almost perfectly into our evaluation pass: the spatial acceleration, vertex cache, and LDS batching all stay the same, and only the innermost per-triangle test changes. This would give us physically-based soft shadow edges that vary with the distance between the occluder and the receiver, instead of the uniformly hard edges we have now.

The second is sub-pixel anti-aliasing. Wyman's original IZB paper describes an approach where instead of testing multiple samples per pixel, you replace the point-in-triangle test with a frustum-triangle intersection. Each pixel is treated as a small quad projected onto the surface, and each edge of the occluder triangle defines a shadow half-plane. Projecting these half-planes onto the pixel's footprint and looking up 32-sample visibility bitmasks from a precomputed table gives you analytic sub-pixel coverage. Three lookups and a binary AND per triangle, rather than 32 individual sample tests. This would smooth out the remaining one-pixel staircase at shadow boundaries where our current one-sample-per-pixel mask transitions between lit and shadowed.

If I end up implementing either of these, I'll write them up as follow-up posts.

Conclusion

Thanks for reading! For the full interactive comparison with all controls, check out the standalone demo. If you have questions or want to discuss any of this, feel free to reach out. And if you found this interesting and are looking for a graphics programmer: please hire me, I don't have a job.

References

Johnson, G. et al. (2004). The Irregular Z-Buffer and its Application to Shadow Mapping.
Wyman, C. (2015). Frustum-Traced Raster Shadows: Revisiting Irregular Z-Buffers.
Tokuyoshi, Y. (2018). Conservative Z-Prepass for Frustum-Traced Irregular Z-Buffers.
Sintorn, E. et al. (2008). Real-Time, Sub-Pixel Accurate Hard Shadow Volumes.
Kalbertodt, L. (2023). Tiled Per-Triangle Soft Shadow Volumes.