Tile-Based Forward Rendering

Update, June 2012: I've made some inline revisions, but you can see a summary of my changes here.
Forward+

There's been a bit of buzz about AMD's Leo Demo recently, which uses a tile-based culling technique to generate a per-tile list of lights that might affect each tile. According to the website,

this demo uses DirectCompute to cull and manage lights in a scene. The end result is a per-pixel or per-tile list of lights that forward-render based shaders use for lighting each pixel.

I thought I'd give this a go, and while my approach differs somewhat in the technicalities to AMD's implementation, the benefits are theoretically the same. For those who haven't discovered this algorithm yet, a quick summary of the advantages are:

Aras P explains in more detail, so instead I'll document my direct findings here, as well as my thoughts on solutions to some of the problems this algorithm might cause. I assume readers have some familiarity with the algorithm.

I've uploaded some screenshots to my webzone.

I had no paper to work off of, so my implementation is basically made up based on the few hints from the quoted paragraph above from AMD.

Depth Prepass

To start with, I do a depth prepass over the scene. There's nothing I do that's unusual about this step; it's just a boring old Zpass. However, then I down-sample it so that I have one conservative depth value per tile. By default my application accumulates a linked list of lights in 8x8 pixel tiles, so I have a shader that finds the maximum depth of all the pixels in that tile and writes it to a smaller depth buffer.

Conservative depth

If you're using 1x1 pixel tiles without MSAA you can skip the depth down-sample stage. If you are using MSAA you will always need to do this step even if you have 1x1 tiles, as you'll need to resolve your MSAA depth buffer to a 1x depth buffer for the next step[Note]. Don't forget to read each depth sample in each pixel of your MSAA depth buffer.

Ensure you're progressively down-sampling. The reason for this is subject of a whole post in itself, but essentially it's to reduce the number of pixels in flight at any one time.

My approach is this:

  1. Resolve my MSAA depth buffer to a full-res non-MSAA depth buffer
  2. Down-sample and transpose the output of (1) to a H x (W/T) buffer (where W and H are the full-res buffer's dimensions, and T is the tile size)
  3. Down-sample and transpose the output of (2) to the final (W/T)x(H/T) depth buffer.

Make sure you unroll your HLSL loops and pass the tile size data in via compile-time macros.

Building the Per-Tile Linked Lists

In order to accumulate what lights might affect each tile I render a volume for each light, just like you would with a deferred renderer. Instead of shading, I merely add the light's ID to a linked list for each tile. I use the down-sampled depth buffer from the previous step to quickly reject pixels that definitely have no lights affecting them.

Where with a deferred renderer you might use Carmack's Reverse, in this case that won't work. Rejecting pixels that pass the Z-test means you're rejecting pixels that might include a transparent object. You won't have information about transparency in your down-sampled depth buffer, so you have to be conservative here.

I still do stencil-based culling; I only discard the pixels that fail the Z-test. The stencil buffer still comes in handy in the case that the camera intersects the volume, but in all other cases the depth buffer would have probably sufficed.

The stencil mode I used is:

// Depth test parameters
dsDesc.DepthEnable = true;
dsDesc.DepthWriteMask = D3D11_DEPTH_WRITE_MASK_ZERO;
dsDesc.DepthFunc = D3D11_COMPARISON_LESS_EQUAL;
// Stencil test parameters
dsDesc.StencilEnable = true;
dsDesc.StencilReadMask = D3D11_DEFAULT_STENCIL_READ_MASK;
dsDesc.StencilWriteMask = D3D11_DEFAULT_STENCIL_WRITE_MASK;
// Stencil operations if pixel is front-facing
dsDesc.FrontFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;
dsDesc.FrontFace.StencilDepthFailOp = D3D11_STENCIL_OP_DECR;
dsDesc.FrontFace.StencilPassOp = D3D11_STENCIL_OP_KEEP;
dsDesc.FrontFace.StencilFunc = D3D11_COMPARISON_ALWAYS;
// Stencil operations if pixel is back-facing
dsDesc.BackFace.StencilFailOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilDepthFailOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilPassOp = D3D11_STENCIL_OP_INCR;
dsDesc.BackFace.StencilFunc = D3D11_COMPARISON_ALWAYS;

In short, I always allow back faces to increment the buffer, but front faces will only decrement when their fragments fail the depth test. I use no back-face culling, so this prevents any overdraw in my light volume pixel shader.

Update: Since writing this I've switched to a Compute Shader for the list generation. I used a line-sphere test in the Compute Shader to determine the maximum and minimum depth values at each corner of each tile. I added lights to either the opaque or transparent buffers based on whether the light intersected this frustum section.

The speed increase by using the CS is easily by an order of magnitude. Beforehand I was primitive assembly bound due to the number of spheres that I had to render, whereas running a compute shader (clipped to the bounds of the light in screen space) was much more efficient. Using the line test also meant I didn't have to grow my spheres by a tile, which previously meant I was getting conservative light addition in some cases.

Tiles being affected by lights

On the right is a render of all pixels that could potentially be shaded by lights. You can see the light volume in the floor will only shade inside the corridor.

The linked list generation is based on AMD's Order Independent Transparency presentation. I have two Unordered Access Views: one stores the linked list elements (the LinkBuffer), and the other stores the offset into the LinkBuffer that marks the start of the list for that tile (the HeadBuffer).

The LinkBuffer could potentially use unbounded amounts of storage because we don't know how many lights will overlap (thus, we can't determine how large each tile's linked list will be), so when you allocate your LinkBuffer make sure you estimate conservatively. I give mine 16 links for each tile, so I have a buffer of (X/T)*(Y/T)*16 link elements, assuming the back buffer dimensions are (X, Y) and a tile size of T. 16 might be a bit excessive for typical game scenes though.

Because this is also going to be a big buffer, I don't store any lighting data in it; I only store the ID of the light and the index of the Next link item. In HLSL:

struct LightLink
{
    uint LightID;
    uint Next;
};

The HeadBuffer is a constant size: one uint per tile. I clear each pixel in this buffer to -1 at the start of the linked list generation process.

I draw each light volume with no culling and depth-read, in two passes. The first pass writes to the stencil buffer, so I turn off colour writes, set the magic stencil mode and draw a sphere. The second pass is drawn with the LinkedList generation pixel shader and only executes if the stencil == 1. This pass also resets the stencil to 0 for each pixel that passes.

The volume pixel shader looks like this:

[earlydepthstencil]
void Main(float4 pos : SV_POSITION)
{
    uint x = pos.x; // X coordinate in pixels
    uint y = pos.y; // Y coordinate in pixels
    LightLink link;
    link.LightID = LightID;
    // (1) Increment the link counter and get the current link ID
    uint linkID = LinkBuffer.IncrementCounter();
    // (2)
    // Read any existing linkID in the head buffer
    // Replace it with the new link ID
    HeadBuffer.InterlockedExchange(
        (y * BufferWidth + x) * 4,
        linkID,
        link.Next);
    // Write the link into the link buffer
    LinkBuffer[linkID] = link;
}

In stage (1), we use the UAV's IncrementCounter to give us the index of the next LightLink element we should write. We exchange this index with the value currently in the HeadBuffer (2), which becomes the new LightLink's Next index. If the value in HeadBuffer is -1, link.Next will be -1 and that means this element is the end of the linked list.

Hopefully you can see how this works, but if not I highly recommend reading the AMD OIT presentation.

Update: A neat optimisation I've recently added is to generate two separate lists for opaque-affecting lights and transparent-affecting lights. Because transparent objects could have any depth less than the value in the depth buffer, all lights that pass the depth test need to be recorded. However, for opaque objects, we can reject lights that we know don't intersect anything.

For each tile I accumulate a minimum and maximum depth value. If a light doesn't fall inside [0, MaxZ] then it's culled completely. If it does, it gets added to the TransparentLightBuffer. If it also falls between [MinZ, MaxZ], I add it to an OpaqueLightBuffer. Obviously this breaks down if a tile has a large depth range, but it's still a significant saving in the general case.

Finally, because we want the opaque materials to be as fast as possible, I serialise the lighted linked lists for each tile into a flat array using another CS. This has the obvious drawback of having to allocate a fixed maximum number of lights per tile, but it nets a huge speed win.

Rendering the Scene

Once you've generated your linked lists, you can bind the LinkBuffer and HeadBuffer UAVs to your pixel shaders that need lighting information. You'll also need to bind a buffer containing all your lighting data. I do this by storing my point light data in a StructuredBuffer for random access by the pixel shader:

struct PointLight
{
    float3 Position;
    float Radius;
    float3 Colour;
};
StructuredBuffer<PointLight> PointLightBuffer;
StructuredBuffer<LightLink> LinkBuffer;
Buffer<uint> HeadBuffer;

The lighting part is easy. First, index into the HeadBuffer based on what tile this pixel touches using SV_Position:

uint x = vpos.x / BufferScale.x;
uint y = vpos.y / BufferScale.y;
uint head_index = y * BufferWidth + x;

BufferScale and BufferWidth in this case are:

BufferWidth = HeadBufferWidth;
BufferScale = { BackBufferWidth / HeadBufferWidth, BackBufferHeight / HeadBufferHeight };

To iterate over the lights, read the index of the first LightLink from the HeadBuffer:

uint next = HeadBuffer[head_index];

Then iterate until you hit the end of the list:

while (next != 0xFFFFFFFF)
{
    LightLink link = LinkBuffer[next];
    PointLight pointLight = PointLightBuffer[link.LightID];
    lighting += Shade(worldPos, worldNormal, eyeVec, specPower, pointLight);
    next = link.Next;
}

Shadows

One problem that people have flagged up is shadows: if you wanted to mimic current algorithms you'd have to bind every possible shadow map to your pixelshader, which obviously isn't viable.

Orc

Instead I was considering rendering all the shadow maps into a virtual texture. This wouldn't be a complicated change, and it would still be possible render all the point light shadows in one pass using a geometry shader and six different viewports. All I'd have to do is upload the UVs in the shadow map that each light uses.

People have mentioned texture arrays, but that means each shadow map would have to be the same resolution, which doesn't sound like a great solution to me.

A Quick Note on Driver Stability

I've been queried about my use of InterlockedExchange in the linked list generation. It's true that only one fragment should be written to that tile at any one time, but for some reason not doing InterlockedExchange causes my GPU to hang. I'm updating my drivers as I type this to hopefully fix this, as I can't think of any reason a normal Load/Store wouldn't suffice instead.

Note that the GPU can and will hang if your linked list is corrupt. It will spin through garbage link values until it sees a -1, and given that you have millions of lists, you're unlikely to avoid a total GPU meltdown. Windows takes it down most of the time. The times that it doesn't usually mean a BSOD is heading your way.

Overall

The Sponza demo runs at 1080p with 8x MSAA at approximately 14ms on my Radeon HD 5700 with 213 lights.

I'd really like to see this side-by-side with a deferred renderer. I would assume this technique is slower than deferred due to the cache performance of per-pixel linked list traversal, but it does offer significant advantages. I wonder if it might actually be faster than deferred if you're bandwidth bound on your G-buffer. Either way, it's still a huge improvement over a multi-pass forward renderer with many dynamic lights.

BRDFs