When Draw calls stop being the point
I thought rendering performance was about optimizing draw calls. It turns out my assumption is 10+ years old already.
The initial issue
This topic is actually wild, and I’m glad I came across it the way I did. So, background: I’m building my own engine for my game 004(name pending), and one of the first frictions I found out about, was the way I had to submit draw calls. Now, I’m a beginner to the whole rendering thing, so I’ve been using openGL and using learnOpenGL as a way to get introduced to it. So, the issue I was facing with draw calls was twofold. 1. I had to include GL everywhere I wanted rendering logic, and 2. at the end, there was a central place, where everything had to be passed down with relevant data which made the individual draw calls. Now, obviously this gave me a bad smell, that I’m doing something wrong, so I wanted to improve it. Now, I’ve been using ImGui for quick debug UI, and for my editor, and I’m really a big fan of their api model. The fact that in my main loop, I just need to make sure to call ImGui::BeginFrame() and EndFrame at the end, while any subsystem, or class in between can define their own nodes, seemed like the perfect api. But, my one gripe with this whole api is that it requires a global namespace with a lot of variables. And while I do see the benefits (I love it), there are a few restrictions that come with it.Such as it has to run on the main thread & thread safety is not guaranteed (I did see a comment mentioning how to get thread safety, but didn’t follow it through to be fair).
Which is fair play, but my current goal was to use openGL to learn the basics and move on to vulkan to get access to the sweet promise of multi queues for parallel submission. So, I didn’t want this sort of api for my rendering pipeline. But, while writing more of my engine to use my naive approach, I realized I was doing the same thing multiple times. Fetch the resource handle -> write data -> in render loop use data to make the draw call. Now, I was manually writing this, and there’s a lot of resources that are used by multiple draw calls, and very often. So, this by default looked like a DAG problem to me from my earlier [[Penny]] work (TODO: shill it with a link here). If I could define the unit of work as a node, and the dependencies as edges, I would be able to build out a graph, and right before making the draw calls, sort it for optimization! This seemed like a revelation and lo and behold! This idea (albeit way better and way more thought out) has a name! The render graph. I was estatic that I was able to come to this idea (sort of) on my own, and validated by engines as a valid orchestration layer.
So, naturally I wanted to look at how the great engines do it, so I went to the forge’s github, to get some insight. and that’s where I came across this: https://github.com/ConfettiFX/The-Forge/issues/171. Now, the great wolfgangfengel (loved his book series btw, can’t recommend it enough. It’s how I learnt about SVOs and the wonderful world of morton encoding), has a comment towards the bottom that said “With a GPU-driven renderer, you do not make many draw calls.” and “It was a good idea 10+ years ago. (This one stung, ngl)”. With a GPU-driven renderer, you do not make many draw call? I’m not joking when I say, this revelation actually shook me. All the resources I’ve been seeing just have draw calls, and maybe deferred vs forward rendering, and I thought that was advanced, and that AAA studios just have really tight optimizations to improve performance. Man, was I wrong.
The Paradigm shift
So, based on my research, Gpu driven rendering uses SSBOs and frustrum culling heavily, to reduce CPU sync overhead. Traditional Uniform have a size limit of ~16kb, UBO = 64kb, but SSBOs limit is the VRAM limit(sorta). AND it’s read and write data. So, modern renderers, upload what they need to the GPU, and a compute shader /kernel generate draw commands. That’s a completely different game, one that I didn’t remotely even see coming. since visibility check is embarrisingly parallel, this makes it the perfect use for frustrum culling. My chunking logic to reduce O(n) over all of my tiles, cannot realistically compare in terms of parallelism or bandwidth, even with 32 threads. This pipeline makes the CPU the policy setter , while the GPU solely becomes responsible for handling all computation regarding geometry, with minimal data sync between the 2. Now, the CPU side only (post initial upload) needs to worry about telling the GPU about the data that changed.
So, with good map design, like breaking line of sight, tall structures, broader geometry etc. You can render beautiful scenes, while offloading the massive compute to the GPU.
Now, I imagine this isn’t a silver bullet. The CPU still needs to define barriers and such, along with orchestrating data uploads, so I don’t imagine RenderGraph on the CPU side is useless, but I can definitely see how AAA games are able to achieve the visual fidelity, which I had a massive misconception about.
The mental model revision
What finally clicked for me through this intense journey is that this isn’t just an optimization. It’s a role reversal. The CPU stops being the thing that decides what gets drawn, it can focus on game logic, state machines and gameplay mechanics, and just sets policy.
The GPU becomes responsible for what is shown in this frame.