1. Culling: Only process the blocks that aren't completely surrounded by other blocks with cube models. The other blocks we can't see are "culled" away
By "cube models" I mean that they have the default block model that takes up the whole space. For example, if a torch is on a wall the block it is attached to should still display, but if I replace the torch with a dirt block I no longer need to render the block the torch was attached to (assuming it isn't touching any other air blocks)
By the way, for all non-culled blocks, only rendering the block faces that are visible (and hiding faces that are against a wall) can end up almost halfing your triangle count sometimes, which really helps! I highly recommend trying it.
2. For all the non-culled blocks, convert their raw data into triangles (or some other intermediate data of some kind)
3. Move that data to the GPU
4. Tell the GPU to render it
I had initially been doing step 1 in a compute shader (running one GPU thread per block, then appending the coordinates of all blocks that aren't culled to an AppendBuffer and reading this out later). If this seems like overkill, you're right. The reason I was doing it that way was because I borrowed that code from my VR Sand project. In VR Sand, blocks are changing very frequently, so I needed this whole process to be very fast. While it was, when I tried to scale up my world sizes, suddenly the size of my ComputeBuffers needed to be very large and the cost of managing them all became slow. I also couldn't easily increase the chunk sizes to something like 64*64*64 before my compute buffers started complaining about that being too many threads.
Also, I wanted to multithread things, and Dispatch can only be ran on the main thread.
So I figured I'd try moving my code to the CPU and see how the performance fares.
It turns out that it worked very well! The reason is this:
The main overhead I had was draw calls. and moving everything to the CPU allowed me to increase chunk size, and decrease draw calls
I decided to upgrade to Unity 2019, and the profiler is much more efficient. This was very nice, because before, if let Unity deep profile record for more than 200 frames (or at the very start when the world was generating). my computer would hang for 10-20 minutes, and not let me even bring up the task manager, so I'd have to restart my computer. With 2019 improvements, this is no longer a problem, making this whole optimization process much less frustrating :)
Anyway, so after moving steps 1 and 2 to the CPU, I needed to move the data to the GPU. I had three options:
Option A: Use the Unity builtin Mesh object
Option B: Use DrawProcedural
Option C: Use a native graphics plugin and manage the VBOs myself
I figured Option A made the most sense, so I made one Mesh object per chunk, and tried it out. The performance was decent, but I kept getting these unpredictable spikes occasionally. I looked at the profiler, and it had to do with Unity "deciding" when was the right time to make VBOs, Unity trying to cull things, and related graphics things.
For those that don't do, a VBO (vertex buffer object) is the object you make when you want to pass triangle data from the CPU to the GPU. Unity has a builtin Mesh object that tries to abstract that away from you. For most purposes, this abstraction works pretty well.
However, I needed more fine grained control over exactly when the VBOs were being created and moved to the graphics engine, because these lag spikes were a big pain in the neck. I kept trying to tweak settings and fiddle with stuff to convince unity to do things the way I wanted (spread things out over multiple frames), but Unity kept being annoying.
I looked into the details, and you can get a little lower level control here (Option C) if you manage the graphics objects yourself. However, that would lose me a lot of platform portability, and I find native plugins generally a pain to use (I've had to write a few in the past for other related projects), so I decided to try Option B first.
To explain Option B: Unity has a DrawProcedural method. How this works is that you do the following:
- Create a ComputeBuffer that has your graphics data
Then in the drawing loop, you can just do this
- Give your drawing material that buffer using drawingMaterial.SetBuffer("DrawingThings", graphicsDataBuffer);
- Call drawingMaterial.SetPass(0); on a material that has a shader you wrote attached to it
- Call Graphics.DrawProcedural(MeshTopology.Triangles, numVertices);
This works well, but it has some overhead. To address this, I tried making my chunk sizes 64x128x64. (they used to be 16x16x16). Once I did this, I was suprised to find that I could render tons of chunks while keeping a very good frame rate!
The key here is that in a blocky type world, once you do world generation and compute lighting, blocks don't change that often. This means that it is okay for us to take the few additional milliseconds in having to recreate the triangles of an entire 64x128x64 chunk every time a single block changes. And once you are done doing that, you just need to pass that data to the GPU, and then the overhead per frame is very low! (since there are very few draw calls)
To prove this, here is a screenshot of the world. My world generation is really simple right now, and I made the variability in height much higher than normal just to see what would happen. But yea, so that's my solution to rendering lots of blocks efficiently in unity :)