Meshlet LOD-compatible two-pass occlusion culling#12898
Meshlet LOD-compatible two-pass occlusion culling#12898superdump merged 588 commits intobevyengine:mainfrom
Conversation
Co-authored-by: François <mockersf@gmail.com>
Fix occlusion culling (partially) and add meshlet bounding sphere debug viewer
Fix occlusion culling in orthographic views
Co-authored-by: Robert Swain <robert.swain@gmail.com>
|
Something is up with the bounding spheres being wonky. Occlusion culling isn't working correctly. |
|
Fixed bugs, ready to go again. |
atlv24
left a comment
There was a problem hiding this comment.
I still think that there's many literals in the shaders that should be either extracted into constants or conversion functions (like the << 6u/>> 6u cluster id conversions can be functions, 32u can be BITS_PER_ELEM or similar) but this can be addressed later once those numbers actually settle
| let sphere_depth = -view.projection[3][2] / (culling_bounding_sphere_center_view_space.z + culling_bounding_sphere_radius); | ||
| meshlet_visible &= sphere_depth >= occluder_depth; | ||
| let meshlet_triangle_count = meshlets[meshlet_id].triangle_count; | ||
| let buffer_start = atomicAdd(&draw_indirect_args.vertex_count, meshlet_triangle_count * 3u) / 3u; |
There was a problem hiding this comment.
is it not possible to skip the muls and divs by 3 here and just multiply afterwards at the last possible moment?
There was a problem hiding this comment.
We'd need an extra pass to just multiply by 3, I doubt it's worth it just to save like 2 instructions.
Co-authored-by: vero <email@atlasdostal.com>
pcwalton
left a comment
There was a problem hiding this comment.
This looks fine. Since you mentioned performance, I left a few microoptimization suggestions. Feel free to take them or leave them :)
| // Project the culling bounding sphere to view-space for occlusion culling | ||
| #ifdef MESHLET_FIRST_CULLING_PASS | ||
| let previous_model = affine3_to_square(instance_uniform.previous_model); | ||
| let previous_model_scale = max(length(previous_model[0]), max(length(previous_model[1]), length(previous_model[2]))); |
There was a problem hiding this comment.
| let previous_model_scale = max(length(previous_model[0]), max(length(previous_model[1]), length(previous_model[2]))); | |
| let previous_model_scale = sqrt(max(dot(previous_model[0], previous_model[0]), max(dot(previous_model[1], previous_model[1]), dot(previous_model[2], previous_model[2])))); |
(feel free to split up into multiple lines if you want)
This saves 2 sqrt instructions.
There was a problem hiding this comment.
It's hard to tell. These microoptimizations don't seem to show up on profiles, as they're not the bottleneck. The bottleneck for culling is mostly thread divergence, and a teeny bit lack of registers.
| let depth_pyramid_size_mip_0 = vec2<f32>(textureDimensions(depth_pyramid, 0)) * 0.5; | ||
| let width = (aabb.z - aabb.x) * depth_pyramid_size_mip_0.x; | ||
| let height = (aabb.w - aabb.y) * depth_pyramid_size_mip_0.y; | ||
| let depth_level = max(0, i32(ceil(log2(max(width, height))))); // TODO: Naga doesn't like this being a u32 |
There was a problem hiding this comment.
Moving the * 0.5 after the max would save 1 multiply.
| fn get_meshlet_occlusion(cluster_id: u32) -> bool { | ||
| let packed_occlusion = meshlet_occlusion[cluster_id / 32u]; | ||
| fn meshlet_is_second_pass_candidate(cluster_id: u32) -> bool { | ||
| // TODO: Does this read need to be an atomicLoad? |
There was a problem hiding this comment.
No, not if all the writing happens in a previous compute dispatch. wgpu should have automatically inserted the proper barriers.
There was a problem hiding this comment.
Oh whoops, this is left over from the previous version of this PR. I can remove it.
| ); | ||
| cull_pass( | ||
| "meshlet_culling_first_pass", | ||
| "culling_first", |
There was a problem hiding this comment.
Why change this name? We'll have non-meshlet GPU culling too, and so it'd be nice to see the meshlet code clearly separated out in RenderDoc.
There was a problem hiding this comment.
It is separated out. It's already nested under a larger meshlet debug span. The repeated use of meshlet for individual passes made it harder to read the NSight profiles.
| let instance_uniform = meshlet_instance_uniforms[instance_id]; | ||
| let meshlet_id = meshlet_thread_meshlet_ids[cluster_id]; | ||
| let model = affine3_to_square(instance_uniform.model); | ||
| let model_scale = max(length(model[0]), max(length(model[1]), length(model[2]))); |
There was a problem hiding this comment.
| let model_scale = max(length(model[0]), max(length(model[1]), length(model[2]))); | |
| let model_scale = sqrt(max(dot(model[0], model[0]), max(dot(model[1], model[1]), dot(model[2], model[2])))); |
Saves 2 sqrt instructions.
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Keeping track of explicit visibility per cluster between frames does not work with LODs, and leads to worse culling (using the final depth buffer from the previous frame is more accurate).
Instead, we need to generate a second depth pyramid after the second raster pass, and then use that in the first culling pass in the next frame to test if a cluster would have been visible last frame or not.
As part of these changes, the write_index_buffer pass has been folded into the culling pass for a large performance gain, and to avoid tracking a lot of extra state that would be needed between passes.
Prepass previous model/view stuff was adapted to work with meshlets as well.
Also fixed a bug with materials, and other misc improvements.