Skip to content

Meshlet LOD-compatible two-pass occlusion culling#12898

Merged
superdump merged 588 commits intobevyengine:mainfrom
JMS55:meshlet-last-frame-depth-pyramid
Apr 28, 2024
Merged

Meshlet LOD-compatible two-pass occlusion culling#12898
superdump merged 588 commits intobevyengine:mainfrom
JMS55:meshlet-last-frame-depth-pyramid

Conversation

@JMS55
Copy link
Contributor

@JMS55 JMS55 commented Apr 6, 2024

Keeping track of explicit visibility per cluster between frames does not work with LODs, and leads to worse culling (using the final depth buffer from the previous frame is more accurate).

Instead, we need to generate a second depth pyramid after the second raster pass, and then use that in the first culling pass in the next frame to test if a cluster would have been visible last frame or not.

As part of these changes, the write_index_buffer pass has been folded into the culling pass for a large performance gain, and to avoid tracking a lot of extra state that would be needed between passes.

Prepass previous model/view stuff was adapted to work with meshlets as well.

Also fixed a bug with materials, and other misc improvements.

JMS55 and others added 30 commits January 14, 2024 21:19
Co-authored-by: François <mockersf@gmail.com>
Fix occlusion culling (partially) and add meshlet bounding sphere debug viewer
Fix occlusion culling in orthographic views
Co-authored-by: Robert Swain <robert.swain@gmail.com>
@JMS55 JMS55 marked this pull request as draft April 24, 2024 01:12
@JMS55
Copy link
Contributor Author

JMS55 commented Apr 24, 2024

Something is up with the bounding spheres being wonky. Occlusion culling isn't working correctly.

@JMS55 JMS55 marked this pull request as ready for review April 24, 2024 06:38
@JMS55
Copy link
Contributor Author

JMS55 commented Apr 24, 2024

Fixed bugs, ready to go again.

@JMS55 JMS55 requested a review from pcwalton April 25, 2024 03:20
Copy link
Contributor

@atlv24 atlv24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that there's many literals in the shaders that should be either extracted into constants or conversion functions (like the << 6u/>> 6u cluster id conversions can be functions, 32u can be BITS_PER_ELEM or similar) but this can be addressed later once those numbers actually settle

let sphere_depth = -view.projection[3][2] / (culling_bounding_sphere_center_view_space.z + culling_bounding_sphere_radius);
meshlet_visible &= sphere_depth >= occluder_depth;
let meshlet_triangle_count = meshlets[meshlet_id].triangle_count;
let buffer_start = atomicAdd(&draw_indirect_args.vertex_count, meshlet_triangle_count * 3u) / 3u;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not possible to skip the muls and divs by 3 here and just multiply afterwards at the last possible moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need an extra pass to just multiply by 3, I doubt it's worth it just to save like 2 instructions.

Co-authored-by: vero <email@atlasdostal.com>
Copy link
Contributor

@pcwalton pcwalton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. Since you mentioned performance, I left a few microoptimization suggestions. Feel free to take them or leave them :)

// Project the culling bounding sphere to view-space for occlusion culling
#ifdef MESHLET_FIRST_CULLING_PASS
let previous_model = affine3_to_square(instance_uniform.previous_model);
let previous_model_scale = max(length(previous_model[0]), max(length(previous_model[1]), length(previous_model[2])));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let previous_model_scale = max(length(previous_model[0]), max(length(previous_model[1]), length(previous_model[2])));
let previous_model_scale = sqrt(max(dot(previous_model[0], previous_model[0]), max(dot(previous_model[1], previous_model[1]), dot(previous_model[2], previous_model[2]))));

(feel free to split up into multiple lines if you want)

This saves 2 sqrt instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell. These microoptimizations don't seem to show up on profiles, as they're not the bottleneck. The bottleneck for culling is mostly thread divergence, and a teeny bit lack of registers.

Comment on lines +89 to +92
let depth_pyramid_size_mip_0 = vec2<f32>(textureDimensions(depth_pyramid, 0)) * 0.5;
let width = (aabb.z - aabb.x) * depth_pyramid_size_mip_0.x;
let height = (aabb.w - aabb.y) * depth_pyramid_size_mip_0.y;
let depth_level = max(0, i32(ceil(log2(max(width, height))))); // TODO: Naga doesn't like this being a u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the * 0.5 after the max would save 1 multiply.

fn get_meshlet_occlusion(cluster_id: u32) -> bool {
let packed_occlusion = meshlet_occlusion[cluster_id / 32u];
fn meshlet_is_second_pass_candidate(cluster_id: u32) -> bool {
// TODO: Does this read need to be an atomicLoad?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not if all the writing happens in a previous compute dispatch. wgpu should have automatically inserted the proper barriers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops, this is left over from the previous version of this PR. I can remove it.

);
cull_pass(
"meshlet_culling_first_pass",
"culling_first",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this name? We'll have non-meshlet GPU culling too, and so it'd be nice to see the meshlet code clearly separated out in RenderDoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is separated out. It's already nested under a larger meshlet debug span. The repeated use of meshlet for individual passes made it harder to read the NSight profiles.

let instance_uniform = meshlet_instance_uniforms[instance_id];
let meshlet_id = meshlet_thread_meshlet_ids[cluster_id];
let model = affine3_to_square(instance_uniform.model);
let model_scale = max(length(model[0]), max(length(model[1]), length(model[2])));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let model_scale = max(length(model[0]), max(length(model[1]), length(model[2])));
let model_scale = sqrt(max(dot(model[0], model[0]), max(dot(model[1], model[1]), dot(model[2], model[2]))));

Saves 2 sqrt instructions.

JMS55 and others added 2 commits April 27, 2024 18:43
Co-authored-by: Robert Swain <robert.swain@gmail.com>
@JMS55 JMS55 requested review from IceSentry and robtfm April 28, 2024 02:51
@superdump superdump added this pull request to the merge queue Apr 28, 2024
Merged via the queue into bevyengine:main with commit e1a0da0 Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Rendering Drawing game state to the screen C-Feature A new feature, making something new possible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants