vulkan_best_practice_for_mobile_developers

Render Subpasses

Overview

Vulkan introduces the concept of sub-passes to subdivide a single render pass into separate logical phases. The benefit of using sub-passes over multiple render passes is that a GPU is able to perform various optimizations. Tile-based renderers, for example, can take advantage of tile memory, which being on chip is decisively faster than external memory, potentially saving a considerable amount of bandwidth.

Deferred rendering

The Render Subpasses sample implements a deferred renderer, which splits rendering in two passes:

The G-buffer layout used by the sample is below a limit of 128-bit per pixel of tile buffer color storage (more about that in the next section):

By using the format RGB10A2_UNORM for the normal buffer, normal values, which are within [-1,1], need to be transformed into [0,1]. The formula used in the shader is this:

out_normal = 0.5 * in_normal + 0.5;

Position can be reconstructed from the lighting pass using the depth attachment, using the following technique: [1]

mat4 inv_view_proj  = inverse(projection * view);
vec2 inv_resolution = vec2(1.0f / width, 1.0f / height);

// Load depth from tile buffer and reconstruct world position
vec4 clip    = vec4(gl_FragCoord.xy * inv_resolution * 2.0 - 1.0,
                    subpassLoad(depth).x, 1.0);
vec4 world_w = inv_view_proj * clip;
vec3 world   = world_w.xyz / world_w.w;

In order to highlight the benefit of sub-passes over multiple render passes, the sample allows the user to switch between two different techniques at run-time:

The first technique uses two render passes, running one after another. The former generates the G-buffer, the latter uses it in the lighting stage. The following picture shows some numbers collected by HWCPipe by using two render passes, with a high number of physical tiles (PTILES) used and a considerable amount of bandwidth (external reads/writes).

Render passes

The second technique uses a single render pass with two sub-passes. The first sub-pass generates the G-buffer, possibly keeping it on tile memory, and the second performs lighting calculations.

The first thing that you may notice from the Streamline screenshot below is the difference in terms of bandwidth between the two techniques:

Sub-passes vs render-passes trace

Merging

As stated by the Vulkan reference, Subpasses with simple framebuffer-space dependencies may be merged into a single tile rendering pass, keeping the attachment data on-chip for the duration of a renderpass. [2].

Since sub-passes information is known ahead of time, the driver is able to detect if two or more subpasses can be merged together. The consequence of this is that vkCmdNextSubpass becomes a NOP.

In other words, a GPU driver can optimize even more by merging two or more subpasses together as long as certain requirements are met:

Furthermore, in order to be merged, sub-passes are required to use at most 128-bit per pixel of tile buffer color storage, although some of the more recent GPUs such as Mali-G72 increase this to up to 256-bits per pixel. G-buffers requiring more color storage than this can be used at the expense of needing smaller tiles during fragment shading, which can reduce overall throughput and increase bandwidth reading tile lists.

From the sample perspective, the best way to verify whether two subpasses are merged or not is to compare the physical tiles (PTILES) counter by switching between the sub-passes and the render passes technique. Two fused sub-passes will need half the number of PTILES needed by two render passes, indeed comparing the following screenshot with the previous one, roughly half the number of tiles are used (0.5 vs 1.1) and about 70% of bandwidth is saved.

Good practice

By changing the VkImageFormat of these images with formats requiring more bits, it is most likely that the G-buffer will no longer fit the budget, denying the driver the possibility to merge the sub-passes. The following picture shows how the number of physical tiles used goes back to over one million per second, meaning that sub-passes are not merged.

To understand what these numbers mean, consider that, on the device where the screenshots are taken, the resolution is 2220x1080 and a tile is 16x16 pixels. Every frame needs (2220 * 1080) / (16 * 16) = ~9k tiles. Since the sample runs at 60 frames per second, we end up with ~9k * 60 = ~0.5M of tiles per second for one render pass. Of course, two render passes will need twice this amount.

G-buffer size

Transient attachments

Some framebuffer attachments, like depth, albedo, and normal in the sample, are cleared at the beginning of the render pass, written by the geometry subpass, read by the lighting subpass, and discarded at the end of the render pass. If the GPU has enough memory available to store them on tile memory, there is no need to write them back to external memory. Actually, there is not even need to allocate them at all.

In practice, their image usage needs to be specified as TRANSIENT and their memory needs to be LAZILY_ALLOCATED. Failing to set these flags properly will lead to an increase of fragment jobs as the GPU will need to write them back to external memory.

Non-transient attachments

Further reading

References

  1. Getting World Position from Depth Buffer Value - stackoverflow.com
  2. Render Pass - vulkan.lunarg.com

Best-practice summary

Do

Don’t

Impact

Debugging