Vulkan introduces the concept of sub-passes to subdivide a single render pass into separate logical phases. The benefit of using sub-passes over multiple render passes is that a GPU is able to perform various optimizations. Tile-based renderers, for example, can take advantage of tile memory, which being on chip is decisively faster than external memory, potentially saving a considerable amount of bandwidth.
The Render Subpasses sample implements a deferred renderer, which splits rendering in two passes:
The G-buffer layout used by the sample is below a limit of 128-bit per pixel of tile buffer color storage (more about that in the next section):
RGBA8_SRGB
), as attachment #0 will take advantage of transaction elimination.D32_SFLOAT
), which does not add up to the 128-bit limit.RGBA8_UNORM
)RGB10A2_UNORM
)By using the format RGB10A2_UNORM
for the normal buffer, normal values, which are within [-1,1], need to be transformed into [0,1]. The formula used in the shader is this:
out_normal = 0.5 * in_normal + 0.5;
Position can be reconstructed from the lighting pass using the depth attachment, using the following technique: [1]
mat4 inv_view_proj = inverse(projection * view);
vec2 inv_resolution = vec2(1.0f / width, 1.0f / height);
// Load depth from tile buffer and reconstruct world position
vec4 clip = vec4(gl_FragCoord.xy * inv_resolution * 2.0 - 1.0,
subpassLoad(depth).x, 1.0);
vec4 world_w = inv_view_proj * clip;
vec3 world = world_w.xyz / world_w.w;
In order to highlight the benefit of sub-passes over multiple render passes, the sample allows the user to switch between two different techniques at run-time:
The first technique uses two render passes, running one after another. The former generates the G-buffer, the latter uses it in the lighting stage. The following picture shows some numbers collected by HWCPipe by using two render passes, with a high number of physical tiles (PTILES
) used and a considerable amount of bandwidth (external reads/writes).
The second technique uses a single render pass with two sub-passes. The first sub-pass generates the G-buffer, possibly keeping it on tile memory, and the second performs lighting calculations.
The first thing that you may notice from the Streamline screenshot below is the difference in terms of bandwidth between the two techniques:
0s
to 3.6s
, the benefit of the sub-passes technique is clear, as it is able to store the G-buffer on tile memory.3.7s
, is highlighted the two render passes technique, which writes lots of data back to the external memory, as the first render pass needs to store the G-buffer in order to be read by the second render pass.As stated by the Vulkan reference, Subpasses with simple framebuffer-space dependencies may be merged into a single tile rendering pass, keeping the attachment data on-chip for the duration of a renderpass. [2].
Since sub-passes information is known ahead of time, the driver is able to detect if two or more subpasses can be merged together. The consequence of this is that vkCmdNextSubpass becomes a NOP
.
In other words, a GPU driver can optimize even more by merging two or more subpasses together as long as certain requirements are met:
VkAttachments
used for input and color attachments in all considered subpasses is <= 8. Note that depth/stencil does not count towards this limit.Furthermore, in order to be merged, sub-passes are required to use at most 128-bit per pixel of tile buffer color storage, although some of the more recent GPUs such as Mali-G72 increase this to up to 256-bits per pixel. G-buffers requiring more color storage than this can be used at the expense of needing smaller tiles during fragment shading, which can reduce overall throughput and increase bandwidth reading tile lists.
From the sample perspective, the best way to verify whether two subpasses are merged or not is to compare the physical tiles (PTILES
) counter by switching between the sub-passes and the render passes technique. Two fused sub-passes will need half the number of PTILES
needed by two render passes, indeed comparing the following screenshot with the previous one, roughly half the number of tiles are used (0.5
vs 1.1
) and about 70%
of bandwidth is saved.
By changing the VkImageFormat
of these images with formats requiring more bits, it is most likely that the G-buffer will no longer fit the budget, denying the driver the possibility to merge the sub-passes. The following picture shows how the number of physical tiles used goes back to over one million per second, meaning that sub-passes are not merged.
To understand what these numbers mean, consider that, on the device where the screenshots are taken, the resolution is 2220x1080
and a tile is 16x16
pixels. Every frame needs (2220 * 1080) / (16 * 16) = ~9k
tiles. Since the sample runs at 60 frames per second, we end up with ~9k * 60 = ~0.5M
of tiles per second for one render pass. Of course, two render passes will need twice this amount.
Some framebuffer attachments, like depth, albedo, and normal in the sample, are cleared at the beginning of the render pass, written by the geometry subpass, read by the lighting subpass, and discarded at the end of the render pass. If the GPU has enough memory available to store them on tile memory, there is no need to write them back to external memory. Actually, there is not even need to allocate them at all.
In practice, their image usage needs to be specified as TRANSIENT
and their memory needs to be LAZILY_ALLOCATED
. Failing to set these flags properly will lead to an increase of fragment jobs as the GPU will need to write them back to external memory.
Do
DEPTH_STENCIL_READ_ONLY
image layout for depth after the geometry pass is done.LAZILY_ALLOCATED
memory to back images for every attachment except for the light buffer, which is the only texture written out to memory.LOAD_OP_CLEAR
or LOAD_OP_DONT_CARE
for attachment loads and STORE_OP_DONT_CARE
for transient stores.Don’t
Impact
Debugging
DEPTH_STENCIL_READ_ONLY
correctly.