Arm Mali G1

Home / Arm® Mali™ G1 Performance Counter Reference Filtered

Introduction

Arm GPUs provide a wide range of performance counters. You can use them to understand your application's performance characteristics, and to find behavioral inefficiencies to optimize. This guide explains the performance counters for the Mali G1, which is a member of the Valhall second generation architecture family.

This introduction section will explain the high level goals to consider when profiling this GPU. Later sections will explain the available counters for each part of the GPU design.

Profiling GPU scheduling

The GPU runs workloads that have been submitted by the graphics driver, using scheduling barriers between workloads to ensure they run in the correct order. Workloads are scheduled to run by adding them to the appropriate hardware queue, which will run enqueued workloads in a pipelined FIFO processing order.

5th Generation top-level

Tile-based rendering

Arm GPUs are tile-based GPUs, meaning that they process graphics render passes in two distinct phases. The first phase processes geometry to determine which primitives contribute to which screen-space tiles. The second phase renders the output framebuffer tile-by-tile.

In this design, tiles are small enough to be kept in on-chip tile memory, which makes fragment processing more efficient. This generation of GPUs introduces deferred vertex shading, which means that the first phase only computes the primitive binning metadata. Full vertex shading is deferred to the second main phase. This generation of GPUs is far more bandwidth efficient than earlier Arm GPUs for geometry heavy scenes.

GPU queues

The GPU front-end in this generation of hardware has three hardware queues:

Compute queue
Binning phase queue
Main phase queue

The Compute queue is used for all compute-like workloads, including compute shaders, buffer transfers, geometry shaders, and tessellation shaders. The Binning phase queue is used for computing vertex positions and binning. The Main queue is used for the main render pass processing, including vertex shading and fragment shading, and most image transfers.

Monitoring your application's queue usage is the first stage of profiling an Arm GPU, as the queue costs give the overall processing cost of each type of workload. In addition you can see if your application is using barriers efficiently, allowing the queues to run their workloads in parallel.

Profiling GPU memory bandwidth

GPUs are data-plane processors, so memory access efficiency is an important factor for overall performance.

5th Generation memory system

Memory system performance outside of the GPU cannot be directly observed via GPU performance counters, but the counters can show the performance observed by the GPU on its memory interface.

Reducing bandwidth

Accessing external DRAM is a very energy-intensive operation, which makes reducing external bandwidth an important optimization goal for mobile devices. Sustained high bandwidth can cause poor performance in main-stream devices, and thermal issues in high-end devices.

Shader core performance counters can give you more breakdown about which functional units are generating memory traffic, guiding your optimization efforts.

Reducing stalls

The memory system outside of the GPU is implemented by the chip manufacturer, and designs can vary and have different performance characteristics. Workloads that generate a significant number of memory stall cycles, or that see a large percentage of high latency reads, might be stressing the external memory system beyond its capabilities. Reducing memory bandwidth often gives measurable performance gains in these scenarios.

Profiling shader core usage

If the GPU queues are scheduling well, the next thing that you will need to profile to determine the processing bottleneck of a workload is your application's use of the shader core.

The Mali G1 shader cores use a massively multi-threaded architecture, supporting thousands of concurrently running threads. A large pool of available threads allows the hardware to fill parallel functional units by switching to any of the available threads if the current thread becomes blocked for any reason.

5th Generation core

In this type of architecture, the utilization of the functional units reflects the overall demand of the running shader programs. This is relatively independent of localized hot-spots in shaders that stress a single functional unit, because other threads will be running other parts of the program and will load-balance the hardware. This is quite different to profiling a CPU, where the serial instruction stream means that performance can be very sensitive to both latency and localized hot-spots.

Improve speed-of-light utilization

For functional unit profiling, we therefore aim for at least 75% utilization of the most heavily used functional unit, relative to its best case 'speed-of-light' performance. This shows that the application has done a good job getting its workload running without problematic stalls.

In this situation, reducing demand on the most heavily used functional units, either by improving efficiency or reducing size, should improve application performance.

Reduce shader core stalls

If no functional unit is heavily utilized, the shader core is running out of work to do. This can occur for multiple reasons, and should be avoided if possible.

The first reason is that the shader is literally running out of threads to run, and the shader core is running with low thread occupancy. GPUs rely on workloads having a lot of threads to fill the capacity of the shader core. You should avoid running small workloads with few threads on the GPU, preferring to use the CPU if possible. Note that some workloads, such as depth shadow maps, may not generate many fragment threads due to their algorithmic design. This is usually unavoidable, but is something to remember when profiling.

The second reason is that the running shader programs are causing operations to stall by missing in descriptor caches or data caches. GPUs use their thread count to hide the impact and latency of cache misses, but there are limits to the density of misses that can be hidden. In this situation, try to identify which workload is causing stalls and try to minimize them. There are not specific performance counters for every stall reason, so this can take some investigation and experimentation to determine which resource is causing the problem.

Profiling workload

In addition to profiling use of the hardware, measuring cycles and bytes, Arm GPUs provide many performance counters that can help you to understand the size and characteristics of your workload. These counters gives feedback in the context of API constructs, such as vertices, triangles, and pixels making it easier for developers to understand the feedback.

5th Generation shader core

Supplementing the workload size counters, Arm GPUs also provide counters that indicate areas where content is not following best practice guidelines. Improving these best practice metrics will nearly always improve your application's performance or energy efficiency.

GPU Front-end

The GPU front-end is the interface between the GPU hardware and the driver. The front-end schedules command streams submitted by the driver on to multiple hardware work queues. Each work queue handles a specific type of workload and is responsible for breaking a workload into smaller tasks that can be dispatched to the shader cores. Work stays at the head of the queue while being processed, so queue activity is a direct way of measuring that the GPU is busy handling a workload.

In this generation of hardware there are three work queues:

Compute queue for compute shaders and advanced geometry shaders.
Binning phase queue for the first phase of a render pass, handling vertex position calculation, and primitive culling and binning.
Main phase queue for the second phase of a render pass, handling any deferred vertex shading and fragment shading.

It is beneficial to schedule work on multiple queues in parallel, as this can more evenly load balance the hardware. In this generation of hardware the Compute and Binning phase queues can run in parallel to the Main phase queue, but serially with respect to each other. Parallel processing will increase the latency of individual tasks, but usually significantly improves overall throughput.

Performance counters in this section can show activity on each of the queues, which indicates the complexity and scheduling patterns of submitted workloads.

GPU Cycles

This counter group shows the workload processing activity level of the GPU, showing the overall use and when work was running for each of the hardware scheduling queues.

GPU active

This counter increments every clock cycle when the GPU has any pending workload present in one of its processing queues. It shows the overall GPU processing load requested by the application.

This counter increments when any workload is present in any processing queue, even if the GPU is stalled waiting for external memory. These cycles are counted as active time even though no progress is being made.

libGPUCounters name: MaliGPUActiveCy

Streamline name: $MaliGPUCyclesGPUActive

Hardware name: GPU_ACTIVE

Any queue active

This counter increments every clock cycle when any GPU command queue is active with work for the tiler or shader cores.

libGPUCounters name: MaliGPUAnyQueueActiveCy

Streamline name: $MaliGPUCyclesAnyQueueActive

Hardware name: GPU_ITER_ACTIVE

Compute queue active

This expression increments every clock cycle when the command stream compute queue has at least one task issued for processing.

libGPUCounters name: MaliCompQueueActiveCy

libGPUCounters derivation:

MaliCompQueuedCy - MaliCompQueueAssignStallCy

Streamline derivation:

$MaliGPUQueuedCyclesComputeQueued - $MaliGPUWaitCyclesComputeQueueEndpointStalls

Hardware derivation:

ITER_COMP_ACTIVE - ITER_COMP_READY_BLOCKED

Binning phase queue active

This expression increments every clock cycle when the command stream binning phase queue has at least one task issued for processing. The binning phase includes vertex position shading and primitive binning.

libGPUCounters name: MaliBinningQueueActiveCy

libGPUCounters derivation:

MaliBinningQueuedCy - MaliBinningQueueAssignStallCy

Streamline derivation:

$MaliGPUQueuedCyclesBinningPhaseQueued - $MaliGPUWaitCyclesBinningPhaseQueueEndpointStalls

Hardware derivation:

ITER_TILER_ACTIVE - ITER_TILER_READY_BLOCKED

Main phase queue active

This expression increments every clock cycle when the command stream main phase queue has at least one task issued for processing. The main phase includes any deferred vertex processing and all fragment shading.

libGPUCounters name: MaliMainQueueActiveCy

libGPUCounters derivation:

MaliMainQueuedCy - MaliMainQueueAssignStallCy

Streamline derivation:

$MaliGPUQueuedCyclesMainPhaseQueued - $MaliGPUWaitCyclesMainPhaseQueueEndpointStalls

Hardware derivation:

ITER_FRAG_ACTIVE - ITER_FRAG_READY_BLOCKED

Tiler active

This counter increments every clock cycle the tiler has a workload in its processing queue. The tiler is responsible for coordinating geometry processing and providing the fixed-function tiling needed for the Mali tile-based rendering pipeline. It can run in parallel to vertex shading and fragment shading.

A high cycle count here does not necessarily imply a bottleneck, unless the Compute or binning phase active cycles counter in the shader core is comparatively low.

libGPUCounters name: MaliTilerActiveCy

Streamline name: $MaliGPUCyclesTilerActive

Hardware name: TILER_ACTIVE

GPU interrupt active

This counter increments every clock cycle when the GPU has an interrupt pending and is waiting for the CPU to process it.

Cycles with a pending interrupt do not necessarily indicate lost performance because the GPU can process other queued work in parallel. However, if GPU interrupt pending cycles are a high percentage of GPU active cycles, an underlying problem might be preventing the CPU from efficiently handling interrupts. This problem is normally a system integration issue, which an application developer cannot work around.

libGPUCounters name: MaliGPUIRQActiveCy

Streamline name: $MaliGPUCyclesGPUInterruptActive

Hardware name: GPU_IRQ_ACTIVE

GPU Queued Cycles

This counter group shows the workload scheduling behavior of the GPU queues, showing when queues contained work, including cycles where a queue was stalled and could not start an enqueued workload.

Compute queued

This counter increments every clock cycle when the command stream compute queue has work queued. The count includes cycles when the queue is stalled because of endpoint contention.

libGPUCounters name: MaliCompQueuedCy

Streamline name: $MaliGPUQueuedCyclesComputeQueued

Hardware name: ITER_COMP_ACTIVE

Binning phase queued

This counter increments every clock cycle when the command stream binning phase queue has work queued. The binning phase includes vertex position shading, culling, and primitive binning, and includes cycles when the queue is stalled because of endpoint contention.

libGPUCounters name: MaliBinningQueuedCy

Streamline name: $MaliGPUQueuedCyclesBinningPhaseQueued

Hardware name: ITER_TILER_ACTIVE

Main phase queued

This counter increments every clock cycle when the command stream main phase queue has work queued. The main phase includes any deferred vertex processing and all fragment shading, and can include cycles when the queue is stalled because of endpoint contention.

libGPUCounters name: MaliMainQueuedCy

Streamline name: $MaliGPUQueuedCyclesMainPhaseQueued

Hardware name: ITER_FRAG_ACTIVE

GPU Wait Cycles

This counter group shows the workload scheduling behavior of the GPU queues, showing reasons for any scheduling stalls for each queue.

Compute queue endpoint drain stalls

This counter increments every clock cycle when compute work is queued but cannot start because IDVS work is still active on the shared endpoints.

libGPUCounters name: MaliCompQueueDrainStallCy

Streamline name: $MaliGPUWaitCyclesComputeQueueEndpointDrainStalls

Hardware name: ITER_COMP_EP_DRAIN

Compute queue endpoint stalls

This counter increments every clock cycle when compute work is queued but cannot start because no endpoints have been assigned.

libGPUCounters name: MaliCompQueueAssignStallCy

Streamline name: $MaliGPUWaitCyclesComputeQueueEndpointStalls

Hardware name: ITER_COMP_READY_BLOCKED

Binning phase queue endpoint drain stalls

This counter increments every clock cycle when binning phase work is queued but cannot start because compute work is still active on the shared endpoints.

libGPUCounters name: MaliTilerQueueDrainStallCy

Streamline name: $MaliGPUWaitCyclesBinningPhaseQueueEndpointDrainStalls

Hardware name: ITER_TILER_EP_DRAIN

Binning phase queue endpoint stalls

This counter increments every clock cycle when binning phase work is queued but cannot start because no endpoints have been assigned. The binning phase includes vertex position shading and primitive binning.

libGPUCounters name: MaliBinningQueueAssignStallCy

Streamline name: $MaliGPUWaitCyclesBinningPhaseQueueEndpointStalls

Hardware name: ITER_TILER_READY_BLOCKED

Main phase queue endpoint stalls

This counter increments every clock cycle when main phase work is queued but cannot start because no endpoints have been assigned. The main phase includes any deferred vertex processing and all fragment shading.

libGPUCounters name: MaliMainQueueAssignStallCy

Streamline name: $MaliGPUWaitCyclesMainPhaseQueueEndpointStalls

Hardware name: ITER_FRAG_READY_BLOCKED

GPU Jobs

This counter group shows the total number of workload jobs issued to the GPU front-end for each queue. Most jobs will correspond to an API workload, for example a compute dispatch generates a compute job. However, the driver can also generate small house-keeping jobs for each queue, so job counts do not directly correlate with API behavior.

Compute jobs

This counter increments for every job processed by the compute queue.

libGPUCounters name: MaliCompQueueJob

Streamline name: $MaliGPUJobsComputeJobs

Hardware name: ITER_COMP_JOB_COMPLETED

Binning phase jobs

This counter increments for every job processed by the binning phase queue.

libGPUCounters name: MaliBinningQueueJob

Streamline name: $MaliGPUJobsBinningPhaseJobs

Hardware name: ITER_TILER_JOB_COMPLETED

Main phase jobs

This counter increments for every job processed by the main phase queue.

libGPUCounters name: MaliMainQueueJob

Streamline name: $MaliGPUJobsMainPhaseJobs

Hardware name: ITER_FRAG_JOB_COMPLETED

GPU Tasks

This counter group shows the total number of workload tasks issued by the GPU front-end to the processing end-points inside the GPU.

Compute tasks

This counter increments for every compute task processed by the GPU.

libGPUCounters name: MaliCompQueueTask

Streamline name: $MaliGPUTasksComputeTasks

Hardware name: ITER_COMP_TASK_COMPLETED

Binning phase tasks

This counter increments for every binning phase task processed by the GPU.

libGPUCounters name: MaliBinningQueueTask

Streamline name: $MaliGPUTasksBinningPhaseTasks

Hardware name: ITER_TILER_IDVS_TASK_COMPLETED

Main phase tasks

This counter increments for every 64 x 64 pixel region of a render pass that is processed by the GPU. The processed region of a render pass can be smaller than the full size of the attached surfaces if the application's viewport and scissor settings prevent the whole image being rendered.

libGPUCounters name: MaliMainQueueTask

Streamline name: $MaliGPUTasksMainPhaseTasks

Hardware name: ITER_FRAG_TASK_COMPLETED

GPU Utilization

This counter group shows the workload processing activity level of the GPU queues, normalized as a percentage of overall GPU activity.

Compute queue utilization

This expression defines the compute queue utilization compared against the GPU active cycles.

For GPU bound content, it is expected that the GPU queues process work in parallel. The dominant queue must be close to 100% utilized to get the best performance. If no queue is dominant, but the GPU is fully utilized, then a serialization or dependency problem might be preventing queue overlap.

libGPUCounters name: MaliCompQueueUtil

libGPUCounters derivation:

max(min(((MaliCompQueuedCy - MaliCompQueueAssignStallCy) / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliGPUQueuedCyclesComputeQueued - $MaliGPUWaitCyclesComputeQueueEndpointStalls) / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min(((ITER_COMP_ACTIVE - ITER_COMP_READY_BLOCKED) / GPU_ACTIVE) * 100, 100), 0)

Binning phase queue utilization

This expression defines the binning phase queue utilization compared against the GPU active cycles. The binning phase includes vertex position shading, culling, and primitive binning.

libGPUCounters name: MaliBinningQueueUtil

libGPUCounters derivation:

max(min(((MaliBinningQueuedCy - MaliBinningQueueAssignStallCy) / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliGPUQueuedCyclesBinningPhaseQueued - $MaliGPUWaitCyclesBinningPhaseQueueEndpointStalls) / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min(((ITER_TILER_ACTIVE - ITER_TILER_READY_BLOCKED) / GPU_ACTIVE) * 100, 100), 0)

Main phase queue utilization

This expression defines the main phase queue utilization compared against the GPU active cycles. The main phase includes any deferred vertex processing and all fragment shading.

libGPUCounters name: MaliMainQueueUtil

libGPUCounters derivation:

max(min(((MaliMainQueuedCy - MaliMainQueueAssignStallCy) / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliGPUQueuedCyclesMainPhaseQueued - $MaliGPUWaitCyclesMainPhaseQueueEndpointStalls) / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min(((ITER_FRAG_ACTIVE - ITER_FRAG_READY_BLOCKED) / GPU_ACTIVE) * 100, 100), 0)

Tiler utilization

This expression defines the tiler utilization compared to the total GPU active cycles.

Note that this metric measures the overall processing time for the tiler geometry pipeline. The metric includes aspects of vertex shading, in addition to the fixed-function tiling process.

libGPUCounters name: MaliTilerUtil

libGPUCounters derivation:

max(min((MaliTilerActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliGPUCyclesTilerActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((TILER_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

Interrupt utilization

This expression defines the IRQ pending utilization compared against the GPU active cycles. In a well-functioning system, this expression should be less than 3% of the total cycles. If the value is much higher than this, a system issue might be preventing the CPU from efficiently handling interrupts.

libGPUCounters name: MaliGPUIRQUtil

libGPUCounters derivation:

max(min((MaliGPUIRQActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliGPUCyclesGPUInterruptActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((GPU_IRQ_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

GPU Messages

This counter group shows the total number of control-plane messages issued by the GPU front-end to the processing end-points inside the GPU.

GPU interrupts

This counter increments for every interrupt raised by the GPU.

libGPUCounters name: MaliGPUIRQ

Streamline name: $MaliGPUMessagesGPUInterrupts

Hardware name: GPU_IRQ_COUNT

GPU Cache Flushes

This counter group shows the total number of L2 cache and MMU operations performed by the GPU top-level.

L2 cache flushes

This counter increments for every L2 cache flush that is performed.

libGPUCounters name: MaliL2CacheFlush

Streamline name: $MaliGPUCacheFlushesL2CacheFlushes

Hardware name: CACHE_FLUSH

GPU Cache Flush Cycles

This counter group shows the total number of cycles spent performing L2 cache and MMU operations by GPU top-level.

L2 cache flush

This counter increments for every clock cycle when the GPU is flushing the L2 cache.

libGPUCounters name: MaliL2CacheFlushCy

Streamline name: $MaliGPUCacheFlushCyclesL2CacheFlush

Hardware name: CACHE_FLUSH_CYCLES

CSF Cycles

This counter group shows the total number of cycles that each of the sub-units inside the command stream front-end was active.

CEU active

This counter increments every clock cycle when the GPU command execution unit is active.

libGPUCounters name: MaliCSFCEUActiveCy

Streamline name: $MaliCSFCyclesCEUActive

Hardware name: CEU_ACTIVE

LSU active

This counter increments every clock cycle when the GPU command load/store unit is active.

libGPUCounters name: MaliCSFLSUActiveCy

Streamline name: $MaliCSFCyclesLSUActive

Hardware name: LSU_ACTIVE

MCU active

This counter increments every clock cycle when the GPU command stream management microcontroller is executing. Cycles waiting for interrupts or events are not counted.

libGPUCounters name: MaliCSFMCUActiveCy

Streamline name: $MaliCSFCyclesMCUActive

Hardware name: MCU_ACTIVE

CSF Utilization

This counter group shows the use of each of the functional units inside the command stream front-end, relative to their speed-of-light capability.

CEU utilization

This expression defines the front-end command execution unit utilization compared against the GPU active cycles.

libGPUCounters name: MaliCSFCEUUtil

libGPUCounters derivation:

max(min((MaliCSFCEUActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliCSFCyclesCEUActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((CEU_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

LSU utilization

This expression defines the front-end load/store unit utilization compared against the GPU active cycles.

libGPUCounters name: MaliCSFLSUUtil

libGPUCounters derivation:

max(min((MaliCSFLSUActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliCSFCyclesLSUActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((LSU_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

MCU utilization

This expression defines the microcontroller utilization compared against the GPU active cycles.

High microcontroller load can be indicative of content using many emulated commands, such as command stream scheduling and synchronization operations.

libGPUCounters name: MaliCSFMCUUtil

libGPUCounters derivation:

max(min((MaliCSFMCUActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliCSFCyclesMCUActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((MCU_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

CSF Stream Cycles

This counter group shows the total number of cycles that each of the command stream interfaces was active.

CS0 active

This counter increments every clock cycle when command stream interface 0 contained a command stream. This does not necessarily indicate that the command stream was actively being processed by the main GPU.

libGPUCounters name: MaliCSFCS0ActiveCy

Streamline name: $MaliCSFStreamCyclesCS0Active

Hardware name: CSHWIF0_ENABLED

CS1 active

This counter increments every clock cycle when command stream interface 1 contained a command stream. This does not necessarily indicate that the command stream was actively being processed by the main GPU.

libGPUCounters name: MaliCSFCS1ActiveCy

Streamline name: $MaliCSFStreamCyclesCS1Active

Hardware name: CSHWIF1_ENABLED

CS2 active

This counter increments every clock cycle when command stream interface 2 contained a command stream. This does not necessarily indicate that the command stream was actively being processed by the main GPU.

libGPUCounters name: MaliCSFCS2ActiveCy

Streamline name: $MaliCSFStreamCyclesCS2Active

Hardware name: CSHWIF2_ENABLED

CS3 active

This counter increments every clock cycle when command stream interface 3 contained a command stream. This does not necessarily indicate that the command stream was actively being processed by the main GPU.

libGPUCounters name: MaliCSFCS3ActiveCy

Streamline name: $MaliCSFStreamCyclesCS3Active

Hardware name: CSHWIF3_ENABLED

CS4 active

This counter increments every clock cycle when command stream interface 4 contained a command stream. This does not necessarily indicate that the command stream was actively being processed by the main GPU.

libGPUCounters name: MaliCSFCS4ActiveCy

Streamline name: $MaliCSFStreamCyclesCS4Active

Hardware name: CSHWIF4_ENABLED

CS5 active

This counter increments every clock cycle when command stream interface 5 contained a command stream. This does not necessarily indicate that the command stream was actively being processed by the main GPU.

libGPUCounters name: MaliCSFCS5ActiveCy

Streamline name: $MaliCSFStreamCyclesCS5Active

Hardware name: CSHWIF5_ENABLED

CSF Stream Stall Cycles

This counter group shows the total number of cycles that each of the command stream interfaces stalled for any reason.

CS0 wait stalls

This counter increments every clock cycle when command stream interface 0 was blocked due an outstanding scheduling dependency.

libGPUCounters name: MaliCS0WaitStallCy

Streamline name: $MaliCSFStreamStallCyclesCS0WaitStalls

Hardware name: CSHWIF0_WAIT_BLOCKED

CS1 wait stalls

This counter increments every clock cycle when command stream interface 1 was blocked due an outstanding scheduling dependency.

libGPUCounters name: MaliCS1WaitStallCy

Streamline name: $MaliCSFStreamStallCyclesCS1WaitStalls

Hardware name: CSHWIF1_WAIT_BLOCKED

CS2 wait stalls

This counter increments every clock cycle when command stream interface 2 was blocked due an outstanding scheduling dependency.

libGPUCounters name: MaliCS2WaitStallCy

Streamline name: $MaliCSFStreamStallCyclesCS2WaitStalls

Hardware name: CSHWIF2_WAIT_BLOCKED

CS3 wait stalls

This counter increments every clock cycle when command stream interface 3 was blocked due an outstanding scheduling dependency.

libGPUCounters name: MaliCS3WaitStallCy

Streamline name: $MaliCSFStreamStallCyclesCS3WaitStalls

Hardware name: CSHWIF3_WAIT_BLOCKED

CS4 wait stalls

This counter increments every clock cycle when command stream interface 4 was blocked due an outstanding scheduling dependency.

libGPUCounters name: MaliCS4WaitStallCy

Streamline name: $MaliCSFStreamStallCyclesCS4WaitStalls

Hardware name: CSHWIF4_WAIT_BLOCKED

CS5 wait stalls

This counter increments every clock cycle when command stream interface 5 was blocked due an outstanding scheduling dependency.

libGPUCounters name: MaliCS5WaitStallCy

Streamline name: $MaliCSFStreamStallCyclesCS5WaitStalls

Hardware name: CSHWIF5_WAIT_BLOCKED

External Memory System

The GPU external memory interface connects the GPU to the system DRAM, via an on-chip memory bus. The exact configuration of the memory system outside of the GPU varies from device to device and might include additional levels of system cache before reaching the off-chip memory.

GPUs are data-plane processors, with workloads that are too large to keep in system cache and that therefore make heavy use of main memory. GPUs are designed to be tolerant of high latency, when compared to a CPU, but poor memory system performance can still reduce GPU efficiency.

Accessing external DRAM is one of the most energy-intensive operations that the GPU can perform. Reducing memory bandwidth is a key optimization goal for mobile applications, even if not bandwidth limited, ensuring users get long battery life and thermally stable performance.

Performance counters in this section measure how much memory bandwidth your application uses, as well as stall and latency counters to show how well the memory system is coping with the generated traffic.

External Bus Accesses

This counter group shows the absolute number of external memory transactions generated by the GPU.

Read transactions

This counter increments for every external read transaction made on the memory bus. These transactions typically result in an external DRAM access, but some designs include a system cache which can provide some buffering.

The longest memory transaction possible is 64 bytes in length, but shorter transactions are generated in some circumstances.

libGPUCounters name: MaliExtBusRd

Streamline name: $MaliExternalBusAccessesReadTransactions

Hardware name: L2_EXT_READ

Write transactions

This counter increments for every external write transaction made on the memory bus. These transactions typically result in an external DRAM access, but some chips include a system cache which can provide some buffering.

The longest memory transaction possible is 64 bytes in length, but shorter transactions are generated in some circumstances.

libGPUCounters name: MaliExtBusWr

Streamline name: $MaliExternalBusAccessesWriteTransactions

Hardware name: L2_EXT_WRITE

ReadNoSnoop transactions

This counter increments for every non-coherent (ReadNoSnp) transaction.

libGPUCounters name: MaliExtBusRdNoSnoop

Streamline name: $MaliExternalBusAccessesReadNoSnoopTransactions

Hardware name: L2_EXT_READ_NOSNP

ReadUnique transactions

This counter increments for every coherent exclusive read (ReadUnique) transaction.

libGPUCounters name: MaliExtBusRdUnique

Streamline name: $MaliExternalBusAccessesReadUniqueTransactions

Hardware name: L2_EXT_READ_UNIQUE

WriteNoSnoopFull transactions

This counter increments for every external non-coherent full write (WriteNoSnpFull) transaction.

libGPUCounters name: MaliExtBusWrNoSnoopFull

Streamline name: $MaliExternalBusAccessesWriteNoSnoopFullTransactions

Hardware name: L2_EXT_WRITE_NOSNP_FULL

WriteNoSnoopPartial transactions

This counter increments for every external non-coherent partial write (WriteNoSnpPtl) transaction.

libGPUCounters name: MaliExtBusWrNoSnoopPart

Streamline name: $MaliExternalBusAccessesWriteNoSnoopPartialTransactions

Hardware name: L2_EXT_WRITE_NOSNP_PTL

WriteSnoopFull transactions

This counter increments for every external coherent full write (WriteBackFull or WriteUniqueFull) transaction.

libGPUCounters name: MaliExtBusWrSnoopFull

Streamline name: $MaliExternalBusAccessesWriteSnoopFullTransactions

Hardware name: L2_EXT_WRITE_SNP_FULL

WriteSnoopPartial transactions

This counter increments for every external coherent partial write (WriteBackPtl or WriteUniquePtl) transaction.

libGPUCounters name: MaliExtBusWrSnoopPart

Streamline name: $MaliExternalBusAccessesWriteSnoopPartialTransactions

Hardware name: L2_EXT_WRITE_SNP_PTL

External Bus Beats

This counter group shows the absolute amount of external memory data transfer cycles used by the GPU.

Read beats

This counter increments for every clock cycle when a data beat was read from the external memory bus.

Most implementations use a 128-bit (16-byte) data bus, enabling a single 64-byte read transaction to be read using 4 bus cycles.

libGPUCounters name: MaliExtBusRdBt

Streamline name: $MaliExternalBusBeatsReadBeats

Hardware name: L2_EXT_READ_BEATS

Write beats

This counter increments for every clock cycle when a data beat was written to the external memory bus.

Most implementations use a 128-bit (16-byte) data bus, enabling a single 64-byte read transaction to be written using 4 bus cycles.

libGPUCounters name: MaliExtBusWrBt

Streamline name: $MaliExternalBusBeatsWriteBeats

Hardware name: L2_EXT_WRITE_BEATS

External Bus Bytes

This counter group shows the absolute amount of external memory traffic generated by the GPU. Absolute measures are the most useful way to check actual bandwidth against a per-frame bandwidth budget.

Read bytes

This expression defines the total output read bandwidth for the GPU.

libGPUCounters name: MaliExtBusRdBy

libGPUCounters derivation:

MaliExtBusRdBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE

Streamline derivation:

$MaliExternalBusBeatsReadBeats * ($MaliConstantsBusWidthBits / 8)

Hardware derivation:

L2_EXT_READ_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE

Write bytes

This expression defines the total output write bandwidth for the GPU.

libGPUCounters name: MaliExtBusWrBy

libGPUCounters derivation:

MaliExtBusWrBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE

Streamline derivation:

$MaliExternalBusBeatsWriteBeats * ($MaliConstantsBusWidthBits / 8)

Hardware derivation:

L2_EXT_WRITE_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE

External Bus Bandwidth

This counter group shows the external memory traffic generated by the GPU, presented as a bytes/second rate. Rates are the most useful way to check actual bandwidth against the design limits of a chip, which will usually be specified in bytes/second.

Read bandwidth

This expression defines the total output read bandwidth for the GPU, measured in bytes per second.

libGPUCounters name: MaliExtBusRdBPS

libGPUCounters derivation:

(MaliExtBusRdBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

Streamline derivation:

($MaliExternalBusBeatsReadBeats * ($MaliConstantsBusWidthBits / 8)) / $ZOOM

Hardware derivation:

(L2_EXT_READ_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

Write bandwidth

This expression defines the total output write bandwidth for the GPU, measured in bytes per second.

libGPUCounters name: MaliExtBusWrBPS

libGPUCounters derivation:

(MaliExtBusWrBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

Streamline derivation:

($MaliExternalBusBeatsWriteBeats * ($MaliConstantsBusWidthBits / 8)) / $ZOOM

Hardware derivation:

(L2_EXT_WRITE_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

External Bus Stall Cycles

This counter group shows the absolute number of external memory interface stalls, which is the number of cycles that the GPU was trying to send data but the external bus could not accept it.

Read stalls

This counter increments for every stall cycle on the AXI bus where the GPU has a valid read transaction to send, but is awaiting a ready signal from the bus.

libGPUCounters name: MaliExtBusRdStallCy

Streamline name: $MaliExternalBusStallCyclesReadStalls

Hardware name: L2_EXT_AR_STALL

Write stalls

This counter increments for every stall cycle on the external bus where the GPU has a valid write transaction to send, but is awaiting a ready signal from the external bus.

libGPUCounters name: MaliExtBusWrStallCy

Streamline name: $MaliExternalBusStallCyclesWriteStalls

Hardware name: L2_EXT_W_STALL

External Bus Stall Rate

This counter group shows the percentage of cycles that the GPU was trying to send data, but by the external bus could not accept it.

A small number of stalls is expected, but sustained periods of with stall rates above 10% might indicate that the GPU is generating more traffic than the downstream memory system can handle efficiently.

Read stall rate

This expression defines the percentage of GPU cycles with a memory stall on an external read transaction.

Stall rates can be reduced by reducing the size of data resources, such as buffers or textures.

libGPUCounters name: MaliExtBusRdStallRate

libGPUCounters derivation:

max(min((MaliExtBusRdStallCy / MALI_CONFIG_L2_CACHE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliExternalBusStallCyclesReadStalls / $MaliConstantsL2SliceCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((L2_EXT_AR_STALL / MALI_CONFIG_L2_CACHE_COUNT / GPU_ACTIVE) * 100, 100), 0)

Write stall rate

This expression defines the percentage of GPU cycles with a memory stall on an external write transaction.

Stall rates can be reduced by reducing geometry complexity, or the size of framebuffers in memory.

libGPUCounters name: MaliExtBusWrStallRate

libGPUCounters derivation:

max(min((MaliExtBusWrStallCy / MALI_CONFIG_L2_CACHE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliExternalBusStallCyclesWriteStalls / $MaliConstantsL2SliceCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((L2_EXT_W_STALL / MALI_CONFIG_L2_CACHE_COUNT / GPU_ACTIVE) * 100, 100), 0)

External Bus Read Latency

This counter group shows the histogram distribution of memory latency for GPU reads.

GPUs are more tolerant to latency than a CPU, but sustained periods of high latency might indicate that the GPU is generating more traffic than the downstream memory system can handle efficiently.

0-127 cycles

This counter increments for every data beat that is returned between 0 and 127 cycles after the read transaction started. This latency is considered a fast access response speed.

libGPUCounters name: MaliExtBusRdLat0

Streamline name: $MaliExternalBusReadLatency0127Cycles

Hardware name: L2_EXT_RRESP_0_127

128-191 cycles

This counter increments for every data beat that is returned between 128 and 191 cycles after the read transaction started. This latency is considered a normal access response speed.

libGPUCounters name: MaliExtBusRdLat128

Streamline name: $MaliExternalBusReadLatency128191Cycles

Hardware name: L2_EXT_RRESP_128_191

192-255 cycles

This counter increments for every data beat that is returned between 192 and 255 cycles after the read transaction started. This latency is considered a normal access response speed.

libGPUCounters name: MaliExtBusRdLat192

Streamline name: $MaliExternalBusReadLatency192255Cycles

Hardware name: L2_EXT_RRESP_192_255

256-319 cycles

This counter increments for every data beat that is returned between 256 and 319 cycles after the read transaction started. This latency is considered a slow access response speed.

libGPUCounters name: MaliExtBusRdLat256

Streamline name: $MaliExternalBusReadLatency256319Cycles

Hardware name: L2_EXT_RRESP_256_319

320-383 cycles

This counter increments for every data beat that is returned between 320 and 383 cycles after the read transaction started. This latency is considered a slow access response speed.

libGPUCounters name: MaliExtBusRdLat320

Streamline name: $MaliExternalBusReadLatency320383Cycles

Hardware name: L2_EXT_RRESP_320_383

384+ cycles

This expression increments for every read beat that is returned at least 384 cycles after the transaction started. This latency is considered a very slow access response speed.

libGPUCounters name: MaliExtBusRdLat384

libGPUCounters derivation:

MaliExtBusRdBt - MaliExtBusRdLat0 - MaliExtBusRdLat128 - MaliExtBusRdLat192 - MaliExtBusRdLat256 - MaliExtBusRdLat320

Streamline derivation:

$MaliExternalBusBeatsReadBeats - $MaliExternalBusReadLatency0127Cycles - $MaliExternalBusReadLatency128191Cycles - $MaliExternalBusReadLatency192255Cycles - $MaliExternalBusReadLatency256319Cycles - $MaliExternalBusReadLatency320383Cycles

Hardware derivation:

L2_EXT_READ_BEATS - L2_EXT_RRESP_0_127 - L2_EXT_RRESP_128_191 - L2_EXT_RRESP_192_255 - L2_EXT_RRESP_256_319 - L2_EXT_RRESP_320_383

External Bus Outstanding Reads

This counter group shows the histogram distribution of the use of the available pool of outstanding memory read transactions.

Sustained periods with most read transactions outstanding may indicate that the GPU hardware configuration is running out of outstanding read capacity.

0-25% outstanding

This counter increments for every read transaction initiated when 0-25% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ1

Streamline name: $MaliExternalBusOutstandingReads025Outstanding

Hardware name: L2_EXT_AR_CNT_Q1

25-50% outstanding

This counter increments for every read transaction initiated when 25-50% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ2

Streamline name: $MaliExternalBusOutstandingReads2550Outstanding

Hardware name: L2_EXT_AR_CNT_Q2

50-75% outstanding

This counter increments for every read transaction initiated when 50-75% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ3

Streamline name: $MaliExternalBusOutstandingReads5075Outstanding

Hardware name: L2_EXT_AR_CNT_Q3

75-100% outstanding

This expression increments for every read transaction initiated when 75-100% of transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ4

libGPUCounters derivation:

MaliExtBusRd - MaliExtBusRdOTQ1 - MaliExtBusRdOTQ2 - MaliExtBusRdOTQ3

Streamline derivation:

$MaliExternalBusAccessesReadTransactions - $MaliExternalBusOutstandingReads025Outstanding - $MaliExternalBusOutstandingReads2550Outstanding - $MaliExternalBusOutstandingReads5075Outstanding

Hardware derivation:

L2_EXT_READ - L2_EXT_AR_CNT_Q1 - L2_EXT_AR_CNT_Q2 - L2_EXT_AR_CNT_Q3

External Bus Outstanding Writes

This counter group shows the histogram distribution of the use of the available pool of outstanding memory write transactions.

Sustained periods with most write transactions outstanding may indicate that the GPU hardware configuration is running out of outstanding write capacity.

0-25% outstanding

This counter increments for every write transaction initiated when 0-25% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ1

Streamline name: $MaliExternalBusOutstandingWrites025Outstanding

Hardware name: L2_EXT_AW_CNT_Q1

25-50% outstanding

This counter increments for every write transaction initiated when 25-50% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ2

Streamline name: $MaliExternalBusOutstandingWrites2550Outstanding

Hardware name: L2_EXT_AW_CNT_Q2

50-75% outstanding

This counter increments for every write transaction initiated when 50-75% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ3

Streamline name: $MaliExternalBusOutstandingWrites5075Outstanding

Hardware name: L2_EXT_AW_CNT_Q3

75-100% outstanding

This expression increments for every write transaction initiated when 75-100% of transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ4

libGPUCounters derivation:

MaliExtBusWr - MaliExtBusWrOTQ1 - MaliExtBusWrOTQ2 - MaliExtBusWrOTQ3

Streamline derivation:

$MaliExternalBusAccessesWriteTransactions - $MaliExternalBusOutstandingWrites025Outstanding - $MaliExternalBusOutstandingWrites2550Outstanding - $MaliExternalBusOutstandingWrites5075Outstanding

Hardware derivation:

L2_EXT_WRITE - L2_EXT_AW_CNT_Q1 - L2_EXT_AW_CNT_Q2 - L2_EXT_AW_CNT_Q3

Graphics Geometry Workload

Graphics workloads using the rasterization pipeline pass inputs to the GPU as a geometry stream. Vertices in this stream are position shaded, assembled into primitives, and then passed through a culling pipeline before being passed to the Arm GPU binning unit.

Performance counters in this section show how the input geometry is processed, indicating the overall complexity of the geometry workload and how it is processed by the primitive culling stages.

Input Primitives

This counter group shows the number of input primitives to the GPU, before any culling is applied.

Input primitives

This expression defines the total number of input primitives to the rendering process.

High complexity geometry is one of the most expensive inputs to the GPU, because vertices are much larger than compressed texels. Optimize your geometry to minimize mesh complexity, using dynamic level-of-detail and normal maps to reduce the number of primitives required.

libGPUCounters name: MaliGeomTotalPrim

libGPUCounters derivation:

MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim

Streamline derivation:

$MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives

Hardware derivation:

PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE

Triangle primitives

This counter increments for every input triangle primitive. The count is made before any culling or clipping.

libGPUCounters name: MaliGeomTrianglePrim

Streamline name: $MaliInputPrimitivesTrianglePrimitives

Hardware name: TRIANGLES

Line primitives

This counter increments for every input line primitive. The count is made before any culling or clipping.

libGPUCounters name: MaliGeomLinePrim

Streamline name: $MaliInputPrimitivesLinePrimitives

Hardware name: LINES

Point primitives

This counter increments for every input point primitive. The count is made before any culling or clipping.

libGPUCounters name: MaliGeomPointPrim

Streamline name: $MaliInputPrimitivesPointPrimitives

Hardware name: POINTS

Visible Primitives

This counter group shows the properties of any visible primitives, after any culling is applied.

Front-facing primitives

This counter increments for every visible front-facing triangle that survives culling.

libGPUCounters name: MaliGeomFrontFacePrim

Streamline name: $MaliVisiblePrimitivesFrontFacingPrimitives

Hardware name: FRONT_FACING

Back-facing primitives

This counter increments for every visible back-facing triangle that survives culling.

libGPUCounters name: MaliGeomBackFacePrim

Streamline name: $MaliVisiblePrimitivesBackFacingPrimitives

Hardware name: BACK_FACING

Primitive Culling

This counter group shows the absolute number of primitives that are culled by each of the culling stages in the geometry pipeline, and the number of visible primitives that are not culled by any stage.

Visible primitives

This counter increments for every visible primitive that survives all culling stages.

All fragments of the primitive might be occluded by other primitives closer to the camera, and so produce no visible output.

libGPUCounters name: MaliGeomVisiblePrim

Streamline name: $MaliPrimitiveCullingVisiblePrimitives

Hardware name: PRIM_VISIBLE

Culled primitives

This expression defines the number of primitives that were culled during the rendering process, for any reason.

For efficient 3D content, it is expected that only 50% of primitives are visible because back-face culling is used to remove half of each model.

libGPUCounters name: MaliGeomTotalCullPrim

libGPUCounters derivation:

MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim

Streamline derivation:

$MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives

Hardware derivation:

PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED

Facing test culled primitives

This counter increments for every primitive culled by the facing test.

For an arbitrary 3D scene we would expect approximately half of the triangles to be back-facing. If you see a significantly lower percentage than this, check that the facing test is properly enabled.

libGPUCounters name: MaliGeomFaceCullPrim

Streamline name: $MaliPrimitiveCullingFacingTestCulledPrimitives

Hardware name: PRIM_FACE_CULLED

Frustum test culled primitives

This counter increments for every primitive culled by testing against the view frustum clip planes.

If significant numbers of triangles are culled by this test, Arm recommends reviewing application culling and batching. Test draw call bounding boxes against the frustum to cull draws that are completely out-of-frustum. Reduce the size of static batches to reduce the bounding volume of each batch, enabling better culling.

libGPUCounters name: MaliGeomPlaneCullPrim

Streamline name: $MaliPrimitiveCullingFrustumTestCulledPrimitives

Hardware name: PRIM_FRUSTUM_CULLED

Scissor test culled primitives

This counter increments for every primitive culled by the scissor test.

libGPUCounters name: MaliGeomScissorCullPrim

Streamline name: $MaliPrimitiveCullingScissorTestCulledPrimitives

Hardware name: PRIM_SCISSOR_CULLED

Sample test culled primitives

This counter increments for every primitive culled by the sample coverage test. It is expected that a few primitives are small and fail the sample coverage test, as application mesh level-of-detail selection can never be perfect. If the number of primitives counted is more than than 5-10% of the total number, this might indicate that the application has a large number of very small triangles, which are very expensive for a GPU to process.

Aim to keep triangle screen area above 10 pixels. Use schemes such as mesh level-of-detail to select simplified meshes as objects move further away from the camera.

libGPUCounters name: MaliGeomSampleCullPrim

Streamline name: $MaliPrimitiveCullingSampleTestCulledPrimitives

Hardware name: PRIM_SAMPLE_CULLED

Primitive Culling Rate

This counter group shows the percentage of the primitives that use each culling stage that are culled by it, and the percentage of primitives that are visible and not culled by any stage.

Visible primitive rate

This expression defines the percentage of primitives that are visible after culling.

For efficient 3D content, it is expected that only 50% of primitives are visible because back-face culling is used to remove half of each model.

A significantly higher visibility rate indicates that the facing test might not be enabled.
A significantly lower visibility rate indicates that geometry is being culled for other reasons, which is often possible to optimize. Use the individual culling counters for a more detailed breakdown.

libGPUCounters name: MaliGeomVisibleRate

libGPUCounters derivation:

max(min((MaliGeomVisiblePrim / (MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingVisiblePrimitives / ($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_VISIBLE / (PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE)) * 100, 100), 0)

Facing culled primitive rate

This expression defines the percentage of primitives entering the facing test that are culled by it. Back-facing triangles that are inside the frustum are culled by this stage.

For efficient 3D content, it is expected that 50% of primitives are culled by the facing test. If you see a significantly lower percentage, check that the facing test is properly enabled.

libGPUCounters name: MaliGeomFaceCullRate

libGPUCounters derivation:

max(min((MaliGeomFaceCullPrim / ((MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim) - MaliGeomPlaneCullPrim - MaliGeomScissorCullPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingFacingTestCulledPrimitives / (($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFrustumTestCulledPrimitives - $MaliPrimitiveCullingScissorTestCulledPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_FACE_CULLED / ((PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE) - PRIM_FRUSTUM_CULLED - PRIM_SCISSOR_CULLED)) * 100, 100), 0)

Frustum culled primitive rate

This expression defines the percentage of primitives entering the frustum test that are culled by it. Primitives that are outside of the view frustum are culled by this stage.

If a significant percentage of triangles are culled by this test we recommend reviewing application culling and batching. Test draw call bounding boxes against the frustum to cull draws that are completely out-of-frustum. Reduce the size of static batches to reduce the bounding volume of each batch, enabling better culling.

libGPUCounters name: MaliGeomPlaneCullRate

libGPUCounters derivation:

max(min((MaliGeomPlaneCullPrim / (MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingFrustumTestCulledPrimitives / ($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_FRUSTUM_CULLED / (PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE)) * 100, 100), 0)

Scissor culled primitive rate

This expression defines the percentage of primitives entering the scissor test that are culled by it. Primitives outside of the active scissor region are killed by this stage.

libGPUCounters name: MaliGeomScissorCullRate

libGPUCounters derivation:

max(min((MaliGeomScissorCullPrim / ((MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim) - MaliGeomPlaneCullPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingScissorTestCulledPrimitives / (($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFrustumTestCulledPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_SCISSOR_CULLED / ((PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE) - PRIM_FRUSTUM_CULLED)) * 100, 100), 0)

Sample culled primitive rate

This expression defines the percentage of primitives entering the sample coverage test that are culled by it. This stage culls primitives that are so small that they hit no rasterizer sample points.

If a significant number of triangles are culled at this stage, the application is using geometry meshes that are too complex for their screen coverage. Use schemes such as mesh level-of-detail to select simplified meshes as objects move further away from the camera.

libGPUCounters name: MaliGeomSampleCullRate

libGPUCounters derivation:

max(min((MaliGeomSampleCullPrim / ((MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim) - MaliGeomPlaneCullPrim - MaliGeomScissorCullPrim - MaliGeomFaceCullPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingSampleTestCulledPrimitives / (($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFrustumTestCulledPrimitives - $MaliPrimitiveCullingScissorTestCulledPrimitives - $MaliPrimitiveCullingFacingTestCulledPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_SAMPLE_CULLED / ((PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE) - PRIM_FRUSTUM_CULLED - PRIM_SCISSOR_CULLED - PRIM_FACE_CULLED)) * 100, 100), 0)

Geometry Threads

This counter group shows the number of vertex shader threads of each type that are generated during the binning phase processing.

All vertices must be position shaded, but only visible vertices of draw calls that are incompatible with deferred vertex shading will be varying shaded.

Position shading threads

This expression defines the number of position shader thread invocations.

libGPUCounters name: MaliGeomPosShadThread

libGPUCounters derivation:

MaliGeomPosShadTask * 16

Streamline derivation:

$MaliTilerShadingRequestsPositionShadingRequests * 16

Hardware derivation:

POS_SHADER_WARPS * 16

Varying shading threads

This expression defines the number of varying shader thread invocations triggered during the binning phase.

This GPU can defer varying shading to the main pass, which is not visible in this counter.

libGPUCounters name: MaliGeomVarShadThread

libGPUCounters derivation:

MaliGeomVarShadTask * 16

Streamline derivation:

$MaliTilerShadingRequestsVaryingShadingRequests * 16

Hardware derivation:

VAR_SHADER_WARPS * 16

Geometry Efficiency

This counter group shows the number of vertex shader threads of each type that are generated per primitive during vertex processing. Efficient geometry aims to keep these metrics as low as possible.

This GPU has deferred vertex shading which means that most triangles will defer varying shading until the main phase processing, so the number of varying threads per primitive has a different meaning than earlier GPUs without deferred vertex shading.

Position threads/input primitive

This expression defines the number of position shader threads per input primitive.

Efficient meshes with a good vertex reuse have average less than 1.5 vertices shaded per triangle, as vertex computation is shared by multiple primitives. Minimize this number by reusing vertices for nearby primitives, improving temporal locality of index reuse, and avoiding unused values in the active index range.

libGPUCounters name: MaliGeomPosShadThreadPerPrim

libGPUCounters derivation:

(MaliGeomPosShadTask * 16) / (MaliGeomFaceCullPrim + MaliGeomPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomScissorCullPrim + MaliGeomVisiblePrim)

Streamline derivation:

($MaliTilerShadingRequestsPositionShadingRequests * 16) / ($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingScissorTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)

Hardware derivation:

(POS_SHADER_WARPS * 16) / (PRIM_FACE_CULLED + PRIM_FRUSTUM_CULLED + PRIM_SAMPLE_CULLED + PRIM_SCISSOR_CULLED + PRIM_VISIBLE)

Graphics Fragment Workload

Graphics workloads using the rasterization pipeline are rendered into the framebuffer to create output images.

Performance counters in this section show the workload complexity of your fragment rendering.

Output pixels

This counter group shows the total number of output pixels rendered.

Pixels

This expression defines the total number of pixels that are shaded by the GPU, including on-screen and off-screen render passes.

This measure can be a slight overestimate because it assumes all pixels in each active 64 x 64 pixel region are shaded. If the rendered region does not align with 64 pixel aligned boundaries, then this metric includes pixels that are not actually shaded.

libGPUCounters name: MaliGPUPix

libGPUCounters derivation:

MaliMainQueueTask * 4096

Streamline derivation:

$MaliGPUTasksMainPhaseTasks * 4096

Hardware derivation:

ITER_FRAG_TASK_COMPLETED * 4096

Overdraw

This counter group shows the number of fragments rendered per pixel.

Fragments/pixel

This expression computes the number of fragments shaded per output pixel.

GPU processing cost per pixel accumulates with the layer count. High overdraw can build up to a significant processing cost, especially when rendering to a high-resolution framebuffer. Minimize overdraw by rendering opaque objects front-to-back and minimizing use of blended transparent layers.

libGPUCounters name: MaliFragOverdraw

libGPUCounters derivation:

MaliFragThread / (MaliMainQueueTask * 4096)

Streamline derivation:

$MaliShaderThreadsAllFragmentThreads / ($MaliGPUTasksMainPhaseTasks * 4096)

Hardware derivation:

FRAG_SHADER_THREADS / (ITER_FRAG_TASK_COMPLETED * 4096)

Workload Cost

Workload cost metrics give an average throughput per item of work processed by the GPU.

Performance counters in this section can be used to track average performance against budget, and to monitor the impact of application changes over time.

Average Workload Cost

This counter group gives the average cycle throughput for the different kinds of workloads the GPU is running.

When running workloads in parallel the shader core is shared, and these throughput metrics will be impacted by cross-talk across the queues. However, they still a useful tool for managing performance budgets.

GPU cycles/pixel

This expression defines the average number of GPU cycles being spent per pixel rendered. This includes the cost of all shader stages.

It is a useful exercise to set a cycle budget for each render pass in your application, based on your target resolution and frame rate. Rendering 1080p60 is possible with an entry-level device, but you have a small number of cycles per pixel to work so must use them efficiently.

libGPUCounters name: MaliGPUCyPerPix

libGPUCounters derivation:

MaliGPUActiveCy / (MaliMainQueueTask * 4096)

Streamline derivation:

$MaliGPUCyclesGPUActive / ($MaliGPUTasksMainPhaseTasks * 4096)

Hardware derivation:

GPU_ACTIVE / (ITER_FRAG_TASK_COMPLETED * 4096)

Shader cycles/non-fragment thread

This expression defines the average number of shader core cycles per non-fragment thread.

This measurement captures the overall shader core throughput, not the shader processing cost. It will be impacted by cycles lost to stalls that could not be hidden by other processing. In addition, it will be impacted by any fragment workloads that are running concurrently in the shader core.

libGPUCounters name: MaliNonFragThroughputCy

libGPUCounters derivation:

MaliCompOrBinningActiveCy / (MaliNonFragWarp * 16)

Streamline derivation:

$MaliShaderCoreCyclesComputeOrBinningPhaseActive / ($MaliShaderWarpsNonFragmentWarps * 16)

Hardware derivation:

COMPUTE_ACTIVE / (COMPUTE_WARPS * 16)

Shader cycles/fragment thread

This expression defines the average number of shader core cycles per fragment thread.

libGPUCounters name: MaliFragThroughputCy

libGPUCounters derivation:

MaliMainActiveCy / ((MaliFragWarp - MaliFragPrepassWarp) * 16)

Streamline derivation:

$MaliShaderCoreCyclesMainPhaseActive / (($MaliShaderWarpsFragmentWarps - $MaliShaderWarpsFragmentPrepassWarps) * 16)

Hardware derivation:

FRAG_ACTIVE / ((FRAG_WARPS - FRAG_WARPS_PRE_PASS) * 16)

Shader Core Front-end

The shader core front-ends are the internal interfaces inside the GPU that accept tasks from other parts of the GPU and turn them into shader threads running in the programmable core.

Each shader core has two front-ends:

Compute and Binning phase front-end for tasks including compute, binning-time vertex shading, and advanced geometry.
Main phase front-end for all main phase tasks, including deferred vertex shading, and fragment shading.

The front-ends show as active until task processing is complete, so front-end activity is a direct way of measuring that the shader core is busy handling a workload.

The Execution core is the programmable core at the heart of the shader core hardware. The Execution core shows as active if there is at least on thread running, and monitoring its activity is an indirect way of checking that the front-ends are managing to keep the GPU busy.

Performance counters in this section measure the overall workload scheduling for the shader core, showing how busy the shader core is. Note that front-end counters can tell you that a task was scheduled but cannot tell you how heavily the programmable core is being used.

Shader Core Cycles

This counter group shows the scheduling load on the shader core, indicating which of the shader core front-ends have work scheduled and whether they are running threads on the programmable core.

Any workload active

This counter increments every clock cycle when the shader core is processing any type of workload, irrespective of which queue the workload came from.

This counter is particularly useful in high-end GPU configurations where it can indicate the shader core clock rate. This rate can be lower than the GPU top-level clock rate.

libGPUCounters name: MaliAnyActiveCy

Streamline name: $MaliShaderCoreCyclesAnyWorkloadActive

Hardware name: SHADER_CORE_ACTIVE

Compute or binning phase active

This counter increments every clock cycle when the shader core is processing some compute or binning phase workload. Active processing includes any cycle that compute or binning work is queued in the fixed-function front-end or programmable core.

libGPUCounters name: MaliCompOrBinningActiveCy

Streamline name: $MaliShaderCoreCyclesComputeOrBinningPhaseActive

Hardware name: COMPUTE_ACTIVE

Main phase active

This counter increments every clock cycle when the shader core is processing some main phase workload. Active processing includes any cycle that fragment work is running anywhere in the fixed-function front-end, fixed-function back-end, or programmable core.

libGPUCounters name: MaliMainActiveCy

Streamline name: $MaliShaderCoreCyclesMainPhaseActive

Hardware name: FRAG_ACTIVE

Fragment pre-pipe buffer active

This counter increments every clock cycle when the pre-pipe quad queue contains at least one quad waiting to run. If this queue completely drains, a fragment warp cannot be spawned when space for new threads becomes available in the shader core. You can experience reduced performance when low thread occupancy starves the functional units of work to process.

Possible causes for this include:

Tiles which contain no geometry, which are commonly encountered when creating shadow maps, where many tiles contain no shadow casters.
Tiles which contain a lot of geometry which are killed by early ZS or hidden surface removal.

libGPUCounters name: MaliFragFPKActiveCy

Streamline name: $MaliShaderCoreCyclesFragmentPrePipeBufferActive

Hardware name: FRAG_FPK_ACTIVE

Execution core active

This counter increments every clock cycle when the shader core is processing at least one warp. Note that this counter does not provide detailed information about how the functional units are utilized inside the shader core, but simply gives an indication that something was running.

libGPUCounters name: MaliCoreActiveCy

Streamline name: $MaliShaderCoreCyclesExecutionCoreActive

Hardware name: EXEC_CORE_ACTIVE

Shader Core Utilization

This counter group shows the scheduling load on the shader core, normalized against the overall shader core activity.

Compute or binning phase utilization

This expression defines the percentage utilization of the shader core compute or binning phase path. This counter measures any cycle that a compute or binning phase workload is active in the fixed-function front-end or programmable core.

libGPUCounters name: MaliCompOrBinningUtil

libGPUCounters derivation:

max(min((MaliCompOrBinningActiveCy / MaliAnyActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesComputeOrBinningPhaseActive / $MaliShaderCoreCyclesAnyWorkloadActive) * 100, 100), 0)

Hardware derivation:

max(min((COMPUTE_ACTIVE / SHADER_CORE_ACTIVE) * 100, 100), 0)

Main phase utilization

This expression defines the percentage utilization of the shader core main phase path. This counter measures any cycle that a main phase workload is active in the fixed-function front-end, fixed-function back-end, or programmable core.

libGPUCounters name: MaliMainUtil

libGPUCounters derivation:

max(min((MaliMainActiveCy / MaliAnyActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesMainPhaseActive / $MaliShaderCoreCyclesAnyWorkloadActive) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_ACTIVE / SHADER_CORE_ACTIVE) * 100, 100), 0)

Fragment pre-pipe buffer utilization

This expression defines the percentage of cycles when the pre-pipe quad buffer contains at least one fragment quad. This buffer is located after early ZS but before the programmable core.

During fragment shading this counter must be close to 100%. This indicates that the fragment front-end is able to keep up with the shader core shading rate. This counter commonly drops below 100% for three reasons:

The running workload has many empty tiles with no geometry to render. Empty tiles are common in shadow maps, for any screen region with no shadow casters.
The application consists of simple shaders but a high percentage of microtriangles. This combination causes the shader core to complete fragments faster than they are rasterized, so the quad buffer starts to drain.
The application consists of layers which stall at early ZS because of a dependency on an earlier fragment layer which is still in flight. Stalled layers prevent new fragments entering the quad buffer, so the quad buffer starts to drain.

libGPUCounters name: MaliFragFPKBUtil

libGPUCounters derivation:

max(min((MaliFragFPKActiveCy / MaliMainActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesFragmentPrePipeBufferActive / $MaliShaderCoreCyclesMainPhaseActive) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_FPK_ACTIVE / FRAG_ACTIVE) * 100, 100), 0)

Execution core utilization

This expression defines the percentage utilization of the programmable core, measuring cycles when the shader core contains at least one warp. A low utilization here indicates lost performance, because there are spare shader core cycles that are unused.

In some use cases an idle core is unavoidable. For example, a clear color tile that contains no shaded geometry, or a shadow map that is resolved entirely using early ZS depth updates.

Improve programmable core utilization by parallel processing of the GPU work queues, running overlapping workloads from multiple render passes. Also aim to keep the FPK buffer utilization as high as possible, ensuring constant forward-pressure on fragment shading.

libGPUCounters name: MaliCoreUtil

libGPUCounters derivation:

max(min((MaliCoreActiveCy / MaliAnyActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesExecutionCoreActive / $MaliShaderCoreCyclesAnyWorkloadActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_CORE_ACTIVE / SHADER_CORE_ACTIVE) * 100, 100), 0)

Shader Clock Ratio

This counter group gives an estimate of the clock ratio between the shader core and the GPU top-level. In large systems the shader cores will typically be clocked more slowly than the top-level to improve energy efficiency.

Shader core clock ratio

This expression defines the percentage usage of the shader core, relative to the top-level GPU clock.

To improve energy efficiency, some systems clock the shader cores at a lower frequency than the GPU top-level components. In these systems, the maximum achievable usage value is the clock ratio between the GPU top-level clock and the shader clock. For example, a GPU with an 800MHz top-level clock and a 400MHz shader clock can achieve a maximum usage of 50%.

libGPUCounters name: MaliAnyUtil

libGPUCounters derivation:

max(min((MaliAnyActiveCy / MALI_CONFIG_SHADER_CORE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesAnyWorkloadActive / $MaliConstantsShaderCoreCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((SHADER_CORE_ACTIVE / MALI_CONFIG_SHADER_CORE_COUNT / GPU_ACTIVE) * 100, 100), 0)

Shader Core Tasks

This counter group shows the number of tasks processed by the shader cores. Task sizes for compute tasks are variable, so this is not expected to be a useful measure of workload.

Non-main phase tasks

This counter increments for every non-main phase task issued to the shader core. The size of these tasks is variable.

libGPUCounters name: MaliNonFragTask

Streamline name: $MaliShaderCoreTasksNonMainPhaseTasks

Hardware name: COMPUTE_TASKS

Shader Core Fragment Front-end

The shader core fragment front-end is a complex multi-stage pipeline that converts an incoming primitive stream for a screen-space tile into fragment threads that need to be shaded. The fragment front-end handles rasterization, early depth (Z) and stencil (S) testing, and hidden surface removal (HSR).

Performance counters in this section measure how the incoming stream was turned into quads, and how efficiently those quads interacted with ZS testing and HSR.

Fragment Tiles

This counter group shows the number of fragment tiles processed by the shader cores.

Tiles

This counter increments for every tile processed by the shader core. Note that tiles are normally 32 x 32 pixels but can vary depending on per-pixel storage requirements and the tile buffer size of the current GPU.

This GPU supports full size tiles when using up to and including 256 bits per pixel of color storage. Pixel storage requirements depend on the number of color attachments, their data format, and the number of multi-sampling samples per pixel.

The most accurate way to get the total pixel count rendered by the application is to use the Main phase tasks counter, because it always counts 64 x 64 pixel regions.

libGPUCounters name: MaliFragTile

Streamline name: $MaliFragmentTilesTiles

Hardware name: FRAG_PTILES

Killed unchanged tiles

This counter increments for every 16x16 pixel tile or tile sub-region killed by a transaction elimination CRC check, where the data is the same as the content already stored in memory.

libGPUCounters name: MaliFragTileKill

Streamline name: $MaliFragmentTilesKilledUnchangedTiles

Hardware name: FRAG_TRANS_ELIM

Fragment Primitives

This counter group shows how the fragment front-end handles the incoming primitive stream from the tile list built during the binning phase.

Large primitives will be read in multiple tiles and will therefore cause multiple increments to these counter values. These counters will not match the input primitive counts passed in by the application.

Input primitives

This expression defines the number of unique primitives loaded by the fragment front-end for each tile.

libGPUCounters name: MaliFragInputPrim

libGPUCounters derivation:

(MaliFragPrim + MaliFragPrepassCullPrim) - MaliFragPrepassPrim

Streamline derivation:

($MaliFragmentPrimitivesLoadedPrimitives + $MaliFragmentPrimitivesPrepassCulledPrimitives) - $MaliFragmentPrimitivesLoadedPrepassPrimitives

Hardware derivation:

(FRAG_PRIMITIVES_OUT + FRAG_PRIMITIVES_HSR_CULLED) - FRAG_PRIMITIVES_OUT_PRE_PASS

Loaded primitives

This counter increments for every primitive loaded by the fragment front-end.

Primitives might be loaded up to two times per tile, depending on interaction with Fragment Prepass hidden surface removal.

libGPUCounters name: MaliFragPrim

Streamline name: $MaliFragmentPrimitivesLoadedPrimitives

Hardware name: FRAG_PRIMITIVES_OUT

Loaded prepass primitives

This counter increments for every primitive loaded by the fragment front-end that are used in the fragment prepass.

libGPUCounters name: MaliFragPrepassPrim

Streamline name: $MaliFragmentPrimitivesLoadedPrepassPrimitives

Hardware name: FRAG_PRIMITIVES_OUT_PRE_PASS

Prepass culled primitives

This counter increments for every primitive loaded by the fragment front-end that is optimized out by the fragment prepass hidden surface removal.

libGPUCounters name: MaliFragPrepassCullPrim

Streamline name: $MaliFragmentPrimitivesPrepassCulledPrimitives

Hardware name: FRAG_PRIMITIVES_HSR_CULLED

Prepass skipped primitives

This counter increments for every primitive that is not tested by fragment prepass hidden surface removal because an earlier primitive was incompatible and terminated the prepass.

libGPUCounters name: MaliFragPrepassSkippedPrim

Streamline name: $MaliFragmentPrimitivesPrepassSkippedPrimitives

Hardware name: FRAG_PRIMITIVES_HSR_DISABLED

Rasterized primitives

This counter increments for every primitive entering the rasterization unit for each tile shaded. This increments per tile, which means that a single primitive that spans multiple tiles is counted multiple times. If you want to know the total number of primitives in the scene refer to the Total input primitives expression.

Input primitives might be rasterized up to two times per tile, depending on interaction with Fragment Prepass hidden surface removal.

libGPUCounters name: MaliFragRastPrim

Streamline name: $MaliFragmentPrimitivesRasterizedPrimitives

Hardware name: FRAG_PRIM_RAST

Fragment Prepass Properties

This counter group shows how the fragment prepass hidden surface removal processed the incoming primitive stream.

Prepass primitive rate

This expression defines the percentage of primitives that are processed by fragment prepass hidden surface removal.

A low percentage indicates that many primitives are using a render state that is ineligible for the prepass, or a primitive used a render state that caused the prepass to terminate early. Review application draw call settings to ensure compatibility with the fragment prepass requirements.

libGPUCounters name: MaliFragPrepassPrimRate

libGPUCounters derivation:

max(min((MaliFragPrepassPrim / ((MaliFragPrim + MaliFragPrepassCullPrim) - MaliFragPrepassPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentPrimitivesLoadedPrepassPrimitives / (($MaliFragmentPrimitivesLoadedPrimitives + $MaliFragmentPrimitivesPrepassCulledPrimitives) - $MaliFragmentPrimitivesLoadedPrepassPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_PRIMITIVES_OUT_PRE_PASS / ((FRAG_PRIMITIVES_OUT + FRAG_PRIMITIVES_HSR_CULLED) - FRAG_PRIMITIVES_OUT_PRE_PASS)) * 100, 100), 0)

Prepass warp rate

This expression defines the percentage of warps that are processed by the fragment prepass relative to the main pass.

A high percentage here indicates a potential inefficiency. It can indicate that a high percentage of draw calls require prepass shaders due to use of shader-based alpha-testing or alpha-to-coverage. It can also indicate that a high percentage of geometry is being culled by hidden surface removal.

libGPUCounters name: MaliFragPrepassWarpRate

libGPUCounters derivation:

max(min((MaliFragPrepassWarp / (MaliFragWarp - MaliFragPrepassWarp)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderWarpsFragmentPrepassWarps / ($MaliShaderWarpsFragmentWarps - $MaliShaderWarpsFragmentPrepassWarps)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_WARPS_PRE_PASS / (FRAG_WARPS - FRAG_WARPS_PRE_PASS)) * 100, 100), 0)

Culled primitive rate

This expression defines the percentage of primitives in the main pass that are culled by the fragment prepass.

A high percentage indicates that a lot of geometry is being occluded by opaque primitives. If objects are completely occluded by geometry closer to the camera, consider applying higher level culling algorithms that can completely optimize away the occluded geometry.

libGPUCounters name: MaliFragPrepassCullPrimRate

libGPUCounters derivation:

max(min((MaliFragPrepassCullPrim / ((MaliFragPrim + MaliFragPrepassCullPrim) - MaliFragPrepassPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentPrimitivesPrepassCulledPrimitives / (($MaliFragmentPrimitivesLoadedPrimitives + $MaliFragmentPrimitivesPrepassCulledPrimitives) - $MaliFragmentPrimitivesLoadedPrepassPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_PRIMITIVES_HSR_CULLED / ((FRAG_PRIMITIVES_OUT + FRAG_PRIMITIVES_HSR_CULLED) - FRAG_PRIMITIVES_OUT_PRE_PASS)) * 100, 100), 0)

Skipped primitive rate

This expression defines the percentage of primitives that are not be tested by fragment prepass hidden surface removal.

A high percentage indicates that many primitives are submitted after a primitive that used a render state that caused the prepass to terminate. Review application draw call settings to ensure compatibility with the fragment prepass requirements. If an incompatible render state must be used, move all draw calls using that state after all prepass compatible draw calls.

libGPUCounters name: MaliFragPrepassSkipPrimRate

libGPUCounters derivation:

max(min((MaliFragPrepassSkippedPrim / ((MaliFragPrim + MaliFragPrepassCullPrim) - MaliFragPrepassPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentPrimitivesPrepassSkippedPrimitives / (($MaliFragmentPrimitivesLoadedPrimitives + $MaliFragmentPrimitivesPrepassCulledPrimitives) - $MaliFragmentPrimitivesLoadedPrepassPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_PRIMITIVES_HSR_DISABLED / ((FRAG_PRIMITIVES_OUT + FRAG_PRIMITIVES_HSR_CULLED) - FRAG_PRIMITIVES_OUT_PRE_PASS)) * 100, 100), 0)

Culled quad rate

This expression defines the percentage of rasterized quads that are killed by the fragment prepass hidden surface removal scheme.

Quads killed at this stage are killed before shading, so a high percentage here is not generally a performance problem. However, performance could be improved if occluded objects were removed using software culling techniques.

libGPUCounters name: MaliFragPrepassKillRate

libGPUCounters derivation:

max(min((MaliFragPrepassKillQd / MaliFragPrepassTestQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsPrepassKilledQuads / $MaliFragmentZSQuadsPrepassTestedQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_QUADS_HSR_BUF_KILLED / FRAG_QUADS_HSR_BUF_TEST) * 100, 100), 0)

Main pass stall rate

This expression defines the percentage of cycles when the fragment main pass is stalled waiting for the fragment prepass to complete.

A high percentage here indicates that the fragment prepass is a bottleneck. This can be caused by a high amount of geometry or a high number of primitives needing prepass shading.

libGPUCounters name: MaliFragMainPassStallRate

libGPUCounters derivation:

max(min((MaliFragMainPassStallCy / MaliMainActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreStallCyclesFragmentMainPassStalls / $MaliShaderCoreCyclesMainPhaseActive) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_MAIN_PASS_STALLED_BY_PRE_PASS / FRAG_ACTIVE) * 100, 100), 0)

Fragment Quads

This counter group shows how the rasterizer turns the incoming primitive stream in to 2x2 sample quads for shading.

Rasterized fine quads

This counter increments for every fine quad generated by the rasterization phase. A fine quad covers a 2x2 pixel screen region. The quads generated have at least some coverage based on the current sample pattern, but can subsequently be killed by early ZS testing or hidden surface removal before they are shaded.

Input quads might be rasterized up to two times, depending on interaction with Fragment Prepass hidden surface removal.

libGPUCounters name: MaliFragRastQd

Streamline name: $MaliFragmentQuadsRasterizedFineQuads

Hardware name: FRAG_QUADS_RAST

Partial rasterized fine quads

This counter increments for every rasterized fine quad containing pixels that have no active sample points. Partial coverage occurs when any of sample points span the edge of a triangle.

Note that a non-partial fine quad can become partial before shading if some samples fail early ZS testing. This change is not visible in this counter.

Input quads might be rasterized up to two times, depending on interaction with Fragment Prepass hidden surface removal.

libGPUCounters name: MaliFragRastPartQd

Streamline name: $MaliFragmentQuadsPartialRasterizedFineQuads

Hardware name: FRAG_PARTIAL_QUADS_RAST

Rasterized coarse quads

This counter increments for every coarse quad generated by the rasterization phase. A coarse quad covers a 2x2 block of fragment threads. The quads generated have at least some coverage based on the current sample pattern, but can subsequently be killed by early ZS testing or hidden surface removal before they are shaded.

There are more coarse quads than fine quads if the application is using sample-rate shading when rendering to multi-sampled framebuffers.

There are fewer coarse quads than fine quads if the application is using variable rate shading to reduce the fragment density and shade multiple pixels per fragment.

Input quads might be rasterized up to two times, depending on interaction with Fragment Prepass hidden surface removal.

libGPUCounters name: MaliFragRastCoarseQd

Streamline name: $MaliFragmentQuadsRasterizedCoarseQuads

Hardware name: FRAG_QUADS_COARSE

Shaded coarse quads

This expression defines the number of 2x2 fragment quads that are spawned as executing threads in the shader core.

This expression is an approximation assuming that all spawned fragment warps contain a full set of quads. Comparing the total number of warps against the Full warps counter can indicate how close this approximation is.

libGPUCounters name: MaliFragShadedQd

libGPUCounters derivation:

(MaliFragWarp * 16) / 4

Streamline derivation:

($MaliShaderWarpsFragmentWarps * 16) / 4

Hardware derivation:

(FRAG_WARPS * 16) / 4

Fragment ZS Quads

This counter group shows how the depth (Z) and stencil (Z) test unit handles quads for early and late ZS test and update.

Prepass tested quads

This counter increments for every quad that is tested by the fragment prepass hidden surface removal.

libGPUCounters name: MaliFragPrepassTestQd

Streamline name: $MaliFragmentZSQuadsPrepassTestedQuads

Hardware name: FRAG_QUADS_HSR_BUF_TEST

Prepass killed quads

This counter increments for every quad that is killed by the fragment prepass hidden surface removal.

libGPUCounters name: MaliFragPrepassKillQd

Streamline name: $MaliFragmentZSQuadsPrepassKilledQuads

Hardware name: FRAG_QUADS_HSR_BUF_KILLED

Prepass early ZS updated quads

This counter increments for every quad that updates the fragment prepass during early depth and stencil testing.

libGPUCounters name: MaliFragPrepassEZSUpdateQd

Streamline name: $MaliFragmentZSQuadsPrepassEarlyZSUpdatedQuads

Hardware name: FRAG_QUADS_HSR_BUF_EZS_UPDATE

Early ZS tested quads

This counter increments for every quad undergoing early depth and stencil testing.

For maximum performance, this number must be close to the total number of input quads. We want as many of the input quads as possible to be subject to early ZS testing because early ZS testing is significantly more efficient than late ZS testing, which only kills threads after they have been shaded.

libGPUCounters name: MaliFragEZSTestQd

Streamline name: $MaliFragmentZSQuadsEarlyZSTestedQuads

Hardware name: FRAG_QUADS_EZS_TEST

Early ZS killed quads

This counter increments for every quad killed by early depth and stencil testing.

Quads killed at this stage are killed before shading, so a high percentage here is not generally a performance problem. However, it can indicate an opportunity to use software culling techniques such as portal culling to avoid sending occluded geometry to the GPU.

libGPUCounters name: MaliFragEZSKillQd

Streamline name: $MaliFragmentZSQuadsEarlyZSKilledQuads

Hardware name: FRAG_QUADS_EZS_KILL

Early ZS updated quads

This counter increments for every quad undergoing early depth and stencil testing that can update the framebuffer. Quads that have a depth value that depends on shader behavior, or those that have indeterminate coverage because of use of alpha-to-coverage or discard statements in the shader, might be early ZS tested but can not do an early ZS update.

For maximum performance, this number must be close to the total number of input quads. Aim to maximize the number of quads that are capable of doing an early ZS update.

libGPUCounters name: MaliFragEZSUpdateQd

Streamline name: $MaliFragmentZSQuadsEarlyZSUpdatedQuads

Hardware name: FRAG_QUADS_EZS_UPDATE

Late ZS killed quads

This counter increments for every quad killed by late depth and stencil testing.

libGPUCounters name: MaliFragLZSKillQd

Streamline name: $MaliFragmentZSQuadsLateZSKilledQuads

Hardware name: FRAG_LZS_KILL

Late ZS tested quads

This counter increments for every quad undergoing late depth and stencil testing.

libGPUCounters name: MaliFragLZSTestQd

Streamline name: $MaliFragmentZSQuadsLateZSTestedQuads

Hardware name: FRAG_LZS_TEST

ZS Unit Test Rate

This counter group shows the relative numbers of quads doing early and late depth (Z) and stencil (Z) testing.

Late ZS kill rate

This expression defines the percentage of rasterized quads that are killed by late depth and stencil testing. Quads killed by late ZS testing run at least some of their fragment program before being killed. A significant number of quads being killed at late ZS testing indicates a potential overhead. Aim to minimize the number of quads using and being killed by late ZS testing.

Shaders with mutable coverage, mutable depth, or side-effects on shared resources in memory, use late ZS testing.

The driver also generates late ZS updates to preload a depth or stencil attachment at the start of a render pass, which is needed if the render pass does not start from a cleared depth value. These fragments show as a late ZS kill, as no shader is needed after the depth or stencil value has been set.

libGPUCounters name: MaliFragLZSKillRate

libGPUCounters derivation:

max(min((MaliFragLZSKillQd / (4 * MaliFragWarp)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsLateZSKilledQuads / (4 * $MaliShaderWarpsFragmentWarps)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_LZS_KILL / (4 * FRAG_WARPS)) * 100, 100), 0)

Late ZS test rate

This expression defines the percentage of rasterized quads that are tested by late depth and stencil testing.

A high percentage of fragments performing a late ZS update can cause slow performance, even if fragments are not killed. Younger fragments cannot complete early ZS until all older fragments at the same coordinate have completed their late ZS operations, which can cause stalls.

You achieve the lowest late test rates by avoiding draw calls with modifiable coverage, or with shader programs that write to their depth value or that have memory-visible side-effects

libGPUCounters name: MaliFragLZSTestRate

libGPUCounters derivation:

max(min((MaliFragLZSTestQd / (4 * MaliFragWarp)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsLateZSTestedQuads / (4 * $MaliShaderWarpsFragmentWarps)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_LZS_TEST / (4 * FRAG_WARPS)) * 100, 100), 0)

Fragment FPK HSR Quads

This counter group shows how many of the generated quads are eligible to be occluders for the Forward Pixel Kill (FPK) hidden surface removal scheme.

Occluding quads

This counter increments for every quad that is a valid occluder for hidden surface removal. To be a candidate occluder, a quad must be guaranteed to be opaque and have fulled resolved at early ZS.

Draw calls that use blending, shader discard, alpha-to-coverage, programmable depth, or programmable tile buffer access can not be occluders.

libGPUCounters name: MaliFragOpaqueQd

Streamline name: $MaliFragmentFPKHSRQuadsOccludingQuads

Hardware name: QUAD_FPK_KILLER

Fragment Shading Rate

This counter group shows the rate of fragment generation relative to the number of covered pixels.

The fragment shading rate will be lower than 100% if the application is using variable-rate shading to reduce shading rate.

The fragment shading rate will be higher than 100% if the application is using sample-rate shading to increase shading rate for a multi-sampled render.

Shading rate

This expression defines the percentage of coarse quads generated relative to the number of fine quads that were rasterized. Coarse quads cover a 2x2 fragment region. Fine quads cover a 2x2 pixel region.

The fragment shading rate is lower than 100% if the application uses variable-rate shading to reduce shading rate.

The fragment shading rate is higher than 100% if the application uses sample-rate shading to increase shading rate for a multi-sampled render.

libGPUCounters name: MaliFragShadRate

libGPUCounters derivation:

max(min((MaliFragRastCoarseQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentQuadsRasterizedCoarseQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_QUADS_COARSE / FRAG_QUADS_RAST) * 100, 100), 0)

Fragment Workload Properties

This counter group shows properties of the fragment front-end workload that can identify specific application optimization opportunities.

Partial coverage rate

This expression defines the percentage of fragment quads that contain samples with no coverage. A high percentage can indicate that the content has a high density of small triangles, which are expensive to process. To avoid this, use mesh level-of-detail algorithms to select simpler meshes as objects move further from the camera.

libGPUCounters name: MaliFragRastPartQdRate

libGPUCounters derivation:

max(min((MaliFragRastPartQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentQuadsPartialRasterizedFineQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_PARTIAL_QUADS_RAST / FRAG_QUADS_RAST) * 100, 100), 0)

Unchanged tile kill rate

This expression defines the percentage of tiles that are killed by the transaction elimination CRC check because the content of a tile matches the content already stored in memory.

A high percentage of tile writes being killed indicates that a significant part of the framebuffer is static from frame to frame. Consider using scissor rectangles to reduce the area that is redrawn. To help manage the partial frame updates for window surfaces consider using the EGL extensions such as:

EGL_KHR_partial_update
EGL_EXT_swap_buffers_with_damage

libGPUCounters name: MaliFragTileKillRate

libGPUCounters derivation:

max(min((MaliFragTileKill / (4 * MaliFragTile)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentTilesKilledUnchangedTiles / (4 * $MaliFragmentTilesTiles)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_TRANS_ELIM / (4 * FRAG_PTILES)) * 100, 100), 0)

Shader Core Programmable Core

The programmable core is responsible for executing shader programs. This generation of Arm GPUs are warp-based, scheduling multiple threads from the same program in lockstep to improve energy efficiency.

The programmable core is a massively multi-threaded core, allowing many concurrently resident warps, which provides a level of tolerance to cache misses and data fetch latency. For most applications having more threads resident improves performance, as it increases the number of threads available for latency hiding, but it might decrease performance if the additional threads cause cache thrashing.

The core is built from a multiple independent hardware units, which can be simultaneously processing workloads from any of the resident threads. The most heavily loaded unit will set the upper bound on performance, with the other units running in parallel to it.

Performance counters in this section show the overall utilization of the different hardware units, as well as any indication of unit backpressure overload, making it easier to identify the units that are on the critical path.

Shader Core Unit Utilization

This counter group shows the use of each of the functional units inside the shader core, relative to their speed-of-light capability.

These units can run in parallel, and well performing content can expect peak load to be above 80% utilization on the most heavily used units. In this scenario reducing use of those units is likely to improve application performance.

If no unit is heavily loaded, it implies that the shader core is starving for work. This can be because not enough threads are getting spawned by the front-end, or because threads in the core are blocked on memory access. Other counters can help determine which of these situations is occurring.

Arithmetic unit utilization

This expression defines the percentage utilization of the arithmetic unit in the programmable core.

The most effective technique for reducing arithmetic load is reducing the complexity of your shader programs. Using narrower 8 and 16-bit data types can also help, as it allows multiple operations to be processed in parallel.

libGPUCounters name: MaliALUUtil

libGPUCounters derivation:

max(min((max((MaliEngFMAInstr + MaliEngCVTInstr + MaliEngSFUInstr) - MaliEngSlot1IssueCy, MaliEngSlot1IssueCy, MaliEngSFUInstr * 4) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((max(($MaliALUInstructionsFMAPipeInstructions + $MaliALUInstructionsCVTPipeInstructions + $MaliALUInstructionsSFUPipeInstructions) - $MaliALUIssuesSlot1Issues, $MaliALUIssuesSlot1Issues, $MaliALUInstructionsSFUPipeInstructions * 4) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((max((EXEC_INSTR_FMA + EXEC_INSTR_CVT + EXEC_INSTR_SFU) - EXEC_INSTR_SLOT_1, EXEC_INSTR_SLOT_1, EXEC_INSTR_SFU * 4) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Load/store unit utilization

This expression defines the percentage utilization of the load/store unit. The load/store unit is used for general-purpose memory accesses, including vertex attribute access, buffer access, work group shared memory access, and stack access. This unit also implements imageLoad/Store and atomic access functionality.

For traditional graphics content the most significant contributor to load/store usage is vertex data. Arm recommends simplifying mesh complexity, using fewer triangles, fewer vertices, and fewer bytes per vertex.

Shaders that spill to stack are also expensive, as any spilling is multiplied by the large number of parallel threads that are running. You can use the Mali Offline Compiler to check your shaders for spilling.

libGPUCounters name: MaliLSUtil

libGPUCounters derivation:

max(min(((MaliLSFullRd + MaliLSPartRd + MaliLSFullWr + MaliLSPartWr + MaliLSAtomic) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads + $MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites + $MaliLoadStoreUnitCyclesAtomicAccesses) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min(((LS_MEM_READ_FULL + LS_MEM_READ_SHORT + LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT + LS_MEM_ATOMIC) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Varying unit utilization

This expression defines the percentage utilization of the varying unit.

The most effective technique for reducing varying load is reducing the number of interpolated values read by the fragment shading. Increasing shader usage of 16-bit input variables also helps, as they can be interpolated as twice the speed of 32-bit variables.

libGPUCounters name: MaliVarUtil

libGPUCounters derivation:

max(min((((MaliVar32IssueSlot / 4) + (MaliVar16IssueSlot / 4)) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(((($MaliVaryingUnitRequests32BitInterpolationSlots / 4) + ($MaliVaryingUnitRequests16BitInterpolationSlots / 4)) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((((VARY_SLOT_32 / 4) + (VARY_SLOT_16 / 4)) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Texture unit utilization

This expression defines the percentage utilization of the texturing unit.

The most effective technique for reducing texturing unit load is reducing the number of texture samples read by your shaders. Using 32bpp color formats, and the ASTC decode mode extensions to select a 32bpp intermediate precision, can reduce cache access cost. Using simpler texture filters can reduce filtering cost. Using a 16bit per component sampler result can reduce data return cost.

libGPUCounters name: MaliTexUtil

libGPUCounters derivation:

max(min((max(MaliTexFiltIssueCy, MaliTexCacheLookupCy, MaliTexCacheSimpleLoadCy, MaliTexCacheComplexLoadCy, MaliTexInBt, MaliTexOutBt, MaliTexL1CacheOutputCy, MaliTexL1CacheLookupCy, MaliTexIndexCy) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((max($MaliTextureUnitCyclesFilteringActive, $MaliTextureUnitCacheCyclesCacheLookupActive, $MaliTextureUnitCacheCyclesSimpleLoadActive, $MaliTextureUnitCacheCyclesComplexLoadActive, $MaliTextureUnitBusInputBeats, $MaliTextureUnitBusOutputBeats, $MaliTextureUnitCacheCyclesL1OutputActive, $MaliTextureUnitCacheCyclesL1LookupActive, $MaliTextureUnitCyclesIndexCalculationActive) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((max(TEX_FILT_NUM_OPERATIONS, TEX_TFCH_NUM_TCL_OPERATIONS, TEX_CFCH_NUM_DIRECT_PATH_OPERATIONS, TEX_CFCH_NUM_RP_OPERATIONS, TEX_MSGI_NUM_FLITS, TEX_RSPS_NUM_OPERATIONS, TEX_CFCH_NUM_L1_CL_OPERATIONS, TEX_CFCH_NUM_L1_CT_OPERATIONS, TEX_TIDX_NUM_OPERATIONS) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Ray tracing unit utilization

This expression defines the percentage utilization of the ray tracing unit.

The most effective technique for reducing ray tracing load is reducing the amount of geometry in the acceleration structure, and ensuring that rays issued in each warp are spatially coherent.

libGPUCounters name: MaliRTUUtil

libGPUCounters derivation:

max(min((max(MaliRTUBoxIssueCy, MaliRTUTriIssueCy) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((max($MaliRayTracingUnitCyclesBoxTesterIssues, $MaliRayTracingUnitCyclesTriangleTesterIssues) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((max(RT_BOX_ISSUE_CYCLES, RT_TRI_ISSUE_CYCLES) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Attribute unit utilization

This expression defines the percentage utilization of the attribute unit.

libGPUCounters name: MaliAttrUtil

libGPUCounters derivation:

max(min((MaliAttrIssueCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliAttributeUnitCyclesAttributeUnitIssues / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((ATTR_ISSUE / EXEC_CORE_ACTIVE) * 100, 100), 0)

Blend unit utilization

This expression defines the percentage utilization of the blend unit.

libGPUCounters name: MaliBlendUtil

libGPUCounters derivation:

max(min((MaliBlendIssueCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliBlendUnitCyclesBlendUnitIssues / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((BLEND_ISSUE / EXEC_CORE_ACTIVE) * 100, 100), 0)

Shader Core Backpressure Cycles

This counter group shows the absolute amount of backpressure being generated by functional units that are overloaded and unable to accept more work.

Backpressure is a strong indicator that a unit is unable to meet requested demand, either due to workload complexity or slow process inside the unit due to cache misses. Reduce the size, or improving the efficiency, of the workload for the impacted unit will improve application performance.

Load/store unit backpressure

This counter increments for every clock cycle new work could not be sent to the load/store unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngLSBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesLoadStoreUnitBackpressure

Hardware name: EXEC_MSG_STALLED_LSC

Varying unit backpressure

This counter increments for every clock cycle new work could not be sent to the varying unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngVarBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesVaryingUnitBackpressure

Hardware name: EXEC_MSG_STALLED_VARY

Texture unit backpressure

This counter increments for every clock cycle new work could not be sent to the texture unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngTexBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesTextureUnitBackpressure

Hardware name: EXEC_MSG_STALLED_TEX

Ray tracing unit backpressure

This counter increments for every clock cycle new work could not be sent to the ray tracing unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngRTUBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesRayTracingUnitBackpressure

Hardware name: EXEC_MSG_STALLED_RTU

Attribute unit backpressure

This counter increments for every clock cycle new work could not be sent to the attribute unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngAttrBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesAttributeUnitBackpressure

Hardware name: EXEC_MSG_STALLED_ATTR

ZS unit backpressure

This counter increments for every clock cycle new work could not be sent to the depth/stencil test unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngZSBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesZSUnitBackpressure

Hardware name: EXEC_MSG_STALLED_ZS

Blend unit backpressure

This counter increments for every clock cycle new work could not be sent to the blend unit. This indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngBlendBackpressureCy

Streamline name: $MaliShaderCoreBackpressureCyclesBlendUnitBackpressure

Hardware name: EXEC_MSG_STALLED_BLEND

Shader Core Backpressure Rate

This counter group shows the relative amount of backpressure being generated by functional units that are overloaded and unable to accept more work.

Load/store unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the load/store unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngLSBackpressureRate

libGPUCounters derivation:

max(min((MaliEngLSBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesLoadStoreUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_LSC / EXEC_CORE_ACTIVE) * 100, 100), 0)

Varying unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the varying unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngVarBackpressureRate

libGPUCounters derivation:

max(min((MaliEngVarBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesVaryingUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_VARY / EXEC_CORE_ACTIVE) * 100, 100), 0)

Texture unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the texture unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngTexBackpressureRate

libGPUCounters derivation:

max(min((MaliEngTexBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesTextureUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_TEX / EXEC_CORE_ACTIVE) * 100, 100), 0)

Ray tracing unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the ray tracing test unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngRTUBackpressureRate

libGPUCounters derivation:

max(min((MaliEngRTUBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesRayTracingUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_RTU / EXEC_CORE_ACTIVE) * 100, 100), 0)

Attribute unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the attribute unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngAttrBackpressureRate

libGPUCounters derivation:

max(min((MaliEngAttrBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesAttributeUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_ATTR / EXEC_CORE_ACTIVE) * 100, 100), 0)

ZS unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the depth/stencil test unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngZSBackpressureRate

libGPUCounters derivation:

max(min((MaliEngZSBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesZSUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_ZS / EXEC_CORE_ACTIVE) * 100, 100), 0)

Blend unit rate

This expression defines the percentage of shader core cycles when new work could not be sent to the blend unit. A high percentage indicates that the unit is overloaded and might be a bottleneck.

libGPUCounters name: MaliEngBlendBackpressureRate

libGPUCounters derivation:

max(min((MaliEngBlendBackpressureCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreBackpressureCyclesBlendUnitBackpressure / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_MSG_STALLED_BLEND / EXEC_CORE_ACTIVE) * 100, 100), 0)

Shader Core Stall Cycles

This counter group shows the number of cycles that the shader core is able to accept new warps, but the front-end has no new warp ready to run. This might be because the front-end is a bottleneck, or because the workload requires no warps to be spawned.

Fragment main pass stalls

This counter increments for every clock cycle when a fragment main pass cannot start because the fragment main pass is waiting for a prepass result.

libGPUCounters name: MaliFragMainPassStallCy

Streamline name: $MaliShaderCoreStallCyclesFragmentMainPassStalls

Hardware name: FRAG_MAIN_PASS_STALLED_BY_PRE_PASS

Execution engine starvation

This counter increments every clock cycle when the processing unit is starved of work because all warps are blocked on message dependencies or instruction cache misses.

This counter increments per fetch unit, and so can increase by up to 4 in a clock cycle.

libGPUCounters name: MaliEngStarveCy

Streamline name: $MaliShaderCoreStallCyclesExecutionEngineStarvation

Hardware name: EXEC_STARVE_ARITH

Execution engine I-cache starvation

This counter increments every clock cycle when a fetch unit is starved of work because of instruction cache misses.

This counter increments per fetch unit, and so can increase by up to 4 in a clock cycle.

libGPUCounters name: MaliEngStarveICacheCy

Streamline name: $MaliShaderCoreStallCyclesExecutionEngineICacheStarvation

Hardware name: EXEC_STARVE_ICACHE

Shader Core Workload

The programmable core runs the shader program threads that generate the desired application output.

Performance counters in this section show how the programmable core converts incoming work into the threads and warps running in the shader core, as well as other important properties of the running workload such as warp divergence.

Shader Warps

This counter group shows the number of warps created, split by type. This can help you to understand the running workload mix.

Non-fragment warps

This counter increments for every created non-fragment warp. For this GPU, a warp contains 16 threads.

For compute shaders, to ensure full utilization of the warp capacity, work groups must be a multiple of warp size.

libGPUCounters name: MaliNonFragWarp

Streamline name: $MaliShaderWarpsNonFragmentWarps

Hardware name: COMPUTE_WARPS

Deferred vertex warps

This counter increments for every created deferred vertex warp. For this GPU, a warp contains 16 threads.

libGPUCounters name: MaliDefVertWarp

Streamline name: $MaliShaderWarpsDeferredVertexWarps

Hardware name: DVS_WARPS

Fragment warps

This counter increments for every created fragment warp. For this GPU, a warp contains 16 threads.

Fragment warps are populated with fragment quads, where each quad corresponds to a 2x2 fragment region from a single triangle. Threads in a quad which correspond to a sample point outside of the triangle still consume shader resource, which makes small triangles disproportionately expensive.

libGPUCounters name: MaliFragWarp

Streamline name: $MaliShaderWarpsFragmentWarps

Hardware name: FRAG_WARPS

Fragment prepass warps

This counter increments for every created fragment prepass warp. For this GPU, a warp contains 16 threads.

libGPUCounters name: MaliFragPrepassWarp

Streamline name: $MaliShaderWarpsFragmentPrepassWarps

Hardware name: FRAG_WARPS_PRE_PASS

Full warps

This counter increments for every warp that has a full thread slot allocation. Note that allocated thread slots might not contain a running thread if the workload cannot fill the whole allocation.

If many warps are not fully allocated then performance is reduced. Fully allocated warps are more likely if:

Draw calls avoid late ZS dependency hazards.
Draw calls use meshes with a low percentage of tiny primitives.
Compute dispatches use work groups that are a multiple of warp size.

libGPUCounters name: MaliCoreFullWarp

Streamline name: $MaliShaderWarpsFullWarps

Hardware name: FULL_WARPS

All register warps

This counter increments for every warp that requires more than 32 registers. Threads which require more than 32 registers consume two thread slots in the register file, halving the number of threads that can be concurrently active in the shader core.

Reduction in thread count can impact the ability of the shader core to keep functional units busy, and means that performance is more likely to be impacted by stalls caused by cache misses.

Aim to minimize the number of threads requiring more than 32 registers, by using simpler shader programs and lower precision data types.

libGPUCounters name: MaliCoreAllRegsWarp

Streamline name: $MaliShaderWarpsAllRegisterWarps

Hardware name: WARP_REG_SIZE_64

Shader Threads

This counter group shows the number of threads created, split by type. This can help you to understand the running workload mix.

Counters in this group are derived by scaling quad or warp counters, and their counts will include unused thread slots in the coarser granule.

Non-fragment threads

This expression defines the number of non-fragment threads started.

The expression is an approximation, based on the assumption that all warps are fully populated with threads. The Full warps counter can give some indication of warp occupancy.

libGPUCounters name: MaliNonFragThread

libGPUCounters derivation:

MaliNonFragWarp * 16

Streamline derivation:

$MaliShaderWarpsNonFragmentWarps * 16

Hardware derivation:

COMPUTE_WARPS * 16

All fragment threads

This counter defines the total number of fragment threads started, including prepass and main pass threads. This counter assumes all 4 lanes in a coarse quad are active, so this counter includes helper threads and idle thread slots if a coarse quad has partial coverage.

libGPUCounters name: MaliFragThread

Streamline name: $MaliShaderThreadsAllFragmentThreads

Hardware name: FRAG_SHADER_THREADS

Fragment prepass threads

This expression defines the number of fragment threads started in the prepass. This expression assumes all lanes in a warp are active.

libGPUCounters name: MaliFragPrepassThread

libGPUCounters derivation:

MaliFragPrepassWarp * 16

Streamline derivation:

$MaliShaderWarpsFragmentPrepassWarps * 16

Hardware derivation:

FRAG_WARPS_PRE_PASS * 16

Fragment main pass threads

This expression defines the number of fragment threads started in the main pass. This expression assumes all lanes in a warp are active.

libGPUCounters name: MaliFragMainThread

libGPUCounters derivation:

(MaliFragWarp - MaliFragPrepassWarp) * 16

Streamline derivation:

($MaliShaderWarpsFragmentWarps - $MaliShaderWarpsFragmentPrepassWarps) * 16

Hardware derivation:

(FRAG_WARPS - FRAG_WARPS_PRE_PASS) * 16

Shader Workload Properties

This counter group shows interesting properties of the running shader code, most of which highlight an interesting optimization opportunity.

Fragment warp occupancy

This expression measures the thread occupancy of the fragment warps in percent. Threads are counted as active if they are part of a coarse quad, even if they have no sample coverage.

libGPUCounters name: MaliCoreFragWarpOcc

libGPUCounters derivation:

max(min((MaliFragThread / (MaliFragWarp * 16)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderThreadsAllFragmentThreads / ($MaliShaderWarpsFragmentWarps * 16)) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_SHADER_THREADS / (FRAG_WARPS * 16)) * 100, 100), 0)

Full warp rate

This expression defines the percentage of warps that have a full thread slot allocation. Note that allocated thread slots might not contain a running thread if the workload cannot fill the whole allocation.

If a high percentage of warps are not fully allocated then performance is reduced. Fully allocated warps are more likely if:

Draw calls avoid late ZS dependency hazards.
Draw calls use meshes with a low percentage of tiny primitives.
Compute dispatches use work groups that are a multiple of warp size.

libGPUCounters name: MaliCoreFullWarpRate

libGPUCounters derivation:

max(min((MaliCoreFullWarp / (MaliNonFragWarp + MaliFragWarp)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderWarpsFullWarps / ($MaliShaderWarpsNonFragmentWarps + $MaliShaderWarpsFragmentWarps)) * 100, 100), 0)

Hardware derivation:

max(min((FULL_WARPS / (COMPUTE_WARPS + FRAG_WARPS)) * 100, 100), 0)

All registers warp rate

This expression defines the percentage of warps that use more than 32 registers, requiring the full register allocation of 64 registers. Warps that require more than 32 registers halve the peak thread occupancy of the shader core, and can make shader performance more sensitive to cache misses and memory stalls.

libGPUCounters name: MaliCoreAllRegsWarpRate

libGPUCounters derivation:

max(min((MaliCoreAllRegsWarp / (MaliNonFragWarp + MaliFragWarp)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderWarpsAllRegisterWarps / ($MaliShaderWarpsNonFragmentWarps + $MaliShaderWarpsFragmentWarps)) * 100, 100), 0)

Hardware derivation:

max(min((WARP_REG_SIZE_64 / (COMPUTE_WARPS + FRAG_WARPS)) * 100, 100), 0)

Warp divergence rate

This expression defines the percentage of instructions that have control flow divergence across the warp.

libGPUCounters name: MaliEngDivergedInstrRate

libGPUCounters derivation:

max(min((MaliEngDivergedInstr / (MaliEngFMAInstr + MaliEngCVTInstr + MaliEngSFUInstr)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliALUInstructionsDivergedInstructions / ($MaliALUInstructionsFMAPipeInstructions + $MaliALUInstructionsCVTPipeInstructions + $MaliALUInstructionsSFUPipeInstructions)) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_INSTR_DIVERGED / (EXEC_INSTR_FMA + EXEC_INSTR_CVT + EXEC_INSTR_SFU)) * 100, 100), 0)

Narrow arithmetic rate

This expression defines the percentage of arithmetic instructions that operate on 8/16-bit types. These are more energy efficient, and require fewer registers for variable storage, than 32-bit operations.

libGPUCounters name: MaliEngNarrowInstrRate

libGPUCounters derivation:

max(min((MaliEngNarrowInstr / (MaliEngFMAInstr + MaliEngCVTInstr + MaliEngSFUInstr)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliALUInstructionsNarrowInstructions / ($MaliALUInstructionsFMAPipeInstructions + $MaliALUInstructionsCVTPipeInstructions + $MaliALUInstructionsSFUPipeInstructions)) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_INSTR_NARROW / (EXEC_INSTR_FMA + EXEC_INSTR_CVT + EXEC_INSTR_SFU)) * 100, 100), 0)

Shader blend rate

This expression defines the percentage of fragments that use shader-based blending, rather than the fixed-function blend path. These fragments are caused by the application using color formats, or advanced blend equations, which the fixed-function blend path does not support.

Vulkan shaders that use software blending do not show up in this data, because the blend is inlined in to the main body of the shader program.

libGPUCounters name: MaliEngSWBlendRate

libGPUCounters derivation:

max(min(((MaliEngSWBlendInstr * 4) / MaliFragWarp) * 100, 100), 0)

Streamline derivation:

max(min((($MaliALUInstructionsBlendShaderInstructions * 4) / $MaliShaderWarpsFragmentWarps) * 100, 100), 0)

Hardware derivation:

max(min(((CALL_BLEND_SHADER * 4) / FRAG_WARPS) * 100, 100), 0)

Shader Core Arithmetic Unit

The arithmetic unit in the shader core processes all the arithmetic and logic operations in the running shader programs.

Performance counters in this section show how the running programs used the arithmetic units, which may indicate the type of operations that are consuming the most performance.

ALU Cycles

This counter group shows the number of cycles when work was issued to the arithmetic and logic unit.

Arithmetic unit issues

This expression defines the number of cycles that the arithmetic unit was busy processing work.

libGPUCounters name: MaliALUIssueCy

libGPUCounters derivation:

max((MaliEngFMAInstr + MaliEngCVTInstr + MaliEngSFUInstr) - MaliEngSlot1IssueCy, MaliEngSlot1IssueCy, MaliEngSFUInstr * 4)

Streamline derivation:

max(($MaliALUInstructionsFMAPipeInstructions + $MaliALUInstructionsCVTPipeInstructions + $MaliALUInstructionsSFUPipeInstructions) - $MaliALUIssuesSlot1Issues, $MaliALUIssuesSlot1Issues, $MaliALUInstructionsSFUPipeInstructions * 4)

Hardware derivation:

max((EXEC_INSTR_FMA + EXEC_INSTR_CVT + EXEC_INSTR_SFU) - EXEC_INSTR_SLOT_1, EXEC_INSTR_SLOT_1, EXEC_INSTR_SFU * 4)

ALU Instructions

This counter group gives a breakdown of the types of arithmetic instructions being used by the shader program.

Executed instructions

This expression defines the number of total instructions issued to any of the arithmetic pipe types.

libGPUCounters name: MaliEngArithInstr

libGPUCounters derivation:

MaliEngFMAInstr + MaliEngCVTInstr + MaliEngSFUInstr

Streamline derivation:

$MaliALUInstructionsFMAPipeInstructions + $MaliALUInstructionsCVTPipeInstructions + $MaliALUInstructionsSFUPipeInstructions

Hardware derivation:

EXEC_INSTR_FMA + EXEC_INSTR_CVT + EXEC_INSTR_SFU

FMA pipe instructions

This counter increments for every instruction issued to the fused multiply-accumulate pipe.

libGPUCounters name: MaliEngFMAInstr

Streamline name: $MaliALUInstructionsFMAPipeInstructions

Hardware name: EXEC_INSTR_FMA

CVT pipe instructions

This counter increments for every instruction issued to the convert pipe.

libGPUCounters name: MaliEngCVTInstr

Streamline name: $MaliALUInstructionsCVTPipeInstructions

Hardware name: EXEC_INSTR_CVT

SFU pipe instructions

This counter increments for every instruction issued to the special functions unit pipe.

libGPUCounters name: MaliEngSFUInstr

Streamline name: $MaliALUInstructionsSFUPipeInstructions

Hardware name: EXEC_INSTR_SFU

Diverged instructions

This counter increments for every instruction the programmable core processes per warp where there is control flow divergence across the warp. Control flow divergence erodes arithmetic processing efficiency because it implies some threads in the warp are idle because they did not take the current control path through the code. Aim to minimize control flow divergence when designing shader effects.

libGPUCounters name: MaliEngDivergedInstr

Streamline name: $MaliALUInstructionsDivergedInstructions

Hardware name: EXEC_INSTR_DIVERGED

Narrow instructions

This counter increments for every instruction that does 16-bit or narrower calculations.

libGPUCounters name: MaliEngNarrowInstr

Streamline name: $MaliALUInstructionsNarrowInstructions

Hardware name: EXEC_INSTR_NARROW

Blend shader instructions

This counter increments for every blend shader invocation run.

This counter increments per fetch unit, and so can increase by up to 4 in a clock cycle.

libGPUCounters name: MaliEngSWBlendInstr

Streamline name: $MaliALUInstructionsBlendShaderInstructions

Hardware name: CALL_BLEND_SHADER

ALU Utilization

This counter group gives a breakdown of the usage of the different arithmetic sub-units, relative to their speed-of-light performance.

Due to shared issue data paths, it might not be possible for individual ALU units to reach their speed-of-light if the other ALU hardware units are also in use.

FMA pipe utilization

This expression defines the fused multiply-accumulate pipeline utilization.

This pipeline shares instruction issue slots with CVT and SFU instructions, so it is not possible to achieve 100% utilization unless the other pipelines are idle.

libGPUCounters name: MaliEngFMAPipeUtil

libGPUCounters derivation:

max(min((MaliEngFMAInstr / (2 * MaliCoreActiveCy)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliALUInstructionsFMAPipeInstructions / (2 * $MaliShaderCoreCyclesExecutionCoreActive)) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_INSTR_FMA / (2 * EXEC_CORE_ACTIVE)) * 100, 100), 0)

CVT pipe utilization

This expression defines the convert pipeline utilization.

This pipeline shares instruction issue slots with FMA and SFU instructions, so it is not possible to achieve 100% utilization unless the other pipelines are idle.

libGPUCounters name: MaliEngCVTPipeUtil

libGPUCounters derivation:

max(min((MaliEngCVTInstr / (2 * MaliCoreActiveCy)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliALUInstructionsCVTPipeInstructions / (2 * $MaliShaderCoreCyclesExecutionCoreActive)) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_INSTR_CVT / (2 * EXEC_CORE_ACTIVE)) * 100, 100), 0)

SFU pipe utilization

This expression defines the special functions unit pipeline utilization.

This pipeline shares instruction issue slots with CVT and SFU instructions, so it is not possible to achieve 100% utilization unless the other pipelines are idle.

libGPUCounters name: MaliEngSFUPipeUtil

libGPUCounters derivation:

max(min(((MaliEngSFUInstr * 4) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliALUInstructionsSFUPipeInstructions * 4) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min(((EXEC_INSTR_SFU * 4) / EXEC_CORE_ACTIVE) * 100, 100), 0)

ALU Issues

This counter group gives a breakdown of the usage of the arithmetic instruction issue ports. Issue port contention will usually become a bottleneck before individual functional pipelines, as pipelines often share issue port bandwidth.

Any slot issues

This counter increments for every clock cycle when issues an instruction to either arithmetic issue slot.

libGPUCounters name: MaliEngSlotAnyIssueCy

Streamline name: $MaliALUIssuesAnySlotIssues

Hardware name: EXEC_ISSUE_SLOT_ANY

Slot 0 issues

This counter increments for every clock cycle when issues an instruction to issue slot 0.

libGPUCounters name: MaliEngSlot0IssueCy

libGPUCounters derivation:

(MaliEngFMAInstr + MaliEngCVTInstr + MaliEngSFUInstr) - MaliEngSlot1IssueCy

Streamline derivation:

($MaliALUInstructionsFMAPipeInstructions + $MaliALUInstructionsCVTPipeInstructions + $MaliALUInstructionsSFUPipeInstructions) - $MaliALUIssuesSlot1Issues

Hardware derivation:

(EXEC_INSTR_FMA + EXEC_INSTR_CVT + EXEC_INSTR_SFU) - EXEC_INSTR_SLOT_1

Slot 1 issues

This counter increments for every clock cycle when issues an instruction to issue slot 1.

libGPUCounters name: MaliEngSlot1IssueCy

Streamline name: $MaliALUIssuesSlot1Issues

Hardware name: EXEC_INSTR_SLOT_1

Shader Core Load/store Unit

The load/store unit in the shader core handles all generic read/write data access, including access to vertex attributes, buffers, images, workgroup local storage, and program stack.

Performance counters in this section show the breakdown of performed load/store cache accesses, showing whether accesses are using an entire cache line or just using part of one.

Load/Store Unit Cycles

This counter group shows the number of cycles when work was issued to the load/store unit.

Load/store unit issues

This expression defines the total number of load/store cache access cycles. This counter ignores secondary effects such as cache misses, so provides the minimum possible cycle usage.

libGPUCounters name: MaliLSIssueCy

libGPUCounters derivation:

MaliLSFullRd + MaliLSPartRd + MaliLSFullWr + MaliLSPartWr + MaliLSAtomic

Streamline derivation:

$MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads + $MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites + $MaliLoadStoreUnitCyclesAtomicAccesses

Hardware derivation:

LS_MEM_READ_FULL + LS_MEM_READ_SHORT + LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT + LS_MEM_ATOMIC

Reads

This expression defines the total number of load/store read cycles.

libGPUCounters name: MaliLSRdCy

libGPUCounters derivation:

MaliLSFullRd + MaliLSPartRd

Streamline derivation:

$MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads

Hardware derivation:

LS_MEM_READ_FULL + LS_MEM_READ_SHORT

Full reads

This counter increments for every full-width load/store cache read.

libGPUCounters name: MaliLSFullRd

Streamline name: $MaliLoadStoreUnitCyclesFullReads

Hardware name: LS_MEM_READ_FULL

Partial reads

This counter increments for every partial-width load/store cache read. Partial data accesses do not make full use of the load/store cache capability. Merging short accesses together to make fewer larger requests improves efficiency. To do this in shader code:

Use vector data loads.
Avoid padding in strided data accesses.
Write compute shaders so that adjacent threads in a warp access adjacent addresses in memory.

libGPUCounters name: MaliLSPartRd

Streamline name: $MaliLoadStoreUnitCyclesPartialReads

Hardware name: LS_MEM_READ_SHORT

Writes

This expression defines the total number of load/store write cycles.

libGPUCounters name: MaliLSWrCy

libGPUCounters derivation:

MaliLSFullWr + MaliLSPartWr

Streamline derivation:

$MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites

Hardware derivation:

LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT

Full writes

This counter increments for every full-width load/store cache write.

libGPUCounters name: MaliLSFullWr

Streamline name: $MaliLoadStoreUnitCyclesFullWrites

Hardware name: LS_MEM_WRITE_FULL

Partial writes

This counter increments for every partial-width load/store cache write. Partial data accesses do not make full use of the load/store cache capability. Merging short accesses together to make fewer larger requests improves efficiency. To do this in shader code:

Use vector data loads.
Avoid padding in strided data accesses.
Write compute shaders so that adjacent threads in a warp access adjacent addresses in memory.

libGPUCounters name: MaliLSPartWr

Streamline name: $MaliLoadStoreUnitCyclesPartialWrites

Hardware name: LS_MEM_WRITE_SHORT

Atomic accesses

This counter increments for every atomic access.

Atomic memory accesses are typically multicycle operations per thread in the warp, so they are exceptionally expensive. Minimize the use of atomics in performance critical code. For some types of atomic operation, it can be beneficial to perform a warp-wide reduction using subgroup operations and then use a single thread to update the atomic value.

libGPUCounters name: MaliLSAtomic

Streamline name: $MaliLoadStoreUnitCyclesAtomicAccesses

Hardware name: LS_MEM_ATOMIC

Shader Core Varying Unit

The varying unit in the shader core handles all vertex data interpolation in fragment shaders.

Performance counters in this section show the breakdown of performed interpolation operations.

Varying Unit Requests

This counter group shows the number of requests made to the varying interpolation unit.

Interpolation requests

This counter increments for every warp-width interpolation operation processed by the varying unit.

libGPUCounters name: MaliVarInstr

Streamline name: $MaliVaryingUnitRequestsInterpolationRequests

Hardware name: VARY_INSTR

16-bit interpolation slots

This counter increments for every 16-bit interpolation slot processed by the varying unit.

The width of each slot and the number of slots is GPU dependent.

libGPUCounters name: MaliVar16IssueSlot

Streamline name: $MaliVaryingUnitRequests16BitInterpolationSlots

Hardware name: VARY_SLOT_16

32-bit interpolation slots

This counter increments for every 32-bit interpolation slot processed by the varying unit. 32-bit interpolation is half the performance of 16-bit interpolation, so if content is varying bound consider reducing precision of varying inputs to fragment shaders.

The width of each slot and the number of slots is GPU dependent.

libGPUCounters name: MaliVar32IssueSlot

Streamline name: $MaliVaryingUnitRequests32BitInterpolationSlots

Hardware name: VARY_SLOT_32

Varying Unit Cycles

This counter group shows the number of cycles when work was issued to the varying interpolation unit.

Varying unit issues

This expression defines the total number of cycles when the varying interpolator is issuing operations.

libGPUCounters name: MaliVarIssueCy

libGPUCounters derivation:

(MaliVar32IssueSlot / 4) + (MaliVar16IssueSlot / 4)

Streamline derivation:

($MaliVaryingUnitRequests32BitInterpolationSlots / 4) + ($MaliVaryingUnitRequests16BitInterpolationSlots / 4)

Hardware derivation:

(VARY_SLOT_32 / 4) + (VARY_SLOT_16 / 4)

16-bit interpolation issues

This counter increments for every 16-bit interpolation cycle processed by the varying unit.

libGPUCounters name: MaliVar16IssueCy

libGPUCounters derivation:

MaliVar16IssueSlot / 4

Streamline derivation:

$MaliVaryingUnitRequests16BitInterpolationSlots / 4

Hardware derivation:

VARY_SLOT_16 / 4

32-bit interpolation issues

This counter increments for every 32-bit interpolation cycle processed by the varying unit. 32-bit interpolation is half the performance of 16-bit interpolation, so if content is varying bound consider reducing precision of varying inputs to fragment shaders.

libGPUCounters name: MaliVar32IssueCy

libGPUCounters derivation:

MaliVar32IssueSlot / 4

Streamline derivation:

$MaliVaryingUnitRequests32BitInterpolationSlots / 4

Hardware derivation:

VARY_SLOT_32 / 4

Shader Core Texture Unit

The texture unit in the shader core handles all read-only texture access and filtering.

Performance counters in this section show the breakdown of performed texturing operations, and use of sub-units inside the texturing hardware.

Texture Unit Requests

This counter group shows the number of requests made to the texture unit.

Texture samples

This expression defines the number of texture samples made.

libGPUCounters name: MaliTexSample

libGPUCounters derivation:

((MaliTexOutMsg * 2) - MaliTexOutSingleMsg) * 4

Streamline derivation:

(($MaliTextureUnitQuadsTextureMessages * 2) - $MaliTextureUnitQuadsTextureMessagesWithSingleQuad) * 4

Hardware derivation:

((TEX_MSGO_NUM_MSG * 2) - TEX_MSGO_NUM_SINGLE_QUAD_MSG) * 4

Texture Unit Quads

This counter group shows the number of fragment quads submitted to the texture unit for sampling.

Texture requests

This counter increments for every quad-width texture operation processed by the texture unit.

libGPUCounters name: MaliTexQuads

libGPUCounters derivation:

(MaliTexOutMsg * 2) - MaliTexOutSingleMsg

Streamline derivation:

($MaliTextureUnitQuadsTextureMessages * 2) - $MaliTextureUnitQuadsTextureMessagesWithSingleQuad

Hardware derivation:

(TEX_MSGO_NUM_MSG * 2) - TEX_MSGO_NUM_SINGLE_QUAD_MSG

Texture messages

This counter increments for every texture message emitted by the texture unit.

libGPUCounters name: MaliTexOutMsg

Streamline name: $MaliTextureUnitQuadsTextureMessages

Hardware name: TEX_MSGO_NUM_MSG

Texture messages with single quad

This counter increments for every texture message emitted by the texture unit which only contains a single quad.

libGPUCounters name: MaliTexOutSingleMsg

Streamline name: $MaliTextureUnitQuadsTextureMessagesWithSingleQuad

Hardware name: TEX_MSGO_NUM_SINGLE_QUAD_MSG

Texture Unit Cycles

This counter group shows the number of cycles when work was issued to the sub-units inside the texture unit.

Texture unit clock active

This counter increments for every clock cycle when the texture unit is active, and contains an active sample message.

A high active time does not imply a high utilization. The unit counts as active even if only a single pipeline stage is busy.

libGPUCounters name: MaliTexClkActiveCy

Streamline name: $MaliTextureUnitCyclesTextureUnitClockActive

Hardware name: TEX_TEXP_CLK_ACTIVE

Texture unit issues

This expression measures the number of cycles the texture unit was busy processing work.

libGPUCounters name: MaliTexIssueCy

libGPUCounters derivation:

max(MaliTexFiltIssueCy, MaliTexCacheLookupCy, MaliTexCacheSimpleLoadCy, MaliTexCacheComplexLoadCy, MaliTexInBt, MaliTexOutBt, MaliTexL1CacheOutputCy, MaliTexL1CacheLookupCy, MaliTexIndexCy)

Streamline derivation:

max($MaliTextureUnitCyclesFilteringActive, $MaliTextureUnitCacheCyclesCacheLookupActive, $MaliTextureUnitCacheCyclesSimpleLoadActive, $MaliTextureUnitCacheCyclesComplexLoadActive, $MaliTextureUnitBusInputBeats, $MaliTextureUnitBusOutputBeats, $MaliTextureUnitCacheCyclesL1OutputActive, $MaliTextureUnitCacheCyclesL1LookupActive, $MaliTextureUnitCyclesIndexCalculationActive)

Hardware derivation:

max(TEX_FILT_NUM_OPERATIONS, TEX_TFCH_NUM_TCL_OPERATIONS, TEX_CFCH_NUM_DIRECT_PATH_OPERATIONS, TEX_CFCH_NUM_RP_OPERATIONS, TEX_MSGI_NUM_FLITS, TEX_RSPS_NUM_OPERATIONS, TEX_CFCH_NUM_L1_CL_OPERATIONS, TEX_CFCH_NUM_L1_CT_OPERATIONS, TEX_TIDX_NUM_OPERATIONS)

Index calculation active

This counter increments for every clock cycle when the texture unit is computing a texel index value.

libGPUCounters name: MaliTexIndexCy

Streamline name: $MaliTextureUnitCyclesIndexCalculationActive

Hardware name: TEX_TIDX_NUM_OPERATIONS

Filtering active

This counter increments for every texture filtering issue cycle. This GPU can do 8x 2D bilinear texture samples per clock. More complex filtering operations are composed of multiple 2D bilinear samples, and take proportionally more filtering time to complete. The scaling factors for more expensive operations are:

2D trilinear filtering runs at half speed.
3D bilinear filtering runs at half speed.
3D trilinear filtering runs at quarter speed.

Anisotropic filtering makes up to MAX_ANISOTROPY filtered subsamples of the current base filter type. For example, using trilinear filtering with a MAX_ANISOTROPY of 3 will require up to 6 bilinear filters.

libGPUCounters name: MaliTexFiltIssueCy

Streamline name: $MaliTextureUnitCyclesFilteringActive

Hardware name: TEX_FILT_NUM_OPERATIONS

Texture Unit Stall Cycles

This counter group shows the number of stall cycles when work could not be issued to the sub-units inside the texture unit.

Texture causing starvation

This counter increments for every clock cycle when the texture unit is active and a response message could be accepted, but no response was generated.

A high value here can be indicative of a inefficient texturing operations which are failing to sustain full throughput, and can be caused by a high miss rate in either descriptor or data cache, or complex multicycle filtering operations.

libGPUCounters name: MaliTexClkStarvedCy

Streamline name: $MaliTextureUnitStallCyclesTextureCausingStarvation

Hardware name: TEX_MSGI_CLK_STARVED

Descriptor stalls

This counter increments for every clock cycle a quad is stalled on texture descriptor fetch. This might not correspond to a stall cycle in the filtering unit if there is enough work already buffered after the descriptor fetcher to hide the stall.

libGPUCounters name: MaliTexDescStallCy

Streamline name: $MaliTextureUnitStallCyclesDescriptorStalls

Hardware name: TEX_DFCH_CLK_STALLED

Fetch queue stalls

This counter increments for every clock cycle a quad is stalled on entering texture fetch because the fetch queue is full. This might not correspond to a stall cycle in the filtering unit if there is enough work already buffered to hide the stall.

libGPUCounters name: MaliTexDataFetchStallCy

Streamline name: $MaliTextureUnitStallCyclesFetchQueueStalls

Hardware name: TEX_TFCH_CLK_STALLED

Filtering unit stalls

This counter increments for every clock cycle the filtering unit is idle and there is at least one quad present in the texture data fetch queue. A high stall rate here can be indicative of content which is failing to make good use of the texture cache. For example, under-sampling from a high resolution texture.

libGPUCounters name: MaliTexFiltStallCy

Streamline name: $MaliTextureUnitStallCyclesFilteringUnitStalls

Hardware name: TEX_TFCH_STARVED_PENDING_DATA_FETCH

Texture Unit CPI

This counter group shows the average cost of texture samples.

Texture CPI

This expression defines the average number of texture filtering cycles per instruction. For texture-limited content that has a CPI higher than the optimal throughout of this core (8 samples per cycle), consider using simpler texture filters. See Texture unit issue cycles for details of the expected performance for different types of operation.

libGPUCounters name: MaliTexCPI

libGPUCounters derivation:

max(MaliTexFiltIssueCy, MaliTexCacheLookupCy, MaliTexCacheSimpleLoadCy, MaliTexCacheComplexLoadCy, MaliTexInBt, MaliTexOutBt, MaliTexL1CacheOutputCy, MaliTexL1CacheLookupCy, MaliTexIndexCy) / (((MaliTexOutMsg * 2) - MaliTexOutSingleMsg) * 4)

Streamline derivation:

max($MaliTextureUnitCyclesFilteringActive, $MaliTextureUnitCacheCyclesCacheLookupActive, $MaliTextureUnitCacheCyclesSimpleLoadActive, $MaliTextureUnitCacheCyclesComplexLoadActive, $MaliTextureUnitBusInputBeats, $MaliTextureUnitBusOutputBeats, $MaliTextureUnitCacheCyclesL1OutputActive, $MaliTextureUnitCacheCyclesL1LookupActive, $MaliTextureUnitCyclesIndexCalculationActive) / ((($MaliTextureUnitQuadsTextureMessages * 2) - $MaliTextureUnitQuadsTextureMessagesWithSingleQuad) * 4)

Hardware derivation:

max(TEX_FILT_NUM_OPERATIONS, TEX_TFCH_NUM_TCL_OPERATIONS, TEX_CFCH_NUM_DIRECT_PATH_OPERATIONS, TEX_CFCH_NUM_RP_OPERATIONS, TEX_MSGI_NUM_FLITS, TEX_RSPS_NUM_OPERATIONS, TEX_CFCH_NUM_L1_CL_OPERATIONS, TEX_CFCH_NUM_L1_CT_OPERATIONS, TEX_TIDX_NUM_OPERATIONS) / (((TEX_MSGO_NUM_MSG * 2) - TEX_MSGO_NUM_SINGLE_QUAD_MSG) * 4)

Texture Unit Utilization

This counter group shows the use of some of the functional units and data paths inside the texture unit, relative to their speed-of-light capability.

Input bus utilization

This expression defines the percentage load on the texture message input bus.

If bus utilization is higher than the filtering unit utilization, your content might be limited by texture operation parameter passing. Requests that require more input parameters, such as 3D accesses, array accesses, and accesses using an explicit level-of-detail, place a higher load on the bus than basic 2D texture operations.

libGPUCounters name: MaliTexInBusUtil

libGPUCounters derivation:

max(min((MaliTexInBt / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitBusInputBeats / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((TEX_MSGI_NUM_FLITS / EXEC_CORE_ACTIVE) * 100, 100), 0)

Output bus utilization

This expression defines the percentage load on the texture message output bus.

If bus utilization is higher than the filtering unit utilization, your content might be limited by texture result return. Requests that require higher precision sampler return type place a higher load on the bus, so it is recommended to use a 16-bit sampler precision whenever possible.

libGPUCounters name: MaliTexOutBusUtil

libGPUCounters derivation:

max(min((MaliTexOutBt / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitBusOutputBeats / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((TEX_RSPS_NUM_OPERATIONS / EXEC_CORE_ACTIVE) * 100, 100), 0)

Texture Unit Cache Cycles

This counter group shows the number of cache access cycles for the data caches inside the texture unit.

Cache lookup active

This counter increments for every clock cycle when the texture cache is returning data.

A high value here can be indicative of a inefficient post-decompression texture formats. For example, a 64-bpp format takes twice as long as a 32-bpp format.

libGPUCounters name: MaliTexCacheLookupCy

Streamline name: $MaliTextureUnitCacheCyclesCacheLookupActive

Hardware name: TEX_TFCH_NUM_TCL_OPERATIONS

L1 load active

This counter increments for every clock cycle when data is being transferred into the L1 texture cache.