Arm Mali-G72

Home / Arm Mali-G72 Performance Counter Reference Filtered

Introduction

This guide explains the performance counters for the Mali-G72, which is a member of the Bifrost architecture family.

This introduction section will explain the high level goals to consider when profiling this GPU. Later sections will explain the available counters for each part of the GPU design.

Profiling GPU scheduling

The GPU runs workloads that have been submitted by the graphics driver, using scheduling barriers between workloads to ensure they run in the correct order. Workloads are scheduled to run by adding them to the appropriate hardware queue, which will run enqueued workloads in a pipelined FIFO processing order.

Bifrost top-level

Tile-based rendering

Arm GPUs are tile-based GPUs, meaning that they process graphics render passes in two distinct phases. The first phase processes geometry to determine which primitives contribute to which screen-space tiles. The second phase renders the output framebuffer tile-by-tile.

In this design, tiles are small enough to be kept in on-chip tile memory, which makes fragment processing more efficient. However, in this generation of GPUs vertex shaders are processed in their entirety in the first phase with their outputs written back to main memory, and then re-read during the second phase. This makes geometry processing less efficient.

GPU queues

The GPU front-end in this generation of hardware has two hardware queues:

  • Non-fragment queue
  • Fragment queue

The Non-fragment queue is used for all compute-like workloads, including vertex shading and buffer transfers. The Fragment queue is used for all fragment-like workloads, including fragment shading and most image transfers.

Monitoring your application's queue usage is the first stage of profiling an Arm GPU, as the queue costs give the overall processing cost of each type of workload. In addition you can see if your application is using barriers efficiently, allowing the queues to run their workloads in parallel.

Profiling GPU memory bandwidth

GPUs are data-plane processors, so memory access efficiency is an important factor for overall performance.

Bifrost memory system

Memory system performance outside of the GPU cannot be directly observed via GPU performance counters, but the counters can show the performance observed by the GPU on its memory interface.

Reducing bandwidth

Accessing external DRAM is a very energy-intensive operation, which makes reducing external bandwidth an important optimization goal for mobile devices. Sustained high bandwidth can cause poor performance in main-stream devices, and thermal issues in high-end devices.

Shader core performance counters can give you more breakdown about which functional units are generating memory traffic, guiding your optimization efforts.

Reducing stalls

The memory system outside of the GPU is implemented by the chip manufacturer, and designs can vary and have different performance characteristics. Workloads that generate a significant number of memory stall cycles, or that see a large percentage of high latency reads, might be stressing the external memory system beyond its capabilities. Reducing memory bandwidth often gives measurable performance gains in these scenarios.

Profiling shader core usage

If the GPU queues are scheduling well, the next thing that you will need to profile to determine the processing bottleneck of a workload is your application's use of the shader core.

The Mali-G72 shader cores use a massively multi-threaded architecture, supporting hundreds of concurrently running threads. A large pool of available threads allows the hardware to fill parallel functional units by switching to any of the available threads if the current thread becomes blocked for any reason.

Bifrost shader core

In this type of architecture, the utilization of the functional units reflects the overall demand of the running shader programs. This is relatively independent of localized hot-spots in shaders that stress a single functional unit, because other threads will be running other parts of the program and will load-balance the hardware. This is quite different to profiling a CPU, where the serial instruction stream means that performance can be very sensitive to both latency and localized hot-spots.

Improve speed-of-light utilization

For functional unit profiling, we therefore aim for at least 75% utilization of the most heavily used functional unit, relative to its best case 'speed-of-light' performance. This shows that the application has done a good job getting its workload running without problematic stalls.

In this situation, reducing demand on the most heavily used functional units, either by improving efficiency or reducing size, should improve application performance.

Reduce shader core stalls

If no functional unit is heavily utilized, the shader core is running out of work to do. This can occur for multiple reasons, and should be avoided if possible.

The first reason is that the shader is literally running out of threads to run, and the shader core is running with low thread occupancy. GPUs rely on workloads having a lot of threads to fill the capacity of the shader core. You should avoid running small workloads with few threads on the GPU, preferring to use the CPU if possible. Note that some workloads, such as depth shadow maps, may not generate many fragment threads due to their algorithmic design. This is usually unavoidable, but is something to remember when profiling.

The second reason is that the running shader programs are causing operations to stall by missing in descriptor caches or data caches. GPUs use their thread count to hide the impact and latency of cache misses, but there are limits to the density of misses that can be hidden. In this situation, try to identify which workload is causing stalls and try to minimize them. There are not specific performance counters for every stall reason, so this can take some investigation and experimentation to determine which resource is causing the problem.

Profiling workload

In addition to profiling use of the hardware, measuring cycles and bytes, Arm GPUs provide many performance counters that can help you to understand the size and characteristics of your workload. These counters gives feedback in the context of API constructs, such as vertices, triangles, and pixels making it easier for developers to understand the feedback.

Bifrost shader core

Supplementing the workload size counters, Arm GPUs also provide counters that indicate areas where content is not following best practice guidelines. Improving these best practice metrics will nearly always improve your application's performance or energy efficiency.

GPU Front-end

The GPU front-end is the interface between the GPU hardware and the driver. The front-end schedules workloads submitted by the driver on to multiple hardware work queues. Each work queue handles a specific type of workload and is responsible for breaking a workload into smaller tasks that can be dispatched to the shader cores. Work stays at the head of the queue while being processed, so queue activity is a direct way of measuring that the GPU is busy handling a workload.

In this generation of hardware there are two work queues:

  • Non-fragment queue for compute shaders, vertex shaders, and primitive culling and binning.
  • Fragment queue for render pass fragment shading.

It is beneficial to schedule work on multiple queues in parallel, as this can more evenly load balance the hardware. Parallel processing will increase the latency of individual tasks, but usually significantly improves overall throughput.

Performance counters in this section can show activity on each of the queues, which indicates the complexity and scheduling patterns of submitted workloads.

GPU Cycles

This counter group shows the workload processing activity level of the GPU, showing the overall use and when work was running for each of the hardware scheduling queues.

GPU active

This counter increments every clock cycle when the GPU has any pending workload present in one of its processing queues. It shows the overall GPU processing load requested by the application.

This counter increments when any workload is present in any processing queue, even if the GPU is stalled waiting for external memory. These cycles are counted as active time even though no progress is being made.

libGPUCounters name: MaliGPUActiveCy
Streamline name: $MaliGPUCyclesGPUActive
Hardware name: GPU_ACTIVE

Non-fragment queue active

This counter increments every clock cycle when the GPU has any workload present in the non-fragment queue. This queue is used for vertex shaders, tessellation shaders, geometry shaders, fixed-function tiling, and compute shaders. This counter can not disambiguate between these workloads.

In content achieving good parallelism, which is important for overall efficiency of rendering, the highest queue active cycle counter must be similar to the GPU active counter.

This counter increments when any workload is present in the non-fragment processing queue, even if the GPU is stalled waiting for external memory. These cycles are counted as active time even though no progress is being made.

libGPUCounters name: MaliNonFragQueueActiveCy
Streamline name: $MaliGPUCyclesNonFragmentQueueActive
Hardware name: JS1_ACTIVE

Fragment queue active

This counter increments every clock cycle when the GPU has any workload present in the fragment queue.

In content achieving good parallelism, which is important for overall efficiency of rendering, the highest queue active cycle counter must be similar to the GPU active counter.

This counter increments when any workload is present in the fragment queue, even if the GPU is stalled waiting for external memory. These cycles are counted as active time even though no progress is being made.

libGPUCounters name: MaliFragQueueActiveCy
Streamline name: $MaliGPUCyclesFragmentQueueActive
Hardware name: JS0_ACTIVE

Reserved queue active

This counter increments any clock cycle that the GPU has any workload present in the reserved processing queue.

libGPUCounters name: MaliResQueueActiveCy
Streamline name: $MaliGPUCyclesReservedQueueActive
Hardware name: JS2_ACTIVE

Tiler active

This counter increments every clock cycle the tiler has a workload in its processing queue. The tiler is responsible for coordinating geometry processing and providing the fixed-function tiling needed for the Mali tile-based rendering pipeline. It can run in parallel to vertex shading and fragment shading.

A high cycle count here does not necessarily imply a bottleneck, unless the Non-fragment active cycles counter in the shader core is comparatively low.

libGPUCounters name: MaliTilerActiveCy
Streamline name: $MaliGPUCyclesTilerActive
Hardware name: TILER_ACTIVE

GPU interrupt active

This counter increments every clock cycle when the GPU has an interrupt pending and is waiting for the CPU to process it.

Cycles with a pending interrupt do not necessarily indicate lost performance because the GPU can process other queued work in parallel. However, if GPU interrupt pending cycles are a high percentage of GPU active cycles, an underlying problem might be preventing the CPU from efficiently handling interrupts. This problem is normally a system integration issue, which an application developer cannot work around.

libGPUCounters name: MaliGPUIRQActiveCy
Streamline name: $MaliGPUCyclesGPUInterruptActive
Hardware name: IRQ_ACTIVE

GPU Wait Cycles

This counter group shows the workload scheduling behavior of the GPU queues, showing reasons for any scheduling stalls for each queue.

Non-fragment queue cache flush stalls

This counter increments any clock cycle that the GPU has non-fragment work queued that can not run or retire because of a pending L2 cache flush.

libGPUCounters name: MaliNonFragQueueWaitFlushCy
Streamline name: $MaliGPUWaitCyclesNonFragmentQueueCacheFlushStalls
Hardware name: JS1_WAIT_FLUSH

Non-fragment queue descriptor read stalls

This counter increments any clock cycle that the GPU has non-fragment work queued that can not run because of a pending descriptor load from memory.

libGPUCounters name: MaliNonFragQueueWaitRdCy
Streamline name: $MaliGPUWaitCyclesNonFragmentQueueDescriptorReadStalls
Hardware name: JS1_WAIT_READ

Non-fragment queue job dependency stalls

This counter increments any clock cycle that the GPU has non-fragment work queued that can not run until dependent work has completed.

libGPUCounters name: MaliNonFragQueueWaitDepCy
Streamline name: $MaliGPUWaitCyclesNonFragmentQueueJobDependencyStalls
Hardware name: JS1_WAIT_DEPEND

Non-fragment queue job finish stalls

This counter increments any clock cycle that the GPU has run out of new non-fragment work to issue, and is waiting for remaining work to complete.

libGPUCounters name: MaliNonFragQueueWaitFinishCy
Streamline name: $MaliGPUWaitCyclesNonFragmentQueueJobFinishStalls
Hardware name: JS1_WAIT_FINISH

Non-fragment queue job issue stalls

This counter increments any clock cycle that the GPU has non-fragment work queued that can not run because all processor resources are busy.

libGPUCounters name: MaliNonFragQueueWaitIssueCy
Streamline name: $MaliGPUWaitCyclesNonFragmentQueueJobIssueStalls
Hardware name: JS1_WAIT_ISSUE

Fragment queue cache flush stalls

This counter increments any clock cycle that the GPU has fragment work queued that can not run or retire because of a pending L2 cache flush.

libGPUCounters name: MaliFragQueueWaitFlushCy
Streamline name: $MaliGPUWaitCyclesFragmentQueueCacheFlushStalls
Hardware name: JS0_WAIT_FLUSH

Fragment queue descriptor read stalls

This counter increments any clock cycle that the GPU has fragment work queued that can not run because of a pending descriptor load from memory.

libGPUCounters name: MaliFragQueueWaitRdCy
Streamline name: $MaliGPUWaitCyclesFragmentQueueDescriptorReadStalls
Hardware name: JS0_WAIT_READ

Fragment queue job dependency stalls

This counter increments any clock cycle that the GPU has fragment work queued that can not run until dependent work has completed.

libGPUCounters name: MaliFragQueueWaitDepCy
Streamline name: $MaliGPUWaitCyclesFragmentQueueJobDependencyStalls
Hardware name: JS0_WAIT_DEPEND

Fragment queue job finish stalls

This counter increments any clock cycle that the GPU has run out of new fragment work to issue, and is waiting for remaining work to complete.

libGPUCounters name: MaliFragQueueWaitFinishCy
Streamline name: $MaliGPUWaitCyclesFragmentQueueJobFinishStalls
Hardware name: JS0_WAIT_FINISH

Fragment queue job issue stalls

This counter increments any clock cycle that the GPU has fragment work queued that can not run because all processor resources are busy.

libGPUCounters name: MaliFragQueueWaitIssueCy
Streamline name: $MaliGPUWaitCyclesFragmentQueueJobIssueStalls
Hardware name: JS0_WAIT_ISSUE

Reserved queue cache flush stalls

This counter increments any clock cycle that the GPU has reserved work queued that can not run or retire because of a pending L2 cache flush.

libGPUCounters name: MaliResQueueWaitFlushCy
Streamline name: $MaliGPUWaitCyclesReservedQueueCacheFlushStalls
Hardware name: JS2_WAIT_FLUSH

Reserved queue descriptor read stalls

This counter increments any clock cycle that the GPU has reserved work queued that can not run because of a pending descriptor load from memory.

libGPUCounters name: MaliResQueueWaitRdCy
Streamline name: $MaliGPUWaitCyclesReservedQueueDescriptorReadStalls
Hardware name: JS2_WAIT_READ

Reserved queue job dependency stalls

This counter increments any clock cycle that the GPU has reserved work queued that can not run until dependent work is completed.

libGPUCounters name: MaliResQueueWaitDepCy
Streamline name: $MaliGPUWaitCyclesReservedQueueJobDependencyStalls
Hardware name: JS2_WAIT_DEPEND

Reserved queue job finish stalls

This counter increments any clock cycle that the GPU has run out of new reserved work to issue, and is waiting for remaining work to complete.

libGPUCounters name: MaliResQueueWaitFinishCy
Streamline name: $MaliGPUWaitCyclesReservedQueueJobFinishStalls
Hardware name: JS2_WAIT_FINISH

Reserved queue job issue stalls

This counter increments any clock cycle that the GPU has reserved work queued that can not run because all processor resources are busy.

libGPUCounters name: MaliResQueueWaitIssueCy
Streamline name: $MaliGPUWaitCyclesReservedQueueJobIssueStalls
Hardware name: JS2_WAIT_ISSUE

GPU Jobs

This counter group shows the total number of workload jobs issued to the GPU front-end for each queue. Most jobs will correspond to an API workload, for example a compute dispatch generates a compute job. However, the driver can also generate small house-keeping jobs for each queue, so job counts do not directly correlate with API behavior.

Non-fragment jobs

This counter increments for every job processed by the GPU non-fragment queue.

libGPUCounters name: MaliNonFragQueueJob
Streamline name: $MaliGPUJobsNonFragmentJobs
Hardware name: JS1_JOBS

Fragment jobs

This counter increments for every job processed by the GPU fragment queue.

libGPUCounters name: MaliFragQueueJob
Streamline name: $MaliGPUJobsFragmentJobs
Hardware name: JS0_JOBS

Reserved jobs

This counter increments for every job processed by the GPU reserved queue.

libGPUCounters name: MaliResQueueJob
Streamline name: $MaliGPUJobsReservedJobs
Hardware name: JS2_JOBS

GPU Tasks

This counter group shows the total number of workload tasks issued by the GPU front-end to the processing end-points inside the GPU.

Non-fragment tasks

This counter increments for every non-fragment task processed by the GPU.

libGPUCounters name: MaliNonFragQueueTask
Streamline name: $MaliGPUTasksNonFragmentTasks
Hardware name: JS1_TASKS

Fragment tasks

This counter increments for every 32 x 32 pixel region of a render pass that is processed by the GPU. The processed region of a render pass can be smaller than the full size of the attached surfaces if the application's viewport and scissor settings prevent the whole image being rendered.

libGPUCounters name: MaliFragQueueTask
Streamline name: $MaliGPUTasksFragmentTasks
Hardware name: JS0_TASKS

Reserved tasks

This counter increments for every reserved task processed by the GPU.

libGPUCounters name: MaliResQueueTask
Streamline name: $MaliGPUTasksReservedTasks
Hardware name: JS2_TASKS

GPU Utilization

This counter group shows the workload processing activity level of the GPU queues, normalized as a percentage of overall GPU activity.

Non-fragment queue utilization

This expression defines the non-fragment queue utilization compared against the GPU active cycles. For GPU bound content, it is expected that the GPU queues process work in parallel. The dominant queue must be close to 100% utilized. If no queue is dominant, but the GPU is close to 100% utilized, then there might be a serialization or dependency problem preventing better overlap across the queues.

libGPUCounters name: MaliNonFragQueueUtil

libGPUCounters derivation:

max(min((MaliNonFragQueueActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliGPUCyclesNonFragmentQueueActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((JS1_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

Fragment queue utilization

This expression defines the fragment queue utilization compared against the GPU active cycles. For GPU bound content, the GPU queues are expected to process work in parallel. Aim to keep the dominant queue close to 100% utilized. If no queue is dominant, but the GPU is close to 100% utilized, then there might be a serialization or dependency problem preventing better queue overlap.

libGPUCounters name: MaliFragQueueUtil

libGPUCounters derivation:

max(min((MaliFragQueueActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliGPUCyclesFragmentQueueActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((JS0_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

Tiler utilization

This expression defines the tiler utilization compared to the total GPU active cycles.

Note that this metric measures the overall processing time for the tiler geometry pipeline. The metric includes aspects of vertex shading, in addition to the fixed-function tiling process.

libGPUCounters name: MaliTilerUtil

libGPUCounters derivation:

max(min((MaliTilerActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliGPUCyclesTilerActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((TILER_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

Interrupt utilization

This expression defines the IRQ pending utilization compared against the GPU active cycles. In a well-functioning system, this expression should be less than 3% of the total cycles. If the value is much higher than this, a system issue might be preventing the CPU from efficiently handling interrupts.

libGPUCounters name: MaliGPUIRQUtil

libGPUCounters derivation:

max(min((MaliGPUIRQActiveCy / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliGPUCyclesGPUInterruptActive / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((IRQ_ACTIVE / GPU_ACTIVE) * 100, 100), 0)

GPU Cache Flushes

This counter group shows the total number of L2 cache and MMU operations performed by the GPU top-level.

L2 cache flushes

This counter increments for every L2 cache flush that is performed.

libGPUCounters name: MaliL2CacheFlush
Streamline name: $MaliGPUCacheFlushesL2CacheFlushes
Hardware name: CACHE_FLUSH

External Memory System

The GPU external memory interface connects the GPU to the system DRAM, via an on-chip memory bus. The exact configuration of the memory system outside of the GPU varies from device to device and might include additional levels of system cache before reaching the off-chip memory.

GPUs are data-plane processors, with workloads that are too large to keep in system cache and that therefore make heavy use of main memory. GPUs are designed to be tolerant of high latency, when compared to a CPU, but poor memory system performance can still reduce GPU efficiency.

Accessing external DRAM is one of the most energy-intensive operations that the GPU can perform. Reducing memory bandwidth is a key optimization goal for mobile applications, even if not bandwidth limited, ensuring users get long battery life and thermally stable performance.

Performance counters in this section measure how much memory bandwidth your application uses, as well as stall and latency counters to show how well the memory system is coping with the generated traffic.

External Bus Accesses

This counter group shows the absolute number of external memory transactions generated by the GPU.

Read transactions

This counter increments for every external read transaction made on the memory bus. These transactions typically result in an external DRAM access, but some designs include a system cache which can provide some buffering.

The longest memory transaction possible is 64 bytes in length, but shorter transactions are generated in some circumstances.

libGPUCounters name: MaliExtBusRd
Streamline name: $MaliExternalBusAccessesReadTransactions
Hardware name: L2_EXT_READ

Write transactions

This counter increments for every external write transaction made on the memory bus. These transactions typically result in an external DRAM access, but some chips include a system cache which can provide some buffering.

The longest memory transaction possible is 64 bytes in length, but shorter transactions are generated in some circumstances.

libGPUCounters name: MaliExtBusWr
Streamline name: $MaliExternalBusAccessesWriteTransactions
Hardware name: L2_EXT_WRITE

ReadNoSnoop transactions

This counter increments for every non-coherent (ReadNoSnp) transaction.

libGPUCounters name: MaliExtBusRdNoSnoop
Streamline name: $MaliExternalBusAccessesReadNoSnoopTransactions
Hardware name: L2_EXT_READ_NOSNP

ReadUnique transactions

This counter increments for every coherent exclusive read (ReadUnique) transaction.

libGPUCounters name: MaliExtBusRdUnique
Streamline name: $MaliExternalBusAccessesReadUniqueTransactions
Hardware name: L2_EXT_READ_UNIQUE

Snoop transactions

This counter increments for every coherency snoop transaction received from an external requester.

libGPUCounters name: MaliL2CacheIncSnp
Streamline name: $MaliExternalBusAccessesSnoopTransactions
Hardware name: L2_EXT_SNOOP

WriteNoSnoopFull transactions

This counter increments for every external non-coherent full write (WriteNoSnpFull) transaction.

libGPUCounters name: MaliExtBusWrNoSnoopFull
Streamline name: $MaliExternalBusAccessesWriteNoSnoopFullTransactions
Hardware name: L2_EXT_WRITE_NOSNP_FULL

WriteNoSnoopPartial transactions

This counter increments for every external non-coherent partial write (WriteNoSnpPtl) transaction.

libGPUCounters name: MaliExtBusWrNoSnoopPart
Streamline name: $MaliExternalBusAccessesWriteNoSnoopPartialTransactions
Hardware name: L2_EXT_WRITE_NOSNP_PTL

WriteSnoopFull transactions

This counter increments for every external coherent full write (WriteBackFull or WriteUniqueFull) transaction.

libGPUCounters name: MaliExtBusWrSnoopFull
Streamline name: $MaliExternalBusAccessesWriteSnoopFullTransactions
Hardware name: L2_EXT_WRITE_SNP_FULL

WriteSnoopPartial transactions

This counter increments for every external coherent partial write (WriteBackPtl or WriteUniquePtl) transaction.

libGPUCounters name: MaliExtBusWrSnoopPart
Streamline name: $MaliExternalBusAccessesWriteSnoopPartialTransactions
Hardware name: L2_EXT_WRITE_SNP_PTL

External Bus Beats

This counter group shows the absolute amount of external memory data transfer cycles used by the GPU.

Read beats

This counter increments for every clock cycle when a data beat was read from the external memory bus.

Most implementations use a 128-bit (16-byte) data bus, enabling a single 64-byte read transaction to be read using 4 bus cycles.

libGPUCounters name: MaliExtBusRdBt
Streamline name: $MaliExternalBusBeatsReadBeats
Hardware name: L2_EXT_READ_BEATS

Write beats

This counter increments for every clock cycle when a data beat was written to the external memory bus.

Most implementations use a 128-bit (16-byte) data bus, enabling a single 64-byte read transaction to be written using 4 bus cycles.

libGPUCounters name: MaliExtBusWrBt
Streamline name: $MaliExternalBusBeatsWriteBeats
Hardware name: L2_EXT_WRITE_BEATS

External Bus Bytes

This counter group shows the absolute amount of external memory traffic generated by the GPU. Absolute measures are the most useful way to check actual bandwidth against a per-frame bandwidth budget.

Read bytes

This expression defines the total output read bandwidth for the GPU.

libGPUCounters name: MaliExtBusRdBy

libGPUCounters derivation:

MaliExtBusRdBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE

Streamline derivation:

$MaliExternalBusBeatsReadBeats * ($MaliConstantsBusWidthBits / 8)

Hardware derivation:

L2_EXT_READ_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE

Write bytes

This expression defines the total output write bandwidth for the GPU.

libGPUCounters name: MaliExtBusWrBy

libGPUCounters derivation:

MaliExtBusWrBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE

Streamline derivation:

$MaliExternalBusBeatsWriteBeats * ($MaliConstantsBusWidthBits / 8)

Hardware derivation:

L2_EXT_WRITE_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE

External Bus Bandwidth

This counter group shows the external memory traffic generated by the GPU, presented as a bytes/second rate. Rates are the most useful way to check actual bandwidth against the design limits of a chip, which will usually be specified in bytes/second.

Read bandwidth

This expression defines the total output read bandwidth for the GPU, measured in bytes per second.

libGPUCounters name: MaliExtBusRdBPS

libGPUCounters derivation:

(MaliExtBusRdBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

Streamline derivation:

($MaliExternalBusBeatsReadBeats * ($MaliConstantsBusWidthBits / 8)) / $ZOOM

Hardware derivation:

(L2_EXT_READ_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

Write bandwidth

This expression defines the total output write bandwidth for the GPU, measured in bytes per second.

libGPUCounters name: MaliExtBusWrBPS

libGPUCounters derivation:

(MaliExtBusWrBt * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

Streamline derivation:

($MaliExternalBusBeatsWriteBeats * ($MaliConstantsBusWidthBits / 8)) / $ZOOM

Hardware derivation:

(L2_EXT_WRITE_BEATS * MALI_CONFIG_EXT_BUS_BYTE_SIZE) / MALI_CONFIG_TIME_SPAN

External Bus Stall Cycles

This counter group shows the absolute number of external memory interface stalls, which is the number of cycles that the GPU was trying to send data but the external bus could not accept it.

Read stalls

This counter increments for every stall cycle on the AXI bus where the GPU has a valid read transaction to send, but is awaiting a ready signal from the bus.

libGPUCounters name: MaliExtBusRdStallCy
Streamline name: $MaliExternalBusStallCyclesReadStalls
Hardware name: L2_EXT_AR_STALL

Write stalls

This counter increments for every stall cycle on the external bus where the GPU has a valid write transaction to send, but is awaiting a ready signal from the external bus.

libGPUCounters name: MaliExtBusWrStallCy
Streamline name: $MaliExternalBusStallCyclesWriteStalls
Hardware name: L2_EXT_W_STALL

Snoop stalls

This counter increments for every clock cycle when a coherency snoop transaction received from an external requester is stalled by the L2 cache.

libGPUCounters name: MaliL2CacheIncSnpStallCy
Streamline name: $MaliExternalBusStallCyclesSnoopStalls
Hardware name: L2_EXT_SNOOP_STALL

External Bus Stall Rate

This counter group shows the percentage of cycles that the GPU was trying to send data, but by the external bus could not accept it.

A small number of stalls is expected, but sustained periods of with stall rates above 10% might indicate that the GPU is generating more traffic than the downstream memory system can handle efficiently.

Read stall rate

This expression defines the percentage of GPU cycles with a memory stall on an external read transaction.

Stall rates can be reduced by reducing the size of data resources, such as buffers or textures.

libGPUCounters name: MaliExtBusRdStallRate

libGPUCounters derivation:

max(min((MaliExtBusRdStallCy / MALI_CONFIG_L2_CACHE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliExternalBusStallCyclesReadStalls / $MaliConstantsL2SliceCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((L2_EXT_AR_STALL / MALI_CONFIG_L2_CACHE_COUNT / GPU_ACTIVE) * 100, 100), 0)

Write stall rate

This expression defines the percentage of GPU cycles with a memory stall on an external write transaction.

Stall rates can be reduced by reducing geometry complexity, or the size of framebuffers in memory.

libGPUCounters name: MaliExtBusWrStallRate

libGPUCounters derivation:

max(min((MaliExtBusWrStallCy / MALI_CONFIG_L2_CACHE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliExternalBusStallCyclesWriteStalls / $MaliConstantsL2SliceCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((L2_EXT_W_STALL / MALI_CONFIG_L2_CACHE_COUNT / GPU_ACTIVE) * 100, 100), 0)

External Bus Read Latency

This counter group shows the histogram distribution of memory latency for GPU reads.

GPUs are more tolerant to latency than a CPU, but sustained periods of high latency might indicate that the GPU is generating more traffic than the downstream memory system can handle efficiently.

0-127 cycles

This counter increments for every data beat that is returned between 0 and 127 cycles after the read transaction started. This latency is considered a fast access response speed.

libGPUCounters name: MaliExtBusRdLat0
Streamline name: $MaliExternalBusReadLatency0127Cycles
Hardware name: L2_EXT_RRESP_0_127

128-191 cycles

This counter increments for every data beat that is returned between 128 and 191 cycles after the read transaction started. This latency is considered a normal access response speed.

libGPUCounters name: MaliExtBusRdLat128
Streamline name: $MaliExternalBusReadLatency128191Cycles
Hardware name: L2_EXT_RRESP_128_191

192-255 cycles

This counter increments for every data beat that is returned between 192 and 255 cycles after the read transaction started. This latency is considered a normal access response speed.

libGPUCounters name: MaliExtBusRdLat192
Streamline name: $MaliExternalBusReadLatency192255Cycles
Hardware name: L2_EXT_RRESP_192_255

256-319 cycles

This counter increments for every data beat that is returned between 256 and 319 cycles after the read transaction started. This latency is considered a slow access response speed.

libGPUCounters name: MaliExtBusRdLat256
Streamline name: $MaliExternalBusReadLatency256319Cycles
Hardware name: L2_EXT_RRESP_256_319

320-383 cycles

This counter increments for every data beat that is returned between 320 and 383 cycles after the read transaction started. This latency is considered a slow access response speed.

libGPUCounters name: MaliExtBusRdLat320
Streamline name: $MaliExternalBusReadLatency320383Cycles
Hardware name: L2_EXT_RRESP_320_383

384+ cycles

This expression increments for every read beat that is returned at least 384 cycles after the transaction started. This latency is considered a very slow access response speed.

libGPUCounters name: MaliExtBusRdLat384

libGPUCounters derivation:

MaliExtBusRdBt - MaliExtBusRdLat0 - MaliExtBusRdLat128 - MaliExtBusRdLat192 - MaliExtBusRdLat256 - MaliExtBusRdLat320

Streamline derivation:

$MaliExternalBusBeatsReadBeats - $MaliExternalBusReadLatency0127Cycles - $MaliExternalBusReadLatency128191Cycles - $MaliExternalBusReadLatency192255Cycles - $MaliExternalBusReadLatency256319Cycles - $MaliExternalBusReadLatency320383Cycles

Hardware derivation:

L2_EXT_READ_BEATS - L2_EXT_RRESP_0_127 - L2_EXT_RRESP_128_191 - L2_EXT_RRESP_192_255 - L2_EXT_RRESP_256_319 - L2_EXT_RRESP_320_383

External Bus Outstanding Reads

This counter group shows the histogram distribution of the use of the available pool of outstanding memory read transactions.

Sustained periods with most read transactions outstanding may indicate that the GPU hardware configuration is running out of outstanding read capacity.

0-25% outstanding

This counter increments for every read transaction initiated when 0-25% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ1
Streamline name: $MaliExternalBusOutstandingReads025Outstanding
Hardware name: L2_EXT_AR_CNT_Q1

25-50% outstanding

This counter increments for every read transaction initiated when 25-50% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ2
Streamline name: $MaliExternalBusOutstandingReads2550Outstanding
Hardware name: L2_EXT_AR_CNT_Q2

50-75% outstanding

This counter increments for every read transaction initiated when 50-75% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ3
Streamline name: $MaliExternalBusOutstandingReads5075Outstanding
Hardware name: L2_EXT_AR_CNT_Q3

75-100% outstanding

This expression increments for every read transaction initiated when 75-100% of transaction IDs are in use.

libGPUCounters name: MaliExtBusRdOTQ4

libGPUCounters derivation:

MaliExtBusRd - MaliExtBusRdOTQ1 - MaliExtBusRdOTQ2 - MaliExtBusRdOTQ3

Streamline derivation:

$MaliExternalBusAccessesReadTransactions - $MaliExternalBusOutstandingReads025Outstanding - $MaliExternalBusOutstandingReads2550Outstanding - $MaliExternalBusOutstandingReads5075Outstanding

Hardware derivation:

L2_EXT_READ - L2_EXT_AR_CNT_Q1 - L2_EXT_AR_CNT_Q2 - L2_EXT_AR_CNT_Q3

External Bus Outstanding Writes

This counter group shows the histogram distribution of the use of the available pool of outstanding memory write transactions.

Sustained periods with most write transactions outstanding may indicate that the GPU hardware configuration is running out of outstanding write capacity.

0-25% outstanding

This counter increments for every write transaction initiated when 0-25% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ1
Streamline name: $MaliExternalBusOutstandingWrites025Outstanding
Hardware name: L2_EXT_AW_CNT_Q1

25-50% outstanding

This counter increments for every write transaction initiated when 25-50% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ2
Streamline name: $MaliExternalBusOutstandingWrites2550Outstanding
Hardware name: L2_EXT_AW_CNT_Q2

50-75% outstanding

This counter increments for every write transaction initiated when 50-75% of the available transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ3
Streamline name: $MaliExternalBusOutstandingWrites5075Outstanding
Hardware name: L2_EXT_AW_CNT_Q3

75-100% outstanding

This expression increments for every write transaction initiated when 75-100% of transaction IDs are in use.

libGPUCounters name: MaliExtBusWrOTQ4

libGPUCounters derivation:

MaliExtBusWr - MaliExtBusWrOTQ1 - MaliExtBusWrOTQ2 - MaliExtBusWrOTQ3

Streamline derivation:

$MaliExternalBusAccessesWriteTransactions - $MaliExternalBusOutstandingWrites025Outstanding - $MaliExternalBusOutstandingWrites2550Outstanding - $MaliExternalBusOutstandingWrites5075Outstanding

Hardware derivation:

L2_EXT_WRITE - L2_EXT_AW_CNT_Q1 - L2_EXT_AW_CNT_Q2 - L2_EXT_AW_CNT_Q3

Graphics Geometry Workload

Graphics workloads using the rasterization pipeline pass inputs to the GPU as a geometry stream. Vertices in this stream are position shaded, assembled into primitives, and then passed through a culling pipeline before being passed to the Arm GPU binning unit.

Performance counters in this section show how the input geometry is processed, indicating the overall complexity of the geometry workload and how it is processed by the primitive culling stages.

Input Primitives

This counter group shows the number of input primitives to the GPU, before any culling is applied.

Input primitives

This expression defines the total number of input primitives to the rendering process.

High complexity geometry is one of the most expensive inputs to the GPU, because vertices are much larger than compressed texels. Optimize your geometry to minimize mesh complexity, using dynamic level-of-detail and normal maps to reduce the number of primitives required.

libGPUCounters name: MaliGeomTotalPrim

libGPUCounters derivation:

MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomVisiblePrim

Streamline derivation:

$MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives

Hardware derivation:

PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED + PRIM_VISIBLE

Triangle primitives

This counter increments for every input triangle primitive. The count is made before any culling or clipping.

libGPUCounters name: MaliGeomTrianglePrim
Streamline name: $MaliInputPrimitivesTrianglePrimitives
Hardware name: TRIANGLES

Line primitives

This counter increments for every input line primitive. The count is made before any culling or clipping.

libGPUCounters name: MaliGeomLinePrim
Streamline name: $MaliInputPrimitivesLinePrimitives
Hardware name: LINES

Point primitives

This counter increments for every input point primitive. The count is made before any culling or clipping.

libGPUCounters name: MaliGeomPointPrim
Streamline name: $MaliInputPrimitivesPointPrimitives
Hardware name: POINTS

Visible Primitives

This counter group shows the properties of any visible primitives, after any culling is applied.

Front-facing primitives

This counter increments for every visible front-facing triangle that survives culling.

libGPUCounters name: MaliGeomFrontFacePrim
Streamline name: $MaliVisiblePrimitivesFrontFacingPrimitives
Hardware name: FRONT_FACING

Back-facing primitives

This counter increments for every visible back-facing triangle that survives culling.

libGPUCounters name: MaliGeomBackFacePrim
Streamline name: $MaliVisiblePrimitivesBackFacingPrimitives
Hardware name: BACK_FACING

Primitive Culling

This counter group shows the absolute number of primitives that are culled by each of the culling stages in the geometry pipeline, and the number of visible primitives that are not culled by any stage.

Visible primitives

This counter increments for every visible primitive that survives all culling stages.

All fragments of the primitive might be occluded by other primitives closer to the camera, and so produce no visible output.

libGPUCounters name: MaliGeomVisiblePrim
Streamline name: $MaliPrimitiveCullingVisiblePrimitives
Hardware name: PRIM_VISIBLE

Culled primitives

This expression defines the number of primitives that were culled during the rendering process, for any reason.

For efficient 3D content, it is expected that only 50% of primitives are visible because back-face culling is used to remove half of each model.

libGPUCounters name: MaliGeomTotalCullPrim

libGPUCounters derivation:

MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim

Streamline derivation:

$MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives

Hardware derivation:

PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED

Facing or XY plane test culled primitives

This counter increments for every primitive culled by the facing test, or culled by testing against the view frustum X and Y clip planes.

For an arbitrary 3D scene we would expect approximately half of the triangles to be back-facing. If you see a significantly lower percentage than this, check that the facing test is properly enabled.

It is expected that a small number of primitives are outside of the frustum extents, as application culling is never perfect and some models might intersect a frustum clip plane. If this counter is significantly higher than half of the triangles, use draw call bounding box checks to cull draws that are completely out-of-frustum.

If batched draw calls are complex and have a large bounding volume, consider using smaller batches to reduce the bounding volume to enable better culling.

libGPUCounters name: MaliGeomFaceXYPlaneCullPrim
Streamline name: $MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives
Hardware name: PRIM_CULLED

Z plane test culled primitives

This counter increments for every primitive culled by testing against the view frustum near and far clip planes.

It is expected that a small number of primitives are outside of the frustum extents, as application culling is never perfect and some models might intersect a frustum clip plane.

Use draw call bounding box checks to cull draws that are completely out-of-frustum. If batched draw calls are complex and have a large bounding volume consider using smaller batches to reduce the bounding volume to enable better culling.

libGPUCounters name: MaliGeomZPlaneCullPrim
Streamline name: $MaliPrimitiveCullingZPlaneTestCulledPrimitives
Hardware name: PRIM_CLIPPED

Sample test culled primitives

This counter increments for every primitive culled by the sample coverage test. It is expected that a few primitives are small and fail the sample coverage test, as application mesh level-of-detail selection can never be perfect. If the number of primitives counted is more than than 5-10% of the total number, this might indicate that the application has a large number of very small triangles, which are very expensive for a GPU to process.

Aim to keep triangle screen area above 10 pixels. Use schemes such as mesh level-of-detail to select simplified meshes as objects move further away from the camera.

libGPUCounters name: MaliGeomSampleCullPrim
Streamline name: $MaliPrimitiveCullingSampleTestCulledPrimitives
Hardware name: PRIM_SAT_CULLED

Primitive Culling Rate

This counter group shows the percentage of the primitives that use each culling stage that are culled by it, and the percentage of primitives that are visible and not culled by any stage.

Visible primitive rate

This expression defines the percentage of primitives that are visible after culling.

For efficient 3D content, it is expected that only 50% of primitives are visible because back-face culling is used to remove half of each model.

  • A significantly higher visibility rate indicates that the facing test might not be enabled.
  • A significantly lower visibility rate indicates that geometry is being culled for other reasons, which is often possible to optimize. Use the individual culling counters for a more detailed breakdown.
libGPUCounters name: MaliGeomVisibleRate

libGPUCounters derivation:

max(min((MaliGeomVisiblePrim / (MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomVisiblePrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingVisiblePrimitives / ($MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_VISIBLE / (PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED + PRIM_VISIBLE)) * 100, 100), 0)

Facing or XY plane culled primitive rate

This expression defines the percentage of primitives entering the facing and XY plane test that are culled by it. Primitives that are outside of the view frustum in the XY axis, or that are back-facing inside the frustum, are culled by this stage.

For efficient 3D content, it is expected that 50% of primitives are culled by the facing test. If more than 50% of primitives are culled it might be because they are out-of-frustum, which can often be optimized with better software culling or batching granularity.

libGPUCounters name: MaliGeomFaceXYPlaneCullRate

libGPUCounters derivation:

max(min((MaliGeomFaceXYPlaneCullPrim / (MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomVisiblePrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives / ($MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_CULLED / (PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED + PRIM_VISIBLE)) * 100, 100), 0)

Z plane culled primitive rate

This expression defines the percentage of primitives entering the Z plane culling test that are culled by it. Primitives that are closer than the frustum near clip plane, or further away than the frustum far clip plane, are culled by this stage.

Seeing a significant proportion of triangles culled at this stage can be indicative of insufficient application software culling.

libGPUCounters name: MaliGeomZPlaneCullRate

libGPUCounters derivation:

max(min((MaliGeomZPlaneCullPrim / ((MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomVisiblePrim) - MaliGeomFaceXYPlaneCullPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingZPlaneTestCulledPrimitives / (($MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_CLIPPED / ((PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED + PRIM_VISIBLE) - PRIM_CULLED)) * 100, 100), 0)

Sample culled primitive rate

This expression defines the percentage of primitives entering the sample coverage test that are culled by it. This stage culls primitives that are so small that they hit no rasterizer sample points.

If a significant number of triangles are culled at this stage, the application is using geometry meshes that are too complex for their screen coverage. Use schemes such as mesh level-of-detail to select simplified meshes as objects move further away from the camera.

libGPUCounters name: MaliGeomSampleCullRate

libGPUCounters derivation:

max(min((MaliGeomSampleCullPrim / ((MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomVisiblePrim) - MaliGeomFaceXYPlaneCullPrim - MaliGeomZPlaneCullPrim)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliPrimitiveCullingSampleTestCulledPrimitives / (($MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives - $MaliPrimitiveCullingZPlaneTestCulledPrimitives)) * 100, 100), 0)

Hardware derivation:

max(min((PRIM_SAT_CULLED / ((PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED + PRIM_VISIBLE) - PRIM_CULLED - PRIM_CLIPPED)) * 100, 100), 0)

Geometry Threads

This counter group shows the number of vertex shader threads of each type that are generated during vertex processing.

All vertices must be position shaded, but only visible vertices will be varying shaded.

Position shading threads

This expression defines the number of position shader thread invocations.

libGPUCounters name: MaliGeomPosShadThread

libGPUCounters derivation:

MaliGeomPosShadTask * 4

Streamline derivation:

$MaliTilerShadingRequestsPositionShadingRequests * 4

Hardware derivation:

IDVS_POS_SHAD_REQ * 4

Varying shading threads

This expression defines the number of varying shader thread invocations.

libGPUCounters name: MaliGeomVarShadThread

libGPUCounters derivation:

MaliGeomVarShadTask * 4

Streamline derivation:

$MaliTilerShadingRequestsVaryingShadingRequests * 4

Hardware derivation:

IDVS_VAR_SHAD_REQ * 4

Geometry Efficiency

This counter group shows the number of vertex shader threads of each type that are generated per primitive during vertex processing. Efficient geometry aims to keep these metrics as low as possible.

Position threads/input primitive

This expression defines the number of position shader threads per input primitive.

Efficient meshes with a good vertex reuse have average less than 1.5 vertices shaded per triangle, as vertex computation is shared by multiple primitives. Minimize this number by reusing vertices for nearby primitives, improving temporal locality of index reuse, and avoiding unused values in the active index range.

libGPUCounters name: MaliGeomPosShadThreadPerPrim

libGPUCounters derivation:

(MaliGeomPosShadTask * 4) / (MaliGeomFaceXYPlaneCullPrim + MaliGeomZPlaneCullPrim + MaliGeomSampleCullPrim + MaliGeomVisiblePrim)

Streamline derivation:

($MaliTilerShadingRequestsPositionShadingRequests * 4) / ($MaliPrimitiveCullingFacingOrXYPlaneTestCulledPrimitives + $MaliPrimitiveCullingZPlaneTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)

Hardware derivation:

(IDVS_POS_SHAD_REQ * 4) / (PRIM_CULLED + PRIM_CLIPPED + PRIM_SAT_CULLED + PRIM_VISIBLE)

Varying threads/visible primitive

This expression defines the number of varying shader invocations per visible primitive.

Efficient meshes with a good vertex reuse have average less than 1.5 vertices shaded per triangle, as vertex computation is shared by multiple primitives. Minimize this number by reusing vertices for nearby primitives, improving temporal locality of index reuse, and avoiding unused values in the active index range.

libGPUCounters name: MaliGeomVarShadThreadPerPrim

libGPUCounters derivation:

(MaliGeomVarShadTask * 4) / MaliGeomVisiblePrim

Streamline derivation:

($MaliTilerShadingRequestsVaryingShadingRequests * 4) / $MaliPrimitiveCullingVisiblePrimitives

Hardware derivation:

(IDVS_VAR_SHAD_REQ * 4) / PRIM_VISIBLE

Graphics Fragment Workload

Graphics workloads using the rasterization pipeline are rendered into the framebuffer to create output images.

Performance counters in this section show the workload complexity of your fragment rendering.

Output pixels

This counter group shows the total number of output pixels rendered.

Pixels

This expression defines the total number of pixels that are shaded by the GPU, including on-screen and off-screen render passes.

This measure can be a slight overestimate because it assumes all pixels in each active 32 x 32 pixel region are shaded. If the rendered region does not align with 32 pixel aligned boundaries, then this metric includes pixels that are not actually shaded.

libGPUCounters name: MaliGPUPix

libGPUCounters derivation:

MaliFragQueueTask * 1024

Streamline derivation:

$MaliGPUTasksFragmentTasks * 1024

Hardware derivation:

JS0_TASKS * 1024

Overdraw

This counter group shows the number of fragments rendered per pixel.

Fragments/pixel

This expression computes the number of fragments shaded per output pixel.

GPU processing cost per pixel accumulates with the layer count. High overdraw can build up to a significant processing cost, especially when rendering to a high-resolution framebuffer. Minimize overdraw by rendering opaque objects front-to-back and minimizing use of blended transparent layers.

libGPUCounters name: MaliFragOverdraw

libGPUCounters derivation:

(MaliFragWarp * 4) / (MaliFragQueueTask * 1024)

Streamline derivation:

($MaliShaderWarpsFragmentWarps * 4) / ($MaliGPUTasksFragmentTasks * 1024)

Hardware derivation:

(FRAG_WARPS * 4) / (JS0_TASKS * 1024)

Workload Cost

Workload cost metrics give an average throughput per item of work processed by the GPU.

Performance counters in this section can be used to track average performance against budget, and to monitor the impact of application changes over time.

Average Workload Cost

This counter group gives the average cycle throughput for the different kinds of workloads the GPU is running.

When running workloads in parallel the shader core is shared, and these throughput metrics will be impacted by cross-talk across the queues. However, they still a useful tool for managing performance budgets.

GPU cycles/pixel

This expression defines the average number of GPU cycles being spent per pixel rendered. This includes the cost of all shader stages.

It is a useful exercise to set a cycle budget for each render pass in your application, based on your target resolution and frame rate. Rendering 1080p60 is possible with an entry-level device, but you have a small number of cycles per pixel to work so must use them efficiently.

libGPUCounters name: MaliGPUCyPerPix

libGPUCounters derivation:

MaliGPUActiveCy / (MaliFragQueueTask * 1024)

Streamline derivation:

$MaliGPUCyclesGPUActive / ($MaliGPUTasksFragmentTasks * 1024)

Hardware derivation:

GPU_ACTIVE / (JS0_TASKS * 1024)

Shader cycles/non-fragment thread

This expression defines the average number of shader core cycles per non-fragment thread.

This measurement captures the overall shader core throughput, not the shader processing cost. It will be impacted by cycles lost to stalls that could not be hidden by other processing. In addition, it will be impacted by any fragment workloads that are running concurrently in the shader core.

libGPUCounters name: MaliNonFragThroughputCy

libGPUCounters derivation:

MaliNonFragActiveCy / (MaliNonFragWarp * 4)

Streamline derivation:

$MaliShaderCoreCyclesNonFragmentActive / ($MaliShaderWarpsNonFragmentWarps * 4)

Hardware derivation:

COMPUTE_ACTIVE / (COMPUTE_WARPS * 4)

Shader cycles/fragment thread

This expression defines the average number of shader core cycles per fragment thread.

This measurement captures the overall shader core throughput, not the shader processing cost. It will be impacted by cycles lost to stalls that could not be hidden by other processing. In addition, it will be impacted by any fragment workloads that are running concurrently in the shader core.

libGPUCounters name: MaliFragThroughputCy

libGPUCounters derivation:

MaliFragActiveCy / (MaliFragWarp * 4)

Streamline derivation:

$MaliShaderCoreCyclesFragmentActive / ($MaliShaderWarpsFragmentWarps * 4)

Hardware derivation:

FRAG_ACTIVE / (FRAG_WARPS * 4)

Shader Core Front-end

The shader core front-ends are the internal interfaces inside the GPU that accept tasks from other parts of the GPU and turn them into shader threads running in the programmable core.

Each shader core has two front-ends:

  • Non-fragment front-end for all non-fragment tasks, including compute, vertex shading, and advanced geometry.
  • Fragment front-end for all fragment tasks.

The front-ends show as active until task processing is complete, so front-end activity is a direct way of measuring that the shader core is busy handling a workload.

The Execution engine is the programmable core at the heart of the shader core hardware. The Execution engine shows as active if there is at least one thread running, and monitoring its activity is an indirect way of checking that the front-ends are managing to keep the GPU busy.

Performance counters in this section measure the overall workload scheduling for the shader core, showing how busy the shader core is. Note that front-end counters can tell you that a task was scheduled but cannot tell you how heavily the programmable core is being used.

Shader Core Cycles

This counter group shows the scheduling load on the shader core, indicating which of the shader core front-ends have work scheduled and whether they are running threads on the programmable core.

Non-fragment active

This counter increments every clock cycle when the shader core is processing some non-fragment workload. Active processing includes any cycle that non-fragment work is queued in the fixed-function front-end or programmable core.

libGPUCounters name: MaliNonFragActiveCy
Streamline name: $MaliShaderCoreCyclesNonFragmentActive
Hardware name: COMPUTE_ACTIVE

Fragment active

This counter increments every clock cycle when the shader core is processing some fragment workload. Active processing includes any cycle that fragment work is running anywhere in the fixed-function front-end, fixed-function back-end, or programmable core.

libGPUCounters name: MaliFragActiveCy
Streamline name: $MaliShaderCoreCyclesFragmentActive
Hardware name: FRAG_ACTIVE

Fragment pre-pipe buffer active

This counter increments every clock cycle when the pre-pipe quad queue contains at least one quad waiting to run. If this queue completely drains, a fragment warp cannot be spawned when space for new threads becomes available in the shader core. You can experience reduced performance when low thread occupancy starves the functional units of work to process.

Possible causes for this include:

  • Tiles which contain no geometry, which are commonly encountered when creating shadow maps, where many tiles contain no shadow casters.
  • Tiles which contain a lot of geometry which are killed by early ZS or hidden surface removal.
libGPUCounters name: MaliFragFPKActiveCy
Streamline name: $MaliShaderCoreCyclesFragmentPrePipeBufferActive
Hardware name: FRAG_FPK_ACTIVE

Execution core active

This counter increments every clock cycle when the shader core is processing at least one warp. Note that this counter does not provide detailed information about how the functional units are utilized inside the shader core, but simply gives an indication that something was running.

libGPUCounters name: MaliCoreActiveCy
Streamline name: $MaliShaderCoreCyclesExecutionCoreActive
Hardware name: EXEC_CORE_ACTIVE

Shader Core Utilization

This counter group shows the scheduling load on the shader core, normalized against the overall shader core activity.

Non-fragment utilization

This expression defines the percentage utilization of the shader core non-fragment path. This counter measures any cycle that a non-fragment workload is active in the fixed-function front-end or programmable core.

libGPUCounters name: MaliNonFragUtil

libGPUCounters derivation:

max(min((MaliNonFragActiveCy / MALI_CONFIG_SHADER_CORE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesNonFragmentActive / $MaliConstantsShaderCoreCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((COMPUTE_ACTIVE / MALI_CONFIG_SHADER_CORE_COUNT / GPU_ACTIVE) * 100, 100), 0)

Fragment utilization

This expression defines the percentage utilization of the shader core fragment path. This counter measures any cycle that a fragment workload is active in the fixed-function front-end, fixed-function back-end, or programmable core.

libGPUCounters name: MaliFragUtil

libGPUCounters derivation:

max(min((MaliFragActiveCy / MALI_CONFIG_SHADER_CORE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesFragmentActive / $MaliConstantsShaderCoreCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_ACTIVE / MALI_CONFIG_SHADER_CORE_COUNT / GPU_ACTIVE) * 100, 100), 0)

Fragment pre-pipe buffer utilization

This expression defines the percentage of cycles when the pre-pipe quad buffer contains at least one fragment quad. This buffer is located after early ZS but before the programmable core.

During fragment shading this counter must be close to 100%. This indicates that the fragment front-end is able to keep up with the shader core shading performance. This counter commonly drops below 100% for three reasons:

  • The running workload has many empty tiles with no geometry to render. Empty tiles are common in shadow maps, corresponding to a screen region with no shadow casters, so this might not be avoidable.
  • The application consists of simple shaders but a high percentage of microtriangles. This combination causes the shader core to shade fragments faster than they are rasterized, so the quad buffer drains.
  • The application consists of geometry which stalls at early ZS because of a dependency on an earlier fragment layer which is still in flight. Stalled layers prevent new fragments entering the quad buffer, so the quad buffer drains.
libGPUCounters name: MaliFragFPKBUtil

libGPUCounters derivation:

max(min((MaliFragFPKActiveCy / MaliFragActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesFragmentPrePipeBufferActive / $MaliShaderCoreCyclesFragmentActive) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_FPK_ACTIVE / FRAG_ACTIVE) * 100, 100), 0)

Execution core utilization

This expression defines the percentage utilization of the programmable core, measuring cycles when the shader core contains at least one warp. A low utilization here indicates lost performance, because there are spare shader core cycles that are unused.

In some use cases an idle core is unavoidable. For example, a clear color tile that contains no shaded geometry, or a shadow map that is resolved entirely using early ZS depth updates.

Improve programmable core utilization by parallel processing of the non-fragment and fragment queues, running overlapping workloads from multiple render passes. Also aim to keep the FPK buffer utilization as high as possible, ensuring constant forward-pressure on fragment shading.

libGPUCounters name: MaliCoreUtil

libGPUCounters derivation:

max(min((MaliCoreActiveCy / MALI_CONFIG_SHADER_CORE_COUNT / MaliGPUActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderCoreCyclesExecutionCoreActive / $MaliConstantsShaderCoreCount / $MaliGPUCyclesGPUActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_CORE_ACTIVE / MALI_CONFIG_SHADER_CORE_COUNT / GPU_ACTIVE) * 100, 100), 0)

Shader Core Tasks

This counter group shows the number of tasks processed by the shader cores. Task sizes for compute tasks are variable, so this is not expected to be a useful measure of workload.

Non-fragment tasks

This counter increments for every non-fragment task issued to the shader core. The size of these tasks is variable.

libGPUCounters name: MaliNonFragTask
Streamline name: $MaliShaderCoreTasksNonFragmentTasks
Hardware name: COMPUTE_TASKS

Shader Core Fragment Front-end

The shader core fragment front-end is a complex multi-stage pipeline that converts an incoming primitive stream for a screen-space tile into fragment threads that need to be shaded. The fragment front-end handles rasterization, early depth (Z) and stencil (S) testing, and hidden surface removal (HSR).

Performance counters in this section measure how the incoming stream was turned into quads, and how efficiently those quads interacted with ZS testing and HSR.

Fragment Tiles

This counter group shows the number of fragment tiles processed by the shader cores.

Tiles

This counter increments for every tile processed by the shader core. Note that tiles are normally 16 x 16 pixels but can vary depending on per-pixel storage requirements and the tile buffer size of the current GPU.

This GPU supports full size tiles when using up to and including 256 bits per pixel of color storage. Pixel storage requirements depend on the number of color attachments, their data format, and the number of multi-sampling samples per pixel.

The most accurate way to get the total pixel count rendered by the application is to use the Fragment tasks counter, because it always counts 32 x 32 pixel regions.

libGPUCounters name: MaliFragTile
Streamline name: $MaliFragmentTilesTiles
Hardware name: FRAG_PTILES

Killed unchanged tiles

This counter increments for every 16x16 pixel tile or tile sub-region killed by a transaction elimination CRC check, where the data is the same as the content already stored in memory.

libGPUCounters name: MaliFragTileKill
Streamline name: $MaliFragmentTilesKilledUnchangedTiles
Hardware name: FRAG_TRANS_ELIM

Fragment Primitives

This counter group shows how the fragment front-end handles the incoming primitive stream from the tile list built during the binning phase.

Large primitives will be read in multiple tiles and will therefore cause multiple increments to these counter values. These counters will not match the input primitive counts passed in by the application.

Loaded primitives

This counter increments for every primitive loaded from the tile list by the fragment front-end. This increments per tile, which means that a single primitive that spans multiple tiles is counted multiple times. If you want to know the total number of primitives in the scene refer to the Total input primitives expression

libGPUCounters name: MaliFragRdPrim
Streamline name: $MaliFragmentPrimitivesLoadedPrimitives
Hardware name: FRAG_PRIMITIVES

Rasterized primitives

This counter increments for every primitive entering the rasterization unit for each tile shaded. This increments per tile, which means that a single primitive that spans multiple tiles is counted multiple times. If you want to know the total number of primitives in the scene refer to the Total input primitives expression.

libGPUCounters name: MaliFragRastPrim
Streamline name: $MaliFragmentPrimitivesRasterizedPrimitives
Hardware name: FRAG_PRIM_RAST

Fragment Quads

This counter group shows how the rasterizer turns the incoming primitive stream in to 2x2 sample quads for shading.

Rasterized fine quads

This counter increments for every fine quad generated by the rasterization phase. A fine quad covers a 2x2 pixel screen region. The quads generated have at least some coverage based on the current sample pattern, but can subsequently be killed by early ZS testing or hidden surface removal before they are shaded.

In this GPU, this counter has an erratum which causes an over count by 2x when using 8x MSAA, and by 4x when using 16x MSAA.

libGPUCounters name: MaliFragRastQd
Streamline name: $MaliFragmentQuadsRasterizedFineQuads
Hardware name: FRAG_QUADS_RAST

Shaded coarse quads

This expression defines the number of 2x2 fragment quads that are spawned as executing threads in the shader core.

libGPUCounters name: MaliFragShadedQd

libGPUCounters derivation:

MaliFragWarp

Streamline derivation:

$MaliShaderWarpsFragmentWarps

Hardware derivation:

FRAG_WARPS

Fragment ZS Quads

This counter group shows how the depth (Z) and stencil (Z) test unit handles quads for early and late ZS test and update.

Early ZS tested quads

This counter increments for every quad undergoing early depth and stencil testing.

For maximum performance, this number must be close to the total number of input quads. We want as many of the input quads as possible to be subject to early ZS testing because early ZS testing is significantly more efficient than late ZS testing, which only kills threads after they have been shaded.

libGPUCounters name: MaliFragEZSTestQd
Streamline name: $MaliFragmentZSQuadsEarlyZSTestedQuads
Hardware name: FRAG_QUADS_EZS_TEST

Early ZS killed quads

This counter increments for every quad killed by early depth and stencil testing.

Quads killed at this stage are killed before shading, so a high percentage here is not generally a performance problem. However, it can indicate an opportunity to use software culling techniques such as portal culling to avoid sending occluded geometry to the GPU.

libGPUCounters name: MaliFragEZSKillQd
Streamline name: $MaliFragmentZSQuadsEarlyZSKilledQuads
Hardware name: FRAG_QUADS_EZS_KILL

Early ZS updated quads

This counter increments for every quad undergoing early depth and stencil testing that can update the framebuffer. Quads that have a depth value that depends on shader behavior, or those that have indeterminate coverage because of use of alpha-to-coverage or discard statements in the shader, might be early ZS tested but can not do an early ZS update.

For maximum performance, this number must be close to the total number of input quads. Aim to maximize the number of quads that are capable of doing an early ZS update.

libGPUCounters name: MaliFragEZSUpdateQd
Streamline name: $MaliFragmentZSQuadsEarlyZSUpdatedQuads
Hardware name: FRAG_QUADS_EZS_UPDATE

FPK HSR killed quads

This expression defines the number of quads that are killed by the Forward Pixel Kill (FPK) hidden surface removal scheme.

It is good practice to sort opaque geometry so that the geometry is rendered front-to-back with depth testing enabled. This enables more geometry to be killed by early ZS testing instead of FPK, which removes the work earlier in the pipeline.

Quads killed at this stage are killed before shading, so a high percentage here is not generally a performance problem. However, it can indicate an opportunity to use software culling techniques such as portal culling to avoid sending occluded geometry to the GPU.

libGPUCounters name: MaliFragFPKKillQd

libGPUCounters derivation:

MaliFragRastQd - MaliFragEZSKillQd - MaliFragWarp

Streamline derivation:

$MaliFragmentQuadsRasterizedFineQuads - $MaliFragmentZSQuadsEarlyZSKilledQuads - $MaliShaderWarpsFragmentWarps

Hardware derivation:

FRAG_QUADS_RAST - FRAG_QUADS_EZS_KILL - FRAG_WARPS

Late ZS killed quads

This counter increments for every quad killed by late depth and stencil testing.

libGPUCounters name: MaliFragLZSKillQd
Streamline name: $MaliFragmentZSQuadsLateZSKilledQuads
Hardware name: FRAG_LZS_KILL

Late ZS tested quads

This counter increments for every quad undergoing late depth and stencil testing.

libGPUCounters name: MaliFragLZSTestQd
Streamline name: $MaliFragmentZSQuadsLateZSTestedQuads
Hardware name: FRAG_LZS_TEST

ZS Unit Test Rate

This counter group shows the relative numbers of quads doing early and late depth (Z) and stencil (Z) testing.

Early ZS kill rate

This expression defines the percentage of rasterized quads that are killed by early depth and stencil testing.

Quads killed at this stage are killed before shading, so a high percentage here is not generally a performance problem. However, it can indicate an opportunity to use software culling techniques such as portal culling to avoid sending occluded geometry to the GPU.

libGPUCounters name: MaliFragEZSKillRate

libGPUCounters derivation:

max(min((MaliFragEZSKillQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsEarlyZSKilledQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_QUADS_EZS_KILL / FRAG_QUADS_RAST) * 100, 100), 0)

Early ZS test rate

This expression defines the percentage of rasterized quads that were subjected to early depth and stencil testing.

To achieve the best early test rates, enable depth testing, and avoid draw calls with modifiable coverage or draw calls with fragment shader programs that write to their depth value.

libGPUCounters name: MaliFragEZSTestRate

libGPUCounters derivation:

max(min((MaliFragEZSTestQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsEarlyZSTestedQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_QUADS_EZS_TEST / FRAG_QUADS_RAST) * 100, 100), 0)

Early ZS update rate

This expression defines the percentage of rasterized quads that update the framebuffer during early depth and stencil testing.

To achieve the best early test rates, enable depth testing, and avoid draw calls with modifiable coverage or draw calls with fragment shader programs that write to their depth value.

libGPUCounters name: MaliFragEZSUpdateRate

libGPUCounters derivation:

max(min((MaliFragEZSUpdateQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsEarlyZSUpdatedQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_QUADS_EZS_UPDATE / FRAG_QUADS_RAST) * 100, 100), 0)

Occluding quad rate

This expression defines the percentage of rasterized quads that survive early depth and stencil testing that are valid hidden surface removal occluders.

libGPUCounters name: MaliFragOpaqueQdRate

libGPUCounters derivation:

max(min((MaliFragOpaqueQd / (MaliFragRastQd - MaliFragEZSKillQd)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentFPKHSRQuadsOccludingQuads / ($MaliFragmentQuadsRasterizedFineQuads - $MaliFragmentZSQuadsEarlyZSKilledQuads)) * 100, 100), 0)

Hardware derivation:

max(min((QUAD_FPK_KILLER / (FRAG_QUADS_RAST - FRAG_QUADS_EZS_KILL)) * 100, 100), 0)

FPK HSR kill rate

This expression defines the percentage of rasterized quads that are killed by the Forward Pixel Kill (FPK) hidden surface removal scheme.

Quads killed at this stage are killed before shading, so a high percentage here is not generally a performance problem. However, it can indicate an opportunity to use software culling techniques such as portal culling to avoid sending occluded geometry to the GPU.

libGPUCounters name: MaliFragFPKKillRate

libGPUCounters derivation:

max(min(((MaliFragRastQd - MaliFragEZSKillQd - MaliFragWarp) / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min((($MaliFragmentQuadsRasterizedFineQuads - $MaliFragmentZSQuadsEarlyZSKilledQuads - $MaliShaderWarpsFragmentWarps) / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min(((FRAG_QUADS_RAST - FRAG_QUADS_EZS_KILL - FRAG_WARPS) / FRAG_QUADS_RAST) * 100, 100), 0)

Late ZS kill rate

This expression defines the percentage of rasterized quads that are killed by late depth and stencil testing. Quads killed by late ZS testing run at least some of their fragment program before being killed.

A high percentage of fragments being killed by ZS can be a source of redundant processing. You achieve the lowest late test rates by avoiding draw calls with modifiable coverage, or with shader programs that write to their depth value or that have memory-visible side-effects.

The driver uses a late ZS update and kill sequence to preload a depth or stencil attachment at the start of a render pass, which is needed if the render pass does not start from a cleared value. Always start from a cleared value whenever possible.

libGPUCounters name: MaliFragLZSKillRate

libGPUCounters derivation:

max(min((MaliFragLZSKillQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsLateZSKilledQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_LZS_KILL / FRAG_QUADS_RAST) * 100, 100), 0)

Late ZS test rate

This expression defines the percentage of rasterized quads that are tested by late depth and stencil testing.

A high percentage of fragments performing a late ZS update can cause slow performance, even if fragments are not killed. Younger fragments cannot complete early ZS until all older fragments at the same coordinate have completed their late ZS operations, which can cause stalls.

You achieve the lowest late test rates by avoiding draw calls with modifiable coverage,, or with shader programs that write to their depth value or that have memory-visible side-effects

libGPUCounters name: MaliFragLZSTestRate

libGPUCounters derivation:

max(min((MaliFragLZSTestQd / MaliFragRastQd) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentZSQuadsLateZSTestedQuads / $MaliFragmentQuadsRasterizedFineQuads) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_LZS_TEST / FRAG_QUADS_RAST) * 100, 100), 0)

Fragment FPK HSR Quads

This counter group shows how many of the generated quads are eligible to be occluders for the Forward Pixel Kill (FPK) hidden surface removal scheme.

Non-occluding quads

This expression defines the number of quads that are not candidates for being hidden surface removal occluders. To be eligible, a quad must be guaranteed to be opaque and resolvable at early ZS.

Draw calls that use blending, shader discard, alpha-to-coverage, programmable depth, or programmable tile buffer access can not be occluders. Aim to minimize the number of transparent quads by disabling blending when it is not required.

libGPUCounters name: MaliFragTransparentQd

libGPUCounters derivation:

MaliFragRastQd - MaliFragEZSKillQd - MaliFragOpaqueQd

Streamline derivation:

$MaliFragmentQuadsRasterizedFineQuads - $MaliFragmentZSQuadsEarlyZSKilledQuads - $MaliFragmentFPKHSRQuadsOccludingQuads

Hardware derivation:

FRAG_QUADS_RAST - FRAG_QUADS_EZS_KILL - QUAD_FPK_KILLER

Occluding quads

This counter increments for every quad that is a valid occluder for hidden surface removal. To be a candidate occluder, a quad must be guaranteed to be opaque and have fulled resolved at early ZS.

Draw calls that use blending, shader discard, alpha-to-coverage, programmable depth, or programmable tile buffer access can not be occluders.

libGPUCounters name: MaliFragOpaqueQd
Streamline name: $MaliFragmentFPKHSRQuadsOccludingQuads
Hardware name: QUAD_FPK_KILLER

Fragment Workload Properties

This counter group shows properties of the fragment front-end workload that can identify specific application optimization opportunities.

Partial coverage rate

This expression defines the percentage of warps quads that contain fragment samples with no coverage. A high percentage can indicate that the content has a high density of small triangles, which are expensive to process. To avoid this, use mesh level-of-detail algorithms to select simpler meshes as objects move further from the camera.

libGPUCounters name: MaliFragPartWarpRate

libGPUCounters derivation:

max(min((MaliFragPartWarp / MaliFragWarp) * 100, 100), 0)

Streamline derivation:

max(min(($MaliShaderWarpsPartialFragmentWarps / $MaliShaderWarpsFragmentWarps) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_PARTIAL_WARPS / FRAG_WARPS) * 100, 100), 0)

Unchanged tile kill rate

This expression defines the percentage of tiles that are killed by the transaction elimination CRC check because the content of a tile matches the content already stored in memory.

A high percentage of tile writes being killed indicates that a significant part of the framebuffer is static from frame to frame. Consider using scissor rectangles to reduce the area that is redrawn. To help manage the partial frame updates for window surfaces consider using the EGL extensions such as:

  • EGL_KHR_partial_update
  • EGL_EXT_swap_buffers_with_damage
libGPUCounters name: MaliFragTileKillRate

libGPUCounters derivation:

max(min((MaliFragTileKill / MaliFragTile) * 100, 100), 0)

Streamline derivation:

max(min(($MaliFragmentTilesKilledUnchangedTiles / $MaliFragmentTilesTiles) * 100, 100), 0)

Hardware derivation:

max(min((FRAG_TRANS_ELIM / FRAG_PTILES) * 100, 100), 0)

Shader Core Programmable Core

The programmable core is responsible for executing shader programs. This generation of Arm GPUs are warp-based, scheduling multiple threads from the same program in lockstep to improve energy efficiency.

The programmable core is a massively multi-threaded core, allowing many concurrently resident warps, which provides a level of tolerance to cache misses and data fetch latency. For most applications having more threads resident improves performance, as it increases the number of threads available for latency hiding, but it might decrease performance if the additional threads cause cache thrashing.

The core is built from a multiple independent hardware units, which can be simultaneously processing workloads from any of the resident threads. The most heavily loaded unit will set the upper bound on performance, with the other units running in parallel to it.

Performance counters in this section show the overall utilization of the different hardware units, making it easier to identify the units that are likely to be on the critical path.

Shader Core Unit Utilization

This counter group shows the use of each of the functional units inside the shader core, relative to their speed-of-light capability.

These units can run in parallel, and well performing content can expect peak load to be above 80% utilization on the most heavily used units. In this scenario reducing use of those units is likely to improve application performance.

If no unit is heavily loaded, it implies that the shader core is starving for work. This can be because not enough threads are getting spawned by the front-end, or because threads in the core are blocked on memory access. Other counters can help determine which of these situations is occurring.

Arithmetic unit utilization

This expression defines the percentage utilization of the arithmetic unit in the programmable core.

The most effective technique for reducing arithmetic load is reducing the complexity of your shader programs. Using narrower 8 and 16-bit data types can also help, as it allows multiple operations to be processed in parallel.

libGPUCounters name: MaliALUUtil

libGPUCounters derivation:

max(min((MaliEngInstr / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliALUInstructionsExecutedInstructions / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_INSTR_COUNT / EXEC_CORE_ACTIVE) * 100, 100), 0)

Load/store unit utilization

This expression defines the percentage utilization of the load/store unit. The load/store unit is used for general-purpose memory accesses, including vertex attribute access, buffer access, work group shared memory access, and stack access. This unit also implements imageLoad/Store and atomic access functionality.

For traditional graphics content the most significant contributor to load/store usage is vertex data. Arm recommends simplifying mesh complexity, using fewer triangles, fewer vertices, and fewer bytes per vertex.

Shaders that spill to stack are also expensive, as any spilling is multiplied by the large number of parallel threads that are running. You can use the Mali Offline Compiler to check your shaders for spilling.

libGPUCounters name: MaliLSUtil

libGPUCounters derivation:

max(min(((MaliLSFullRd + MaliLSPartRd + MaliLSFullWr + MaliLSPartWr + MaliLSAtomic) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads + $MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites + $MaliLoadStoreUnitCyclesAtomicAccesses) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min(((LS_MEM_READ_FULL + LS_MEM_READ_SHORT + LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT + LS_MEM_ATOMIC) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Varying unit utilization

This expression defines the percentage utilization of the varying unit.

The most effective technique for reducing varying load is reducing the number of interpolated values read by the fragment shading. Increasing shader usage of 16-bit input variables also helps, as they can be interpolated as twice the speed of 32-bit variables.

libGPUCounters name: MaliVarUtil

libGPUCounters derivation:

max(min(((MaliVar32IssueSlot + MaliVar16IssueSlot) / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min((($MaliVaryingUnitRequests32BitInterpolationSlots + $MaliVaryingUnitRequests16BitInterpolationSlots) / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min(((VARY_SLOT_32 + VARY_SLOT_16) / EXEC_CORE_ACTIVE) * 100, 100), 0)

Texture unit utilization

This expression defines the percentage utilization of the texturing unit.

The most effective technique for reducing texturing unit load is reducing the number of texture samples read by your shaders. Using 32bpp color formats, and the ASTC decode mode extensions to select a 32bpp intermediate precision, can reduce cache access cost. Using simpler texture filters can reduce filtering cost. Using a 16bit per component sampler result can reduce data return cost.

libGPUCounters name: MaliTexUtil

libGPUCounters derivation:

max(min((MaliTexFiltIssueCy / MaliCoreActiveCy) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitCyclesFilteringActive / $MaliShaderCoreCyclesExecutionCoreActive) * 100, 100), 0)

Hardware derivation:

max(min((TEX_COORD_ISSUE / EXEC_CORE_ACTIVE) * 100, 100), 0)

Shader Core Stall Cycles

This counter group shows the number of cycles that the shader core is able to accept new warps, but the front-end has no new warp ready to run. This might be because the front-end is a bottleneck, or because the workload requires no warps to be spawned.

Execution engine starvation

This counter increments every clock cycle when the programmable core contains threads to run, and is ready to accept a new instruction, but no new thread is available to run. This typically occurs when all threads are blocked waiting for the result from an asynchronous processing operation, such as a texture filtering operation or a data fetch from memory.

libGPUCounters name: MaliEngStarveCy
Streamline name: $MaliShaderCoreStallCyclesExecutionEngineStarvation
Hardware name: EXEC_INSTR_STARVING

Shader Core Workload

The programmable core runs the shader program threads that generate the desired application output.

Performance counters in this section show how the programmable core converts incoming work into the threads and warps running in the shader core, as well as other important properties of the running workload such as warp divergence.

Shader Warps

This counter group shows the number of warps created, split by type. This can help you to understand the running workload mix.

Non-fragment warps

This counter increments for every created non-fragment warp. For this GPU, a warp contains 4 threads.

For compute shaders, to ensure full utilization of the warp capacity, work groups must be a multiple of warp size.

libGPUCounters name: MaliNonFragWarp
Streamline name: $MaliShaderWarpsNonFragmentWarps
Hardware name: COMPUTE_WARPS

Fragment warps

This counter increments for every created fragment warp. For this GPU, a warp contains 4 threads.

Fragment warps are populated with fragment quads, where each quad corresponds to a 2x2 fragment region from a single triangle. Threads in a quad which correspond to a sample point outside of the triangle still consume shader resource, which makes small triangles disproportionately expensive.

libGPUCounters name: MaliFragWarp
Streamline name: $MaliShaderWarpsFragmentWarps
Hardware name: FRAG_WARPS

Partial fragment warps

This counter increments for every created fragment warp containing helper threads that do not correspond to a hit sample point. Partial coverage in a fragment quad occurs if any of its sample points span the edge of a triangle, or if one or more covered sample points fail an early ZS test. Partial coverage in a warp occurs if any quads it contains have partial coverage.

libGPUCounters name: MaliFragPartWarp
Streamline name: $MaliShaderWarpsPartialFragmentWarps
Hardware name: FRAG_PARTIAL_WARPS

Shader Threads

This counter group shows the number of threads created, split by type. This can help you to understand the running workload mix.

Counters in this group are derived by scaling quad or warp counters, and their counts will include unused thread slots in the coarser granule.

Non-fragment threads

This expression defines the number of non-fragment threads started.

libGPUCounters name: MaliNonFragThread

libGPUCounters derivation:

MaliNonFragWarp * 4

Streamline derivation:

$MaliShaderWarpsNonFragmentWarps * 4

Hardware derivation:

COMPUTE_WARPS * 4

Fragment threads

This expression defines the number of fragment threads started. This expression is an approximation, based on the assumption that all warps are fully populated with threads. The Partial fragment warps counter can indicate how close this approximation is.

libGPUCounters name: MaliFragThread

libGPUCounters derivation:

MaliFragWarp * 4

Streamline derivation:

$MaliShaderWarpsFragmentWarps * 4

Hardware derivation:

FRAG_WARPS * 4

Shader Workload Properties

This counter group shows interesting properties of the running shader code, most of which highlight an interesting optimization opportunity.

Warp divergence rate

This expression defines the percentage of instructions that have control flow divergence across the warp.

Control flow divergence can reduce performance, because only some lanes of the warp are active while diverged. Minimize divergence by using warp-uniform branch decisions for conditional checks and loop limits.

libGPUCounters name: MaliEngDivergedInstrRate

libGPUCounters derivation:

max(min((MaliEngDivergedInstr / MaliEngInstr) * 100, 100), 0)

Streamline derivation:

max(min(($MaliALUInstructionsDivergedInstructions / $MaliALUInstructionsExecutedInstructions) * 100, 100), 0)

Hardware derivation:

max(min((EXEC_INSTR_DIVERGED / EXEC_INSTR_COUNT) * 100, 100), 0)

Shader Core Arithmetic Unit

The arithmetic unit in the shader core processes all the arithmetic and logic operations in the running shader programs.

Performance counters in this section show how the running programs used the arithmetic units, which may indicate the type of operations that are consuming the most performance.

ALU Cycles

This counter group shows the number of cycles when work was issued to the arithmetic and logic unit.

Arithmetic unit issues

This expression defines the number of cycles that the arithmetic unit was busy processing work.

libGPUCounters name: MaliALUIssueCy

libGPUCounters derivation:

MaliEngInstr

Streamline derivation:

$MaliALUInstructionsExecutedInstructions

Hardware derivation:

EXEC_INSTR_COUNT

ALU Instructions

This counter group gives a breakdown of the types of arithmetic instructions being used by the shader program.

Executed instructions

This counter increments for every instruction that the shader core processes per warp. All instructions are single cycle issue.

libGPUCounters name: MaliEngInstr
Streamline name: $MaliALUInstructionsExecutedInstructions
Hardware name: EXEC_INSTR_COUNT

Diverged instructions

This counter increments for every instruction the programmable core processes per warp where there is control flow divergence across the warp. Control flow divergence erodes arithmetic processing efficiency because it implies some threads in the warp are idle because they did not take the current control path through the code. Aim to minimize control flow divergence when designing shader effects.

libGPUCounters name: MaliEngDivergedInstr
Streamline name: $MaliALUInstructionsDivergedInstructions
Hardware name: EXEC_INSTR_DIVERGED

Shader Core Load/store Unit

The load/store unit in the shader core handles all generic read/write data access, including access to vertex attributes, buffers, images, workgroup local storage, and program stack.

Performance counters in this section show the breakdown of performed load/store cache accesses, showing whether accesses are using an entire cache line or just using part of one.

Load/Store Unit Cycles

This counter group shows the number of cycles when work was issued to the load/store unit.

Load/store unit issues

This expression defines the total number of load/store cache access cycles. This counter ignores secondary effects such as cache misses, so provides the minimum possible cycle usage.

libGPUCounters name: MaliLSIssueCy

libGPUCounters derivation:

MaliLSFullRd + MaliLSPartRd + MaliLSFullWr + MaliLSPartWr + MaliLSAtomic

Streamline derivation:

$MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads + $MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites + $MaliLoadStoreUnitCyclesAtomicAccesses

Hardware derivation:

LS_MEM_READ_FULL + LS_MEM_READ_SHORT + LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT + LS_MEM_ATOMIC

Reads

This expression defines the total number of load/store read cycles.

libGPUCounters name: MaliLSRdCy

libGPUCounters derivation:

MaliLSFullRd + MaliLSPartRd

Streamline derivation:

$MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads

Hardware derivation:

LS_MEM_READ_FULL + LS_MEM_READ_SHORT

Full reads

This counter increments for every full-width load/store cache read.

libGPUCounters name: MaliLSFullRd
Streamline name: $MaliLoadStoreUnitCyclesFullReads
Hardware name: LS_MEM_READ_FULL

Partial reads

This counter increments for every partial-width load/store cache read. Partial data accesses do not make full use of the load/store cache capability. Merging short accesses together to make fewer larger requests improves efficiency. To do this in shader code:

  • Use vector data loads.
  • Avoid padding in strided data accesses.
  • Write compute shaders so that adjacent threads in a warp access adjacent addresses in memory.
libGPUCounters name: MaliLSPartRd
Streamline name: $MaliLoadStoreUnitCyclesPartialReads
Hardware name: LS_MEM_READ_SHORT

Writes

This expression defines the total number of load/store write cycles.

libGPUCounters name: MaliLSWrCy

libGPUCounters derivation:

MaliLSFullWr + MaliLSPartWr

Streamline derivation:

$MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites

Hardware derivation:

LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT

Full writes

This counter increments for every full-width load/store cache write.

libGPUCounters name: MaliLSFullWr
Streamline name: $MaliLoadStoreUnitCyclesFullWrites
Hardware name: LS_MEM_WRITE_FULL

Partial writes

This counter increments for every partial-width load/store cache write. Partial data accesses do not make full use of the load/store cache capability. Merging short accesses together to make fewer larger requests improves efficiency. To do this in shader code:

  • Use vector data loads.
  • Avoid padding in strided data accesses.
  • Write compute shaders so that adjacent threads in a warp access adjacent addresses in memory.
libGPUCounters name: MaliLSPartWr
Streamline name: $MaliLoadStoreUnitCyclesPartialWrites
Hardware name: LS_MEM_WRITE_SHORT

Atomic accesses

This counter increments for every atomic access.

Atomic memory accesses are typically multicycle operations per thread in a warp, so they are exceptionally expensive. Minimize the use of atomics in performance critical code.

libGPUCounters name: MaliLSAtomic
Streamline name: $MaliLoadStoreUnitCyclesAtomicAccesses
Hardware name: LS_MEM_ATOMIC

Shader Core Varying Unit

The varying unit in the shader core handles all vertex data interpolation in fragment shaders.

Performance counters in this section show the breakdown of performed interpolation operations.

Varying Unit Requests

This counter group shows the number of requests made to the varying interpolation unit.

Interpolation requests

This counter increments for every warp-width interpolation operation processed by the varying unit.

libGPUCounters name: MaliVarInstr
Streamline name: $MaliVaryingUnitRequestsInterpolationRequests
Hardware name: VARY_INSTR

16-bit interpolation slots

This counter increments for every 16-bit interpolation slot processed by the varying unit.

The width of each slot and the number of slots is GPU dependent.

libGPUCounters name: MaliVar16IssueSlot
Streamline name: $MaliVaryingUnitRequests16BitInterpolationSlots
Hardware name: VARY_SLOT_16

32-bit interpolation slots

This counter increments for every 32-bit interpolation slot processed by the varying unit. 32-bit interpolation is half the performance of 16-bit interpolation, so if content is varying bound consider reducing precision of varying inputs to fragment shaders.

The width of each slot and the number of slots is GPU dependent.

libGPUCounters name: MaliVar32IssueSlot
Streamline name: $MaliVaryingUnitRequests32BitInterpolationSlots
Hardware name: VARY_SLOT_32

Varying Unit Cycles

This counter group shows the number of cycles when work was issued to the varying interpolation unit.

Varying unit issues

This expression defines the total number of cycles when the varying interpolator is issuing operations.

libGPUCounters name: MaliVarIssueCy

libGPUCounters derivation:

MaliVar32IssueSlot + MaliVar16IssueSlot

Streamline derivation:

$MaliVaryingUnitRequests32BitInterpolationSlots + $MaliVaryingUnitRequests16BitInterpolationSlots

Hardware derivation:

VARY_SLOT_32 + VARY_SLOT_16

16-bit interpolation issues

This counter increments for every 16-bit interpolation cycle processed by the varying unit.

libGPUCounters name: MaliVar16IssueCy

libGPUCounters derivation:

MaliVar16IssueSlot

Streamline derivation:

$MaliVaryingUnitRequests16BitInterpolationSlots

Hardware derivation:

VARY_SLOT_16

32-bit interpolation issues

This counter increments for every 32-bit interpolation cycle processed by the varying unit. 32-bit interpolation is half the performance of 16-bit interpolation, so if content is varying bound consider reducing precision of varying inputs to fragment shaders.

libGPUCounters name: MaliVar32IssueCy

libGPUCounters derivation:

MaliVar32IssueSlot

Streamline derivation:

$MaliVaryingUnitRequests32BitInterpolationSlots

Hardware derivation:

VARY_SLOT_32

Shader Core Texture Unit

The texture unit in the shader core handles all read-only texture access and filtering.

Performance counters in this section show the breakdown of performed texturing operations, and use of sub-units inside the texturing hardware.

Texture Unit Requests

This counter group shows the number of requests made to the texture unit.

Texture requests

This counter increments for every thread-width texture operation processed by the texture unit.

libGPUCounters name: MaliTexInstr
Streamline name: $MaliTextureUnitRequestsTextureRequests
Hardware name: TEX_INSTR

Texture samples

This expression defines the number of texture samples made.

libGPUCounters name: MaliTexSample

libGPUCounters derivation:

MaliTexInstr

Streamline derivation:

$MaliTextureUnitRequestsTextureRequests

Hardware derivation:

TEX_INSTR

3D texture requests

This counter increments for every texture operation acting on a 3D texture. 3D filtering is half the performance of 2D filtering.

libGPUCounters name: MaliTex3DInstr
Streamline name: $MaliTextureUnitRequests3DTextureRequests
Hardware name: TEX_INSTR_3D

Compressed texture requests

This counter increments for every texture operation acting on a compressed texture. Lossy texture compression, such as ASTC and ETC, provides a significant reduction in texture size and bandwidth, improving performance and reducing memory power consumption.

Note that this counter excludes textures compressed using AFBC lossless framebuffer compression.

libGPUCounters name: MaliTexCompressInstr
Streamline name: $MaliTextureUnitRequestsCompressedTextureRequests
Hardware name: TEX_INSTR_COMPRESSED

Mipmapped texture requests

This counter increments for every texture operation that acts on a mipmapped texture. Mipmapping improves texturing quality for 3D scenes by providing some pre-filtering for minified texture samples. It also improves performance because it reduces pressure on texture caches. Aim to use mipmapping for all texturing operations in a 3D scene that read from static input textures.

libGPUCounters name: MaliTexMipInstr
Streamline name: $MaliTextureUnitRequestsMipmappedTextureRequests
Hardware name: TEX_INSTR_MIPMAP

Trilinear filtered requests

This counter increments for every texture operation that uses a trilinear texture filter. Trilinear filtering is half the performance of bilinear filtering.

libGPUCounters name: MaliTexTriInstr
Streamline name: $MaliTextureUnitRequestsTrilinearFilteredRequests
Hardware name: TEX_INSTR_TRILINEAR

Texture Unit Cycles

This counter group shows the number of cycles when work was issued to the sub-units inside the texture unit.

Texture unit issues

This expression measures the number of cycles the texture unit was busy processing work.

libGPUCounters name: MaliTexIssueCy

libGPUCounters derivation:

MaliTexFiltIssueCy

Streamline derivation:

$MaliTextureUnitCyclesFilteringActive

Hardware derivation:

TEX_COORD_ISSUE

Filtering active

This counter increments for every texture filtering issue cycle. This GPU can do 1x 2D bilinear texture samples per clock. More complex filtering operations are composed of multiple 2D bilinear samples, and take proportionally more filtering time to complete. The costs per sample are:

  • 2D bilinear filtering takes one cycle.
  • 2D trilinear filtering takes two cycles.
  • 3D bilinear filtering takes two cycles.
  • 3D trilinear filtering takes four cycles.
  • Sampling from multi-plane YUV takes one cycle per plane.

Anisotropic filtering makes multiple filtered subsamples which are combined to make the final output sample color. For a filter with MAX_ANISOTROPY of N, up to N times the cycles of the base filter are required.

libGPUCounters name: MaliTexFiltIssueCy
Streamline name: $MaliTextureUnitCyclesFilteringActive
Hardware name: TEX_COORD_ISSUE

Texture Unit Stall Cycles

This counter group shows the number of stall cycles when work could not be issued to the sub-units inside the texture unit.

Coordinate stalls

This counter increments every clock cycle when threads are stalled at the texel coordinate calculation stage.

These stalled threads can occur when the texture cache is full of threads waiting for data and the cache is unable to accept new threads.

libGPUCounters name: MaliTexCoordStallCy
Streamline name: $MaliTextureUnitStallCyclesCoordinateStalls
Hardware name: TEX_COORD_STALL

Line fill stalls

This counter increments every clock cycle when at least one thread is waiting for data from the texture cache, but no lookup is completed.

This event occurs if no new threads enter the data fetch stage from texel address calculation, and all threads that are already in the data fetch stage are still waiting for their data.

libGPUCounters name: MaliTexDataStallCy
Streamline name: $MaliTextureUnitStallCyclesLineFillStalls
Hardware name: TEX_STARVE_CACHE

Partial data stalls

This counter increments every clock cycle when at least one thread fetched some data from the texture cache, but no filtering operation is started because no thread has all the data that it requires.

libGPUCounters name: MaliTexPartDataStallCy
Streamline name: $MaliTextureUnitStallCyclesPartialDataStalls
Hardware name: TEX_STARVE_FILTER

Texture Unit Usage Rate

This counter group shows the properties of texturing workloads that are being performed.

3D sample rate

This expression defines the percentage of texture operations accessing 3D textures.

libGPUCounters name: MaliTex3DInstrRate

libGPUCounters derivation:

max(min((MaliTex3DInstr / MaliTexInstr) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitRequests3DTextureRequests / $MaliTextureUnitRequestsTextureRequests) * 100, 100), 0)

Hardware derivation:

max(min((TEX_INSTR_3D / TEX_INSTR) * 100, 100), 0)

Compressed sample rate

This expression defines the percentage of texture operations accessing compressed textures. Note that compressed textures in this instance means API-level compression, such as ETC and ASTC. The AFBC lossless framebuffer compression is not included.

libGPUCounters name: MaliTexCompressInstrRate

libGPUCounters derivation:

max(min((MaliTexCompressInstr / MaliTexInstr) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitRequestsCompressedTextureRequests / $MaliTextureUnitRequestsTextureRequests) * 100, 100), 0)

Hardware derivation:

max(min((TEX_INSTR_COMPRESSED / TEX_INSTR) * 100, 100), 0)

Mipmapped sample rate

This expression defines the percentage of texture operations accessing mipmapped textures.

Mipmapping significantly improves image quality and memory bandwidth in 3D scenes with variable object view distance. Aim to enable mipmapping for all offline-authored texture assets in 3D content.

libGPUCounters name: MaliTexMipInstrRate

libGPUCounters derivation:

max(min((MaliTexMipInstr / MaliTexInstr) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitRequestsMipmappedTextureRequests / $MaliTextureUnitRequestsTextureRequests) * 100, 100), 0)

Hardware derivation:

max(min((TEX_INSTR_MIPMAP / TEX_INSTR) * 100, 100), 0)

Trilinear sample rate

This expression defines the percentage of texture operations using trilinear filtering. Trilinear samples are twice the cost of simple bilinear samples.

libGPUCounters name: MaliTexTriInstrRate

libGPUCounters derivation:

max(min((MaliTexTriInstr / MaliTexInstr) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTextureUnitRequestsTrilinearFilteredRequests / $MaliTextureUnitRequestsTextureRequests) * 100, 100), 0)

Hardware derivation:

max(min((TEX_INSTR_TRILINEAR / TEX_INSTR) * 100, 100), 0)

Texture Unit CPI

This counter group shows the average cost of texture samples.

Texture CPI

This expression defines the average number of texture filtering cycles per instruction. For texture-limited content that has a CPI higher than the optimal throughout of this core (1 samples per cycle), consider using simpler texture filters. See Texture unit issue cycles for details of the expected performance for different types of operation.

libGPUCounters name: MaliTexCPI

libGPUCounters derivation:

MaliTexFiltIssueCy / MaliTexInstr

Streamline derivation:

$MaliTextureUnitCyclesFilteringActive / $MaliTextureUnitRequestsTextureRequests

Hardware derivation:

TEX_COORD_ISSUE / TEX_INSTR

Shader Core Other Units

In addition to the main units, covered in earlier sections, the shader core has several other units that can be measured.

Performance counters in this section show the workload on these other units.

Attribute Unit Requests

This counter group shows the number of requests made to the attribute unit.

Attribute requests

This counter increments for every instruction run by the attribute unit.

Each instruction converts a logical attribute access into a pointer-based access, which is then processed by the load/store unit.

libGPUCounters name: MaliAttrInstr
Streamline name: $MaliAttributeUnitRequestsAttributeRequests
Hardware name: ATTR_INSTR

Shader Core Memory Access

GPUs are data-plane processors, so understanding your memory bandwidth and where it is coming from is a critical piece of knowledge when trying to improve performance.

Performance counters in this section show the breakdown of memory accesses by shader core hardware unit, showing the total amount of read and write bandwidth being generated by the shader core.

Read bandwidth is split to show how much was provided by the GPU L2 cache and how much was provided by the external memory system. Write bandwidth does not have an equivalent split, and it is not possible to tell from the counters if a write went to L2 or directly to external memory.

Shader Core L2 Reads

This counter group shows the number of shader core read transactions served from the L2 cache, broken down by hardware unit inside the shader core.

Fragment front-end beats

This counter increments for every read beat received by the fixed-function fragment front-end.

libGPUCounters name: MaliSCBusFFEL2RdBt
Streamline name: $MaliShaderCoreL2ReadsFragmentFrontEndBeats
Hardware name: BEATS_RD_FTC

Load/store unit beats

This counter increments for every read beat received by the load/store unit.

libGPUCounters name: MaliSCBusLSL2RdBt
Streamline name: $MaliShaderCoreL2ReadsLoadStoreUnitBeats
Hardware name: BEATS_RD_LSC

Texture unit beats

This counter increments for every read beat received by the texture unit.

libGPUCounters name: MaliSCBusTexL2RdBt
Streamline name: $MaliShaderCoreL2ReadsTextureUnitBeats
Hardware name: BEATS_RD_TEX

Other unit beats

This counter increments for every read beat received by any unit that is not identified as a specific data destination.

libGPUCounters name: MaliSCBusOtherL2RdBt
Streamline name: $MaliShaderCoreL2ReadsOtherUnitBeats
Hardware name: BEATS_RD_OTHER

Shader Core External Reads

This counter group shows the number of shader core read transactions served from external memory, broken down by hardware unit inside the shader core.

Fragment front-end beats

This counter increments for every read beat received by the fixed-function fragment front-end that required an external memory access because of an L2 cache miss.

libGPUCounters name: MaliSCBusFFEExtRdBt
Streamline name: $MaliShaderCoreExternalReadsFragmentFrontEndBeats
Hardware name: BEATS_RD_FTC_EXT

Load/store unit beats

This counter increments for every read beat received by the load/store unit that required an external memory access because of an L2 cache miss.

libGPUCounters name: MaliSCBusLSExtRdBt
Streamline name: $MaliShaderCoreExternalReadsLoadStoreUnitBeats
Hardware name: BEATS_RD_LSC_EXT

Texture unit beats

This counter increments for every read beat received by the texture unit that required an external memory access because of an L2 cache miss.

libGPUCounters name: MaliSCBusTexExtRdBt
Streamline name: $MaliShaderCoreExternalReadsTextureUnitBeats
Hardware name: BEATS_RD_TEX_EXT

Shader Core L2 Writes

This counter group shows the number of shader core write transactions, broken down by hardware unit inside the shader core.

Load/store unit beats

This counter increments for every write beat sent by the load/store unit.

libGPUCounters name: MaliSCBusLSWrBt
Streamline name: $MaliShaderCoreL2WritesLoadStoreUnitBeats
Hardware name: BEATS_WR_LSC

Tile unit beats

This counter increments for every write beat sent by the tile write-back unit.

libGPUCounters name: MaliSCBusTileWrBt
Streamline name: $MaliShaderCoreL2WritesTileUnitBeats
Hardware name: BEATS_WR_TIB

Other unit beats

This counter increments for every write beat sent by any unit that is not identified as a specific data source.

libGPUCounters name: MaliSCBusOtherWrBt
Streamline name: $MaliShaderCoreL2WritesOtherUnitBeats
Hardware name: BEATS_WR_OTHER

Shader Core L2 Read Bytes

This counter group shows the number of bytes read from the L2 cache by the shader core, broken down by hardware unit inside the shader core.

Fragment front-end bytes

This expression defines the total number of bytes read from the L2 memory system by the fragment front-end.

libGPUCounters name: MaliSCBusFFEL2RdBy

libGPUCounters derivation:

MaliSCBusFFEL2RdBt * 16

Streamline derivation:

$MaliShaderCoreL2ReadsFragmentFrontEndBeats * 16

Hardware derivation:

BEATS_RD_FTC * 16

Load/store unit bytes

This expression defines the total number of bytes read from the L2 memory system by the load/store unit.

libGPUCounters name: MaliSCBusLSL2RdBy

libGPUCounters derivation:

MaliSCBusLSL2RdBt * 16

Streamline derivation:

$MaliShaderCoreL2ReadsLoadStoreUnitBeats * 16

Hardware derivation:

BEATS_RD_LSC * 16

Texture unit bytes

This expression defines the total number of bytes read from the L2 memory system by the texture unit.

libGPUCounters name: MaliSCBusTexL2RdBy

libGPUCounters derivation:

MaliSCBusTexL2RdBt * 16

Streamline derivation:

$MaliShaderCoreL2ReadsTextureUnitBeats * 16

Hardware derivation:

BEATS_RD_TEX * 16

Shader Core External Read Bytes

This counter group shows the number of bytes read from external memory by the shader core, broken down by hardware unit inside the shader core.

Fragment front-end bytes

This expression defines the total number of bytes read from the external memory system by the fragment front-end.

libGPUCounters name: MaliSCBusFFEExtRdBy

libGPUCounters derivation:

MaliSCBusFFEExtRdBt * 16

Streamline derivation:

$MaliShaderCoreExternalReadsFragmentFrontEndBeats * 16

Hardware derivation:

BEATS_RD_FTC_EXT * 16

Load/store unit bytes

This expression defines the total number of bytes read from the external memory system by the load/store unit.

libGPUCounters name: MaliSCBusLSExtRdBy

libGPUCounters derivation:

MaliSCBusLSExtRdBt * 16

Streamline derivation:

$MaliShaderCoreExternalReadsLoadStoreUnitBeats * 16

Hardware derivation:

BEATS_RD_LSC_EXT * 16

Texture unit bytes

This expression defines the total number of bytes read from the external memory system by the texture unit.

libGPUCounters name: MaliSCBusTexExtRdBy

libGPUCounters derivation:

MaliSCBusTexExtRdBt * 16

Streamline derivation:

$MaliShaderCoreExternalReadsTextureUnitBeats * 16

Hardware derivation:

BEATS_RD_TEX_EXT * 16

Shader Core L2 Write Bytes

This counter group shows the number of bytes written by the shader core, broken down by hardware unit inside the shader core.

These writes are written to the L2 memory system, but counters cannot determine if the write was written to the L2 cache or directly to external memory.

Load/store unit bytes

This expression defines the total number of bytes written to the L2 memory system by the load/store unit.

libGPUCounters name: MaliSCBusLSWrBy

libGPUCounters derivation:

MaliSCBusLSWrBt * 16

Streamline derivation:

$MaliShaderCoreL2WritesLoadStoreUnitBeats * 16

Hardware derivation:

BEATS_WR_LSC * 16

Tile unit bytes

This expression defines the total number of bytes written to the L2 memory system by the tile write-back unit.

libGPUCounters name: MaliSCBusTileWrBy

libGPUCounters derivation:

MaliSCBusTileWrBt * 16

Streamline derivation:

$MaliShaderCoreL2WritesTileUnitBeats * 16

Hardware derivation:

BEATS_WR_TIB * 16

Other unit bytes

This expression defines the number of write beats sent by any unit that is not identified as a specific data source.

libGPUCounters name: MaliSCBusOtherWrBy

libGPUCounters derivation:

MaliSCBusOtherWrBt * 16

Streamline derivation:

$MaliShaderCoreL2WritesOtherUnitBeats * 16

Hardware derivation:

BEATS_WR_OTHER * 16

Load/Store Unit Bytes/Cycle

This counter group shows the number of bytes accessed in the L2 cache and external memory per load/store cache access cycle. This gives some measure of how effectively the GPU is caching load/store data.

L2 read bytes/cy

This expression defines the average number of bytes read from the L2 memory system by the load/store unit per read cycle. This metric gives some idea how effectively data is being cached in the L1 load/store cache.

If more bytes are being requested per access than you would expect for the data layout you are using, review your data layout and access patterns.

libGPUCounters name: MaliSCBusLSL2RdByPerRd

libGPUCounters derivation:

(MaliSCBusLSL2RdBt * 16) / (MaliLSFullRd + MaliLSPartRd)

Streamline derivation:

($MaliShaderCoreL2ReadsLoadStoreUnitBeats * 16) / ($MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads)

Hardware derivation:

(BEATS_RD_LSC * 16) / (LS_MEM_READ_FULL + LS_MEM_READ_SHORT)

L2 write bytes/cy

This expression defines the average number of bytes written to the L2 memory system by the load/store unit per write cycle.

If more bytes are being written per access than you would expect for the data layout you are using, review your data layout and access patterns to improve cache locality.

libGPUCounters name: MaliSCBusLSWrByPerWr

libGPUCounters derivation:

(MaliSCBusLSWrBt * 16) / (MaliLSFullWr + MaliLSPartWr)

Streamline derivation:

($MaliShaderCoreL2WritesLoadStoreUnitBeats * 16) / ($MaliLoadStoreUnitCyclesFullWrites + $MaliLoadStoreUnitCyclesPartialWrites)

Hardware derivation:

(BEATS_WR_LSC * 16) / (LS_MEM_WRITE_FULL + LS_MEM_WRITE_SHORT)

External read bytes/cy

This expression defines the average number of bytes read from the external memory system by the load/store unit per read cycle. This metric indicates how effectively data is being cached in the L2 cache.

If more bytes are being requested per access than you would expect for the data layout you are using, review your data layout and access patterns.

libGPUCounters name: MaliSCBusLSExtRdByPerRd

libGPUCounters derivation:

(MaliSCBusLSExtRdBt * 16) / (MaliLSFullRd + MaliLSPartRd)

Streamline derivation:

($MaliShaderCoreExternalReadsLoadStoreUnitBeats * 16) / ($MaliLoadStoreUnitCyclesFullReads + $MaliLoadStoreUnitCyclesPartialReads)

Hardware derivation:

(BEATS_RD_LSC_EXT * 16) / (LS_MEM_READ_FULL + LS_MEM_READ_SHORT)

Texture Unit Bytes/Cycle

This counter group shows the number of bytes accessed in the L2 cache and external memory per texture sample. This gives some measure of how effectively the GPU is caching texture data.

L2 read bytes/cy

This expression defines the average number of bytes read from the L2 memory system by the texture unit per filtering cycle. This metric indicates how effectively textures are being cached in the L1 texture cache.

If more bytes are being requested per access than you would expect for the format you are using, review your texture settings. Arm recommends:

  • Using mipmaps for offline generated textures.
  • Using ASTC or ETC compression for offline generated textures.
  • Replacing runtime framebuffer formats with narrower formats.
  • Reducing use of imageLoad/Store to allow framebuffer compression.
  • Reducing use of negative LOD bias used for texture sharpening.
  • Reducing use of anisotropic filtering, or reducing the level of MAX_ANISOTROPY used.
libGPUCounters name: MaliSCBusTexL2RdByPerRd

libGPUCounters derivation:

(MaliSCBusTexL2RdBt * 16) / MaliTexFiltIssueCy

Streamline derivation:

($MaliShaderCoreL2ReadsTextureUnitBeats * 16) / $MaliTextureUnitCyclesFilteringActive

Hardware derivation:

(BEATS_RD_TEX * 16) / TEX_COORD_ISSUE

External read bytes/cy

This expression defines the average number of bytes read from the external memory system by the texture unit per filtering cycle. This metric indicates how effectively textures are being cached in the L2 cache.

If more bytes are being requested per access than you would expect for the format you are using, review your texture settings. Arm recommends:

  • Using mipmaps for offline generated textures.
  • Using ASTC or ETC compression for offline generated textures.
  • Replacing runtime framebuffer formats with narrower formats.
  • Reducing use of imageLoad/Store to allow framebuffer compression.
  • Reducing use of negative LOD bias used for texture sharpening.
  • Reducing use of anisotropic filtering, or reducing the level of MAX_ANISOTROPY used.
libGPUCounters name: MaliSCBusTexExtRdByPerRd

libGPUCounters derivation:

(MaliSCBusTexExtRdBt * 16) / MaliTexFiltIssueCy

Streamline derivation:

($MaliShaderCoreExternalReadsTextureUnitBeats * 16) / $MaliTextureUnitCyclesFilteringActive

Hardware derivation:

(BEATS_RD_TEX_EXT * 16) / TEX_COORD_ISSUE

Tile Unit Bytes/Pixel

This counter group shows the number of bytes written by the tile unit per output pixel. This can be used to determine the efficiency of application render pass store configuration.

Applications can minimize the number of bytes stored by following best practices:

  • Use the smallest pixel color format that meets your requirements.
  • Discard transient attachments that are no longer required at the end of each render pass (Vulkan storeOp=DONT_CARE or storeOp=NONE).
  • Use resolve attachments to resolve multi-sampled data into a single value as part of tile write-back and discard the multi-sampled data so that it is not written back to memory.

External write bytes/px

This expression defines the average number of bytes written to the L2 memory system by the tile unit per output pixel.

If more bytes are being written per pixel than expected, Arm recommends:

  • Using narrower attachment color formats with fewer bytes per pixel.
  • Configuring attachments so that they can use framebuffer compression.
  • Invalidating transient attachments to skip writing to memory.
  • Using inline multi-sample resolve to skip writing the multi-sampled data to memory.
libGPUCounters name: MaliSCBusTileWrBPerPx

libGPUCounters derivation:

(MaliSCBusTileWrBt * 16) / (MaliFragQueueTask * 1024)

Streamline derivation:

($MaliShaderCoreL2WritesTileUnitBeats * 16) / ($MaliGPUTasksFragmentTasks * 1024)

Hardware derivation:

(BEATS_WR_TIB * 16) / (JS0_TASKS * 1024)

Tiling

The tiler hardware orchestrates vertex shading, and binning primitives into the tile lists read during fragment shading.

Performance counters in this section show how the tiler processed the binning-time vertex and primitive workload.

Tiler Stall Cycles

This counter group shows the number of cycles that individual sub-units inside the tiler were stalled.

Position FIFO full stalls

This counter increments every clock cycle when the tiler has a position shading request that it can not send to a shader core because the position buffer is full.

libGPUCounters name: MaliTilerPosShadFIFOFullCy
Streamline name: $MaliTilerStallCyclesPositionFIFOFullStalls
Hardware name: IDVS_POS_FIFO_FULL

Position shading stalls

This counter increments every clock cycle when the tiler has a position shading request that it can not send to a shader core because the shading request queue is full.

libGPUCounters name: MaliTilerPosShadStallCy
Streamline name: $MaliTilerStallCyclesPositionShadingStalls
Hardware name: IDVS_POS_SHAD_STALL

Varying shading stalls

This counter increments every clock cycle when the tiler has a varying shading request that it can not send to a shader core because the shading request queue is full.

libGPUCounters name: MaliTilerVarShadStallCy
Streamline name: $MaliTilerStallCyclesVaryingShadingStalls
Hardware name: IDVS_VAR_SHAD_STALL

Tiler Vertex Cache

This counter group shows the number of accesses made into the vertex position and varying post-transform caches.

Position cache hits

This counter increments every time a vertex position lookup hits in the vertex cache.

libGPUCounters name: MaliTilerPosCacheHit
Streamline name: $MaliTilerVertexCachePositionCacheHits
Hardware name: VCACHE_HIT

Position cache misses

This counter increments every time a vertex position lookup misses in the vertex cache. Cache misses at this stage result in a position shading request, although a single request can produce data to handle multiple cache misses.

libGPUCounters name: MaliTilerPosCacheMiss
Streamline name: $MaliTilerVertexCachePositionCacheMisses
Hardware name: VCACHE_MISS

Varying cache hits

This counter increments every time a vertex varying lookup results in a successful hit in the vertex cache.

libGPUCounters name: MaliTilerVarCacheHit
Streamline name: $MaliTilerVertexCacheVaryingCacheHits
Hardware name: IDVS_VBU_HIT

Varying cache misses

This counter increments every time a vertex varying lookup misses in the vertex cache. Cache misses at this stage result in a varying shading request, although a single request can produce data to handle multiple cache misses.

libGPUCounters name: MaliTilerVarCacheMiss
Streamline name: $MaliTilerVertexCacheVaryingCacheMisses
Hardware name: IDVS_VBU_MISS

Tiler L2 Accesses

This counter group shows the number of tiler memory transactions into the L2 memory system.

Read beats

This counter increments for every data read cycle the tiler uses on the internal bus.

libGPUCounters name: MaliTilerRdBt
Streamline name: $MaliTilerL2AccessesReadBeats
Hardware name: BUS_READ

Write beats

This counter increments for every data write cycle the tiler uses on the internal bus.

libGPUCounters name: MaliTilerWrBt
Streamline name: $MaliTilerL2AccessesWriteBeats
Hardware name: BUS_WRITE

Tiler Shading Requests

This counter group tracks the number of shading requests that are made by the tiler when processing vertex shaders.

Application vertex shaders are split into two pieces, a position shader that computes the vertex position, and a varying shader that computes the remaining vertex shader outputs. The varying shader is only run if a group contains visible vertices that survive primitive culling.

Position shading requests

This counter increments for every position shading request in the tiler geometry flow. Position shading runs the first part of the vertex shader, computing the position required to perform clipping and culling. A vertex that has been evicted from the post-transform cache must be reshaded if used again, so your index buffers must have good spatial locality of index reuse.

Each request contains 4 vertices.

Note that not all types of draw call use this tiler workflow, so this counter might not account for all submitted geometry.

libGPUCounters name: MaliGeomPosShadTask
Streamline name: $MaliTilerShadingRequestsPositionShadingRequests
Hardware name: IDVS_POS_SHAD_REQ

Varying shading requests

This counter increments for every varying shading request in the tiler geometry flow. Varying shading runs the second part of the vertex shader, for any primitive that survives clipping and culling. The same vertex is shaded multiple times if it has been evicted from the post-transform cache before reuse occurs. Keep good spatial locality of index reuse in your index buffers.

Each request contains 4 vertices.

Note that not all types of draw call use this tiler workflow, so this counter might not account for all submitted geometry.

libGPUCounters name: MaliGeomVarShadTask
Streamline name: $MaliTilerShadingRequestsVaryingShadingRequests
Hardware name: IDVS_VAR_SHAD_REQ

Vertex Cache Hit Rate

This counter group shows the hit rate in the tiler post-transform caches.

Position read hit rate

This expression defines the percentage hit rate of the tiler position cache used for the index-driven vertex shading pipeline.

libGPUCounters name: MaliTilerPosCacheHitRate

libGPUCounters derivation:

max(min((MaliTilerPosCacheHit / (MaliTilerPosCacheHit + MaliTilerPosCacheMiss)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTilerVertexCachePositionCacheHits / ($MaliTilerVertexCachePositionCacheHits + $MaliTilerVertexCachePositionCacheMisses)) * 100, 100), 0)

Hardware derivation:

max(min((VCACHE_HIT / (VCACHE_HIT + VCACHE_MISS)) * 100, 100), 0)

Varying read hit rate

This expression defines the percentage hit rate of the tiler varying cache used for the index-driven vertex shading pipeline.

libGPUCounters name: MaliTilerVarCacheHitRate

libGPUCounters derivation:

max(min((MaliTilerVarCacheHit / (MaliTilerVarCacheHit + MaliTilerVarCacheMiss)) * 100, 100), 0)

Streamline derivation:

max(min(($MaliTilerVertexCacheVaryingCacheHits / ($MaliTilerVertexCacheVaryingCacheHits + $MaliTilerVertexCacheVaryingCacheMisses)) * 100, 100), 0)

Hardware derivation:

max(min((IDVS_VBU_HIT / (IDVS_VBU_HIT + IDVS_VBU_MISS)) * 100, 100), 0)

Internal Memory System

The GPU internal memory interface connects the processing units, such as the shader cores and the tiler, to the GPU L2 cache.

Performance counters in this section show reads and writes into the L2 cache and how the cache responded to them.

L2 Cache Requests

This counter group shows the total number of requests made into the L2 cache from any source. This includes any requests made

Read requests

This counter increments for every read request received by the L2 cache from an internal requester.

libGPUCounters name: MaliL2CacheRd
Streamline name: $MaliL2CacheRequestsReadRequests
Hardware name: L2_RD_MSG_IN

Write requests

This counter increments for every write request received by the L2 cache from an internal requester.

libGPUCounters name: MaliL2CacheWr
Streamline name: $MaliL2CacheRequestsWriteRequests
Hardware name: L2_WR_MSG_IN

Snoop requests

This counter increments for every coherency snoop request received by the L2 cache from internal requesters.

libGPUCounters name: MaliL2CacheSnp
Streamline name: $MaliL2CacheRequestsSnoopRequests
Hardware name: L2_SNP_MSG_IN

L1 read requests

This counter increments for every L1 cache read request or read response sent by the L2 cache to an internal requester.

Read requests are triggered by a snoop request from one requester that needs data from another requester's L1 to resolve.

Read responses are standard responses back to a requester in response to its own read requests.

libGPUCounters name: MaliL2CacheL1Rd
Streamline name: $MaliL2CacheRequestsL1ReadRequests
Hardware name: L2_RD_MSG_OUT

L1 write requests

This counter increments for every L1 cache write response sent by the L2 cache to an internal requester.

Write responses are standard responses back to a requester in response to its own write requests.

libGPUCounters name: MaliL2CacheL1Wr
Streamline name: $MaliL2CacheRequestsL1WriteRequests
Hardware name: L2_WR_MSG_OUT

L2 Cache Lookups

This counter group shows the total number of lookups made into the L2 cache from any source.

Any lookups

This counter increments for every L2 cache lookup made, including all reads, writes, coherency snoops, and cache flush operations.

libGPUCounters name: MaliL2CacheLookup
Streamline name: $MaliL2CacheLookupsAnyLookups
Hardware name: L2_ANY_LOOKUP

Read lookups

This counter increments for every L2 cache read lookup made.

libGPUCounters name: MaliL2CacheRdLookup
Streamline name: $MaliL2CacheLookupsReadLookups
Hardware name: L2_READ_LOOKUP

Write lookups

This counter increments for every L2 cache write lookup made.

libGPUCounters name: MaliL2CacheWrLookup
Streamline name: $MaliL2CacheLookupsWriteLookups
Hardware name: L2_WRITE_LOOKUP

External snoop lookups

This counter increments for every coherency snoop lookup performed that is triggered by a requester outside of the GPU.

libGPUCounters name: MaliL2CacheSnpLookup
Streamline name: $MaliL2CacheLookupsExternalSnoopLookups
Hardware name: L2_EXT_SNOOP_LOOKUP

L2 Cache Stall Cycles

This counter group shows the total number of stall cycles that impact L2 cache lookups.

Read stalls

This counter increments for every clock cycle an L2 cache read request from an internal requester is stalled.

libGPUCounters name: MaliL2CacheRdStallCy
Streamline name: $MaliL2CacheStallCyclesReadStalls
Hardware name: L2_RD_MSG_IN_STALL

Write stalls

This counter increments for every clock cycle when an L2 cache write request from an internal requester is stalled.

libGPUCounters name: MaliL2CacheWrStallCy
Streamline name: $MaliL2CacheStallCyclesWriteStalls
Hardware name: L2_WR_MSG_IN_STALL

Snoop stalls

This counter increments for every clock cycle when an L2 cache coherency snoop request from an internal requester is stalled.

libGPUCounters name: MaliL2CacheSnpStallCy
Streamline name: $MaliL2CacheStallCyclesSnoopStalls
Hardware name: L2_SNP_MSG_IN_STALL

L1 read stalls

This counter increments for every clock cycle when L1 cache read requests and responses sent by the L2 cache to an internal requester are stalled.

libGPUCounters name: MaliL2CacheL1RdStallCy
Streamline name: $MaliL2CacheStallCyclesL1ReadStalls
Hardware name: L2_RD_MSG_OUT_STALL

L2 Cache Miss Rate

This counter group shows the miss rate in the L2 cache.

Read miss rate

This expression defines the percentage of internal L2 cache reads that result in an external read.

libGPUCounters name: MaliL2CacheRdMissRate

libGPUCounters derivation:

max(min((MaliExtBusRd / MaliL2CacheRdLookup) * 100, 100), 0)

Streamline derivation:

max(min(($MaliExternalBusAccessesReadTransactions / $MaliL2CacheLookupsReadLookups) * 100, 100), 0)

Hardware derivation:

max(min((L2_EXT_READ / L2_READ_LOOKUP) * 100, 100), 0)

Write miss rate

This expression defines the percentage of internal L2 cache writes that result in an external write.

libGPUCounters name: MaliL2CacheWrMissRate

libGPUCounters derivation:

max(min((MaliExtBusWr / MaliL2CacheWrLookup) * 100, 100), 0)

Streamline derivation:

max(min(($MaliExternalBusAccessesWriteTransactions / $MaliL2CacheLookupsWriteLookups) * 100, 100), 0)

Hardware derivation:

max(min((L2_EXT_WRITE / L2_WRITE_LOOKUP) * 100, 100), 0)

Stage 1 MMU Translations

This counter group shows the number of stage 1 page table lookups handled by the GPU MMU.

MMU lookups

This counter increments for every address lookup made by the main GPU MMU. Increments only occur if all lookups into a local TLB miss.

libGPUCounters name: MaliMMULookup
Streamline name: $MaliStage1MMUTranslationsMMULookups
Hardware name: MMU_REQUESTS

L2 table reads

This counter increments for every read of a level 2 MMU translation table entry. Each address translation at this level covers a 2MB section, which is typically broken down into further into 4KB pages using a subsequent level 3 translation table lookup.

libGPUCounters name: MaliMMUL2Rd
Streamline name: $MaliStage1MMUTranslationsL2TableReads
Hardware name: MMU_TABLE_READS_L2

L2 table read hits

This counter increments for every read of a level 2 MMU translation table entry that results in a successful hit in the main MMU's TLB.

libGPUCounters name: MaliMMUL2Hit
Streamline name: $MaliStage1MMUTranslationsL2TableReadHits
Hardware name: MMU_HIT_L2

L3 table reads

This counter increments for every read of a level 3 MMU translation table entry. Each address translation at this level covers a single 4KB page.

libGPUCounters name: MaliMMUL3Rd
Streamline name: $MaliStage1MMUTranslationsL3TableReads
Hardware name: MMU_TABLE_READS_L3

L3 table read hits

This counter increments for every read of a level 3 MMU translation table entry that results in a successful hit in the main MMU's TLB.

libGPUCounters name: MaliMMUL3Hit
Streamline name: $MaliStage1MMUTranslationsL3TableReadHits
Hardware name: MMU_HIT_L3

Stage 2 MMU Translations

This counter group shows the number of stage 2 page table lookups handled by the GPU MMU.

MMU lookups

This counter increments for every stage 2 lookup made by the main GPU MMU. Increments only occurs if all lookups in to a local TLB miss.

Stage 2 address translation is used when the operating system using the GPU is a guest in a virtualized environment. The guest operating system controls the stage 1 MMU, translating virtual addresses into intermediate physical addresses. The hypervisor controls the stage 2 MMU, translating intermediate physical addresses into physical addresses.

libGPUCounters name: MaliMMUS2Lookup
Streamline name: $MaliStage2MMUTranslationsMMULookups
Hardware name: MMU_S2_REQUESTS

L2 table reads

This counter increments for every read of a stage 2 level 2 MMU translation table entry. Each address translation at this level covers a 2MB section.

libGPUCounters name: MaliMMUS2L2Rd
Streamline name: $MaliStage2MMUTranslationsL2TableReads
Hardware name: MMU_S2_TABLE_READS_L2

L2 table read hits

This counter increments for every read of a stage 2 level 2 MMU translation table entry that results in a successful hit in the main MMU's TLB.

libGPUCounters name: MaliMMUS2L2Hit
Streamline name: $MaliStage2MMUTranslationsL2TableReadHits
Hardware name: MMU_S2_HIT_L2

L3 table reads

This counter increments for every read of a stage 2 level 3 MMU translation table entry. Each address translation at this level covers a single 4KB page.

libGPUCounters name: MaliMMUS2L3Rd
Streamline name: $MaliStage2MMUTranslationsL3TableReads
Hardware name: MMU_S2_TABLE_READS_L3

L3 table read hits

This counter increments for every read of a stage 2 level 3 MMU translation table entry that results in a successful hit in the main MMU's TLB.

libGPUCounters name: MaliMMUS2L3Hit
Streamline name: $MaliStage2MMUTranslationsL3TableReadHits
Hardware name: MMU_S2_HIT_L3

Constants

Arm GPUs are configurable, with variable performance across products, and variable configurations across devices.

This section lists useful symbolic configuration and constant values that can be used in expressions to compute derived counters. Note that configuration values must be provided by a run-time tool that can query the actual implementation configuration of the target device.

Implementation Configuration

This constants group contains symbolic constants that define the configuration of a particular device. These must be populated by the counter sampling runtime tooling.

Shader core count

This configuration constant defines the number of shader cores in the design.

libGPUCounters name: MaliConfigCoreCount

libGPUCounters derivation:

MALI_CONFIG_SHADER_CORE_COUNT

Streamline derivation:

$MaliConstantsShaderCoreCount

Hardware derivation:

MALI_CONFIG_SHADER_CORE_COUNT

L2 cache slice count

This configuration constant defines the number of L2 cache slices in the design.

libGPUCounters name: MaliConfigL2CacheCount

libGPUCounters derivation:

MALI_CONFIG_L2_CACHE_COUNT

Streamline derivation:

$MaliConstantsL2SliceCount

Hardware derivation:

MALI_CONFIG_L2_CACHE_COUNT

External bus beat size

This configuration constant defines the number of bytes transferred per external bus beat.

libGPUCounters name: MaliConfigExtBusBeatSize

libGPUCounters derivation:

MALI_CONFIG_EXT_BUS_BYTE_SIZE

Streamline derivation:

($MaliConstantsBusWidthBits / 8)

Hardware derivation:

MALI_CONFIG_EXT_BUS_BYTE_SIZE

Static Configuration

This constants group contains literal constants that define the static configuration and performance characteristics of this product.

Fragment queue task size

This constant defines the number of pixels in each axis per fragment task.

libGPUCounters name: MaliFragQueueTaskSize

libGPUCounters derivation:

32

Streamline derivation:

32

Hardware derivation:

32

Tiler shader task thread count

This constant defines the number of threads per vertex shading task issued by the tiler, to perform position shading or varying shading concurrently, for multiple sequential vertices.

libGPUCounters name: MaliGPUGeomTaskSize

libGPUCounters derivation:

4

Streamline derivation:

4

Hardware derivation:

4

Tile size

This constant defines the size of a tile.

libGPUCounters name: MaliGPUTileSize

libGPUCounters derivation:

16

Streamline derivation:

16

Hardware derivation:

16

Tile storage/pixel

This constant defines the number of bits of color storage per pixel available when using a 16 x 16 tile size. If you use more storage than the available storage for multi-sampling, wide color formats, or multiple render targets, the driver dynamically reduces the tile size until sufficient storage is available.

libGPUCounters name: MaliGPUMaxPixelStorage

libGPUCounters derivation:

256

Streamline derivation:

256

Hardware derivation:

256

Warp size

This constant defines the number of threads in a single warp.

libGPUCounters name: MaliGPUWarpSize

libGPUCounters derivation:

4

Streamline derivation:

4

Hardware derivation:

4

Varying slot count

This constant defines the number of varying unit slots.

The width of a slot is GPU-dependent.

libGPUCounters name: MaliVarSlotPerCy

libGPUCounters derivation:

1

Streamline derivation:

1

Hardware derivation:

1

Texture samples/cycle

This constant defines the maximum number of texture samples that can be made per cycle.

libGPUCounters name: MaliTexSamplePerCy

libGPUCounters derivation:

1

Streamline derivation:

1

Hardware derivation:

1

Texture cycles/sample

This constant defines the minimum number of cycles needed to make a texture sample.

libGPUCounters name: MaliTexCyPerSample

libGPUCounters derivation:

1

Streamline derivation:

1

Hardware derivation:

1

Internal bus beat size

This constant defines the number of bytes transferred per internal bus beat.

libGPUCounters name: MaliSCBusBeatSize

libGPUCounters derivation:

16

Streamline derivation:

16

Hardware derivation:

16

Copyright © Arm 2025