Compute Library
 21.11
Advanced

Table of Contents

OpenCL Tuner

The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS). The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file. The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file. In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.

If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:

https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice

Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.

CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.

#Example: 2 unique Matrix Multiply configurations
TensorShape a0 = TensorShape(32,32);
TensorShape b0 = TensorShape(32,32);
TensorShape c0 = TensorShape(32,32);
TensorShape a1 = TensorShape(64,64);
TensorShape b1 = TensorShape(64,64);
TensorShape c1 = TensorShape(64,64);
Tensor a0_tensor;
Tensor b0_tensor;
Tensor c0_tensor;
Tensor a1_tensor;
Tensor b1_tensor;
Tensor c1_tensor;
a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
CLGEMM gemm0;
CLGEMM gemm1;
// Configuration 0
gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
// Configuration 1
gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);

How to use it

All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file

#Enable CL tuner
./graph_mobilenet --enable-tuner –-target=CL
./arm_compute_benchmark --enable-tuner

#Export/Import to/from a file
./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv

If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.

Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:

-# Disable the power management
-# Keep the GPU frequency constant
-# Run multiple times the network (i.e. 10).

If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.

CLTuner tuner;
// Setup Scheduler

After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".

  • tuner.save_to_file("results.csv");

This file can be also imported using the method "load_from_file("results.csv")".

  • tuner.load_from_file("results.csv");