20.08
Introduction

The Computer Vision and Machine Learning library is a set of functions optimised for both ARM CPUs and GPUs using SIMD technologies.Several builds of the library are available using various configurations:

• OS: Linux, Android or bare metal.
• Architecture: armv7a (32bit) or arm64-v8a (64bit)
• Technology: NEON / OpenCL / GLES_COMPUTE / NEON and OpenCL and GLES_COMPUTE
• Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance.

# Contact / Support

In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run:



### How to build the library ?

To cross-compile the library in debug mode, with NEON only support, for Android 32bit:

CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a


To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit:

CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a


To cross-compile the library in asserts mode, with GLES_COMPUTE only support, for Android 64bit:

CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=0 gles_compute=1 embed_kernels=1 os=android arch=arm64-v8a


### How to manually build the examples ?

The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.

Note
The following command lines assume the arm_compute binaries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built library with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.

Once you've got your Android standalone toolchain built and added to your path you can do the following:

To cross compile a NEON example:

#32 bit:
arm-linux-androideabi-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_arm -static-libstdc++ -pie
#64 bit:
aarch64-linux-android-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_aarch64 -static-libstdc++ -pie


To cross compile an OpenCL example:

#32 bit:
arm-linux-androideabi-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
#64 bit:
aarch64-linux-android-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL


To cross compile a GLES example:

#32 bit:
arm-linux-androideabi-clang++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o gc_absdiff_arm -static-libstdc++ -pie -DARM_COMPUTE_GC
#64 bit:
aarch64-linux-android-clang++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o gc_absdiff_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_GC


To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also.

#32 bit:
arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
#64 bit:
aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL

Note
Due to some issues in older versions of the Mali OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android.
When linked statically the arm_compute_graph library currently needs the –whole-archive linker flag in order to work properly

Then you need to do is upload the executable and the shared library to the device using ADB:

adb push neon_convolution_arm /data/local/tmp/
adb shell chmod 777 -R /data/local/tmp/


And finally to run the example:

adb shell /data/local/tmp/neon_convolution_arm


For 64bit:

adb push neon_convolution_aarch64 /data/local/tmp/
adb shell chmod 777 -R /data/local/tmp/


And finally to run the example:

adb shell /data/local/tmp/neon_convolution_aarch64

Note
Examples accept different types of arguments, to find out what they are run the example with –help as an argument. If no arguments are specified then random values will be used to execute the graph.

For example: adb shell /data/local/tmp/graph_lenet –help

In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on NEON, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run.

## Building for bare metal

For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains:

• arm-eabi for armv7a
• aarch64-elf for arm64-v8a

Note
Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:\$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin

### How to build the library ?

To cross-compile the library with NEON support for baremetal arm64-v8a:

scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=arm64-v8a build=cross_compile cppthreads=0 openmp=0 standalone=1


### How to manually build the examples ?

Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found here.

## Building on a Windows host system

Using scons directly from the Windows command line is known to cause problems. The reason seems to be that if scons is setup for cross-compilation it gets confused about Windows style paths (using backslashes). Thus it is recommended to follow one of the options outlined below.

### Bash on Ubuntu on Windows

The best and easiest option is to use Ubuntu on Windows. This feature is still marked as beta and thus might not be available. However, if it is building the library is as simple as opening a Bash on Ubuntu on Windows shell and following the general guidelines given above.

### Cygwin

If the Windows subsystem for Linux is not available Cygwin can be used to install and run scons, the minimum Cygwin version must be 3.0.7 or later. In addition to the default packages installed by Cygwin scons has to be selected in the installer. (git might also be useful but is not strictly required if you already have got the source code of the library.) Linaro provides pre-built versions of GCC cross-compilers that can be used from the Cygwin terminal. When building for Android the compiler is included in the Android standalone toolchain. After everything has been set up in the Cygwin terminal the general guide on building the library can be followed.

## OpenCL DDK Requirements

### Hard Requirements

Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Mali OpenCL DDK r8p0 and above as an extension (respective extension flag is -cl-arm-non-uniform-work-group-size).

Enabling 16-bit floating point calculations require cl_khr_fp16 extension to be supported. All Mali GPUs with compute capabilities have native support for half precision floating points.

Use of CLMeanStdDev function requires 64-bit atomics support, thus cl_khr_int64_base_atomics should be supported in order to use.

### Performance improvements

Integer dot product built-in function extensions (and therefore optimized kernels) are available with Mali OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are cl_arm_integer_dot_product_int8, cl_arm_integer_dot_product_accumulate_int8 and cl_arm_integer_dot_product_accumulate_int16.

OpenCL kernel level debugging can be simplified with the use of printf, this requires the cl_arm_printf extension to be supported.

SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.

## OpenCL Tuner

The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS). The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file. The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file. In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.

If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:

https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice

Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.

CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.

#Example: 2 unique Matrix Multiply configurations

TensorShape a0 = TensorShape(32,32);
TensorShape b0 = TensorShape(32,32);
TensorShape c0 = TensorShape(32,32);
TensorShape a1 = TensorShape(64,64);
TensorShape b1 = TensorShape(64,64);
TensorShape c1 = TensorShape(64,64);
Tensor a0_tensor;
Tensor b0_tensor;
Tensor c0_tensor;
Tensor a1_tensor;
Tensor b1_tensor;
Tensor c1_tensor;
a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
CLGEMM gemm0;
CLGEMM gemm1;
// Configuration 0
gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
// Configuration 1
gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);

### How to use it

All the graph examples in the ACL's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file

#Enable CL tuner
./graph_mobilenet --enable-tuner –-target=CL
./arm_compute_benchmark --enable-tuner

#Export/Import to/from a file
./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv


If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.

Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:

-# Disable the power management
-# Keep the GPU frequency constant
-# Run multiple times the network (i.e. 10).


If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.

CLTuner tuner;
// Setup Scheduler

After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".

• tuner.save_to_file("results.csv");

This file can be also imported using the method "load_from_file("results.csv")".