Release versions

All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number. If there is more than one release in a month then an extra sequential number is appended at the end:

v17.03 (First release of March 2017)
v17.03.1 (Second release of March 2017)
v17.04 (First release of April 2017)

Note: We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.; Starting from release 22.05, 'master' branch is no longer being used, it has been replaced by 'main'. Please update your clone jobs accordingly.

Changelog

v22.11 Public major release

New features:
- Add new experimental dynamic fusion API.
- Add CPU batch matrix multiplication with adj_x = false and adj_y = false for FP32.
- Add CPU MeanStdDevNorm for QASYMM8.
- Add CPU and GPU GELU activation function for FP32 and FP16.
- Add CPU swish activation function for FP32 and FP16.
Performance optimizations:
- Optimize CPU bilinear scale for FP32, FP16, QASYMM8, QASYMM8_SIGNED, U8 and S8.
- Optimize CPU activation functions using LUT-based implementation:
  - Sigmoid function for QASYMM8 and QASYMM8_SIGNED.
  - Hard swish function for QASYMM8_SIGNED.
- Optimize CPU addition for QASYMM8 and QASYMM8_SIGNED using fixed-point arithmetic.
- Optimize CPU multiplication, subtraction and activation layers by considering tensors as 1D.
- Optimize GPU depthwise convolution kernel and heuristic.
- Optimize GPU Conv2d heuristic.
- Optimize CPU MeanStdDevNorm for FP16.
- Optimize CPU tanh activation function for FP16 using rational approximation.
Improve GPU GeMMLowp start-up time.
Various optimizations and bug fixes.

v22.08 Public major release

Various bug fixes.
Disable unsafe FP optimizations causing accuracy issues in:
Add Dynamic Fusion of Elementwise Operators: Div, Floor, Add.
Optimize the gemm_reshaped_rhs_nly_nt OpenCL kernel using the arm_matrix_multiply extension available for Arm® Mali™-G715 and Arm® Mali™-G615.
Add support for the arm_matrix_multiply extension in the gemmlowp_mm_reshaped_only_rhs_t OpenCL kernel.
Expand GPUTarget list with missing Mali™ GPUs product names: G57, G68, G78AE, G610, G510, G310.
Extend the direct convolution 2d interface to configure the block size.
Update ClConv2D heuristic to use direct convolution.
Use official Khronos® OpenCL extensions:
- Add cl_khr_integer_dot_product extension support.
- Add support of OpenCL 3.0 non-uniform workgroup.
Cpu performance optimizations:
- Add LUT-based implementation of Hard Swish and Leaky ReLU activation function for aarch64 build.
- Optimize Add layer by considering the input tensors as 1D array.
Add fixed-format BF16, FP16 and FP32 Neon™ GEMM kernels to support variable weights.
Add new winograd convolution kernels implementation and update the ACL CpuWinogradConv2d operator.
Add experimental support for native builds for Windows on Arm®.
Build flag interpretation change: arch=armv8.6-a now translates to -march=armv8.6-a CXX flag instead of march=armv8.2-a + explicit selection of feature extensions.
Build flag change: toolchain_prefix, compiler_prefix:
- Use empty string "" to suppress any prefixes.
- Use "auto" to use default (auto) prefixes chosen by the build script. This is the default behavior when unspecified.
- Any other string will be used as custom prefixes to the compiler and the rest of toolchain tools.
- The default behaviour when prefix is unspecified does not change, but its signifier has been changed from empty string "" to "auto".
armv7a with Android build will no longer be tested or maintained.

v22.05 Public major release

Various bug fixes.
Various optimizations.
Add support for NDK r23b.
Inclusive language adjustment. Please refer to Inclusive language guideline for details.
New Arm® Neon™ kernels / functions :
- ClPool3dKernel
New OpenCL kernels / functions :
- CpuPool3dKernel
Improve the start-up times for the following OpenCL kernels:
Decouple the implementation of the following Cpu kernels into various data types (fp32, fp16, int):

v22.02 Public major release

Various bug fixes.
Various optimizations.
Update A510 arm_gemm cpu Kernels.
Inclusive language adjustment. Please refer to Inclusive language guideline for details.
Improve the start-up time for the following OpenCL kernels:
Remove functions:
- CLRemap
- NERemap
Remove padding from OpenCL kernels:
- ClDirectConv2dKernel
Remove padding from Cpu kernels:
- CpuDirectConv2dKernel
Decouple the implementation of the following Cpu kernels into various data types (fp32, fp16, int):
- CpuActivationKernel
- CpuAddKernel
- CpuElementwiseKernel
- CpuSoftmaxKernel
- NEBoundingBoxTransformKernel
- NECropKernel
- NEComputeAllAnchorsKernel
- NEInstanceNormalizationLayerKernel
- NEMaxUnpoolingLayerKernel
- NEMeanStdDevNormalizationKernel
- NERangeKernel
- NEROIAlignLayerKernel
- NESelectKernel

v21.11 Public major release

Various bug fixes.
Various optimizations:
- Improve performance of bilinear and nearest neighbor Scale on both CPU and GPU for FP32, FP16, Int8, Uint8 data types
- Improve performance of Softmax on GPU for Uint8/Int8
New OpenCL kernels / functions:
- CLConv3D
New Arm® Neon™ kernels / functions:
- NEConv3D
Support configurable build by a selected subset of operator list
Support MobileBert on Neon™ backend
Improve operator/function logging
Remove padding from OpenCL kernels:
- ClPool2dKernel
- ClScaleKernel
- ClGemmMatrixMultiplyReshapedKernel
Remove padding from Cpu kernels:
- CpuPool2dKernel
Remove Y padding from OpenCL kernels:
- ClGemmMatrixMultiplyKernel
- ClGemmReshapedRHSMatrixKernel
Remove legacy GeMM kernels in gemm_v1.cl

v21.08 Public major release

Various bug fixes.
Various optimizations:
- Improve LWS (Local-Workgroup-Size) heuristic in OpenCL for GeMM, Direct Convolution and Winograd Transformations when OpenCL tuner is not used
- Improve QASYMM8/QSYMM8 performance on OpenCL for various Arm® Mali™ GPU architectures
- Add dynamic weights support in Fully connected layer (CPU/GPU)
- Various performance optimizations for floating-point data types (CPU/GPU)
Add a reduced core library build arm_compute_core_v2
Expose Operator API
Support fat binary build for arm8.2-a via fat_binary build flag
Add CPU discovery capabilities
Add data type f16 support for:
- CLRemapKernel
Port the following functions to stateless API:
Remove the following functions:
- CLWinogradInputTransform
Remove CLCoreRuntimeContext
Remove ICPPSimpleKernel
Rename file arm_compute/runtime/CL/functions/CLElementWiseUnaryLayer.h to arm_compute/runtime/CL/functions/CLElementwiseUnaryLayer.h

v21.05 Public major release

Various bug fixes.
Various optimisations.
Various documentation updates:
- Add supported operators and corresponding Android NNAPI operators.
- Documentation reorg into user guide and contributor guide.
Add support for a global allocator for OpenCL tensors
Add experimental support for CLVK.
Add data type S32 support for:
- opencl::kernels::ClArithmeticKernel
Add data type QASYMM8 support for:
Add per-channel quantization support for:
Remove padding from OpenCL kernels:
- CLL2NormalizeLayerKernel
- CLDepthwiseConvolutionLayer3x3NHWCKernel
- CLNormalizationLayerKernel
- CLNormalizePlanarYUVLayerKernel
- opencl::kernels::ClMulKernel
- CLReductionOperationKernel
- CLROIPoolingLayerKernel
Remove computer vision support from Arm® Neon™ backend
Remove the following functions:
- NEAbsoluteDifference
- NEAccumulate
- NEBox3x3
- NECannyEdge
- NEChannelCombine
- NEChannelExtract
- NEColorConvert
- NEConvolution
- NEDerivative
- NEDilate
- NEEqualizeHistogram
- NEErode
- NEFastCorners
- NEGaussian3x3
- NEGaussian5x5
- NEGaussianPyramid
- NEHOGDescriptor
- NEHOGDetector
- NEHOGGradient
- NEHOGMultiDetection
- NEHarrisCorners
- NEHistogram
- NEIntegralImage
- NELaplacianPyramid
- NELaplacianReconstruct
- NEMagnitude
- NEMeanStdDev
- NEMedian3x3
- NEMinMaxLocation
- NENonLinearFilter
- NEOpticalFlow
- NEPhase
- NEScharr3x3
- NESobel3x3
- NESobel5x5
- NESobel7x7
- NETableLookup
- NEThreshold
- NEWarpAffine
- NEWarpPerspectiveKernel
Remove all GLES kernels / functions / tests / examples
Remove computer vision support from CL backend
Remove the following functions:
- CLAbsoluteDifference
- CLAccumulate
- CLBox3x3
- CLCannyEdge
- CLChannelCombine
- CLChannelExtract
- CLColorConvert
- CLConvolution
- CLDerivative
- CLDilate
- CLEqualizeHistogram
- CLErode
- CLFastCorners
- CLGaussian3x3
- CLGaussian5x5
- CLGaussianPyramid
- CLHOGDescriptor
- CLHOGDetector
- CLHOGGradient
- CLHOGMultiDetection
- CLHarrisCorners
- CLHistogram
- CLIntegralImage
- CLLaplacianPyramid
- CLLaplacianReconstruct
- CLMagnitude
- CLMeanStdDev
- CLMedian3x3
- CLMinMaxLocation
- CLNonLinearFilter
- CLOpticalFlow
- CLPhase
- CLScharr3x3
- CLSobel3x3
- CLSobel5x5
- CLSobel7x7
- CLTableLookup
- CLThreshold
- CLWarpAffine
- CLWarpPerspective

v21.02 Public major release

Various bug fixes.
Various optimisations.
Upgrade C++ standard to C++14
Add macOS support
Add Armv8-R AArch64 architecture support
Add SVE/SVE2 support for:
Remove padding from OpenCL kernels:
- CLDirectConvolutionLayerKernel
- CLArgMinMaxLayerKernel
- CLPadLayerKernel
- CLROIAlignLayerKernel
- CLRangeKernel
- CLScaleKernel
- CLSelectKernel
- CLBitwiseKernel
- opencl::kernels::ClFloorKernel
- CLTransposeKernel
Deprecate functions in CLTuner:
- add_lws_to_table
- import_lws_table
- lws_table
Remove functions:
- NELocallyConnectedLayer / CLLocallyConnectedLayer
- NEIm2Col
- NECol2Im
- NEGEMMInterleave4x4
- NEGEMMTranspose1xW
- NEComputeAllAnchors / CLComputeAllAnchors
- NEGEMMAssemblyDispatch
- NEUpsampleLayer / CLUpsampleLayer
Remove kernels:
- NEGEMMMatrixVectorMultiplyKernel
- NELocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedMatrixMultiplyKernel
- NEUpsampleLayerKernel / CLUpsampleLayerKernel
Extend OpenCL tuner with workgroup batch size support
- Experimental extension for the OpenCL tuner to tune the batches of work groups distribute to compute units
Add functionality to load the OpenCL GEMM heuristics at runtime
- The GEMM heuristic file (MLGO) can be used to update the default GEMM heuristics available for OpenCL
Note: there might be performance regressions against v20.08 in Inception v3 using int8 data types on Arm Mali-G77 GPUs. Currently under investigation
Note: data-type decoupling is in progress and experimental. Warning of unused symbols might be raised

v20.11 Public major release

Various bug fixes.
Various optimisations.
Performance regressions can be noted when executing Depthwise Convolution on Arm® Neon™ with a depth multiplier > 1 for quantized data type. This is planned to be resolved in 21.02 release.
Added new data type QASYMM8_SIGNED support for NEROIAlignLayer.
Added new data type S32 support for:
- NEArithmeticSubtraction
- NEArithmeticSubtractionKernel
- NEPixelWiseMultiplication
- NEPixelWiseMultiplicationKernel
- NEElementwiseDivision
- NEDivisionOperationKernel
Interface change
- Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5. The supported value range of axis is [-rank, rank). This change applies to the following functions:
  - NESoftmaxLayer
  - NELogSoftmaxLayer
  - CLSoftmaxLayer
  - CLLogSoftmaxLayer
  - GCSoftmaxLayer
New OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
- CLLogicalNot
- CLLogicalAnd
- CLLogicalOr
New Arm® Neon™ kernels / functions:
Removed padding from Arm® Neon™ kernels:
- NEComplexPixelWiseMultiplicationKernel
- NENonMaximaSuppression3x3Kernel
- NERemapKernel
- NEGEMMInterleave4x4Kernel
- NEDirectConvolutionLayerKernel
- NEScaleKernel
- NELocallyConnectedMatrixMultiplyKernel
- NEGEMMLowpOffsetContributionKernel
- NEGEMMTranspose1xWKernel
- NEPoolingLayerKernel
- NEConvolutionKernel
- NEDepthwiseConvolutionLayerNativeKernel
- NEGEMMLowpMatrixMultiplyKernel
- NEGEMMMatrixMultiplyKernel
- NEDirectConvolutionLayerOutputStageKernel
- NEReductionOperationKernel
- NEGEMMLowpMatrixAReductionKernel
- NEGEMMLowpMatrixBReductionKernel
Removed padding from OpenCL kernels:
- CLBatchConcatenateLayerKernel
- CLElementwiseOperationKernel
- CLBatchNormalizationLayerKernel
- CLPoolingLayerKernel
- CLWinogradInputTransformKernel
- CLGEMMLowpMatrixMultiplyNativeKernel
- CLGEMMLowpMatrixAReductionKernel
- CLGEMMLowpMatrixBReductionKernel
- CLGEMMLowpOffsetContributionOutputStageKernel
- CLGEMMLowpOffsetContributionKernel
- CLWinogradOutputTransformKernel
- CLGEMMLowpMatrixMultiplyReshapedKernel
- CLFuseBatchNormalizationKernel
- CLDepthwiseConvolutionLayerNativeKernel
- CLDepthConvertLayerKernel
- CLCopyKernel
- CLDepthwiseConvolutionLayer3x3NHWCKernel
- CLActivationLayerKernel
- CLWinogradFilterTransformKernel
- CLWidthConcatenateLayerKernel
- CLWidthConcatenate4TensorsKernel
- CLWidthConcatenate2TensorsKernel
- CLLogits1DMaxShiftExpSumKernel
- CLLogits1DNormKernel
- CLHeightConcatenateLayerKernel
- CLGEMMMatrixMultiplyKernel
- CLGEMMLowpQuantizeDownInt32ScaleKernel
- CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
- CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
- CLDepthConcatenateLayerKernel
- CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
Removed OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
- CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
- CLLocallyConnectedLayer
- CLLocallyConnectedMatrixMultiplyKernel
- CLAbsoluteDifference
- CLAbsoluteDifferenceKernel
- CLAccumulate
- CLAccumulateKernel
- CLAccumulateSquared
- CLAccumulateSquaredKernel
- CLAccumulateWeighted
- CLAccumulateWeightedKernel
- CLAccumulateWeightedFP16Kernel
- CLBox3x3
- CLBox3x3Kernel
- CLBox3x3FP16Kernel
- CLCannyEdge
- CLChannelCombine
- CLChannelCombineKernel
- CLChannelExtract
- CLChannelExtractKernel
- CLColorConvert
- CLColorConvertKernel
- CLConvolution3x3
- CLConvolutionRectangle
- CLConvolutionRectangleKernel
- CLConvolutionSquare
- CLConvolutionKernel
- CLDerivative
- CLDerivativeKernel
- CLDilate
- CLDilateKernel
- CLEqualizeHistogram
- CLErode
- CLErodeKernel
- CLFastCorners
- CLFastCornersKernel
- CLGaussian3x3
- CLGaussian3x3Kernel
- CLGaussian5x5
- CLGaussian5x5HorKernel
- CLGaussian5x5VertKernel
- CLGaussianPyramid
- CLGaussianPyramidHalf
- CLGaussianPyramidOrb
- CLHarrisCorners
- CLHarrisScoreKernel
- CLHarrisScoreFP16Kernel
- CLHistogram
- CLHistogramKernel
- CLHOGOrientationBinningKernel
- CLHOGBlockNormalizationKernel
- CLHOGDetectorKernel
- CLHOGNonMaximaSuppressionKernel
- CLHOGDescriptor
- CLHOGDetector
- CLHOGGradient
- CLHOGMultiDetection
- CLHOGOrientationBinningKernel
- CLHOGBlockNormalizationKernel
- CLHOGDetectorKernel
- CLIntegralImage
- CLIntegralImageKernel
- CLLaplacianReconstruct
- CLLaplacianPyramid
- CLMagnitude
- CLMagnitudePhaseKernel
- CLMedian3x3
- CLMedian3x3Kernel
- CLMinMaxLocation
- CLMinMaxLocationKernel
- CLNonLinearFilter
- CLNonLinearFilterKernel
- CLNonMaximaSuppression3x3
- CLNonMaximaSuppression3x3FP16Kernel
- CLNonMaximaSuppression3x3Kernel
- CLOpticalFlow
- CLPhase
- CLRemap
- CLRemapKernel
- CLScharr3x3
- CLScharr3x3Kernel
- CLSobel3x3
- CLSobel3x3Kernel
- CLSobel5x5
- CLSobel5x5HorKernel
- CLSobel5x5VertKernel
- CLSobel7x7
- CLSobel7x7HorKernel
- CLSobel7x7VertKernel
- CLThreshold
- CLThresholdKernel
- CLWarpAffine
- CLWarpAffineKernel
- CLWarpPerspective
- CLWarpPerspectiveKernel
Deprecated Arm® Neon™ kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
- NELocallyConnectedLayer
- NELocallyConnectedMatrixMultiplyKernel
- NEAbsoluteDifference
- NEAbsoluteDifferenceKernel
- NEAccumulate
- NEAccumulateKernel
- NEAccumulateSquared
- NEAccumulateSquaredKernel
- NEAccumulateWeighted
- NEAccumulateWeightedKernel
- NEAccumulateWeightedFP16Kernel
- NEBox3x3
- NEBox3x3Kernel
- NEBox3x3FP16Kernel
- NECannyEdge
- NEChannelCombine
- NEChannelCombineKernel
- NEChannelExtract
- NEChannelExtractKernel
- NEColorConvert
- NEColorConvertKernel
- NEConvolution3x3
- NEConvolutionRectangle
- NEConvolutionRectangleKernel
- NEConvolutionSquare
- NEConvolutionKernel
- NEDerivative
- NEDerivativeKernel
- NEDilate
- NEDilateKernel
- NEEqualizeHistogram
- NEErode
- NEErodeKernel
- NEFastCorners
- NEFastCornersKernel
- NEGaussian3x3
- NEGaussian3x3Kernel
- NEGaussian5x5
- NEGaussian5x5HorKernel
- NEGaussian5x5VertKernel
- NEGaussianPyramid
- NEGaussianPyramidHalf
- NEGaussianPyramidOrb
- NEHarrisCorners
- NEHarrisScoreKernel
- NEHarrisScoreFP16Kernel
- NEHistogram
- NEHistogramKernel
- NEHOGOrientationBinningKernel
- NEHOGBlockNormalizationKernel
- NEHOGDetectorKernel
- NEHOGNonMaximaSuppressionKernel
- NEHOGDescriptor
- NEHOGDetector
- NEHOGGradient
- NEHOGMultiDetection
- NEHOGOrientationBinningKernel
- NEHOGBlockNormalizationKernel
- NEHOGDetectorKernel
- NEIntegralImage
- NEIntegralImageKernel
- NELaplacianReconstruct
- NELaplacianPyramid
- NEMagnitude
- NEMagnitudePhaseKernel
- NEMedian3x3
- NEMedian3x3Kernel
- NEMinMaxLocation
- NEMinMaxLocationKernel
- NENonLinearFilter
- NENonLinearFilterKernel
- NENonMaximaSuppression3x3
- NENonMaximaSuppression3x3FP16Kernel
- NENonMaximaSuppression3x3Kernel
- NEOpticalFlow
- NEPhase
- NERemap
- NERemapKernel
- NEScharr3x3
- NEScharr3x3Kernel
- NESobel3x3
- NESobel3x3Kernel
- NESobel5x5
- NESobel5x5HorKernel
- NESobel5x5VertKernel
- NESobel7x7
- NESobel7x7HorKernel
- NESobel7x7VertKernel
- NEThreshold
- NEThresholdKernel
- NEWarpAffine
- NEWarpAffineKernel
- NEWarpPerspective
- NEWarpPerspectiveKernel
Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
- GCAbsoluteDifference
- GCActivationLayer
- GCArithmeticAddition
- GCBatchNormalizationLayer
- GCConcatenateLayer
- GCConvolutionLayer
- GCDepthwiseConvolutionLayer
- GCDirectConvolutionLayer
- GCDropoutLayer
- GCFillBorder
- GCFullyConnectedLayer
- GCGEMM
- GCGEMMInterleave4x4
- GCGEMMTranspose1xW
- GCNormalizationLayer
- GCNormalizePlanarYUVLayer
- GCPixelWiseMultiplication
- GCPoolingLayer
- GCScale
- GCSoftmaxLayer
- GCTensorShift
- GCTranspose

v20.08 Public major release

Various bug fixes.
Various optimisations.
Added new data type QASYMM8_SIGNED support for:
- CLArgMinMaxLayer
- CLArgMinMaxLayerKernel
Added new data type U8 support for:
- NECropKernel
- CLCropKernel
Added align_corner support for nearest neighbor interpolation in:
- NEScaleKernel
- CLScaleKernel
New OpenCL kernels / functions:
- CLMaxUnpoolingLayerKernel
New Arm® Neon™ kernels / functions:
- NEMaxUnpoolingLayerKernel
New graph example:
- graph_yolov3_output_detector
GEMMTuner improvements:
- Added fp16 support
- Output json files for easier integration
- Enabled tuning for export_to_cl_image_rhs option for RHS tensors
- More robust script for running benchmarks
Removed padding from:
- NEPixelWiseMultiplicationKernel
- NEHeightConcatenateLayerKernel
- NEThresholdKernel
- NEBatchConcatenateLayerKernel
- NETransposeKernel
- NEBatchNormalizationLayerKernel
- NEArithmeticSubtractionKernel
- NEBoundingBoxTransformKernel
- NELogits1DMaxKernel
- NELogits1DSoftmaxKernel
- NEROIPoolingLayerKernel
- NEROIAlignLayerKernel
- NEYOLOLayerKernel
- NEUpsampleLayerKernel
- NEFloorKernel
- NEWidthConcatenateLayerKernel
- NEDepthConcatenateLayerKernel
- NENormalizationLayerKernel
- NEL2NormalizeLayerKernel
- NEFillArrayKernel
- NEDepthConvertLayerKernel
- NERangeKernel
- NEPriorBoxLayer
Removed OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ToUint8Scale
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
Removed Arm® Neon™ kernels / functions:
- NEGEMMLowpQuantizeDownInt32ToUint8Scale
- NEGEMMMatrixAccumulateBiasesKernel
Deprecated functions / interfaces:
- Non-descriptor based interfaces for NEThreshold, CLThreshold
- Non-descriptor based interfaces for NEScale, CLScale and GCScale
- In NESoftmaxLayer, NELogSoftmaxLayer, CLSoftmaxLayer, CLLogSoftmaxLayer and GCSoftmaxLayer : The default "axis" value for CLSoftmaxLayer, CLLogSoftmaxLayer and GCSoftmaxLayer is changed from 1 to 0. Only axis 0 is supported. The default "axis" value for NESoftmaxLayer, NELogSoftmaxLayer is changed from 1 to 0. Only axis 0 is supported.
The support for quantized data types has been removed from CLLogSoftmaxLayer due to implementation complexity.
Removed padding requirement for the input (e.g. LHS of GEMM) and output in CLGEMMMatrixMultiplyNativeKernel, CLGEMMMatrixMultiplyReshapedKernel, CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and CLIm2ColKernel (NHWC only)
- This change allows to use CLGEMMConvolutionLayer without extra padding for the input and output.
- Only the weights/bias of CLGEMMConvolutionLayer could require padding for the computation.
- Only on Arm® Mali™ Midgard GPUs, CLGEMMConvolutionLayer could require padding since CLGEMMMatrixMultiplyKernel is called and currently requires padding.
Added support for exporting the OpenCL buffer object to the OpenCL image object in CLGEMMMatrixMultiplyReshapedKernel and CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
- This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
- The padding requirement for the OpenCL image object is considered into the CLGEMMReshapeRHSMatrixKernel.
- The reshaped RHS matrix stores the weights when GEMM is used to accelerate CLGEMMConvolutionLayer.

v20.05 Public major release

Various bug fixes.
Various optimisations.
Updated recommended NDK version to r18b.
Updated recommended gcc version to Linaro 6.3.1.
Added Bfloat16 type support
Added Bfloat16 support in:
- NEWeightsReshapeKernel
- NEConvolutionLayerReshapeWeights
- NEIm2ColKernel
- NEIm2Col
- NEDepthConvertLayerKernel
- NEDepthConvertLayer
- NEGEMMConvolutionLayer
- NEGEMMAssemblyDispatch
Added new data type QASYMM8_SIGNED support for:
- CLDirectConvolutionLayer
- CLDeconvolutionLayer
- CLDirectDeconvolutionLayer
- CLGEMMDeconvolutionLayer
- CLGEMMLowpMatrixMultiplyReshapedKernel
- CLGEMMLowpQuantizeDownInt32ScaleKernel
- CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
- CLReductionOperation
- CLReduceMean
- NEScale
- NEScaleKernel
- NEUpsampleLayer
- NECast
- NEReductionOperation
- NEReduceMean
- NEArgMinMaxLayer
- NEDeconvolutionLayer
- NEGEMMLowpQuantizeDownInt32ScaleKernel
- CPPBoxWithNonMaximaSuppressionLimit
- CPPDetectionPostProcessLayer
- CPPPermuteKernel
- CPPPermute
- CPPTopKVKernel
- CPPTopKV
- CPPUpsample
- CPPUpsampleKernel
New OpenCL kernels / functions:
- CLQLSTMLayer
- CLQLSTMLayerNormalizationKernel
New Arm® Neon™ kernels / functions:
- NEQLSTMLayer
- NEQLSTMLayerNormalizationKernel
Added HARD_SWISH support in:
- CLActivationLayerKernel
- NEActivationLayerKernel
Deprecated OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ToUint8Scale
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
Deprecated Arm® Neon™ kernels / functions:
- NEGEMMLowpQuantizeDownInt32ToUint8Scale
Removed CPP kernels / functions:
- CPPFlipWeightsKernel
Removed PoolingLayerInfo constructors without Data Layout.
Removed CLDepthwiseConvolutionLayer3x3
Removed NEDepthwiseConvolutionLayerOptimized
Added support for Winograd 3x3,4x4 on Arm® Neon™ FP16:
- NEWinogradConvolutionLayer
- CpuWinogradConv2dTransformInputKernel
- CpuWinogradConv2dTransformOutputKernel
- CpuWinogradConv2dTransformWeightsKernel
Added CLCompileContext
Added Arm® Neon™ GEMM kernel with 2D window support

v20.02.1 Maintenance release

Added Android-NN build script.

v20.02 Public major release

Various bug fixes.
Various optimisations.
Added new data type QASYMM8_SIGNED support for:
- CLDepthwiseConvolutionLayer
- CLDepthwiseConvolutionLayer3x3
- CLGEMMConvolutionLayer
- CLGEMMLowpMatrixMultiplyCore
- CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
- CLGEMMLowpMatrixMultiplyNativeKernel
- NEActivationLayer
- NEComparisonOperationKernel
- NEConvolutionLayer
- NEDepthwiseConvolutionLayer
- NEDepthwiseConvolutionLayer3x3Kernel
- NEDirectConvolutionLayerOutputStageKernel
- NEElementwiseComparison
- NEElementwiseMax
- NEElementwiseMin
- NEElementwiseSquaredDiff
- NEFullyConnectedLayer
- NEGEMMMatrixVectorMultiplyKernel
- NEPixelWiseMultiplication
- NEPoolingLayer
- NEPReluLayer
Added support for QSYMM8_PER_CHANNEL in:
- NEDepthwiseConvolutionLayer3x3Kernel
Added support for split sizes in:
- CLSplit
- NESplit
New OpenCL kernels / functions:
- CLFill
- CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
New Arm® Neon™ kernels / functions:
- NEFill
- NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
Deprecated Arm® Neon™ functions / interfaces:
- CLDepthwiseConvolutionLayer3x3
- NEDepthwiseConvolutionLayerOptimized
- PoolingLayerInfo constructors without Data Layout.
Added support for quantization with multiplier greater than 1 on Arm® Neon™ and CL.
Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to CLQuantizationLayer.
Added the ability to build bootcode for bare metal.
Added support for generating synthetic QASYMM8 graphs.
Added support for F16 datatype in VGG16.
Removed pre-built binaries for GLES.

v19.11.1 Public maintenance release

Fix offset calculation in NEReductionOperationKernel.
Fix data layout in NEScaleKernel for nhwc.
Retain configuration step data layout to avoid side-effects.
Perform sqrt in double domain for L2 pooling.
Fix output shape calculation for Reduce Mean
Restrict cases where optimized NEPadLayer runs.

v19.11 Public major release

Various bug fixes.
Various optimisations.
Updated recommended NDK version to r17c.
Deprecated OpenCL kernels / functions:
- CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
- CLDepthwiseIm2ColKernel
- CLDepthwiseSeparableConvolutionLayer
- CLDepthwiseVectorToTensorKernel
- CLDirectConvolutionLayerOutputStageKernel
Deprecated Arm® Neon™ kernels / functions:
- NEDepthwiseWeightsReshapeKernel
- NEDepthwiseIm2ColKernel
- NEDepthwiseSeparableConvolutionLayer
- NEDepthwiseVectorToTensorKernel
- NEDepthwiseConvolutionLayer3x3
New OpenCL kernels / functions:
- CLInstanceNormalizationLayerKernel / CLInstanceNormalizationLayer
- CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated OpenCL kernels / functions)
- CLLogSoftmaxLayer
New Arm® Neon™ kernels / functions:
- NEBoundingBoxTransformKernel / NEBoundingBoxTransform
- NEComputeAllAnchorsKernel / NEComputeAllAnchors
- NEDetectionPostProcessLayer
- NEGenerateProposalsLayer
- NEInstanceNormalizationLayerKernel / NEInstanceNormalizationLayer
- NELogSoftmaxLayer
- NEROIAlignLayerKernel / NEROIAlignLayer
Added QASYMM8 support for:
- CLGenerateProposalsLayer
- CLROIAlignLayer
- CPPBoxWithNonMaximaSuppressionLimit
Added QASYMM16 support for:
- CLBoundingBoxTransform
Added FP16 support for:
- CLGEMMMatrixMultiplyReshapedKernel
Added new data type QASYMM8_PER_CHANNEL support for:
- CLDequantizationLayer
- NEDequantizationLayer
Added new data type QSYMM8_PER_CHANNEL support for:
- CLConvolutionLayer
- NEConvolutionLayer
- CLDepthwiseConvolutionLayer
- NEDepthwiseConvolutionLayer
Added FP16 mixed-precision support for:
- CLGEMMMatrixMultiplyReshapedKernel
- CLPoolingLayerKernel
Added FP32 and FP16 ELU activation for:
- CLActivationLayer
- NEActivationLayer
Added asymmetric padding support for:
- CLDirectDeconvolutionLayer
- CLGEMMDeconvolutionLayer
- NEDeconvolutionLayer
Added SYMMETRIC and REFLECT modes for CLPadLayerKernel / CLPadLayer.
Replaced the calls to NECopyKernel and NEMemsetKernel with NEPadLayer in NEGenerateProposalsLayer.
Replaced the calls to CLCopyKernel and CLMemsetKernel with CLPadLayer in CLGenerateProposalsLayer.
Improved performance for CL Inception V3 - FP16.
Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
Improved Arm® Neon™ performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
Improved Arm® Neon™ performance for MobileNet-SSD by improving the output detection performance.
Optimized CLPadLayer.
Optimized CL generic depthwise convolution layer by introducing CLDepthwiseConvolutionLayerNativeKernel.
Reduced memory consumption by implementing weights sharing.

v19.08.1 Public maintenance release

Fix offset calculation in NEReductionOperationKernel.
Fix data layout in NEScaleKernel for nhwc.
Retain configuration step data layout to avoid side-effects.
Perform sqrt in double domain for L2 pooling.
Fix output shape calculation for Reduce Mean
Fix broadcast CLPixelwiseMultiplication with 5D tensors

v19.08 Public major release

Various bug fixes.
Various optimisations.
Deprecated Arm® Neon™ functions
- NEDepthConcatenateLayer
- NEWidthConcatenateLayer
Deprecated OpenCL kernels / functions
- CLDepthConcatenateLayer
- CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
- CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
- CLWidthConcatenateLayer
New Arm® Neon™ kernels / functions:
- NEAbsLayer
- NECast
- NEElementwisePower
- NELogLayer
- NELSTMLayerQuantized
- NENegLayer
- NEPReluLayer
- NESinLayer
- NEBatchConcatenateLayerKernel
- NEDepthToSpaceLayerKernel / NEDepthToSpaceLayer
- NEDepthwiseConvolutionLayerNativeKernel
- NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
- NEMeanStdDevNormalizationKernel / NEMeanStdDevNormalizationLayer
- NESpaceToDepthLayerKernel / NESpaceToDepthLayer
New OpenCL kernels / functions:
- CLAbsLayer
- CLElementwisePower
- CLLogLayer
- CLLSTMLayerQuantized
- CLNegLayer
- CLPReluLayer
- CLSinLayer
- CLBatchConcatenateLayerKernel
- CLDepthToSpaceLayerKernel / CLDepthToSpaceLayer
- CLGEMMLowpMatrixMultiplyNativeKernel
- CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
- CLGEMMMatrixMultiplyNativeKernel
- CLMeanStdDevNormalizationKernel /CLMeanStdDevNormalizationLayer
- CLSpaceToDepthLayerKernel / CLSpaceToDepthLayer
New examples:
- neon_opticalflow
- cl_cache
- neon_permute
Added support for FP16 in NEDeconvolutionLayer
Added support for FP16 in CLDeconvolutionLayer
Added support for REDUCE_MIN and REDUCE_MAX in ReductionOperation
Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
Re-factored the depthwise convolution layer kernel on Arm® Neon™ for generic cases
Added an optimized depthwise convolution layer kernel for 5x5 filters (Neon™ only)
Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
Altered QuantizationInfo interface to support per-channel quantization.
The CLDepthwiseConvolutionLayer3x3 will be included by CLDepthwiseConvolutionLayer to accommodate for future optimizations.
The NEDepthwiseConvolutionLayerOptimized will be included by NEDepthwiseConvolutionLayer to accommodate for future optimizations.
Removed inner_border_right and inner_border_top parameters from CLDeconvolutionLayer interface
Removed inner_border_right and inner_border_top parameters from NEDeconvolutionLayer interface
Optimized the Arm® Neon™ assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel

v19.05 Public major release

Various bug fixes.
Various optimisations.
New Arm® Neon™ kernels / functions:
- NEBatchToSpaceLayerKernel / NEBatchToSpaceLayer
- NEComplexPixelWiseMultiplicationKernel / NEComplexPixelWiseMultiplication
- NECropKernel / NECropResize
- NEDepthwiseConvolutionAssemblyDispatch
- NEFFTDigitReverseKernel
- NEFFTRadixStageKernel
- NEFFTScaleKernel
- NEGEMMLowpOffsetContributionOutputStageKernel
- NEHeightConcatenateLayerKernel
- NESpaceToBatchLayerKernel / NESpaceToBatchLayer
- NEFFT1D
- NEFFT2D
- NEFFTConvolutionLayer
New OpenCL kernels / functions:
- CLComplexPixelWiseMultiplicationKernel / CLComplexPixelWiseMultiplication
- CLCropKernel / CLCropResize
- CLDeconvolutionReshapeOutputKernel
- CLFFTDigitReverseKernel
- CLFFTRadixStageKernel
- CLFFTScaleKernel
- CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
- CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
- CLHeightConcatenateLayerKernel
- CLDirectDeconvolutionLayer
- CLFFT1D
- CLFFT2D
- CLFFTConvolutionLayer
- CLGEMMDeconvolutionLayer
New OpenGLES kernels / functions:
- GCConcatenateLayer
Deprecated functions/interfaces
- GCDepthConcatenateLayer
- NEWidthConcatenateLayer
- NEDepthConcatenateLayer
- CLWidthConcatenateLayer
- CLDepthConcatenateLayer
- CLGEMMInterleave4x4
- CLGEMMTranspose1xW
Support different quantization info in CLConcatLayer.
Add checks on different input/output quantization info were not supported.
Tensors have different quantization information.
Add FP16 support checks.
Fix output quantization CLDeptwiseConv3x3 when activation is fused.
New graph examples:
- graph_convolution
- graph_fully_connected
- graph_depthwise_convolution
- Deepspeech v0.4.1
Add support for QASYMM8 in NEArithmeticSubtractionKernel.
Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
Add support for QASYMM8 NEDeconvolution.
Add support for DequantizationLayer for Neon/CL.
Add support for dilation in CLDepthwiseConvolution.
Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
Optimize CLDeconvolution.
Add StackLayer to the graph API.
Add support for "reflect" padding mode in NEPad.
Winograd 7x7 NHWC on OpenCL.
Rework CL ML layers to run exclusively on CL.
Support different quantization info in PoolingLayer.
Implement and test import memory interfaces.
Added new tests and removed old ones.
Various clang-tidy fixes.

v19.02 Public major release

Various bug fixes.
Various optimisations.
New Arm® Neon™ kernels / functions:
- NETileKernel / NETile
- NEFuseBatchNormalizationKernel / NEFuseBatchNormalization
- NEElementwiseOperationKernel
- NEElementwiseMax
- NEElementwiseMin
- NEElementwiseSquaredDiff
- NESelectKernel / NESelect
- NESplit
- NESlice
- NEUnstack
- NEStridedSliceKernel / NEStridedSlice
- NEElementwiseUnaryKernel
- NERsqrtLayer
- NEExpLayer
- NEReverseKernel / NEReverse
- NEArgMinMaxLayer
- NEStackLayerKernel / NEStackLayer
- NERangeKernel / NERange
- NEPadLayer
- NEMemsetKernel
- NEGatherKernel / NEGather
- NEElementwiseComparison
- NEElementwiseComparisonStatic
- NEComparisonOperationKernel
- NEElementwiseDivision
New OpenCL kernels / functions:
- CLSelectKernel / CLSelect
- CLTileKernel / CLTile
- CLComparisonKernel / CLComparison
- CLArgMinMaxLayer
- CLElementwiseMax
- CLElementwiseMin
- CLElementwiseSquaredDiff
- CLStackLayerKernel / CLStackLayer
- CLReverse / CLReverseKernel
- CLRsqrtLayer
- CLExpLayer
- CLElementWiseUnaryLayerKernel
- CLGEMMReshapeLHSMatrixKernel
- CLGEMMReshapeRHSMatrixKernel
- CLGEMMMatrixMultiplyReshapedKernel
- CLRangeKernel / CLRange
- CLUnstack
- CLGatherKernel / CLGather
- CLGEMMLowpMatrixMultiplyReshapedKernel
New CPP kernels / functions:
- CPPDetectionOutputLayer
- CPPTopKV / CPPTopKVKernel
Added new examples:
Add 4D tensors support to
- NESoftmaxLayer
Fused activation in CLWinogradConvolutionLayer
Extended NEPermute to support more cases
Added Neon™/SVE GEMM Hybrid kernels
Added u8 and s8 hybrid assembly kernels
Introduced GEMM strategy name in NEGEMMAssemblyWrapper
Improved CLTuner
Fused the bias addition within CLGEMM
Added support for QASYMM8 LOGISTIC activation in NEActivationLayer
Added NHWC data layout support to:
- NEScale for F16
- CLNormalizationLayer IN_MAP_2D for FP32/FP16
- NEL2NormalizeLayer for FP32/FP16
- NENormalizationLayer IN_MAP_2D for FP32/FP16
- CLROIAlignLayer
- CLGenerateProposalsLayer
Added QASYMM8 support to the following kernels:
- NEArithmeticAdditionKernel
- NEScale
Added new tests and improved validation and benchmarking suites.
Deprecated functions/interfaces
- Usage of inner_border_right and inner_border_top has been deprecated in CLDeconvolutionLayer and NEDeconvolutionLayer

v18.11 Public major release

Various bug fixes.
Various optimisations.
New Arm® Neon™ kernels / functions:
- NEChannelShuffleLayer / NEChannelShuffleLayerKernel
- NEReduceMean
- NEReorgLayer / NEReorgLayerKernel
- NEPriorBoxLayer / NEPriorBoxLayerKernel
- NEUpsampleLayer / NEUpsampleLayerKernel
- NEYOLOLayer / NEYOLOLayerKernel
New OpenCL kernels / functions:
- CLBatchToSpaceLayer / CLBatchToSpaceLayerKernel
- CLBoundingBoxTransform / CLBoundingBoxTransformKernel
- CLComputeAllAnchorsKernel
- CLGenerateProposalsLayer
- CLNormalizePlanarYUVLayer / CLNormalizePlanarYUVLayerKernel
- CLReorgLayer / CLReorgLayerKernel
- CLSpaceToBatchLayer / CLSpaceToBatchLayerKernel
- CLPadLayer
- CLReduceMean
- CLPriorBoxLayer / CLPriorBoxLayerKernel
- CLROIAlignLayer / CLROIAlignLayerKernel
- CLSlice
- CLSplit
- CLStridedSlice / CLStridedSliceKernel
- CLUpsampleLayer / CLUpsampleLayerKernel
- CLYOLOLayer / CLYOLOLayerKernel
New CPP kernels / functions:
- CPPBoxWithNonMaximaSuppressionLimit / CPPBoxWithNonMaximaSuppressionLimitKernel
Added the validate method in:
- NEDepthConvertLayer
- NEFloor / CLFloor
- NEGEMMMatrixAdditionKernel
- NEReshapeLayer / CLReshapeLayer
- CLScale
Added new examples:
- graph_shufflenet.cpp
- graph_yolov3.cpp
Added documentation for add a new function or kernel.
Improved doxygen documentation adding a list of the existing functions.
Add 4D tensors support to
- CLWidthConcatenateLayer
- CLFlattenLayer
- CLSoftmaxLayer
Add dot product support for CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
Add SVE support
Fused batch normalization into convolution layer weights in CLFuseBatchNormalization
Fuses activation in CLDepthwiseConvolutionLayer3x3NCHWKernel, CLDepthwiseConvolutionLayer3x3NHWCKernel and NEGEMMConvolutionLayer
Added NHWC data layout support to:
Added QASYMM8 support to the following kernels:
- CLScaleKernel
- NEDepthwiseConvolutionLayer3x3Kernel
- CLPixelWiseMultiplicationKernel
Added FP16 support to the following kernels:
- CLDepthwiseConvolutionLayer3x3NHWCKernel
- NEDepthwiseConvolutionLayer3x3Kernel
- CLNormalizePlanarYUVLayerKernel
- CLWinogradConvolutionLayer (5x5 kernel)
More tests added to both validation and benchmarking suites.

v18.08 Public major release

Various bug fixes.
Various optimisations.
Updated recommended NDK version to r17b.
Removed support for QS8/QS16 data types.
Added support for grouped convolution in CLConvolutionLayer.
Added NHWC data layout support to:
- NEDepthConcatenateLayer / CLDepthConcatenateLayer
- NEWinogradConvolutionLayer / CLWinogradConvolutionLayer
- CLDepthwiseConvolutionLayer
- CLDirectConvolutionLayer
- CLConvolutionLayer
- CLScale
- CLIm2ColKernel
New Arm® Neon™ kernels / functions:
- NERNNLayer
New OpenCL kernels / functions:
- CLArithmeticDivision
Introduced prepare() stage support in the graph API for GLES.
Added support for memory reusage when trying to allocate smaller CLTensors.
Enabled NHWC execution on graph examples.
Added JPEG accessor for validation purposes.
Added validate methods to some kernels / functions.

v18.05 Public major release

Various bug fixes.
Various optimisations.
Major redesign in the interface for the Neon™ kernels implemented in assembly.
Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in Neon™ functions.
Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
Moved Neon™ assembly kernels to the folder src/core/Neon/kernels/arm_gemm.
Improved doxygen documentation.
Improved memory management for layer's transitions.
Added support for NHWC data layout in tensors.
Added NHWC data layout support to:
- NEGEMMConvolutionLayer
- NEDirectConvolutionLayer
- NEPoolingLayer / CLPoolingLayer
- NEBatchNormalizationLayer / CLBatchNormalizationLayer
- NEDepthwiseConvolutionLayer
- NEScale
- NEIm2Col
Added support for dilated convolutions in NEConvolutionLayer and CLConvolutionLayer.
New OpenCL kernels / functions:
- CLChannelShuffleLayer / CLChannelShuffleLayerKernel
- CLConvertFullyConnectedWeightsKernel / CLConvertFullyConnectedWeights
- CLCopy / CLCopyKernel
- CLLSTMLayer
- CLRNNLayer
- CLWidthConcatenateLayer / CLWidthConcatenateLayerKernel
- CLWinogradFilterTransformKernel / CLWinogradConvolutionLayer
- CLWinogradInputTransformKernel / CLWinogradInputTransform
New Arm® Neon™ kernels / functions:
- NEConvertFullyConnectedWeightsKernel / NEConvertFullyConnectedWeights.
Created the validate method in CLDepthwiseConvolutionLayer.
Beta and gamma are no longer mandatory arguments in NEBatchNormalizationLayer and CLBatchNormalizationLayer.
Added depth multiplier support in NEDepthwiseConvolutionLayer and CLDepthwiseConvolutionLayer.
Added broadcast multiply support in NEPixelWiseMultiplication / NEPixelWiseMultiplicationKernel.
Port mobilenet example to NHWC data layout.
Enabled Winograd method in CLConvolutionLayer.
Renamed NEWinogradLayer to NEWinogradConvolutionLayer.
Updated NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/Neon/kernels/arm_gemm.
Added memory manager support in GLES functions.
Major refactoring of the graph API.
Added GLES backend in the graph API.
Added support for the memory manager in the graph API.
Enabled Winograd Convolution method in the graph API.
Added support for grouped convolutions in the graph API.
Replaced NEDeconvolutionLayerUpsampleKernel with NEScaleKernel in NEDeconvolutionLayer.
Added fast maths flag in CLConvolutionLayer.
Added new tests and benchmarks in validation and benchmark frameworks
Merge Activation layer with Convolution Layer (Neon™, CL, GLES)
Added support to OpenCL 2.0 SVM
Added support to import memory in OpenCL tensors.
Added the prepare() method to perform any one off pre-processing before running the function.
Added new examples:
- graph_inception_v4.cpp
- graph_resnext50.cpp
Added memory measurement instrument for CL.

v18.03 Public maintenance release

Various bug fixes.
Fixed bug in NEActivationLayer
Fix in CLTuner when using batches.
Updated recommended NDK version to r16b (And fixed warnings).
Fixed bug in validation code.
Added Inception v4 graph example.
Renamed NEWinogradLayer.cpp to NEWinogradConvolutionLayer

v18.02 Public major release

Various Arm® Neon™ / OpenCL / GLES optimisations.
Various bug fixes.
Changed default number of threads on big LITTLE systems.
Refactored examples and added:
- graph_mobilenet_qassym8
- graph_resnet
- graph_squeezenet_v1_1
Renamed CLConvolutionLayer into CLGEMMConvolutionLayer and created a new CLConvolutionLayer to select the fastest convolution method.
Renamed NEConvolutionLayer into NEGEMMConvolutionLayer and created a new NEConvolutionLayer to select the fastest convolution method.
Added in place support to:
- CLActivationLayer
- CLBatchNormalizationLayer
Added QASYMM8 support to:
Added FP16 support to:
- CLDepthwiseConvolutionLayer3x3
- CLDepthwiseConvolutionLayer
Added broadcasting support to NEArithmeticAddition / CLArithmeticAddition / CLPixelWiseMultiplication
Added fused batched normalization and activation to CLBatchNormalizationLayer and NEBatchNormalizationLayer
Added support for non-square pooling to NEPoolingLayer and CLPoolingLayer
New OpenCL kernels / functions:
- CLDirectConvolutionLayerOutputStageKernel
New Arm® Neon™ kernels / functions
- Added name() method to all kernels.
- Added support for Winograd 5x5.
- NEPermuteKernel / NEPermute
- CpuWinogradConv2dTransformInputKernel / NEWinogradLayer
- CpuWinogradConv2dTransformOutputKernel / NEWinogradLayer
- CpuWinogradConv2dTransformWeightsKernel / NEWinogradLayer
- Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel
New GLES kernels / functions:
- GCTensorShiftKernel / GCTensorShift

v18.01 Public maintenance release

Various bug fixes
Added some of the missing validate() methods
Added CLDeconvolutionLayerUpsampleKernel / CLDeconvolutionLayer CLDeconvolutionLayerUpsample
Added CLPermuteKernel / CLPermute
Added method to clean the programs cache in the CL Kernel library.
Added GCArithmeticAdditionKernel / GCArithmeticAddition
Added GCDepthwiseConvolutionLayer3x3Kernel / GCDepthwiseConvolutionLayer3x3
Added GCNormalizePlanarYUVLayerKernel / GCNormalizePlanarYUVLayer
Added GCScaleKernel / GCScale
Added GCWeightsReshapeKernel / GCConvolutionLayer
Added FP16 support to the following GLES compute kernels:
- GCCol2ImKernel
- GCGEMMInterleave4x4Kernel
- GCGEMMTranspose1xWKernel
- GCIm2ColKernel
Refactored Arm® Neon™ Winograd (NEWinogradLayerKernel)
Added NEDirectConvolutionLayerOutputStageKernel
Added QASYMM8 support to the following Arm® Neon™ kernels:
- NEDepthwiseConvolutionLayer3x3Kernel
- NEFillBorderKernel
- NEPoolingLayerKernel
Added new examples:
- graph_cl_mobilenet_qasymm8.cpp
- graph_inception_v3.cpp
- gc_dc.cpp
More tests added to both validation and benchmarking suites.

v17.12 Public major release

Most machine learning functions on OpenCL support the new data type QASYMM8
Introduced logging interface
Introduced opencl timer
Reworked GEMMLowp interface
Added new Arm® Neon™ assembly kernels for GEMMLowp, SGEMM and HGEMM
Added validation method for most Machine Learning kernels / functions
Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
Added sgemm example for OpenCL
Added absolute difference example for GLES compute
Added new tests and benchmarks in validation and benchmark frameworks
Added new kernels / functions for GLES compute
New OpenGL ES kernels / functions
- GCAbsoluteDifferenceKernel / GCAbsoluteDifference
- GCActivationLayerKernel / GCActivationLayer
- GCBatchNormalizationLayerKernel / GCBatchNormalizationLayer
- GCCol2ImKernel
- GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
- GCDirectConvolutionLayerKernel / GCDirectConvolutionLayer
- GCDropoutLayerKernel / GCDropoutLayer
- GCFillBorderKernel / GCFillBorder
- GCGEMMInterleave4x4Kernel / GCGEMMInterleave4x4
- GCGEMMMatrixAccumulateBiasesKernel / GCGEMMMatrixAdditionKernel / GCGEMMMatrixMultiplyKernel / GCGEMM
- GCGEMMTranspose1xWKernel / GCGEMMTranspose1xW
- GCIm2ColKernel
- GCNormalizationLayerKernel / GCNormalizationLayer
- GCPixelWiseMultiplicationKernel / GCPixelWiseMultiplication
- GCPoolingLayerKernel / GCPoolingLayer
- GCLogits1DMaxKernel / GCLogits1DShiftExpSumKernel / GCLogits1DNormKernel / GCSoftmaxLayer
- GCTransposeKernel / GCTranspose
New Arm® Neon™ kernels / functions
- arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
- arm_compute::NEHGEMMAArch64FP16Kernel
- NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / NEDepthwiseConvolutionLayer
- NEGEMMLowpOffsetContributionKernel / NEGEMMLowpMatrixAReductionKernel / NEGEMMLowpMatrixBReductionKernel / NEGEMMLowpMatrixMultiplyCore
- NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
- NEWinogradLayer / NEWinogradLayerKernel
New OpenCL kernels / functions
- CLGEMMLowpOffsetContributionKernel / CLGEMMLowpMatrixAReductionKernel / CLGEMMLowpMatrixBReductionKernel / CLGEMMLowpMatrixMultiplyCore
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
New graph nodes for Arm® Neon™ and OpenCL
- graph::BranchLayer
- graph::DepthConvertLayer
- graph::DepthwiseConvolutionLayer
- graph::DequantizationLayer
- graph::FlattenLayer
- graph::QuantizationLayer
- graph::ReshapeLayer

v17.10 Public maintenance release

Bug fixes:
- Check the maximum local workgroup size supported by OpenCL devices
- Minor documentation updates (Fixed instructions to build the examples)
- Introduced a graph::GraphContext
- Added a few new Graph nodes, support for branches and grouping.
- Automatically enable cl_printf in debug builds
- Fixed bare metal builds for armv7a
- Added AlexNet and cartoon effect examples
- Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)

v17.09 Public major release

Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers.
Memory Manager (BlobLifetimeManager, BlobMemoryPool, ILifetimeManager, IMemoryGroup, IMemoryManager, IMemoryPool, IPoolManager, MemoryManagerOnDemand, PoolManager)
New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework).
Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both Arm® Neon™ and OpenCL.
New Arm® Neon™ kernels / functions:
- arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel
- NEDequantizationLayerKernel / NEDequantizationLayer
- NEFloorKernel / NEFloor
- NEL2NormalizeLayerKernel / NEL2NormalizeLayer
- NEQuantizationLayerKernel NEMinMaxLayerKernel / NEQuantizationLayer
- NEROIPoolingLayerKernel / NEROIPoolingLayer
- NEReductionOperationKernel / NEReductionOperation
- NEReshapeLayerKernel / NEReshapeLayer
New OpenCL kernels / functions:
- CLDepthwiseConvolutionLayer3x3NCHWKernel CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer
- CLDequantizationLayerKernel / CLDequantizationLayer
- CLDirectConvolutionLayerKernel / CLDirectConvolutionLayer
- CLFlattenLayer
- CLFloorKernel / CLFloor
- CLGEMMTranspose1xW
- CLGEMMMatrixVectorMultiplyKernel
- CLL2NormalizeLayerKernel / CLL2NormalizeLayer
- CLQuantizationLayerKernel CLMinMaxLayerKernel / CLQuantizationLayer
- CLROIPoolingLayerKernel / CLROIPoolingLayer
- CLReductionOperationKernel / CLReductionOperation
- CLReshapeLayerKernel / CLReshapeLayer

v17.06 Public major release

Various bug fixes
Added support for fixed point 8 bit (QS8) to the various Arm® Neon™ machine learning kernels.
Added unit tests and benchmarks (AlexNet, LeNet)
Added support for sub tensors.
Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
Added OMPScheduler (OpenMP) scheduler for Neon
Added SingleThreadScheduler scheduler for Arm® Neon™ (For bare metal)
User can specify their own scheduler by implementing the IScheduler interface.
New OpenCL kernels / functions:
- CLBatchNormalizationLayerKernel / CLBatchNormalizationLayer
- CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
- CLHOGOrientationBinningKernel CLHOGBlockNormalizationKernel, CLHOGDetectorKernel / CLHOGDescriptor CLHOGDetector CLHOGGradient CLHOGMultiDetection
- CLLocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedLayer
- CLWeightsReshapeKernel / CLConvolutionLayerReshapeWeights
New C++ kernels:
- CPPDetectionWindowNonMaximaSuppressionKernel
New Arm® Neon™ kernels / functions:
- NEBatchNormalizationLayerKernel / NEBatchNormalizationLayer
- NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer
- NEDirectConvolutionLayerKernel / NEDirectConvolutionLayer
- NELocallyConnectedMatrixMultiplyKernel / NELocallyConnectedLayer
- NEWeightsReshapeKernel / NEConvolutionLayerReshapeWeights

v17.05 Public bug fixes release

Various bug fixes
Remaining of the functions ported to use accurate padding.
Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
Added "free" method to allocator.
Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9

v17.04 Public bug fixes release

The following functions have been ported to use the new accurate padding:

CLColorConvertKernel
CLEdgeNonMaxSuppressionKernel
CLEdgeTraceKernel
CLGaussianPyramidHorKernel
CLGaussianPyramidVertKernel
CLGradientKernel
NEChannelCombineKernel
NEFillArrayKernel
NEGaussianPyramidHorKernel
NEGaussianPyramidVertKernel
NEHarrisScoreFP16Kernel
NEHarrisScoreKernel
NEHOGDetectorKernel
NELogits1DMaxKernel
NELogits1DShiftExpSumKernel
NELogits1DNormKernel
NENonMaximaSuppression3x3FP16Kernel
NENonMaximaSuppression3x3Kernel

v17.03.1 First Major public release of the sources

Renamed the library to arm_compute
New CPP target introduced for C++ kernels shared between Arm® Neon™ and CL functions.
New padding calculation interface introduced and ported most kernels / functions to use it.
New OpenCL kernels / functions:
- CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
New Arm® Neon™ kernels / functions:
- NENormalizationLayerKernel / NENormalizationLayer
- NETransposeKernel / NETranspose
- NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / NESoftmaxLayer
- NEIm2ColKernel, NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / NEConvolutionLayer
- NEGEMMMatrixAccumulateBiasesKernel / NEFullyConnectedLayer
- NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp

v17.03 Sources preview

New OpenCL kernels / functions:
- CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
- GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / CLGEMM
- CLGEMMMatrixAccumulateBiasesKernel / CLFullyConnectedLayer
- CLTransposeKernel / CLTranspose
- CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow
- CLNormalizationLayerKernel / CLNormalizationLayer
- CLLaplacianPyramid, CLLaplacianReconstruct
New Arm® Neon™ kernels / functions:
- NEActivationLayerKernel / NEActivationLayer
- GEMM refactoring + FP16 support (Requires armv8.2 CPU): NEGEMMInterleave4x4Kernel, NEGEMMTranspose1xWKernel, NEGEMMMatrixMultiplyKernel, NEGEMMMatrixAdditionKernel / NEGEMM
- NEPoolingLayerKernel / NEPoolingLayer

v17.02.1 Sources preview

New OpenCL kernels / functions:
- CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, CLLogits1DNormKernel / CLSoftmaxLayer
- CLPoolingLayerKernel / CLPoolingLayer
- CLIm2ColKernel, CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / CLConvolutionLayer
- CLRemapKernel / CLRemap
- CLGaussianPyramidHorKernel, CLGaussianPyramidVertKernel / CLGaussianPyramid, CLGaussianPyramidHalf, CLGaussianPyramidOrb
- CLMinMaxKernel, CLMinMaxLocationKernel / CLMinMaxLocation
- CLNonLinearFilterKernel / CLNonLinearFilter
New Arm® Neon™ FP16 kernels (Requires armv8.2 CPU)
- NEAccumulateWeightedFP16Kernel
- NEBox3x3FP16Kernel
- NENonMaximaSuppression3x3FP16Kernel

v17.02 Sources preview

New OpenCL kernels / functions:
- CLActivationLayerKernel / CLActivationLayer
- CLChannelCombineKernel / CLChannelCombine
- CLDerivativeKernel / CLChannelExtract
- CLFastCornersKernel / CLFastCorners
- CLMeanStdDevKernel / CLMeanStdDev
New Arm® Neon™ kernels / functions:
- HOG / SVM: NEHOGOrientationBinningKernel, NEHOGBlockNormalizationKernel, NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / NEHOGDescriptor, NEHOGDetector, NEHOGGradient, NEHOGMultiDetection
- NENonLinearFilterKernel / NENonLinearFilter
Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
Switched all the kernels / functions to use tensors instead of images.
Updated documentation to include instructions to build the library from sources.

v16.12 Binary preview release

Original release

Table of Contents

Release versions

Changelog