Release versions
All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number. If there is more than one release in a month then an extra sequential number is appended at the end:
v17.03 (First release of March 2017)
v17.03.1 (Second release of March 2017)
v17.04 (First release of April 2017)
- Note
- We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.
-
Starting from release 22.05, 'master' branch is no longer being used, it has been replaced by 'main'. Please update your clone jobs accordingly.
Changelog
v22.11 Public major release
- New features:
- Add new experimental dynamic fusion API.
- Add CPU batch matrix multiplication with adj_x = false and adj_y = false for FP32.
- Add CPU MeanStdDevNorm for QASYMM8.
- Add CPU and GPU GELU activation function for FP32 and FP16.
- Add CPU swish activation function for FP32 and FP16.
- Performance optimizations:
- Optimize CPU bilinear scale for FP32, FP16, QASYMM8, QASYMM8_SIGNED, U8 and S8.
- Optimize CPU activation functions using LUT-based implementation:
- Sigmoid function for QASYMM8 and QASYMM8_SIGNED.
- Hard swish function for QASYMM8_SIGNED.
- Optimize CPU addition for QASYMM8 and QASYMM8_SIGNED using fixed-point arithmetic.
- Optimize CPU multiplication, subtraction and activation layers by considering tensors as 1D.
- Optimize GPU depthwise convolution kernel and heuristic.
- Optimize GPU Conv2d heuristic.
- Optimize CPU MeanStdDevNorm for FP16.
- Optimize CPU tanh activation function for FP16 using rational approximation.
- Improve GPU GeMMLowp start-up time.
- Various optimizations and bug fixes.
v22.08 Public major release
- Various bug fixes.
- Disable unsafe FP optimizations causing accuracy issues in:
- Add Dynamic Fusion of Elementwise Operators: Div, Floor, Add.
- Optimize the gemm_reshaped_rhs_nly_nt OpenCL kernel using the arm_matrix_multiply extension available for Arm® Mali™-G715 and Arm® Mali™-G615.
- Add support for the arm_matrix_multiply extension in the gemmlowp_mm_reshaped_only_rhs_t OpenCL kernel.
- Expand GPUTarget list with missing Mali™ GPUs product names: G57, G68, G78AE, G610, G510, G310.
- Extend the direct convolution 2d interface to configure the block size.
- Update ClConv2D heuristic to use direct convolution.
- Use official Khronos® OpenCL extensions:
- Add cl_khr_integer_dot_product extension support.
- Add support of OpenCL 3.0 non-uniform workgroup.
- Cpu performance optimizations:
- Add LUT-based implementation of Hard Swish and Leaky ReLU activation function for aarch64 build.
- Optimize Add layer by considering the input tensors as 1D array.
- Add fixed-format BF16, FP16 and FP32 Neon™ GEMM kernels to support variable weights.
- Add new winograd convolution kernels implementation and update the ACL CpuWinogradConv2d operator.
- Add experimental support for native builds for Windows on Arm®.
- Build flag interpretation change: arch=armv8.6-a now translates to -march=armv8.6-a CXX flag instead of march=armv8.2-a + explicit selection of feature extensions.
- Build flag change: toolchain_prefix, compiler_prefix:
- Use empty string "" to suppress any prefixes.
- Use "auto" to use default (auto) prefixes chosen by the build script. This is the default behavior when unspecified.
- Any other string will be used as custom prefixes to the compiler and the rest of toolchain tools.
- The default behaviour when prefix is unspecified does not change, but its signifier has been changed from empty string "" to "auto".
- armv7a with Android build will no longer be tested or maintained.
v22.05 Public major release
- Various bug fixes.
- Various optimizations.
- Add support for NDK r23b.
- Inclusive language adjustment. Please refer to Inclusive language guideline for details.
- New Arm® Neon™ kernels / functions :
- New OpenCL kernels / functions :
- Improve the start-up times for the following OpenCL kernels:
- Decouple the implementation of the following Cpu kernels into various data types (fp32, fp16, int):
v22.02 Public major release
- Various bug fixes.
- Various optimizations.
- Update A510 arm_gemm cpu Kernels.
- Inclusive language adjustment. Please refer to Inclusive language guideline for details.
- Improve the start-up time for the following OpenCL kernels:
- Remove functions:
- Remove padding from OpenCL kernels:
- Remove padding from Cpu kernels:
- Decouple the implementation of the following Cpu kernels into various data types (fp32, fp16, int):
v21.11 Public major release
- Various bug fixes.
- Various optimizations:
- Improve performance of bilinear and nearest neighbor Scale on both CPU and GPU for FP32, FP16, Int8, Uint8 data types
- Improve performance of Softmax on GPU for Uint8/Int8
- New OpenCL kernels / functions:
- New Arm® Neon™ kernels / functions:
- Support configurable build by a selected subset of operator list
- Support MobileBert on Neon™ backend
- Improve operator/function logging
- Remove padding from OpenCL kernels:
- ClPool2dKernel
- ClScaleKernel
- ClGemmMatrixMultiplyReshapedKernel
- Remove padding from Cpu kernels:
- Remove Y padding from OpenCL kernels:
- ClGemmMatrixMultiplyKernel
- ClGemmReshapedRHSMatrixKernel
- Remove legacy GeMM kernels in gemm_v1.cl
v21.08 Public major release
- Various bug fixes.
- Various optimizations:
- Improve LWS (Local-Workgroup-Size) heuristic in OpenCL for GeMM, Direct Convolution and Winograd Transformations when OpenCL tuner is not used
- Improve QASYMM8/QSYMM8 performance on OpenCL for various Arm® Mali™ GPU architectures
- Add dynamic weights support in Fully connected layer (CPU/GPU)
- Various performance optimizations for floating-point data types (CPU/GPU)
- Add a reduced core library build arm_compute_core_v2
- Expose Operator API
- Support fat binary build for arm8.2-a via fat_binary build flag
- Add CPU discovery capabilities
- Add data type f16 support for:
- Port the following functions to stateless API:
- Remove the following functions:
- Remove CLCoreRuntimeContext
- Remove ICPPSimpleKernel
- Rename file arm_compute/runtime/CL/functions/CLElementWiseUnaryLayer.h to arm_compute/runtime/CL/functions/CLElementwiseUnaryLayer.h
v21.05 Public major release
- Various bug fixes.
- Various optimisations.
- Various documentation updates:
- Add supported operators and corresponding Android NNAPI operators.
- Documentation reorg into user guide and contributor guide.
- Add support for a global allocator for OpenCL tensors
- Add experimental support for CLVK.
- Add data type S32 support for:
- Add data type QASYMM8 support for:
- Add per-channel quantization support for:
- Remove padding from OpenCL kernels:
- Remove computer vision support from Arm® Neon™ backend
- Remove the following functions:
- NEAbsoluteDifference
- NEAccumulate
- NEBox3x3
- NECannyEdge
- NEChannelCombine
- NEChannelExtract
- NEColorConvert
- NEConvolution
- NEDerivative
- NEDilate
- NEEqualizeHistogram
- NEErode
- NEFastCorners
- NEGaussian3x3
- NEGaussian5x5
- NEGaussianPyramid
- NEHOGDescriptor
- NEHOGDetector
- NEHOGGradient
- NEHOGMultiDetection
- NEHarrisCorners
- NEHistogram
- NEIntegralImage
- NELaplacianPyramid
- NELaplacianReconstruct
- NEMagnitude
- NEMeanStdDev
- NEMedian3x3
- NEMinMaxLocation
- NENonLinearFilter
- NEOpticalFlow
- NEPhase
- NEScharr3x3
- NESobel3x3
- NESobel5x5
- NESobel7x7
- NETableLookup
- NEThreshold
- NEWarpAffine
- NEWarpPerspectiveKernel
- Remove all GLES kernels / functions / tests / examples
- Remove computer vision support from CL backend
- Remove the following functions:
- CLAbsoluteDifference
- CLAccumulate
- CLBox3x3
- CLCannyEdge
- CLChannelCombine
- CLChannelExtract
- CLColorConvert
- CLConvolution
- CLDerivative
- CLDilate
- CLEqualizeHistogram
- CLErode
- CLFastCorners
- CLGaussian3x3
- CLGaussian5x5
- CLGaussianPyramid
- CLHOGDescriptor
- CLHOGDetector
- CLHOGGradient
- CLHOGMultiDetection
- CLHarrisCorners
- CLHistogram
- CLIntegralImage
- CLLaplacianPyramid
- CLLaplacianReconstruct
- CLMagnitude
- CLMeanStdDev
- CLMedian3x3
- CLMinMaxLocation
- CLNonLinearFilter
- CLOpticalFlow
- CLPhase
- CLScharr3x3
- CLSobel3x3
- CLSobel5x5
- CLSobel7x7
- CLTableLookup
- CLThreshold
- CLWarpAffine
- CLWarpPerspective
v21.02 Public major release
- Various bug fixes.
- Various optimisations.
- Upgrade C++ standard to C++14
- Add macOS support
- Add Armv8-R AArch64 architecture support
- Add SVE/SVE2 support for:
- Remove padding from OpenCL kernels:
- Deprecate functions in CLTuner:
- add_lws_to_table
- import_lws_table
- lws_table
- Remove functions:
- NELocallyConnectedLayer / CLLocallyConnectedLayer
- NEIm2Col
- NECol2Im
- NEGEMMInterleave4x4
- NEGEMMTranspose1xW
- NEComputeAllAnchors / CLComputeAllAnchors
- NEGEMMAssemblyDispatch
- NEUpsampleLayer / CLUpsampleLayer
- Remove kernels:
- NEGEMMMatrixVectorMultiplyKernel
- NELocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedMatrixMultiplyKernel
- NEUpsampleLayerKernel / CLUpsampleLayerKernel
- Extend OpenCL tuner with workgroup batch size support
- Experimental extension for the OpenCL tuner to tune the batches of work groups distribute to compute units
- Add functionality to load the OpenCL GEMM heuristics at runtime
- The GEMM heuristic file (MLGO) can be used to update the default GEMM heuristics available for OpenCL
- Note: there might be performance regressions against v20.08 in Inception v3 using int8 data types on Arm Mali-G77 GPUs. Currently under investigation
- Note: data-type decoupling is in progress and experimental. Warning of unused symbols might be raised
v20.11 Public major release
- Various bug fixes.
- Various optimisations.
- Performance regressions can be noted when executing Depthwise Convolution on Arm® Neon™ with a depth multiplier > 1 for quantized data type. This is planned to be resolved in 21.02 release.
- Added new data type QASYMM8_SIGNED support for NEROIAlignLayer.
- Added new data type S32 support for:
- Interface change
- Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5. The supported value range of axis is [-rank, rank). This change applies to the following functions:
- New OpenCL kernels / functions:
- New Arm® Neon™ kernels / functions:
- Removed padding from Arm® Neon™ kernels:
- NEComplexPixelWiseMultiplicationKernel
- NENonMaximaSuppression3x3Kernel
- NERemapKernel
- NEGEMMInterleave4x4Kernel
- NEDirectConvolutionLayerKernel
- NEScaleKernel
- NELocallyConnectedMatrixMultiplyKernel
- NEGEMMLowpOffsetContributionKernel
- NEGEMMTranspose1xWKernel
- NEPoolingLayerKernel
- NEConvolutionKernel
- NEDepthwiseConvolutionLayerNativeKernel
- NEGEMMLowpMatrixMultiplyKernel
- NEGEMMMatrixMultiplyKernel
- NEDirectConvolutionLayerOutputStageKernel
- NEReductionOperationKernel
- NEGEMMLowpMatrixAReductionKernel
- NEGEMMLowpMatrixBReductionKernel
- Removed padding from OpenCL kernels:
- CLBatchConcatenateLayerKernel
- CLElementwiseOperationKernel
- CLBatchNormalizationLayerKernel
- CLPoolingLayerKernel
- CLWinogradInputTransformKernel
- CLGEMMLowpMatrixMultiplyNativeKernel
- CLGEMMLowpMatrixAReductionKernel
- CLGEMMLowpMatrixBReductionKernel
- CLGEMMLowpOffsetContributionOutputStageKernel
- CLGEMMLowpOffsetContributionKernel
- CLWinogradOutputTransformKernel
- CLGEMMLowpMatrixMultiplyReshapedKernel
- CLFuseBatchNormalizationKernel
- CLDepthwiseConvolutionLayerNativeKernel
- CLDepthConvertLayerKernel
- CLCopyKernel
- CLDepthwiseConvolutionLayer3x3NHWCKernel
- CLActivationLayerKernel
- CLWinogradFilterTransformKernel
- CLWidthConcatenateLayerKernel
- CLWidthConcatenate4TensorsKernel
- CLWidthConcatenate2TensorsKernel
- CLLogits1DMaxShiftExpSumKernel
- CLLogits1DNormKernel
- CLHeightConcatenateLayerKernel
- CLGEMMMatrixMultiplyKernel
- CLGEMMLowpQuantizeDownInt32ScaleKernel
- CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
- CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
- CLDepthConcatenateLayerKernel
- CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
- Removed OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
- CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
- Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
- CLLocallyConnectedLayer
- CLLocallyConnectedMatrixMultiplyKernel
- CLAbsoluteDifference
- CLAbsoluteDifferenceKernel
- CLAccumulate
- CLAccumulateKernel
- CLAccumulateSquared
- CLAccumulateSquaredKernel
- CLAccumulateWeighted
- CLAccumulateWeightedKernel
- CLAccumulateWeightedFP16Kernel
- CLBox3x3
- CLBox3x3Kernel
- CLBox3x3FP16Kernel
- CLCannyEdge
- CLChannelCombine
- CLChannelCombineKernel
- CLChannelExtract
- CLChannelExtractKernel
- CLColorConvert
- CLColorConvertKernel
- CLConvolution3x3
- CLConvolutionRectangle
- CLConvolutionRectangleKernel
- CLConvolutionSquare
- CLConvolutionKernel
- CLDerivative
- CLDerivativeKernel
- CLDilate
- CLDilateKernel
- CLEqualizeHistogram
- CLErode
- CLErodeKernel
- CLFastCorners
- CLFastCornersKernel
- CLGaussian3x3
- CLGaussian3x3Kernel
- CLGaussian5x5
- CLGaussian5x5HorKernel
- CLGaussian5x5VertKernel
- CLGaussianPyramid
- CLGaussianPyramidHalf
- CLGaussianPyramidOrb
- CLHarrisCorners
- CLHarrisScoreKernel
- CLHarrisScoreFP16Kernel
- CLHistogram
- CLHistogramKernel
- CLHOGOrientationBinningKernel
- CLHOGBlockNormalizationKernel
- CLHOGDetectorKernel
- CLHOGNonMaximaSuppressionKernel
- CLHOGDescriptor
- CLHOGDetector
- CLHOGGradient
- CLHOGMultiDetection
- CLHOGOrientationBinningKernel
- CLHOGBlockNormalizationKernel
- CLHOGDetectorKernel
- CLIntegralImage
- CLIntegralImageKernel
- CLLaplacianReconstruct
- CLLaplacianPyramid
- CLMagnitude
- CLMagnitudePhaseKernel
- CLMedian3x3
- CLMedian3x3Kernel
- CLMinMaxLocation
- CLMinMaxLocationKernel
- CLNonLinearFilter
- CLNonLinearFilterKernel
- CLNonMaximaSuppression3x3
- CLNonMaximaSuppression3x3FP16Kernel
- CLNonMaximaSuppression3x3Kernel
- CLOpticalFlow
- CLPhase
- CLRemap
- CLRemapKernel
- CLScharr3x3
- CLScharr3x3Kernel
- CLSobel3x3
- CLSobel3x3Kernel
- CLSobel5x5
- CLSobel5x5HorKernel
- CLSobel5x5VertKernel
- CLSobel7x7
- CLSobel7x7HorKernel
- CLSobel7x7VertKernel
- CLThreshold
- CLThresholdKernel
- CLWarpAffine
- CLWarpAffineKernel
- CLWarpPerspective
- CLWarpPerspectiveKernel
- Deprecated Arm® Neon™ kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
- NELocallyConnectedLayer
- NELocallyConnectedMatrixMultiplyKernel
- NEAbsoluteDifference
- NEAbsoluteDifferenceKernel
- NEAccumulate
- NEAccumulateKernel
- NEAccumulateSquared
- NEAccumulateSquaredKernel
- NEAccumulateWeighted
- NEAccumulateWeightedKernel
- NEAccumulateWeightedFP16Kernel
- NEBox3x3
- NEBox3x3Kernel
- NEBox3x3FP16Kernel
- NECannyEdge
- NEChannelCombine
- NEChannelCombineKernel
- NEChannelExtract
- NEChannelExtractKernel
- NEColorConvert
- NEColorConvertKernel
- NEConvolution3x3
- NEConvolutionRectangle
- NEConvolutionRectangleKernel
- NEConvolutionSquare
- NEConvolutionKernel
- NEDerivative
- NEDerivativeKernel
- NEDilate
- NEDilateKernel
- NEEqualizeHistogram
- NEErode
- NEErodeKernel
- NEFastCorners
- NEFastCornersKernel
- NEGaussian3x3
- NEGaussian3x3Kernel
- NEGaussian5x5
- NEGaussian5x5HorKernel
- NEGaussian5x5VertKernel
- NEGaussianPyramid
- NEGaussianPyramidHalf
- NEGaussianPyramidOrb
- NEHarrisCorners
- NEHarrisScoreKernel
- NEHarrisScoreFP16Kernel
- NEHistogram
- NEHistogramKernel
- NEHOGOrientationBinningKernel
- NEHOGBlockNormalizationKernel
- NEHOGDetectorKernel
- NEHOGNonMaximaSuppressionKernel
- NEHOGDescriptor
- NEHOGDetector
- NEHOGGradient
- NEHOGMultiDetection
- NEHOGOrientationBinningKernel
- NEHOGBlockNormalizationKernel
- NEHOGDetectorKernel
- NEIntegralImage
- NEIntegralImageKernel
- NELaplacianReconstruct
- NELaplacianPyramid
- NEMagnitude
- NEMagnitudePhaseKernel
- NEMedian3x3
- NEMedian3x3Kernel
- NEMinMaxLocation
- NEMinMaxLocationKernel
- NENonLinearFilter
- NENonLinearFilterKernel
- NENonMaximaSuppression3x3
- NENonMaximaSuppression3x3FP16Kernel
- NENonMaximaSuppression3x3Kernel
- NEOpticalFlow
- NEPhase
- NERemap
- NERemapKernel
- NEScharr3x3
- NEScharr3x3Kernel
- NESobel3x3
- NESobel3x3Kernel
- NESobel5x5
- NESobel5x5HorKernel
- NESobel5x5VertKernel
- NESobel7x7
- NESobel7x7HorKernel
- NESobel7x7VertKernel
- NEThreshold
- NEThresholdKernel
- NEWarpAffine
- NEWarpAffineKernel
- NEWarpPerspective
- NEWarpPerspectiveKernel
- Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
- GCAbsoluteDifference
- GCActivationLayer
- GCArithmeticAddition
- GCBatchNormalizationLayer
- GCConcatenateLayer
- GCConvolutionLayer
- GCDepthwiseConvolutionLayer
- GCDirectConvolutionLayer
- GCDropoutLayer
- GCFillBorder
- GCFullyConnectedLayer
- GCGEMM
- GCGEMMInterleave4x4
- GCGEMMTranspose1xW
- GCNormalizationLayer
- GCNormalizePlanarYUVLayer
- GCPixelWiseMultiplication
- GCPoolingLayer
- GCScale
- GCSoftmaxLayer
- GCTensorShift
- GCTranspose
v20.08 Public major release
- Various bug fixes.
- Various optimisations.
- Added new data type QASYMM8_SIGNED support for:
- Added new data type U8 support for:
- Added align_corner support for nearest neighbor interpolation in:
- NEScaleKernel
- CLScaleKernel
- New OpenCL kernels / functions:
- New Arm® Neon™ kernels / functions:
- NEMaxUnpoolingLayerKernel
- New graph example:
- graph_yolov3_output_detector
- GEMMTuner improvements:
- Added fp16 support
- Output json files for easier integration
- Enabled tuning for export_to_cl_image_rhs option for RHS tensors
- More robust script for running benchmarks
- Removed padding from:
- Removed OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ToUint8Scale
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
- Removed Arm® Neon™ kernels / functions:
- NEGEMMLowpQuantizeDownInt32ToUint8Scale
- NEGEMMMatrixAccumulateBiasesKernel
- Deprecated functions / interfaces:
- The support for quantized data types has been removed from CLLogSoftmaxLayer due to implementation complexity.
- Removed padding requirement for the input (e.g. LHS of GEMM) and output in CLGEMMMatrixMultiplyNativeKernel, CLGEMMMatrixMultiplyReshapedKernel, CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and CLIm2ColKernel (NHWC only)
- This change allows to use CLGEMMConvolutionLayer without extra padding for the input and output.
- Only the weights/bias of CLGEMMConvolutionLayer could require padding for the computation.
- Only on Arm® Mali™ Midgard GPUs, CLGEMMConvolutionLayer could require padding since CLGEMMMatrixMultiplyKernel is called and currently requires padding.
- Added support for exporting the OpenCL buffer object to the OpenCL image object in CLGEMMMatrixMultiplyReshapedKernel and CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
- This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
- The padding requirement for the OpenCL image object is considered into the CLGEMMReshapeRHSMatrixKernel.
- The reshaped RHS matrix stores the weights when GEMM is used to accelerate CLGEMMConvolutionLayer.
v20.05 Public major release
- Various bug fixes.
- Various optimisations.
- Updated recommended NDK version to r18b.
- Updated recommended gcc version to Linaro 6.3.1.
- Added Bfloat16 type support
- Added Bfloat16 support in:
- Added new data type QASYMM8_SIGNED support for:
- New OpenCL kernels / functions:
- New Arm® Neon™ kernels / functions:
- Added HARD_SWISH support in:
- CLActivationLayerKernel
- NEActivationLayerKernel
- Deprecated OpenCL kernels / functions:
- CLGEMMLowpQuantizeDownInt32ToUint8Scale
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
- Deprecated Arm® Neon™ kernels / functions:
- NEGEMMLowpQuantizeDownInt32ToUint8Scale
- Removed CPP kernels / functions:
- Removed PoolingLayerInfo constructors without Data Layout.
- Removed CLDepthwiseConvolutionLayer3x3
- Removed NEDepthwiseConvolutionLayerOptimized
- Added support for Winograd 3x3,4x4 on Arm® Neon™ FP16:
- NEWinogradConvolutionLayer
- CpuWinogradConv2dTransformInputKernel
- CpuWinogradConv2dTransformOutputKernel
- CpuWinogradConv2dTransformWeightsKernel
- Added CLCompileContext
- Added Arm® Neon™ GEMM kernel with 2D window support
v20.02.1 Maintenance release
- Added Android-NN build script.
v20.02 Public major release
- Various bug fixes.
- Various optimisations.
- Added new data type QASYMM8_SIGNED support for:
- Added support for QSYMM8_PER_CHANNEL in:
- NEDepthwiseConvolutionLayer3x3Kernel
- Added support for split sizes in:
- New OpenCL kernels / functions:
- CLFill
- CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
- New Arm® Neon™ kernels / functions:
- NEFill
- NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
- Deprecated Arm® Neon™ functions / interfaces:
- CLDepthwiseConvolutionLayer3x3
- NEDepthwiseConvolutionLayerOptimized
- PoolingLayerInfo constructors without Data Layout.
- Added support for quantization with multiplier greater than 1 on Arm® Neon™ and CL.
- Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to CLQuantizationLayer.
- Added the ability to build bootcode for bare metal.
- Added support for generating synthetic QASYMM8 graphs.
- Added support for F16 datatype in VGG16.
- Removed pre-built binaries for GLES.
v19.11.1 Public maintenance release
- Fix offset calculation in NEReductionOperationKernel.
- Fix data layout in NEScaleKernel for nhwc.
- Retain configuration step data layout to avoid side-effects.
- Perform sqrt in double domain for L2 pooling.
- Fix output shape calculation for Reduce Mean
- Restrict cases where optimized NEPadLayer runs.
v19.11 Public major release
- Various bug fixes.
- Various optimisations.
- Updated recommended NDK version to r17c.
- Deprecated OpenCL kernels / functions:
- CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
- CLDepthwiseIm2ColKernel
- CLDepthwiseSeparableConvolutionLayer
- CLDepthwiseVectorToTensorKernel
- CLDirectConvolutionLayerOutputStageKernel
- Deprecated Arm® Neon™ kernels / functions:
- NEDepthwiseWeightsReshapeKernel
- NEDepthwiseIm2ColKernel
- NEDepthwiseSeparableConvolutionLayer
- NEDepthwiseVectorToTensorKernel
- NEDepthwiseConvolutionLayer3x3
- New OpenCL kernels / functions:
- New Arm® Neon™ kernels / functions:
- Added QASYMM8 support for:
- Added QASYMM16 support for:
- Added FP16 support for:
- CLGEMMMatrixMultiplyReshapedKernel
- Added new data type QASYMM8_PER_CHANNEL support for:
- Added new data type QSYMM8_PER_CHANNEL support for:
- Added FP16 mixed-precision support for:
- CLGEMMMatrixMultiplyReshapedKernel
- CLPoolingLayerKernel
- Added FP32 and FP16 ELU activation for:
- Added asymmetric padding support for:
- Added SYMMETRIC and REFLECT modes for CLPadLayerKernel / CLPadLayer.
- Replaced the calls to NECopyKernel and NEMemsetKernel with NEPadLayer in NEGenerateProposalsLayer.
- Replaced the calls to CLCopyKernel and CLMemsetKernel with CLPadLayer in CLGenerateProposalsLayer.
- Improved performance for CL Inception V3 - FP16.
- Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
- Improved Arm® Neon™ performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
- Improved Arm® Neon™ performance for MobileNet-SSD by improving the output detection performance.
- Optimized CLPadLayer.
- Optimized CL generic depthwise convolution layer by introducing CLDepthwiseConvolutionLayerNativeKernel.
- Reduced memory consumption by implementing weights sharing.
v19.08.1 Public maintenance release
- Fix offset calculation in NEReductionOperationKernel.
- Fix data layout in NEScaleKernel for nhwc.
- Retain configuration step data layout to avoid side-effects.
- Perform sqrt in double domain for L2 pooling.
- Fix output shape calculation for Reduce Mean
- Fix broadcast CLPixelwiseMultiplication with 5D tensors
v19.08 Public major release
- Various bug fixes.
- Various optimisations.
- Deprecated Arm® Neon™ functions
- NEDepthConcatenateLayer
- NEWidthConcatenateLayer
- Deprecated OpenCL kernels / functions
- CLDepthConcatenateLayer
- CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
- CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
- CLWidthConcatenateLayer
- New Arm® Neon™ kernels / functions:
- New OpenCL kernels / functions:
- New examples:
- neon_opticalflow
- cl_cache
- neon_permute
- Added support for FP16 in NEDeconvolutionLayer
- Added support for FP16 in CLDeconvolutionLayer
- Added support for REDUCE_MIN and REDUCE_MAX in ReductionOperation
- Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
- Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
- Re-factored the depthwise convolution layer kernel on Arm® Neon™ for generic cases
- Added an optimized depthwise convolution layer kernel for 5x5 filters (Neon™ only)
- Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
- Altered QuantizationInfo interface to support per-channel quantization.
- The CLDepthwiseConvolutionLayer3x3 will be included by CLDepthwiseConvolutionLayer to accommodate for future optimizations.
- The NEDepthwiseConvolutionLayerOptimized will be included by NEDepthwiseConvolutionLayer to accommodate for future optimizations.
- Removed inner_border_right and inner_border_top parameters from CLDeconvolutionLayer interface
- Removed inner_border_right and inner_border_top parameters from NEDeconvolutionLayer interface
- Optimized the Arm® Neon™ assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel
v19.05 Public major release
- Various bug fixes.
- Various optimisations.
- New Arm® Neon™ kernels / functions:
- New OpenCL kernels / functions:
- New OpenGLES kernels / functions:
- Deprecated functions/interfaces
- GCDepthConcatenateLayer
- NEWidthConcatenateLayer
- NEDepthConcatenateLayer
- CLWidthConcatenateLayer
- CLDepthConcatenateLayer
- CLGEMMInterleave4x4
- CLGEMMTranspose1xW
- Support different quantization info in CLConcatLayer.
- Add checks on different input/output quantization info were not supported.
- Tensors have different quantization information.
- Add FP16 support checks.
- Fix output quantization CLDeptwiseConv3x3 when activation is fused.
- New graph examples:
- graph_convolution
- graph_fully_connected
- graph_depthwise_convolution
- Deepspeech v0.4.1
- Add support for QASYMM8 in NEArithmeticSubtractionKernel.
- Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
- Add support for QASYMM8 NEDeconvolution.
- Add support for DequantizationLayer for Neon/CL.
- Add support for dilation in CLDepthwiseConvolution.
- Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
- Optimize CLDeconvolution.
- Add StackLayer to the graph API.
- Add support for "reflect" padding mode in NEPad.
- Winograd 7x7 NHWC on OpenCL.
- Rework CL ML layers to run exclusively on CL.
- Support different quantization info in PoolingLayer.
- Implement and test import memory interfaces.
- Added new tests and removed old ones.
- Various clang-tidy fixes.
v19.02 Public major release
- Various bug fixes.
- Various optimisations.
- New Arm® Neon™ kernels / functions:
- New OpenCL kernels / functions:
- New CPP kernels / functions:
- Added new examples:
- Add 4D tensors support to
- Fused activation in CLWinogradConvolutionLayer
- Extended NEPermute to support more cases
- Added Neon™/SVE GEMM Hybrid kernels
- Added u8 and s8 hybrid assembly kernels
- Introduced GEMM strategy name in NEGEMMAssemblyWrapper
- Improved CLTuner
- Fused the bias addition within CLGEMM
- Added support for QASYMM8 LOGISTIC activation in NEActivationLayer
- Added NHWC data layout support to:
- Added QASYMM8 support to the following kernels:
- Added new tests and improved validation and benchmarking suites.
- Deprecated functions/interfaces
v18.11 Public major release
- Various bug fixes.
- Various optimisations.
- New Arm® Neon™ kernels / functions:
- New OpenCL kernels / functions:
- New CPP kernels / functions:
- Added the validate method in:
- Added new examples:
- Added documentation for add a new function or kernel.
- Improved doxygen documentation adding a list of the existing functions.
- Add 4D tensors support to
- Add dot product support for CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
- Add SVE support
- Fused batch normalization into convolution layer weights in CLFuseBatchNormalization
- Fuses activation in CLDepthwiseConvolutionLayer3x3NCHWKernel, CLDepthwiseConvolutionLayer3x3NHWCKernel and NEGEMMConvolutionLayer
- Added NHWC data layout support to:
- Added QASYMM8 support to the following kernels:
- CLScaleKernel
- NEDepthwiseConvolutionLayer3x3Kernel
- CLPixelWiseMultiplicationKernel
- Added FP16 support to the following kernels:
- More tests added to both validation and benchmarking suites.
v18.08 Public major release
- Various bug fixes.
- Various optimisations.
- Updated recommended NDK version to r17b.
- Removed support for QS8/QS16 data types.
- Added support for grouped convolution in CLConvolutionLayer.
- Added NHWC data layout support to:
- New Arm® Neon™ kernels / functions:
- New OpenCL kernels / functions:
- Introduced prepare() stage support in the graph API for GLES.
- Added support for memory reusage when trying to allocate smaller CLTensors.
- Enabled NHWC execution on graph examples.
- Added JPEG accessor for validation purposes.
- Added validate methods to some kernels / functions.
v18.05 Public major release
- Various bug fixes.
- Various optimisations.
- Major redesign in the interface for the Neon™ kernels implemented in assembly.
- Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
- Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in Neon™ functions.
- Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
- Moved Neon™ assembly kernels to the folder src/core/Neon/kernels/arm_gemm.
- Improved doxygen documentation.
- Improved memory management for layer's transitions.
- Added support for NHWC data layout in tensors.
- Added NHWC data layout support to:
- Added support for dilated convolutions in NEConvolutionLayer and CLConvolutionLayer.
- New OpenCL kernels / functions:
- New Arm® Neon™ kernels / functions:
- Created the validate method in CLDepthwiseConvolutionLayer.
- Beta and gamma are no longer mandatory arguments in NEBatchNormalizationLayer and CLBatchNormalizationLayer.
- Added depth multiplier support in NEDepthwiseConvolutionLayer and CLDepthwiseConvolutionLayer.
- Added broadcast multiply support in NEPixelWiseMultiplication / NEPixelWiseMultiplicationKernel.
- Port mobilenet example to NHWC data layout.
- Enabled Winograd method in CLConvolutionLayer.
- Renamed NEWinogradLayer to NEWinogradConvolutionLayer.
- Updated NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/Neon/kernels/arm_gemm.
- Added memory manager support in GLES functions.
- Major refactoring of the graph API.
- Added GLES backend in the graph API.
- Added support for the memory manager in the graph API.
- Enabled Winograd Convolution method in the graph API.
- Added support for grouped convolutions in the graph API.
- Replaced NEDeconvolutionLayerUpsampleKernel with NEScaleKernel in NEDeconvolutionLayer.
- Added fast maths flag in CLConvolutionLayer.
- Added new tests and benchmarks in validation and benchmark frameworks
- Merge Activation layer with Convolution Layer (Neon™, CL, GLES)
- Added support to OpenCL 2.0 SVM
- Added support to import memory in OpenCL tensors.
- Added the prepare() method to perform any one off pre-processing before running the function.
- Added new examples:
- Added memory measurement instrument for CL.
v18.03 Public maintenance release
- Various bug fixes.
- Fixed bug in NEActivationLayer
- Fix in CLTuner when using batches.
- Updated recommended NDK version to r16b (And fixed warnings).
- Fixed bug in validation code.
- Added Inception v4 graph example.
- Renamed NEWinogradLayer.cpp to NEWinogradConvolutionLayer
v18.02 Public major release
v18.01 Public maintenance release
- Various bug fixes
- Added some of the missing validate() methods
- Added CLDeconvolutionLayerUpsampleKernel / CLDeconvolutionLayer CLDeconvolutionLayerUpsample
- Added CLPermuteKernel / CLPermute
- Added method to clean the programs cache in the CL Kernel library.
- Added GCArithmeticAdditionKernel / GCArithmeticAddition
- Added GCDepthwiseConvolutionLayer3x3Kernel / GCDepthwiseConvolutionLayer3x3
- Added GCNormalizePlanarYUVLayerKernel / GCNormalizePlanarYUVLayer
- Added GCScaleKernel / GCScale
- Added GCWeightsReshapeKernel / GCConvolutionLayer
- Added FP16 support to the following GLES compute kernels:
- GCCol2ImKernel
- GCGEMMInterleave4x4Kernel
- GCGEMMTranspose1xWKernel
- GCIm2ColKernel
- Refactored Arm® Neon™ Winograd (NEWinogradLayerKernel)
- Added NEDirectConvolutionLayerOutputStageKernel
- Added QASYMM8 support to the following Arm® Neon™ kernels:
- Added new examples:
- More tests added to both validation and benchmarking suites.
v17.12 Public major release
- Most machine learning functions on OpenCL support the new data type QASYMM8
- Introduced logging interface
- Introduced opencl timer
- Reworked GEMMLowp interface
- Added new Arm® Neon™ assembly kernels for GEMMLowp, SGEMM and HGEMM
- Added validation method for most Machine Learning kernels / functions
- Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
- Added sgemm example for OpenCL
- Added absolute difference example for GLES compute
- Added new tests and benchmarks in validation and benchmark frameworks
- Added new kernels / functions for GLES compute
- New OpenGL ES kernels / functions
- GCAbsoluteDifferenceKernel / GCAbsoluteDifference
- GCActivationLayerKernel / GCActivationLayer
- GCBatchNormalizationLayerKernel / GCBatchNormalizationLayer
- GCCol2ImKernel
- GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
- GCDirectConvolutionLayerKernel / GCDirectConvolutionLayer
- GCDropoutLayerKernel / GCDropoutLayer
- GCFillBorderKernel / GCFillBorder
- GCGEMMInterleave4x4Kernel / GCGEMMInterleave4x4
- GCGEMMMatrixAccumulateBiasesKernel / GCGEMMMatrixAdditionKernel / GCGEMMMatrixMultiplyKernel / GCGEMM
- GCGEMMTranspose1xWKernel / GCGEMMTranspose1xW
- GCIm2ColKernel
- GCNormalizationLayerKernel / GCNormalizationLayer
- GCPixelWiseMultiplicationKernel / GCPixelWiseMultiplication
- GCPoolingLayerKernel / GCPoolingLayer
- GCLogits1DMaxKernel / GCLogits1DShiftExpSumKernel / GCLogits1DNormKernel / GCSoftmaxLayer
- GCTransposeKernel / GCTranspose
- New Arm® Neon™ kernels / functions
- arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
- arm_compute::NEHGEMMAArch64FP16Kernel
- NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / NEDepthwiseConvolutionLayer
- NEGEMMLowpOffsetContributionKernel / NEGEMMLowpMatrixAReductionKernel / NEGEMMLowpMatrixBReductionKernel / NEGEMMLowpMatrixMultiplyCore
- NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
- NEWinogradLayer / NEWinogradLayerKernel
- New OpenCL kernels / functions
- CLGEMMLowpOffsetContributionKernel / CLGEMMLowpMatrixAReductionKernel / CLGEMMLowpMatrixBReductionKernel / CLGEMMLowpMatrixMultiplyCore
- CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
- New graph nodes for Arm® Neon™ and OpenCL
- graph::BranchLayer
- graph::DepthConvertLayer
- graph::DepthwiseConvolutionLayer
- graph::DequantizationLayer
- graph::FlattenLayer
- graph::QuantizationLayer
- graph::ReshapeLayer
v17.10 Public maintenance release
- Bug fixes:
- Check the maximum local workgroup size supported by OpenCL devices
- Minor documentation updates (Fixed instructions to build the examples)
- Introduced a graph::GraphContext
- Added a few new Graph nodes, support for branches and grouping.
- Automatically enable cl_printf in debug builds
- Fixed bare metal builds for armv7a
- Added AlexNet and cartoon effect examples
- Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)
v17.09 Public major release
v17.06 Public major release
- Various bug fixes
- Added support for fixed point 8 bit (QS8) to the various Arm® Neon™ machine learning kernels.
- Added unit tests and benchmarks (AlexNet, LeNet)
- Added support for sub tensors.
- Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
- Added OMPScheduler (OpenMP) scheduler for Neon
- Added SingleThreadScheduler scheduler for Arm® Neon™ (For bare metal)
- User can specify their own scheduler by implementing the IScheduler interface.
- New OpenCL kernels / functions:
- CLBatchNormalizationLayerKernel / CLBatchNormalizationLayer
- CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
- CLHOGOrientationBinningKernel CLHOGBlockNormalizationKernel, CLHOGDetectorKernel / CLHOGDescriptor CLHOGDetector CLHOGGradient CLHOGMultiDetection
- CLLocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedLayer
- CLWeightsReshapeKernel / CLConvolutionLayerReshapeWeights
- New C++ kernels:
- CPPDetectionWindowNonMaximaSuppressionKernel
- New Arm® Neon™ kernels / functions:
v17.05 Public bug fixes release
- Various bug fixes
- Remaining of the functions ported to use accurate padding.
- Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
- Added "free" method to allocator.
- Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9
v17.04 Public bug fixes release
The following functions have been ported to use the new accurate padding:
- CLColorConvertKernel
- CLEdgeNonMaxSuppressionKernel
- CLEdgeTraceKernel
- CLGaussianPyramidHorKernel
- CLGaussianPyramidVertKernel
- CLGradientKernel
- NEChannelCombineKernel
- NEFillArrayKernel
- NEGaussianPyramidHorKernel
- NEGaussianPyramidVertKernel
- NEHarrisScoreFP16Kernel
- NEHarrisScoreKernel
- NEHOGDetectorKernel
- NELogits1DMaxKernel
- NELogits1DShiftExpSumKernel
- NELogits1DNormKernel
- NENonMaximaSuppression3x3FP16Kernel
- NENonMaximaSuppression3x3Kernel
v17.03.1 First Major public release of the sources
- Renamed the library to arm_compute
- New CPP target introduced for C++ kernels shared between Arm® Neon™ and CL functions.
- New padding calculation interface introduced and ported most kernels / functions to use it.
- New OpenCL kernels / functions:
- CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
- New Arm® Neon™ kernels / functions:
v17.03 Sources preview
- New OpenCL kernels / functions:
- CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
- GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / CLGEMM
- CLGEMMMatrixAccumulateBiasesKernel / CLFullyConnectedLayer
- CLTransposeKernel / CLTranspose
- CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow
- CLNormalizationLayerKernel / CLNormalizationLayer
- CLLaplacianPyramid, CLLaplacianReconstruct
- New Arm® Neon™ kernels / functions:
- NEActivationLayerKernel / NEActivationLayer
- GEMM refactoring + FP16 support (Requires armv8.2 CPU): NEGEMMInterleave4x4Kernel, NEGEMMTranspose1xWKernel, NEGEMMMatrixMultiplyKernel, NEGEMMMatrixAdditionKernel / NEGEMM
- NEPoolingLayerKernel / NEPoolingLayer
v17.02.1 Sources preview
- New OpenCL kernels / functions:
- CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, CLLogits1DNormKernel / CLSoftmaxLayer
- CLPoolingLayerKernel / CLPoolingLayer
- CLIm2ColKernel, CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / CLConvolutionLayer
- CLRemapKernel / CLRemap
- CLGaussianPyramidHorKernel, CLGaussianPyramidVertKernel / CLGaussianPyramid, CLGaussianPyramidHalf, CLGaussianPyramidOrb
- CLMinMaxKernel, CLMinMaxLocationKernel / CLMinMaxLocation
- CLNonLinearFilterKernel / CLNonLinearFilter
- New Arm® Neon™ FP16 kernels (Requires armv8.2 CPU)
- NEAccumulateWeightedFP16Kernel
- NEBox3x3FP16Kernel
- NENonMaximaSuppression3x3FP16Kernel
v17.02 Sources preview
- New OpenCL kernels / functions:
- CLActivationLayerKernel / CLActivationLayer
- CLChannelCombineKernel / CLChannelCombine
- CLDerivativeKernel / CLChannelExtract
- CLFastCornersKernel / CLFastCorners
- CLMeanStdDevKernel / CLMeanStdDev
- New Arm® Neon™ kernels / functions:
- HOG / SVM: NEHOGOrientationBinningKernel, NEHOGBlockNormalizationKernel, NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / NEHOGDescriptor, NEHOGDetector, NEHOGGradient, NEHOGMultiDetection
- NENonLinearFilterKernel / NENonLinearFilter
- Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
- Switched all the kernels / functions to use tensors instead of images.
- Updated documentation to include instructions to build the library from sources.
v16.12 Binary preview release