Introduction

This user manual describes the CMSIS DSP software library, a suite of common compute processing functions for use on Cortex-M and Cortex-A processor based devices.

The library is divided into a number of functions each covering a specific category:

The library has generally separate functions for operating on 8-bit integers, 16-bit integers, 32-bit integer and 32-bit floating-point values and 64-bit floating-point values.

The library is providing vectorized versions of most algorithms for Helium and of most f32 algorithms for Neon.

When using a vectorized version, provide a little bit of padding after the end of a buffer (3 words) because the vectorized code may read a little bit after the end of a buffer. You don't have to modify your buffers but just ensure that the end of buffer + padding is not outside of a memory region.

Related projects

Python wrapper

A Python wrapper is also available with a Python API as close as possible to the C one. It can be used to start developing and testing an algorithm with NumPy and SciPy before writing the C version. Is is available on PyPI.org. It can be installed with: pip install cmsisdsp.

Experimental C++ template extension

This extension is a set of C++ headers. They just need to included to start using the features.

Those headers are not yet part of the pack and you need to get them from the github repository

More documentation about the DSP++ extension.

Using the CMSIS-DSP Library

The library is released in source form. It is strongly advised to compile the library using -Ofast optimization to have the best performances.

Following options should be avoided:

-fno-builtin
-ffreestanding because it enables previous options

The library is doing some type punning to process word 32 from memory as a pair of q15 or a quadruple of q7. Those type manipulations are done through memcpy functions. Most compilers should be able to optimize out those function calls when the length to copy is small (4 bytes).

This optimization will not occur when -fno-builtin is used and it will have a very bad impact on the performances.

The library functions are declared in the public file Include/arm_math.h. Simply include this file to use the CMSIS-DSP library. If you don't want to include everything, you can also rely on individual header files from the Include/dsp/ folder and include only those that are needed in the project.

Examples

The library ships with a number of examples which demonstrate how to use the library functions. Please refer to Examples.

Toolchain Support

The library is now tested on Fast Models building with cmake. Core M0, M4, M7, M33, M55 are tested.

Access to CMSIS-DSP

CMSIS-DSP is actively maintained in the CMSIS-DSP GitHub repository and is released as a standalone CMSIS-DSP pack in the CMSIS-Pack format.

The table below explains the content of ARM::CMSIS-DSP pack.

Directory	Description
📂 ComputeLibrary	Small Neon kernels when building on Cortex-A
📂 Documentation	Folder with this CMSIS-DSP documenation
📂 Example	Example projects demonstrating the usage of the library functions
📂 Include	Include files for using and building the lib
📂 PrivateInclude	Private include files for building the lib
📂 Source	Source files
📄 ARM.CMSIS-DSP.pdsc	CMSIS-Pack description file
📄 LICENSE	License Agreement (Apache 2.0)

See CMSIS Documentation for an overview of CMSIS software components, tools and specifications.

Preprocessor Macros

Each library project has different preprocessor macros.

ARM_MATH_BIG_ENDIAN:
- Define macro ARM_MATH_BIG_ENDIAN to build the library for big endian targets. By default library builds for little endian targets.
ARM_MATH_MATRIX_CHECK:
- Define macro ARM_MATH_MATRIX_CHECK for checking on the input and output sizes of matrices
ARM_MATH_ROUNDING:
- Define macro ARM_MATH_ROUNDING for rounding on support functions
ARM_MATH_LOOPUNROLL:
- Define macro ARM_MATH_LOOPUNROLL to enable manual loop unrolling in DSP functions
ARM_MATH_NEON:
- Define macro ARM_MATH_NEON to enable Neon versions of the DSP functions. It is not enabled by default when Neon is available because performances are dependent on the compiler and target architecture.
ARM_MATH_NEON_EXPERIMENTAL:
- Define macro ARM_MATH_NEON_EXPERIMENTAL to enable experimental Neon versions of of some DSP functions. Experimental Neon versions currently do not have better performances than the scalar versions.
ARM_MATH_HELIUM:
- It implies the flags ARM_MATH_MVEF and ARM_MATH_MVEI and ARM_MATH_MVE_FLOAT16.
ARM_MATH_HELIUM_EXPERIMENTAL:
- Only taken into account when ARM_MATH_MVEF, ARM_MATH_MVEI or ARM_MATH_MVE_FLOAT16 are defined. Enable some vector versions which may have worse performance than scalar depending on the core / compiler configuration.
ARM_MATH_MVEF:
- Select Helium versions of the f32 algorithms. It implies ARM_MATH_FLOAT16 and ARM_MATH_MVEI.
ARM_MATH_MVEI:
- Select Helium versions of the int and fixed point algorithms.
ARM_MATH_MVE_FLOAT16:
- MVE Float16 implementations of some algorithms (Requires MVE extension).
DISABLEFLOAT16:
- Disable float16 algorithms when __fp16 is not supported for a specific compiler / core configuration. This is only valid for scalar. When vector architecture is supporting f16 then it can't be disabled.
ARM_MATH_AUTOVECTORIZE:
- With Helium or Neon, disable the use of vectorized code with C intrinsics and use pure C instead. The vectorization is then done by the compiler.
ARM_DSP_ATTRIBUTE: Can be set to define CMSIS-DSP function as weak functions. This can either be set on the command line when building or in a new arm_dsp_config.h header (see below)
ARM_DSP_TABLE_ATTRIBUTE: Can be set to define in which section constant tables must be mapped. This can either be set on the command line when building or in a new arm_dsp_config.h header (see below). Another way to set those sections is by modifying the linker scripts since the constant tables are defined only in a restricted set of source files.
ARM_DSP_CUSTOM_CONFIG When set, the file arm_dsp_config.h is included by the arm_math_types.h headers. You can use this file to define any of the above compilation symbols.

Code size

Previous versions were using lots of compilation flags to control code size. It was enabled with ARM_DSP_CONFIG_TABLES. It was getting too complex and has been removed. Now code size optimizations are relying on the linker.

You no more need to use any compilation flags like ARM_TABLE_TWIDDLECOEF_F32_2048, ARM_FFT_ALLOW_TABLES etc ...

They have been removed.

Constant tables can use a lot of read only memory but the linker can remove the unused functions and constant tables if it can deduce that those tables or functions are not used.

For this you need to use the right initialization functions in the library and the right options for the linker (they are compiler dependent).

For all transforms functions (CFFT, RFFT ...) instead of using a generic initialization function that works for all lengths (like arm_cfft_init_f32), use a dedicated initialization function for a specific size (like arm_cfft_init_1024_f32).

By using the right initialization function, you're telling the linker what is really used.

If you use a generic function, the linker cannot deduce the used lengths and thus will keep all the constant tables required for each length.

Then you need to use the right options for the compiler so that the unused tables and functions are removed. It is compiler dependent but generally the options are named like -ffunction-sections, -fdata-sections, --gc-sections ...

Variations between the architectures

Some algorithms may give slightlty different results on different architectures (like M0 or M4/M7 or M55). It is a tradeoff made for speed reasons and to make best use of the different instruction sets.

All algorithms are compared with a double precision reference and the different versions (for different architectures) have the same characteristics when compared to the double precision (SNR bound, max bound for sample error ...)

As consequence, the small differences that may exists between the different architecture implementations should be too small to have any practical consequences.

License

The CMSIS-DSP is provided free of charge under the Apache 2.0 License.