Arm C Language Extensions

Table of Contents

Preface ^

Abstract ^

This document specifies the Arm C Language Extensions to enable C/C++ programmers to exploit the Arm architecture with minimal restrictions on source code portability.

Keywords ^

ACLE, ABI, C, C++, compiler, armcc, gcc, intrinsic, macro, attribute, Neon, SIMD, SVE, SVE2, atomic

Latest release and defects report ^

For the latest release of this document, see the ACLE project on GitHub.

Please report defects in this specification to the issue tracker page on GitHub.

License ^

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Grant of Patent License. Subject to the terms and conditions of this license (both the Public License and this Patent License), each Licensor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Licensed Material, where such license applies only to those patent claims licensable by such Licensor that are necessarily infringed by their contribution(s) alone or by combination of their contribution(s) with the Licensed Material to which such contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Licensed Material or a contribution incorporated within the Licensed Material constitutes direct or contributory patent infringement, then any licenses granted to You under this license for that Licensed Material shall terminate as of the date such litigation is filed.

About the license ^

As identified more fully in the License section, this project is licensed under CC-BY-SA-4.0 along with an additional patent license. The language in the additional patent license is largely identical to that in Apache-2.0 (specifically, Section 3 of Apache-2.0 as reflected at https://www.apache.org/licenses/LICENSE-2.0) with two exceptions.

First, several changes were made related to the defined terms so as to reflect the fact that such defined terms need to align with the terminology in CC-BY-SA-4.0 rather than Apache-2.0 (for example, changing “Work” to “Licensed Material”).

Second, the defensive termination clause was changed such that the scope of defensive termination applies to “any licenses granted to You” (rather than “any patent licenses granted to You”). This change is intended to help maintain a healthy ecosystem by providing additional protection to the community against patent litigation claims.

Contributions ^

Contributions to this project are licensed under an inbound=outbound model such that any such contributions are licensed by the contributor under the same terms as those in the LICENSE file.

We do not require copyright assignment. The original contributor will retain the copyright.

Trademark notice ^

The text of and illustrations in this document are licensed by Arm under a Creative Commons Attribution–Share Alike 4.0 International license (“CC-BY-SA-4.0”), with an additional clause on patents. The Arm trademarks featured here are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Please visit https://www.arm.com/company/policies/trademarks for more information about Arm’s trademarks.

About this document ^

Change control ^

Current Status and Anticipated Changes ^

The following support level definitions are used by the ACLE specifications:

Release

Arm considers this specification to have enough implementations, which have received sufficient testing, to verify that it is correct. The details of these criteria are dependent on the scale and complexity of the change over previous versions: small, simple changes might only require one implementation, but more complex changes require multiple independent implementations, which have been rigorously tested for cross-compatibility. Arm anticipates that future changes to this specification will be limited to typographical corrections, clarifications and compatible extensions.

Beta

Arm considers this specification to be complete, but existing implementations do not meet the requirements for confidence in its release quality. Arm might need to make incompatible changes if issues emerge from its implementation.

Alpha

The content of this specification is a draft, and Arm considers the likelihood of future incompatible changes to be significant.

All content in this document is at the Release quality level, unless a different support level is specified in the text.

Change history ^

Issue Date By Change
A 11/11/11 AG First release
B 13/11/13 AG Version 1.1. Editorial changes. Corrections and completions to intrinsics as detailed in 3.3. Updated for C11/C++11.
C 09/05/14 TB Version 2.0. Updated for Armv8 AArch32 and AArch64.
D 24/03/16 TB Version 2.1. Updated for Armv8.1 AArch32 and AArch64.
E 02/06/17 Arm Version ACLE Q2 2017. Updated for Armv8.2-A and Armv8.3-A.
F 30/04/18 Arm Version ACLE Q2 2018. Updated for Armv8.4-A.
G 30/03/19 Arm Version ACLE Q1 2019. Updated for Armv8.5-A and MVE. Various bugfixes.
H 30/06/19 Arm Version ACLE Q2 2019. Updated for TME and more Armv8.5-A intrinsics. Various bugfixes.
ACLE Q3 2019 30/09/19 Arm Version ACLE Q3 2019.
ACLE Q4 2019 31/12/19 Arm Version ACLE Q4 2019.
ACLE Q2 2020 31/05/20 Arm Version ACLE Q2 2020.
ACLE Q3 2020 31/10/20 Arm Version ACLE Q3 2020.
2021Q2 02 July 2021 Arm Version ACLE Q2 2021. Open source version. NFCI.
2021Q3 30 September 2021 Arm Minor re-wording. NFCI.
2021Q4 11 January 2022 Arm See Changes between ACLE Q3 2021 and ACLE Q4 2021
2022Q1 06 April 2022 Arm See Changes between ACLE Q4 2021 and ACLE Q1 2022
2022Q2 01 Jul 2022 Arm See Changes between ACLE Q1 2021 and ACLE Q2 2022
2022Q4 23 November 2022 Arm See Changes between ACLE Q2 2022 and ACLE Q4 2022
2023Q2 04 August 2023 Arm See Changes between ACLE Q4 2022 and ACLE Q2 2023
2024Q1 11 April 2024 Arm See Changes between ACLE Q2 2023 and ACLE Q1 2024
2024Q2 21 June 2024 Arm See Changes between ACLE Q1 2024 and ACLE Q2 2024
2024Q3 30 September 2024 Arm See Changes between ACLE Q2 2024 and ACLE Q3 2024

Changes between ACLE Q2 2017 and ACLE Q2 2018 ^

Most changes in ACLE Q2 2018 are updates to support features introduced in Armv8.3-A [ARMARMv83]. Support is added for the Complex addition and Complex MLA intrinsics. Armv8.4-A [ARMARMv84]. Support is added for the Dot Product intrinsics.

Changes between ACLE Q2 2018 and ACLE Q1 2019 ^

Changes between ACLE Q1 2019 and ACLE Q2 2019 ^

Changes between ACLE Q2 2019 and ACLE Q3 2019 ^

Changes between ACLE Q3 2019 and ACLE Q4 2019 ^

Changes between ACLE Q4 2019 and ACLE Q2 2020 ^

Changes between ACLE Q2 2020 and ACLE Q3 2020 ^

Changes between ACLE Q2 2021 and ACLE Q3 2021 ^

Changes between ACLE Q3 2021 and ACLE Q4 2021 ^

Changes between ACLE Q4 2021 and ACLE Q1 2022 ^

Changes between ACLE Q1 2022 and ACLE Q2 2022 ^

Changes between ACLE Q2 2022 and ACLE Q4 2022 ^

Changes between ACLE Q4 2022 and ACLE Q2 2023 ^

Changes between ACLE Q2 2023 and ACLE Q1 2024 ^

Changes between ACLE Q1 2024 and ACLE Q2 2024 ^

Changes between ACLE Q2 2024 and ACLE Q3 2024 ^

References ^

This document refers to the following documents.

Terms and abbreviations ^

This document uses the following terms and abbreviations.

Term Meaning
AAPCS Arm Procedure Call Standard, part of the ABI, defined in [AAPCS].
ABI Arm Application Binary Interface.
ACLE Arm C Language Extensions, as defined in this document.
Advanced SIMD A 64-bit/128-bit SIMD instruction set defined as part of the Arm architecture.
FFR The SVE first-fault register.
FFRT The SVE “first-fault register token”. This is a conceptual construct that forms part of the ACLE model of first-faulting and non-faulting loads; see First-faulting and non-faulting loads for details.
ILP32 A 32-bit address mode where long is a 32-bit type.
LLP64 A 64-bit address mode where long is a 32-bit type.
LP64 A 64-bit address mode where long is a 64-bit type.
Neon An implementation of the Arm Advanced SIMD extensions.
SIMD Any instruction set that operates simultaneously on multiple elements of a vector data type.
sizeless type A C and C++ type that can be used to create objects, but that has no measurable size; see Sizeless types for details.
SVE The Armv8-A Scalable Vector Extension. Also used more generally to include SVE2 and other SVE extensions.
SVE2 An Armv9-A extension of SVE.
SVL Streaming Vector Length; that is, the number of bits in an SVE vector type like svint32_t when the processor is in streaming mode
SVL.B As for SVL, but measured in bytes rather than bits
Thumb The Thumb instruction set extension to Arm.
VG The number of 64-bit elements (“vector granules”) in an SVE vector.
VFP The original Arm non-SIMD floating-point instruction set.
build attributes Object build attributes indicating configuration, as defined in [BA].
word A 32-bit quantity, in memory or a register.

Terms used to specify C and C++ semantics ^

The following terms are used to specify C and C++ semantics:

abstract machine

The conceptual machine that the C and C++ language standards use to define the behavior of programs.

evaluated call

A call that does not occur in an “unevaluated operand”; see section [expr.context] in the C++ standard for details.

For example, any calls that occur in the operand of a sizeof expression are not evaluated.

external linkage

A function has “external linkage” if there is a single definition that can be referenced by name from more than one translation unit. See [basic.link] in the C++ standard for more details.

As noted in Intrinsics, it is unspecified whether ACLE intrinsics are functions and, if so, what linkage they have. However, certain ACLE support functions are defined to have external linkage.

ill-formed programs or pieces of programs

Programs or pieces of programs that violate a rule due to their static construction rather than due to their runtime behavior.

Ill-formed programs should usually be rejected with at least one diagnostic. However, there are some ill-formed C++ constructs for which “no diagnostic is required”; see the [intro] section of the C++ standard for details. Many of these constructs could in principle use ACLE features.

In order to cope with such cases, ACLE does not say when ill-formed programs should be rejected. However, from a quality-of-implementation perspective, it is better to reject ill-formed programs wherever possible.

unprototyped functions

In early versions of C, it was possible to call a function without declaring it first. The function was then assumed to return an int. For example, this was a valid complete translation unit:

  int x() { return some_func(1, 2.0, "apples"); }

It was also possible to declare a function’s return type without specifying its argument types. For example:

  double another_func();
  double f() { return another_func(1.0, 2, "oranges"); }

Functions such as some_func and another_func are referred to as (K&R-style) “unprototyped” functions. The first C standard categorized these functions as an obsolescent feature and C18 removed all remaining support for them.

Conventions ^

Most SVE ACLE intrinsics have two names: a longer unique name and a shorter overloaded alias. The convention adopted in this document is to enclose characters in square brackets if they are only present in the longer name. For example:

  svclz[_u16]_m

refers to an intrinsic whose full name is svclz_u16_m and whose overloaded alias is svclz_m.

Scope ^

The Arm C Language Extensions (ACLE) specification specifies source language extensions and implementation choices that C/C++ compilers can implement in order to allow programmers to better exploit the Arm architecture.

The extensions include:

This specification does not standardize command-line options, diagnostics or other external behavior of compilers.

The intended users of this specification are:

ACLE is not a hardware abstraction layer (HAL), and does not specify a library component but it may make it easier to write a HAL or other low-level library in C rather than assembler.

Cortex-M Security Extension (CMSE) ^

ACLE support for the Cortex-M Security Extension (CMSE) is defined in Arm®v8-M Security Extensions: Requirements on Development Tools document CMSE-ACLE.

Introduction ^

The Arm architecture includes features that go beyond the set of operations available to C/C++ programmers. The intention of the Arm C Language Extensions (ACLE) is to allow the creation of applications and middleware code that is portable across compilers, and across Arm architecture variants, while exploiting the advanced features of the Arm architecture.

The design principles for ACLE can be summarized as:

ACLE incorporates some language extensions introduced in the GCC C compiler. Current GCC documentation [GCC] can be found at http://gcc.gnu.org/onlinedocs/gcc. Formally it should be assumed that ACLE refers to the documentation for GCC 4.5.1: http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/.

Some of the ACLE extensions are not specific to the Arm architecture but have proven to be of particular benefit in low-level and systems programming; examples include features for controlling the alignment and packing of data, and some common operations such as word rotation and reversal. As and when features become available in international standards (and implementations), Arm recommends that you use these in preference to ACLE. When implementations are widely available, any ACLE-specific features can be expected to be deprecated.

Portable binary objects ^

In AArch32, the ABI for the Arm Architecture defines a set of build attributes [BA]. These attributes are intended to facilitate generating cross-platform portable binary object files by providing a mechanism to determine the compatibility of object files. In AArch64, the ABI does not define a standard set of build attributes and takes the approach that binaries are, in general, not portable across platforms. References to build attributes in this document should be interpreted as applying only to AArch32.

C language extensions ^

Data types ^

This section overlaps with the specification of the Arm Procedure Call Standard, particularly [AAPCS] (4.1). ACLE extends some of the guarantees of C, allowing assumptions to be made in source code beyond those permitted by Standard C.

ACLE extends C by providing some types not present in Standard C and defining how they are dealt with by the AAPCS. These types fall into two groups:

The former group includes things like vector data types, which are defined by the header file <arm_neon.h>.

The predefined types are:

Implementation-defined type properties ^

ACLE and the Arm ABI allow implementations some freedom in order to conform to long-standing conventions in various environments. It is suggested that implementations set suitable defaults for their environment but allow the default to be overridden.

The signedness of a plain int bit-field is implementation-defined.

Whether the underlying type of an enumeration is minimal or at least 32-bit, is implementation-defined. The predefined macro __ARM_SIZEOF_MINIMAL_ENUM should be defined as 1 or 4 according to the size of a minimal enumeration type such as enum { X=0 }. An implementation that conforms to the Arm ABI must reflect its choice in the Tag_ABI_enum_size build attribute.

wchar_t may be 2 or 4 bytes. The predefined macro __ARM_SIZEOF_WCHAR_T should be defined as the same number. An implementation that conforms to the Arm ABI must reflect its choice in the Tag_ABI_PCS_wchar_t build attribute.

Predefined macros ^

Several predefined macros are defined. Generally these define features of the Arm architecture being targeted, or how the C/C++ implementation uses the architecture. These macros are detailed in Feature test macros. All ACLE predefined macros start with the prefix __ARM.

Keyword attributes ^

This section is in Beta state and may change or be extended in the future.

ACLE adds several non-standard keywords to C and C++. These keywords all start with the prefix __arm_. (However, not everything that starts with __arm_ is a keyword.)

The keywords are modeled after standard [[…]] attributes and fulfill a similar role to them. However, the C++ standard says that “any [non-standard attribute] that is not recognized by the implementation is ignored” [dcl.attr.grammar]. It is therefore not appropriate to use standard attributes for properties that affect the type system, that affect the binary interface of the generated code, or that must be honored for other reasons. ACLE uses keywords for some of these properties instead. The keywords are therefore known as “keyword attributes”.

A keyword attribute can appear wherever a standard [[…]] attribute can appear. Following the terminology of the C++ standard, a keyword attribute “appertains” to whatever a standard attribute would appertain to. For example, consider the placeholders A, B, C and D in the following function declaration:

  /*A*/ void /*B*/ f /*C*/ (void) /*D*/;

Here:

Position A is preferred over position C for ACLE code. This is because position A can also be used for GNU-style attributes, whereas position C cannot.

The table below lists all the ACLE keyword attributes. The “Target” column says what the keyword is allowed to appertain to. For example, a function type keyword can appear in position D above, but it cannot appear in positions A, B or C.

Keyword attributes are only guaranteed to be supported by compilers that predefine the associated macro to a nonzero value.

Name Target Predefined macro
__arm_locally_streaming function declaration __ARM_FEATURE_LOCALLY_STREAMING
__arm_in function type Argument-dependent
__arm_inout function type Argument-dependent
__arm_new function declaration Argument-dependent
__arm_out function type Argument-dependent
__arm_preserves function type Argument-dependent
__arm_streaming function type __ARM_FEATURE_SME
__arm_streaming_compatible function type __ARM_FEATURE_SME

Using a keyword attribute multiple times is equivalent to using it once.

Intrinsics ^

ACLE standardizes intrinsics to access various features of the Arm ® architecture. It also standardizes a set of header files that provide access to these intrinsics.

Whether intrinsics are macros, functions or built-in operators is unspecified. For example:

However, each argument must be evaluated at most once. So this definition is acceptable:

  #define __rev(x) __builtin_bswap32(x)

but this is not:

  #define __rev(x) ((((x) & 0xff) << 24) | (((x) & 0xff00) << 8) | \
    (((x) & 0xff0000) >> 8) | ((x) >> 24))

Constant arguments to intrinsics ^

Some intrinsics may require arguments that are constant at compile-time, to supply data that is encoded into the immediate fields of an instruction. Typically, these intrinsics require an integral-constant-expression in a specified range, or sometimes a string literal. An implementation should produce a diagnostic if the argument does not meet the requirements.

Header files ^

ACLE standardizes various header files that provide access to intrinsics and their associated data types. It also standardizes feature test macros that indicate which header files are available.

Some architecture features have a dedicated header file; for example, <arm_neon.h> provides access to the Advanced SIMD (Neon) intrinsics. arm_acle.h provides a catch-all for intrinsics that do not belong to a more specific header file.

These headers behave as standard library headers; repeated inclusion has no effect beyond the first include.

Except where noted otherwise, it is unspecified whether the ACLE headers include the standard headers <assert.h>, <stdint.h> or <inttypes.h>. However, the ACLE headers will not define the standard type names (for example uint32_t) except by inclusion of the standard headers. Arm recommends that you include the standard headers explicitly if the associated types and macros are needed.

In C++, the following source code fragments are expected to work correctly:

  #include <stdint.h>
  // UINT64_C not defined here since we did not set __STDC_FORMAT_MACROS
  ...
  #include <arm_neon.h>

and:

  #include <arm_neon.h>
  ...
  #define __STDC_FORMAT_MACROS
  #include <stdint.h>
  // ... UINT64_C is now defined

<arm_acle.h> ^

<arm_acle.h> provides access to intrinsics that do not belong to the more specific header files below. These intrinsics are in the C implementation namespace and begin with double underscores. It is unspecified whether they are available without the header being included. When __ARM_ACLE is defined to 1, the header file is guaranteed to be available.

  #ifdef __ARM_ACLE
  #include <arm_acle.h>
  #endif /* __ARM_ACLE */

<arm_fp16.h> ^

<arm_fp16.h> is provided to define the scalar 16-bit floating point arithmetic intrinsics. As these intrinsics are in the user namespace, an implementation would not normally define them until the header is included. When __ARM_FEATURE_FP16_SCALAR_ARITHMETIC is defined to 1, the header file is available regardless of the context in which the macro is evaluated.

  #ifdef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
  #include <arm_fp16.h>
  #endif /* __ARM_FEATURE_FP16_SCALAR_ARITHMETIC */

<arm_bf16.h> ^

<arm_bf16.h> is provided to define the 16-bit brain floating point arithmetic intrinsics. As these intrinsics are in the user namespace, an implementation would not normally define them until the header is included. When __ARM_FEATURE_BF16 is defined to 1, the header file is guaranteed to be available regardless of the context in which the macro is evaluated.

  #ifdef __ARM_FEATURE_BF16
  #include <arm_bf16.h>
  #endif /* __ARM_FEATURE_BF16 */

When __ARM_BF16_FORMAT_ALTERNATIVE is defined to 1 the only scalar instructions available are conversion intrinsics between bfloat16_t and float32_t. These instructions are:

<arm_neon.h> ^

<arm_neon.h> is provided to define the Advanced SIMD (Neon) intrinsics and associated data types. As these intrinsics and data types are in the user namespace, an implementation would not normally define them until the header is included. When __ARM_NEON is defined to 1, the header file is available regardless of the context in which the macro is evaluated.

  #ifdef __ARM_NEON
  #include <arm_neon.h>
  #endif /* __ARM_NEON */

Including <arm_neon.h> will also cause the following header files to be included, if the header files are available:

<arm_sve.h> ^

<arm_sve.h> defines data types and intrinsics for SVE and its extensions; see SVE language extensions and intrinsics for details. When __ARM_FEATURE_SVE is defined to 1, the header file is available regardless of the context in which the macro is evaluated.

  #ifdef __ARM_FEATURE_SVE
  #include <arm_sve.h>
  #endif /* __ARM_FEATURE_SVE */

Including <arm_sve.h> also includes the following header files:

<arm_neon_sve_bridge.h> ^

<arm_neon_sve_bridge.h> defines intrinsics for moving data between Neon and SVE vector types; see NEON-SVE Bridge for details. Before including the header, you should test the __ARM_NEON_SVE_BRIDGE macro. :

  #ifdef __ARM_NEON_SVE_BRIDGE
  #include <arm_neon_sve_bridge.h>
  #endif /* __ARM_NEON_SVE_BRIDGE */

Including <arm_neon_sve_bridge.h> will also include <arm_neon.h> and <arm_sve.h>.

<arm_mve.h> ^

<arm_mve.h> is provided to define the M-profile Vector Extension (MVE) intrinsics and associated data types. The data types occupy the user namespace. By default the intrinsics occupy both the user namespace and the __arm_ namespace; defining __ARM_MVE_PRESERVE_USER_NAMESPACE will hide the definition of the user namespace variants.

The __ARM_FEATURE_MVE macro should be tested before including the header:

  #if (__ARM_FEATURE_MVE & 3) == 3
  #include <arm_mve.h>
  /* MVE integer and floating point intrinsics are now available to use.  */
  #elif __ARM_FEATURE_MVE & 1
  #include <arm_mve.h>
  /* MVE integer intrinsics are now available to use.  */
  #endif

<arm_sme.h> ^

The specification for SME is in Beta state and may change or be extended in the future.

<arm_sme.h> declares functions and defines intrinsics for SME and its extensions; see SME language extensions and intrinsics for details. When __ARM_FEATURE_SME is defined to 1, the header file is available regardless of the context in which the macro is evaluated.

  #ifdef __ARM_FEATURE_SME
  #include <arm_sme.h>
  #endif

Including <arm_sme.h> also includes <arm_sve.h>.

Predefined feature macros and header files ^

Evaluating a feature macro returns the availability of intrinsics and inline assembly for that feature, but no assumptions should be made on the order or context in which the preprocessor macros are evaluated. For example:

    __attribute__((target("+sve")))
    void foo() {
    #ifdef __ARM_FEATURE_SVE
      // The user should make no assumptions that the target attribute
      // enables the __ARM_FEATURE_SVE macro.
    #endif
}

The compiler might add additional restrictions to the intrinsics beyond what is captured by the ACLE macros depending on the context in which the intrinsics are used. For example:

    #include <arm_sme.h>
    void foo(svbool_t pg, void *ptr, uint32_t slice_base) {
    #ifdef __ARM_FEATURE_SME
      svst1_hor_za8(0, slice_base, pg, ptr);
    #endif
    }

If __ARM_FEATURE_SME evaluates to true the SME intrinsic svst1_hor_za8 is available, but foo might still fail to compile because the call does not occur in a streaming statement.

Attributes ^

GCC-style attributes are provided to annotate types, objects and functions with extra information, such as alignment. These attributes are defined in Attributes and pragmas.

Implementation strategies ^

An implementation may choose to define all the ACLE non-Neon intrinsics as true compiler intrinsics, that is, built-in functions. The <arm_acle.h> header would then have no effect.

Alternatively, <arm_acle.h> could define the ACLE intrinsics in terms of already supported features of the implementation, for example compiler intrinsics with other names, or inline functions using inline assembler.

Half-precision floating-point ^

ACLE defines the __fp16 type, which can be used for half-precision (16-bit) floating-point in one of two formats. The binary16 format defined in [IEEE-FP], and referred to as IEEE format, and an alternative format, defined by Arm, which extends the range by removing support for infinities and NaNs, referred to as alternative format. Both formats are described in [ARMARM] (A2.7.4), [ARMARMv8] (A1.4.2).

Toolchains are not required to support the alternative format, and use of the alternative format precludes use of the ISO/IEC TS 18661:3 [CFP15] _Float16 type and the Armv8.2-A 16-bit floating-point extensions. For these reasons, Arm deprecates the use of the alternative format for half precision in ACLE.

The format in use can be selected at runtime but ACLE assumes it is fixed for the life of a program. If the __fp16 type is available, one of __ARM_FP16_FORMAT_IEEE and __ARM_FP16_FORMAT_ALTERNATIVE will be defined to indicate the format in use. An implementation conforming to the Arm ABI will set the Tag_ABI_FP_16bit_format build attribute.

The __fp16 type can be used in two ways; using the intrinsics ACLE defines when the Armv8.2-A 16-bit floating point extensions are available, and using the standard C operators. When using standard C operators, values of __fp16 type promote to (at least) float when used in arithmetic operations, in the same way that values of char or short types promote to int. There is no support for arithmetic directly on __fp16 values using standard C operators.

  void add(__fp16 a, __fp16 b) {
    a + b; /* a and b are promoted to (at least) float.
              Operation takes place with (at least) 32-bit precision.  */
    vaddh_f16 (a, b); /* a and b are not promoted.
                         Operation takes place with 16-bit precision.  */
  }

Armv8 introduces floating point instructions to convert 64-bit to 16-bit, that is, from double to __fp16. They are not available in earlier architectures, therefore have to rely on emulation libraries or a sequence of instructions to achieve the conversion.

Providing emulation libraries for half-precision floating point conversions when not implemented in hardware is implementation-defined.

  double xd;
  __fp16 xs = (float)xd;

rather than:

  double xd;
  __fp16 xs = xd;

In some older implementations, __fp16 cannot be used as an argument or result type, though it can be used as a field in a structure passed as an argument or result, or passed via a pointer. The predefined macro __ARM_FP16_ARGS should be defined if __fp16 can be used as an argument and result. C++ name mangling is Dh as defined in [cxxabi], and is the same for both the IEEE and alternative formats.

In this example, the floating-point addition is done in single (32-bit) precision:

  void add(__fp16 *z, __fp16 const *x, __fp16 const *y, int n) {
     int i;
     for (i = 0; i < n; ++i) z[i] = x[i] + y[i];
   }

Relationship between __fp16 and ISO/IEC TS 18661 ^

ISO/IEC TS 18661-3 [CFP15] is a published extension to [C11] which describes a language binding for the [IEEE-FP] standard for floating point arithmetic. This language binding introduces a mapping to an unlimited number of interchange and extended floating-point types, on which binary arithmetic is well defined. These types are of the form _FloatN, where N gives size in bits of the type.

One instantiation of the interchange types introduced by [CFP15] is the _Float16 type. ACLE defines the __fp16 type as a storage and interchange only format, on which arithmetic operations are defined to first promote to a type with at least the range and precision of float.

This has implications for the result of operations which would result in rounding had the operation taken place in a native 16-bit type. As software may rely on this behaviour for correctness, arithmetic operations on __fp16 are defined to promote even when the Armv8.2-A fp16 extension is available.

Arm recommends that portable software is written to use the _Float16 type defined in [CFP15].

Type conversion between a value of type __fp16 and a value of type _Float16 leaves the object representation of the converted value unchanged.

When __ARM_FP16_FORMAT_IEEE == 1, this has no effect on the value of the object. However, as the representation of certain values has a different meaning when using the Arm alternative format for 16-bit floating point values [ARMARM] (A2.7.4) [ARMARMv8] (A1.4.2), when __ARM_FP16_FORMAT_ALTERNATIVE == 1 the type conversion may introduce or remove infinity or NaN representations.

Arm recommends that software implementations warn on type conversions between __fp16 and _Float16 if __ARM_FP16_FORMAT_ALTERNATIVE == 1.

In an arithmetic operation where one operand is of __fp16 type and the other is of _Float16 type, the _Float16 type is first converted to __fp16 type following the rules above, and then the operation is completed as if both operands were of __fp16 type.

[CFP15] and [C11] do not define vector types, however many C implementations do provide these extensions. Where they exist, type conversion between a value of type vector of __fp16 and a value of type vector of _Float16 leaves the object representation of the converted value unchanged.

ACLE does not define vector of _Float16 types.

Half-precision brain floating-point ^

ACLE defines the __bf16 type, which can be used for half-precision (16-bit) brain floating-point in an alternative format, defined by Arm, which closely resembles the IEEE 754 single-precision floating point format.

The __bf16 type is only available when the __ARM_BF16_FORMAT_ALTERNATIVE feature macro is defined. When it is available it can only be used by the ACLE intrinsics ; it cannot be used with standard C operators. It is expected that arithmetic using standard C operators be used using a single-precision floating point format and the value be converted to __bf16 when required using ACLE intrinsics.

Armv8.2-A introduces floating point instructions to convert 32-bit to brain 16-bit, that is, from float to __bf16. They are not available in earlier architectures, therefore have to rely on emulation libraries or a sequence of instructions to achieve the conversion.

Providing emulation libraries for half-precision floating point conversions when not implemented in hardware is implementation-defined.

ACLE defines the __mfp8 type, which can be used for the E5M2 and E4M3 8-bit floating-point formats (“FP8”). It is a storage and interchange only type with no arithmetic operations other than intrinsic calls.

Architecture and CPU names ^

Introduction ^

The intention of this section is to standardize architecture names, for example for use in compiler command lines. Toolchains should accept these names case-insensitively where possible, or use all lowercase where not possible. Tools may apply local conventions such as using hyphens instead of underscores.

(Note: processor names, including from the Arm Cortex |reg| processor family, are used as illustrative examples. This specification is applicable to any processors implementing the Arm architecture.)

Architecture names ^

CPU architecture ^

The recommended CPU architecture names are as specified under Tag_CPU_arch in [BA]. For details of how to use predefined macros to test architecture in source code, see Instruction set architecture and features.

The following table lists the architectures and the A32 and T32 instruction set versions.

Name Features A32 T32 Example processor
Armv4 Armv4 4   DEC/Intel StrongARM
Armv4T Armv4 with Thumb instruction set 4 2 Arm7TDMI
Armv5T Armv5 with Thumb instruction set 5 2 Arm10TDMI
Armv5TE Armv5T with DSP extensions 5 2 Arm9E, Intel XScale
Armv5TEJ Armv5TE with Jazelle ® extensions 5 2 Arm926EJ
Armv6 Armv6 (includes TEJ) 6 2 Arm1136J r0
Armv6K Armv6 with kernel extensions 6 2 Arm1136J r1
Armv6T2 Armv6 with Thumb-2 architecture 6 3 Arm1156T2
Armv6Z Armv6K with Security Extensions (includes K) 6 2 Arm1176JZ-S
Armv6-M T32 (M-profile)   2 Cortex-M0, Cortex-M1
Armv7-A Armv7 application profile 7 4 Cortex-A8, Cortex-A9
Armv7-R Armv7 realtime profile 7 4 Cortex-R4
Armv7-M Armv7 microcontroller profile: Thumb-2 instructions only   4 Cortex-M3
Armv7E-M Armv7-M with DSP extensions   4 Cortex-M4
Armv8-A AArch32 Armv8 application profile 8 4 Cortex-A57, Cortex-A53
Armv8-A AArch64 Armv8 application profile 8   Cortex-A57, Cortex-A53

Note that there is some architectural variation that is not visible through ACLE; either because it is only relevant at the system level (for example the Large Physical Address Extension) or because it would be handled by the compiler (for example hardware divide might or might not be present in the Armv7-A architecture).

FPU architecture ^

For details of how to test FPU features in source code, see Floating-point and vector hardware. In particular, for testing which precisions are supported in hardware, see Hardware floating point.

Name Features Example processor
VFPv2 VFPv2 Arm1136JF-S
VFPv3 VFPv3 Cortex-A8
VFPv3_FP16 VFPv3 with FP16 Cortex-A9 (with Neon)
VFPv3_D16 VFPv3 with 16 D-registers Cortex-R4F
VFPv3_D16_FP16 VFPv3 with 16 D-registers and FP16 Cortex-A9 (without Neon), Cortex-R7
VFPv3_SP_D16 VFPv3 with 16 D-registers, single-precision only Cortex-R5 with SP-only
VFPv4 VFPv4 (including FMA and FP16) Cortex-A15
VFPv4_D16 VFPv4 (including FMA and FP16) with 16 D-registers Cortex-A5 (VFP option)
FPv4_SP FPv4 with single-precision only Cortex-M4.fp

CPU names ^

ACLE does not standardize CPU names for use in command-line options and similar contexts. Standard vendor product names should be used.

Object producers should place the CPU name in the Tag_CPU_name build attribute.

Feature test macros ^

Introduction ^

The feature test macros allow programmers to determine the availability of ACLE or subsets of it, or of target architectural features. This may indicate the availability of some source language extensions (for example intrinsics) or the likely level of performance of some standard C features, such as integer division and floating-point.

Several macros are defined as numeric values to indicate the level of support for particular features. These macros are undefined if the feature is not present. (Aside: in Standard C/C++, references to undefined macros expand to 0 in preprocessor expressions, so a comparison such as:

  #if __ARM_ARCH >= 7

will have the expected effect of evaluating to false if the macro is not defined.)

All ACLE macros begin with the prefix __ARM_. All ACLE macros expand to integral constant expressions suitable for use in an #if directive, unless otherwise specified. Syntactically, they must be primary-expressions generally this means an implementation should enclose them in parentheses if they are not simple constants.

Testing for Arm C Language Extensions ^

__ARM_ACLE is defined as the version of this specification that is implemented, formatted as {YEAR}{QUARTER}{PATCH}. The YEAR segment is composed of 4 digits, the QUARTER segment is composed of 1 digit, and the PATCH segment is also composed of 1 digit.

For example:

NOTE: Previously, the macro followed the previous versioning scheme and was defined as 100 * major_version + minor_version, which was the version of this specification implemented. For instance, an implementation implementing version 2.1 of the ACLE specification defined __ARM_ACLE as 201.

__ARM_ACLE_VERSION(year, quarter, patch) is defined to express a given ACLE version. Returns with the version number in the same format as the __ARM_ACLE does. Checking the minimum required ACLE version could be written as:

#if __ARM_ACLE >= __ARM_ACLE_VERSION(2024, 1, 0)

Endianness ^

__ARM_BIG_ENDIAN is defined as 1 if data is stored by default in big-endian format. If the macro is not set, data is stored in little-endian format. (Aside: the “mixed-endian” format for double-precision numbers, used on some very old Arm FPU implementations, is not supported by ACLE or the Arm ABI.)

Instruction set architecture and features ^

References to the target architecture refer to the target as configured in the tools, for example by appropriate command-line options. This may be a subset or intersection of actual targets, in order to produce a binary that runs on more than one real architecture. For example, use of specific features may be disabled.

In the 32-bit architecture, some hardware features may be accessible from only one or other of A32 or T32 state. For example, in the v5TE and v6 architectures, DSP instructions and (where available) VFP instructions, are only accessible in A32 state, while in the v7-R architecture, hardware divide is only accessible from T32 state. Where both states are available, the implementation should set feature test macros indicating that the hardware feature is accessible. To provide access to the hardware feature, an implementation might override the programmer’s preference for target instruction set, or generate an interworking call to a helper function. This mechanism is outside the scope of ACLE. In cases where the implementation is given a hard requirement to use only one state (for example to support validation, or post-processing) then it should set feature test macros only for the hardware features available in that state as if compiling for a core where the other instruction set was not present.

An implementation that allows a user to indicate which functions go into which state (either as a hard requirement or a preference) is not required to change the settings of architectural feature test macros.

Arm architecture level ^

__ARM_ARCH is defined as an integer value indicating the current Arm instruction set architecture (for example 7 for the Arm v7-A architecture implemented by Cortex-A8 or the Armv7-M architecture implemented by Cortex-M3 or 8 for the Armv8-A architecture implemented by Cortex-A57). Armv8.1-A [ARMARMv81] onwards, the value of __ARM_ARCH is scaled up to include minor versions. The formula to calculate the value of __ARM_ARCH from Armv8.1-A [ARMARMv81] onwards is given by the following formula:

For an Arm architecture ArmvX.Y, __ARM_ARCH = X * 100 + Y. For example, for Armv8.1 __ARM_ARCH = 801.

Since ACLE only supports the Arm architecture, this macro would always be defined in an ACLE implementation.

Note that the __ARM_ARCH macro is defined even for cores which only support the T32 instruction set.

Instruction set architecture (A32/T32/A64) ^

__ARM_ARCH_ISA_ARM is defined to 1 if the core supports the Arm instruction set. It is not defined for M-profile cores.

__ARM_ARCH_ISA_THUMB is defined to 1 if the core supports the original T32 instruction set (including the v6-M architecture) and 2 if it supports the T32 instruction set as found in the v6T2 architecture and all v7 architectures.

__ARM_ARCH_ISA_A64 is defined to 1 if the core supports AArch64’s A64 instruction set.

__ARM_32BIT_STATE is defined to 1 if code is being generated for AArch32.

__ARM_64BIT_STATE is defined to 1 if code is being generated for AArch64.

Architectural profile (A, R, M or pre-Cortex) ^

__ARM_ARCH_PROFILE is defined to be one of the char literals 'A', 'R', 'M' or 'S', or unset, according to the architectural profile of the target. 'S' indicates the common subset of the A and R profiles. The common subset of the A, R and M profiles is indicated by

  __ARM_ARCH == 7 && !defined (__ARM_ARCH_PROFILE)

This macro corresponds to the Tag_CPU_arch_profile object build attribute. It may be useful to writers of system code. It is expected in most cases programmers will use more feature-specific tests.

The macro is undefined for architectural targets which predate the use of architectural profiles.

Unaligned access supported in hardware ^

__ARM_FEATURE_UNALIGNED is defined if the target supports unaligned access in hardware, at least to the extent of being able to load or store an integer word at any alignment with a single instruction. (There may be restrictions on load-multiple and floating-point accesses.) Note that whether a code generation target permits unaligned access will in general depend on the settings of system register bits, so an implementation should define this macro to match the user’s expectations and intentions. For example, a command-line option might be provided to disable the use of unaligned access, in which case this macro would not be defined.

LDREX/STREX ^

This feature was deprecated in ACLE 2.0. It is strongly recommended that C11/C++11 atomics be used instead. (See also Synchronization, barrier, and hint intrinsics.)

__ARM_FEATURE_LDREX is defined if the load/store-exclusive instructions (LDREX/STREX) are supported. Its value is a set of bits indicating available widths of the access, as powers of 2. The following bits are used:

Bit Value Access width Instruction
0 0x01 byte LDREXB/STREXB
1 0x02 halfword LDREXH/STREXH
2 0x04 word LDREX/STREX
3 0x08 doubleword LDREXD/STREXD

Other bits are reserved.

The following values of __ARM_FEATURE_LDREX may occur:

Macro value Access widths Example architecture
(undefined) none Armv5, Armv6-M
0x04 word Armv6
0x07 word, halfword, byte Armv7-M
0x0F doubleword, word, halfword, byte Armv6K, Armv7-A/R

Other values are reserved.

The LDREX/STREX instructions are introduced in recent versions of the Arm architecture and supersede the SWP instruction. Where both are available, Arm strongly recommends programmers to use LDREX/STREX rather than SWP. Note that platforms may choose to make SWP unavailable in user mode and emulate it through a trap to a platform routine, or fault it.

CLZ ^

__ARM_FEATURE_CLZ is defined to 1 if the CLZ (count leading zeroes) instruction is supported in hardware. Note that ACLE provides the __clz() family of intrinsics (see Miscellaneous data-processing intrinsics) even when __ARM_FEATURE_CLZ is not defined.

Q (saturation) flag ^

__ARM_FEATURE_QBIT is defined to 1 if the Q (saturation) global flag exists and the intrinsics defined in The Q (saturation) flag are available. This flag is used with the DSP saturating-arithmetic instructions (such as QADD) and the width-specified saturating instructions (SSAT and USAT). Note that either of these classes of instructions may exist without the other: for example, v5E has only QADD while v7-M has only SSAT.

Intrinsics associated with the Q-bit and their feature macro __ARM_FEATURE_QBIT are deprecated in ACLE 2.0 for A-profile. They are fully supported for M-profile and R-profile. This macro is defined for AArch32 only.

DSP instructions ^

__ARM_FEATURE_DSP is defined to 1 if the DSP (v5E) instructions are supported and the intrinsics defined in Saturating intrinsics are available. These instructions include QADD, SMULBB and others. This feature also implies support for the Q flag.

__ARM_FEATURE_DSP and its associated intrinsics are deprecated in ACLE 2.0 for A-profile. They are fully supported for M and R-profiles. This macro is defined for AArch32 only.

Saturation instructions ^

__ARM_FEATURE_SAT is defined to 1 if the SSAT and USAT instructions are supported and the intrinsics defined in Width-specified saturation intrinsics are available. This feature also implies support for the Q flag.

__ARM_FEATURE_SAT and its associated intrinsics are deprecated in ACLE 2.0 for A-profile. They are fully supported for M and R-profiles. This macro is defined for AArch32 only.

32-bit SIMD instructions ^

__ARM_FEATURE_SIMD32 is defined to 1 if the 32-bit SIMD instructions are supported and the intrinsics defined in 32-bit SIMD Operations are available. This also implies support for the GE global flags which indicate byte-by-byte comparison results.

__ARM_FEATURE_SIMD32 is deprecated in ACLE 2.0 for A-profile. Users are encouraged to use Neon Intrinsics as an equivalent for the 32-bit SIMD intrinsics functionality. However they are fully supported for M and R-profiles. This is defined for AArch32 only.

Hardware integer divide ^

__ARM_FEATURE_IDIV is defined to 1 if the target has hardware support for 32-bit integer division in all available instruction sets. Signed and unsigned versions are both assumed to be available. The intention is to allow programmers to choose alternative algorithm implementations depending on the likely speed of integer division.

Some older R-profile targets have hardware divide available in the T32 instruction set only. This can be tested for using the following test:

    #if __ARM_FEATURE_IDIV || (__ARM_ARCH_PROFILE == 'R')

CRC32 extension ^

__ARM_FEATURE_CRC32 is defined to 1 if the CRC32 instructions are supported and the intrinsics defined in CRC32 intrinsics are available. These instructions include CRC32B, CRC32H and others. This is only available when __ARM_ARCH >= 8.

Random Number Generation Extension ^

__ARM_FEATURE_RNG is defined to 1 if the Random Number Generation instructions are supported and the intrinsics defined in Random number generation intrinsics are available.

Branch Target Identification ^

__ARM_FEATURE_BTI_DEFAULT is defined to 1 if the Branch Target Identification extension is used to protect branch destinations by default. The protection applied to any particular function may be overridden by mechanisms such as function attributes.

__ARM_FEATURE_BTI is defined to 1 if Branch Target Identification extension is available on the target. It is undefined otherwise.

Pointer Authentication ^

__ARM_FEATURE_PAC_DEFAULT is defined as a bitmap to indicate the use of the Pointer Authentication extension to protect code against code reuse attacks by default. The bits are defined as follows:

Bit Meaning
0 Protection using the A key
1 Protection using the B key
2 Protection including leaf functions
3 Protection using PC as a diversifier

For example, a value of 0x5 indicates that the Pointer Authentication extension is used to protect function entry points, including leaf functions, using the A key for signing. The protection applied to any particular function may be overridden by mechanisms such as function attributes.

__ARM_FEATURE_PAUTH is defined to 1 if Pointer Authentication extension (FEAT_PAuth) is available on the target. It is undefined otherwise.

__ARM_FEATURE_PAUTH_LR is defined to 1 if Armv9.5-A enhancements to the Pointer Authentication extension (FEAT_PAuth_LR) are available on the target. It is undefined otherwise.

Large System Extensions ^

__ARM_FEATURE_ATOMICS is defined if the Large System Extensions introduced in the Armv8.1-A [ARMARMv81] architecture are supported on this target. Note: It is strongly recommended that standardized C11/C++11 atomics are used to implement atomic operations in user code.

Transactional Memory Extension ^

__ARM_FEATURE_TME is defined to 1 if the Transactional Memory Extension instructions are supported in hardware and intrinsics defined in Transactional Memory Extension (TME) intrinsics are available.

Armv8.7-A Load/Store 64 Byte extension ^

__ARM_FEATURE_LS64 is defined to 1 if the Armv8.7-A LD64B, ST64B, ST64BV and ST64BV0 instructions for atomic 64-byte access to device memory are supported. This macro may only ever be defined in the AArch64 execution state. Intrinsics for using these instructions are specified in Load/store 64 Byte intrinsics.

memcpy family of memory operations standarization instructions - MOPS ^

If the CPYF*, CPY*, SET* and SETG* instructions are supported, __ARM_FEATURE_MOPS is defined to 1. These instructions were introduced in the Armv8.8-A and Armv9.3-A architecture updates for standardization of memcpy, memset, and memmove family of memory operations (MOPS).

The __ARM_FEATURE_MOPS macro can only be implemented in the AArch64 execution state. Intrinsics for the use of these instructions are specified in memcpy family of operations intrinsics - MOPS

RCPC ^

__ARM_FEATURE_RCPC is set if the weaker RCpc (Release Consistent processor consistent) model is supported. It is undefined otherwise. The value indicates the set of Load-Acquire and Store-Release instructions available. The intention is to allow programmers to guard the usage of these instructions in inline assembly.

If defined, the value of __ARM_FEATURE_RCPC remains consistent with the decimal value of LRCPC field (bits [23:20]) in the ID_AA64ISAR1_EL1 register. For convenience these are shown below:

Value Feature Instructions Availability
1 FEAT_LRCPC LDAPR* instructions Armv8.3, optional Armv8.2
2 FEAT_LRCPC2 LDAPUR* and STLUR* instructions Armv8.4, optional Armv8.2
3 FEAT_LRCPC3 See FEAT_LRCPC3 documentation Armv8.9, optional Armv8.2

The __ARM_FEATURE_RCPC macro can only be implemented in the AArch64 execution state.

128-bit system registers ^

If the MRRS and MSRR instructions are supported, __ARM_FEATURE_SYSREG128 is defined to 1. These instructions were introduced in the Armv9.4-A architecture updates to support 128-bit system register accesses.

The __ARM_FEATURE_SYSREG128 macro can only be implemented in the AArch64 execution state. Intrinsics for the use of these instructions are specified in Special register intrinsics.

Floating-point and vector hardware ^

Hardware floating point ^

__ARM_FP is set if hardware floating-point is available. The value is a set of bits indicating the floating-point precisions supported. The following bits are used:

Bit Value Precision
1 0x02 half (16-bit) data type only
2 0x04 single (32-bit)
3 0x08 double (64-bit)

Bits 0 and 4..31 are reserved

Currently, the following values of __ARM_FP may occur (assuming the processor configuration option for hardware floating-point support is selected where available):

Value Precisions Example processor
(undefined) none any processor without hardware floating-point support
0x04 single Cortex-R5 when configured with SP only
0x06 single, half Cortex-M4.fp
0x0C double, single Arm9, Arm11, Cortex-A8, Cortex-R4
0x0E double, single, half Cortex-A9, Cortex-A15, Cortex-R7

Other values are reserved.

Standard C implementations support single and double precision floating-point irrespective of whether floating-point hardware is available. However, an implementation might choose to offer a mode to diagnose or fault use of floating-point arithmetic at a precision not supported in hardware.

Support for 16-bit floating-point language or 16-bit brain floating-point language extensions (see Half-precision (16-bit) floating-point format and Brain 16-bit floating-point support) is only required if supported in hardware.

Half-precision (16-bit) floating-point format ^

__ARM_FP16_FORMAT_IEEE is defined to 1 if the IEEE 754-2008 [IEEE-FP] 16-bit floating-point format is used.

__ARM_FP16_FORMAT_ALTERNATIVE is defined to 1 if the Arm alternative [ARMARM] 16-bit floating-point format is used. This format removes support for infinities and NaNs in order to provide an additional binade.

At most one of these macros will be defined. See Half-precision floating-point for details of half-precision floating-point types.

Half-precision argument and result ^

__ARM_FP16_ARGS is defined to 1 if __fp16 can be used as an argument and result.

Vector extensions ^

Advanced SIMD architecture extension (Neon) ^

__ARM_NEON is defined to a value indicating the Advanced SIMD (Neon) architecture supported. The only current value is 1.

In principle, for AArch32, the Neon architecture can exist in an integer-only version. To test for the presence of Neon floating-point vector instructions, test __ARM_NEON_FP. When Neon does occur in an integer-only version, the VFP scalar instruction set is also not present. See [ARMARM] (table A2-4) for architecturally permitted combinations.

__ARM_NEON is always set to 1 for AArch64.

Neon floating-point ^

__ARM_NEON_FP is defined as a bitmap to indicate floating-point support in the Neon architecture. The meaning of the values is the same as for __ARM_FP. This macro is undefined when the Neon extension is not present or does not support floating-point.

Current AArch32 Neon implementations do not support double-precision floating-point even when it is present in VFP. 16-bit floating-point format is supported in Neon if and only if it is supported in VFP. Consequently, the definition of __ARM_NEON_FP is the same as __ARM_FP except that the bit to indicate double-precision is not set for AArch32. Double-precision is always set for AArch64.

If __ARM_FEATURE_FMA and __ARM_NEON_FP are both defined, fused-multiply instructions are available in Neon also.

Scalable Vector Extension (SVE) ^

__ARM_FEATURE_SVE is defined to 1 if there is hardware support for the FEAT_SVE instructions and if the associated ACLE features are available. This implies that __ARM_NEON and __ARM_NEON_FP are both nonzero.

The following macros indicate the presence of various optional SVE language extensions:

__ARM_FEATURE_SVE_BITS==N

When N is nonzero, this indicates that the implementation is generating code for an N-bit SVE target and that the implementation supports the arm_sve_vector_bits(N) attribute. N may also be zero, but this carries the same meaning as not defining the macro. See The __ARM_FEATURE_SVE_BITS macro for details.

__ARM_FEATURE_SVE_VECTOR_OPERATORS==N

N >= 1 indicates that applying the arm_sve_vector_bits attribute to an SVE vector type creates a type that supports the GNU vector extensions. This condition is only meaningful when __ARM_FEATURE_SVE_BITS is nonzero. See arm_sve_vector_bits behavior specific to vectors for details.

N >= 2 indicates that the operators outlined in the GNU vector extensions additionally work on sizeless SVE vector types like svint32_t. The availability of operators on sizeless types is independent of __ARM_FEATURE_SVE_BITS.

__ARM_FEATURE_SVE_PREDICATE_OPERATORS==N

N >= 1 indicates that applying the arm_sve_vector_bits attribute to svbool_t creates a type that supports basic built-in vector operations. The state of this macro is only meaningful when __ARM_FEATURE_SVE_BITS is nonzero. See arm_sve_vector_bits behavior specific to predicates for details.

N >= 2 indicates that the built-in vector operations described above additionally work on svbool_t. The availability of operators on svbool_t is independent of __ARM_FEATURE_SVE_BITS.

SVE2 ^

__ARM_FEATURE_SVE2 is defined to 1 if there is hardware support for the Armv9-A SVE2 extension (FEAT_SVE2) and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SVE is nonzero.

__ARM_FEATURE_SVE2p1 is defined to 1 if the FEAT_SVE2p1 instructions are available and if the associated [ACLE features] (#sme-language-extensions-and-intrinsics) are supported.

NEON-SVE Bridge macros ^

__ARM_NEON_SVE_BRIDGE is defined to 1 if NEON-SVE Bridge intrinsics are available. This implies that the following macros are nonzero:

Scalable Matrix Extension (SME) ^

The specification for SME2.1 is in Alpha state and the specification for the rest of SME is in Beta state. The specifications may change or be extended in the future.

ACLE provides features for accessing the Scalable Matrix Extension (SME). Each revision of SME has an associated preprocessor macro, given in the table below:

Feature Macro
FEAT_SME __ARM_FEATURE_SME
FEAT_SME2 __ARM_FEATURE_SME2
FEAT_SME2p1 __ARM_FEATURE_SME2p1

Each macro is defined if there is hardware support for the associated architecture feature and if all of the ACLE features that are conditional on the macro are supported.

In addition, __ARM_FEATURE_LOCALLY_STREAMING is defined to 1 if the arm_locally_streaming attribute is available.

M-profile Vector Extension ^

__ARM_FEATURE_MVE is defined as a bitmap to indicate M-profile Vector Extension (MVE) support.

Bit Value Support
0 0x01 Integer MVE
1 0x02 Floating-point MVE

Wireless MMX ^

If Wireless MMX operations are available on the target, __ARM_WMMX is defined to a value that indicates the level of support, corresponding to the Tag_WMMX_arch build attribute.

This specification does not further define source-language features to support Wireless MMX.

16-bit floating-point extensions ^

16-bit floating-point data processing operations ^

__ARM_FEATURE_FP16_SCALAR_ARITHMETIC is defined to 1 if the 16-bit floating-point arithmetic instructions are supported in hardware and the associated scalar intrinsics defined by ACLE are available. Note that this implies:

__ARM_FEATURE_FP16_VECTOR_ARITHMETIC is defined to 1 if the 16-bit floating-point arithmetic instructions are supported in hardware and the associated vector intrinsics defined by ACLE are available. Note that this implies:

FP16 FML extension ^

__ARM_FEATURE_FP16_FML is defined to 1 if the FP16 multiplication variant instructions from Armv8.2-A are supported and intrinsics targeting them are available. This implies that __ARM_FEATURE_FP16_SCALAR_ARITHMETIC is defined to a nonzero value.

Half-precision floating-point SME intrinsics ^

The specification for SME2.1 is in Alpha state and may change or be extended in the future.

__ARM_FEATURE_SME_F16F16 is defined to 1 if there is hardware support for the SME2 half-precision (FEAT_SME_F16F16) instructions and if their associated intrinsics are available.

Brain 16-bit floating-point support ^

__ARM_BF16_FORMAT_ALTERNATIVE is defined to 1 if the Arm alternative [ARMARM] 16-bit brain floating-point format is used. This format closely resembles the IEEE 754 single-precision format. As such a brain half-precision floating point value can be converted to an IEEE 754 single-floating point format by appending 16 zero bits at the end.

__ARM_FEATURE_BF16_VECTOR_ARITHMETIC is defined to 1 if there is hardware support for the Advanced SIMD brain 16-bit floating-point arithmetic instructions and if the associated ACLE vector intrinsics are available. This implies:

Similarly, __ARM_FEATURE_SVE_BF16 is defined to 1 if there is hardware support for the SVE BF16 extensions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_BF16_VECTOR_ARITHMETIC and __ARM_FEATURE_SVE are both nonzero.

See Half-precision brain floating-point for details of half-precision brain floating-point types.

Non-widening brain 16-bit floating-point support ^

The specification for B16B16 is in Alpha state and may change or be extended in the future.

__ARM_FEATURE_SVE_B16B16 is defined to 1 if there is hardware support for the FEAT_SVE_B16B16 instructions and if their associated intrinsics are available. Specifically, if this macro is defined to 1, then:

__ARM_FEATURE_SME_B16B16 is defined to 1 if there is hardware support for the FEAT_SME_B16B16 instructions and if their associated intrinsics are available.

Cryptographic extensions ^

“Crypto” extension ^

NOTE: The __ARM_FEATURE_CRYPTO macro is deprecated in favor of the finer grained feature macros described below.

__ARM_FEATURE_CRYPTO is defined to 1 if the Armv8-A Crypto instructions are supported and intrinsics targeting them are available. These instructions include AES{E, D}, SHA1{C, P, M} and others. This also implies __ARM_FEATURE_AES and __ARM_FEATURE_SHA2.

AES extension ^

__ARM_FEATURE_AES is defined to 1 if there is hardware support for the Advanced SIMD AES Crypto instructions from Armv8-A and if the associated ACLE intrinsics are available. These instructions are identified by FEAT_AES and FEAT_PMULL in [ARMARMv8], and they include AES{E, D}, AESMC, AESIMC and others.

In addition, __ARM_FEATURE_SVE2_AES is defined to 1 if there is hardware support for the SVE2 AES (FEAT_SVE_AES) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_AES and __ARM_FEATURE_SVE2 are both nonzero.

SHA2 extension ^

__ARM_FEATURE_SHA2 is defined to 1 if the SHA1 & SHA2-256 Crypto instructions from Armv8-A are supported and intrinsics targeting them are available. These instructions are identified by FEAT_SHA1 and FEAT_SHA256 in [ARMARMv8], and they include SHA1{C, P, M}, SHA256H, SHA256H2… and others.

SHA512 extension ^

__ARM_FEATURE_SHA512 is defined to 1 if the SHA2-512 Crypto instructions from Armv8.2-A are supported and intrinsics targeting them are available. These instructions are identified by FEAT_SHA512 in [ARMARMv82], and they include SHA1{C, P, M}, SHA256H, SHA256H2, …, SHA512H, SHA512H2, SHA512SU0… and others. Note: FEAT_SHA512 requires both FEAT_SHA1 and FEAT_SHA256.

SHA3 extension ^

__ARM_FEATURE_SHA3 is defined to 1 if there is hardware support for the Advanced SIMD SHA1 & SHA2 Crypto instructions from Armv8-A and the SHA2 and SHA3 instructions from Armv8.2-A and newer and if the associated ACLE intrinsics are available. These instructions include AES{E, D}, SHA1{C, P, M}, RAX, and others.

In addition, __ARM_FEATURE_SVE2_SHA3 is defined to 1 if there is hardware support for the SVE2 SHA3 (FEAT_SVE_SHA3) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SHA3 and __ARM_FEATURE_SVE2 are both nonzero.

SM3 extension ^

__ARM_FEATURE_SM3 is defined to 1 if there is hardware support for Advanced SIMD SM3 Crypto instructions from Armv8.2-A and if the associated ACLE intrinsics are available. These instructions include SM3{TT1A, TT1B}, and others.

In addition, __ARM_FEATURE_SVE2_SM3 is defined to 1 if there is hardware support for the SVE2 SM3 (FEAT_SVE_SM3) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SM3 and __ARM_FEATURE_SVE2 are both nonzero.

SM4 extension ^

__ARM_FEATURE_SM4 is defined to 1 if there is hardware support for the Advanced SIMD SM4 Crypto instructions from Armv8.2-A and if the associated ACLE intrinsics are available. These instructions include SM4{E, EKEY} and others.

In addition, __ARM_FEATURE_SVE2_SM4 is defined to 1 if there is hardware support for the SVE2 SM4 (FEAT_SVE_SM4) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SM4 and __ARM_FEATURE_SVE2 are both nonzero.

Floating-point absolute minimum and maximum extension ^

__ARM_FEATURE_FAMINMAX is defined to 1 if there is hardware support for floating-point absolute minimum and maximum instructions (FEAT_FAMINMAX) and if the associated ACLE intrinsics are available.

Lookup table extensions ^

__ARM_FEATURE_LUT is defined to 1 if there is hardware support for lookup table instructions with 2-bit and 4-bit indices (FEAT_LUT) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_SME_LUTv2 is defined to 1 if there is hardware support for lookup table instructions with 4-bit indices and 8-bit elements (FEAT_SME_LUTv2) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_FP8 is defined to 1 if there is hardware support for FP8 conversion instructions (FEAT_FP8) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_FP8FMA is defined to 1 if there is hardware support for FP8 multiply-accumulate to half-precision and single-precision instructions (FEAT_FP8FMA) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_FP8DOT2 is defined to 1 if there is hardware support for FP8 2-way dot product to half-precision instructions (FEAT_FP8DOT2) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_FP8DOT4 is defined to 1 if there is hardware support for FP8 4-way dot product to single-precision instructions (FEAT_FP8DOT4) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_SSVE_FP8DOT4 is defined to 1 if there is hardware support for SVE2 FP8 4-way dot product to single-precision instructions in Streaming SVE mode (FEAT_SSVE_FP8DOT4) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_SSVE_FP8DOT2 is defined to 1 if there is hardware support for SVE2 FP8 2-way dot product to half-precision instructions in Streaming SVE mode (FEAT_SSVE_FP8DOT2) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_SSVE_FP8FMA is defined to 1 if there is hardware support for SVE2 FP8 multiply-accumulate to half-precision and single-precision instructions in Streaming SVE mode (FEAT_SSVE_FP8FMA) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_SME_F8F32 is defined to 1 if there is hardware support for SME2 FP8 dot product, multiply-accumulate, and outer product to single-precision instructions (FEAT_SME_F8F32) and if the associated ACLE intrinsics are available.

__ARM_FEATURE_SME_F8F16 is defined to 1 if there is hardware support for SME2 FP8 dot product, multiply-accumulate, and outer product to half-precision instructions (FEAT_SME_F8F16) and if the associated ACLE intrinsics are available.

Other floating-point and vector extensions ^

Fused multiply-accumulate (FMA) ^

__ARM_FEATURE_FMA is defined to 1 if the hardware floating-point architecture supports fused floating-point multiply-accumulate, that is, without intermediate rounding. Note that C implementations are encouraged [C99] (7.12) to ensure that defines `FP_FAST_FMAF` or `FP_FAST_FMA`, which can be tested by portable C code. A C implementation on Arm might define these macros by testing `__ARM_FEATURE_FMA` and `__ARM_FP`.

This macro implies support for floating-point instructions but it does not in itself imply support for vector instructions. See Neon floating-point for the conditions under which vector fused multiply-accumulate operations are available.

Directed rounding ^

__ARM_FEATURE_DIRECTED_ROUNDING is defined to 1 if the directed rounding and conversion vector instructions are supported and rounding and conversion intrinsics are available. This is only available when __ARM_ARCH >= 8.

Armv8.5-A Floating-point rounding extension ^

__ARM_FEATURE_FRINT is defined to 1 if the Armv8.5-A rounding number instructions are supported and the scalar and vector intrinsics are available. This macro may only ever be defined in the AArch64 execution state. The scalar intrinsics are specified in Floating-point data-processing intrinsics and are not expected to be for general use. They are defined for uses that require the specialist rounding behavior of the relevant instructions. The vector intrinsics are specified in the Arm Neon Intrinsics Reference Architecture Specification [Neon].

Javascript floating-point conversion ^

__ARM_FEATURE_JCVT is defined to 1 if the FJCVTZS (AArch64) or VJCVT (AArch32) instruction and the associated intrinsic are available.

Numeric maximum and minimum ^

__ARM_FEATURE_NUMERIC_MAXMIN is defined to 1 if the IEEE 754-2008 compliant floating point maximum and minimum vector instructions are supported and intrinsics targeting these instructions are available. This is only available when __ARM_ARCH >= 8.

Rounding doubling multiplies ^

__ARM_FEATURE_QRDMX is defined to 1 if SQRDMLAH and SQRDMLSH instructions and their associated intrinsics are available.

Dot Product extension ^

__ARM_FEATURE_DOTPROD is defined if the dot product data manipulation instructions are supported and the vector intrinsics are available. Note that this implies:

Complex number intrinsics ^

__ARM_FEATURE_COMPLEX is defined if the complex addition and complex multiply-accumulate vector instructions are supported. Note that this implies:

These instructions require that the input vectors are organized such that the real and imaginary parts of the complex number are stored in alternating sequences: real, imag, real, imag, … etc.

Matrix Multiply Intrinsics ^

Multiplication of 8-bit integer matrices ^

__ARM_FEATURE_MATMUL_INT8 is defined to 1 if there is hardware support for the Advanced SIMD integer matrix multiply instructions are if the associated ACLE intrinsics are available. This implies that __ARM_NEON is nonzero.

In addition, __ARM_FEATURE_SVE_MATMUL_INT8 is defined to 1 if there is hardware support for the SVE forms of these instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_MATMUL_INT8 and __ARM_FEATURE_SVE are both nonzero.

Multiplication of 32-bit floating-point matrices ^

__ARM_FEATURE_SVE_MATMUL_FP32 is defined to 1 if there is hardware support for the SVE 32-bit floating-point matrix multiply (FEAT_F32MM) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SVE is nonzero.

Multiplication of 64-bit floating-point matrices ^

__ARM_FEATURE_SVE_MATMUL_FP64 is defined to 1 if there is hardware support for the SVE 64-bit floating-point matrix multiply (FEAT_F64MM) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SVE is nonzero.

Bit permute extension ^

__ARM_FEATURE_SVE2_BITPERM is defined to 1 if there is hardware support for the SVE2 bit permute (FEAT_SVE_BitPerm) instructions and if the associated ACLE intrinsics are available. This implies that __ARM_FEATURE_SVE2 is nonzero.

16-bit to 64-bit integer widening outer product intrinsics ^

The specification for SME is in Beta state and may change or be extended in the future.

__ARM_FEATURE_SME_I16I64 is defined to 1 if there is hardware support for the SME 16-bit to 64-bit integer widening outer product (FEAT_SME_I16I64) instructions and if their associated intrinsics are available. This implies that __ARM_FEATURE_SME is nonzero.

Double precision floating-point outer product intrinsics ^

The specification for SME is in Beta state and may change or be extended in the future.

__ARM_FEATURE_SME_F64F64 is defined to 1 if there is hardware support for the SME double precision floating-point outer product (FEAT_SME_F64F64) instructions and if their associated intrinsics are available. This implies that __ARM_FEATURE_SME is nonzero.

Floating-point model ^

These macros test the floating-point model implemented by the compiler and libraries. The model determines the guarantees on arithmetic and exceptions.

__ARM_FP_FAST is defined to 1 if floating-point optimizations may occur such that the computed results are different from those prescribed by the order of operations according to the C standard. Examples of such optimizations would be reassociation of expressions to reduce depth, and replacement of a division by constant with multiplication by its reciprocal.

__ARM_FP_FENV_ROUNDING is defined to 1 if the implementation allows the rounding to be configured at runtime using the standard C fesetround() function and will apply this rounding to future floating-point operations. The rounding mode applies to both scalar floating-point and Neon.

The floating-point implementation might or might not support denormal values. If denormal values are not supported then they are flushed to zero.

Implementations may also define the following macros in appropriate floating-point modes:

__STDC_IEC_559__ is defined if the implementation conforms to IEC This implies support for floating-point exception status flags, including the inexact exception. This macro is specified by [C99] (6.10.8).

__SUPPORT_SNAN__ is defined if the implementation supports signalling NaNs. This macro is specified by the C standards proposal WG14 N965 Optional support for Signaling NaNs. (Note: this was not adopted into C11.)

Procedure call standard ^

__ARM_PCS is defined to 1 if the default procedure calling standard for the translation unit conforms to the base PCS defined in [AAPCS]. This is supported on AArch32 only.

__ARM_PCS_VFP is defined to 1 if the default is to pass floating-point parameters in hardware floating-point registers using the VFP variant PCS defined in [AAPCS]. This is supported on AArch32 only.

__ARM_PCS_AAPCS64 is defined to 1 if the default procedure calling standard for the translation unit conforms to the [AAPCS64].

Note that this should reflect the implementation default for the translation unit. Implementations which allow the PCS to be set for a function, class or namespace are not expected to redefine the macro within that scope.

Position-independent code ^

__ARM_ROPI is defined to 1 if the translation unit is being compiled in read-only position independent mode. In this mode, all read-only data and functions are at a link-time constant offset from the program counter.

__ARM_RWPI is defined to 1 if the translation unit is being compiled in read-write position independent mode. In this mode, all writable data is at a link-time constant offset from the static base register defined in [AAPCS].

The ROPI and RWPI position independence modes are compatible with each other, so the __ARM_ROPI and __ARM_RWPI macros may be defined at the same time.

Coprocessor intrinsics ^

__ARM_FEATURE_COPROC is defined as a bitmap to indicate the presence of coprocessor intrinsics for the target architecture. If __ARM_FEATURE_COPROC is undefined or zero, that means there is no support for coprocessor intrinsics on the target architecture. The following bits are used:

Bit Value Intrinsics Available
0 0x1 __arm_cdp __arm_ldc, __arm_ldcl, __arm_stc, __arm_stcl, __arm_mcr and __arm_mrc
1 0x2 __arm_cdp2, __arm_ldc2, __arm_stc2, __arm_ldc2l, __arm_stc2l, __arm_mcr2 and __arm_mrc2
2 0x4 __arm_mcrr and __arm_mrrc
3 0x8 __arm_mcrr2 and __arm_mrrc2

Custom Datapath Extension ^

__ARM_FEATURE_CDE is defined to 1 if the Arm Custom Datapath Extension (CDE) is supported.

__ARM_FEATURE_CDE_COPROC is a bitmap indicating the CDE coprocessors available. The following bits are used:

Bit Value CDE Coprocessor available
0 0x01 p0
1 0x02 p1
2 0x04 p2
3 0x08 p3
4 0x10 p4
5 0x20 p5
6 0x40 p6
7 0x80 p7

Mapping of object build attributes to predefines ^

This section is provided for guidance. Details of build attributes can be found in [BA].

Tag no. Tag Predefined macro
6 Tag_CPU_arch __ARM_ARCH, __ARM_FEATURE_DSP
7 Tag_CPU_arch_profile __ARM_PROFILE
8 Tag_ARM_ISA_use __ARM_ISA_ARM
9 Tag_THUMB_ISA_use __ARM_ISA_THUMB
11 Tag_WMMX_arch __ARM_WMMX
18 Tag_ABI_PCS_wchar_t __ARM_SIZEOF_WCHAR_T
20 Tag_ABI_FP_denormal  
21 Tag_ABI_FP_exceptions  
22 Tag_ABI_FP_user_exceptions  
23 Tag_ABI_FP_number_model  
26 Tag_ABI_enum_size __ARM_SIZEOF_MINIMAL_ENUM
34 Tag_CPU_unaligned_access __ARM_FEATURE_UNALIGNED
36 Tag_FP_HP_extension __ARM_FP16_FORMAT_IEEE
    __ARM_FP16_FORMAT_ALTERNATIVE
38 Tag_ABI_FP_16bit_for __ARM_FP16_FORMAT_IEEE
    __ARM_FP16_FORMAT_ALTERNATIVE

Summary of predefined macros ^

Macro name Meaning Example
__ARM_32BIT_STATE Code is for AArch32 state 1
__ARM_64BIT_STATE Code is for AArch64 state 1
__ARM_ACLE Indicates ACLE implemented 101
__ARM_ALIGN_MAX_PWR Log of maximum alignment of static object 20
__ARM_ALIGN_MAX_STACK_PWR Log of maximum alignment of stack object 3
__ARM_ARCH Arm architecture level 7
__ARM_ARCH_ISA_A64 AArch64 ISA present 1
__ARM_ARCH_ISA_ARM Arm instruction set present 1
__ARM_ARCH_ISA_THUMB T32 instruction set present 2
__ARM_ARCH_PROFILE Architecture profile 'A'
__ARM_BF16_FORMAT_ALTERNATIVE 16-bit brain floating-point, alternative format 1
__ARM_BIG_ENDIAN Memory is big-endian 1
__ARM_FEATURE_AES AES Crypto extension (Arm v8-A) 1
__ARM_FEATURE_ATOMICS Large System Extensions 1
__ARM_FEATURE_BF16 16-bit brain floating-point, vector instruction 1
__ARM_FEATURE_BTI_DEFAULT Branch Target Identification 1
__ARM_FEATURE_CDE Custom Datapath Extension 0x01
__ARM_FEATURE_CDE_COPROC Custom Datapath Extension 0xf
__ARM_FEATURE_CLZ CLZ instruction 1
__ARM_FEATURE_COMPLEX Armv8.3-A extension 1
__ARM_FEATURE_COPROC Coprocessor Intrinsics 1
__ARM_FEATURE_CRC32 CRC32 extension 1
__ARM_FEATURE_CRYPTO Crypto extension 1
__ARM_FEATURE_DIRECTED_ROUNDING Directed Rounding 1
__ARM_FEATURE_DOTPROD Dot product extension (ARM v8.2-A) 1
__ARM_FEATURE_DSP DSP instructions (Arm v5E) (32-bit-only) 1
__ARM_FEATURE_FAMINMAX Floating-point absolute minimum and maximum extension 1
__ARM_FEATURE_FMA Floating-point fused multiply-accumulate 1
__ARM_FEATURE_FP16_FML FP16 FML extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) 1
__ARM_FEATURE_FP8 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_FP8DOT2 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_FP8DOT4 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_FP8FMA Modal 8-bit floating-point extensions 1
__ARM_FEATURE_FRINT Floating-point rounding extension (Arm v8.5-A) 1
__ARM_FEATURE_IDIV Hardware Integer Divide 1
__ARM_FEATURE_JCVT Javascript conversion (ARMv8.3-A) 1
__ARM_FEATURE_LDREX (Deprecated) Load/store exclusive instructions 0x0F
__ARM_FEATURE_LUT Lookup table extensions (FEAT_LUT) 1
__ARM_FEATURE_MATMUL_INT8 Integer Matrix Multiply extension (Armv8.6-A, optional Armv8.2-A, Armv8.3-A, Armv8.4-A, Armv8.5-A) 1
__ARM_FEATURE_MEMORY_TAGGING Memory Tagging (Armv8.5-A) 1
__ARM_FEATURE_MOPS memcpy, memset, and memmove family of operations standardization instructions 1
__ARM_FEATURE_MVE M-profile Vector Extension 1
__ARM_FEATURE_NUMERIC_MAXMIN Numeric Maximum and Minimum 1
__ARM_FEATURE_PAC_DEFAULT Pointer authentication protection 0x5
__ARM_FEATURE_PAUTH Pointer Authentication Extension (FEAT_PAuth) 1
__ARM_FEATURE_PAUTH_LR Armv9.5-A Enhancements to Pointer Authentication Extension (FEAT_PAuth_LR) 1
__ARM_FEATURE_QBIT Q (saturation) flag (32-bit-only) 1
__ARM_FEATURE_QRDMX SQRDMLxH instructions and associated intrinsics availability 1
__ARM_FEATURE_RCPC Release Consistent processor consistent Model (64-bit-only) 1
__ARM_FEATURE_RNG Random Number Generation Extension (Armv8.5-A) 1
__ARM_FEATURE_SAT Width-specified saturation instructions (32-bit-only) 1
__ARM_FEATURE_SHA2 SHA2 Crypto extension (Arm v8-A) 1
__ARM_FEATURE_SHA3 SHA3 Crypto extension (Arm v8.4-A) 1
__ARM_FEATURE_SHA512 SHA2 Crypto ext. (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) 1
__ARM_FEATURE_SIMD32 32-bit SIMD instructions (Armv6) (32-bit-only) 1
__ARM_FEATURE_SM3 SM3 Crypto extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) 1
__ARM_FEATURE_SM4 SM4 Crypto extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) 1
__ARM_FEATURE_SME Scalable Matrix Extension (FEAT_SME) 1
__ARM_FEATURE_SME2 Scalable Matrix Extension (FEAT_SME2) 1
__ARM_FEATURE_SME_B16B16 Non-widening brain 16-bit floating-point SME intrinsics (FEAT_SME_B16B16) 1
__ARM_FEATURE_SME_F16F16 Half-precision floating-point SME intrinsics (FEAT_SME_F16F16) 1
__ARM_FEATURE_SME_F64F64 Double precision floating-point outer product intrinsics (FEAT_SME_F64F64) 1
__ARM_FEATURE_SME_F8F16 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_SME_F8F32 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_SME_I16I64 16-bit to 64-bit integer widening outer product intrinsics (FEAT_SME_I16I64) 1
__ARM_FEATURE_SME_LOCALLY_STREAMING Support for the arm_locally_streaming attribute 1
__ARM_FEATURE_SME_LUTv2 Lookup table extensions (FEAT_SME_LUTv2) 1
__ARM_FEATURE_SSVE_FP8DOT2 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_SSVE_FP8DOT4 Modal 8-bit floating-point extensions 1
__ARM_FEATURE_SSVE_FP8FMA Modal 8-bit floating-point extensions 1
__ARM_FEATURE_SVE Scalable Vector Extension (FEAT_SVE) 1
__ARM_FEATURE_SVE_B16B16 Non-widening brain 16-bit floating-point intrinsics (FEAT_SVE_B16B16) 1
__ARM_FEATURE_SVE_BF16 SVE support for the 16-bit brain floating-point extension (FEAT_BF16) 1
__ARM_FEATURE_SVE_BITS The number of bits in an SVE vector, when known in advance 256
__ARM_FEATURE_SVE_MATMUL_FP32 32-bit floating-point matrix multiply extension (FEAT_F32MM) 1
__ARM_FEATURE_SVE_MATMUL_FP64 64-bit floating-point matrix multiply extension (FEAT_F64MM) 1
__ARM_FEATURE_SVE_MATMUL_INT8 SVE support for the integer matrix multiply extension (FEAT_I8MM) 1
__ARM_FEATURE_SVE_PREDICATE_OPERATORS Level of support for C and C++ operators on SVE vector types 1
__ARM_FEATURE_SVE_VECTOR_OPERATORS Level of support for C and C++ operators on SVE predicate types 1
__ARM_FEATURE_SVE2 SVE version 2 (FEAT_SVE2) 1
__ARM_FEATURE_SVE2_AES SVE2 support for the AES cryptographic extension (FEAT_SVE_AES) 1
__ARM_FEATURE_SVE2_BITPERM SVE2 bit permute extension (FEAT_SVE2_BitPerm) 1
__ARM_FEATURE_SVE2_SHA3 SVE2 support for the SHA3 cryptographic extension (FEAT_SVE_SHA3) 1
__ARM_FEATURE_SVE2_SM3 SVE2 support for the SM3 cryptographic extension (FEAT_SVE_SM3) 1
__ARM_FEATURE_SVE2_SM4 SVE2 support for the SM4 cryptographic extension (FEAT_SVE_SM4) 1
__ARM_FEATURE_SVE2p1 SVE version 2.1 (FEAT_SVE2p1)  
__ARM_FEATURE_SYSREG128 Support for 128-bit system registers (FEAT_SYSREG128) 1
__ARM_FEATURE_UNALIGNED Hardware support for unaligned access 1
__ARM_FP Hardware floating-point 1
__ARM_FP16_ARGS __fp16 argument and result 0x0C
__ARM_FP16_FORMAT_ALTERNATIVE 16-bit floating-point, alternative format 1
__ARM_FP16_FORMAT_IEEE 16-bit floating-point, IEEE format 1
__ARM_FP_FAST Accuracy-losing optimizations 1
__ARM_FP_FENV_ROUNDING Rounding is configurable at runtime 1
__ARM_NEON Advanced SIMD (Neon) extension 1
__ARM_NEON_FP Advanced SIMD (Neon) floating-point 0x04
__ARM_NEON_SVE_BRIDGE Moving data between Neon and SVE data types 1
__ARM_PCS Arm procedure call standard (32-bit-only) 0x01
__ARM_PCS_AAPCS64 Arm PCS for AArch64. 1
__ARM_PCS_VFP Arm PCS hardware FP variant in use (32-bit-only) 1
__ARM_ROPI Read-only PIC in use 1
__ARM_RWPI Read-write PIC in use 1
__ARM_SIZEOF_MINIMAL_ENUM Size of minimal enumeration type: 1 or 4 1
__ARM_SIZEOF_WCHAR_T Size of wchar_t: 2 or 4 2
__ARM_WMMX Wireless MMX extension (32-bit-only) 1

Attributes and pragmas ^

Attribute syntax ^

This section describes ACLE features that use GNU-style attributes. The general rules for attribute syntax are described in the GCC documentation https://gcc.gnu.org/onlinedocs/gcc/extensions-to-the-c-language-family/attribute-syntax.html. Briefly, for this declaration:

  A int B x C, D y E;

attribute A applies to both x and y; B and C apply to x only, and D and E apply to y only. Programmers are recommended to keep declarations simple if attributes are used.

Unless otherwise stated, all attribute arguments must be compile-time constants.

In addition to the GNU-style attributes described in this section, ACLE also uses keyword attributes.

Hardware/software floating-point calling convention ^

The AArch32 PCS defines a base standard, as well as several variants.

On targets with hardware FP the AAPCS provides for procedure calls to use either integer or floating-point argument and result registers. ACLE allows this to be selectable per function.

    __attribute__((pcs("aapcs")))

applied to a function, selects software (integer) FP calling convention.

    __attribute__((pcs("aapcs-vfp")))

applied to a function, selects hardware FP calling convention.

The AArch64 PCS standard variants do not change how parameters are passed, so no PCS attributes are supported.

The pcs attribute applies to functions and function types. Implementations are allowed to treat the procedure call specification as part of the type, that is, as a language linkage in the sense of [C++ #1].

Target selection ^

The following target selection attributes are supported:

    __attribute__((target("arm")))

when applied to a function, forces A32 state code generation.

    __attribute__((target("thumb")))

when applied to a function, forces T32 state code generation.

The implementation must generate code in the required state unless it is impossible to do so. For example, on an Armv5 or Armv6 target with VFP (and without the T32 instruction set), if a function is forced to T32 state, any floating-point operations or intrinsics that are only available in A32 state must be generated as calls to library functions or compiler-generated functions.

This attribute does not apply to AArch64.

Function Multi Versioning ^

The specification for Function Multi Versioning is in Beta state and might change or be extended in the future.

Function Multi Versioning provides a convenient way to select the most appropriate version of a function at runtime. All versions of the function may be in the final binary. The compiler generates all supported versions and the runtime makes the selection at load time.

The following attributes trigger the multi version code generation: __attribute__((target_version("name"))) and __attribute__((target_clones("name",...))).

but it is enough to say target_version("sve_bf16") in the code.

The attribute __attribute__((target_version("name"))) expresses the following:

The attribute __attribute__((target_clones("name",...))) expresses the following:

compiled with an older compiler. The compiler may provide diagnostic messages and could block the compilation (for instance, if the -pedantic flag is present).

__HAVE_FUNCTION_MULTI_VERSIONING is defined to 1 if the versioning mechanism described in this section is supported by the compiler and it is enabled.

__FUNCTION_MULTI_VERSIONING_SUPPORT_LEVEL is defined to the currently supported version of the ACLE. The value and the format are the same as the __ARM_ACLE.

For example, it can be implemented as:

#define __FUNCTION_MULTI_VERSIONING_SUPPORT_LEVEL __ARM_ACLE_VERSION(2024, 3, 0)

Name mangling ^

The "default" version is mangled with ".default" on top of the language-specific name mangling. All versioned functions with their mangled names are always resolvable. A function is expected to be resolvable with the original mangled name of the function.

The mangling function is compatible with the mangling for version information of the [cxxabi], and it is defined as follows:

<variant name> := <c/c++mangling> `.` <vendor specific suffix>
<c/c++ mangling> := function name mangling for c/c++
<vendor specific suffix> := `_` followed by token obtained from the tables below and prefixed with `M`

If multiple features are requested then those shall be appended in lexicographic order and prefixed with M. The mangled name shall contain a unique set of features (duplication of features is not allowed).

For example:

__attribute__((target_clones("crc32", "aes+sha1")))
int foo(){..}

will produce these mangled names for C language: foo.default, foo._Mcrc32, foo._MaesMsha1 while foo is a callable external symbol which leads to one of the versioned functions.

Mapping ^

The following table lists the architectures feature mapping for AArch32

Priority Architecture name Name Dependent feature registers
0 N/A default N/A
90 CRC32 instructions crc ID_ISAR5.CRC32 == 0b0001
100 SHA1 instructions sha1 ID_ISAR5.SHA1 == 0b0001
110 SHA2 instructions sha2 ID_ISAR5.SHA2 == 0b0001
120 AES instructions aes ID_ISAR5.AES >= 0b0001
130 VMULL (polynomial) instructions vmull ID_ISAR5.AES == 0b0002

The following table lists the architectures feature mapping for AArch64

Priority Architecture name Name Dependent feature registers
0 N/A default N/A
10 FEAT_RNG rng ID_AA64ISAR0_EL1.RNDR >= 0b0001
20 FEAT_FlagM flagm ID_AA64ISAR0_EL1.TS >= 0b0001
30 FEAT_FlagM2 flagm2 ID_AA64ISAR0_EL1.TS >= 0b0010
80 FEAT_LSE lse ID_AA64ISAR0_EL1.Atomic >= 0b0010
90 Floating-point fp ID_AA64PFR0_EL1.FP != 0b1111
100 FEAT_AdvSIMD simd ID_AA64PFR0_EL1.AdvSIMD != 0b1111
104 FEAT_DotProd dotprod ID_AA64ISAR0_EL1.DP >= 0b0001
106 FEAT_SM3, FEAT_SM4 sm4 ID_AA64ISAR0_EL1.SM4 >= 0b0001
108 FEAT_RDM rdm, rdma ID_AA64ISAR0_EL1.RDM >= 0b0001
110 FEAT_CRC32 crc ID_AA64ISAR0_EL1.CRC32 >= 0b0001
130 FEAT_SHA1, FEAT_SHA256 sha2 ID_AA64ISAR0_EL1.SHA2 >= 0b0001
140 FEAT_SHA512, FEAT_SHA3 sha3 ID_AA64ISAR0_EL1.SHA3 >= 0b0001
150 FEAT_AES aes ID_AA64ISAR0_EL1.AES >= 0b0001
160 FEAT_PMULL pmull ID_AA64ISAR0_EL1.AES >= 0b0010
170 FEAT_FP16 fp16 ID_AA64PFR0_EL1.FP == 0b0001
175 FEAT_FHM fp16fml ID_AA64ISAR0_EL1.FHM >= 0b0001
180 FEAT_DIT dit ID_AA64PFR0_EL1.DIT >= 0b0001
190 FEAT_DPB dpb ID_AA64ISAR1_EL1.DPB >= 0b0001
200 FEAT_DPB2 dpb2 ID_AA64ISAR1_EL1.DPB >= 0b0010
210 FEAT_JSCVT jscvt ID_AA64ISAR1_EL1.JSCVT >= 0b0001
220 FEAT_FCMA fcma ID_AA64ISAR1_EL1.FCMA >= 0b0001
230 FEAT_LRCPC rcpc ID_AA64ISAR1_EL1.LRCPC >= 0b0001
240 FEAT_LRCPC2 rcpc2 ID_AA64ISAR1_EL1.LRCPC >= 0b0010
241 FEAT_LRCPC3 rcpc3 ID_AA64ISAR1_EL1.LRCPC >= 0b0011
250 FEAT_FRINTTS frintts ID_AA64ISAR1_EL1.FRINTTS >= 0b0001
260 FEAT_DGH dgh ID_AA64ISAR1_EL1.DGH >= 0b0001
270 FEAT_I8MM i8mm ID_AA64ISAR1_EL1.I8MM >= 0b0001
280 FEAT_BF16 bf16 ID_AA64ISAR1_EL1.BF16 >= 0b0001
290 FEAT_EBF16 ebf16 ID_AA64ISAR1_EL1.BF16 >= 0b0010
300 FEAT_RPRES rpres ID_AA64ISAR2_EL1.RPRES >= 0b0001
310 FEAT_SVE sve ID_AA64PFR0_EL1.SVE >= 0b0001
320 FEAT_BF16 sve-bf16 ID_AA64ZFR0_EL1.BF16 >= 0b0001
330 FEAT_EBF16 sve-ebf16 ID_AA64ZFR0_EL1.BF16 >= 0b0010
340 FEAT_I8MM sve-i8mm ID_AA64ZFR0_EL1.I8MM >= 0b00001
350 FEAT_F32MM f32mm ID_AA64ZFR0_EL1.F32MM >= 0b00001
360 FEAT_F64MM f64mm ID_AA64ZFR0_EL1.F64MM >= 0b00001
370 FEAT_SVE2 sve2 ID_AA64ZFR0_EL1.SVEver >= 0b0001
380 FEAT_SVE_AES sve2-aes ID_AA64ZFR0_EL1.AES >= 0b0001
390 FEAT_SVE_PMULL128 sve2-pmull128 ID_AA64ZFR0_EL1.AES >= 0b0010
400 FEAT_SVE_BitPerm sve2-bitperm ID_AA64ZFR0_EL1.BitPerm >= 0b0001
410 FEAT_SVE_SHA3 sve2-sha3 ID_AA64ZFR0_EL1.SHA3 >= 0b0001
420 FEAT_SM3, FEAT_SVE_SM4 sve2-sm4 ID_AA64ZFR0_EL1.SM4 >= 0b0001
430 FEAT_SME sme ID_AA64PFR1_EL1.SME >= 0b0001
440 FEAT_MTE memtag ID_AA64PFR1_EL1.MTE >= 0b0001
450 FEAT_MTE2 memtag2 ID_AA64PFR1_EL1.MTE >= 0b0010
460 FEAT_MTE3 memtag3 ID_AA64PFR1_EL1.MTE >= 0b0011
470 FEAT_SB sb ID_AA64ISAR1_EL1.SB >= 0b0001
480 FEAT_SPECRES predres ID_AA64ISAR1_EL1.SPECRES >= 0b0001
490 FEAT_SSBS ssbs ID_AA64PFR1_EL1.SSBS >= 0b0001
500 FEAT_SSBS2 ssbs2 ID_AA64PFR1_EL1.SSBS >= 0b0010
510 FEAT_BTI bti ID_AA64PFR1_EL1.BT >= 0b0001
520 FEAT_LS64, FEAT_LS64_V,
FEAT_LS64_ACCDATA
ls64 ID_AA64ISAR1_EL1.LS64 >= 0b0011
550 FEAT_WFxT wfxt ID_AA64ISAR2_EL1.WFxT >= 0b0010
560 FEAT_SME_F64F64 sme-f64f64 ID_AA64SMFR0_EL1.F64F64 == 0b1
570 FEAT_SME_I16I64 sme-i16i64 ID_AA64SMFR0_EL1.I16I64 == 0b1111
580 FEAT_SME2 sme2 ID_AA64PFR1_EL1.SMEver >= 0b0001
650 FEAT_MOPS mops ID_AA64ISAR2_EL1.MOPS >= 0b0001

Selection ^

The following rules shall be followed by all implementations:

  1. Implementation of the selection algorithm is platform dependent, where with platform means CPU/Vendor/OS as in the target triplet.
  2. The selection is permanent for the lifetime of the process.
  3. Only those versions could be considered where all dependent features are available.

Rules of version selection are in order:

  1. Select the most specific version else
  2. select the version with the highest priority else
  3. "default" is selected if no other versions are suitable.

Weak linkage ^

__attribute__((weak)) can be attached to declarations and definitions to indicate that they have weak static linkage (STB_WEAK in ELF objects). As definitions, they can be overridden by other definitions of the same symbol. As references, they do not need to be satisfied and will be resolved to zero if a definition is not present.

Patchable constants ^

In addition, this specification requires that weakly defined initialized constants are not used for constant propagation, allowing the value to be safely changed by patching after the object is produced.

Alignment ^

The new standards for C [C11] (6.7.5) and C++ [CPP11] (7.6.2) add syntax for aligning objects and types. ACLE provides an alternative syntax described in this section.

Alignment attribute ^

__attribute__((aligned(N))) can be associated with data, functions, types and fields. N must be an integral constant expression and must be a power of 2, for example 1, 2, 4, 8. The maximum alignment depends on the storage class of the object being aligned. The size of a data type is always a multiple of its alignment. This is a consequence of the rule in C that the spacing between array elements is equal to the element size.

The aligned attribute does not act as a type qualifier. For example, given

  char x __attribute__((aligned(8)));
  int y __attribute__((aligned(1)));

the type of &x is char * and the type of &y is int *. The following declarations are equivalent:

  struct S x __attribute__((aligned(16))); /* ACLE */

  struct S _Alignas(16) x/* C11 */

  #include <stdalign.h> /* C11 (alternative) */
  struct S alignas(16) x;

  struct S alignas(16) x; /* C++11 */

Alignment of static objects ^

The macro __ARM_ALIGN_MAX_PWR indicates (as the exponent of a power of 2) the maximum available alignment of static data – for example 4 for 16-byte alignment. So the following is always valid:

  int x __attribute__((aligned(1 << __ARM_ALIGN_MAX_PWR)));

or, using the C11/C++11 syntax:

  alignas(1 << __ARM_ALIGN_MAX_PWR) int x;

Since an alignment request on an object does not change its type or size, x in this example would have type int and size 4.

There is in principle no limit on the alignment of static objects, within the constraints of available memory. In the Arm ABI an object with a requested alignment would go into an ELF section with at least as strict an alignment requirement. However, an implementation supporting position-independent dynamic objects or overlays may need to place restrictions on their alignment demands.

Alignment of stack objects ^

It must be possible to align any local object up to the stack alignment as specified in the AAPCS for AArch32 (that is, 8 bytes) or as specified in AAPCS64 for AArch64 (that is, 16 bytes) this being also the maximal alignment of any native type.

An implementation may, but is not required to, permit the allocation of local objects with greater alignment, for example 16 or 32 bytes for AArch32. (This would involve some runtime adjustment such that the object address was not a fixed offset from the stack pointer on entry.)

If a program requests alignment greater than the implementation supports, it is recommended that the compiler warn but not fault this. Programmers should expect over-alignment of local objects to be treated as a hint.

The macro __ARM_ALIGN_MAX_STACK_PWR indicates (as the exponent of a power of 2) the maximum available stack alignment. For example, a value of 3 indicates 8-byte alignment.

Procedure calls ^

For procedure calls, where a parameter has aligned type, data should be passed as if it was a basic type of the given type and alignment. For example, given the aligned type:

  struct S { int a[2]; } __attribute__((aligned(8)));

the second argument of:

  f(int, struct S);

should be passed as if it were:

  f(int, long long);

which means that in AArch32 AAPCS the second parameter is in R2/R3 rather than R1/R2.

Alignment of C heap storage ^

The standard C allocation functions [C99] (7.20.3), such as malloc(), return storage aligned to the normal maximal alignment, that is, the largest alignment of any (standard) type.

Implementations may, but are not required to, provide a function to return heap storage of greater alignment. Suitable functions are:

  int posix_memalign(void **memptr, size_t alignment, size_t size );

as defined in [POSIX], or:

  void *aligned_alloc(size_t alignment, size_t size);

as defined in [C11] (7.22.3.1).

Alignment of C++ heap allocation ^

In C++, an allocation (with new) knows the object’s type. If the type is aligned, the allocation should also be aligned. There are two cases to consider depending on whether the user has provided an allocation function.

If the user has provided an allocation function for an object or array of over-aligned type, it is that function’s responsibility to return suitably aligned storage. The size requested by the runtime library will be a multiple of the alignment (trivially so, for the non-array case).

(The AArch32 C++ ABI does not explicitly deal with the runtime behavior when dealing with arrays of alignment greater than 8. In this situation, any cookie will be 8 bytes as usual, immediately preceding the array; this means that the cookie is not necessarily at the address seen by the allocation and deallocation functions. Implementations will need to make some adjustments before and after calls to the ABI-defined C++ runtime, or may provide additional non-standard runtime helper functions.) Example:

  struct float4 {
    void *operator new[](size_t s) {
      void *p;
      posix_memalign(&p, 16, s);
      return p;
    }
    float data[4];
  } __attribute__((aligned(16)));

If the user has not provided their own allocation function, the behavior is implementation-defined.

The generic itanium C++ ABI, which we use in AArch64, already handles arrays with arbitrarily aligned elements

Scalable Vector Extension procedure call standard attribute ^

On SVE enabled AArch64 targets, the [AAPCS64] allows procedure calls to use the SVE calling convention. If a subroutine takes at least one argument in scalable vector registers or scalable predicate registers, or if the subroutine is a function that returns results in such registers, the subroutine must ensure that the entire contents of z8-z23 and p4-15 are preserved across the call. This calling convention is described in sections 6.1.3 and 6.1.4 of AAPCS64 (see release 2022Q1).

The ACLE allows this to be enforced per function and adds the following function attribute to a function declaration or definition:

    __attribute__(("aarch64_sve_pcs"))

Other attributes ^

The following attributes should be supported and their definitions follow [GCC]. These attributes are not specific to Arm or the Arm ABI.

alias, common, nocommon, noinline, packed, section, visibility, weak

Some specific requirements on the weak attribute are detailed in Weak linkage.

Synchronization, barrier, and hint intrinsics ^

Introduction ^

This section provides intrinsics for managing data that may be accessed concurrently between processors, or between a processor and a device. Some intrinsics atomically update data, while others place barriers around accesses to data to ensure that accesses are visible in the correct order.

Memory prefetch intrinsics are also described in this section.

Atomic update primitives ^

C/C++ standard atomic primitives ^

The new C and C++ standards [C11] (7.17), [CPP11] (clause 29) provide a comprehensive library of atomic operations and barriers, including operations to read and write data with particular ordering requirements. Programmers are recommended to use this where available.

IA-64/GCC atomic update primitives ^

The __sync family of intrinsics (introduced in [IA-64] (section 7.4), and as documented in the GCC documentation) may be provided, especially if the C/C++ atomics are not available, and are recommended as being portable and widely understood. These may be expanded inline, or call library functions. Note that, unusually, these intrinsics are polymorphic they will specialize to instructions suitable for the size of their arguments.

Memory barriers ^

Memory barriers ensure specific ordering properties between memory accesses. For more details on memory barriers, see [ARMARM] (A3.8.3). The intrinsics in this section are available for all targets. They may be no-ops (that is, generate no code, but possibly act as a code motion barrier in compilers) on targets where the relevant instructions do not exist, but only if the property they guarantee would have held anyway. On targets where the relevant instructions exist but are implemented as no-ops, these intrinsics generate the instructions.

The memory barrier intrinsics take a numeric argument indicating the scope and access type of the barrier, as shown in the following table. (The assembler mnemonics for these numbers, as shown in the table, are not available in the intrinsics.) The argument should be an integral constant expression within the required range see Constant arguments to intrinsics.

Argument Mnemonic Domain Ordered Accesses (before-after)
15 SY Full system Any-Any
14 ST Full system Store-Store
13 LD Full system Load-Load, Load-Store
11 ISH Inner shareable Any-Any
10 ISHST Inner shareable Store-Store
9 ISHLD Inner shareable Load-Load, Load-Store
7 NSH or UN Non-shareable Any-Any
6 NSHST Non-shareable Store-Store
5 NSHLD Non-shareable Load-Load, Load-Store
3 OSH Outer shareable Any-Any
2 OSHST Outer shareable Store-Store
1 OSHLD Outer shareable Load-Load, Load-Store

The following memory barrier intrinsics are available:

    void __dmb(/*constant*/ unsigned int);

Generates a DMB (data memory barrier) instruction or equivalent CP15 instruction. DMB ensures the observed ordering of memory accesses. Memory accesses of the specified type issued before the DMB are guaranteed to be observed (in the specified scope) before memory accesses issued after the DMB. For example, DMB should be used between storing data, and updating a flag variable that makes that data available to another core.

The __dmb() intrinsic also acts as a compiler memory barrier of the appropriate type.

  void __dsb(/*constant*/ unsigned int);

Generates a DSB (data synchronization barrier) instruction or equivalent CP15 instruction. DSB ensures the completion of memory accesses. A DSB behaves as the equivalent DMB and has additional properties. After a DSB instruction completes, all memory accesses of the specified type issued before the DSB are guaranteed to have completed.

The __dsb() intrinsic also acts as a compiler memory barrier of the appropriate type.

  void __isb(/*constant*/ unsigned int);

Generates an ISB (instruction synchronization barrier) instruction or equivalent CP15 instruction. This instruction flushes the processor pipeline fetch buffers, so that following instructions are fetched from cache or memory. An ISB is needed after some system maintenance operations.

An ISB is also needed before transferring control to code that has been loaded or modified in memory, for example by an overlay mechanism or just-in-time code generator. (Note that if instruction and data caches are separate, privileged cache maintenance operations would be needed in order to unify the caches.)

The only supported argument for the __isb() intrinsic is 15, corresponding to the SY (full system) scope of the ISB instruction.

Examples ^

In this example, process P1 makes some data available to process P2 and sets a flag to indicate this.

  P1:

    value = x;
    /* issue full-system memory barrier for previous store:
       setting of flag is guaranteed not to be observed before
       write to value */
    __dmb(14);
    flag = true;

  P2:

    /* busy-wait until the data is available */
    while (!flag) {}
    /* issue full-system memory barrier: read of value is guaranteed
       not to be observed by memory system before read of flag */
    __dmb(15);
    /* use value */;

In this example, process P1 makes data available to P2 by putting it on a queue.

  P1:

    work = new WorkItem;
    work->payload = x;
    /* issue full-system memory barrier for previous store:
       consumer cannot observe work item on queue before write to
       work item's payload */
    __dmb(14);
    queue_head = work;

  P2:

    /* busy-wait until work item appears */
    while (!(work = queue_head)) {}
    /* no barrier needed: load of payload is data-dependent */
    /* use work->payload */

Hints ^

The intrinsics in this section are available for all targets. They may be no-ops (that is, generate no code, but possibly act as a code motion barrier in compilers) on targets where the relevant instructions do not exist. On targets where the relevant instructions exist but are implemented as no-ops, these intrinsics generate the instructions.

  void __wfi(void);

Generates a WFI (wait for interrupt) hint instruction, or nothing. The WFI instruction allows (but does not require) the processor to enter a low-power state until one of a number of asynchronous events occurs.

  void __wfe(void);

Generates a WFE (wait for event) hint instruction, or nothing. The WFE instruction allows (but does not require) the processor to enter a low-power state until some event occurs such as a SEV being issued by another processor.

  void __sev(void);

Generates a SEV (send a global event) hint instruction. This causes an event to be signaled to all processors in a multiprocessor system. It is a NOP on a uniprocessor system.

  void __sevl(void);

Generates a send a local event hint instruction. This causes an event to be signaled to only the processor executing this instruction. In a multiprocessor system, it is not required to affect the other processors.

  void __yield(void);

Generates a YIELD hint instruction. This enables multithreading software to indicate to the hardware that it is performing a task, for example a spin-lock, that could be swapped out to improve overall system performance.

  void __dbg(/*constant*/ unsigned int);

Generates a DBG instruction. This provides a hint to debugging and related systems. The argument must be a constant integer from 0 to 15 inclusive. See implementation documentation for the effect (if any) of this instruction and the meaning of the argument. This is available only when compiling for AArch32.

Swap ^

__swp is available for all targets. This intrinsic expands to a sequence equivalent to the deprecated (and possibly unavailable) SWP instruction.

  uint32_t __swp(uint32_t, volatile void *);

Unconditionally stores a new value at the given address, and returns the old value.

As with the IA-64/GCC primitives described in 0, the __swp intrinsic is polymorphic. The second argument must provide the address of a byte-sized object or an aligned word-sized object and it must be possible to determine the size of this object from the argument expression.

This intrinsic is implemented by LDREX/STREX (or LDREXB/STREXB) where available, as if by

  uint32_t __swp(uint32_t x, volatile uint32_t *p) {
    uint32_t v;
    /* use LDREX/STREX intrinsics not specified by ACLE */
    do v = __ldrex(p); while (__strex(x, p));
    return v;
  }

or alternatively,

  uint32_t __swp(uint32_t x, uint32_t *p) {
    uint32_t v;
    /* use IA-64/GCC atomic builtins */
    do v = *p; while (!__sync_bool_compare_and_swap(p, v, x));
    return v;
  }

It is recommended that compilers should produce a downgradeable/upgradeable warning on encountering the __swp intrinsic.

Only if load-store exclusive instructions are not available will the intrinsic use the SWP/SWPB instructions.

It is strongly recommended to use standard and flexible atomic primitives such as those available in the C++ header. `__swp` is provided solely to allow straightforward (and possibly automated) replacement of explicit use of SWP in inline assembler. SWP is obsolete in the Arm architecture, and in recent versions of the architecture, may be configured to be unavailable in user-mode. (Aside: unconditional atomic swap is also less powerful as a synchronization primitive than load-exclusive/store-conditional.)

Memory prefetch intrinsics ^

Intrinsics are provided to prefetch data or instructions. The size of the data or function is ignored. Note that the intrinsics may be implemented as no-ops (that is, not generate a prefetch instruction, if none is available). Also, even where the architecture does provide a prefetch instruction, a particular implementation may implement the instruction as a no-op (that is, the instruction has no effect).

Data prefetch ^

  void __pld(void const volatile *addr);

Generates a data prefetch instruction, if available. The argument should be any expression that may designate a data address. The data is prefetched to the innermost level of cache, for reading.

  void __pldx(/*constant*/ unsigned int /*access_kind*/,
              /*constant*/ unsigned int /*cache_level*/,
              /*constant*/ unsigned int /*retention_policy*/,
              void const volatile *addr);

Generates a data prefetch instruction. This intrinsic allows the specification of the expected access kind (read or write), the cache level to load the data, the data retention policy (temporal or streaming), The relevant arguments can only be one of the following values.

Access Kind Value Summary
PLD 0 Fetch the addressed location for reading
PST 1 Fetch the addressed location for writing
Cache Level Value Summary
L1 0 Fetch the addressed location to L1 cache
L2 1 Fetch the addressed location to L2 cache
L3 2 Fetch the addressed location to L3 cache
SLC 3 Fetch the addressed location to the System-Level Cache
Retention Policy Value Summary
KEEP 0 Temporal fetch of the addressed location (that is, allocate in cache normally)
STRM 1 Streaming fetch of the addressed location (that is, memory used only once)

Instruction prefetch ^

  void __pli(T addr);

Generates a code prefetch instruction, if available. If a specific code prefetch instruction is not available, this intrinsic may generate a data-prefetch instruction to fetch the addressed code to the innermost level of unified cache. It will not fetch code to data-cache in a split cache level.

  void __plix(/*constant*/ unsigned int /*cache_level*/,
              /*constant*/ unsigned int /*retention_policy*/,
              T addr);

Generates a code prefetch instruction. This intrinsic allows the specification of the cache level to load the code, the retention policy (temporal or streaming). The relevant arguments can have the same values as in __pldx.

__pldx and __plix arguments cache level and retention policy are ignored on unsupported targets.

NOP ^

  void __nop(void);

Generates an unspecified no-op instruction. Note that not all architectures provide a distinguished NOP instruction. On those that do, it is unspecified whether this intrinsic generates it or another instruction. It is not guaranteed that inserting this instruction will increase execution time.

Data-processing intrinsics ^

The intrinsics in this section are provided for algorithm optimization.

The <arm_acle.h> header should be included before using these intrinsics.

Implementations are not required to introduce precisely the instructions whose names match the intrinsics. However, implementations should aim to ensure that a computation expressed compactly with intrinsics will generate a similarly compact sequence of machine code. In general, C’s as-if rule [C99] (5.1.2.3) applies, meaning that the compiled code must behave as if the instruction had been generated.

In general, these intrinsics are aimed at DSP algorithm optimization on M-profile and R-profile. Use on A-profile is deprecated. However, the miscellaneous intrinsics and CRC32 intrinsics described in Miscellaneous data-processing intrinsics and CRC32 intrinsics respectively are suitable for all profiles.

Programmer’s model of global state ^

The Q (saturation) flag ^

The Q flag is a cumulative (sticky) saturation bit in the APSR (Application Program Status Register) indicating that an operation saturated, or in some cases, overflowed. It is set on saturation by most intrinsics in the DSP and SIMD intrinsic sets, though some SIMD intrinsics feature saturating operations which do not set the Q flag.

[AAPCS] (5.1.1) states:

The N, Z, C, V and Q flags (bits 27-31) and the GE[3:0] bits (bits 16-19) are undefined on entry to or return from a public interface.

Note that this does not state that these bits (in particular the Q flag) are undefined across any C/C++ function call boundary only across a public interface. The Q and GE bits could be manipulated in well-defined ways by local functions, for example when constructing functions to be used in DSP algorithms.

Implementations must avoid introducing instructions (such as SSAT/USAT, or SMLABB) which affect the Q flag, if the programmer is testing whether the Q flag was set by explicit use of intrinsics and if the implementation’s introduction of an instruction may affect the value seen. The implementation might choose to model the definition and use (liveness) of the Q flag in the way that it models the liveness of any visible variable, or it might suppress introduction of Q-affecting instructions in any routine in which the Q flag is tested.

ACLE does not define how or whether the Q flag is preserved across function call boundaries. (This is seen as an area for future specification.)

In general, the Q flag should appear to C/C++ code in a similar way to the standard floating-point cumulative exception flags, as global (or thread-local) state that can be tested, set or reset through an API.

The following intrinsics are available when __ARM_FEATURE_QBIT is defined:

int __saturation_occurred(void);

Returns 1 if the Q flag is set, 0 if not.

void __set_saturation_occurred(int);

Sets or resets the Q flag according to the LSB of the value. __set_saturation_occurred(0) might be used before performing a sequence of operations after which the Q flag is tested. (In general, the Q flag cannot be assumed to be unset at the start of a function.)

void __ignore_saturation(void);

This intrinsic is a hint and may be ignored. It indicates to the compiler that the value of the Q flag is not live (needed) at or subsequent to the program point at which the intrinsic occurs. It may allow the compiler to remove preceding instructions, or to change the instruction sequence in such a way as to result in a different value of the Q flag. (A specific example is that it may recognize clipping idioms in C code and implement them with an instruction such as SSAT that may set the Q flag.)

The GE flags ^

The GE (Greater than or Equal to) flags are four bits in the APSR. They are used with the 32-bit SIMD intrinsics described in 32-bit SIMD Operations.

There are four GE flags, one for each 8-bit lane of a 32-bit SIMD operation. Certain non-saturating 32-bit SIMD intrinsics set the GE bits to indicate overflow of addition or subtraction. For 4x8-bit operations the GE bits are set one for each byte. For 2x16-bit operations the GE bits are paired together, one for the high halfword and the other pair for the low halfword. The only supported way to read or use the GE bits (in this specification) is by using the __sel intrinsic, see Parallel selection.

Floating-point environment ^

An implementation should implement the features of for accessing the floating-point runtime environment. Programmers should use this rather than accessing the VFP FPSCR directly. For example, on a target supporting VFP the cumulative exception flags (for example IXC, OFC) can be read from the FPSCR by using the fetestexcept() function, and the rounding mode (RMode) bits can be read using the fegetround() function.

ACLE does not support changing the DN, FZ or AHP bits at runtime.

VFP short vector mode (enabled by setting the Stride and Len bits) is deprecated, and is unavailable on later VFP implementations. ACLE provides no support for this mode.

Miscellaneous data-processing intrinsics ^

The following intrinsics perform general data-processing operations. They have no effect on global state.

[Note: documentation of the __nop intrinsic has moved to NOP]

For completeness and to aid portability between LP64 and LLP64 models, ACLE also defines intrinsics with l suffix.

  uint32_t __ror(uint32_t x, uint32_t y);
  unsigned long __rorl(unsigned long x, uint32_t y);
  uint64_t __rorll(uint64_t x, uint32_t y);

Rotates the argument x right by y bits. y can take any value. These intrinsics are available on all targets.

  unsigned int __clz(uint32_t x);
  unsigned int __clzl(unsigned long x);
  unsigned int __clzll(uint64_t x);

Returns the number of leading zero bits in x. When x is zero it returns the argument width, that is, 32 or 64. These intrinsics are available on all targets. On targets without the CLZ instruction it should be implemented as an instruction sequence or a call to such a sequence. A suitable sequence can be found in Warren (fig. 5-7). Hardware support for these intrinsics is indicated by __ARM_FEATURE_CLZ.

  unsigned int __cls(uint32_t x);
  unsigned int __clsl(unsigned long x);
  unsigned int __clsll(uint64_t x);

Returns the number of leading sign bits in x. When x is zero it returns the argument width - 1, that is, 31 or 63. These intrinsics are available on all targets. On targets without the CLZ instruction it should be implemented as an instruction sequence or a call to such a sequence. Fast hardware implementation (using a CLS instruction or a short code sequence involving the CLZ instruction) is indicated by __ARM_FEATURE_CLZ.

  uint32_t __rev(uint32_t);
  unsigned long __revl(unsigned long);
  uint64_t __revll(uint64_t);

Reverses the byte order within a word or doubleword. These intrinsics are available on all targets and should be expanded to an efficient straight-line code sequence on targets without byte reversal instructions.

  uint32_t __rev16(uint32_t);
  unsigned long __rev16l(unsigned long);
  uint64_t __rev16ll(uint64_t);

Reverses the byte order within each halfword of a word. For example, 0x12345678 becomes 0x34127856. These intrinsics are available on all targets and should be expanded to an efficient straight-line code sequence on targets without byte reversal instructions.

  int16_t __revsh(int16_t);

Reverses the byte order in a 16-bit value and returns the signed 16-bit result. For example, 0x0080 becomes 0x8000. This intrinsic is available on all targets and should be expanded to an efficient straight-line code sequence on targets without byte reversal instructions.

  uint32_t __rbit(uint32_t x);
  unsigned long __rbitl(unsigned long x);
  uint64_t __rbitll(uint64_t x);

Reverses the bits in x. These intrinsics are only available on targets with the RBIT instruction.

Examples ^

  #ifdef __ARM_BIG_ENDIAN
  #define htonl(x) (uint32_t)(x)
  #define htons(x) (uint16_t)(x)
  #else /* little-endian */
  #define htonl(x) __rev(x)
  #define htons(x) (uint16_t)__revsh(x)
  #endif /* endianness */
  #define ntohl(x) htonl(x)
  #define ntohs(x) htons(x)

  /* Count leading sign bits */
  inline unsigned int count_sign(int32_t x) { return __clz(x ^ (x << 1)); }

  /* Count trailing zeroes */
  inline unsigned int count_trail(uint32_t x) {
  #if (__ARM_ARCH >= 6 && __ARM_ISA_THUMB >= 2) || __ARM_ARCH >= 7
  /* RBIT is available */
    return __clz(__rbit(x));
  #else
    unsigned int n = __clz(x & -x);   /* get the position of the last bit */
    return n == 32 ? n : (31-n);
  #endif
  }

16-bit multiplications ^

The intrinsics in this section provide direct access to the 16x16 and 16x32 bit multiplies introduced in Armv5E. Compilers are also encouraged to exploit these instructions from C code. These intrinsics are available when __ARM_FEATURE_DSP is defined, and are not available on non-5E targets. These multiplies cannot overflow.

  int32_t __smulbb(int32_t, int32_t);

Multiplies two 16-bit signed integers, that is, the low halfwords of the operands.

  int32_t __smulbt(int32_t, int32_t);

Multiplies the low halfword of the first operand and the high halfword of the second operand.

  int32_t __smultb(int32_t, int32_t);

Multiplies the high halfword of the first operand and the low halfword of the second operand.

  int32_t __smultt(int32_t, int32_t);

Multiplies the high halfwords of the operands.

  int32_t __smulwb(int32_t, int32_t);

Multiplies the 32-bit signed first operand with the low halfword (as a 16-bit signed integer) of the second operand. Return the top 32 bits of the 48-bit product.

  int32_t __smulwt(int32_t, int32_t);

Multiplies the 32-bit signed first operand with the high halfword (as a 16-bit signed integer) of the second operand. Return the top 32 bits of the 48-bit product.

Saturating intrinsics ^

Width-specified saturation intrinsics ^

These intrinsics are available when __ARM_FEATURE_SAT is defined. They saturate a 32-bit value at a given bit position. The saturation width must be an integral constant expression |–| see Constant arguments to intrinsics.

  int32_t __ssat(int32_t, /*constant*/ unsigned int);

Saturates a signed integer to the given bit width in the range 1 to 32. For example, the result of saturation to 8-bit width will be in the range -128 to 127. The Q flag is set if the operation saturates.

  uint32_t __usat(int32_t, /*constant*/ unsigned int);

Saturates a signed integer to an unsigned (non-negative) integer of a bit width in the range 0 to 31. For example, the result of saturation to 8-bit width is in the range 0 to 255, with all negative inputs going to zero. The Q flag is set if the operation saturates.

Saturating addition and subtraction intrinsics ^

These intrinsics are available when __ARM_FEATURE_DSP is defined.

The saturating intrinsics operate on 32-bit signed integer data. There are no special saturated or fixed point types.

  int32_t __qadd(int32_t, int32_t);

Adds two 32-bit signed integers, with saturation. Sets the Q flag if the addition saturates.

  int32_t __qsub(int32_t, int32_t);

Subtracts two 32-bit signed integers, with saturation. Sets the Q flag if the subtraction saturates.

  int32_t __qdbl(int32_t);

Doubles a signed 32-bit number, with saturation. __qdbl(x) is equal to __qadd(x,x) except that the argument x is evaluated only once. Sets the Q flag if the addition saturates.

Accumulating multiplications ^

These intrinsics are available when __ARM_FEATURE_DSP is defined.

  int32_t __smlabb(int32_t, int32_t, int32_t);

Multiplies two 16-bit signed integers, the low halfwords of the first two operands, and adds to the third operand. Sets the Q flag if the addition overflows. (Note that the addition is the usual 32-bit modulo addition which wraps on overflow, not a saturating addition. The multiplication cannot overflow.)

  int32_t __smlabt(int32_t, int32_t, int32_t);

Multiplies the low halfword of the first operand and the high halfword of the second operand, and adds to the third operand, as for __smlabb.

  int32_t __smlatb(int32_t, int32_t, int32_t);

Multiplies the high halfword of the first operand and the low halfword of the second operand, and adds to the third operand, as for __smlabb.

  int32_t __smlatt(int32_t, int32_t, int32_t);

Multiplies the high halfwords of the first two operands and adds to the third operand, as for __smlabb.

  int32_t __smlawb(int32_t, int32_t, int32_t);

Multiplies the 32-bit signed first operand with the low halfword (as a 16-bit signed integer) of the second operand. Adds the top 32 bits of the 48-bit product to the third operand. Sets the Q flag if the addition overflows. (See note for __smlabb).

  int32_t __smlawt(int32_t, int32_t, int32_t);

Multiplies the 32-bit signed first operand with the high halfword (as a 16-bit signed integer) of the second operand and adds the top 32 bits of the 48-bit result to the third operand as for __smlawb.

Examples ^

The ACLE DSP intrinsics can be used to define ETSI/ITU-T basic operations [G.191]:

    #include <arm_acle.h>
    inline int32_t L_add(int32_t x, int32_t y) { return __qadd(x, y); }
    inline int32_t L_negate(int32_t x) { return __qsub(0, x); }
    inline int32_t L_mult(int16_t x, int16_t y) { return __qdbl(x*y); }
    inline int16_t add(int16_t x, int16_t y) { return (int16_t)(__qadd(x<<16, y<<16) >> 16); }
    inline int16_t norm_l(int32_t x) { return __clz(x ^ (x<<1)) & 31; }
    ...

This example assumes the implementation preserves the Q flag on return from an inline function.

32-bit SIMD Operations ^

Availability ^

Armv6 introduced instructions to perform 32-bit SIMD operations (that is, two 16-bit operations or four 8-bit operations) on the Arm general-purpose registers. These instructions are not related to the much more versatile Advanced SIMD (Neon) extension, whose support is described in Advanced SIMD (Neon) intrinsics.

The 32-bit SIMD intrinsics are available on targets featuring Armv6 and upwards, including the A and R profiles. In the M profile they are available in the Armv7E-M architecture. Availability of the 32-bit SIMD intrinsics implies availability of the saturating intrinsics.

Availability of the SIMD intrinsics is indicated by the __ARM_FEATURE_SIMD32 predefine.

To access the intrinsics, the <arm_acle.h> header should be included.

Data types for 32-bit SIMD intrinsics ^

The header <arm_acle.h> should be included before using these intrinsics.

The SIMD intrinsics generally operate on and return 32-bit words consisting of two 16-bit or four 8-bit values. These are represented as int16x2_t and int8x4_t below for illustration. Some intrinsics also feature scalar accumulator operands and/or results.

When defining the intrinsics, implementations can define SIMD operands using a 32-bit integral type (such as unsigned int).

The header <arm_acle.h> defines typedefs int16x2_t, uint16x2_t, int8x4_t, and uint8x4_t. These should be defined as 32-bit integral types of the appropriate sign. There are no intrinsics provided to pack or unpack values of these types. This can be done with shifting and masking operations.

Use of the Q flag by 32-bit SIMD intrinsics ^

Some 32-bit SIMD instructions may set the Q flag described in The Q (saturation) flag. The behavior of the intrinsics matches that of the instructions.

Generally, instructions that perform lane-by-lane saturating operations do not set the Q flag. For example, __qadd16 does not set the Q flag, even if saturation occurs in one or more lanes.

The explicit saturation operations __ssat and __usat set the Q flag if saturation occurs. Similarly, __ssat16 and __usat16 set the Q flag if saturation occurs in either lane.

Some instructions, such as __smlad, set the Q flag if overflow occurs on an accumulation, even though the accumulation is not a saturating operation (that is, does not clip its result to the limits of the type).

Parallel 16-bit saturation ^

These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. They saturate two 16-bit values to a given bit width as for the __ssat and __usat intrinsics defined in Width-specified saturation intrinsics.

  int16x2_t __ssat16(int16x2_t, /*constant*/ unsigned int);

Saturates two 16-bit signed values to a width in the range 1 to 16. The Q flag is set if either operation saturates.

  int16x2_t __usat16(int16x2_t, /*constant */ unsigned int);

Saturates two 16-bit signed values to a bit width in the range 0 to 15. The input values are signed and the output values are non-negative, with all negative inputs going to zero. The Q flag is set if either operation saturates.

Packing and unpacking ^

These intrinsics are available when __ARM_FEATURE_SIMD32 is defined.

  int16x2_t __sxtab16(int16x2_t, int8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the second operand, sign-extended to 16 bits, and added to the first operand.

  int16x2_t __sxtb16(int8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the first operand, sign-extended to 16 bits, and returned as the result.

  uint16x2_t __uxtab16(uint16x2_t, uint8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the second operand, zero-extended to 16 bits, and added to the first operand.

  uint16x2_t __uxtb16(uint8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the first operand, zero-extended to 16 bits, and returned as the result.

Parallel selection ^

This intrinsic is available when __ARM_FEATURE_SIMD32 is defined.

  uint8x4_t __sel(uint8x4_t, uint8x4_t);

Selects each byte of the result from either the first operand or the second operand, according to the values of the GE bits. For each result byte, if the corresponding GE bit is set then the byte from the first operand is used, otherwise the byte from the second operand is used. Because of the way that int16x2_t operations set two (duplicate) GE bits per value, the __sel intrinsic works equally well on (u)int16x2_t and (u)int8x4_t data.

Parallel 8-bit addition and subtraction ^

These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. Each intrinsic performs 8-bit parallel addition or subtraction. In some cases the result may be halved or saturated.

  int8x4_t __qadd8(int8x4_t, int8x4_t);

4x8-bit addition, saturated to the range -2**7 to 2**7-1.

  int8x4_t __qsub8(int8x4_t, int8x4_t);

4x8-bit subtraction, with saturation.

  int8x4_t __sadd8(int8x4_t, int8x4_t);

4x8-bit signed addition. The GE bits are set according to the results.

  int8x4_t __shadd8(int8x4_t, int8x4_t);

4x8-bit signed addition, halving the results.

  int8x4_t __shsub8(int8x4_t, int8x4_t);

4x8-bit signed subtraction, halving the results.

  int8x4_t __ssub8(int8x4_t, int8x4_t);

4x8-bit signed subtraction. The GE bits are set according to the results.

  uint8x4_t __uadd8(uint8x4_t, uint8x4_t);

4x8-bit unsigned addition. The GE bits are set according to the results.

  uint8x4_t __uhadd8(uint8x4_t, uint8x4_t);

4x8-bit unsigned addition, halving the results.

  uint8x4_t __uhsub8(uint8x4_t, uint8x4_t);

4x8-bit unsigned subtraction, halving the results.

  uint8x4_t __uqadd8(uint8x4_t, uint8x4_t);

4x8-bit unsigned addition, saturating to the range 0 to 2**8-1.

  uint8x4_t __uqsub8(uint8x4_t, uint8x4_t);

4x8-bit unsigned subtraction, saturating to the range 0 to 2**8-1.

  uint8x4_t __usub8(uint8x4_t, uint8x4_t);

4x8-bit unsigned subtraction. The GE bits are set according to the results.

Sum of 8-bit absolute differences ^

These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. They perform an 8-bit sum-of-absolute differences operation, typically used in motion estimation.

  uint32_t __usad8(uint8x4_t, uint8x4_t);

Performs 4x8-bit unsigned subtraction, and adds the absolute values of the differences together, returning the result as a single unsigned integer.

  uint32_t __usada8(uint8x4_t, uint8x4_t, uint32_t);

Performs 4x8-bit unsigned subtraction, adds the absolute values of the differences together, and adds the result to the third operand.

Parallel 16-bit addition and subtraction ^

These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. Each intrinsic performs 16-bit parallel addition and/or subtraction. In some cases the result may be halved or saturated.

  int16x2_t __qadd16(int16x2_t, int16x2_t);

2x16-bit addition, saturated to the range -2**15 to 2**15-1.

  int16x2_t __qasx(int16x2_t, int16x2_t);

Exchanges halfwords of second operand, adds high halfwords and subtracts low halfwords, saturating in each case.

  int16x2_t __qsax(int16x2_t, int16x2_t);

Exchanges halfwords of second operand, subtracts high halfwords and adds low halfwords, saturating in each case.

  int16x2_t __qsub16(int16x2_t, int16x2_t);

2x16-bit subtraction, with saturation.

  int16x2_t __sadd16(int16x2_t, int16x2_t);

2x16-bit signed addition. The GE bits are set according to the results.

  int16x2_t __sasx(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords. The GE bits are set according to the results.

  int16x2_t __shadd16(int16x2_t, int16x2_t);

2x16-bit signed addition, halving the results.

  int16x2_t __shasx(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtract low halfwords, halving the results.

  int16x2_t __shsax(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, subtracts high halfwords and add low halfwords, halving the results.

  int16x2_t __shsub16(int16x2_t, int16x2_t);

2x16-bit signed subtraction, halving the results.

  int16x2_t __ssax(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, subtracts high halfwords and adds low halfwords. The GE bits are set according to the results.

  int16x2_t __ssub16(int16x2_t, int16x2_t);

2x16-bit signed subtraction. The GE bits are set according to the results.

  uint16x2_t __uadd16(uint16x2_t, uint16x2_t);

2x16-bit unsigned addition. The GE bits are set according to the results.

  uint16x2_t __uasx(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords. The GE bits are set according to the results of unsigned addition.

  uint16x2_t __uhadd16(uint16x2_t, uint16x2_t);

2x16-bit unsigned addition, halving the results.

  uint16x2_t __uhasx(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords, halving the results.

  uint16x2_t __uhsax(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, subtracts high halfwords and adds low halfwords, halving the results.

  uint16x2_t __uhsub16(uint16x2_t, uint16x2_t);

2x16-bit unsigned subtraction, halving the results.

  uint16x2_t __uqadd16(uint16x2_t, uint16x2_t);

2x16-bit unsigned addition, saturating to the range 0 to 2**16-1.

  uint16x2_t __uqasx(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, and performs saturating unsigned addition on the high halfwords and saturating unsigned subtraction on the low halfwords.

  uint16x2_t __uqsax(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, and performs saturating unsigned subtraction on the high halfwords and saturating unsigned addition on the low halfwords.

  uint16x2_t __uqsub16(uint16x2_t, uint16x2_t);

2x16-bit unsigned subtraction, saturating to the range 0 to 2**16-1.

  uint16x2_t __usax(uint16x2_t, uint16x2_t);

Exchanges the halfwords of the second operand, subtracts the high halfwords and adds the low halfwords. Sets the GE bits according to the results of unsigned addition.

  uint16x2_t __usub16(uint16x2_t, uint16x2_t);

2x16-bit unsigned subtraction. The GE bits are set according to the results.

Parallel 16-bit multiplication ^

These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. Each intrinsic performs two 16-bit multiplications.

  int32_t __smlad(int16x2_t, int16x2_t, int32_t);

Performs 2x16-bit multiplication and adds both results to the third operand. Sets the Q flag if the addition overflows. (Overflow cannot occur during the multiplications.)

  int32_t __smladx(int16x2_t, int16x2_t, int32_t);

Exchanges the halfwords of the second operand, performs 2x16-bit multiplication, and adds both results to the third operand. Sets the Q flag if the addition overflows. (Overflow cannot occur during the multiplications.)

  int64_t __smlald(int16x2_t, int16x2_t, int64_t);

Performs 2x16-bit multiplication and adds both results to the 64-bit third operand. Overflow in the addition is not detected.

  int64_t __smlaldx(int16x2_t, int16x2_t, int64_t);

Exchanges the halfwords of the second operand, performs 2x16-bit multiplication and adds both results to the 64-bit third operand. Overflow in the addition is not detected.

  int32_t __smlsd(int16x2_t, int16x2_t, int32_t);

Performs two 16-bit signed multiplications. Takes the difference of the products, subtracting the high-halfword product from the low-halfword product, and adds the difference to the third operand. Sets the Q flag if the addition overflows. (Overflow cannot occur during the multiplications or the subtraction.)

  int32_t __smlsdx(int16x2_t, int16x2_t, int32_t);

Performs two 16-bit signed multiplications. The product of the high halfword of the first operand and the low halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high halfword of the second operand, and the difference is added to the third operand. Sets the Q flag if the addition overflows. (Overflow cannot occur during the multiplications or the subtraction.)

  int64_t __smlsld(int16x2_t, int16x2_t, int64_t);

Perform two 16-bit signed multiplications. Take the difference of the products, subtracting the high-halfword product from the low-halfword product, and add the difference to the third operand. Overflow in the 64-bit addition is not detected. (Overflow cannot occur during the multiplications or the subtraction.)

  int64_t __smlsldx(int16x2_t, int16x2_t, int64_t);

Perform two 16-bit signed multiplications. The product of the high halfword of the first operand and the low halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high halfword of the second operand, and the difference is added to the third operand. Overflow in the 64-bit addition is not detected. (Overflow cannot occur during the multiplications or the subtraction.)

  int32_t __smuad(int16x2_t, int16x2_t);

Perform 2x16-bit signed multiplications, adding the products together. Set the Q flag if the addition overflows.

  int32_t __smuadx(int16x2_t, int16x2_t);

Exchange the halfwords of the second operand (or equivalently, the first operand), perform 2x16-bit signed multiplications, and add the products together. Set the Q flag if the addition overflows.

  int32_t __smusd(int16x2_t, int16x2_t);

Perform two 16-bit signed multiplications. Take the difference of the products, subtracting the high-halfword product from the low-halfword product.

  int32_t __smusdx(int16x2_t, int16x2_t);

Perform two 16-bit signed multiplications. The product of the high halfword of the first operand and the low halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high halfword of the second operand.

Examples ^

Taking the elementwise maximum of two SIMD values each of which consists of four 8-bit signed numbers:

  int8x4_t max8x4(int8x4_t x, int8x4_t y) { __ssub8(x, y); return __sel(x, y); }

As described in :ref:sec-Parallel-selection, where SIMD values consist of two 16-bit unsigned numbers:

  int16x2_t max16x2(int16x2_t x, int16x2_t y) { __usub16(x, y); return __sel(x, y); }

Note that even though the result of the subtraction is not used, the compiler must still generate the instruction, because of its side-effect on the GE bits which are tested by the __sel() intrinsic.

Floating-point data-processing intrinsics ^

The intrinsics in this section provide direct access to selected floating-point instructions. They are defined only if the appropriate precision is available in hardware, as indicated by __ARM_FP (see Hardware floating point).

  double __sqrt(double x);
  float __sqrtf(float x);

The __sqrt intrinsics compute the square root of their operand. They have no effect on errno. Negative values produce a default NaN result and possible floating-point exception as described in [ARMARM] (A2.7.7).

  double __fma(double x, double y, double z);
  float __fmaf(float x, float y, float z);

The __fma intrinsics compute (x*y)+z, without intermediate rounding. These intrinsics are available only if __ARM_FEATURE_FMA is defined. On a Standard C implementation it should not normally be necessary to use these intrinsics, because the fma functions defined in [C99] (7.12.13) should expand directly to the instructions if available.

  float __rintnf (float);
  double __rintn (double);

The __rintn intrinsics perform a floating point round to integral, to nearest with ties to even. The __rintn intrinsic is available when __ARM_FEATURE_DIRECTED_ROUNDING is defined to 1. For other rounding modes like |lsquo| to nearest with ties to away |rsquo| it is strongly recommended that C99 standard functions be used. To achieve a floating point convert to integer, rounding to |lsquo| nearest with ties to even |rsquo| operation, use these rounding functions with a type-cast to integral values. For example:

  (int) __rintnf (a);

maps to a floating point convert to signed integer, rounding to nearest with ties to even operation.

  int32_t __jcvt (double);

Converts a double-precision floating-point number to a 32-bit signed integer following the Javascript Convert instruction semantics [ARMARMv83]. The __jcvt intrinsic is available if __ARM_FEATURE_JCVT is defined.

  float __rint32zf (float);
  double __rint32z (double);
  float __rint64zf (float);
  double __rint64z (double);
  float __rint32xf (float);
  double __rint32x (double);
  float __rint64xf (float);
  double __rint64x (double);

These intrinsics round their floating-point argument to a floating-point value that would be representable in a 32-bit or 64-bit signed integer type. Out-of-Range values are forced to the Most Negative Integer representable in the target size, and an Invalid Operation Floating-Point Exception is generated. The rounding mode can be either the ambient rounding mode (for example __rint32xf) or towards zero (for example __rint32zf).

These instructions are introduced in the Armv8.5-A extensions [ARMARMv85] and are available only in the AArch64 execution state. The intrinsics are available when __ARM_FEATURE_FRINT is defined.

Random number generation intrinsics ^

The Random number generation intrinsics provide access to the Random Number instructions introduced in Armv8.5-A. These intrinsics are only defined for the AArch64 execution state and are available when __ARM_FEATURE_RNG is defined.

  int __rndr (uint64_t *);

Stores a 64-bit random number into the object pointed to by the argument and returns zero. If the implementation could not generate a random number within a reasonable period of time the object pointed to by the input is set to zero and a non-zero value is returned.

  int __rndrrs (uint64_t *);

Reseeds the random number generator. After that stores a 64-bit random number into the object pointed to by the argument and returns zero. If the implementation could not generate a random number within a reasonable period of time the object pointed to by the input is set to zero and a non-zero value is returned.

These intrinsics have side-effects on the system beyond their results. Implementations must preserve them even if the results of the intrinsics are unused.

To access these intrinsics, <arm_acle.h> should be included.

CRC32 intrinsics ^

CRC32 intrinsics provide direct access to CRC32 instructions CRC32{C}{B, H, W, X} in both Armv8 AArch32 and AArch64 execution states. These intrinsics are available when __ARM_FEATURE_CRC32 is defined.

  uint32_t __crc32b (uint32_t a, uint8_t b);

Performs CRC-32 checksum from bytes.

  uint32_t __crc32h (uint32_t a, uint16_t b);

Performs CRC-32 checksum from half-words.

  uint32_t __crc32w (uint32_t a, uint32_t b);

Performs CRC-32 checksum from words.

  uint32_t __crc32d (uint32_t a, uint64_t b);

Performs CRC-32 checksum from double words.

  uint32_t __crc32cb (uint32_t a, uint8_t b);

Performs CRC-32C checksum from bytes.

  uint32_t __crc32ch (uint32_t a, uint16_t b);

Performs CRC-32C checksum from half-words.

  uint32_t __crc32cw (uint32_t a, uint32_t b);

Performs CRC-32C checksum from words.

  uint32_t __crc32cd (uint32_t a, uint64_t b);

Performs CRC-32C checksum from double words.

To access these intrinsics, <arm_acle.h> should be included.

Load/store 64 Byte intrinsics ^

These intrinsics provide direct access to the Armv8.7-A LD64B, ST64B, ST64BV and ST64BV0 instructions for atomic 64-byte access to device memory. These intrinsics are available when __ARM_FEATURE_LS64 is defined.

The header <arm_acle.h> defines these intrinsics, and also the data type data512_t that they use.

The type data512_t is a 64-byte structure type containing a single member val which is an array of 8 uint64_t, as if declared like this:

  typedef struct {
      uint64_t val[8];
  } data512_t;

The following intrinsics are defined on this data type. In all cases, the address addr must be aligned to a multiple of 64 bytes.

  data512_t __arm_ld64b(const void *addr);

Loads 64 bytes of data atomically from the address addr. The address must be in a memory region that supports 64-byte load/store operations.

  void __arm_st64b(void *addr, data512_t value);

Stores the 64 bytes in value atomically to the address addr. The address must be in a memory region that supports 64-byte load/store operations.

  uint64_t __arm_st64bv(void *addr, data512_t value);

Attempts to store the 64 bytes in value atomically to the address addr. It returns a 64-bit value from the response of the device written to.

  uint64_t __arm_st64bv0(void *addr, data512_t value);

Performs the same operation as __arm_st64bv, except that the data stored to memory is modified by replacing the low 32 bits of value.val[0] with the contents of the ACCDATA_EL1 system register. The returned value is the same as for __arm_st64bv.

Custom Datapath Extension ^

The intrinsics in this section provide access to instructions in the Custom Datapath Extension.

The <arm_cde.h> header should be included before using these intrinsics. The header is available when the __ARM_FEATURE_CDE feature macro is defined.

The intrinsics are stateless and pure, meaning an implementation is permitted to discard an invocation of an intrinsic whose result is unused without considering side-effects.

CDE intrinsics ^

The following intrinsics are available when __ARM_FEATURE_CDE is defined. These intrinsics use the coproc and imm compile-time constants to generate the corresponding CDE instructions. The coproc argument indicates the CDE coprocessor to use. The range of available coprocessors is indicated by the bitmap __ARM_FEATURE_CDE_COPROC, described in Custom Datapath Extension. The imm argument must fit within the immediate range of the corresponding CDE instruction. Values for these arguments outside these ranges must be rejected.

  uint32_t __arm_cx1(int coproc, uint32_t imm);
  uint32_t __arm_cx1a(int coproc, uint32_t acc, uint32_t imm);
  uint32_t __arm_cx2(int coproc, uint32_t n, uint32_t imm);
  uint32_t __arm_cx2a(int coproc, uint32_t acc, uint32_t n, uint32_t imm);
  uint32_t __arm_cx3(int coproc, uint32_t n, uint32_t m, uint32_t imm);
  uint32_t __arm_cx3a(int coproc, uint32_t acc, uint32_t n, uint32_t m, uint32_t imm);

  uint64_t __arm_cx1d(int coproc, uint32_t imm);
  uint64_t __arm_cx1da(int coproc, uint64_t acc, uint32_t imm);
  uint64_t __arm_cx2d(int coproc, uint32_t n, uint32_t imm);
  uint64_t __arm_cx2da(int coproc, uint64_t acc, uint32_t n, uint32_t imm);
  uint64_t __arm_cx3d(int coproc, uint32_t n, uint32_t m, uint32_t imm);
  uint64_t __arm_cx3da(int coproc, uint64_t acc, uint32_t n, uint32_t m, uint32_t imm);

The following intrinsics are also available when __ARM_FEATURE_CDE is defined, providing access to the CDE instructions that read and write the floating-point registers:

  uint32_t __arm_vcx1_u32(int coproc, uint32_t imm);
  uint32_t __arm_vcx1a_u32(int coproc, uint32_t acc, uint32_t imm);
  uint32_t __arm_vcx2_u32(int coproc, uint32_t n, uint32_t imm);
  uint32_t __arm_vcx2a_u32(int coproc, uint32_t acc, uint32_t n, uint32_t imm);
  uint32_t __arm_vcx3_u32(int coproc, uint32_t n, uint32_t m, uint32_t imm);
  uint32_t __arm_vcx3a_u32(int coproc, uint32_t acc, uint32_t n, uint32_t m, uint32_t imm);

In addition, the following intrinsics can be used to generate the D-register forms of the instructions:

  uint64_t __arm_vcx1d_u64(int coproc, uint32_t imm);
  uint64_t __arm_vcx1da_u64(int coproc, uint64_t acc, uint32_t imm);
  uint64_t __arm_vcx2d_u64(int coproc, uint64_t m, uint32_t imm);
  uint64_t __arm_vcx2da_u64(int coproc, uint64_t acc, uint64_t m, uint32_t imm);
  uint64_t __arm_vcx3d_u64(int coproc, uint64_t n, uint64_t m, uint32_t imm);
  uint64_t __arm_vcx3da_u64(int coproc, uint64_t acc, uint64_t n, uint64_t m, uint32_t imm);

The above intrinsics use the uint32_t and uint64_t types as general container types.

The following intrinsics can be used to generate CDE instructions that use the MVE Q registers.

  uint8x16_t __arm_vcx1q_u8 (int coproc, uint32_t imm);
  T __arm_vcx1qa(int coproc, T acc, uint32_t imm);
  T __arm_vcx2q(int coproc, T n, uint32_t imm);
  uint8x16_t __arm_vcx2q_u8(int coproc, T n, uint32_t imm);
  T __arm_vcx2qa(int coproc, T acc, U n, uint32_t imm);
  T __arm_vcx3q(int coproc, T n, U m, uint32_t imm);
  uint8x16_t __arm_vcx3q_u8(int coproc, T n, U m, uint32_t imm);
  T __arm_vcx3qa(int coproc, T acc, U n, V m, uint32_t imm);

  T __arm_vcx1q_m(int coproc, T inactive, uint32_t imm, mve_pred16_t p);
  T __arm_vcx2q_m(int coproc, T inactive, U n, uint32_t imm, mve_pred16_t p);
  T __arm_vcx3q_m(int coproc, T inactive, U n, V m, uint32_t imm, mve_pred16_t p);

  T __arm_vcx1qa_m(int coproc, T acc, uint32_t imm, mve_pred16_t p);
  T __arm_vcx2qa_m(int coproc, T acc, U n, uint32_t imm, mve_pred16_t p);
  T __arm_vcx3qa_m(int coproc, T acc, U n, V m, uint32_t imm, mve_pred16_t p);

These intrinsics are polymorphic in the T, U and V types, which must be of size 128 bits. The __arm_vcx1q_u8, __arm_vcx2q_u8 and __arm_vcx3q_u8 intrinsics return a container vector of 16 bytes that can be reinterpreted to other vector types as needed using the intrinsics below:

  uint16x8_t __arm_vreinterpretq_u16_u8 (uint8x16_t in);
  int16x8_t __arm_vreinterpretq_s16_u8 (uint8x16_t in);
  uint32x4_t __arm_vreinterpretq_u32_u8 (uint8x16_t in);
  int32x4_t __arm_vreinterpretq_s32_u8 (uint8x16_t in);
  uint64x2_t __arm_vreinterpretq_u64_u8 (uint8x16_t in);
  int64x2_t __arm_vreinterpretq_s64_u8 (uint8x16_t in);
  float16x8_t __arm_vreinterpretq_f16_u8 (uint8x16_t in);
  float32x4_t __arm_vreinterpretq_f32_u8 (uint8x16_t in);
  float64x2_t __arm_vreinterpretq_f64_u8 (uint8x16_t in);

The parameter inactive can be set to an uninitialized (don’t care) value using the MVE vuninitializedq family of intrinsics.

Memory tagging intrinsics ^

The intrinsics in this section provide access to the Memory Tagging Extension (MTE) introduced with the Armv8.5-A [ARMARMv85] architecture.

The <arm_acle.h> header should be included before using these intrinsics.

These intrinsics are expected to be used in system code, including freestanding environments. As such, implementations must guarantee that no new linking dependencies to runtime support libraries will occur when these intrinsics are used.

Memory tagging ^

Memory tagging is a lightweight, probabilistic version of a lock and key system where one of a limited set of lock values can be associated with the memory locations forming part of an allocation, and the equivalent key is stored in unused high bits of addresses used as references to that allocation. On each use of a reference the key is checked to make sure that it matches with the lock before an access is made.

When allocating memory, programmers must assign a lock to that section of memory. When freeing an allocation, programmers must change the lock value so that further referencing using the previous key has a reasonable probability of failure.

The intrinsics specified below support creation, storage, and retrieval of the lock values, leaving software to select and set the values on allocation and deallocation. The intrinsics are expected to help protect heap allocations.

The lock is referred in the text below as allocation tag and the key as logical address tag (or in short logical tag).

Terms and implementation details ^

The memory system is extended with a new physical address space containing an allocation tag for each 16-byte granule of memory in the existing data physical address space. All loads and stores to memory must pass a valid logical address tag as part of the reference. However, SP- and PC-relative addresses are not checked. The logical tag is held in the upper bits of the reference. There are 16 available logical tags that can be used.

MTE intrinsics ^

These intrinsics are available when __ARM_FEATURE_MEMORY_TAGGING is defined. Type T below can be any type. Where the function return type is specified as T, the return type is determined from the input argument which must be also be specified as of type T. If the input argument T has qualifiers const or volatile, the return type T will also have the const or volatile qualifier.

  T* __arm_mte_create_random_tag(T* src, uint64_t mask);

This intrinsic returns a pointer containing a randomly created logical address tag. The first argument is a pointer src containing an address. The second argument is a mask, where the lower 16 bits specify logical tags which must be excluded from consideration. The intrinsic returns a pointer which is a copy of the input address but also contains a randomly created logical tag (in the upper bits), that excludes any logical tags specified by the mask. A mask of zero excludes no tags.

  T* __arm_mte_increment_tag(T* src, unsigned offset);

This intrinsic returns a pointer which is a copy of the input pointer src but with the logical address tag part offset by a specified offset value. The first argument is a pointer src containing an address and a logical tag. The second argument is an offset which must be a compile time constant value in the range [0,15]. The intrinsic adds offset to the logical tag part of src returning a pointer with the incremented logical tag. If adding the offset increments the logical tag beyond the valid 16 tags, the value is wrapped around.

  uint64_t __arm_mte_exclude_tag(T* src, uint64_t excluded);

This intrinsic adds a logical tag to the set of excluded logical tags. The first argument is a pointer src containing an address and a logical tag. The second argument excluded is a mask where the lower 16 bits specify logical tags which are in current excluded set. The intrinsic adds the logical tag of src to the set specified by excluded and returns the new excluded tag set.

  void __arm_mte_set_tag(T* tag_address);

This intrinsic stores an allocation tag, computed from the logical tag, to the tag memory thereby setting the allocation tag for the 16-byte granule of memory. The argument is a pointer tag_address containing a logical tag and an address. The address must be 16-byte aligned. The type of the pointer is ignored (that is, allocation tag is set only for a single granule even if the pointer points to a type that is greater than 16 bytes). These intrinsics generate an unchecked access to memory.

  T* __arm_mte_get_tag(T* address);

This intrinsic loads the allocation tag from tag memory and returns the corresponding logical tag as part of the returned pointer value. The argument is a pointer address containing an address from which allocation tag memory is read. The pointer address need not be 16-byte aligned as it applies to the 16-byte naturally aligned granule containing the un-aligned pointer. The return value is a pointer whose address part comes from address and the logical tag value is the value computed from the allocation tag that was read from tag memory.

  ptrdiff_t __arm_mte_ptrdiff(T* a, T* b);

The intrinsic calculates the difference between the address parts of the two pointers, ignoring the tags. The return value is the sign-extended result of the computation. The tag bits in the input pointers are ignored for this operation.

State management ^

The specification for SME is in Beta state and may change or be extended in the future.

Introduction ^

ACLE often uses normal C and C++ objects to represent architectural state. These objects are passed to intrinsics and returned from intrinsics in the same way as they would be for a normal function.

For example, ACLE defines an svbool_t type to represent the contents of an SVE predicate register. SVE intrinsics that read a predicate register have an svbool_t argument for that register. SVE intrinsics that write to a predicate register have an svbool_t return type for that register.

SVE provides 16 predicate registers, but this number is not exposed to C and C++ code. Instead, the program can have as many svbool_t objects as it needs, and it is the compiler’s responsibility to manage the register allocation.

However, there are some pieces of architectural state for which this approach is not suitable. For example, SME’s ZA is a single piece of storage: there are not multiple ZAs, and so it does not make sense for a C or C++ function to have multiple ZA objects. It would also be inefficient to copy state from one ZA object to another.

ACLE models this kind of architectural state using keyword attributes rather than types and objects. The main purpose of these keyword attributes is to establish the following binary choices for a given piece of architectural state S:

  1. A function with a given type either shares S with its caller, or it does not.

  2. A given function definition either uses S internally, or it does not.

A function definition that shares S with its caller also (implicitly) uses S internally.

A function definition that uses S internally but does not share S with its caller is said to create a “new scope” for S.

ACLE uses strings to identify each piece of state that is managed in this way. The strings are listed in the table below.

String State State macro Feature macro
"za" SME’s ZA storage __ARM_STATE_ZA __ARM_FEATURE_SME
"zt0" SME2’s ZT0 __ARM_STATE_ZT0 __ARM_FEATURE_SME2

For each string, there are two associated macros:

A compiler that predefines the feature macro must also predefine the state macro.

For example, __ARM_STATE_ZA allows declarations of functions that share ZA. __ARM_FEATURE_SME allows function definitions to use ZA internally.

The strings are case-sensitive. For example, "za" cannot be written "ZA".

Ways of sharing state ^

ACLE identifies several different ways in which a callee can share state with its caller. Each one has an associated keyword attribute:

Each keyword attribute takes a comma-separated list of state strings as an argument. The list must be non-empty.

For example:

   void f() __arm_inout("za") __arm_preserves("zt0");

says that f uses ZA to receive data from callers and to pass data back to callers. f also promises to preserve the contents of ZT0.

A program is ill-formed if a function type has two attributes that specify different behavior for the same piece of state. For example, a program is ill-formed if a function type has both __arm_in("za") and __arm_out("za"), or both __arm_in("za") and __arm_inout("za").

The same information may be specified multiple times. For example:

  void f() __arm_in("za", "za") __arm_in("za");

is well-formed.

Two function types are incompatible if they handle a piece of architectural state differently. For example:

   void f1() __arm_in("za");
   void f2() __arm_out("za");
   void f3() __arm_inout("za");
   void f4() __arm_preserves("za");
   void f5();

   void (*ptr1)() __arm_in("za");
   void (*ptr2)() __arm_out("za");
   void (*ptr3)() __arm_inout("za");
   void (*ptr4)() __arm_preserves("za");
   void (*ptr5)();

   void test() {
     ptr1 = f1;    // OK
     ptr2 = f1;    // Invalid
     ptr3 = f1;    // Invalid
     ptr4 = f1;    // Invalid
     ptr5 = f1;    // Invalid

     ptr1 = f2;    // Invalid
     ptr2 = f2;    // OK
     ptr3 = f2;    // Invalid
     ptr4 = f2;    // Invalid
     ptr5 = f2;    // Invalid

     ptr1 = f3;    // Invalid
     ptr2 = f3;    // Invalid
     ptr3 = f3;    // OK
     ptr4 = f3;    // Invalid
     ptr5 = f3;    // Invalid

     ptr1 = f4;    // Invalid
     ptr2 = f4;    // Invalid
     ptr3 = f4;    // Invalid
     ptr4 = f4;    // OK
     ptr5 = f4;    // Invalid

     ptr1 = f5;    // Invalid
     ptr2 = f5;    // Invalid
     ptr3 = f5;    // Invalid
     ptr4 = f5;    // Invalid
     ptr5 = f5;    // OK
   }

__arm_in and __arm_preserves both guarantee that a function leaves the contents of the state unchanged on return. Functions are not ill-formed if they write to such state, but they must ensure that the cumulative effect of such writes is to leave the state unchanged.

A program is ill-formed if a function that does not use state S contains a call to a function that shares S with its caller. This is conceptually similar to an undefined variable in C. For example:

   void callee() __arm_inout("za");
   void caller() {
     callee(); // Ill-formed, caller does not use ZA
   }

If a C or C++ function F uses state S internally and calls a function F2 that does not share S with its callers, the call to F2 has no effect on F’s S state. The compiler must ensure this by restoring the old contents of S after the call to F2, if such a restoration is necessary. For example:

   void za_callee() __arm_inout("za");
   void za_caller() __arm_inout("za") {
     za_callee();
     printf("Here\n"); // Does not change za_caller's ZA state
     za_callee();
   }

Mapping to the Procedure Call Standard ^

[AAPCS64] classifies functions as having one of the following interfaces:

If a C or C++ function F forms part of the object code’s ABI, that object code function has a shared-ZA interface if and only if at least one of the following is true:

All other functions have a private-ZA interface.

Function definitions ^

The keyword attribute __arm_new(...) applies to function definitions. It specifies that the function creates a new scope for the given state. ... is a comma-separated list of state strings. The list must be non-empty.

Each piece of state specified by __arm_new is zero-initialized.

This attribute does not change a function’s binary interface. If the function forms part of the object code’s ABI, that object code function has the same interface as it would have had without __arm_new.

A program is ill-formed if a function definition F uses __arm_new for some state that F also shares with its caller. For example:

  __arm_new("za") void f1() __arm_in("za") { ... }     // Invalid
  __arm_new("za") void f2() __arm_out("za") { ... }    // Invalid
  __arm_new("za") void f3() __arm_inout("za") { ... }  // Invalid
  __arm_new("za") void f4() __arm_inout("zt0") { ... } // OK
  __arm_new("zt0") void f5() __arm_inout("za") { ... } // OK

Inline assembly ^

Some ACLE implementations might support the GNU inline asm extension. For implementations that do, suppose that an inline asm occurs in a function F that uses state S. There are then two possibilities:

If instead the inline asm occurs in a function that does not use state S, the two possibilities are:

[AAPCS64] defines three possible states for ZA: “off”, “dormant” or “active”. These states describe the values of two things:

ZA is guaranteed to be active on entry to an inline asm A in a function F if at least one of the following is true:

Otherwise, ZA can be in any state on entry to A if at least one of the following is true:

Otherwise, ZA can be off or dormant on entry to A, as for what AAPCS64 calls “private-ZA” functions.

If ZA is active on entry to A then A’s instructions must ensure that ZA is also active when the asm finishes.

Similarly, if ZA is off on entry to A then A’s instructions must ensure that ZA is off when the asm finishes.

If ZA is dormant on entry to A and A’s clobber list does not include "za", A’s instructions must ensure that ZA is dormant when the asm finishes.

Otherwise, if ZA is dormant on entry to A and A’s clobber list includes "za", A’s instructions can leave ZA unchanged or in the off state (A’s choice). In the latter case, A’s instructions must commit the lazy save that was pending on entry to A.

The table below summarizes the possible ZA states on entry to an inline asm A in a function F. It also specifies what the state is required to be when A finishes; it is the asm’s responsibility to ensure this. Since PSTATE.ZA controls both ZA and ZT0, the rules depend on ZT0 as well as ZA.

ZA state before A ZA state after A Possible if…
off off F’s uses and A’s clobbers are disjoint
dormant dormant ” “ “
dormant off ” “ “, and A clobbers "za"
active active F uses "za" and/or "zt0"

The __ARM_STATE macros indicate whether a compiler is guaranteed to support a particular clobber string. For example, the following code is a safe way to indicate that an asm might commit a lazy ZA save:

  // Function foo doesn't use ZA or ZT0.
  void foo() {
    ...
    asm volatile("bl something"
                 ::: ...call-clobbered registers...
#ifdef __ARM_STATE_ZA
                     , "za"
#endif
                );
    ...
  }

Implementation notes ^

A piece of state S can be said to be “live” at a point P during the execution of a function F if:

If S is live before a call from F to a function F2 that does not share S with its caller, the compiler must arrange for S to be preserved around the call to F2. One way to do this is to save S before the call and restore it afterwards. However, if S is ZA and F2 is a private-ZA function, the compiler can instead make use of the lazy-save scheme described in [AAPCS64].

For example, the code below shows a function that uses ZA and ZT0 internally. The comments describe when ZA should be saved and restored:

  void f_shares_zt0() __arm_inout("zt0");
  void f_shares_za() __arm_inout("za");
  void f_private();

  void setup_za() __arm_out("za");
  void use_za() __arm_in("za");

  __arm_new("za", "zt0") void test() {
    f_private();     // ZA is not live, no save necessary

    setup_za();      // ZA is live after this call
    f_shares_zt0();  // The compiler should save and restore ZA
                     // around the call ("caller-save")
    f_shares_za();   // ZA is live before and after this call
    f_private();     // The compiler should preserve ZA across the call
                     // It can use the lazy-save mechanism
    use_za();        // ZA is no longer live after this call

    f_private();     // ZA is not live, no save necessary
  }

ZT0 cannot be saved lazily, so if ZT0 is live before a call to a function that does not share ZT0, the compiler must save and restore ZT0 around the call. For example:

  void setup_zt0() __arm_out("zt0");
  void use_zt0() __arm_in("zt0");
  void f_private();

  __arm_new("zt0") void test() {
    f_private();     // ZT0 is not live, no save necessary

    setup_zt0();     // ZT0 is live after this call
    f_private();     // The compiler should save and restore ZT0
                     // around the call ("caller-save")
    use_zt0();       // ZT0 is no longer live after this call

    f_private();     // ZT0 is not live, no save necessary
  }

System register access ^

Special register intrinsics ^

Intrinsics are provided to read and write system and coprocessor registers, collectively referred to as special register.

  uint32_t __arm_rsr(const char *special_register);

Reads a 32-bit system register.

  uint64_t __arm_rsr64(const char *special_register);

Reads a 64-bit system register.

  __uint128_t __arm_rsr128(const char *special_register);

Reads a 128-bit system register. Only available if __ARM_FEATURE_SYSREG128 is defined to 1.

  void* __arm_rsrp(const char *special_register);

Reads a system register containing an address.

  float __arm_rsrf(const char *special_register);

Reads a 32-bit coprocessor register containing a floating point value.

  double __arm_rsrf64(const char *special_register);

Reads a 64-bit coprocessor register containing a floating point value.

  void __arm_wsr(const char *special_register, uint32_t value);

Writes a 32-bit system register.

  void __arm_wsr64(const char *special_register, uint64_t value);

Writes a 64-bit system register.

  void __arm_wsr128(const char *special_register, __uint128_t value);

Writes a 128-bit system register. Only available if __ARM_FEATURE_SYSREG128 is defined to 1.

  void __arm_wsrp(const char *special_register, const void *value);

Writes a system register containing an address.

  void __arm_wsrf(const char *special_register, float value);

Writes a floating point value to a 32-bit coprocessor register.

  void __arm_wsrf64(const char *special_register, double value);

Writes a floating point value to a 64-bit coprocessor register.

Special register designations ^

The special_register parameter must be a compile time string literal. This means that the implementation can determine the register being accessed at compile-time and produce the correct instruction without having to resort to self-modifying code. All register specifiers are case-insensitive (so “apsr” is equivalent to “APSR”). The string literal should have one of the forms described below.

AArch32 32-bit coprocessor register ^

When specifying a 32-bit coprocessor register to __arm_rsr, __arm_rsrp, __arm_rsrf, __arm_wsr, __arm_wsrp, or __arm_wsrf:

  cp<coprocessor>:<opc1>:c<CRn>:c<CRm>:<opc2>

Or (equivalently):

  p<coprocessor>:<opc1>:c<CRn>:c<CRm>:<opc2>

Where:

The values of the register specifiers will be as described in [ARMARM] or the Technical Reference Manual (TRM) for the specific processor.

So to read MIDR:

  unsigned int midr = __arm_rsr("cp15:0:c0:c0:0");

ACLE does not specify predefined strings for the system coprocessor register names documented in the Arm Architecture Reference Manual (for example “ MIDR ”).

AArch32 32-bit system register ^

When specifying a 32-bit system register to __arm_rsr, __arm_rsrp, __arm_wsr, or __arm_wsrp, one of:

AArch32 64-bit coprocessor register ^

When specifying a 64-bit coprocessor register to __arm_rsr64, __arm_rsrf64, __arm_wsr64, or __arm_wsrf64

  cp<coprocessor>:<opc1>:c<CRm>

Or (equivalently):

  p<coprocessor>:<opc1>:c<Rm>

Where:

AArch64 system register ^

When specifying a system register to __arm_rsr, __arm_rsr64, __arm_rsr128, __arm_rsrp, __arm_wsr, __arm_wsr64,__arm_wsr128 or __arm_wsrp:

  "o0:op1:CRn:CRm:op2"

Where:

AArch64 processor state field ^

When specifying a processor state field to __arm_rsr, __arm_rsp, __arm_wsr, or __arm_wsrp, one of the values accepted in the pstatefield of the MSR (immediate) instruction [ARMARMv8] (C5.6.130).

Coprocessor Intrinsics ^

AArch32 coprocessor intrinsics ^

In the intrinsics below coproc, opc1, opc2, CRn and CRd are all compile time integer constants with appropriate values as defined by the coprocessor for the intended architecture.

The argument order for all intrinsics is the same as the operand order for the instruction as described in the Arm Architecture Reference Manual, with the exception of MRC/ MRC2/ MRRC/MRRC2 which omit the Arm register arguments and instead returns a value and MCRR/MCRR2 which accepts a single 64 bit unsigned integer instead of two 32-bit unsigned integers.

AArch32 Data-processing coprocessor intrinsics ^

Intrinsics are provided to create coprocessor data-processing instructions as follows:

Intrinsics Equivalent Instruction
void __arm_cdp(coproc, opc1, CRd, CRn, CRm, opc2) CDP coproc, opc1, CRd, CRn, CRm, opc2
void __arm_cdp2(coproc, opc1, CRd, CRn, CRm, opc2) CDP2 coproc, opc1, CRd, CRn, CRm, opc2

AArch32 Memory coprocessor transfer intrinsics ^

Intrinsics are provided to create coprocessor memory transfer instructions as follows:

Intrinsics Equivalent Instruction
void __arm_ldc(coproc, CRd, const void* p) LDC coproc, CRd, […]
void __arm_ldcl(coproc, CRd, const void* p) LDCL coproc, CRd, […]
void __arm_ldc2(coproc, CRd, const void* p) LDC2 coproc, CRd, […]
void __arm_ldc2l(coproc, CRd, const void* p) LDC2L coproc, CRd, […]
void __arm_stc(coproc, CRd, void* p) STC coproc, CRd, […]
void __arm_stcl(coproc, CRd, void* p) STCL coproc, CRd, […]
void __arm_stc2(coproc, CRd, void* p) STC2 coproc, CRd, […]
void __arm_stc2l(coproc, CRd, void* p) STC2L coproc, CRd, […]

AArch32 Integer to coprocessor transfer intrinsics ^

Intrinsics are provided to map to coprocessor to core register transfer instructions as follows:

Intrinsics Equivalent Instruction
void __arm_mcr(coproc, opc1, uint32_t value, CRn, CRm, opc2) MCR coproc, opc1, Rt, CRn, CRm, opc2
void __arm_mcr2(coproc, opc1, uint32_t value, CRn, CRm, opc2) MCR2 coproc, opc1, Rt, CRn, CRm, opc2
uint32_t __arm_mrc(coproc, opc1, CRn, CRm, opc2) MRC coproc, opc1, Rt, CRn, CRm, opc2
uint32_t __arm_mrc2(coproc, opc1, CRn, CRm, opc2) MRC2 coproc, opc1, Rt, CRn, CRm, opc2
void __arm_mcrr(coproc, opc1, uint64_t value, CRm) MCRR coproc, opc1, Rt, Rt2, CRm
void __arm_mcrr2(coproc, opc1, uint64_t value, CRm) MCRR2 coproc, opc1, Rt, Rt2, CRm
uint64_t __arm_mrrc(coproc, opc1, CRm) MRRC coproc, opc1, Rt, Rt2, CRm
uint64_t __arm_mrrc2(coproc, opc1, CRm) MRRC2 coproc, opc1, Rt, Rt2, CRm

The intrinsics __arm_mcrr/__arm_mcrr2 accept a single unsigned 64-bit integer value instead of two 32-bit integers. The low half of the value goes in register Rt and the high half goes in Rt2. Likewise for __arm_mrrc/__arm_mrrc2 which return an unsigned 64-bit integer.

Unspecified behavior ^

ACLE does not specify how the implementation should behave in the following cases:

Instruction generation ^

Instruction generation, arranged by instruction ^

The following table indicates how instructions may be generated by intrinsics, and/or C code. The table includes integer data processing and certain system instructions.

Compilers are encouraged to use opportunities to combine instructions, or to use shifted/rotated operands where available. In general, intrinsics are not provided for accumulating variants of instructions in cases where the accumulation is a simple addition (or subtraction) following the instruction.

The table indicates which architectures the instruction is supported on, as follows:

Architecture 8 means Armv8-A AArch32 and AArch64, 8-32 means Armv8-AArch32 only. 8-64 means Armv8-AArch64 only.

Architecture 7 means Armv7-A and Armv7-R.

In the sequence of Arm architectures { 5, 5TE, 6, 6T2, 7 } each architecture includes its predecessor instruction set.

In the sequence of Thumb-only architectures { 6-M, 7-M, 7E-M } each architecture includes its predecessor instruction set.

7MP are the Armv7 architectures that implement the Multiprocessing Extensions.

Instruction Flags Arch. Intrinsic or C code
BKPT   5 none
BFC   6T2, 7-M C
BFI   6T2, 7-M C
CLZ   5 __clz, __builtin_clz
DBG   7, 7-M __dbg
DMB   8,7, 6-M __dmb
DSB   8, 7, 6-M __dsb
FRINT32Z   8-64 __rint32zf, __rint32z
FRINT64Z   8-64 __rint64zf, __rint64z
FRINT32X   8-64 __rint32xf, __rint32x
FRINT64X   8-64 __rint64xf, __rint64x
ISB   8, 7, 6-M __isb
LDREX   6, 7-M __sync_xxx
LDRT   all none
MCR/MRC   all see System register access
MSR/MRS   6-M see System register access
MSRR/MRRS   8-64 see System register access
PKHBT   6 C
PKHTB   6 C
PLD   8-32,5TE, 7-M __pld
PLDW   7-MP __pldx
PLI   8-32,7 __pli
QADD Q 5E, 7E-M __qadd
QADD16   6, 7E-M __qadd16
QADD8   6, 7E-M __qadd8
QASX   6, 7E-M __qasx
QDADD Q 5E, 7E-M __qadd(__qdbl)
QDSUB Q 5E, 7E-M __qsub(__qdbl)
QSAX   6, 7E-M __qsax
QSUB Q 5E, 7E-M __qsub
QSUB16   6, 7E-M __qsub16
QSUB8   6, 7E-M __qsub8
RBIT   8,6T2, 7-M __rbit, __builtin_rbit
REV   8,6, 6-M __rev, __builtin_bswap32
REV16   8,6, 6-M __rev16
REVSH   6, 6-M __revsh
ROR   all __ror
SADD16 GE 6, 7E-M __sadd16
SADD8 GE 6, 7E-M __sadd8
SASX GE 6, 7E-M __sasx
SBFX   8,6T2, 7-M C
SDIV   7-M+ C
SEL (GE) 6, 7E-M __sel
SETEND   6 n/a
SEV   8,6K,6-M,7-M __sev
SHADD16   6, 7E-M __shadd16
SHADD8   6, 7E-M __shadd8
SHASX   6, 7E-M __shasx
SHSAX   6, 7E-M __shsax
SHSUB16   6, 7E-M __shsub16
SHSUB8   6, 7E-M __shsub8
SMC   8,6Z, T2 none
SMI   6Z, T2 none
SMLABB Q 5E, 7E-M __smlabb
SMLABT Q 5E, 7E-M __smlabt
SMLAD Q 6, 7E-M __smlad
SMLADX Q 6, 7E-M __smladx
SMLAL   all, 7-M C
SMLALBB   5E, 7E-M __smulbb and C
SMLALBT   5E, 7E-M __smulbt and C
SMLALTB   5E, 7E-M __smultb and C
SMLALTT   5E, 7E-M __smultt and C
SMLALD   6, 7E-M __smlald
SMLALDX   6, 7E-M __smlaldx
SMLATB Q 5E, 7E-M __smlatb
SMLATT Q 5E, 7E-M __smlatt
SMLAWB Q 5E, 7E-M __smlawb
SMLAWT Q 5E, 7E-M __smlawt
SMLSD Q 6, 7E-M __smlsd
SMLSDX Q 6, 7E-M __smlsdx
SMLSLD   6, 7E-M __smlsld
SMLSLDX   6, 7E-M __smlsldx
SMMLA   6, 7E-M C
SMMLAR   6, 7E-M C
SMMLS   6, 7E-M C
SMMLSR   6, 7E-M C
SMMUL   6, 7E-M C
SMMULR   6, 7E-M C
SMUAD Q 6, 7E-M __smuad
SMUADX Q 6, 7E-M __smuadx
SMULBB   5E, 7E-M __smulbb; C
SMULBT   5E, 7E-M __smulbt ; C
SMULTB   5E, 7E-M __smultb; C
SMULTT   5E, 7E-M __smultt; C
SMULL   all, 7-M C
SMULWB   5E, 7E-M __smulwb; C
SMULWT   5E, 7E-M __smulwt; C
SMUSD   6, 7E-M __smusd
SMUSDX   6, 7E-M __smusd
SSAT Q 6, 7-M __ssat
SSAT16 Q 6, 7E-M __ssat16
SSAX GE 6, 7E-M __ssax
SSUB16 GE 6, 7E-M __ssub16
SSUB8 GE 6, 7E-M __ssub8
STREX   6, 7-M __sync_xxx
STRT   all none
SVC   all none
SWP   A32 only __swp [deprecated; see Swap]
SXTAB   6, 7E-M (int8_t)x + a
SXTAB16   6, 7E-M __sxtab16
SXTAH   6, 7E-M (int16_t)x + a
SXTB   8,6, 6-M (int8_t)x
SXTB16   6, 7E-M __sxtb16
SXTH   8,6, 6-M (int16_t)x
UADD16 GE 6, 7E-M __uadd16
UADD8 GE 6, 7E-M __uadd8
UASX GE 6, 7E-M __uasx
UBFX   8,6T2, 7-M C
UDIV   7-M+ C
UHADD16   6, 7E-M __uhadd16
UHADD8   6, 7E-M __uhadd8
UHASX   6, 7E-M __uhasx
UHSAX   6, 7E-M __uhsax
UHSUB16   6, 7E-M __uhsub16
UHSUB8   6, 7E-M __uhsub8
UMAAL   6, 7E-M C
UMLAL   all, 7-M acc += (uint64_t)x * y
UMULL   all, 7-M C
UQADD16   6, 7E-M __uqadd16
UQADD8   6, 7E-M __uqadd8
UQASX   6, 7E-M __uqasx
UQSAX   6, 7E-M __uqsax
UQSUB16   6, 7E-M __uqsub16
UQSUB8   6, 7E-M __uqsub8
USAD8   6, 7E-M __usad8
USADA8   6, 7E-M __usad8 + acc
USAT Q 6, 7-M __usat
USAT16 Q 6, 7E-M __usat16
USAX   6, 7E-M __usax
USUB16   6, 7E-M __usub16
USUB8   6, 7E-M __usub8
UXTAB   6, 7E-M (uint8_t)x + i
UXTAB16   6, 7E-M __uxtab16
UXTAH   6, 7E-M (uint16_t)x + i
UXTB16   6, 7E-M __uxtb16
UXTH   8,6, 6-M (uint16_t)x
VFMA   VFPv4 fma, __fma
VSQRT   VFP sqrt, __sqrt
WFE   8,6K, 6-M __wfe
WFI   8,6K, 6-M __wfi
YIELD   8,6K, 6-M __yield

About FP8 intrinsics ^

The specification for FP8 intrinsics is in Alpha state.

Each 8-bit floating point intrinsic call has a parameter to define the format and scale of the operands, and the overflow behavior, as applicable to each operation. This parameter is typically declared as fpm_t fpm.

  typedef uint64_t fpm_t;

The bits of an argument to an fpm parameter are interpreted as follows:

Bit range Name Meaning
0-2 src1_format first source operand format: 0 - FP8 E5M2, 1 - FP8 E4M3
3-5 src2_format second source operand format: 0 - FP8 E5M2, 1 - FP8 E4M3
6-8 dst_format destination format: 0 - FP8 E5M2, 1 - FP8 E4M3
9-13   must be zero
14 overflow_mul overflow behavior for multiplication instructions:
    0 - generate infinity, 1 - generate maximum normal number
15 overflow_cvt overflow behavior for conversion instructions:
    0 - generate infinity or NaN, 1 - generate maximum normal number
16-22 lscale downscaling value
23   must be zero
24-31 nscale scaling value for conversions
32-37 lscale2 downscaling value for conversions of the second input stream
38-63   must be zero

Bit patterns other than as described above are invalid. Passing an invalid value as an argument to an FP8 intrinsic results in undefined behavior.

The ACLE declares several helper types and intrinsics to facilitate construction of fpm arguments. The helper intrinsics do not have side effects and their return values depend only on their parameters.

Passing an out of range argument to a helper intrinsic results in the intrinsic returning an indeterminate value. Passing such an indeterminate value as an argument to an FP8 intrinsic results in undefined behavior.

The helper types and intrinsics are available after including any of <arm_neon.h>, <arm_sve.h>, or <arm_sme.h>.

Note: where a helper intrinsic description refers to “updating the FP8 mode” it means the intrinsic only modifies the bits of the input fpm_t parameter that correspond to the new mode and returns the resulting value. No side effects (such as changing processor state) occur.

Individual FP8 intrinsics are described in their respective Advanced SIMD (NEON), SVE, and SME sections.

Support enumerations ^

enum __ARM_FPM_FORMAT {
  __ARM_FPM_E5M2,
  __ARM_FPM_E4M3,
};

enum __ARM_FPM_OVERFLOW {
  __ARM_FPM_INFNAN,
  __ARM_FPM_SATURATE,
};

Helper intrinsics ^

  fpm_t __arm_fpm_init();

Initializes a value, suitable for use as an fpm argument (“FP8 mode”). The value corresponds to a mode of operation where:

  fpm_t __arm_set_fpm_src1_format(fpm_t fpm, enum __ARM_FPM_FORMAT format);
  fpm_t __arm_set_fpm_src2_format(fpm_t fpm, enum __ARM_FPM_FORMAT format);

Updates the FP8 mode to set the first or the second source operand format, respectively.

  fpm_t __arm_set_fpm_dst_format(fpm_t fpm,  enum __ARM_FPM_FORMAT format);

Updates the FP8 mode to set the destination format.

  fpm_t __arm_set_fpm_overflow_cvt(fpm_t fpm, enum __ARM_FPM_OVERFLOW behavior);

Updates the FP8 mode to set the overflow behavior for conversion operations.

  fpm_t __arm_set_fpm_overflow_mul(fpm_t fpm, enum __ARM_FPM_OVERFLOW behavior);

Updates the FP8 mode to set the overflow behavior for multiplicative operations.

  fpm_t __arm_set_fpm_lscale(fpm_t fpm, uint64_t scale);

Updates the FP8 mode to set the downscaling value subtracted from:

The valid range for the scale parameter is [0, 127], inclusive.

  fpm_t __arm_set_fpm_lscale2(fpm_t fpm, uint64_t scale);

Updates the FP8 mode to set the downscaling value subtracted from the result exponent for instructions converting the second FP8 input data stream to other floating-point formats.

The valid range for the scale parameter is [0, 63], inclusive.

  fpm_t __arm_set_fpm_nscale(fpm_t fpm,  int64_t scale);

Updates the FP8 mode to set the scaling value added to the operand’s exponent for instructions converting other floating-point formats to an FP8 format.

The valid range for the scale parameter is [-128, 127], inclusive.

Advanced SIMD (Neon) intrinsics ^

Introduction ^

The Advanced SIMD instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types.

Neon is an implementation of the Advanced SIMD instructions which is provided as an extension for some Cortex-A Series processors. Where this document refers to Neon instructions, such instructions refer to the Advanced SIMD instructions as described by the Arm Architecture Reference Manual [ARMARMv8].

The Advanced SIMD extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 32-bit and 64-bit floating-point data, arranged in 64-bit and 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to code-generation of the Advanced SIMD instructions for both AArch64 and AArch32 execution states.

Concepts ^

The Advanced SIMD instructions are designed to improve the performance of multimedia and signal processing algorithms by operating on 64-bit or 128-bit vectors of elements of the same scalar data type.

For example, uint16x4_t is a 64-bit vector type consisting of four elements of the scalar uint16_t data type. Likewise, uint16x8_t is a 128-bit vector type consisting of eight uint16_t elements.

In a vector programming model, operations are performed in parallel across the elements of the vector. For example, vmul_u16(a, b) is a vector intrinsic which takes two uint16x4_t vector arguments a and b, and returns the result of multiplying corresponding elements from each vector together.

The Advanced SIMD extension also provides support for vector-by-lane and vector-by-scalar operations. In these operations, a scalar value is extracted from one element of a vector input, or provided directly, duplicated to create a new vector with the same number of elements as an input vector, and an operation is performed in parallel between this new vector and other input vectors.

For example, vmul_lane_u16(a, b, 1), is a vector-by-lane intrinsic which takes two uint16x4_t vector elements. From b, element 1 is extracted, a new vector is formed which consists of four copies of element 1 of b, and this new vector is multiplied by a.

Reduction, cross-lane, and pairwise vector operations work on pairs of elements within a vector, or across the whole of a single vector performing the same operation between elements of that vector. For example, vaddv_u16(a) is a reduction intrinsic which takes a uint16x4_t vector, adds each of the four uint16_t elements together, and returns a uint16_t result containing the sum.

Vector data types ^

Vector data types are named as a lane type and a multiple. Lane type names are based on the types defined in <stdint.h>. For example, int16x4_t is a vector of four int16_t values. The base types are int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float16_t, float32_t, poly8_t, poly16_t, poly64_t, poly128_t, and bfloat16_t. The multiples are such that the resulting vector types are 64-bit and 128-bit. In AArch64, float64_t and mfloat8_t are also base types.

Not all types can be used in all operations. Generally, the operations available on a type correspond to the operations available on the corresponding scalar type.

ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same type as uint64_t, or whether poly64x1_t is the same as poly64_t for example for C++ overloading purposes.

float16 types are only available when the __fp16 type is defined, that is, when supported by the hardware.

bfloat types are only available when the __bf16 type is defined, that is, when supported by the hardware. The bfloat types are all opaque types. That is, they can only be used by intrinsics.

The FP8 types are all opaque types. That is, they can only be used by intrinsics.

Advanced SIMD Scalar data types ^

AArch64 supports Advanced SIMD scalar operations that work on standard scalar data types viz. int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float32_t, float64_t.

Vector array data types ^

Array types are defined for multiples of 2, 3 or 4 of all the vector types, for use in load and store operations, in table-lookup operations, and as the result type of operations that return a pair of vectors. For a vector type <type>_t the corresponding array type is <type>x<length>_t. Concretely, an array type is a structure containing a single array element called val.

For example an array of two int16x4_t types is int16x4x2_t, and is represented as

  struct int16x4x2_t { int16x4_t val[2]; };

Note that this array of two 64-bit vector types is distinct from the 128-bit vector type int16x8_t.

Scalar data types ^

For consistency, <arm_neon.h> defines some additional scalar data types to match the vector types.

float32_t is defined as an alias for float.

If the __fp16 type is defined, float16_t is defined as an alias for it.

If the __bf16 type is defined, bfloat16_t is defined as an alias for it.

If the __mfp8 type is defined, mfloat8_t is defined as an alias for it.

poly8_t, poly16_t, poly64_t and poly128_t are defined as unsigned integer types. It is unspecified whether these are the same type as uint8_t, uint16_t, uint64_t and uint128_t for overloading and mangling purposes.

float64_t is defined as an alias for double.

16-bit floating-point arithmetic scalar intrinsics ^

The architecture extensions introduced by Armv8.2-A [ARMARMv82] provide a set of data processing instructions which operate on 16-bit floating-point quantities. These instructions are available in both AArch64 and AArch32 execution states, for both Advanced SIMD and scalar floating-point values.

ACLE defines two sets of intrinsics which correspond to these data processing instructions; a set of scalar intrinsics, and a set of vector intrinsics.

The intrinsics introduced in this section use the data types defined by ACLE. In particular, scalar intrinsics use the float16_t type defined by ACLE as an alias for the __fp16 type, and vector intrinsics use the float16x4_t and float16x8_t vector types.

Where the scalar 16-bit floating point intrinsics are available, an implementation is required to ensure that including <arm_neon.h> has the effect of also including <arm_fp16.h>.

To only enable support for the scalar 16-bit floating-point intrinsics, the header <arm_fp16.h> may be included directly.

16-bit brain floating-point arithmetic scalar intrinsics ^

The architecture extensions introduced by Armv8.6-A [Bfloat16] provide a set of data processing instructions which operate on brain 16-bit floating-point quantities. These instructions are available in both AArch64 and AArch32 execution states, for both Advanced SIMD and scalar floating-point values.

The brain 16-bit floating-point format (bfloat) differs from the older 16-bit floating-point format (float16) in that the former has an 8-bit exponent similar to a single-precision floating-point format but has a 7-bit fraction.

ACLE defines two sets of intrinsics which correspond to these data processing instructions; a set of scalar intrinsics, and a set of vector intrinsics.

The intrinsics introduced in this section use the data types defined by ACLE. In particular, scalar intrinsics use the bfloat16_t type defined by ACLE as an alias for the __bf16 type, and vector intrinsics use the bfloat16x4_t and bfloat16x8_t vector types.

Where the 16-bit brain floating point intrinsics are available, an implementation is required to ensure that including <arm_neon.h> has the effect of also including <arm_bf16.h>.

To only enable support for the 16-bit brain floating-point intrinsics, the header <arm_bf16.h> may be included directly.

When __ARM_BF16_FORMAT_ALTERNATIVE is defined to 1 then these types are storage only and cannot be used with anything other than ACLE intrinsics. The underlying type for them is uint16_t.

Operations on data types ^

ACLE does not define implicit conversion between different data types. For example:

  int32x4_t x;
  uint32x4_t y = x; // No representation change
  float32x4_t z = x; // Conversion of integer to floating type

Is not portable. Use the vreinterpret intrinsics to convert from one vector type to another without changing representation, and use the vcvt intrinsics to convert between integer and floating types; for example:

  int32x4_t x;
  uint32x4_t y = vreinterpretq_u32_s32(x);
  float32x4_t z = vcvt_f32_s32(x);

ACLE does not define static construction of vector types. For instance:

  int32x4_t x = { 1, 2, 3, 4 };

Is not portable. Use the vcreate or vdup intrinsics to construct values from scalars.

In C++, ACLE does not define whether Advanced SIMD data types are POD types or whether they can be inherited from.

Compatibility with other vector programming models ^

ACLE does not specify how the Advanced SIMD Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the Advanced SIMD Intrinsics programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

  #include "arm_neon.h"
  ...
  uint32x2_t x = {0, 1}; // GCC extension.
  uint32_t y = vget_lane_s32 (x, 0); // ACLE Neon Intrinsic.

But the definition of the GCC vector extensions is such that the value stored in y will depend on both the target architecture (AArch32 or AArch64) and whether the program is running in big- or little-endian mode.

It is recommended that Advanced SIMD Intrinsics be used consistently:

  #include "arm_neon.h"
  ...
  const int temp[2] = {0, 1};
  uint32x2_t x = vld1_s32 (temp);
  uint32_t y = vget_lane_s32 (x, 0);

Availability of Advanced SIMD intrinsics and Extensions ^

Availability of Advanced SIMD intrinsics ^

Advanced SIMD support is available if the __ARM_NEON macro is predefined (see Advanced SIMD architecture extension (Neon)). In order to access the Advanced SIMD intrinsics, it is necessary to include the <arm_neon.h> header.

  #if __ARM_NEON
  #include <arm_neon.h>
    /* Advanced SIMD intrinsics are now available to use.  */
  #endif

Some intrinsics are only available when compiling for the AArch64 execution state. This can be determined using the __ARM_64BIT_STATE predefined macro (see Instruction set architecture (A32/T32/A64)).

Availability of 16-bit floating-point vector interchange types ^

When the 16-bit floating-point data type __fp16 is available as an interchange type for scalar values, it is also available in the vector interchange types float16x4_t and float16x8_t. When the vector interchange types are available, conversion intrinsics between vector of __fp16 and vector of float types are provided.

This is indicated by the setting of bit 1 in __ARM_NEON_FP (see Neon floating-point).

  #if __ARM_NEON_FP & 0x1
    /* 16-bit floating point vector types are available.  */
    float16x8_t storage;
  #endif

Availability of fused multiply-accumulate intrinsics ^

Whenever fused multiply-accumulate is available for scalar operations, it is also available as a vector operation in the Advanced SIMD extension. When a vector fused multiply-accumulate is available, intrinsics are defined to access it.

This is indicated by __ARM_FEATURE_FMA (see Fused multiply-accumulate (FMA)).

  #if __ARM_FEATURE_FMA
    /* Fused multiply-accumulate intrinsics are available.  */
    float32x4_t a, b, c;
    vfma_f32 (a, b, c);
  #endif

Availability of Armv8.1-A Advanced SIMD intrinsics ^

The Armv8.1-A [ARMARMv81] architecture introduces two new instructions: SQRDMLAH and SQRDMLSH. ACLE specifies vector and vector-by-lane intrinsics to access these instructions where they are available in hardware.

This is indicated by __ARM_FEATURE_QRDMX (see Rounding doubling multiplies). :

  #if __ARM_FEATURE_QRDMX
    /* Armv8.1-A RDMA extensions are available.  */
    int16x4_t a, b, c;
    vqrdmlah_s16 (a, b, c);
  #endif

Availability of 16-bit floating-point arithmetic intrinsics ^

Armv8.2-A [ARMARMv82] introduces new data processing instructions which operate on 16-bit floating point data in the IEEE754-2008 [IEEE-FP] format. ACLE specifies intrinsics which map to the vector forms of these instructions where they are available in hardware.

This is indicated by __ARM_FEATURE_FP16_VECTOR_ARITHMETIC (see 16-bit floating-point data processing operations). :

  #if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
    float16x8_t a, b;
    vaddq_f16 (a, b);
  #endif

ACLE also specifies intrinsics which map to the scalar forms of these instructions, see 16-bit floating-point arithmetic scalar intrinsics. Availability of the scalar intrinsics is indicated by __ARM_FEATURE_FP16_SCALAR_ARITHMETIC.

  #if __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
    float16_t a, b;
    vaddh_f16 (a, b);
  #endif

Availability of 16-bit brain floating-point arithmetic intrinsics ^

Armv8.2-A [ARMARMv82] introduces new data processing instructions which operate on 16-bit brain floating point data as described in the Arm Architecture Reference Manual. ACLE specifies intrinsics which map to the vector forms of these instructions where they are available in hardware.

This is indicated by __ARM_FEATURE_BF16_VECTOR_ARITHMETIC (see Brain 16-bit floating-point support).

  #if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
    float32x2_t res = {0};
    bfloat16x4_t a' = vld1_bf16 (a);
    bfloat16x4_t b' = vld1_bf16 (b);
    res = vdot_bf16 (res, a', b');
  #endif

ACLE also specifies intrinsics which map to the scalar forms of these instructions, see 16-bit brain floating-point arithmetic scalar intrinsics. Availability of the scalar intrinsics is indicated by __ARM_FEATURE_BF16_SCALAR_ARITHMETIC.

  #if __ARM_FEATURE_BF16_SCALAR_ARITHMETIC
    bfloat16_t a;
    float32_t b = ..;
    a = b<convert> (b);
  #endif

Availability of Armv8.4-A Advanced SIMD intrinsics ^

New Crypto and FP16 Floating Point Multiplication Variant instructions in Armv8.4-A:

These instructions have been backported as optional instructions to Armv8.2-A and Armv8.3-A.

Availability of Dot Product intrinsics ^

The architecture extensions introduced by Armv8.2-A provide a set of dot product instructions which operate on 8-bit sub-element quantities. These instructions are available in both AArch64 and AArch32 execution states using Advanced SIMD instructions. These intrinsics are available when __ARM_FEATURE_DOTPROD is defined (see Dot Product extension). :

  #if __ARM_FEATURE_DOTPROD
    uint8x8_t a, b;
    vdot_u8 (a, b);
  #endif

Availability of Armv8.5-A floating-point rounding intrinsics ^

The architecture extensions introduced by Armv8.5-A provide a set of floating-point rounding instructions that round a floating-point number to an to a floating-point value that would be representable in a 32-bit or 64-bit signed integer type. NaNs, Infinities and Out-of-Range values are forced to the Most Negative Integer representable in the target size, and an Invalid Operation Floating-Point Exception is generated. These instructions are available only in the AArch64 execution state. The intrinsics for these are available when __ARM_FEATURE_FRINT is defined. The Advanced SIMD intrinsics are specified in the Arm Neon Intrinsics Reference Architecture Specification [Neon].

Availability of Armv8.6-A Integer Matrix Multiply intrinsics ^

The architecture extensions introduced by Armv8.6-A provide a set of integer matrix multiplication and mixed sign dot product instructions. These instructions are optional from Armv8.2-A to Armv8.5-A.

These intrinsics are available when __ARM_FEATURE_MATMUL_INT8 is defined (see Matrix Multiply Intrinsics).

Specification of Advanced SIMD intrinsics ^

The Advanced SIMD intrinsics are specified in the Arm Neon Intrinsics Reference Architecture Specification [Neon].

The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

Undefined behavior ^

Care should be taken by compiler implementers not to introduce the concept of undefined behavior to the semantics of an intrinsic. For example, the vabsd_s64 intrinsic has well defined behaviour for all input values, while the C99 llabs has undefined behaviour if the result would not be representable in a long long type. It would thus be incorrect to implement vabsd_s64 as a wrapper function or macro around llabs.

Alignment assertions ^

The AArch32 Neon load and store instructions provide for alignment assertions, which may speed up access to aligned data (and will fault access to unaligned data). The Advanced SIMD intrinsics do not directly provide a means for asserting alignment.

SVE language extensions and intrinsics ^

SVE introduction ^

The Scalable Vector Extension (SVE) was introduced as an optional extension to the Arm8.2-A architecture. Its associated feature flag is FEAT_SVE. SVE has since been extended further by features such as Armv8.6-A’s matrix multiplication extensions and Armv9-A’s SVE2.

For brevity, this document generally uses “SVE” to refer to the original FEAT_SVE version of SVE and all subsequent extensions to it. For example, references to “SVE” include “SVE2” except whether otherwise noted.

ACLE defines a set of C and C++ types and accessors for SVE vectors and predicates. It also defines intrinsics for almost all SVE instructions. Mapping of SVE instructions to intrinsics lists each SVE instruction and links to the corresponding ACLE intrinsic (if one exists).

However, the intrinsic interface is more general than the underlying architecture, so not every intrinsic maps directly to an architectural instruction. The intention is to provide a regular interface and leave the ACLE implementation to pick the best mapping to SVE instructions.

SVE types ^

Overview of SVE types ^

SVE is a “vector-length agnostic” architecture. Instead of mandating a particular vector length, it allows implementations to choose any multiple of 128 bits up to the architectural maximum of 2048 bits. It also allows higher exception levels to constrain the vector lengths of lower exception levels. The effective vector length is therefore not a compile-time constant and can in principle change during execution.

Although the architectural increment is 128 bits, it is often more natural to think in terms of 64-bit quantities. The term “VG” (“vector granules”) refers to the current number of 64-bit elements in an SVE vector, so that the number of usable bits in a vector is VG×64. Predicates have one bit for each vector byte, so the number of usable bits in a predicate is VG×8.

Therefore, code that makes direct use of SVE instructions requires access to vector types that have VG×64 bits and predicate types that have VG×8 bits, but without the value of VG being fixed. Ideally it should be possible to pass and return these types by value, like it is for other (fixed-length) vector extensions. However, standard C and C++ do not provide variable-length types that are both self-contained (rather than dependent on separate storage) and suitable for passing and returning by value. (For example, C’s variable-length arrays are self-contained but are not valid return types. C++’s std::string can store variable-length data and is suitable for passing and returning by value, but it relies on separately-allocated storage.) This means that there is no existing mechanism that maps directly to the concept of an SVE vector or predicate.

Any approach to defining the types would fall into one of three categories:

  1. Limit the types in such a way that there is no concept of size.

  2. Define the size of the types to be variable.

  3. Define the size of the types to be constant, with the constant being large enough for all possible VG or with the types pointing to separate memory (as for classes like std::string).

ACLE takes the first approach and classifies SVE vectors and predicates as belonging to a new category of type called “sizeless types”. The next section describes these types in more detail.

Sizeless types ^

For the reasons explained in the previous section, ACLE defines a new category of type called “sizeless types”. They are less restrictive than the standard “incomplete types” but more restrictive than ”complete types”. SVE vectors and predicates have sizeless type.

Informal definition of sizeless types ^

Informally, sizeless types can be used in the following situations:

Sizeless types may not be used in the following situations:

In all other respects, sizeless types have the same restrictions as the standard-defined “incomplete types”. This specifically includes (but is not limited to) the following:

Formal definition of sizeless types ^

Sizeless types in C gives a more formal definition of sizeless types for C, in the form of an edit to the standard. Sizeless types in C++ gives the corresponding changes for C++. Implementations should correctly compile any code that follows these rules, but they do not need to report a diagnostic for invalid uses of sizeless types. Whether they report a diagnostic or not is a quality of implementation issue. (Note: future versions of ACLE might have stricter diagnostic requirements.)

Sizeless types in C ^

This section specifies the behavior of sizeless types as an edit to the N1570 version of the C standard.

6.2.5 Types

In 6.2.5/1, replace:

At various points within a translation unit an object type may be incomplete

onwards with:

Object types are further partitioned into sized and sizeless; all basic and derived types defined in this standard are sized, but an implementation may provide additional sizeless types.

and add two additional clauses:

  • At various points within a translation unit an object type may be indefinite (lacking sufficient information to construct an object of that type) or definite (having sufficient information). An object type is said to be complete if it is both sized and definite; all other object types are said to be incomplete. Complete types have sufficient information to determine the size of an object of that type while incomplete types do not.

  • Arrays, structures, unions and enumerated types are always sized, so for them the term incomplete is equivalent to (and used interchangeably with) the term indefinite.

Change 6.2.5/19 to:

The void type comprises an empty set of values; it is a sized indefinite object type that cannot be completed (made definite).

Replace incomplete with indefinite and complete with definite in 6.2.5/37, which describes how a type’s state can change throughout a translation unit.

6.3.2.1 Lvalues, arrays, and function designators

Replace incomplete with indefinite in 6.3.2.1/1, so that sizeless definite types are modifiable lvalues.

Make the same replacement in 6.3.2.1/2, to prevent undefined behavior when lvalues have sizeless definite type.

6.5.1.1 Generic selection

Replace complete object type with definite object type in 6.5.1.1/2, so that the type name in a generic association can be a sizeless definite type.

6.5.2.2 Function calls

Replace complete object type with definite object type in 6.5.2.2/1, so that functions can return sizeless definite types.

Make the same change in 6.5.2.2/4, so that arguments can also have sizeless definite type.

6.5.2.5 Compound literals

Replace complete object type with definite object type in 6.5.2.5/1, so that compound literals can have sizeless definite type.

6.7 Declarations

Insert the following new clause after 6.7/4:

  • If an identifier for an object does not have automatic storage duration, its type must be sized rather than sizeless.

Replace complete with definite in 6.7/7, which describes when the type of an object becomes definite.

6.7.6.3 Function declarators (including prototypes)

Replace incomplete type with indefinite type in 6.7.6.3/4, so that parameters can also have sizeless definite type.

Make the same change in 6.7.6.3/12, which allows even indefinite types to be function parameters if no function definition is present.

6.7.9 Initialization

Replace complete object type with definite object type in 6.7.9/3, to allow initialization of identifiers with sizeless definite type.

6.9.1 Function definitions

Replace complete object type with definite object type in 6.9.1/3, so that functions can return sizeless definite types.

Make the same change in 6.9.1/7, so that adjusted parameter types can be sizeless definite types.

J.2 Undefined behavior

Update the entries that refer to the clauses above.

Sizeless types in C++ ^

This section specifies the behavior of sizeless types as an edit to the N3797 version of the C++ standard.

3.1 Declarations and definitions [basic.def]

Replace incomplete type with indefinite type in 3.1/5, so that objects can have sizeless definite type in some situations. Add a new clause immediately afterwards:

  • A program is ill-formed if any declaration of an object gives it both a sizeless type and either static or thread-local storage duration.

3.9 Types [basic.types]

Replace 3.9/5 with these clauses:

  • A class that has been declared but not defined, an enumeration type in certain contexts, or an array of unknown size or of indefinite element type, is an indefinite object type. Indefinite object types and the void types are indefinite types. Objects shall not be defined to have an indefinite type.

  • Object and void types are further partitioned into sized and sizeless; all basic and derived types defined in this standard are sized, but an implementation may provide additional sizeless types.

  • An object or void type is said to be complete if it is both sized and definite; all other object and void types are said to be incomplete. The term completely-defined object type is synonymous with complete object type.

  • Arrays, class types and enumeration types are always sized, so for them the term incomplete is equivalent to (and used interchangeably with) the term indefinite.

Replace incomplete types with indefinite types in 3.9/7, which is simply a cross-reference note.

3.9.1 Fundamental Types [basic.fundamental]

Replace the second sentence of 3.9.1/9 with:

The void type is a sized indefinite type that cannot be completed (made definite).

3.9.2. Compound Types [basic.compound]

Add “(including indefinite types)” after “Pointers to incomplete types” in 3.9.2/3, to emphasize that pointers to incomplete types are more restricted than pointers to complete types, even if the types are definite.

3.10 Lvalues and rvalues [basic.lval]

Replace complete types with definite types and incomplete types with indefinite types in 3.10/4, so that prvalues can be sizeless definite types.

Replace incomplete with indefinite and complete with definite in 3.10/7, which describes the modifiability of a pointer and the object to which it points.

4.1 Lvalue-to-rvalue conversion [conv.lval]

Replace incomplete type with indefinite type in 4.1/1, so that glvalues of sizeless definite type can be converted to prvalues.

5.2.2 Function call [expr.call]

Replace completely-defined object type with definite object type and incomplete class type with indefinite class type, so that parameters can have sizeless definite type.

Replace incomplete with indefinite and complete with definite in 5.2.2/11, so that functions can return sizeless definite types.

5.2.3 Explicit type conversion (function notation) [expr.type.conv]

Replace complete object type with definite object type in 5.2.3/2, so that the T() notation extends to sizeless definite types.

5.3.1 Unary operators [expr.unary.op]

Replace incomplete type with indefinite type in 5.3.1/1. This simply updates a cross-reference to the 4.1 change above.

5.3.5 Delete [expr.delete]

Insert the following after the first sentence of 5.3.5/2 (which describes the implicit conversion of an object type to a pointer type):

The type of the operand must now be a pointer to a sized type, otherwise the program is ill-formed.

8.3.4 Arrays [dcl.array]

In 8.3.4/1, add a sizeless type to the list of types that the element type T cannot have.

9.4.2 Static data members [class.static.data]

Replace an incomplete type with a sized indefinite type in 9.4.2/2, so that static data members cannot be declared with sizeless type.

Add a new final clause:

  • A static data member shall not have sizeless type.

14.3.1 Template type parameters [temp.arg.type]

Add “(including an indefinite type)” after “an incomplete type” in the note describing template type arguments, to emphasize that the arguments can be sizeless.

20.10.4.3 Type properties [meta.unary.prop]

Replace complete with definite in 20.10.4.3/3, so that the situations in which an implementation may implicitly instantiate template arguments remain the same as before.

20.10.6 Relationships between types [meta.rel]

Replace complete with definite in 20.10.6/2, so that these metaprogramming facilities extend to sized definite types.

20.10.7.5 Pointer modifications [meta.trans.ptr]

Likewise replace complete with definite in the initial table, so that these metaprogramming facilities also extend to sized definite types.

Interaction between sizeless types and other extensions ^

It is not feasible to reconcile the definition of sizeless types with each individual current and future extension to C and C++. Therefore, the default position is that extensions to C and C++ should treat sizeless types as incomplete types unless otherwise specified.

Scalar types defined by <arm_sve.h> ^

<arm_sve.h> includes <stdint.h> and so provides standard types such as uint32_t. When included from C code, the header also includes <stdbool.h> and so provides the bool type.

In addition, the header file defines the following scalar data types:

Scalar type Definition
float16_t equivalent to __fp16
float32_t equivalent to float
float64_t equivalent to double
mfloat8_t equivalent to __mfp8

If the feature macro __ARM_FEATURE_BF16_SCALAR_ARITHMETIC is defined, <arm_sve.h> also includes <arm_bf16.h>, which defines the scalar 16-bit brain floating-point type __bf16. Then, under the control of the same feature macro, <arm_sve.h> defines an alias of __bf16 called bfloat16_t.

SVE vector types ^

<arm_sve.h> defines the following sizeless types for single vectors:

Signed integer Unsigned integer Floating-point  
svint8_t svuint8_t   svmfloat8_t
svint16_t svuint16_t svfloat16_t svbfloat16_t
svint32_t svuint32_t svfloat32_t  
svint64_t svuint64_t svfloat64_t  

Each type svBASE_t is an opaque length-agnostic vector of BASE_t elements. The types in each row have the same number of elements and have twice as many elements as the types in the row below them.

svbfloat16_t is only available if the header file also provides a definition of bfloat16_t (see Scalar types defined by <arm_sve.h>). The other types are available unconditionally.

<arm_sve.h> also defines tuples of two, three, and four vectors, as follows:

Signed integer Unsigned integer Floating-point  
svint8x2_t svuint8x2_t   svmfloat8x2_t
svint16x2_t svuint16x2_t svfloat16x2_t svbfloat16x2_t
svint32x2_t svuint32x2_t svfloat32x2_t  
svint64x2_t svuint64x2_t svfloat64x2_t  
       
svint8x3_t svuint8x3_t   svmfloat8x3_t
svint16x3_t svuint16x3_t svfloat16x3_t svbfloat16x3_t
svint32x3_t svuint32x3_t svfloat32x3_t  
svint64x3_t svuint64x3_t svfloat64x3_t  
       
svint8x4_t svuint8x4_t   svmfloat8x4_t
svint16x4_t svuint16x4_t svfloat16x4_t svbfloat16x4_t
svint32x4_t svuint32x4_t svfloat32x4_t  
svint64x4_t svuint64x4_t svfloat64x4_t  

Each svBASExN_t is an opaque sizeless type that conceptually contains a sequence of N svBASE_ts, indexed starting at 0. It is available whenever the corresponding single-vector type is.

ACLE provides the following intrinsics for creating and accessing these tuple types:

svcreate2(svBASE_t x0, svBASE_t x1)
svcreate3(svBASE_t x0, svBASE_t x1, svBASE_t x2)
svcreate4(svBASE_t x0, svBASE_t x1, svBASE_t x2, svBASE_t x3)

svcreateN returns a svBASExN_t in which the vector at index M has the value given by xM.

svset2(svBASEx2_t tuple, uint64_t imm_index, svBASE_t x)
svset3(svBASEx3_t tuple, uint64_t imm_index, svBASE_t x)
svset4(svBASEx4_t tuple, uint64_t imm_index, svBASE_t x)

svsetN returns a copy of tuple in which the vector at index imm_index has the value given by x. imm_index must be an integer constant expression in the range [0, N-1].

svget2(svBASEx2_t tuple, uint64_t imm_index)
svget3(svBASEx3_t tuple, uint64_t imm_index)
svget4(svBASEx4_t tuple, uint64_t imm_index)

svgetN returns the value of vector imm_index in tuple. imm_index must be an integer constant expression in the range [0, N-1].

It is also possible to create completely-undefined vectors and tuples of vectors using svundef, svundef2, svundef3 and svundef4.

ACLE provides two sets of intrinsics for converting between vector types and between vector tuple types:

To avoid any ambiguity between the two operations, ACLE does not allow C-style casting from one vector type to another.

SVE predicate types ^

<arm_sve.h> defines a single sizeless predicate type svbool_t, which has enough bits to control an operation on a vector of bytes.

The main use of predicates is to select elements in a vector. When the elements in the vector have N bytes, only the low bit in each sequence of N predicate bits is significant. For example, with bits of a predicate numbered from the least significant bit upwards, the elements selected by each bit are as follows:

  bit 0 bit 1 bit 2 bit 3 bit 4 bit 5 bit 6 bit 7 bit 8
svuint8_t element 0 1 2 3 4 5 6 7 8
svuint16_t element 0 - 1 - 2 - 3 - 4
svuint32_t element 0 - - - 1 - - - 2
svuint64_t element 0 - - - - - - - 1

Fixed-length SVE types ^

Introduction to fixed-length SVE types ^

As described in Overview of SVE types, SVE is a “length-agnostic” architecture that supports a range of different vector register sizes. Some SVE ACLE code might be written to work with all of these register sizes; such code is fully “length-agnostic”. Other SVE ACLE code might instead require the register size to belong to a restricted set. Possible restrictions include imposing a minimum register size, a maximum register size, a power-of-two register size, or a single exact register size. This is a choice for the programmer to make. For example, the program snippet:

  svfloat32_t data = svld1(svptrue_pat_b32(SV_VL16), ptr);

loads no data when the vector registers are smaller than 512 bits, so in practice the code is likely to require a register size of 512 bits or more.

Similarly, for any given piece of source code (whether it uses ACLE or not), a code generator might produce SVE object code that only correctly implements the source code for certain runtime register sizes. The code is likely to abort or malfunction if the registers have a different size. For example, if the code generator assumes that the vector register size is exactly 256 bits, it might use:

  ADD X1, X2, #32

instead of:

  ADDVL X1, X2, #1

Let SourceVL be the set of vector register sizes supported by a piece of SVE ACLE source code and ObjectVL be the set of vector register sizes for which the object code correctly implements the source code. In both cases, the set might be the set of all possible register sizes (referred to as “unrestricted” below).

In general, object code created from SVE ACLE source code only works correctly when the runtime register size belongs to the intersection of SourceVL and ObjectVL. The behavior in other cases is undefined.

The compiler might impose a fixed ObjectVL set for all code or provide a level of user control. It might also allow ObjectVL to vary within the program, either within the same translation unit or between different translation units. Whether and how this happens is currently outside the scope of ACLE.

The combination of a singleton SourceVL and an unrestricted ObjectVL makes sense: ACLE source code written for a single register size can still be compiled correctly without knowing what that size is, just like an assembler can assemble SVE code without knowing what register sizes the assembly code supports. Similarly, the combination of a singleton ObjectVL and an unrestricted SourceVL makes sense: ACLE source code written for all register sizes can still be compiled for only a single register size, perhaps to try to improve the performance of surrounding code. Other combinations are also possible.

This section describes ACLE features related to singleton ObjectVLs. They are intended to help cases in which SourceVL is the same singleton.

All the features are optional: a conforming ACLE implementation does not need to provide them.

Note: the types defined in SVE vector types and SVE predicate types are always sizeless types and are subject to the same restrictions for all SourceVL and for all ObjectVL.

The __ARM_FEATURE_SVE_BITS macro ^

As an optional extension, an ACLE implementation may set the preprocessor macro:

  __ARM_FEATURE_SVE_BITS

to a nonzero value N provided that:

If the implementation allows ObjectVL to be changed within a translation unit, it must ensure that the value of the macro remains accurate after each change. __ARM_FEATURE_SVE_BITS would then have different values at different points in a translation unit.

Defining the macro to zero has the same meaning as not defining it all.

The arm_sve_vector_bits attribute ^

Syntax and requirements ^

An ACLE implementation can choose to provide the following GNU-style type attribute:

  __attribute__((arm_sve_vector_bits(...arguments...)))

ACLE only defines the effect of the attribute if all of the following are true:

  1. the attribute is attached to a single SVE vector type (such as svint32_t) or to the SVE predicate type svbool_t;

  2. the arguments consist of a single nonzero integer constant expression (referred to as N below); and

  3. N==__ARM_FEATURE_SVE_BITS.

In other cases the implementation must do one of the following:

General arm_sve_vector_bits behavior ^

Let VLST be a valid type:

  VLAT __attribute__((arm_sve_vector_bits(N)))

for some SVE vector type or SVE predicate type VLAT. VLST is then a fixed-length version of VLAT that is specialized for an N-bit SVE target. It is a normal sized (rather than sizeless) type, so unlike VLAT, it can be used as the type of a global variable, class member, constexpr, and so on. For example:

  #if __ARM_FEATURE_SVE_BITS==512
  typedef svint32_t vec __attribute__((arm_sve_vector_bits(512)));
  typedef svbool_t pred __attribute__((arm_sve_vector_bits(512)));
  #endif

creates a type called vec that acts a fixed-length version of svint32_t and a type called pred that acts as a fixed-length version of svbool_t. Both types are normal sized types; vec contains exactly 512 bits (64 bytes) and pred contains exactly 64 bits (8 bytes):

  #if __ARM_FEATURE_SVE_BITS==512
  typedef svint32_t vec __attribute__((arm_sve_vector_bits(512)));
  typedef svbool_t pred __attribute__((arm_sve_vector_bits(512)));

  svint32_t g1;  // Invalid, svint32_t is sizeless
  vec g2;        // OK
  svbool_t g3;   // Invalid, svbool_t is sizeless
  pred g4;       // OK

  struct wrap1 { svint32_t x; };  // Invalid, svint32_t is sizeless
  struct wrap2 { vec x; };        // OK
  struct wrap3 { svbool_t x; };   // Invalid, svbool_t is sizeless
  struct wrap4 { pred x; };       // OK

  size_t size1 = sizeof(svint32_t);  // Invalid, svint32_t is sizeless
  size_t size2 = sizeof(vec);        // OK, equals 64
  size_t size3 = sizeof(svbool_t);   // Invalid, svbool_t is sizeless
  size_t size4 = sizeof(pred);       // OK, equals 8
  #endif

The following rules apply to both vector and predicate VLST types:

See arm_sve_vector_bits behavior specific to vectors for additional rules that apply specifically to vectors and arm_sve_vector_bits behavior specific to predicates for additional rules that apply specifically to predicates.

arm_sve_vector_bits behavior specific to vectors ^

If VLAT is an SVE vector type svBASE_t and VLST is a valid type:

  VLAT __attribute__((arm_sve_vector_bits(N)))

then VLST is a sized type that has the following properties:

  sizeof(VLST) == N/8
  alignof(VLST) == 16

If in addition the implementation defines the preprocessor macro

  __ARM_FEATURE_SVE_VECTOR_OPERATORS

and if VLAT is not svbfloat16_t, then:

arm_sve_vector_bits behavior specific to predicates ^

Let VLST be a valid type:

  svbool_t __attribute__((arm_sve_vector_bits(N)))

VLST is then a sized type that has the following properties:

  sizeof(VLST) == N/64
  alignof(VLST) == 2

If in addition the implementation defines the preprocessor macro

  __ARM_FEATURE_SVE_PREDICATE_OPERATORS

then:

An implementation can choose to provide other operators too, but it does not need to do so.

Note that, at the time of writing, the GNU vector_size extension is not defined for bool elements, nor for packed vectors of single bits. The definition above is intended to be a conservative subset of what such an extension might provide.

C++ mangling of fixed-length SVE types ^

Let VLST be a valid C++ type:

  VLAT __attribute__((arm_sve_vector_bits(N)))

for some SVE vector type or SVE predicate type VLAT. VLST is mangled in the same way as a template:

  template<typename, unsigned> struct __SVE_VLS;

with the arguments:

  __SVE_VLS<VLAT, N>

For example:

  #if __ARM_FEATURE_SVE_BITS==512
  // Mangled as 9__SVE_VLSIu11__SVInt32_tLj512EE
  typedef svint32_t vec __attribute__((arm_sve_vector_bits(512)));
  // Mangled as 9__SVE_VLSIu10__SVBool_tLj512EE
  typedef svbool_t pred __attribute__((arm_sve_vector_bits(512)));
  #endif

SVE enum declarations ^

The following enum enumerates all the possible patterns returned by a PTRUE:

  enum svpattern
  {
    SV_POW2 = 0,
    SV_VL1 = 1,
    SV_VL2 = 2,
    SV_VL3 = 3,
    SV_VL4 = 4,
    SV_VL5 = 5,
    SV_VL6 = 6,
    SV_VL7 = 7,
    SV_VL8 = 8,
    SV_VL16 = 9,
    SV_VL32 = 10,
    SV_VL64 = 11,
    SV_VL128 = 12,
    SV_VL256 = 13,
    SV_MUL4 = 29,
    SV_MUL3 = 30,
    SV_ALL = 31
  };

The following enum lists the possible prefetch types:

  enum svprfop
  {
    SV_PLDL1KEEP = 0,
    SV_PLDL1STRM = 1,
    SV_PLDL2KEEP = 2,
    SV_PLDL2STRM = 3,
    SV_PLDL3KEEP = 4,
    SV_PLDL3STRM = 5,
    SV_PSTL1KEEP = 8,
    SV_PSTL1STRM = 9,
    SV_PSTL2KEEP = 10,
    SV_PSTL2STRM = 11,
    SV_PSTL3KEEP = 12,
    SV_PSTL3STRM = 13
  };

Both enums are defined by <arm_sve.h>.

SVE intrinsics ^

SVE naming convention ^

The SVE ACLE intrinsics have the form:

  svbase[_disambiguator][_type0][_type1]...[_predication]

where the individual parts are as follows:

base

For most intrinsics this is the lower-case name of an SVE instruction, but with some adjustments:

disambiguator

This field distinguishes between different forms of an intrinsic. There are several common uses:

type0
type1

These fields list the types of vectors and predicates, starting with the return type and continuing with the argument types. They do not include the types of vector bases and displacements, which form part of the addressing mode disambiguator instead. They also do not include argument types that are uniquely determined by the previous argument types and return type.

For vectors the field is a type category followed by an element size in bits:

Type category 8-bit 16-bit 32-bit 64-bit
signed integers s8 s16 s32 s64
unsigned integers u8 u16 u32 u64
floating-point numbers   f16 f32 f64
brain floating-point numbers   bf16    

For predicates the suffix is b followed by the size of the associated data elements in bits, or simply b if the operation does not assume a particular element size. For example, the intrinsic for

  PTRUE Pd.B

is svptrue_b8 while the intrinsic for

  PTRUE Pd.H

is svptrue_b16 The intrinsic for:

  PFALSE Pd.B

is svpfalse_b rather than svpfalse_b8 since the result is suitable for all element sizes.

predication

This suffix describes the inactive elements in the result of a predicated operation. It can be one of the following:

z

Zero predication: set all inactive elements of the result to zero.

m

Merge predication: copy all inactive elements from the first vector argument.

Unary operations have a separate vector argument that comes before the general predicate; for example:

  CLZ Zd.H, Pg/M, Zs.H

corresponds to:

  Zd = svclz_u16_m(Zd, Pg, Zs);

The argument does not need to be syntactically related to the result; calls such as:

  Zd = svclz_u16_m(svadd_u16_z(...), Pg, Zs);

are also valid.

Binary and ternary operations reuse the first argument to the operation as the merge input; for example:

  ADD Zd.S, Pg/M, Zd.S, Zs2.S

corresponds to:

  Zd = svadd_u32_m(Pg, Zd, Zs2);

where the Zd argument supplies both the values of inactive elements and the first operand to the addition. Again, the merge input does not need to be syntactically related to the result.

x

“Don’t-care” predication: set the inactive elements to unknown values.

Warning: these values could be arbitrary register contents left around from a previous operation, so accidentally using the inactive elements could lead to data leakage.

This form of predication removes the need to choose between zeroing and merging in cases where the inactive elements are unimportant. The code generator can then pick whichever form of instruction seems to give the best code. This includes using unpredicated instructions, where available and suitable.

Predicated loads, predicated comparisons and predicated string match operations are always zeroing operations, so for brevity, the corresponding intrinsics have no predication suffix.

Overloaded aliases of SVE intrinsics ^

The intrinsic names that are described in SVE naming convention often carry information that is obvious from the types of the arguments. For example, svclz_u16_z always takes a svuint16_t and no other form of svclz_type_z does. ACLE therefore defines shorter, overloaded, aliases that are compatible with both C++ overloading and C _Generic associations. At a minimum, this means that all overloaded forms of an intrinsic have the same number of arguments and that it is possible to select the correct overload by considering each argument from left to right.

(Although the overloading scheme is compatible with _Generic, a C implementation can use other implementation-specific ways of achieving the same effect.)

The usual integer promotions mean that overloading based on scalar integer types can be non-obvious. For example, in the C++ code:

  int f(unsigned char) { return 1; }
  int f(int) { return 2; }
  int g1(unsigned char c) { return f(c); }
  int g2(unsigned char c) { return f(c + 1); }

g1 returns 1 but g2 returns 2. A cast would be required to make g2 use the unsigned char version of f instead of the int version. For this reason, ACLE does not use overloading of scalar arguments alone to determine the type of a vector result. Intrinsics like svdup and svindex (whose only arguments are scalar) always require a suffix indicating the return type.

Overloaded aliases are always a full intrinsic name with parts of the suffix removed. The rest of this document refers to both the full intrinsic name and its overloaded alias by enclosing the elided suffix characters in square brackets. For example:

  svclz[_u16]_m

says that the full name is svclz_u16_m and that its overloaded alias is svclz_m.

SVE addressing modes ^

Load, store, prefetch, and ADR intrinsics have different forms for different addressing modes. The exact set of addressing modes depends on the particular operation, but they always have a base component and may also have a displacement component.

The base may be either a single C pointer or a vector of address values. In the latter case, the address values may be 32 or 64 bits in size. Addressing modes with vector bases use the disambiguator [_xNNbase], where xNN is the type of the address vector elements.

The displacement, if present, may be a 64-bit scalar or a vector of 32-bit or 64-bit elements. There are five forms in total, with the following disambiguators:

_offset

The displacement is a 64-bit scalar byte count.

_index

The displacement is a 64-bit scalar element count. For example, if an intrinsic loads 16-bit elements from memory, an _index displacement of 1 is equivalent to an _offset displacement of 2.

_vnum

The displacement is a 64-bit scalar that counts a single vector’s worth of elements. For example, if an intrinsic loads one or more full vectors from memory, a _vnum displacement of 1 is equivalent to an _offset displacement of VG×8 (the number of bytes in an SVE vector). If an intrinsic loads 16-bit elements and extends them to 32 bits, a _vnum displacement of 1 is equivalent to an _offset displacement of VG×4 (half the number of bytes in an SVE vector).

This form corresponds to the MUL VL addressing mode. However, the displacement argument can be any scalar value; it does not need to be a constant in a particular range.

_[xNN]offset

The displacement is a vector of type xNN and each element specifies a separate byte count. In other words:

_[xNN]index

Similar to _[xNN]offset, but each element specifies an element count rather than a byte count, in the same way as for _index.

These displacements do not need to be constant and they do not need to be within a specific range.

For example, the simplest load addressing mode is:

  svint16_t svld1[_s16](svbool_t pg, const int16_t *base)

which loads N elements from base[0] to base[N-1] inclusive.

It is possible to apply a displacement measured in whole vectors using:

  svint16_t svld1_vnum[_s16](svbool_t pg, const int16_t *base, int64_t vnum)

which loads N elements from base[N*vnum] to base[N*vnum+N-1] inclusive.

The following intrinsic instead takes a vector of offsets, measured in bytes:

  svuint32_t svld1_gather_[s32]offset[_u32](svbool_t pg, const uint32_t *base,
                                            svint32_t offsets)

In this case the address of element i is:

  (int32_t *)((uintptr_t)base + offsets[i])

For:

  svuint32_t svld1_gather_[s32]index[_u32](svbool_t pg, const uint32_t *base,
                                           svint32_t indices)

the address of element i is simply &base[indices[i]].

The following intrinsic is an example of one that combines a vector base with a scalar index:

  svint32_t svld1_gather[_u32base]_index_s32(svbool_t pg, svuint32_t bases,
                                             int64_t index)

The address of element i is:

  ((int32_t *)(uintptr_t)bases[i]) + index

SVE operations involving vectors and scalars ^

Some of the SVE instructions have immediate forms; for example:

  ADD Zd.S, Zd.S, #1

adds 1 to every element of Zd.S. ACLE extends this approach to all arithmetic operations and to all scalar inputs (not just immediates). The implementation can then use immediate forms where possible or duplicate the scalar into a vector otherwise.

For example:

  svint32_t x;
  ...
  x = svadd[_n_s32]_x(pg, x, 1);

adds 1 to every active element of x while:

  svfloat64_t x;
  double *ptr;
  ...
  x = svadd[_n_f64]_x(pg, x, *ptr);

adds *ptr to every active element of x. The first example is likely to use the immediate form of ADD while the latter is likely to use LD1RD.

In a vector operation, the disambiguator _n indicates that the final operand is a scalar rather than a vector.

Immediate arguments to SVE intrinsics ^

Some intrinsics take enumeration values as arguments. These enumerations must always be integer constant expressions that specify a valid enumeration value.

A few intrinsics take general integer immediates as arguments. In the intrinsics documentation, these arguments always start with the prefix imm. Immediate arguments must be integer constant expressions within a certain range (which is the same as the range of the underlying SVE instruction).

Faults and exceptions in SVE intrinsics ^

Loads and stores can trigger faults in the same way as normal pointer dereferences. Exactly what happens then depends on the host environment and is out of scope for this document. However, these faults should be symptoms of a program going wrong rather than something that the programmer deliberately planted. The implementation can therefore remove or reorder potentially faulting operations as long as:

Also, many floating-point operations can raise IEEE exceptions. A similar set of rules apply there: the implementation can remove or reorder operations that might raise an IEEE exception if:

For example, the compiler can remove a floating-point addition whose result is not used. It can also reorder independent floating-point operations, even if that would change the order that exceptions occur. It cannot however move floating-point addition across a direct or indirect call to feclearexcept unless it can prove that the addition would not raise an exception.

Many code generators have a mode that ignores IEEE exceptions. The floating-point restrictions above would not apply when such a mode is in effect.

First-faulting and non-faulting loads ^

SVE provides load instructions in which only the first active element can fault, and others in which no elements can fault. A special register called the first-fault register (FFR) records which elements were loaded successfully. The FFR describes all loads that have executed since the last write to the register.

If a first-faulting or non-faulting load does not load an active element due to a potential fault, it clears the FFR from that element onwards. Those elements of the returned vector then have an unknown value. Elements also have an unknown value if the associated FFR element was already clear before the instruction started.

For example, in:

  SETFFR
  LDFF1D Z0.D, P0/Z, [X0]
  LDFF1D Z1.D, P1/Z, [X1]
  LDFF1D Z2.D, P2/Z, [X2]
  RDFFR P3.B

the load of Z0.D might suppress the load of [X0, #16] due to a potential fault. It would then clear bit 16 onwards of the FFR and leave elements 2 onwards of Z0.D in an unknown state. The same elements of Z1.D and Z2.D would then also be unknown, regardless of whether the loads based on X1 or X2 might fault. At the end of the sequence, P3.B indicates which elements of Z0.D, Z1.D and Z2.D are valid.

In this sequence, the leading inactive elements of Z0.D are guaranteed to be zero and the first active element is guaranteed to be valid (assuming that the first element did not trigger a fault). The same guarantees apply to Z1.D only if the first active element of P1 is guaranteed to be earlier than the second active element of P0. In this sense, the contents of Z0.D, Z1.D and Z2.D do depend on the order of the instructions, since the guarantees would be different if the loads had a different order. However, the point of the loads is to try to access more than the first active element, so these relationships are not useful in practice.

ACLE therefore divides first-faulting and non-faulting loads (but not normal loads) into “FFR groups”. Each group begins and ends with an intrinsic that explicitly reads from or writes to the FFR. More precisely, ACLE introduces a global “first-fault register token” (FFRT) that identifies the current FFR group. This FFRT is a purely conceptual construct and contains three pieces of information:

Identifier Contents
nwrite the number of explicit writes to the FFR
lastwrite the last value written to the FFR
nread the number of explicit reads from the FFR since the last write

There are only two possible ways of modifying this FFRT:

  1. Increment nwrite by one, set lastwrite to a given value, and set nread to zero.

  2. Increment nread by one.

Intrinsics that explicitly write to the FFR do the first operation. Intrinsics that explicitly read from the FFR do the second operation. First-faulting and non-faulting loads read from the FFRT and depend on its value, but they do not write to it.

One consequence of this arrangement is that an intrinsic that writes to the FFR is only dead if there are no further references to the FFRT in the program. However, the normal “as if” rules apply, so the implementation can produce an empty code sequence for the write if doing so would not affect the behavior of the program. Similarly, if a first-faulting or non-faulting load follows an intrinsic that explicitly reads from the FFR, without an intervening write, the load keeps the read intrinsic alive even if the result of that read is unused. Again, the normal “as if” rules mean that the implementation can produce an empty code sequence for the read if doing so would not affect the behavior of the program.

These FFRT dependencies are the only FFR-based ones that the implementation needs to consider when optimizing first-faulting and non-faulting loads; in all other respects the implementation can treat the loads like normal loads. This includes removing loads whose results are not used, suppressing loads of individual elements if their values do not matter, or reordering loads within a group (subject to the usual rules for normal loads). In practice the value of the FFR before a load does still affect which elements of the load result are defined, and in practice the loads do still write to the FFR, but the input program does not control these effects directly.

Assuming float64_t data, the C version of the code above would be:

  svfloat64_t z0, z1, z2;
  svbool_t p0, p1, p2, p3;
  double *x0, *x1, *x2;
  ...
  svsetffr();
  z0 = svldff1[_f64](p0, x0);
  z1 = svldff1[_f64](p1, x1);
  z2 = svldff1[_f64](p2, x2);
  p3 = svrdffr();

SVE reinterpret intrinsics ^

The svreinterpret intrinsics for vector types and vector tuple types take the form

svreinterpret_type0[_type1_xN]

where N refers to the number of elements in a tuple type (2, 3, or 4).

For example:

svuin16_t svreinterpret_u16_s32(svint32_t op);
svuin16_t svreinterpret_u16(svint32_t op);
svuin16x2_t svreinterpret_u16_s32_x2(svint32x2_t op);
svuin16x2_t svreinterpret_u16(svint32x2_t op);

A list of SVE reinterpret intrinsics can be found on developer.arm.com.

List of SVE intrinsics ^

The list of SVE intrinsics can be found on developer.arm.com. This list includes SVE2 and other SVE extensions (such as the 32-bit matrix multiply extensions).

The tick boxes on the left can be used to narrow the displayed intrinsics to particular data types and element sizes. Below that is an instruction group hierarchy that groups the intrinsics based on the operation that they perform.

Mapping of SVE instructions to intrinsics ^

List of instructions ^

This section contains a list of all SVE instructions. For each one it gives a reference to the associated ACLE intrinsic or explains why no such intrinsic exists. The list includes SVE2 and other SVE extensions (such as the 32-bit matrix multiply extensions).

Instruction Intrinsic
ABS svabs
ADCLB svadclb
ADCLT svadclt
ADD (immediate) svadd
ADD (vectors, predicated) svadd
ADD (vectors, unpredicated) svadd
ADDHNB svaddhnb
ADDHNT svaddhnt
ADDP svaddp
ADDPL See ADDPL, ADDVL, INC and DEC
ADDVL See ADDPL, ADDVL, INC and DEC
ADR svadrb, svadrh, svadrw, svadrd
AESD svaesd
AESE svaese
AESIMC svaesimc
AESMC svaesmc
AND (immediate) svand
AND (predicates) svand
AND (vectors, predicated) svand
AND (vectors, unpredicated) svand
ANDS svand
ANDV svandv
ASR (immediate, predicated) svasr
ASR (immediate, unpredicated) svasr
ASR (vectors) svasr
ASR (wide elements, predicated) svasr_wide
ASR (wide elements, unpredicated) svasr_wide
ASRD svasrd
ASRR svasr (optimization of _x forms)
BCAX svbcax
BDEP svbdep
BEXT svbext
BFCVT svcvt
BFCVTNT svcvtnt
BFDOT svbfdot
BFMLALB svbfmlalb
BFMLALT svbfmlalt
BFMMLA svbfmmla
BGRP svbgrp
BIC (immediate) svbic
BIC (predicates) svbic
BIC (vectors, predicated) svbic
BIC (vectors, unpredicated) svbic
BICS svbic
BRKA svbrka
BRKAS svbrka
BRKB svbrkb
BRKBS svbrkb
BRKN svbrkn
BRKNS svbrkn
BRKPA svbrkpa
BRKPAS svbrkpa
BRKPB svbrkpb
BRKPBS svbrkpb
BSL svbsl
BSL1N svbsl1n
BSL2N svbsl2n
CADD svcadd
CDOT (indexed) svcdot_lane
CDOT (vectors) svcdot
CLASTA (SIMD & FP scalar) svclasta
CLASTA (scalar) svclasta
CLASTA (vectors) svclasta
CLASTB (SIMD & FP scalar) svclastb
CLASTB (scalar) svclastb
CLASTB (vectors) svclastb
CLS svcls
CLZ svclz
CMLA (indexed) svcmla_lane
CMLA (vectors) svcmla
CMPEQ (immediate) svcmpeq
CMPEQ (vectors) svcmpeq
CMPEQ (wide elements) svcmpeq_wide
CMPGE (immediate) svcmpge
CMPGE (vectors) svcmpge
CMPGE (wide elements) svcmpge_wide
CMPGT (immediate) svcmpgt
CMPGT (vectors) svcmpgt
CMPGT (wide elements) svcmpgt_wide
CMPHI (immediate) svcmpgt
CMPHI (vectors) svcmpgt
CMPHI (wide elements) svcmpgt_wide
CMPHS (immediate) svcmpge
CMPHS (vectors) svcmpge
CMPHS (wide elements) svcmpge_wide
CMPLE (immediate) svcmple
CMPLE (vectors) svcmple
CMPLE (wide elements) svcmple_wide
CMPLO (immediate) svcmplt
CMPLO (vectors) svcmplt
CMPLO (wide elements) svcmplt_wide
CMPLS (immediate) svcmple
CMPLS (vectors) svcmple
CMPLS (wide elements) svcmple_wide
CMPLT (immediate) svcmplt
CMPLT (vectors) svcmplt
CMPLT (wide elements) svcmplt_wide
CMPNE (immediate) svcmpne
CMPNE (vectors) svcmpne
CMPNE (wide elements) svcmpne_wide
CNOT svcnot
CNT svcnt
CNTB svcntb
CNTD svcntd
CNTH svcnth
CNTP svcntp
CNTW svcntw
COMPACT svcompact
CPY (SIMD & FP scalar) svdup (predicated forms)
CPY (immediate) svdup (predicated forms)
CPY (scalar) svdup (predicated forms)
CTERMEQ See CTERMEQ and CTERMNE
CTERMNE See CTERMEQ and CTERMNE
DECB See ADDPL, ADDVL, INC and DEC
DECD (scalar) See ADDPL, ADDVL, INC and DEC
DECD (vector) See ADDPL, ADDVL, INC and DEC
DECH (scalar) See ADDPL, ADDVL, INC and DEC
DECH (vector) See ADDPL, ADDVL, INC and DEC
DECP (scalar) See ADDPL, ADDVL, INC and DEC
DECP (vector) See ADDPL, ADDVL, INC and DEC
DECW (scalar) See ADDPL, ADDVL, INC and DEC
DECW (vector) See ADDPL, ADDVL, INC and DEC
DUP (immediate) svdup, svdupq
DUP (indexed) svdup_lane, svdupq_lane
DUP (scalar) svdup, svdupq
DUPM svdup, svdupq
EON sveor
EOR (immediate) sveor
EOR (predicates) sveor
EOR (vectors, predicated) sveor
EOR (vectors, unpredicated) sveor
EOR3 sveor3
EORBT sveorbt
EORS sveor
EORTB sveortb
EORV sveorv
EXT svext
FABD svabd
FABS svabs
FACGE svacge
FACGT svacgt
FACLE svacle
FACLT svaclt
FADD (immediate) svadd
FADD (vectors, predicated) svadd
FADD (vectors, unpredicated) svadd
FADDA svadda
FADDP svaddp
FADDV svaddv
FCADD svcadd
FCMEQ (vectors) svcmpeq
FCMEQ (zero) svcmpeq
FCMGE (vectors) svcmpge
FCMGE (zero) svcmpge
FCMGT (vectors) svcmpgt
FCMGT (zero) svcmpgt
FCMLA (indexed) svcmla
FCMLA (vectors) svcmla
FCMLE (vectors) svcmple
FCMLE (zero) svcmple
FCMLT (vectors) svcmplt
FCMLT (zero) svcmplt
FCMNE (vectors) svcmpne
FCMNE (zero) svcmpne
FCMUO svcmpuo
FCPY svdup (predicated forms)
FCVT svcvt
FCVTLT svcvtlt
FCVTNT svcvtnt
FCVTX svcvtx
FCVTXNT svcvtxnt
FCVTZS svcvt
FCVTZU svcvt
FDIV svdiv
FDIVR svdivr
FDUP svdup
FEXPA svexpa
FLOGB svlogb
FMAD svmad
FMAX (immediate) svmax
FMAX (vectors) svmax
FMAXNM (immediate) svmaxnm
FMAXNM (vectors) svmaxnm
FMAXNMP svmaxnmp
FMAXNMV svmaxnmv
FMAXP svmaxp
FMAXV svmaxv
FMIN (immediate) svmin
FMIN (vectors) svmin
FMINNM (immediate) svminnm
FMINNM (vectors) svminnm
FMINNMP svminnmp
FMINNMV svminnmv
FMINP svminp
FMINV svminv
FMLA (indexed) svmla
FMLA (vectors) svmla
FMLALB (indexed) svmlalb_lane
FMLALB (vectors) svmlalb
FMLALT (indexed) svmlalt_lane
FMLALT (vectors) svmlalt
FMLS (indexed) svmls_lane
FMLS (vectors) svmls
FMLSLB (indexed) svmlslb_lane
FMLSLB (vectors) svmlslb
FMLSLT (indexed) svmlslt_lane
FMLSLT (vectors) svmlslt
FMMLA svmmla
FMOV (immediate, predicated) svdup
FMOV (immediate, unpredicated) svdup
FMOV (zero, predicated) svdup
FMOV (zero, unpredicated) svdup
FMSB svmsb
FMUL (immediate) svmul
FMUL (indexed) svmul
FMUL (vectors, predicated) svmul
FMUL (vectors, unpredicated) svmul
FMULX svmulx
FNEG svneg
FNMAD svnmad
FNMLA svnmla
FNMLS svnmls
FNMSB svnmsb
FRECPE svrecpe
FRECPS svrecps
FRECPX svrecpx
FRINTA svrinta
FRINTI svrinti
FRINTM svrintm
FRINTN svrintn
FRINTP svrintp
FRINTX svrintx
FRINTZ svrintz
FRSQRTE svrsqrte
FRSQRTS svrsqrts
FSCALE svscale
FSQRT svsqrt
FSUB (immediate) svsub
FSUB (vectors, predicated) svsub
FSUB (vectors, unpredicated) svsub
FSUBR (immediate) svsubr
FSUBR (vectors) svsubr
FTMAD svtmad
FTSMUL svtsmul
FTSSEL svtssel
HISTCNT svhistcnt
HISTSEG svhistseg
INCB See ADDPL, ADDVL, INC and DEC
INCD (scalar) See ADDPL, ADDVL, INC and DEC
INCD (vector) See ADDPL, ADDVL, INC and DEC
INCH (scalar) See ADDPL, ADDVL, INC and DEC
INCH (vector) See ADDPL, ADDVL, INC and DEC
INCP (scalar) See ADDPL, ADDVL, INC and DEC
INCP (vector) See ADDPL, ADDVL, INC and DEC
INCW (scalar) See ADDPL, ADDVL, INC and DEC
INCW (vector) See ADDPL, ADDVL, INC and DEC
INDEX (immediate, scalar) svindex
INDEX (immediates) svindex
INDEX (scalar, immediate) svindex
INDEX (scalars) svindex
INSR (SIMD & FP scalar) svinsr
INSR (scalar) svinsr
LASTA (SIMD & FP scalar) svlasta
LASTA (scalar) svlasta
LASTB (SIMD & FP scalar) svlastb
LASTB (scalar) svlastb
LD1B (scalar plus immediate) svld1_vnum, svld1ub_vnum
LD1B (scalar plus scalar) svld1, svld1ub
LD1B (scalar plus vector) svld1ub_gather
LD1B (vector plus immediate) svld1ub_gather
LD1D (scalar plus immediate) svld1_vnum
LD1D (scalar plus scalar) svld1
LD1D (scalar plus vector) svld1_gather
LD1D (vector plus immediate) svld1_gather
LD1H (scalar plus immediate) svld1_vnum, svld1uh_vnum
LD1H (scalar plus scalar) svld1, svld1uh
LD1H (scalar plus vector) svld1uh_gather
LD1H (vector plus immediate) svld1uh_gather
LD1RB See SVE LD1R instructions
LD1RD See SVE LD1R instructions
LD1RH See SVE LD1R instructions
LD1ROB (scalar plus immediate) svld1ro
LD1ROB (scalar plus scalar) svld1ro
LD1ROD (scalar plus immediate) svld1ro
LD1ROD (scalar plus scalar) svld1ro
LD1ROH (scalar plus immediate) svld1ro
LD1ROH (scalar plus scalar) svld1ro
LD1ROW (scalar plus immediate) svld1ro
LD1ROW (scalar plus scalar) svld1ro
LD1RQB (scalar plus immediate) svld1rq, svdupq
LD1RQB (scalar plus scalar) svld1rq, svdupq
LD1RQD (scalar plus immediate) svld1rq, svdupq
LD1RQD (scalar plus scalar) svld1rq, svdupq
LD1RQH (scalar plus immediate) svld1rq, svdupq
LD1RQH (scalar plus scalar) svld1rq, svdupq
LD1RQW (scalar plus immediate) svld1rq, svdupq
LD1RQW (scalar plus scalar) svld1rq, svdupq
LD1RSB See SVE LD1R instructions
LD1RSH See SVE LD1R instructions
LD1RSW See SVE LD1R instructions
LD1RW See SVE LD1R instructions
LD1SB (scalar plus immediate) svld1sb_vnum
LD1SB (scalar plus scalar) svld1sb
LD1SB (scalar plus vector) svld1sb_gather
LD1SB (vector plus immediate) svld1sb_gather
LD1SH (scalar plus immediate) svld1sh_vnum
LD1SH (scalar plus scalar) svld1sh
LD1SH (scalar plus vector) svld1sh_gather
LD1SH (vector plus immediate) svld1sh_gather
LD1SW (scalar plus immediate) svld1sw_vnum
LD1SW (scalar plus scalar) svld1sw
LD1SW (scalar plus vector) svld1sw_gather
LD1SW (vector plus immediate) svld1sw_gather
LD1W (scalar plus immediate) svld1_vnum, svld1uw_vnum
LD1W (scalar plus scalar) svld1, svld1uw
LD1W (scalar plus vector) svld1_gather, svld1uw_gather
LD1W (vector plus immediate) svld1_gather, svld1uw_gather
LD2B (scalar plus immediate) svld2_vnum
LD2B (scalar plus scalar) svld2
LD2D (scalar plus immediate) svld2_vnum
LD2D (scalar plus scalar) svld2
LD2H (scalar plus immediate) svld2_vnum
LD2H (scalar plus scalar) svld2
LD2W (scalar plus immediate) svld2_vnum
LD2W (scalar plus scalar) svld2
LD3B (scalar plus immediate) svld3_vnum
LD3B (scalar plus scalar) svld3
LD3D (scalar plus immediate) svld3_vnum
LD3D (scalar plus scalar) svld3
LD3H (scalar plus immediate) svld3_vnum
LD3H (scalar plus scalar) svld3
LD3W (scalar plus immediate) svld3_vnum
LD3W (scalar plus scalar) svld3
LD4B (scalar plus immediate) svld4_vnum
LD4B (scalar plus scalar) svld4
LD4D (scalar plus immediate) svld4_vnum
LD4D (scalar plus scalar) svld4
LD4H (scalar plus immediate) svld4_vnum
LD4H (scalar plus scalar) svld4
LD4W (scalar plus immediate) svld4_vnum
LD4W (scalar plus scalar) svld4
LDFF1B (scalar plus scalar) svldff1, svldff1ub
LDFF1B (scalar plus vector) svldff1ub_gather
LDFF1B (vector plus immediate) svldff1ub_gather
LDFF1D (scalar plus scalar) svldff1
LDFF1D (scalar plus vector) svldff1_gather
LDFF1D (vector plus immediate) svldff1_gather
LDFF1H (scalar plus scalar) svldff1, svldff1uh
LDFF1H (scalar plus vector) svldff1uh_gather
LDFF1H (vector plus immediate) svldff1uh_gather
LDFF1SB (scalar plus scalar) svldff1sb
LDFF1SB (scalar plus vector) svldff1sb_gather
LDFF1SB (vector plus immediate) svldff1sb_gather
LDFF1SH (scalar plus scalar) svldff1sh
LDFF1SH (scalar plus vector) svldff1sh_gather
LDFF1SH (vector plus immediate) svldff1sh_gather
LDFF1SW (scalar plus scalar) svldff1sw
LDFF1SW (scalar plus vector) svldff1sw_gather
LDFF1SW (vector plus immediate) svldff1sw_gather
LDFF1W (scalar plus scalar) svldff1, svldff1uw
LDFF1W (scalar plus vector) svldff1_gather, svldff1uw_gather
LDFF1W (vector plus immediate) svldff1_gather, svldff1uw_gather
LDNF1B svldnf1, svldnf1ub
LDNF1D svldnf1
LDNF1H svldnf1, svldnf1uh
LDNF1SB svldnf1sb
LDNF1SH svldnf1sh
LDNF1SW svldnf1sw
LDNF1W svldnf1, svldnf1uw
LDNT1B (scalar plus immediate) svldnt1_vnum
LDNT1B (scalar plus scalar) svldnt1
LDNT1B (vector plus scalar) svldnt1, svldnt1ub
LDNT1D (scalar plus immediate) svldnt1_vnum
LDNT1D (scalar plus scalar) svldnt1
LDNT1D (vector plus scalar) svldnt1
LDNT1H (scalar plus immediate) svldnt1_vnum
LDNT1H (scalar plus scalar) svldnt1
LDNT1H (vector plus scalar) svldnt1, svldnt1uh
LDNT1SB svldnt1sb
LDNT1SH svldnt1sh
LDNT1SW svldnt1sw
LDNT1W (scalar plus immediate) svldnt1_vnum
LDNT1W (scalar plus scalar) svldnt1
LDNT1W (vector plus scalar) svldnt1, svldnt1uw
LDR (predicate) See SVE LDR and STR
LDR (vector) See SVE LDR and STR
LSL (immediate, predicated) svlsl
LSL (immediate, unpredicated) svlsl
LSL (vectors) svlsl
LSL (wide elements, predicated) svlsl_wide
LSL (wide elements, unpredicated) svlsl_wide
LSLR svlsl (optimization of _x forms)
LSR (immediate, predicated) svlsr
LSR (immediate, unpredicated) svlsr
LSR (vectors) svlsr
LSR (wide elements, predicated) svlsr_wide
LSR (wide elements, unpredicated) svlsr_wide
LSRR svlsr (optimization of _x forms)
MAD svmad
MATCH svmatch
MLA (indexed) svmla_lane
MLA (vectors) svmla
MLS (indexed) svmls_lane
MLS (vectors) svmls
MOV (SIMD & FP scalar, predicated) svdup
MOV (SIMD & FP scalar, unpredicated) svdup
MOV (bitmask immediate) svdup
MOV (immediate, predicated) svdup
MOV (immediate, unpredicated) svdup
MOV (predicate, predicated, merging) svsel
MOV (predicate, predicated, zeroing) svmov
MOV (predicate, unpredicated) See SVE MOV
MOV (scalar, predicated) svdup
MOV (scalar, unpredicated) svdup
MOV (vector, predicated) svsel
MOV (vector, unpredicated) See SVE MOV
MOVPRFX (predicated) See MOVPRFX
MOVPRFX (unpredicated) See MOVPRFX
MOVS (predicated) svmov
MOVS (unpredicated) See SVE MOV
MSB svmsb
MUL (immediate) svmul
MUL (indexed) svmul_lane
MUL (vectors, predicated) svmul
MUL (vectors, unpredicated) svmul
NAND svnand
NANDS svnand
NBSL svnbsl
NEG svneg
NMATCH svnmatch
NOR svnor
NORS svnor
NOT (predicate) svnot
NOT (vector) svnot
NOTS svnot
ORN (immediate) svorr
ORN (predicates) svorn
ORNS svorn
ORR (immediate) svorr
ORR (predicates) svorr
ORR (vectors, predicated) svorr
ORR (vectors, unpredicated) svorr
ORRS svorr
ORV svorv
PFALSE svpfalse
PFIRST svpfirst
PMUL svpmul
PMULLB svpmullb, svpmullb_pair
PMULLT svpmullt, svpmullt_pair
PNEXT svpnext
PRFB (scalar plus immediate) svprfb_vnum
PRFB (scalar plus scalar) svprfb
PRFB (scalar plus vector) svprfb_gather
PRFB (vector plus immediate) svprfb_gather
PRFD (scalar plus immediate) svprfd_vnum
PRFD (scalar plus scalar) svprfd
PRFD (scalar plus vector) svprfd_gather
PRFD (vector plus immediate) svprfd_gather
PRFH (scalar plus immediate) svprfh_vnum
PRFH (scalar plus scalar) svprfh
PRFH (scalar plus vector) svprfh_gather
PRFH (vector plus immediate) svprfh_gather
PRFW (scalar plus immediate) svprfw_vnum
PRFW (scalar plus scalar) svprfw
PRFW (scalar plus vector) svprfw_gather
PRFW (vector plus immediate) svprfw_gather
PTEST svptest
PTRUE svptrue
PTRUES svptrue
PUNPKHI svunpkhi
PUNPKLO svunpklo
RADDHNB svraddhnb
RADDHNT svraddhnt
RAX1 svrax1
RBIT svrbit
RDFFR (predicated) svrdffr
RDFFR (unpredicated) svrdffr
RDFFRS svrdffr
RDVL See RDVL
REV (predicate) svrev
REV (vector) svrev
REVB svrevb
REVH svrevh
REVW svrevw
RSHRNB svrshrnb
RSHRNT svrshrnt
RSUBHNB svrsubhnb
RSUBHNT svrsubhnt
SABA svaba
SABALB svabalb
SABALT svabalt
SABD svabd
SABDLB svabdlb
SABDLT svabdlt
SADALP svadalp
SADDLB svaddlb
SADDLBT svaddlbt
SADDLT svaddlt
SADDV svaddv
SADDWB svaddwb
SADDWT svaddwt
SBCLB svsbclb
SBCLT svsbclt
SCVTF svcvt
SDIV svdiv
SDIVR svdivr
SDOT (indexed) svdot_lane
SDOT (vectors) svdot
SEL (predicates) svsel
SEL (vectors) svsel
SETFFR svsetffr
SHADD svhadd
SHRNB svshrnb
SHRNT svshrnt
SHSUB svhsub
SHSUBR svhsubr
SLI svsli
SM4E svsm4e
SM4EKEY svsm4ekey
SMAX (immediate) svmax
SMAX (vectors) svmax
SMAXP svmaxp
SMAXV svmaxv
SMIN (immediate) svmin
SMIN (vectors) svmin
SMINP svminp
SMINV svminv
SMLALB (indexed) svmlalb_lane
SMLALB (vectors) svmlalb
SMLALT (indexed) svmlalt_lane
SMLALT (vectors) svmlalt
SMLSLB (indexed) svmlslb_lane
SMLSLB (vectors) svmlslb
SMLSLT (indexed) svmlslt_lane
SMLSLT (vectors) svmlslt
SMMLA svmmla
SMULH (predicated) svmulh
SMULH (unpredicated) svmulh
SMULLB (indexed) svmullb_lane
SMULLB (vectors) svmullb
SMULLT (indexed) svmullt_lane
SMULLT (vectors) svmullt
SPLICE svsplice
SQABS svqabs
SQADD (immediate) svqadd
SQADD (vectors, predicated) svqadd
SQADD (vectors, unpredicated) svqadd
SQCADD svqcadd
SQDECB svqdecb
SQDECD (scalar) svqdecd
SQDECD (vector) svqdecd
SQDECH (scalar) svqdech
SQDECH (vector) svqdech
SQDECP (scalar) svqdecp
SQDECP (vector) svqdecp
SQDECW (scalar) svqdecw
SQDECW (vector) svqdecw
SQDMLALB (indexed) svqdmlalb_lane
SQDMLALB (vectors) svqdmlalb
SQDMLALBT svqdmlalbt
SQDMLALT (indexed) svqdmlalt_lane
SQDMLALT (vectors) svqdmlalt
SQDMLSLB (indexed) svqdmlslb_lane
SQDMLSLB (vectors) svqdmlslb
SQDMLSLBT svqdmlslbt
SQDMLSLT (indexed) svqdmlslt_lane
SQDMLSLT (vectors) svqdmlslt
SQDMULH (indexed) svqdmulh_lane
SQDMULH (vectors) svqdmulh
SQDMULLB (indexed) svqdmullb_lane
SQDMULLB (vectors) svqdmullb
SQDMULLT (indexed) svqdmullt_lane
SQDMULLT (vectors) svqdmullt
SQINCB svqincb
SQINCD (scalar) svqincd
SQINCD (vector) svqincd
SQINCH (scalar) svqinch
SQINCH (vector) svqinch
SQINCP (scalar) svqincp
SQINCP (vector) svqincp
SQINCW (scalar) svqincw
SQINCW (vector) svqincw
SQNEG svqneg
SQRDCMLAH (indexed) svqrdcmlah_lane
SQRDCMLAH (vectors) svqrdcmlah
SQRDMLAH (indexed) svqrdmlah_lane
SQRDMLAH (vectors) svqrdmlah
SQRDMLSH (indexed) svqrdmlsh_lane
SQRDMLSH (vectors) svqrdmlsh
SQRDMULH (indexed) svqrdmulh_lane
SQRDMULH (vectors) svqrdmulh
SQRSHL svqrshl
SQRSHLR svqrshl
SQRSHRNB svqrshrnb
SQRSHRNT svqrshrnt
SQRSHRUNB svqrshrunb
SQRSHRUNT svqrshrunt
SQSHL (immediate) svqshl
SQSHL (vectors) svqshl
SQSHLR svqshl
SQSHLU svqshlu
SQSHRNB svqshrnb
SQSHRNT svqshrnt
SQSHRUNB svqshrunb
SQSHRUNT svqshrunt
SQSUB (immediate) svqsub
SQSUB (vectors, predicated) svqsub
SQSUB (vectors, unpredicated) svqsub
SQSUBR svqsubr
SQXTNB svqxtnb
SQXTNT svqxtnt
SQXTUNB svqxtunb
SQXTUNT svqxtunt
SRHADD svrhadd
SRI svsri
SRSHL svrshl
SRSHLR svrshl
SRSHR svrshr
SRSRA svrsra
SSHLLB svshllb, svmovlb
SSHLLT svshllt, svmovlt
SSRA svsra
SSUBLB svsublb
SSUBLBT svsublbt
SSUBLT svsublt
SSUBLTB svsubltb
SSUBWB svsubwb
SSUBWT svsubwt
ST1B (scalar plus immediate) svst1_vnum, svst1b_vnum
ST1B (scalar plus scalar) svst1, svst1b
ST1B (scalar plus vector) svst1b_scatter
ST1B (vector plus immediate) svst1b_scatter
ST1D (scalar plus immediate) svst1_vnum
ST1D (scalar plus scalar) svst1
ST1D (scalar plus vector) svst1_scatter
ST1D (vector plus immediate) svst1_scatter
ST1H (scalar plus immediate) svst1_vnum, svst1h_vnum
ST1H (scalar plus scalar) svst1, svst1h
ST1H (scalar plus vector) svst1h_scatter
ST1H (vector plus immediate) svst1h_scatter
ST1W (scalar plus immediate) svst1_vnum, svst1w_vnum
ST1W (scalar plus scalar) svst1, svst1w
ST1W (scalar plus vector) svst1_scatter, svst1w_scatter
ST1W (vector plus immediate) svst1_scatter, svst1w_scatter
ST2B (scalar plus immediate) svst2_vnum
ST2B (scalar plus scalar) svst2
ST2D (scalar plus immediate) svst2_vnum
ST2D (scalar plus scalar) svst2
ST2H (scalar plus immediate) svst2_vnum
ST2H (scalar plus scalar) svst2
ST2W (scalar plus immediate) svst2_vnum
ST2W (scalar plus scalar) svst2
ST3B (scalar plus immediate) svst3_vnum
ST3B (scalar plus scalar) svst3
ST3D (scalar plus immediate) svst3_vnum
ST3D (scalar plus scalar) svst3
ST3H (scalar plus immediate) svst3_vnum
ST3H (scalar plus scalar) svst3
ST3W (scalar plus immediate) svst3_vnum
ST3W (scalar plus scalar) svst3
ST4B (scalar plus immediate) svst4_vnum
ST4B (scalar plus scalar) svst4
ST4D (scalar plus immediate) svst4_vnum
ST4D (scalar plus scalar) svst4
ST4H (scalar plus immediate) svst4_vnum
ST4H (scalar plus scalar) svst4
ST4W (scalar plus immediate) svst4_vnum
ST4W (scalar plus scalar) svst4
STNT1B (scalar plus immediate) svstnt1_vnum
STNT1B (scalar plus scalar) svstnt1
STNT1B (vector plus scalar) svstnt1_scatter, svstnt1b
STNT1D (scalar plus immediate) svstnt1_vnum
STNT1D (scalar plus scalar) svstnt1
STNT1D (vector plus scalar) svstnt1_scatter
STNT1H (scalar plus immediate) svstnt1_vnum
STNT1H (scalar plus scalar) svstnt1
STNT1H (vector plus scalar) svstnt1_scatter, svstnt1h
STNT1W (scalar plus immediate) svstnt1_vnum
STNT1W (scalar plus scalar) svstnt1
STNT1W (vector plus scalar) svstnt1_scatter, svstnt1w
STR (predicate) See SVE LDR and STR
STR (vector) See SVE LDR and STR
SUB (immediate) svsub
SUB (vectors, predicated) svsub
SUB (vectors, unpredicated) svsub
SUBHNB svsubhnb
SUBHNT svsubhnt
SUBR (immediate) svsubr
SUBR (vectors) svsubr
SUDOT svsudot_lane
SUNPKHI svunpkhi
SUNPKLO svunpklo
SUQADD svuqadd
SXTB svextb
SXTH svexth
SXTW svextw
TBL svtbl, svdup_lane
TBX svtbx
TRN1 (predicates) svtrn1
TRN1 (vectors) svtrn1
TRN2 (predicates) svtrn2
TRN2 (vectors) svtrn2
UABA svaba
UABALB svabalb
UABALT svabalt
UABD svabd
UABDLB svabdlb
UABDLT svabdlt
UADALP svadalp
UADDLB svaddlb
UADDLT svaddlt
UADDV svaddv
UADDWB svaddwb
UADDWT svaddwt
UCVTF svcvt
UDIV svdiv
UDIVR svdivr
UDOT (indexed) svdot_lane
UDOT (vectors) svdot
UHADD svhadd
UHSUB svhsub
UHSUBR svhsubr
UMAX (immediate) svmax
UMAX (vectors) svmax
UMAXP svmaxp
UMAXV svmaxv
UMIN (immediate) svmin
UMIN (vectors) svmin
UMINP svminp
UMINV svminv
UMLALB (indexed) svmlalb_lane
UMLALB (vectors) svmlalb
UMLALT (indexed) svmlalt_lane
UMLALT (vectors) svmlalt
UMLSLB (indexed) svmlslb_lane
UMLSLB (vectors) svmlslb
UMLSLT (indexed) svmlslt_lane
UMLSLT (vectors) svmlslt
UMMLA svmmla
UMULH (predicated) svmulh
UMULH (unpredicated) svmulh
UMULLB (indexed) svmullb_lane
UMULLB (vectors) svmullb
UMULLT (indexed) svmullt_lane
UMULLT (vectors) svmullt
UQADD (immediate) svqadd
UQADD (vectors, predicated) svqadd
UQADD (vectors, unpredicated) svqadd
UQDECB svqdecb
UQDECD (scalar) svqdecd
UQDECD (vector) svqdecd
UQDECH (scalar) svqdech
UQDECH (vector) svqdech
UQDECP (scalar) svqdecp
UQDECP (vector) svqdecp
UQDECW (scalar) svqdecw
UQDECW (vector) svqdecw
UQINCB svqincb
UQINCD (scalar) svqincd
UQINCD (vector) svqincd
UQINCH (scalar) svqinch
UQINCH (vector) svqinch
UQINCP (scalar) svqincp
UQINCP (vector) svqincp
UQINCW (scalar) svqincw
UQINCW (vector) svqincw
UQRSHL svqrshl
UQRSHLR svqrshl
UQRSHRNB svqrshrnb
UQRSHRNT svqrshrnt
UQSHL (immediate) svqshl
UQSHL (vectors) svqshl
UQSHLR svqshl
UQSHRNB svqshrnb
UQSHRNT svqshrnt
UQSUB (immediate) svqsub
UQSUB (vectors, predicated) svqsub
UQSUB (vectors, unpredicated) svqsub
UQSUBR svqsubr
UQXTNB svqxtnb
UQXTNT svqxtnt
URECPE svrecpe
URHADD svrhadd
URSHL svrshl
URSHLR svrshl
URSHR svrshr
URSQRTE svrsqrte
URSRA svrsra
USDOT (indexed) svusdot_lane
USDOT (vectors) svusdot, svsudot
USHLLB svshllb, svmovlb
USHLLT svshllt, svmovlt
USMMLA svusmmla
USQADD svsqadd
USRA svsra
USUBLB svsublb
USUBLT svsublt
USUBWB svsubwb
USUBWT svsubwt
UUNPKHI svunpkhi
UUNPKLO svunpklo
UXTB svextb
UXTH svexth
UXTW svextw
UZP1 (predicates) svuzp1
UZP1 (vectors) svuzp1
UZP2 (predicates) svuzp2
UZP2 (vectors) svuzp2
WHILEGE svwhilege
WHILEGT svwhilegt
WHILEHI svwhilegt
WHILEHS svwhilege
WHILELE svwhilele
WHILELO svwhilelt
WHILELS svwhilele
WHILELT svwhilelt
WHILERW svwhilerw
WHILEWR svwhilewr
WRFFR svwrffr
XAR svxar
ZIP1 (predicates) svzip1
ZIP1 (vectors) svzip1
ZIP2 (predicates) svzip2
ZIP2 (vectors) svzip2

ADDPL, ADDVL, INC and DEC ^

The architecture provides instructions for adding or subtracting an element count, where the count might come from a pattern (such as svcntb_pat) or from a predicate (svcntp). For example, INCH adds the number of halfwords in a pattern to a scalar register or to every element of a vector register.

At the C and C++ level it is usually simpler to write the addition out normally. Also, having intrinsics that map directly to architectural instructions like INCH could cause confusion for pointers, since the instructions would operate on the raw integer value of the pointer. For example, applying INCH to a pointer to halfwords would effectively convert the pointer to an integer, add the number of halfwords, and then convert the result back to a pointer. The pointer would then only advance by half a vector.

For these reasons there are no dedicated ACLE intrinsics for:

Instead the implementation should use these instructions to optimize things like:

  svbool_t p1;
  ...
  x += svcntp_b16(p1, p1);

The same concerns do not apply to saturating scalar arithmetic, since using saturating arithmetic only makes sense for displacements rather than pointers. There is also no standard way of doing a saturating addition or subtraction of arbitrary integers. ACLE does therefore provide intrinsics for:

CTERMEQ and CTERMNE ^

CTERMEQ and CTERMNE provide a way of optimizing conditions such as:

  svbool_t p1, p2;
  uint64_t x, y;
  ...
  if (svptest_last(p1, p2) && x == y)
    ...stmts...

Here the implementation could use PTEST to test whether the last active element of p2 is active and follow it by CTERMNE to test whether x is not equal to y. The code should then execute stmts if the LT condition holds (that is, if the last element is active and the “termination condition” x != y does not hold).

There are no ACLE intrinsics that perform a combined predicate and scalar test since the separate tests should be more readable.

MOVPRFX ^

There are no separate ACLE intrinsics for MOVPRFX, but its use is implicit in many of the zeroing intrinsics. For example, svadd[_s32]_z uses a MOVPRFX before the ADD instruction.

The code generator can also use MOVPRFX to optimize calls to many _x and _m intrinsics. For example, svdiv[_s32]_x can use DIV to tie the result to the first operand, DIVR to tie the result to the second operand, or a combination of MOVPRFX and DIV to store the result to a separate register.

RDVL ^

There is no separate ACLE intrinsic for RDVL, but the same functionality is available though the svcntb, svcnth, svcntw, and svcntd intrinsics.

SVE LD1R instructions ^

There are no separate ACLE intrinsics for LD1RB, LD1RH, LD1RW, LD1RD, LD1RSB, LD1RSH, and LD1RSW. However, the code generator can use these instructions to optimize calls to svdup in which the scalar argument comes from memory.

SVE LDR and STR ^

There are no separate ACLE intrinsics for SVE LDR and STR instructions, but the code generator can use these instructions to save and restore SVE registers, or to optimize calls to intrinsics like svld1 and svst1.

SVE MOV ^

There are no ACLE intrinsics for the unpredicated SVE MOV instructions. This is because the ACLE intrinsic calls do not imply a particular register allocation and so the code generator must decide for itself when move instructions are required.

SVE2 BFloat16 data-processing instructions. ^

The specification for B16B16 is in Alpha state and may change or be extended in the future.

The instructions in this section are available when __ARM_FEATURE_SVE_B16B16 is non-zero.

BFADD, BFSUB ^

BFloat16 floating-point add and sub (vectors)

  svbfloat16_t svadd[_bf16]_m (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
  svbfloat16_t svadd[_bf16]_x (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
  svbfloat16_t svadd[_bf16]_z (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
  svbfloat16_t svadd[_n_bf16]_m (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
  svbfloat16_t svadd[_n_bf16]_x (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
  svbfloat16_t svadd[_n_bf16]_z (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

  svbfloat16_t svsub[_bf16]_m (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
  svbfloat16_t svsub[_bf16]_x (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
  svbfloat16_t svsub[_bf16]_z (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
  svbfloat16_t svsub[_n_bf16]_m (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
  svbfloat16_t svsub[_n_bf16]_x (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
  svbfloat16_t svsub[_n_bf16]_z (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

BFCLAMP ^

BFloat16 Clamp to minimum/maximum vector.

  svbfloat16_t svclamp[_bf16](svbfloat16_t op, svbfloat16_t min, svbfloat16_t max);

BFMAX, BFMIN ^

BFloat16 floating-point maximum/minimum (predicated).

   svbfloat16_t svmax[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmax[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmax[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmax[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmax[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmax[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

   svbfloat16_t svmin[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmin[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmin[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmin[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmin[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmin[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

BFMAXNM, BFMINNM ^

BFloat16 floating-point maximum/minimum number (predicated).

   svbfloat16_t svmaxnm[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmaxnm[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmaxnm[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmaxnm[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmaxnm[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmaxnm[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

   svbfloat16_t svminnm[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svminnm[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svminnm[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svminnm[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svminnm[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svminnm[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

BFMLA, BFMLS ^

BFloat16 floating-point fused multiply add or sub vectors.

   svbfloat16_t svmla[_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                               svbfloat16_t zm);
   svbfloat16_t svmla[_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                               svbfloat16_t zm);
   svbfloat16_t svmla[_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                               svbfloat16_t zm);
   svbfloat16_t svmla[_n_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                                 bfloat16_t zm);
   svbfloat16_t svmla[_n_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                                 bfloat16_t zm);
   svbfloat16_t svmla[_n_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                                 bfloat16_t zm);

   svbfloat16_t svmla_lane[_bf16](svbfloat16_t zda, svbfloat16_t zn,
                                  svbfloat16_t zm, uint64_t imm_idx);

   svbfloat16_t svmls[_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                               svbfloat16_t zm);
   svbfloat16_t svmls[_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                               svbfloat16_t zm);
   svbfloat16_t svmls[_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                               svbfloat16_t zm);
   svbfloat16_t svmls[_n_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                                 bfloat16_t zm);
   svbfloat16_t svmls[_n_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                                 bfloat16_t zm);
   svbfloat16_t svmls[_n_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn,
                                 bfloat16_t zm);

   svbfloat16_t svmls_lane[_bf16](svbfloat16_t zda, svbfloat16_t zn,
                                  svbfloat16_t zm, uint64_t imm_idx);

BFMUL ^

BFloat16 floating-point multiply vectors.

   svbfloat16_t svmul[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmul[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmul[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm);
   svbfloat16_t svmul[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmul[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);
   svbfloat16_t svmul[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm);

   svbfloat16_t svmul_lane[_bf16](svbfloat16_t zn, svbfloat16_t zm,
                                  uint64_t imm_idx);

SVE2.1 instruction intrinsics ^

The specification for SVE2.1 is in Alpha state and may change or be extended in the future.

The functions in this section are defined by the header file <arm_sve.h> when __ARM_FEATURE_SVE2p1 is defined.

Some instructions overlap with the SME and SME2 architecture extensions and are additionally available in Streaming SVE mode when __ARM_FEATURE_SME is non-zero or __ARM_FEATURE_SME2 are non-zero. For convenience, the intrinsics for these instructions are listed in the following section.

Multi-vector predicates ^

When __ARM_FEATURE_SVE2p1 is defined, <arm_sve.h> defines the tuple types svboolx2_t and svboolx4_t.

These are opaque tuple types that can be accessed using the SVE intrinsics svsetN, svgetN and svcreateN. svundef2 and svundef4 are also extended to work with svboolx2_t and svboolx4_t. For example:

    svbool_t svget2[_b](svboolx2_t tuple, uint64_t imm_index);
    svboolx2_t svset2[_b](svboolx2_t tuple, uint64_t imm_index, svbool_t x);
    svboolx2_t svcreate2[_b](svbool_t x, svbool_t y);
    svboolx2_t svundef2_b();

ADDQV, FADDQV ^

Unsigned/FP add reduction of quadword vector segments.

  // Variants are also available for:
  // _s8, _s16, _u16, _s32, _u32, _s64, _u64,
  // _f16, _f32, _f64
  uint8x16_t svaddqv[_u8](svbool_t pg, svuint8_t zn);

ANDQV, EORQV, ORQV ^

Reduction of quadword vector segments.

  // Variants are also available for:
  // _s8, _u16, _s16, _u32, _s32, _u64, _s64
  uint8x16_t svandqv[_u8](svbool_t pg, svuint8_t zn);
  uint8x16_t sveorqv[_u8](svbool_t pg, svuint8_t zn);
  uint8x16_t svorqv[_u8](svbool_t pg, svuint8_t zn);

DUPQ ^

Broadcast indexed element within each quadword vector segment.

   // Variants are also available for:
   // _s8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svuint8_t svdup_laneq[_u8](svuint8_t zn, uint64_t imm_idx);

EXTQ ^

Extract vector segment from each pair of quadword segments.

   // Variants are also available for:
   // _s8, _s16, _u16, _s32, _u32, _s64, _u64
   // _mf8, _bf16, _f16, _f32, _f64
   svuint8_t svextq[_u8](svuint8_t zdn, svuint8_t zm, uint64_t imm);

LD1D, LD1W ^

Contiguous zero-extending load to quadword (single vector).

   // Variants are also available for:
   // _u32, _s32
   svfloat32_t svld1uwq[_f32](svbool_t, const float32_t *ptr);
   svfloat32_t svld1uwq_vnum[_f32](svbool_t, const float32_t *ptr, int64_t vnum);


   // Variants are also available for:
   // _u64, _s64
   svfloat64_t svld1udq[_f64](svbool_t, const float64_t *ptr);
   svfloat64_t svld1udq_vnum[_f64](svbool_t, const float64_t *ptr, int64_t vnum);

LD1Q ^

Gather Load Quadword.

   // Variants are also available for:
   // _u8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svint8_t svld1q_gather[_u64base]_s8(svbool_t pg, svuint64_t zn);
   svint8_t svld1q_gather[_u64base]_offset_s8(svbool_t pg, svuint64_t zn, int64_t offset);
   svint8_t svld1q_gather_[u64]offset[_s8](svbool_t pg, const int8_t *base, svuint64_t offset);

   // Variants are also available for:
   // _u16, _u32, _s32, _u64, _s64
   // _bf16, _f16, _f32, _f64
   svint16_t svld1q_gather_[u64]index[_s16](svbool_t pg, const int16_t *base, svuint64_t index);
   svint8_t svld1q_gather[_u64base]_index_s8(svbool_t pg, svuint64_t zn, int64_t index);

LD2Q, LD3Q, LD4Q ^

Contiguous load two, three, or four quadword structures.

   // Variants are also available for:
   // _u8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svint8x2_t svld2q[_s8](svbool_t pg, const int8_t *rn);
   svint8x2_t svld2q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum);
   svint8x3_t svld3q[_s8](svbool_t pg, const int8_t *rn);
   svint8x3_t svld3q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum);
   svint8x4_t svld4q[_s8](svbool_t pg, const int8_t *rn);
   svint8x4_t svld4q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum);

UMAXQV, SMAXQV, FMAXQV, UMINQV, SMINQV, FMINQV ^

Max/Min reduction of quadword vector segments.

   // Variants are also available for:
   // _s8, _u16, _s16, _u32, _s32, _u64, _s64
   // _f16, _f32, _f64
   uint8x16_t svmaxqv[_u8](svbool_t pg, svuint8_t zn);
   uint8x16_t svminqv[_u8](svbool_t pg, svuint8_t zn);

FMAXNMQV, FMINNMQV ^

Max/Min recursive reduction of quadword vector segments.

   // Variants are also available for  _f32, _f64
   float16x8_t svmaxnmqv[_f16](svbool_t pg, svfloat16_t zn);
   float16x8_t svminnmqv[_f16](svbool_t pg, svfloat16_t zn);

PMOV ^

   // Variants are available for:
   // _s8, _u16, _s16, _s32, _u32, _s64, _u64
   svbool_t svpmov_lane[_u8](svuint8_t zn, uint64_t imm);

   // Variants are available for:
   // _s8, _s16, _u16, _s32, _u32, _s64, _u64
   svbool_t svpmov[_u8](svuint8_t zn);

   // Variants are available for:
   // _s16, _s32, _u32, _s64, _u64
   svuint16_t svpmov_lane[_u16]_m(svuint16_t zd, svbool_t pn, uint64_t imm);

   // Variants are also available for:
   // _s8, _u16, _s16, _u32, _s32, _u64, _s64
   svuint8_t svpmov_u8_z(svbool_t pn);

ST1D, ST1W ^

Contiguous store of single vector operand, truncating from quadword.

   // Variants are also available for:
   // _u32, _s32
   void svst1wq[_f32](svbool_t, const float32_t *ptr, svfloat32_t data);
   void svst1wq_vnum[_f32](svbool_t, const float32_t *ptr, int64_t vnum, svfloat32_t data);
 

   // Variants are also available for:
   // _u64, _s64
   void svst1dq[_f64](svbool_t, const float64_t *ptr, svfloat64_t data);
   void svst1dq_vnum[_f64](svbool_t, const float64_t *ptr, int64_t vnum, svfloat64_t data);

ST1Q ^

Scatter store quadwords.

   // Variants are also available for:
   // _u8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   void svst1q_scatter[_u64base][_s8](svbool_t pg, svuint64_t zn, svint8_t data);
   void svst1q_scatter[_u64base]_offset[_s8](svbool_t pg, svuint64_t zn, int64_t offset, svint8_t data);
   void svst1q_scatter_[u64]offset[_s8](svbool_t pg, const uint8_t *base, svuint64_t offset, svint8_t data);

   // Variants are also available for:
   // _u16, _u32, _s32, _u64, _s64
   // _bf16, _f16, _f32, _f64
   void svst1q_scatter[_u64base]_index[_s8](svbool_t pg, svuint64_t zn, int64_t index, svint8_t data);
   void svst1q_scatter_[u64]index_[s16](svbool_t pg, const int16_t *base, svuint64_t index, svint16_t data);

ST2Q, ST3Q, ST4Q ^

Contiguous store.

   // Variants are also available for:
   // _s8 _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   void svst2q[_u8](svbool_t pg, uint8_t *rn, svuint8x2_t zt);
   void svst2q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x2_t zt);
   void svst3q[_u8](svbool_t pg, uint8_t *rn, svuint8x3_t zt);
   void svst3q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x3_t zt);
   void svst4q[_u8](svbool_t pg, uint8_t *rn, svuint8x4_t zt);
   void svst4q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x4_t zt);

TBLQ ^

Programmable table lookup within each quadword vector segment (zeroing).

   // Variants are also available for:
   // _u8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svint8_t svtblq[_s8](svint8_t zn, svuint8_t zm);

TBXQ ^

Programmable table lookup within each quadword vector segment (merging).

   // Variants are also available for:
   // _u8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svint8_t svtbxq[_s8](svint8_t fallback, svint8_t zn, svuint8_t zm);

UZPQ1, UZPQ2 ^

Concatenate elements within each pair of quadword vector segments.

   // Variants are also available for:
   // _s8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svuint8_t svuzpq1[_u8](svuint8_t zn, svuint8_t zm);
   svuint8_t svuzpq2[_u8](svuint8_t zn, svuint8_t zm);

ZIPQ1, ZIPQ2 ^

Interleave elements from halves of each pair of quadword vector segments.

   // Variants are also available for:
   // _s8, _u16, _s16, _u32, _s32, _u64, _s64
   // _mf8, _bf16, _f16, _f32, _f64
   svuint8_t svzipq1[_u8](svuint8_t zn, svuint8_t zm);
   svuint8_t svzipq2[_u8](svuint8_t zn, svuint8_t zm);

SVE2 maximum and minimum absolute value ^

The intrinsics in this section are defined by the header file <arm_sve.h> when either __ARM_FEATURE_SVE2 or __ARM_FEATURE_SME2 is defined to 1, and __ARM_FEATURE_FAMINMAX is defined to 1.

FAMAX ^

Floating-point absolute maximum (predicated).

  // Variants are also available for: _f32 and _f64
  svfloat16_t svamax[_f16]_m(svbool_t pg, svfloat16_t zn, svfloat16_t zm);
  svfloat16_t svamax[_f16]_x(svbool_t pg, svfloat16_t zn, svfloat16_t zm);
  svfloat16_t svamax[_f16]_z(svbool_t pg, svfloat16_t zn, svfloat16_t zm);

  // Variants are also available for: _f32 and _f64
  svfloat16_t svamax[_n_f16]_m(svbool_t pg, svfloat16_t zn, float16_t zm);
  svfloat16_t svamax[_n_f16]_x(svbool_t pg, svfloat16_t zn, float16_t zm);
  svfloat16_t svamax[_n_f16]_z(svbool_t pg, svfloat16_t zn, float16_t zm);

FAMIN ^

Floating-point absolute minimum (predicated).

  // Variants are also available for: _f32 and _f64
  svfloat16_t svamin[_f16]_m(svbool_t pg, svfloat16_t zn, svfloat16_t zm);
  svfloat16_t svamin[_f16]_x(svbool_t pg, svfloat16_t zn, svfloat16_t zm);
  svfloat16_t svamin[_f16]_z(svbool_t pg, svfloat16_t zn, svfloat16_t zm);

  // Variants are also available for: _f32 and _f64
  svfloat16_t svamin[_n_f16]_m(svbool_t pg, svfloat16_t zn, float16_t zm);
  svfloat16_t svamin[_n_f16]_x(svbool_t pg, svfloat16_t zn, float16_t zm);
  svfloat16_t svamin[_n_f16]_z(svbool_t pg, svfloat16_t zn, float16_t zm);

SVE2 lookup table ^

The intrinsics in this section are defined by the header file <arm_sve.h> when either __ARM_FEATURE_SVE2 or __ARM_FEATURE_SME2 is defined to 1, and __ARM_FEATURE_LUT is defined to 1.

LUTI2 ^

Lookup table read with 2-bit indices.

  // Variant is  also available for: _u8
  svint8_t svluti2_lane[_s8](svint8_t table, svuint8_t indices, uint64_t imm_idx);

  // Variant are also available for: _u16, _f16 and _bf16
  svint16_t svluti2_lane[_s16]( svint16_t table, svuint8_t indices, uint64_t imm_idx);

LUTI4 ^

Lookup table read with 4-bit indices.

  // Variant is also available for: _u8
  svint8_t svluti4_lane[_s8](svint8_t table, svuint8_t indices, uint64_t imm_idx);

  // Variant are also available for: _u16, _f16, _bf16
  svint16_t svluti4_lane[_s16](svint16_t table, svuint8_t indices, uint64_t imm_idx);
  svint16_t svluti4_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);

SME language extensions and intrinsics ^

The specification for SME is in Beta state and may change or be extended in the future.

Controlling the use of streaming mode ^

Introduction to streaming and non-streaming mode ^

The AArch64 architecture defines a concept called “streaming mode”, controlled by a processor state bit called PSTATE.SM. At any given point in time, the processor is either in streaming mode (PSTATE.SM==1) or in non-streaming mode (PSTATE.SM==0). There is an instruction called SMSTART to enter streaming mode and an instruction called SMSTOP to return to non-streaming mode.

Streaming mode has three main effects on C and C++ code:

The C and C++ standards define the behavior of programs in terms of an “abstract machine”. As an extension, the ACLE specification applies the distinction between streaming mode and non-streaming mode to this abstract machine: at any given point in time, the abstract machine is either in streaming mode or in non-streaming mode.

This distinction between processor mode and abstract machine mode is mostly just a dry specification detail. However, the usual “as if” rule applies: the processor’s actual mode at runtime can be different from the abstract machine’s mode, provided that this does not alter the behavior of the program. One practical consequence of this is that C and C++ code does not specify the exact placement of SMSTART and SMSTOP instructions; the source code simply places limits on where such instructions go. For example, when stepping through a program in a debugger, the processor mode might sometimes be different from the one implied by the source code.

ACLE provides attributes that specify whether the abstract machine executes statements:

At present, the classification can only be controlled at function granularity: the statements in a function are all non-streaming, all streaming, or all streaming-compatible. When no attribute specifies otherwise, all statements in a function are non-streaming.

As noted above, this classification affects which intrinsics the statements can use. It also affects the length of SVE vectors and predicates.

A program is ill-formed if:

The current mode of the abstract machine can be queried using __arm_in_streaming_mode.

Changing streaming mode locally ^

Adding an __arm_locally_streaming keyword attribute to a function specifies that all the statements in the function are streaming statements. The program automatically puts the abstract machine into streaming mode before executing the statements and automatically restores the previous mode afterwards.

This choice is internal to the function definition. It is not visible to callers and so it can be changed without affecting the function’s binary interface. (In other words, it can be changed without requiring all callers to be recompiled.)

For example:

  int nonstreaming_fn(void) {
    return __arm_in_streaming_mode();  // Returns 0
  }

  __arm_locally_streaming int streaming_fn(void)
  { // Function automatically switches into streaming mode on entry
    svsetffr();  // Ill-formed, calls a non-streaming intrinsic
    return __arm_in_streaming_mode();  // Returns 1
  } // Function automatically switches out of streaming mode on return

  int (*ptr1)(void) = nonstreaming_fn; // OK
  int (*ptr2)(void) = streaming_fn;    // OK

This approach can be useful when implementing existing APIs, including when overriding virtual functions. It allows the use of SME to be an internal implementation detail.

The __arm_locally_streaming keyword attribute is an optional feature; it is only guaranteed to be present if the implementation predefines the __ARM_FEATURE_LOCALLY_STREAMING macro to a nonzero value.

Managing streaming mode across function boundaries ^

In addition to changing streaming mode locally, ACLE provides keyword attributes for managing streaming mode across function boundaries. This can be useful in the following example situations:

For this reason, the “streaming”, “non-streaming” and “streaming-compatible” classification extends to function types:

The function type classification decides which mode the abstract machine is in on entry to the function and which mode the abstract machine is in on return from the function. The program automatically switches mode as necessary before calling a function and restores the previous mode on return; see Changes in streaming mode for details.

If the function forms part of the object code’s ABI, the function type classification also determines whether the function has a “non-streaming interface”, a “streaming interface” or a “streaming-compatible interface”; see [AAPCS64] for details.

By default, the classification of a function type carries over to the classification of the statements in the function’s definition, if any. However, this can be overridden by the __arm_locally_streaming keyword attribute; see Changing streaming mode locally for details.

For example:

  // "n" stands for "non-streaming"
  // "s" stands for "streaming"
  // "sc" stands for "streaming-compatible"

  void n_callee(void);
  void s_callee(void) __arm_streaming;
  void sc_callee(void) __arm_streaming_compatible;

  void (*n_callback)(void);
  void (*s_callback)(void) __arm_streaming;
  void (*sc_callback)(void)  __arm_streaming_compatible;

  int n_caller(void)
  {
    n_callee();        // No mode switch
    (*n_callback)();   // No mode switch
    s_caller();        // Temporarily switches to streaming mode
    (*s_callback)();   // Temporarily switches to streaming mode
    sc_caller();       // No mode switch
    (*sc_callback)();  // No mode switch

    return __arm_in_streaming_mode();  // Returns 0
  }

  int s_caller(void) __arm_streaming
  {
    n_callee();        // Temporarily switches to non-streaming mode
    (*n_callback)();   // Temporarily switches to non-streaming mode
    s_caller();        // No mode switch
    (*s_callback)();   // No mode switch
    sc_caller();       // No mode switch
    (*sc_callback)();  // No mode switch

    return __arm_in_streaming_mode();  // Returns 1
  }

  int sc_caller(void) __arm_streaming_compatible
  {
    n_callee();        // Temporarily switches to non-streaming mode
    (*n_callback)();   // Temporarily switches to non-streaming mode
    s_caller();        // Temporarily switches to streaming mode
    (*s_callback)();   // Temporarily switches to streaming mode
    sc_caller();       // No mode switch
    (*sc_callback)();  // No mode switch

    return __arm_in_streaming_mode();  // Might return 0 or 1
  }

A function type in one category is incompatible with a function type in another category, even if the types are otherwise identical. For example:

  // "n" stands for "non-streaming"
  // "s" stands for "streaming"
  // "sc" stands for "streaming-compatible"

  void n_callee(void);
  void s_callee(void) __arm_streaming;
  void sc_callee(void) __arm_streaming_compatible;

  void (*n_callback)(void);
  void (*s_callback)(void) __arm_streaming;
  void (*sc_callback)(void) __arm_streaming_compatible;

  void code() {
    n_callback = n_callee;    // OK
    n_callback = s_callee;    // Ill-formed
    n_callback = sc_callee;   // Ill-formed

    s_callback = n_callee;    // Ill-formed
    s_callback = s_callee;    // OK
    s_callback = sc_callee;   // Ill-formed

    sc_callback = n_callee;   // Ill-formed
    sc_callback = s_callee;   // Ill-formed
    sc_callback = sc_callee;  // OK
  }

A type that has both an __arm_streaming keyword attribute and an __arm_streaming_compatible keyword attribute is ill-formed.

Effect of streaming mode on VL ^

The value returned by an SVE intrinsic like svcntb depends on whether or not the abstract machine is in streaming mode: if the abstract machine is in streaming mode then the returned value depends on the “streaming vector length” (SVL), otherwise the returned value depends on the non-streaming vector length.

The runtime size of SVE vector and predicate types varies in the same way. For example, an svint8_t object created while the abstract machine is in streaming mode will have SVL bits, whereas an svint8_t object created while the abstract machine is in non-streaming mode will have the size specified by the non-streaming vector length.

The following definitions are useful when describing the consequences of this behavior on the vector length:

If any of the following events occurs during the execution of a program, then the behavior is undefined:

Streaming callers and streaming callees ^

If, during the execution of a program, a function F1 calls a function F2, then:

See Calling restrictions related to streaming mode for some examples.

Changes in streaming mode ^

Calls from a non-streaming caller to a streaming callee involve a temporary change into streaming mode. The abstract machine enters streaming mode for the duration of the call and returns to non-streaming mode afterwards.

Similarly, calls from a streaming caller to a non-streaming callee involve a temporary change out of streaming mode. The abstract machine leaves streaming mode for the duration of the call and reenters streaming mode afterwards.

In both cases, these changes of mode are automatic and it is the compiler’s responsibility to insert the necessary instructions. There are no intrinsics that map directly to SMSTART and SMSTOP.

Semantically, the sequencing of such calls is as follows:

  1. Evaluate the function arguments according to the normal C/C++ rules and create any temporary objects necessary for performing the call.

  2. Switch the abstract machine to the callee’s mode.

  3. Call the callee function.

  4. Switch the abstract machine back to the original mode.

  5. Process the value returned by the call (if any).

The intention is that the change in mode binds as closely to the call as possible.

If any of the following events occurs during the execution of a program, then the behavior is undefined:

The following code gives some examples. In each case, the assumption is that the non-streaming vector length is different from the streaming vector length:

  // "n" stands for "non-streaming"
  // "s" stands for "streaming"
  // "sc" stands for "streaming-compatible"
  // "ls" stands for "locally streaming"

  void n_callee(svint8_t);
  void s_callee(svbool_t) __arm_streaming;
  void sc_callee(svint8_t) __arm_streaming_compatible;

  __arm_locally_streaming void ls_callee(svbool_t pg) {
    // Invokes undefined behavior if called.
  }

  void n_caller(void)
  {
    svint8_t i = ...;
    svbool_t b = ...;

    n_callee(i);   // OK: non-streaming callee
    s_callee(b);   // Undefined behavior: streaming callee
    sc_callee(i);  // OK: non-streaming callee
    ls_callee(b);  // call is OK: non-streaming callee. However,
                   //   as noted above, the callee invokes undefined
                   //   behavior internally
  }

  void s_caller(void) __arm_streaming
  {
    svint8_t i = ...;
    svbool_t b = ...;

    n_callee(i);   // Undefined behavior: non-streaming callee
    s_callee(b);   // OK: streaming callee
    sc_callee(i);  // OK: streaming callee
    ls_callee(b);  // Undefined behavior: non-streaming callee
  }

  void sc_caller(void) __arm_streaming_compatible
  {
    svint8_t i = ...;
    svbool_t b = ...;

    n_callee(i);   // Undefined behavior if sc_caller was called
                   //   in streaming mode (but not otherwise)
    s_callee(b);   // Undefined behavior if sc_caller was called
                   //   in non-streaming mode (but not otherwise)
    sc_callee(i);  // OK: keeps current streaming/non-streaming mode
    ls_callee(b);  // Undefined behavior if sc_caller was called
                   //   in streaming mode, but the call itself is
                   //   OK otherwise. As noted above, the callee
                   //   invokes undefined behavior internally
  }

If the streaming and non-streaming vector lengths are equal then there is no undefined behavior in the code above.

Some ACLE implementations might support the GNU “inline asm” extension. If so, the value of PSTATE.SM at the start of such inline asms is guaranteed to match the mode of the abstract machine. For example, PSTATE.SM is guaranteed to be 1 at the start of inline asms in streaming statements and 0 at the start of inline asms in non-streaming statements.

All inline asms must preserve the value of PSTATE.SM. Therefore:

The behavior in other cases is undefined.

An inline asm can temporarily switch mode internally, such as using an SMSTART/SMSTOP pair. However, such mode switches invalidate all Z and P register state. The asm’s “clobber list” must mention any registers whose values might be changed by the asm.

ZA storage ^

SME provides an area of storage called ZA, of size SVL.B×SVL.B bytes. It also provides a processor state bit called PSTATE.ZA to control whether ZA is enabled.

In C and C++ code, access to ZA is controlled at function granularity: a function either uses ZA or it does not. Another way to say this is that a function either “has ZA state” or it does not.

If a function does have ZA state, the function can either share that ZA state with the function’s caller or create new ZA state “from scratch”. In the latter case, it is the compiler’s responsibility to free up ZA so that the function can use it; see the description of the lazy saving scheme in [AAPCS64] for details about how the compiler does this.

These possibilities give a one-out-of-three choice for how a function handles ZA:

  1. The function has no ZA state. This is the default.

  2. The function has ZA state that it shares with its caller. This is indicated by adding a state-sharing attribute to the function type, such as __arm_inout("za").

  3. The function has ZA state that it creates “from scratch” and that it does not share with its caller. This is indicated by adding __arm_new("za") to the function definition.

Functions that have ZA state can use the SME instruction intrinsics to manipulate that state.

ZT0 Lookup Table ^

When ZA storage is enabled, SME2 additionally provides access to a 64-byte large lookup table called ZT0 which can be accessed through specialized instructions. ZT0 is architecturally linked to ZA such that changing PSTATE.ZA enables or disables both ZA and ZT0 simultaneously.

This means that when the hardware supports SME2, a function that has ZA state also has ZT state.

SME keyword attributes ^

ACLE adds several SME-related keyword attributes. See Keyword attributes for general remarks about these keywords.

Except where noted otherwise, function types that have an attribute are incompatible with function types that do not. For example:

  // "n" stands for "non-streaming"
  // "s" stands for "streaming"

  typedef void (*n_callback_type)(void);
  n_callback_type n_callback_ptr;
  void n_extern_function(void);
  void n_local_function(void) { ... }

  typedef void (*s_callback_type)(void) __arm_streaming;
  s_callback_type s_callback_ptr;
  void s_extern_function(void) __arm_streaming;
  void s_local_function(void) __arm_streaming { ... }

  void foo() {
    n_callback_ptr = n_callback_ptr;     // OK
    n_callback_ptr = n_extern_function;  // OK
    n_callback_ptr = n_local_function;   // OK
    n_callback_ptr = s_callback_ptr;     // Ill-formed
    n_callback_ptr = s_extern_function;  // Ill-formed
    n_callback_ptr = s_local_function;   // Ill-formed

    s_callback_ptr = n_callback_ptr;     // Ill-formed
    s_callback_ptr = n_extern_function;  // Ill-formed
    s_callback_ptr = n_local_function;   // Ill-formed
    s_callback_ptr = s_callback_ptr;     // OK
    s_callback_ptr = s_extern_function;  // OK
    s_callback_ptr = s_local_function;   // OK
  }

The function type attributes cannot be used with K&R-style unprototyped function types. For example:

  #define ATTR __arm_streaming

  typedef int ft1() ATTR;      // Ill-formed in C, C18 and earlier, OK in
                               //   later versions of C and in C++
  int f1() ATTR { ... }        // Likewise
  typedef int ft2(void) ATTR;  // OK
  int f2(void) ATTR { ... }    // OK

__arm_streaming ^

This keyword attribute applies to function types and specifies the following:

Using this attribute does not place any restriction on the function’s argument and return types. For example, an __arm_streaming function can take arguments of type int32x4_t even though that type is generally associated with non-streaming Advanced SIMD code.

See Managing streaming mode across function boundaries for more information.

__arm_streaming_compatible ^

This keyword attribute applies to function types and specifies the following:

Using this attribute does not place any restriction on the function’s argument and return types. For example, an __arm_streaming_compatible function can take arguments of type int32x4_t even though that type is generally associated only with non-streaming Advanced SIMD code.

See Managing streaming mode across function boundaries for more information.

__arm_locally_streaming ^

This keyword attribute is only guaranteed to be supported by ACLE implementations that predefine the macro __ARM_FEATURE_LOCALLY_STREAMING to a nonzero value.

The attribute applies to function definitions and specifies that all statements in the function definition are streaming statements. The attribute is redundant (but still valid) for functions that have an __arm_streaming type.

See Changing streaming mode locally for more information.

SME types ^

Predicate-as-counter ^

SME2 adds a new kind of predicate, named predicate-as-counter which is used for multi-vector predication. It describes a predicate mask that can span multiple predicate registers with K true values followed by all false values, or K false values followed by all true values, for a given element type.

When __ARM_FEATURE_SME2 is defined, <arm_sme.h> defines a single sizeless predicate-as-counter type named svcount_t.

svcount_t and svbool_t are both used to represent predicate masks, but they cannot be used interchangeably.

The ACLE allows these types to be casted from one to another using the svcount_t svreinterpret_c(svbool_t) and svbool_t svreinterpret_b(svcount_t) intrinsics, although the reinterpreted values may not be sensible in the other format. To safely extract a sensible mask from a svcount_t, the svpext functions should be used.

Multi-vector predicates ^

When __ARM_FEATURE_SME2 is defined, <arm_sme.h> defines the tuple types svboolx2_t and svboolx4_t.

These are opaque tuple types that can be accessed using the SVE intrinsics svsetN, svgetN and svcreateN. svundef2 and svundef4 are also extended to work with svboolx2_t and svboolx4_t. For example:

    svbool_t svget2[_b](svboolx2_t tuple, uint64_t imm_index);
    svboolx2_t svset2[_b](svboolx2_t tuple, uint64_t imm_index, svbool_t x);
    svboolx2_t svcreate2[_b](svbool_t x, svbool_t y);
    svboolx2_t svundef2_b();

SME functions and intrinsics ^

<arm_sme.h> declares various support functions and defines various intrinsics. The support functions have external linkage, like standard C functions such as memcpy do. However, as noted in Intrinsics, it is unspecified whether the intrinsics are functions and, if so, what linkage they have.

There are many more intrinsics than support functions, so everything described in this section is an intrinsic unless its individual description says otherwise.

If an intrinsic is implemented as a macro rather than a function, the macro must behave “as if” it had the prototype and attributes specified in this section.

SME PSTATE functions ^

Prototypes ^

  bool __arm_has_sme(void) __arm_streaming_compatible;

  bool __arm_in_streaming_mode(void) __arm_streaming_compatible;

  // Function with external linkage.
  void __arm_za_disable(void) __arm_streaming_compatible;

Semantics ^

__arm_has_sme()

This call returns true if the current thread “has access to SME”; see [AAPCS64] for a more detailed definition of this term.

One way of implementing this function is to call __arm_sme_state and then return the top bit of X0. See [AAPCS64] for more details about __arm_sme_state.

__arm_in_streaming_mode()

This call returns true if the abstract machine is in streaming mode. It always returns true when called from streaming statements and it always return false when called from non-streaming statements. However, the call is not semantically a constant expression even in those cases.

__arm_za_disable()

This call commits any pending lazy save and turns ZA off; see [AAPCS64] for details.

Note: since this is a private-ZA function, it is valid (though pointless) to call it from a function that has ZA state. The compiler would simply reenable ZA after the call and reload the saved ZA state.

SME ZA state assertions ^

Prototypes ^

  void svundef_za() __arm_streaming_compatible;

Semantics ^

svundef_za()

This call asserts that ZA does not currently have any useful data. Semantically, it fills ZA with unpredictable data, although from a quality of implementation perspective, it should not generate any object code.

The call can be used as an optimization hint to the compiler, so that the compiler does not insert unnecessary code to save and restore the current ZA contents. The call might also be useful for static analysis.

SME instruction intrinsics ^

The intrinsics in this section have the following properties in common:

SME2 adds operations that work on groups of SVE vectors, ZA tile slices or ZA array vectors. The intrinsics model this in the following way:

    // Reads 2 consecutive horizontal tile slices from ZA into multi-vector.
    svint8x2_t svread_hor_za8_s8_vg2(uint64_t tile, uint32_t slice)
      __arm_streaming __arm_in("za");
    // SMLAL intrinsic for 2 quad-vector groups.
    void svmla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                    svint8_t zm, uint64_t imm_idx)
      __arm_streaming __arm_inout("za");
    void svmla_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8x2_t zm)
      __arm_streaming __arm_inout("za");

    void svmla[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                        svint8_t zm)
      __arm_streaming __arm_inout("za");

    void svmla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                    svint8_t zm, uint64_t imm_idx)
      __arm_streaming __arm_inout("za");

LD1B, LD1H, LD1W, LD1D, LD1Q ^

  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svld1_hor_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                     const void *ptr)
    __arm_streaming __arm_inout("za");

  // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the
  // address given by ptr.
  //
  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svld1_hor_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                          const void *ptr, int64_t vnum)
    __arm_streaming __arm_inout("za");

  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svld1_ver_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                     const void *ptr)
    __arm_streaming __arm_inout("za");

  // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the
  // address given by ptr.
  //
  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svld1_ver_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                          const void *ptr, int64_t vnum)
    __arm_streaming __arm_inout("za");

LDR ^

  void svldr_za(uint32_t slice, const void *ptr)
    __arm_streaming_compatible __arm_inout("za");

  // Adds vnum to slice and vnum * svcntsb() to the address given by ptr.
  // This can be done in a single instruction if vnum is a constant in the
  // range [0, 15].  The intrinsic is synthetic for other vnum parameters.
  void svldr_vnum_za(uint32_t slice, const void *ptr, int64_t vnum)
     __arm_streaming_compatible __arm_inout("za");

ST1B, ST1H, ST1W, ST1D, ST1Q ^

  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svst1_hor_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                     void *ptr)
    __arm_streaming __arm_in("za");

  // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the
  // address given by ptr.
  //
  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svst1_hor_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                          void *ptr, int64_t vnum)
    __arm_streaming __arm_in("za");

  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svst1_ver_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                     void *ptr)
    __arm_streaming __arm_in("za");

  // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the
  // address given by ptr.
  //
  // Also for _za16, _za32, _za64 and _za128 (with the same prototype).
  void svst1_ver_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg,
                          void *ptr, int64_t vnum)
    __arm_streaming __arm_in("za");

STR ^

  void svstr_za(uint32_t slice, void *ptr)
    __arm_streaming_compatible __arm_in("za");

  // Adds vnum to slice and vnum * svcntsb() to the address given by ptr.
  // This can be done in a single instruction if vnum is a constant in the
  // range [0, 15].  The intrinsic is synthetic for other vnum parameters.
  void svstr_vnum_za(uint32_t slice, void *ptr, int64_t vnum)
     __arm_streaming_compatible __arm_in("za");

MOVA ^

The intrinsics below read horizontal ZA slices. In each case, the type of the return value and the zd parameter vary with the type suffix. For example, in the _u8 intrinsic, the return value and the zd parameter both have type svuint8_t.

  // And similarly for u8, mf8
  svint8_t svread_hor_za8[_s8]_m(svint8_t zd, svbool_t pg,
                                 uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");

  // And similarly for u16, bf16 and f16.
  svint16_t svread_hor_za16[_s16]_m(svint16_t zd, svbool_t pg,
                                    uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");

  // And similarly for u32 and f32.
  svint32_t svread_hor_za32[_s32]_m(svint32_t zd, svbool_t pg,
                                    uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");

  // And similarly for u64 and f64.
  svint64_t svread_hor_za64[_s64]_m(svint64_t zd, svbool_t pg,
                                    uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");

  // And similarly for s16, s32, s64, u8, u16, u32, u64, mf8, bf16, f16, f32, f64
  svint8_t svread_hor_za128[_s8]_m(svint8_t zd, svbool_t pg,
                                   uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");

Replacing _hor with _ver gives the associated vertical forms.

The intrinsics below write to horizontal ZA slices. In each case, the type of the zn parameter varies with the type suffix. For example, the zn parameter to the _u8 intrinsic has type svuint8_t.

  // And similarly for u8, mf8.
  void svwrite_hor_za8[_s8]_m(uint64_t tile, uint32_t slice, svbool_t pg,
                              svint8_t zn)
    __arm_streaming __arm_inout("za");

  // And similarly for u16, bf16 and f16.
  void svwrite_hor_za16[_s16]_m(uint64_t tile, uint32_t slice, svbool_t pg,
                                svint16_t zn)
    __arm_streaming __arm_inout("za");

  // And similarly for u32 and f32.
  void svwrite_hor_za32[_s32]_m(uint64_t tile, uint32_t slice, svbool_t pg,
                                svint32_t zn)
    __arm_streaming __arm_inout("za");

  // And similarly for u64 and f64.
  void svwrite_hor_za64[_s64]_m(uint64_t tile, uint32_t slice, svbool_t pg,
                                svint64_t zn)
    __arm_streaming __arm_inout("za");

  // And similarly for s16, s32, s64, u8, u16, u32, u64, mf8, bf16, f16, f32, f64
  void svwrite_hor_za128[_s8]_m(uint64_t tile, uint32_t slice, svbool_t pg,
                                svint8_t zn)
    __arm_streaming __arm_inout("za");

Replacing _hor with _ver gives the associated vertical forms.

ADDHA ^

  void svaddha_za32[_s32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svint32_t zn)
    __arm_streaming __arm_inout("za");

  void svaddha_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint32_t zn)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svaddha_za64[_s64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svint64_t zn)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svaddha_za64[_u64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint64_t zn)
    __arm_streaming __arm_inout("za");

ADDVA ^

  void svaddva_za32[_s32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svint32_t zn)
    __arm_streaming __arm_inout("za");

  void svaddva_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint32_t zn)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svaddva_za64[_s64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svint64_t zn)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svaddva_za64[_u64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint64_t zn)
    __arm_streaming __arm_inout("za");

BFMOPA, FMOPA (widening), SMOPA, UMOPA ^

  void svmopa_za32[_bf16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svbfloat16_t zn, svbfloat16_t zm)
    __arm_streaming __arm_inout("za");

  void svmopa_za32[_f16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svfloat16_t zn, svfloat16_t zm)
    __arm_streaming __arm_inout("za");

  void svmopa_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                          svint8_t zn, svint8_t zm)
    __arm_streaming;

  void svmopa_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                          svuint8_t zn, svuint8_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svmopa_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svint16_t zn, svint16_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svmopa_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svuint16_t zn, svuint16_t zm)
    __arm_streaming __arm_inout("za");

BFMOPA, FMOPA (non-widening) ^

  // Variants are also available for:
  //   _za16[_bf16]_m (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16]_m (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za64[_f64]_m (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmopa_za32[_f32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svfloat32_t zn, svfloat32_t zm)
    __arm_streaming __arm_inout("za");

BFMOPS, FMOPS (widening), SMOPS, UMOPS ^

  void svmops_za32[_bf16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svbfloat16_t zn, svbfloat16_t zm)
    __arm_streaming __arm_inout("za");

  void svmops_za32[_f16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svfloat16_t zn, svfloat16_t zm)
    __arm_streaming __arm_inout("za");

  void svmops_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                          svint8_t zn, svint8_t zm)
    __arm_streaming __arm_inout("za");

  void svmops_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                          svuint8_t zn, svuint8_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svmops_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svint16_t zn, svint16_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svmops_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svuint16_t zn, svuint16_t zm)
    __arm_streaming __arm_inout("za");

BFMOPS, FMOPS (non-widening) ^

  // Variants are also available for:
  //   _za16[_bf16]_m (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16]_m (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za64[_f64]_m (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmops_za32[_f32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                           svfloat32_t zn, svfloat32_t zm)
    __arm_streaming __arm_inout("za");

RDSVL ^

The following intrinsics read the length of a streaming vector:

  // Return the number of bytes in a streaming vector.
  // Equivalent to svcntb() when called in streaming mode.
  uint64_t svcntsb() __arm_streaming_compatible;

  // Return the number of halfwords in a streaming vector.
  // Equivalent to svcnth() when called in streaming mode.
  uint64_t svcntsh() __arm_streaming_compatible;

  // Return the number of words in a streaming vector.
  // Equivalent to svcntw() when called in streaming mode.
  uint64_t svcntsw() __arm_streaming_compatible;

  // Return the number of doublewords in a streaming vector.
  // Equivalent to svcntd() when called in streaming mode.
  uint64_t svcntsd() __arm_streaming_compatible;

svcntsb() is equivalent to an RDSVL instruction with an immediate operand of 1. There are no intrinsics that map directly to other immediate operands, or to the ADDSVL and ADDSPL instructions, but it is possible to write these operations using normal C arithmetic. For example:

  x0 = svcntsb();           // corresponds to RDSVL x0, #1
  x0 = svcntsb() * 5;       // can be optimized to RDSVL x0, #5
  x0 = x1 + svcntsb();      // can be optimized to ADDSVL x0, x1, #1
  x0 = x1 - svcntsb() * 3;  // can be optimized to ADDSVL x0, x1, #-3
  x0 = x1 + svcntsd();      // can be optimized to ADDSPL x0, x1, #1
  x0 = x1 - svcntsw();      // can be optimized to ADDSPL x0, x1, #-2

SUMOPA ^

  void svsumopa_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svint8_t zn, svuint8_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svsumopa_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                             svint16_t zn, svuint16_t zm)
    __arm_streaming __arm_inout("za");

SUMOPS ^

  void svsumops_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svint8_t zn, svuint8_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svsumops_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                             svint16_t zn, svuint16_t zm)
    __arm_streaming __arm_inout("za");

USMOPA ^

  void svusmopa_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint8_t zn, svint8_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svusmopa_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                             svuint16_t zn, svint16_t zm)
    __arm_streaming __arm_inout("za");

USMOPS ^

  void svusmops_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint8_t zn, svint8_t zm)
    __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_I16I64 != 0
  void svusmops_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                             svuint16_t zn, svint16_t zm)
    __arm_streaming __arm_inout("za");

ZERO ^

  void svzero_mask_za(uint64_t tile_mask)
    __arm_streaming_compatible __arm_inout("za");

  void svzero_za() __arm_streaming_compatible __arm_out("za");

SME2 instruction intrinsics ^

The intrinsics in this section are defined by the header file <arm_sme.h> when __ARM_FEATURE_SME2 is defined.

ADD, SUB (store into ZA, single) ^

Multi-vector add/sub, storing into ZA

The additional ‘_write’ suffix indicates that the operation is not accumulating; the result is written directly into ZA.

  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svadd_write[_single]_za32[_s32]_vg1x2(uint32_t slice, svint32x2_t zn,
                                             svint32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svadd_write[_single]_za32[_s32]_vg1x4(uint32_t slice, svint32x4_t zn,
                                             svint32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svsub_write[_single]_za32[_u32]_vg1x2(uint32_t slice, svuint32x2_t zn,
                                             svuint32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svsub_write[_single]_za32[_u32]_vg1x4(uint32_t slice, svuint32x4_t zn,
                                             svuint32_t zm)
    __arm_streaming __arm_inout("za");

ADD, SUB (store into ZA, multi) ^

Multi-vector add/sub, storing into ZA

The additional ‘_write’ suffix indicates that the operation is not accumulating; the result is written directly into ZA.

  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svadd_write_za32[_s32]_vg1x2(uint32_t slice,
                                    svint32x2_t zn, svint32x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svadd_write_za32[_s32]_vg1x4(uint32_t slice,
                                    svint32x4_t zn, svint32x4_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svsub_write_za32[_u32]_vg1x2(uint32_t slice,
                                    svuint32x2_t zn, svuint32x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svsub_write_za32[_u32]_vg1x4(uint32_t slice,
                                    svuint32x4_t zn, svuint32x4_t zm)
    __arm_streaming __arm_inout("za");

ADD (vectors) ^

Multi-vector add

  // Variants are also available for _single_u8_x2, _single_s16_x2,
  // _single_u16_x2, _single_s32_x2, _single_u32_x2, _single_s64_x2 and
  // _single_u64_x2
  svint8x2_t svadd[_single_s8_x2](svint8x2_t zdn, svint8_t zm) __arm_streaming;


  // Variants are also available for _single_u8_x4, _single_s16_x4,
  // _single_u16_x4, _single_s32_x4, _single_u32_x4, _single_s64_x4 and
  // _single_u64_x4
  svint8x4_t svadd[_single_s8_x4](svint8x4_t zdn, svint8_t zm) __arm_streaming;

ADD, SUB, BFADD, BFSUB, FADD, FSUB (accumulate into ZA) ^

Multi-vector add/sub and accumulate into ZA

  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svadd_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svadd_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svsub_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za32[_s32]
  //   _za32[_u32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  //   _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svsub_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zm)
    __arm_streaming __arm_inout("za");

BFCVTN, FCVTN ^

Multi-vector floating-point convert from single-precision to interleaved half-precision/BFloat16

  // Variants are also available for _f16[_f32_x2]
  svbfloat16_t svcvtn_bf16[_f32_x2](svfloat32x2_t zn) __arm_streaming;

FCVTL ^

Multi-vector floating-point convert from half-precision to deinterleaved single-precision.

  // Only if __ARM_FEATURE_SME_F16F16 != 0
  svfloat32x2_t svcvtl_f32[_f16_x2](svfloat16_t zn) __arm_streaming;

FCVT, BFCVT, FCVTZS, FCVTZU, SCVTF, UCVTF ^

Multi-vector convert to/from floating-point.

  // Variants are also available for _f16[_f32_x2]
  svbfloat16_t svcvt_bf16[_f32_x2](svfloat32x2_t zn) __arm_streaming;


  // Variants are also available for _f32[_u32_x2], _s32[_f32_x2] and _u32[_f32_x2]
  svfloat32x2_t svcvt_f32[_s32_x2](svint32x2_t zn) __arm_streaming;


  // Variants are also available for _f32[_u32_x4], _s32[_f32_x4] and _u32[_f32_x4]
  svfloat32x4_t svcvt_f32[_s32_x4](svint32x4_t zn) __arm_streaming;

  // Only if __ARM_FEATURE_SME_F16F16 != 0
  svfloat32x2_t svcvt_f32[_f16_x2](svfloat16_t zn) __arm_streaming;

SQCVT, SQCVTU, UQCVT ^

Multi-vector saturating extract narrow

  // Variants are also available for _u16[_s32_x2] and _u16[_u32_x2]
  svint16_t svqcvt_s16[_s32_x2](svint32x2_t zn) __arm_streaming;


  // Variants are also available for _u8[_s32_x4], _u8[_u32_x4], _s16[_s64_x4],
  // _u16[_s64_x4] and _u16[_u64_x4]
  svint8_t svqcvt_s8[_s32_x4](svint32x4_t zn) __arm_streaming;

SQCVTN, SQCVTUN, UQCVTN ^

Multi-vector saturating extract narrow and interleave

  // Variants are also available for _u8[_s32_x4], _u8[_u32_x4], _s16[_s64_x4],
  // _u16[_s64_x4] and _u16[_u64_x4]
  svint8_t svqcvtn_s8[_s32_x4](svint32x4_t zn) __arm_streaming;

FDOT, BFDOT, SUDOT, USDOT, SDOT, UDOT (store into ZA, single) ^

Multi-vector dot-product (2-way and 4-way)

  // Variants are available for:
  //   _za32[_bf16]
  //   _za32[_f16]
  //   _za32[_s8]
  //   _za32[_s16]
  //   _za32[_u8]
  //   _za32[_u16]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svdot[_single]_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn,
                                        svbfloat16_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_bf16]
  //   _za32[_f16]
  //   _za32[_s8]
  //   _za32[_s16]
  //   _za32[_u8]
  //   _za32[_u16]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svdot[_single]_za32[_bf16]_vg1x4(uint32_t slice,
                                        svbfloat16x4_t zn, svbfloat16_t zm)
    __arm_streaming __arm_inout("za");


  void svsudot[_single]_za32[_s8]_vg1x2(uint32_t slice, svint8x2_t zn,
                                        svuint8_t zm)
    __arm_streaming __arm_inout("za");


  void svsudot[_single]_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn,
                                        svuint8_t zm)
    __arm_streaming __arm_inout("za");


  void svusdot[_single]_za32[_u8]_vg1x2(uint32_t slice, svuint8x2_t zn,
                                        svint8_t zm)
    __arm_streaming __arm_inout("za");


  void svusdot[_single]_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn,
                                        svint8_t zm)
    __arm_streaming __arm_inout("za");

FDOT, BFDOT, SUDOT, USDOT, SDOT, UDOT (store into ZA, multi) ^

Multi-vector dot-product (2-way and 4-way)

  // Variants are available for:
  //   _za32[_bf16]
  //   _za32[_f16]
  //   _za32[_s8]
  //   _za32[_s16]
  //   _za32[_u8]
  //   _za32[_u16]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svdot_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn,
                               svbfloat16x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_bf16]
  //   _za32[_f16]
  //   _za32[_s8]
  //   _za32[_s16]
  //   _za32[_u8]
  //   _za32[_u16]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svdot_za32[_bf16]_vg1x4(uint32_t slice, svbfloat16x4_t zn,
                               svbfloat16x4_t zm)
    __arm_streaming __arm_inout("za");


  void svsudot_za32[_s8]_vg1x2(uint32_t slice, svint8x2_t zn, svuint8x2_t zm)
    __arm_streaming __arm_inout("za");


  void svsudot_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svuint8x4_t zm)
    __arm_streaming __arm_inout("za");


  void svusdot_za32[_u8]_vg1x2(uint32_t slice, svuint8x2_t zn, svint8x2_t zm)
    __arm_streaming __arm_inout("za");


  void svusdot_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn, svint8x4_t zm)
    __arm_streaming __arm_inout("za");

FDOT, BFDOT, SUDOT, USDOT, SDOT, UDOT (store into ZA, indexed) ^

Multi-vector dot-product (2-way and 4-way)

  // Variants are available for:
  //   _za32[_bf16]
  //   _za32[_f16]
  //   _za32[_s8]
  //   _za32[_s16]
  //   _za32[_u8]
  //   _za32[_u16]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svdot_lane_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_bf16]
  //   _za32[_f16]
  //   _za32[_s8]
  //   _za32[_s16]
  //   _za32[_u8]
  //   _za32[_u16]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svdot_lane_za32[_bf16]_vg1x4(uint32_t slice, svbfloat16x4_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svsudot_lane_za32[_s8]_vg1x2(uint32_t slice, svint8x2_t zn, svuint8_t zm,
                                    uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svsudot_lane_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svuint8_t zm,
                                    uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svusdot_lane_za32[_u8]_vg1x2(uint32_t slice, svuint8x2_t zn, svint8_t zm,
                                    uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svusdot_lane_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn, svint8_t zm,
                                    uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

FVDOT, BFVDOT, SUVDOT, USVDOT, SVDOT, UVDOT ^

Multi-vector vertical dot-product by indexed element.

  void svsuvdot_lane_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn,
                                     svuint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svusvdot_lane_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn,
                                     svint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svvdot_lane_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn,
                                     svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svvdot_lane_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svint8_t zm,
                                   uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

UMOPA, SMOPA, UMOPS, SMOPS ^

Integer sum of outer products and accumulate/subtract (2-way)

  // Variants are also available for _za32[_u16]
  void svmopa_za32[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn,
                           svint16_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_u16]
  void svmops_za32[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn,
                           svint16_t zm)
    __arm_streaming __arm_inout("za");

BMOPA, BMOPS ^

Bitwise exclusive NOR population count outer product and accumulate/subtract

  // Variants are also available for _za32[_s32]
  void svbmopa_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint32_t zn, svuint32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_s32]
  void svbmops_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
                            svuint32_t zn, svuint32_t zm)
    __arm_streaming __arm_inout("za");

BFMLA, BFMLS, FMLA, FMLS (single) ^

Multi-vector floating-point fused multiply-add/subtract

  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmla[_single]_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn,
                                       svfloat32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmla[_single]_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn,
                                       svfloat32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmls[_single]_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn,
                                       svfloat32_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmls[_single]_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn,
                                       svfloat32_t zm)
    __arm_streaming __arm_inout("za");

BFMLA, BFMLS, FMLA, FMLS (multi) ^

Multi-vector floating-point fused multiply-add/subtract

  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmla_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn,
                              svfloat32x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmla_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn,
                              svfloat32x4_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmls_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn,
                              svfloat32x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmls_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn,
                              svfloat32x4_t zm)
    __arm_streaming __arm_inout("za");

BFMLA. BFMLS, FMLA, FMLS (indexed) ^

Multi-vector floating-point fused multiply-add/subtract

  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmla_lane_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn,
                                   svfloat32_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmla_lane_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn,
                                   svfloat32_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmls_lane_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn,
                                   svfloat32_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za16[_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
  //   _za16[_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
  //   _za32[_f32]
  //   _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
  void svmls_lane_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn,
                                   svfloat32_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

FMLAL, BFMLAL, SMLAL, UMLAL (single) ^

Multi-vector multiply-add long (widening)

  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn,
                               svbfloat16_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla[_single]_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn,
                                        svbfloat16_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla[_single]_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn,
                                        svbfloat16_t zm)
    __arm_streaming __arm_inout("za");

FMLAL, BFMLAL, SMLAL, UMLAL (multi) ^

Multi-vector multiply-add long (widening)

  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn,
                               svbfloat16x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn,
                               svbfloat16x4_t zm)
    __arm_streaming __arm_inout("za");

FMLAL, BFMLAL, SMLAL, UMLAL (indexed) ^

Multi-vector multiply-add long (widening)

  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla_lane_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla_lane_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmla_lane_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

BFMLSL, FMLSL, UMLSL, SMLSL (single) ^

Multi-vector multiply-subtract long (widening)

  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn,
                               svbfloat16_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls[_single]_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn,
                                        svbfloat16_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls[_single]_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn,
                                        svbfloat16_t zm)
    __arm_streaming __arm_inout("za");

BFMLSL, FMLSL, UMLSL, SMLSL (multi) ^

Multi-vector multiply-subtract long (widening)

  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn,
                               svbfloat16x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn,
                               svbfloat16x4_t zm)
    __arm_streaming __arm_inout("za");

BFMLSL, FMLSL, UMLSL, SMLSL (indexed) ^

Multi-vector multiply-subtract long (widening)

  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls_lane_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls_lane_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16]
  void svmls_lane_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn,
                                    svbfloat16_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

UMLALL, SMLALL, USMLALL, SUMLALL (single) ^

Multi-vector multiply-add long long (widening)

  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                      svint8_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla[_single]_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn,
                                      svint8_t zm)
    __arm_streaming __arm_inout("za");


  void svsumla_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svuint8_t zm)
    __arm_streaming __arm_inout("za");


  void svsumla[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                        svuint8_t zm)
    __arm_streaming __arm_inout("za");


  void svsumla[_single]_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn,
                                        svuint8_t zm)
    __arm_streaming __arm_inout("za");


  void svusmla_za32[_u8]_vg4x1(uint32_t slice, svuint8_t zn, svint8_t zm)
    __arm_streaming __arm_inout("za");


  void svusmla[_single]_za32[_u8]_vg4x2(uint32_t slice, svuint8x2_t zn,
                                        svint8_t zm)
    __arm_streaming __arm_inout("za");


  void svusmla[_single]_za32[_u8]_vg4x4(uint32_t slice, svuint8x4_t zn,
                                        svint8_t zm)
    __arm_streaming __arm_inout("za");

UMLALL, SMLALL, USMLALL, SUMLALL (multi) ^

Multi-vector multiply-add long long (widening)

  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8x4_t zm)
    __arm_streaming __arm_inout("za");


  void svsumla_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svuint8x2_t zm)
    __arm_streaming __arm_inout("za");


  void svsumla_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svuint8x4_t zm)
    __arm_streaming __arm_inout("za");


  void svusmla_za32[_u8]_vg4x2(uint32_t slice, svuint8x2_t zn, svint8x2_t zm)
    __arm_streaming __arm_inout("za");


  void svusmla_za32[_u8_vg4x4(uint32_t slice, svuint8x4_t zn, svint8x4_t zm)
    __arm_streaming __arm_inout("za");

UMLALL, SMLALL, USMLALL, SUMLALL (indexed) ^

Multi-vector multiply-add long long (widening)

  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla_lane_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm,
                                  uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm,
                                  uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmla_lane_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8_t zm,
                                  uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svsumla_lane_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn,
                                    svuint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svsumla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                    svuint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svsumla_lane_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn,
                                    svuint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svusmla_lane_za32[_u8]_vg4x1(uint32_t slice, svuint8_t zn,
                                    svint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svusmla_lane_za32[_u8]_vg4x2(uint32_t slice, svuint8x2_t zn,
                                    svint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  void svusmla_lane_za32[_u8]_vg4x4(uint32_t slice, svuint8x4_t zn,
                                    svint8_t zm, uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

SMLSLL, UMLSLL (single) ^

Multi-vector multiply-subtract long long (widening)

  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn,
                                      svint8_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls[_single]_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn,
                                      svint8_t zm)
    __arm_streaming __arm_inout("za");

SMLSLL, UMLSLL (multi) ^

Multi-vector multiply-subtract long long (widening)

  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8x2_t zm)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8x4_t zm)
    __arm_streaming __arm_inout("za");

SMLSLL, UMLSLL (indexed) ^

Multi-vector multiply-subtract long long (widening)

  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls_lane_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm,
                                  uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm,
                                  uint64_t imm_idx)
    __arm_streaming __arm_inout("za");


  // Variants are available for:
  //   _za32[_s8]
  //   _za32[_u8]
  //   _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  //   _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
  void svmls_lane_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8_t zm,
                                  uint64_t imm_idx)
    __arm_streaming __arm_inout("za");

SMAX, SMIN, UMAX, UMIN, BFMAX, BFMIN, FMAX, FMIN (single) ^

Multi-vector min/max

  // Variants are also available for:
  //   _single_s8_x2
  //   _single_u8_x2,
  //   _single_bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_s16_x2
  //   _single_u16_x2
  //   _single_s32_x2
  //   _single_u32_x2,
  //   _single_f32_x2
  //   _single_s64_x2
  //   _single_u64_x2
  //   _single_f64_x2
  svfloat16x2_t svmax[_single_f16_x2](svfloat16x2_t zdn, svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_s8_x4
  //   _single_u8_x4,
  //   _single_bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_s16_x4
  //   _single_u16_x4
  //   _single_s32_x4
  //   _single_u32_x4,
  //   _single_f32_x4
  //   _single_s64_x4
  //   _single_u64_x4
  //   _single_f64_x4
  svfloat16x4_t svmax[_single_f16_x4](svfloat16x4_t zdn, svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_s8_x2
  //   _single_u8_x2,
  //   _single_bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_s16_x2
  //   _single_u16_x2
  //   _single_s32_x2
  //   _single_u32_x2,
  //   _single_f32_x2
  //   _single_s64_x2
  //   _single_u64_x2
  //   _single_f64_x2
  svfloat16x2_t svmin[_single_f16_x2](svfloat16x2_t zdn, svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_s8_x4
  //   _single_u8_x4,
  //   _single_bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_s16_x4
  //   _single_u16_x4
  //   _single_s32_x4
  //   _single_u32_x4,
  //   _single_f32_x4
  //   _single_s64_x4
  //   _single_u64_x4
  //   _single_f64_x4
  svfloat16x4_t svmin[_single_f16_x4](svfloat16x4_t zdn, svfloat16_t zm)
    __arm_streaming;

SMAX, SMIN, UMAX, UMIN, BFMAX, BFMIN, FMAX, FMIN (multi) ^

Multi-vector min/max

  // Variants are also available for:
  //   _s8_x2
  //   _u8_x2
  //   _bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _s16_x2
  //   _u16_x2,
  //   _s32_x2
  //   _u32_x2
  //   _f32_x2
  //   _s64_x2
  //   _u64_x2
  //   _f64_x2
  svfloat16x2_t svmax[_f16_x2](svfloat16x2_t zdn, svfloat16x2_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _s8_x4
  //   _u8_x4
  //   _bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _s16_x4
  //   _u16_x4,
  //   _s32_x4
  //   _u32_x4
  //   _f32_x4
  //   _s64_x4
  //   _u64_x4
  //   _f64_x4
  svfloat16x4_t svmax[_f16_x4](svfloat16x4_t zdn, svfloat16x4_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _s8_x2
  //   _u8_x2
  //   _bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _s16_x2
  //   _u16_x2,
  //   _s32_x2
  //   _u32_x2
  //   _f32_x2
  //   _s64_x2
  //   _u64_x2
  //   _f64_x2
  svfloat16x2_t svmin[_f16_x2](svfloat16x2_t zdn, svfloat16x2_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _s8_x4
  //   _u8_x4
  //   _bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _s16_x4
  //   _u16_x4,
  //   _s32_x4
  //   _u32_x4
  //   _f32_x4
  //   _s64_x4
  //   _u64_x4
  //   _f64_x4
  svfloat16x4_t svmin[_f16_x4](svfloat16x4_t zdn, svfloat16x4_t zm)
    __arm_streaming;

BFMAXNM, BFMINNM, FMAXNM, FMINNM (single) ^

Multi-vector floating point min/max number

  // Variants are also available for:
  //   _single_bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_f32_x2
  //   _single_f64_x2
  svfloat16x2_t svmaxnm[_single_f16_x2](svfloat16x2_t zdn, svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_f32_x4
  //   _single_f64_x4
  svfloat16x4_t svmaxnm[_single_f16_x4](svfloat16x4_t zdn, svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_f32_x2
  //   _single_f64_x2
  svfloat16x2_t svminnm[_single_f16_x2](svfloat16x2_t zdn, svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_f32_x4
  //   _single_f64_x4
  svfloat16x4_t svminnm[_single_f16_x4](svfloat16x4_t zdn, svfloat16_t zm)
    __arm_streaming;

BFMAXNM, BFMINNM, FMAXNM, FMINNM (multi) ^

Multi-vector floating point min/max number

  // Variants are also available for:
  //   _bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _f32_x2
  //   _f64_x2
  svfloat16x2_t svmaxnm[_f16_x2](svfloat16x2_t zdn, svfloat16x2_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _f32_x4
  //   _f64_x4
  svfloat16x4_t svmaxnm[_f16_x4](svfloat16x4_t zdn, svfloat16x4_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _f32_x2
  //   _f64_x2
  svfloat16x2_t svminnm[_f16_x2](svfloat16x2_t zdn, svfloat16x2_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _f32_x4
  //   _f64_x4
  svfloat16x4_t svminnm[_f16_x4](svfloat16x4_t zdn, svfloat16x4_t zm)
    __arm_streaming;

FRINTA, FRINTM, FRINTN, FRINTP ^

Multi-vector floating-point round to integral value

  svfloat32x2_t svrinta[_f32_x2](svfloat32x2_t zn) __arm_streaming;


  svfloat32x4_t svrinta[_f32_x4](svfloat32x4_t zn) __arm_streaming;


  svfloat32x2_t svrintm[_f32_x2](svfloat32x2_t zn) __arm_streaming;


  svfloat32x4_t svrintm[_f32_x4](svfloat32x4_t zn) __arm_streaming;


  svfloat32x2_t svrintn[_f32_x2](svfloat32x2_t zn) __arm_streaming;


  svfloat32x4_t svrintn[_f32_x4](svfloat32x4_t zn) __arm_streaming;


  svfloat32x2_t svrintp[_f32_x2](svfloat32x2_t zn) __arm_streaming;


  svfloat32x4_t svrintp[_f32_x4](svfloat32x4_t zn) __arm_streaming;

LDR, STR ^

Spill and fill of ZT0

  void svldr_zt(uint64_t zt, const void *rn)
    __arm_streaming_compatible __arm_inout("zt0");


  void svstr_zt(uint64_t zt, void *rn)
    __arm_streaming_compatible __arm_in("zt0");

ZERO ^

Zero ZT0

  void svzero_zt(uint64_t zt)
    __arm_streaming_compatible __arm_out("zt0");

LUTI2, LUTI4 ^

Lookup table read with 2-bit and 4-bit indexes

  // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16,
  // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32
  svint8_t svluti2_lane_zt_s8(uint64_t zt, svuint8_t zn, uint64_t imm_idx)
    __arm_streaming __arm_in("zt0");


  // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16,
  // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32
  svint8x2_t svluti2_lane_zt_s8_x2(uint64_t zt, svuint8_t zn,
                                   uint64_t imm_idx)
    __arm_streaming __arm_in("zt0");


  // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16,
  // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32
  svint8x4_t svluti2_lane_zt_s8_x4(uint64_t zt, svuint8_t zn,
                                   uint64_t imm_idx)
    __arm_streaming __arm_in("zt0");


  // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16,
  // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32
  svint8_t svluti4_lane_zt_s8(uint64_t zt, svuint8_t zn, uint64_t imm_idx)
    __arm_streaming __arm_in("zt0");


  // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16,
  // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32
  svint8x2_t svluti4_lane_zt_s8_x2(uint64_t zt, svuint8_t zn,
                                   uint64_t imm_idx)
    __arm_streaming __arm_in("zt0");


  // Variants are also available for _zt_u16, _zt_f16, _zt_bf16, _zt_s32,
  // _zt_u32 and _zt_f32
  svint16x4_t svluti4_lane_zt_s16_x4(uint64_t zt, svuint8_t zn,
                                     uint64_t imm_idx)
    __arm_streaming __arm_in("zt0");

MOVA ^

Move multi-vectors to/from ZA

  // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x2_t svread_hor_za8_s8_vg2(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");


  // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x4_t svread_hor_za8_s8_vg4(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");


  // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x2_t svread_ver_za8_s8_vg2(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");


  // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x4_t svread_ver_za8_s8_vg4(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_in("za");


  // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x2_t svread_za8_s8_vg1x2(uint32_t slice)
    __arm_streaming __arm_in("za");


  // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x4_t svread_za8_s8_vg1x4(uint32_t slice)
    __arm_streaming __arm_in("za");


  // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16],
  // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32],
  // _za64[_s64], _za64[_u64] and _za64[_f64]
  void svwrite_hor_za8[_s8]_vg2(uint64_t tile, uint32_t slice, svint8x2_t zn)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16],
  // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32],
  // _za64[_s64], _za64[_u64] and _za64[_f64]
  void svwrite_hor_za8[_s8]_vg4(uint64_t tile, uint32_t slice, svint8x4_t zn)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16],
  // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32],
  // _za64[_s64], _za64[_u64] and _za64[_f64]
  void svwrite_ver_za8[_s8]_vg2(uint64_t tile, uint32_t slice, svint8x2_t zn)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16],
  // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32],
  // _za64[_s64], _za64[_u64] and _za64[_f64]
  void svwrite_ver_za8[_s8]_vg4(uint64_t tile, uint32_t slice, svint8x4_t zn)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16],
  // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32],
  // _za64[_s64], _za64[_u64] and _za64[_f64]
  void svwrite_za8[_s8]_vg1x2(uint32_t slice, svint8x2_t zn)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8[_u8], za8[_mf8], _za16[_s16], _za16[_u16],
  // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32],
  // _za64[_s64], _za64[_u64] and _za64[_f64]
  void svwrite_za8[_s8]_vg1x4(uint32_t slice, svint8x4_t zn)
    __arm_streaming __arm_inout("za");

UCLAMP, SCLAMP, BFCLAMP, FCLAMP ^

Multi-vector clamp to minimum/maximum vector

  // Variants are also available for:
  //   _single_s8_x2
  //   _single_u8_x2,
  //   _single_bf16_x2 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_s16_x2
  //   _single_u16_x2
  //   _single_s32_x2
  //   _single_u32_x2,
  //   _single_f32_x2
  //   _single_s64_x2
  //   _single_u64_x2
  //   _single_f64_x2
  svfloat16x2_t svclamp[_single_f16_x2](svfloat16x2_t zd, svfloat16_t zn,
                                        svfloat16_t zm)
    __arm_streaming;


  // Variants are also available for:
  //   _single_s8_x4
  //   _single_u8_x4,
  //   _single_bf16_x4 (only if __ARM_FEATURE_SVE_B16B16 != 0)
  //   _single_s16_x4
  //   _single_u16_x4
  //   _single_s32_x4
  //   _single_u32_x4,
  //   _single_f32_x4
  //   _single_s64_x4
  //   _single_u64_x4
  //   _single_f64_x4
  svfloat16x4_t svclamp[_single_f16_x4](svfloat16x4_t zd, svfloat16_t zn,
                                        svfloat16_t zm)
    __arm_streaming;

SEL ^

Multi-vector conditionally select elements from two vectors

  // Variants are also available for _s8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2,
  // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2
  svuint8x2_t svsel[_u8_x2](svcount_t png, svuint8x2_t zn, svuint8x2_t zm)
    __arm_streaming;


  // Variants are also available for _s8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4,
  // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4
  svuint8x4_t svsel[_u8_x4](svcount_t png, svuint8x4_t zn, svuint8x4_t zm)
    __arm_streaming;

URSHL, SRSHL (single) ^

Multi-vector rounding shift left

  // Variants are also available for _single_u8_x2, _single_u16_x2,
  // _single_s16_x2, _single_u32_x2, _single_s32_x2, _single_u64_x2
  // and _single_s64_x2
  svint8x2_t svrshl[_single_s8_x2](svint8x2_t zdn, svint8_t zm) __arm_streaming;


  // Variants are also available for _single_u8_x4, _single_u16_x4,
  // _single_s16_x4, _single_u32_x4, _single_s32_x4, _single_u64_x4
  // and _single_s64_x4
  svint8x4_t svrshl[_single_s8_x4](svint8x4_t zdn, svint8_t zm) __arm_streaming;

URSHL, SRSHL (multi) ^

Multi-vector rounding shift left

  // Variants are also available for _u8_x2, _u16_x2, _s16_x2, _u32_x2, _s32_x2,
  // _u64_x2 and _s64_x2
  svint8x2_t svrshl[_s8_x2](svint8x2_t zdn, svint8x2_t zm) __arm_streaming;


  // Variants are also available for _u8_x4, _u16_x4, _s16_x4, _u32_x4, _s32_x4,
  // _u64_x4 and _s64_x4
  svint8x4_t svrshl[_s8_x4](svint8x4_t zdn, svint8x4_t zm) __arm_streaming;

SQRSHR, UQRSHR ^

Multi-vector saturating rounding shift right narrow

  // Variants are also available for _u8[_u32_x4]
  svint8_t svqrshr[_n]_s8[_s32_x4](svint32x4_t zn, uint64_t imm)
    __arm_streaming;


  // Variants are also available for _u16[_u32_x2]
  svint16_t svqrshr[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm)
    __arm_streaming;


  // Variants are also available for _u16[_u64_x4]
  svint16_t svqrshr[_n]_s16[_s64_x4](svint64x4_t zn, uint64_t imm)
    __arm_streaming;

SQRSHRN, UQRSHRN ^

Multi-vector saturating rounding shift right narrow and interleave

  // Variants are also available for _u8[_u32_x4]
  svint8_t svqrshrn[_n]_s8[_s32_x4](svint32x4_t zn, uint64_t imm)
    __arm_streaming;


  // Variants are also available for _u16[_u64_x4]
  svint16_t svqrshrn[_n]_s16[_s64_x4](svint64x4_t zn, uint64_t imm)
    __arm_streaming;

SQRSHRU ^

Multi-vector saturating rounding shift right unsigned narrow

  svuint8_t svqrshru[_n]_u8[_s32_x4](svint32x4_t zn, uint64_t imm)
    __arm_streaming;


  svuint16_t svqrshru[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm)
    __arm_streaming;


  svuint16_t svqrshru[_n]_u16[_s64_x4](svint64x4_t zn, uint64_t imm)
    __arm_streaming;

SQRSHRUN ^

Multi-vector saturating rounding shift right unsigned narrow and interleave

  // Variants are also available for _u16[_s64_x4]
  svuint8_t svqrshrun[_n]_u8[_s32_x4](svint32x4_t zn, uint64_t imm)
    __arm_streaming;

SQDMULH (single) ^

Multi-vector signed saturating doubling multiply high

  // Variants are also available for _single_s16_x2, _single_s32_x2
  // and _single_s64_x2
  svint8x2_t svqdmulh[_single_s8_x2](svint8x2_t zdn, svint8_t zm)
    __arm_streaming;


  // Variants are also available for _single_s16_x4, _single_s32_x4
  // and _single_s64_x4
  svint8x4_t svqdmulh[_single_s8_x4](svint8x4_t zdn, svint8_t zm)
    __arm_streaming;

SQDMULH (multi) ^

Multi-vector signed saturating doubling multiply high

  // Variants are also available for _s16_x2, _s32_x2 and _s64_x2
  svint8x2_t svqdmulh[_s8_x2](svint8x2_t zdn, svint8x2_t zm) __arm_streaming;


  // Variants are also available for _s16_x4, _s32_x4 and _s64_x4
  svint8x4_t svqdmulh[_s8_x4](svint8x4_t zdn, svint8x4_t zm) __arm_streaming;

SUNPK, UUNPK ^

Multi-vector pack/unpack

  // Variants are also available for _u16[_u8_x2], _u32[_u16_x2], _s32[_s16_x2],
  // _u64[_u32_x2] and _s64[_s32_x2]
  svint16x2_t svunpk_s16[_s8_x2](svint8_t zn) __arm_streaming;


  // Variants are also available for _u16[_u8_x4], _u32[_u16_x4], _s32[_s16_x4],
  // _u64[_u32_x4] and _s64[_s32_x4]
  svint16x4_t svunpk_s16[_s8_x4](svint8x2_t zn) __arm_streaming;

ZIP ^

Multi-vector zip.

  // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2,
  // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2
  svint8x2_t svzip[_s8_x2](svint8x2_t zn) __arm_streaming;


  // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4,
  // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4
  svint8x4_t svzip[_s8_x4](svint8x4_t zn) __arm_streaming;

The svzipq intrinsics operate on quad-words, but for convenience accept all element types.

  // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2,
  // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2
  svint8x2_t svzipq[_s8_x2](svint8x2_t zn) __arm_streaming;


  // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4,
  // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4
  svint8x4_t svzipq[_s8_x4](svint8x4_t zn) __arm_streaming;

UZP ^

Multi-vector unzip.

  // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2,
  // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2
  svint8x2_t svuzp[_s8_x2](svint8x2_t zn) __arm_streaming;


  // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4,
  // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4
  svint8x4_t svuzp[_s8_x4](svint8x4_t zn) __arm_streaming;

The svuzpq intrinsics operate on quad-words, but for convenience accept all element types.

  // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2,
  // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2
  svint8x2_t svuzpq[_s8_x2](svint8x2_t zn) __arm_streaming;


  // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4,
  // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4
  svint8x4_t svuzpq[_s8_x4](svint8x4_t zn) __arm_streaming;

SME2.1 instruction intrinsics ^

The specification for SME2.1 is in Alpha state and might change or be extended in the future.

The intrinsics in this section are defined by the header file <arm_sme.h> when __ARM_FEATURE_SME2p1 is defined.

MOVAZ (tile to vector, single) ^

Move and zero ZA tile slice to vector register.

  // And similarly for u8.
  svint8_t svreadz_hor_za8_s8(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u16, bf16 and f16.
  svint16_t svreadz_hor_za16_s16(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u32 and f32.
  svint32_t svreadz_hor_za32_s32(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u64 and f64.
  svint64_t svreadz_hor_za64_s64(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for s16, s32, s64, u8, u16, u32, u64, bf16, f16, f32, f64
  svint8_t svreadz_hor_za128_s8(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u8.
  svint8_t svreadz_ver_za8_s8(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u16, bf16 and f16.
  svint16_t svreadz_ver_za16_s16(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u32 and f32.
  svint32_t svreadz_ver_za32_s32(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for u64 and f64.
  svint64_t svreadz_ver_za64_s64(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

  // And similarly for s16, s32, s64, u8, u16, u32, u64, bf16, f16, f32, f64
  svint8_t svreadz_ver_za128_s8(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

MOVAZ (tile to vector, multiple) ^

Move and zero multiple ZA tile slices to vector registers

  // Variants are also available for _za8_u8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x2_t svreadz_hor_za8_s8_vg2(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8_u8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x4_t svreadz_hor_za8_s8_vg4(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8_u8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x2_t svreadz_ver_za8_s8_vg2(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8_u8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x4_t svreadz_ver_za8_s8_vg4(uint64_t tile, uint32_t slice)
    __arm_streaming __arm_inout("za");

MOVAZ (array to vector) ^

Move and zero multiple ZA single-vector groups to vector registers

  // Variants are also available for _za8_u8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x2_t svreadz_za8_s8_vg1x2(uint32_t slice)
    __arm_streaming __arm_inout("za");


  // Variants are also available for _za8_u8, _za16_s16, _za16_u16,
  // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
  // _za64_s64, _za64_u64 and _za64_f64
  svint8x4_t svreadz_za8_s8_vg1x4(uint32_t slice)
    __arm_streaming __arm_inout("za");

ZERO (vector groups) ^

Zero ZA vector groups

  void svzero_za64_vg1x2(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg1x4(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg2x1(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg2x2(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg2x4(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg4x1(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg4x2(uint32_t slice)
    __arm_streaming __arm_inout("za");

  void svzero_za64_vg4x4(uint32_t slice)
    __arm_streaming __arm_inout("za");

Streaming-compatible versions of standard routines ^

ACLE provides the following streaming-compatible functions, with the same behavior as the standard C functions that they are named after. All of the functions have external linkage.

  void *__arm_sc_memcpy(void *dest, const void *src, size_t n)
    __arm_streaming_compatible;

  void *__arm_sc_memmove(void *dest, const void *src, size_t n)
    __arm_streaming_compatible;

  void *__arm_sc_memset(void *s, int c, size_t n)
    __arm_streaming_compatible;

  void *__arm_sc_memchr(void *s, int c, size_t n)
    __arm_streaming_compatible;

SVE2.1 and SME2 instruction intrinsics ^

The specification for SVE2.1 is in Alpha state and may change or be extended in the future.

The functions in this section are defined by either the header file <arm_sve.h> or <arm_sme.h> when __ARM_FEATURE_SVE2p1 or __ARM_FEATURE_SME2 is defined, respectively.

These intrinsics can only be called from non-streaming code if __ARM_FEATURE_SVE2p1 is defined. They can only be called from streaming code if the appropriate SME feature macro is defined (see previous paragraph). They can only be called from streaming-compatible code if they could be called from both non-streaming code and streaming code

Most function in this section are SME2 or SVE2.1. However some are available in SME. For convinience the ones available in SME will be tagged in the function with [SME].

UCLAMP, SCLAMP, FCLAMP ^

Clamp to minimum/maximum vector.

  // Variants are also available for:
  //  _s8, _u8, _s16, _u16, _s32, _u32  [SME]
  // _f32, _s64, _u64 and _f64
  svfloat16_t svclamp[_f16](svfloat16_t op, svfloat16_t min, svfloat16_t max);

CNTP ^

Set scalar to count from predicate-as-counter. vl is expected to be 2 or 4.

  // Variants are also available for _c16, _c32 and _c64
  uint64_t svcntp_c8(svcount_t pnn, uint64_t vl);

UDOT, SDOT, FDOT (vectors) ^

Multi-vector dot-product (2-way)

  // Variants are also available for _s32_s16 and _u32_u16
  svfloat32_t svdot[_f32_f16](svfloat32_t zda, svfloat16_t zn,
                              svfloat16_t zm);

UDOT, SDOT, FDOT (indexed) ^

Multi-vector dot-product (2-way)

  // Variants are also available for _s32_s16 and _u32_u16
  svfloat32_t svdot_lane[_f32_f16](svfloat32_t zda, svfloat16_t zn,
                                   svfloat16_t zm, uint64_t imm_idx);

LD1B, LD1D, LD1H, LD1W ^

Contiguous load to multi-vector

  // Variants are also available for _s8, _mf8
  svuint8x2_t svld1[_u8]_x2(svcount_t png, const uint8_t *rn);


  // Variants are also available for _s8, _mf8
  svuint8x4_t svld1[_u8]_x4(svcount_t png, const uint8_t *rn);


  // Variants are also available for _s8, _mf8
  svuint8x2_t svld1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn,
                                 int64_t vnum);


  // Variants are also available for _s8, _mf8
  svuint8x4_t svld1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn,
                                 int64_t vnum);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x2_t svld1[_u16]_x2(svcount_t png, const uint16_t *rn);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x4_t svld1[_u16]_x4(svcount_t png, const uint16_t *rn);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x2_t svld1_vnum[_u16]_x2(svcount_t png, const uint16_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x4_t svld1_vnum[_u16]_x4(svcount_t png, const uint16_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s32 and _f32
  svuint32x2_t svld1[_u32]_x2(svcount_t png, const uint32_t *rn);


  // Variants are also available for _s32 and _f32
  svuint32x4_t svld1[_u32]_x4(svcount_t png, const uint32_t *rn);


  // Variants are also available for _s32 and _f32
  svuint32x2_t svld1_vnum[_u32]_x2(svcount_t png, const uint32_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s32 and _f32
  svuint32x4_t svld1_vnum[_u32]_x4(svcount_t png, const uint32_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s64 and _f64
  svuint64x2_t svld1[_u64]_x2(svcount_t png, const uint64_t *rn);


  // Variants are also available for _s64 and _f64
  svuint64x4_t svld1[_u64]_x4(svcount_t png, const uint64_t *rn);


  // Variants are also available for _s64 and _f64
  svuint64x2_t svld1_vnum[_u64]_x2(svcount_t png, const uint64_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s64 and _f64
  svuint64x4_t svld1_vnum[_u64]_x4(svcount_t png, const uint64_t *rn,
                                   int64_t vnum);

LDNT1B, LDNT1D, LDNT1H, LDNT1W ^

Contiguous non-temporal load to multi-vector

  // Variants are also available for _s8, _mf8
  svuint8x2_t svldnt1[_u8]_x2(svcount_t png, const uint8_t *rn);


  // Variants are also available for _s8, _mf8
  svuint8x4_t svldnt1[_u8]_x4(svcount_t png, const uint8_t *rn);


  // Variants are also available for _s8, _mf8
  svuint8x2_t svldnt1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s8, _mf8
  svuint8x4_t svldnt1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn,
                                   int64_t vnum);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x2_t svldnt1[_u16]_x2(svcount_t png, const uint16_t *rn);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x4_t svldnt1[_u16]_x4(svcount_t png, const uint16_t *rn);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x2_t svldnt1_vnum[_u16]_x2(svcount_t png, const uint16_t *rn,
                                     int64_t vnum);


  // Variants are also available for _s16, _f16 and _bf16
  svuint16x4_t svldnt1_vnum[_u16]_x4(svcount_t png, const uint16_t *rn,
                                     int64_t vnum);


  // Variants are also available for _s32 and _f32
  svuint32x2_t svldnt1[_u32]_x2(svcount_t png, const uint32_t *rn);


  // Variants are also available for _s32 and _f32
  svuint32x4_t svldnt1[_u32]_x4(svcount_t png, const uint32_t *rn);


  // Variants are also available for _s32 and _f32
  svuint32x2_t svldnt1_vnum[_u32]_x2(svcount_t png, const uint32_t *rn,
                                     int64_t vnum);


  // Variants are also available for _s32 and _f32
  svuint32x4_t svldnt1_vnum[_u32]_x4(svcount_t png, const uint32_t *rn,
                                     int64_t vnum);


  // Variants are also available for _s64 and _f64
  svuint64x2_t svldnt1[_u64]_x2(svcount_t png, const uint64_t *rn);


  // Variants are also available for _s64 and _f64
  svuint64x4_t svldnt1[_u64]_x4(svcount_t png, const uint64_t *rn);


  // Variants are also available for _s64 and _f64
  svuint64x2_t svldnt1_vnum[_u64]_x2(svcount_t png, const uint64_t *rn,
                                     int64_t vnum);


  // Variants are also available for _s64 and _f64
  svuint64x4_t svldnt1_vnum[_u64]_x4(svcount_t png, const uint64_t *rn,
                                     int64_t vnum);

BFMLSLB, BFMLSLT ^

BFloat16 floating-point multiply-subtract long from single-precision (top/bottom)

  svfloat32_t svbfmlslb[_f32](svfloat32_t zda, svbfloat16_t zn, svbfloat16_t zm);


  svfloat32_t svbfmlslb_lane[_f32](svfloat32_t zda, svbfloat16_t zn,
                                   svbfloat16_t zm, uint64_t imm_idx);


  svfloat32_t svbfmlslt[_f32](svfloat32_t zda, svbfloat16_t zn, svbfloat16_t zm);


  svfloat32_t svbfmlslt_lane[_f32](svfloat32_t zda, svbfloat16_t zn,
                                   svbfloat16_t zm, uint64_t imm_idx);

PEXT ^

Transform a predicate-as-counter to a predicate (pair).

  // Variants are also available for _c16, _c32 and _c64
  svbool_t svpext_lane_c8(svcount_t pnn, uint64_t imm);


  // Variants are also available for _c16, _c32 and _c64
  svboolx2_t svpext_lane_c8_x2(svcount_t pnn, uint64_t imm);

PSEL ^

Predicate select between predicate value or all-false

   // Variants are also available for _b16, _b32 and _b64 [SME]
   svbool_t svpsel_lane_b8(svbool_t pn, svbool_t pm, uint32_t idx);


  // Variants are also available for _c16, _c32 and _c64
  svcount_t svpsel_lane_c8(svcount_t pn, svbool_t pm, uint32_t idx);

PTRUE, PFALSE ^

Initialise predicate-as-counter to all active or all inactive.

  // Variants are also available for _c16, _c32 and _c64
  svcount_t svptrue_c8();


  svcount_t svpfalse_c(void);

REVD ^

Reverse doublewords in elements.

  // All the intrinsics below are [SME]
  // Variants are available for:
  // _s8, _s16, _u16, _s32, _u32, _s64, _u64
  // _mf8, _bf16, _f16, _f32, _f64
  svuint8_t svrevd[_u8]_m(svuint8_t zd, svbool_t pg, svuint8_t zn);


  // Variants are available for:
  // _s8, _s16, _u16, _s32, _u32, _s64, _u64
  // _mf8, _bf16, _f16, _f32, _f64
  svuint8_t svrevd[_u8]_z(svbool_t pg, svuint8_t zn);


  // Variants are available for:
  // _s8, _s16, _u16, _s32, _u32, _s64, _u64
  // _mf8, _bf16, _f16, _f32, _f64
  svuint8_t svrevd[_u8]_x(svbool_t pg, svuint8_t zn);

SQCVTN, SQCVTUN, UQCVTN ^

Multi-vector saturating extract narrow and interleave

  // Variants are also available for _u16[_s32_x2] and _u16[_u32_x2]
  svint16_t svqcvtn_s16[_s32_x2](svint32x2_t zn);

SQRSHRN, UQRSHRN ^

Multi-vector saturating rounding shift right narrow and interleave

  // Variants are also available for _u16[_u32_x2]
  svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);

SQRSHRUN ^

Multi-vector saturating rounding shift right unsigned narrow and interleave

  svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);

ST1B, ST1D, ST1H, ST1W ^

Contiguous store of multi-vector operand

  // Variants are also available for _s8_x2, _mf8_x2
  void svst1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt);


  // Variants are also available for _s8_x4, _mf8_x4
  void svst1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt);


  // Variants are also available for _s8_x2, _mf8_x2
  void svst1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum,
                          svuint8x2_t zt);


  // Variants are also available for _s8_x4, _mf8_x4
  void svst1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum,
                          svuint8x4_t zt);


  // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2
  void svst1[_u16_x2](svcount_t png, uint16_t *rn, svuint16x2_t zt);


  // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4
  void svst1[_u16_x4](svcount_t png, uint16_t *rn, svuint16x4_t zt);


  // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2
  void svst1_vnum[_u16_x2](svcount_t png, uint16_t *rn, int64_t vnum,
                           svuint16x2_t zt);


  // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4
  void svst1_vnum[_u16_x4](svcount_t png, uint16_t *rn, int64_t vnum,
                           svuint16x4_t zt);


  // Variants are also available for _s32_x2 and _f32_x2
  void svst1[_u32_x2](svcount_t png, uint32_t *rn, svuint32x2_t zt);


  // Variants are also available for _s32_x4 and _f32_x4
  void svst1[_u32_x4](svcount_t png, uint32_t *rn, svuint32x4_t zt);


  // Variants are also available for _s32_x2 and _f32_x2
  void svst1_vnum[_u32_x2](svcount_t png, uint32_t *rn, int64_t vnum,
                           svuint32x2_t zt);


  // Variants are also available for _s32_x4 and _f32_x4
  void svst1_vnum[_u32_x4](svcount_t png, uint32_t *rn, int64_t vnum,
                           svuint32x4_t zt);


 // Variants are also available for _s64_x2 and _f64_x2
  void svst1[_u64_x2](svcount_t png, uint64_t *rn, svuint64x2_t zt);


  // Variants are also available for _s64_x4 and _f64_x4
  void svst1[_u64_x4](svcount_t png, uint64_t *rn, svuint64x4_t zt);


  // Variants are also available for _s64_x2 and _f64_x2
  void svst1_vnum[_u64_x2](svcount_t png, uint64_t *rn, int64_t vnum,
                           svuint64x2_t zt);


  // Variants are also available for _s64_x4 and _f64_x4
  void svst1_vnum[_u64_x4](svcount_t png, uint64_t *rn, int64_t vnum,
                           svuint64x4_t zt);

STNT1B, STNT1D, STNT1H, STNT1W ^

Contiguous non-temporal store of multi-vector operand

  // Variants are also available for _s8_x2, _mf8_x2
  void svstnt1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt);


  // Variants are also available for _s8_x4, _mf8_x4
  void svstnt1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt);


  // Variants are also available for _s8_x2, _mf8_x2
  void svstnt1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum,
                            svuint8x2_t zt);


  // Variants are also available for _s8_x4, _mf8_x4
  void svstnt1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum,
                            svuint8x4_t zt);


  // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2
  void svstnt1[_u16_x2](svcount_t png, uint16_t *rn, svuint16x2_t zt);


  // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4
  void svstnt1[_u16_x4](svcount_t png, uint16_t *rn, svuint16x4_t zt);


  // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2
  void svstnt1_vnum[_u16_x2](svcount_t png, uint16_t *rn, int64_t vnum,
                             svuint16x2_t zt);


  // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4
  void svstnt1_vnum[_u16_x4](svcount_t png, uint16_t *rn, int64_t vnum,
                             svuint16x4_t zt);


  // Variants are also available for _s32_x2 and _f32_x2
  void svstnt1[_u32_x2](svcount_t png, uint32_t *rn, svuint32x2_t zt);


  // Variants are also available for _s32_x4 and _f32_x4
  void svstnt1[_u32_x4](svcount_t png, uint32_t *rn, svuint32x4_t zt);


  // Variants are also available for _s32_x2 and _f32_x2
  void svstnt1_vnum[_u32_x2](svcount_t png, uint32_t *rn, int64_t vnum,
                             svuint32x2_t zt);


  // Variants are also available for _s32_x4 and _f32_x4
  void svstnt1_vnum[_u32_x4](svcount_t png, uint32_t *rn, int64_t vnum,
                             svuint32x4_t zt);


  // Variants are also available for _s64_x2 and _f64_x2
  void svstnt1[_u64_x2](svcount_t png, uint64_t *rn, svuint64x2_t zt);


  // Variants are also available for _s64_x4 and _f64_x4
  void svstnt1[_u64_x4](svcount_t png, uint64_t *rn, svuint64x4_t zt);


  // Variants are also available for _s64_x2 and _f64_x2
  void svstnt1_vnum[_u64_x2](svcount_t png, uint64_t *rn, int64_t vnum,
                             svuint64x2_t zt);


  // Variants are also available for _s64_x4 and _f64_x4
  void svstnt1_vnum[_u64_x4](svcount_t png, uint64_t *rn, int64_t vnum,
                             svuint64x4_t zt);

WHILEGE, WHILEGT, WHILEHI, WHILEHS, WHILELE, WHILELO, WHILELS, WHILELT ^

While (resulting in predicate-as-counter). vl is expected to be 2 or 4.

  // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64],
  // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64]
  svcount_t svwhilege_c8[_s64](int64_t rn, int64_t rm, uint64_t vl);


  // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64],
  // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64]
  svcount_t svwhilegt_c8[_s64](int64_t rn, int64_t rm, uint64_t vl);


  // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64],
  // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64]
  svcount_t svwhilele_c8[_s64](int64_t rn, int64_t rm, uint64_t vl);


  // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64],
  // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64]
  svcount_t svwhilelt_c8[_s64](int64_t rn, int64_t rm, uint64_t vl);

While (resulting in predicate tuple)

  // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2,
  // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and
  // _b64[_u64]_x2
  svboolx2_t svwhilege_b8[_s64]_x2(int64_t rn, int64_t rm);


  // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2,
  // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and
  // _b64[_u64]_x2
  svboolx2_t svwhilegt_b8[_s64]_x2(int64_t rn, int64_t rm);


  // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2,
  // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and
  // _b64[_u64]_x2
  svboolx2_t svwhilele_b8[_s64]_x2(int64_t rn, int64_t rm);


  // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2,
  // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and
  // _b64[_u64]_x2
  svboolx2_t svwhilelt_b8[_s64]_x2(int64_t rn, int64_t rm);

SME2 maximum and minimum absolute value ^

The intrinsics in this section are defined by the header file <arm_sme.h> when __ARM_FEATURE_SME2 is defined to 1 and __ARM_FEATURE_FAMINMAX is defined to 1.

FAMAX ^

Absolute maximum.

  // Variants are also available for:
  //  [_f32_x2], [_f64_x2],
  //  [_f16_x4], [_f32_x4] and [_f64_x4]
  svfloat16x2_t svamax[_f16_x2](svfloat16x2 zd, svfloat16x2_t zm) __arm_streaming;

FAMIN ^

Absolute minimum.

  // Variants are also available for:
  //  [_f32_x2], [_f64_x2],
  //  [_f16_x4], [_f32_x4] and [_f64_x4]
  svfloat16x2_t svamin[_f16_x2](svfloat16x2 zd, svfloat16x2_t zm) __arm_streaming;

SME2 lookup table ^

The intrinsics in this section are defined by the header file <arm_sme.h> when __ARM_FEATURE_SME_LUTv2 is defined to 1.

MOVT ^

Move vector register to ZT0.

  // Variants are also available for:
  // [_s8], [_u16], [_s16], [_u32], [_s32], [_u64], [_s64]
  // [_bf16], [_f16], [_f32], [_f64]
  void svwrite_zt[_u8](uint64_t zt0, svuint8_t zt) __arm_streaming __arm_out("zt0");

  // Variants are also available for:
  // [_s8], [_u16], [_s16], [_u32], [_s32], [_u64], [_s64]
  // [_bf16], [_f16], [_f32], [_f64]
  void svwrite_lane_zt[_u8](uint64_t zt0, svuint8_t zt, uint64_t idx) __arm_streaming __arm_inout("zt0");

LUTI4 ^

Lookup table read with 4-bit indexes and 8-bit elements.

  // Variants are also available for: _u8
  svint8x4_t svluti4_zt_s8_x4(uint64_t zt0, svuint8x2_t zn) __arm_streaming __arm_in("zt0");

SVE2 and SME2 modal 8-bit floating-point intrinsics ^

The intrinsics in this section are defined by the header file <arm_sve.h> when __ARM_FEATURE_FP8 is defined, and __ARM_FEATURE_SVE2 or __ARM_FEATURE_SME2 is defined. Individual intrinsics may have additional target feature requirements.

BF1CVT, BF2CVT, F1CVT, F2CVT ^

8-bit floating-point convert to half-precision and BFloat16.

  // Variants are also available for: _bf16
  svfloat16_t svcvt1_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm);
  svfloat16_t svcvt2_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm);

BF1CVTLT, BF2CVTLT, F1CVTLT, F2CVTLT ^

8-bit floating-point convert to half-precision and BFloat16 (top).

  // Variants are also available for: _bf16
  svfloat16_t svcvtlt1_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm);
  svfloat16_t svcvtlt2_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm);

BFCVTN, FCVTN ^

Half-precision and BFloat16 convert, narrow, and interleave to 8-bit floating-point.

  // Variant is also available for: _bf16_x2
  svmfloat8_t svcvtn_mf8[_f16_x2]_fpm(svfloat16x2_t zn, fpm_t fpm);

FCVTNT, FCVTNB ^

Single-precision convert, narrow, and interleave to 8-bit floating-point (top and bottom).

  svmfloat8_t svcvtnt_mf8[_f32_x2]_fpm(svmfloat8_t zd, svfloat32x2_t zn, fpm_t fpm);
  svmfloat8_t svcvtnb_mf8[_f32_x2]_fpm(svmfloat8_t zd, svfloat32x2_t zn, fpm_t fpm);

FDOT (4-way, vectors) ^

8-bit floating-point dot product to single-precision.

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8DOT4) || __ARM_FEATURE_SSVE_FP8DOT4
  svfloat32_t svdot[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);

FDOT (4-way, indexed) ^

8-bit floating-point indexed dot product to single-precision.

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8DOT4) || __ARM_FEATURE_SSVE_FP8DOT4
  svfloat32_t svdot_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                       uint64_t imm0_3, fpm_t fpm);

FDOT (2-way, vectors, FP8 to FP16) ^

8-bit floating-point dot product to half-precision.

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8DOT2) || __ARM_FEATURE_SSVE_FP8DOT2
  svfloat16_t svdot[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);

FDOT (2-way, indexed, FP8 to FP16) ^

8-bit floating-point dot product to half-precision.

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8DOT2) || __ARM_FEATURE_SSVE_FP8DOT2
  svfloat16_t svdot_lane[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                       uint64_t imm0_7, fpm_t fpm);

FMLALB (vectors, FP8 to FP16) ^

8-bit floating-point multiply-add long to half-precision (bottom).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat16_t svmlalb[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
  svfloat16_t svmlalb[_n_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm);

FMLALB (indexed, FP8 to FP16) ^

8-bit floating-point multiply-add long to half-precision (bottom, indexed).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat16_t svmlalb_lane[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                         uint64_t imm0_15, fpm_t fpm);

FMLALLBB (vectors) ^

8-bit floating-point multiply-add long long to single-precision (bottom bottom).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlallbb[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
  svfloat32_t svmlallbb[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm);

FMLALLBB (indexed) ^

8-bit floating-point multiply-add long long to single-precision (bottom bottom, indexed).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlallbb_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                           uint64_t imm0_15, fpm_t fpm);

FMLALLBT (vectors) ^

8-bit floating-point multiply-add long long to single-precision (bottom top).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlallbt[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
  svfloat32_t svmlallbt[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm);

FMLALLBT (indexed) ^

8-bit floating-point multiply-add long long to single-precision (bottom top, indexed).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlallbt_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                           uint64_t imm0_15, fpm_t fpm);

FMLALLTB (vectors) ^

8-bit floating-point multiply-add long long to single-precision (top bottom).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlalltb[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
  svfloat32_t svmlalltb[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm);

FMLALLTB (indexed) ^

8-bit floating-point multiply-add long long to single-precision (top bottom, indexed).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlalltb_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                           uint64_t imm0_15, fpm_t fpm);

FMLALLTT (vectors) ^

8-bit floating-point multiply-add long long to single-precision (top top).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlalltt[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
  svfloat32_t svmlalltt[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm);

FMLALLTT (indexed) ^

8-bit floating-point multiply-add long long to single-precision (top top, indexed).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat32_t svmlalltt_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                           uint64_t imm0_15, fpm_t fpm);

FMLALT (vectors, FP8 to FP16) ^

8-bit floating-point multiply-add long to half-precision (top).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat16_t svmlalt[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
  svfloat16_t svmlalt[_n_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm);

FMLALT (indexed, FP8 to FP16) ^

8-bit floating-point multiply-add long to half-precision (top, indexed).

  // Only if (__ARM_FEATURE_SVE2  && __ARM_FEATURE_FP8FMA) || __ARM_FEATURE_SSVE_FP8FMA
  svfloat16_t svmlalt_lane[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm,
                                         uint64_t imm0_15, fpm_t fpm);

SME2 modal 8-bit floating-point intrinsics ^

The intrinsics in this section are defined by the header file <arm_sme.h> when __ARM_FEATURE_SME2 and __ARM_FEATURE_FP8 are defined. Individual intrinsics may have additional target feature requirements.

BF1CVT, BF2CVT, F1CVT, F2CVT ^

8-bit floating-point convert to half-precision or BFloat16.

  // Variant is also available for: _bf16[_mf8]_x2
  svfloat16x2_t svcvt1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;
  svfloat16x2_t svcvt2_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;

BF1CVTL, BF2CVTL, F1CVTL, F2CVTL ^

8-bit floating-point convert to deinterleaved half-precision or BFloat16.

  // Variant is also available for: _bf16[_mf8]_x2
  svfloat16x2_t svcvtl1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;
  svfloat16x2_t svcvtl2_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;

BFCVT, FCVT ^

Convert to packed 8-bit floating-point format.

  // Variants are also available for: _mf8[_bf16_x2] and _mf8[_f32_x4]
  svmfloat8_t svcvt_mf8[_f16_x2]_fpm(svfloat16x2_t zn, fpm_t fpm) __arm_streaming;

FCVTN ^

Convert to interleaved 8-bit floating-point format.

  svmfloat8_t svcvtn_mf8[_f32_x4]_fpm(svfloat32x4_t zn, fpm_t fpm) __arm_streaming;

FSCALE ^

  // Variants are also available for:
  // [_single_f32_x2], [_single_f64_x2],
  // [_single_f16_x4], [_single_f32_x4] and [_single_f64_x4]
  svfloat16x2_t svscale[_single_f16_x2](svfloat16x2_t zd, svint16_t zm) __arm_streaming;

  // Variants are also available for:
  //  [_f32_x2], [_f64_x2],
  //  [_f16_x4], [_f32_x4] and [_f64_x4]
  svfloat16x2_t svscale[_f16_x2](svfloat16x2_t zd, svint16x2_t zm) __arm_streaming;

FDOT ^

Multi-vector 8-bit floating-point dot-product.

  // Available variants are: _za16 if __ARM_FEATURE_SME_F8F16 != 0
  //                         _za32 if __ARM_FEATURE_SME_F8F32 != 0
  void svdot_lane_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx,
                                       fpm_t fpm) __arm_streaming __arm_inout("za");

  void svdot_lane_za16[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn,
                                      svmfloat8_t zm, uint64_t imm_idx,
                                      fpm_t fpm) __arm_streaming __arm_inout("za");

  void svdot[_single]_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svdot[_single]_za16[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svdot_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm,
                                  fpm_t fpm) __arm_streaming __arm_inout("za");

  void svdot_za16[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm,
                                  fpm_t fpm)  __arm_streaming __arm_inout("za");

FVDOT ^

Multi-vector 8-bit floating-point vertical dot-product by indexed element to half-precision.

  // Only if __ARM_FEATURE_SME_F8F16 != 0
  void svvdot_lane_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                        svmfloat8_t zm, uint64_t imm_idx,
                                        fpm_t fpm)  __arm_streaming __arm_inout("za");

FVDOTB, FVDOTT ^

Multi-vector 8-bit floating-point vertical dot-product.

  // Only if __ARM_FEATURE_SME_F8F32 != 0
  void svvdott_lane_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x2_t zn,
                                         svmfloat8_t zm, uint64_t imm_idx,
                                         fpm_t fpm) __arm_streaming __arm_inout("za");

  void svvdotb_lane_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x2_t zn,
                                         svmfloat8_t zm, uint64_t imm_idx,
                                         fpm_t fpm) __arm_streaming __arm_inout("za");

FMLAL ^

Multi-vector 8-bit floating-point multiply-add long.

  // Only if __ARM_FEATURE_SME_F8F16 != 0
  void svmla_lane_za16[_mf8]_vg2x1_fpm(uint32_t slice, svmfloat8_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx,
                                       fpm_t fpm)  __arm_streaming __arm_inout("za");

  void svmla_lane_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx,
                                       fpm_t fpm)  __arm_streaming __arm_inout("za");

  void svmla_lane_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx
                                       fpm_t fpm) __arm_streaming __arm_inout("za");

  void svmla[_single]_za16[_mf8]_vg2x1_fpm(uint32_t slice, svmfloat8_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svmla[_single]_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svmla[_single]_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svmla_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm,
                                  fpm_t fpm) __arm_streaming __arm_inout("za");

  void svmla_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm,
                                  fpm_t fpm) __arm_streaming __arm_inout("za");

FMLALL ^

Multi-vector 8-bit floating-point multiply-add long.

// Only if __ARM_FEATURE_SME_F8F32 != 0
  void svmla_lane_za32[_mf8]_vg4x1_fpm(uint32_t slice, svmfloat8_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx,
                                       fpm_t fpm)__arm_streaming __arm_inout("za");

  void svmla_lane_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx,
                                       fpm_t fpm)__arm_streaming __arm_inout("za");

  void svmla_lane_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn,
                                       svmfloat8_t zm, uint64_t imm_idx,
                                       fpm_t fpm)__arm_streaming __arm_inout("za");

  void svmla[_single]_za32[_mf8]_vg4x1_fpm(uint32_t slice, svmfloat8_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svmla[_single]_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svmla[_single]_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn,
                                           svmfloat8_t zm, fpm_t fpm)
                                           __arm_streaming __arm_inout("za");

  void svmla_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm,
                                  fpm_t fpm) __arm_streaming __arm_inout("za");

  void svmla_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm,
                                  fpm_t fpm) __arm_streaming __arm_inout("za");

FMOPA ^

8-bit floating-point sum of outer products and accumulate.

  // Only if __ARM_FEATURE_SME_F8F16 != 0
    void svmopa_za16[_mf8]_m_fpm(uint64_t tile, svbool_t pn, svbool_t pm,
                                 svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm)
                                 __arm_streaming __arm_inout("za");

  // Only if __ARM_FEATURE_SME_F8F32 != 0
    void svmopa_za32[_mf8]_m_fpm(uint64_t tile, svbool_t pn, svbool_t pm,
                                 svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm)
                                 __arm_streaming __arm_inout("za");

M-profile Vector Extension (MVE) intrinsics ^

The M-profile Vector Extension (MVE) [MVE-spec] instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types. MVE can also be referred to as Helium.

The M-profile Vector Extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 16-bit and 32-bit floating-point data, arranged in 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to the code generation of the MVE instructions for the Armv8.1-M Mainline architecture.

Concepts ^

The MVE instructions are designed to improve the performance of SIMD operations by operating on 128-bit vectors of elements of the same scalar data type.

For example, uint16x8_t is a 128-bit vector type consisting of eight elements of the scalar uint16_t data type. Likewise, uint8x16_t is a 128-bit vector type consisting of sixteen uint8_t elements.

In a vector programming model, operations are performed in parallel across the elements of the vector. For example, vmulq_u16(a, b) is a vector intrinsic which takes two uint16x8_t vector arguments a and b, and returns the result of multiplying corresponding elements from each vector together.

The M-profile Vector Extension also provides support for vector-by-scalar operations. In these operations, a scalar value is provided directly, duplicated to create a new vector with the same number of elements as an input vector, and an operation is performed in parallel between this new vector and other input vectors.

For example, vaddq_n_u16(a, s), is a vector-by-scalar intrinsic which takes one uint16x8_t vector argument and one uint16_t scalar argument. A new vector is formed which consists of eight copies of s, and this new vector is multiplied by a.

Reductions work across the whole of a single vector performing the same operation between elements of that vector. For example, vaddvq_u16(a) is a reduction intrinsic which takes a uint16x8_t vector, adds each of the eight uint16_t elements together, and returns a uint32_t result containing the sum. Note the difference in return types between MVE’s vaddvq_u16 and Advanced SIMD’s implementation of the same name intrinsic, MVE returns the uint32_t type whereas Advanced SIMD returns the element type uint16_t.

Cross-lane and pairwise vector operations work on pairs of elements within a vector, sometimes performing the same operation like in the case of the vector saturating doubling multiply subtract dual returning high half with exchange vqdmlsdhxq_s8 or sometimes a different one as is the case with the vector complex addition intrinsic vcaddq_rot90_s8.

Some intrinsics may only read part of the input vectors whereas others may only write part of the results. For example, the vector multiply long intrinsics, depending on whether you use vmullbq_int_s32 or vmulltq_int_s32, will read the even (bottom) or odd (top) elements of each int16x8_t input vectors, multiply them and write to a double-width int32x4_t vector. In contrast the vector shift right and narrow will read in a double-width input vector and, depending on whether you pick the bottom or top variant, write to the even or odd elements of the single-width result vector. For example, vshrnbq_n_s16(a, b, 2) will take each eight elements of type int16_t of argument b, shift them right by two, narrow them to eight bits and write them to the even elements of the int8x16_t result vector, where the odd elements are picked from the equally typed int8x16_t argument a.

Predication: the M-profile Vector Extension uses vector predication to allow SIMD operations on selected lanes. The MVE intrinsics expose vector predication by providing predicated intrinsic variants for instructions that support it. These intrinsics can be recognized by one of the four suffixes:

These predicated intrinsics can also be recognized by their last parameter being of type mve_pred16_t. This is an alias for the uint16_t type. Some predicated intrinsics may have a dedicated first parameter to specify the value in the result vector for the false-predicated lanes; this argument will be of the same type as the result type. For example, v = veorq_m_s8(inactive, a, b, p), will write to each of the sixteen lanes of the result vector v, either the result of the exclusive or between the corresponding lanes of vectors a and b, or the corresponding lane of vector inactive, depending on whether that lane is true- or false-predicated in p. The types of inactive, a, b and v are all int8x16_t in this case and p has type mve_pred16_t.

When calling a predicated intrinsic, the predicate mask value should contain the same value in all bits corresponding to the same element of an input or output vector. For example, an instruction operating on 32-bit vector elements should have a predicate mask in which each block of 4 bits is either all 0 or all 1.

  mve_pred16_t mask8 = vcmpeqq_u8 (a, b);
  uint8x16_t r8  = vaddq_m_u8  (inactive, a, b, mask8); // OK
  uint16x8_t r16 = vaddq_m_u16 (inactive, c, d, mask8); // UNDEFINED BEHAVIOR
  mve_pred16_t mask8 = 0x5555;        // Predicate every other byte.
  uint8x16_t r8  = vaddq_m_u8  (inactive, a, b, mask8); // OK
  uint16x8_t r16 = vaddq_m_u16 (inactive, c, d, mask8); // UNDEFINED BEHAVIOR

In cases where the input and output vectors have different sizes (a widening or narrowing operation), the mask should be consistent with the largest element size used by the intrinsic. For example, vcvtbq_m_f16_f32 and vcvtbq_m_f32_f16 should both be passed a predicate mask consistent with 32-bit vector lanes.

Users wishing to exploit the MVE architecture’s predication behavior in finer detail than this constraint permits are encouraged to use inline assembly.

Scalar shift intrinsics ^

The M-profile Vector Extension (MVE) also provides a set of scalar shift instructions that operate on signed and unsigned double-words and single-words. These shifts can perform additional saturation, rounding, or both. The ACLE for MVE defines intrinsics for these instructions.

Namespace ^

By default all M-profile Vector Extension intrinsics are available with and without the __arm_ prefix. If the __ARM_MVE_PRESERVE_USER_NAMESPACE macro is defined, the __arm_ prefix is mandatory. This is available to hide the user-namespace-polluting variants of the intrinsics.

Intrinsic polymorphism ^

The ACLE for the M-profile Vector Extension intrinsics was designed in such a way that it supports a polymorphic implementation of most intrinsics. The polymorphic name of an intrinsic is indicated by leaving out the type suffix enclosed in square brackets, for example the vector addition intrinsic vaddq[_s32] can be called using the function name vaddq. Note that the polymorphism is only possible on input parameter types and intrinsics with the same name must still have the same number of parameters. This is expected to aid implementation of the polymorphism using C11’s _Generic selection.

Vector data types ^

Vector data types are named as a lane type and a multiple. Lane type names are based on the types defined in <stdint.h>. For example,. int16x8_t is a vector of eight int16_t values. The base types are int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float16_t and float32_t. The multiples are such that the resulting vector types are 128-bit.

Vector array data types ^

Array types are defined for multiples of 2 and 4 of all the vector types, for use in load and store operations. For a vector type <type>_t the corresponding array type is <type>x<length>_t. Concretely, an array type is a structure containing a single array element called val.

For example, an array of two int16x8_t types is int16x4x8_t, and is represented as

  struct int16x8x2_t { int16x8_t val[2]; };

Scalar data types ^

For consistency, <arm_mve.h> defines some additional scalar data types to match the vector types.

float32_t is defined as an alias for float, float16_t is defined as an alias for __fp16 and mve_pred16_t is defined as an alias for uint16_t.

Operations on data types ^

ACLE does not define implicit conversion between different data types. For example:

  int32x4_t x;
  uint32x4_t y = x; // No representation change
  float32x4_t z = x; // Conversion of integer to floating type

Is not portable. Use the vreinterpretq intrinsics to convert from one vector type to another without changing representation, and use the vcvtq intrinsics to convert between integer and floating types; for example:

  int32x4_t x;
  uint32x4_t y = vreinterpretq_u32_s32(x);
  float32x4_t z = vcvtq_f32_s32(x);

ACLE does not define static construction of vector types. For example:

  int32x4_t x = { 1, 2, 3, 4 };

Is not portable. Use the vcreateq or vdupq intrinsics to construct values from scalars.

In C++, ACLE does not define whether MVE data types are POD types or whether they can be inherited from.

Compatibility with other vector programming models ^

ACLE does not specify how the MVE Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the MVE programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

  #include "arm_mve.h"
  ...
  uint32x4_t x = {0, 1, 2, 3}; // GCC extension.
  uint32_t y = vgetq_lane_s32 (x, 0); // ACLE MVE Intrinsic.

But the definition of the GCC vector extensions is such that the value stored in y will depend on whether the program is running in big- or little-endian mode.

It is recommended that MVE Intrinsics be used consistently:

  #include "arm_mve.h"
  ...
  const int temp[4] = {0, 1, 2, 3};
  uint32x4_t x = vld1q_s32 (temp);
  uint32_t y = vgetq_lane_s32 (x, 0);

Availability of M-profile Vector Extension intrinsics ^

M-profile Vector Extension support is available if the __ARM_FEATURE_MVE macro has a value other than 0 (see M-profile Vector Extension). The availability of the MVE Floating Point data types and intrinsics are predicated on the value of this macro having bit two set. In order to access the MVE intrinsics, it is necessary to include the <arm_mve.h> header.

  #if (__ARM_FEATURE_MVE & 3) == 3
  #include <arm_mve.h>
    /* MVE integer and floating point intrinsics are now available to use.  */
  #elif __ARM_FEATURE_MVE & 1
  #include <arm_mve.h>
    /* MVE integer intrinsics are now available to use.  */
  #endif

Specification of M-profile Vector Extension intrinsics ^

The M-profile Vector Extension intrinsics are specified in the Arm MVE Intrinsics Reference Architecture Specification [MVE].

The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

Undefined behavior ^

Care should be taken by compiler implementers not to introduce the concept of undefined behavior to the semantics of an intrinsic.

Alignment assertions ^

The MVE load and store instructions provide for alignment assertions, which may speed up access to aligned data (and will fault access to unaligned data). The MVE intrinsics do not directly provide a means for asserting alignment.

Transactional Memory Extension (TME) intrinsics ^

Introduction ^

This section describes the intrinsics for the instructions of the Transactional Memory Extension (TME). TME adds support for transactional execution where transactions are started and committed by a set of new instructions. The TME instructions are present in the AArch64 execution state only.

TME is designed to improve performance in cases where larger system scaling requires atomic and isolated access to data structures whose composition is dynamic in nature and therefore not readily amenable to fine-grained locking or lock-free approaches.

TME transactions are isolated. This means that transactional stores are hidden from other observers, and transactional loads cannot see stores from other observers until the transaction commits. Also, if the transaction fails then stores to memory and writes to registers by the transaction are discarded and the processor returns to the state it had when the transaction started.

TME transactions are best-effort. This means that the architecture does not guarantee success for any transaction. The architecture requires that all transactions specify a failure handler allowing the software to fallback to a non-transactional alternative to provide guarantees of forward progress.

TME defines flattened nesting of transactions, where nested transactions are subsumed by the outer transaction. This means that the effects of a nested transaction do not become visible to other observers until the outer transaction commits. When a nested transaction fails it causes the outer transaction, and all nested transactions within, to fail.

The TME intrinsics are available when __ARM_FEATURE_TME is defined.

Failure definitions ^

Transactions can fail due to various causes. The following macros are defined to help use or detect these causes.

  #define _TMFAILURE_REASON 0x00007fffu
  #define _TMFAILURE_RTRY   0x00008000u
  #define _TMFAILURE_CNCL   0x00010000u
  #define _TMFAILURE_MEM    0x00020000u
  #define _TMFAILURE_IMP    0x00040000u
  #define _TMFAILURE_ERR    0x00080000u
  #define _TMFAILURE_SIZE   0x00100000u
  #define _TMFAILURE_NEST   0x00200000u
  #define _TMFAILURE_DBG    0x00400000u
  #define _TMFAILURE_INT    0x00800000u
  #define _TMFAILURE_TRIVIAL    0x01000000u

Intrinsics ^

  uint64_t __tstart (void);

Starts a new transaction. When the transaction starts successfully the return value is 0. If the transaction fails, all state modifications are discarded and a cause of the failure is encoded in the return value. The macros defined in Failure definitions can be used to detect the cause of the failure.

  void __tcommit (void);

Commits the current transaction. For a nested transaction, the only effect is that the transactional nesting depth is decreased. For an outer transaction, the state modifications performed transactionally are committed to the architectural state.

  void __tcancel (/*constant*/ uint64_t);

Cancels the current transaction and discards all state modifications that were performed transactionally. The intrinsic takes a 16-bit immediate input that encodes the cancellation reason. This input could be given as

  __tcancel (_TMFAILURE_RTRY | (failure_reason & _TMFAILURE_REASON));

if retry is true or

  __tcancel (failure_reason & _TMFAILURE_REASON);

if retry is false.

  uint64_t __ttest (void);

Tests if executing inside a transaction. If no transaction is currently executing, the return value is 0. Otherwise, this intrinsic returns the depth of the transaction.

Instructions ^

Intrinsics Argument Result Instruction
uint64_t __tstart (void) - Xt -> result tstart
void __tcommit (void) - - tcommit
void __tcancel (/constant/ uint64_t reason) reason -> # - tcancel #
uint64_t __ttest (void) - Xt -> result ttest

These intrinsics are available when arm_acle.h is included.

memcpy family of operations intrinsics - MOPS ^

Introduction ^

This section describes the intrinsic for the new instructions introduced in the Armv8.8-A and Armv9.3-A architecture updates for the memcpy, memmove and memset family of memory operations (MOPS).

These instructions are designed to enable the standardization of the software implementation of those operations. Therefore, most of the use cases for the new instructions are covered by the compiler’s code generation or by library implementations.

An exception to that is the set of instructions covering the memset operation with memory tagging. An intrinsic is available to provide access to this operation. See Memory tagging for more information on memory tagging.

Note: the <arm_acle.h> header should be included before using this intrinsic.

This intrinsic is available when both __ARM_FEATURE_MOPS and __ARM_FEATURE_MEMORY_TAGGING are defined.

  void* __arm_mops_memset_tag(void* tagged_address, int value, size_t size)

This intrinsic performs a memset operation with tag setting on a memory block.

The parameters of __arm_mops_memset_tag are as follows:

Similarly to C’s memset, this intrinsic returns the tagged_address pointer.

Architectural Extension Bridges ^

NEON-SVE Bridge ^

The NEON_SVE Bridge adds intrinsics that allow conversions between NEON and SVE vectors.

The <arm_neon_sve_bridge.h> header should be included before using these intrinsics.

svset_neonq ^

These intrinsics set the first 128 bits of SVE vector vec to subvec.

That is, bit i of the result is equal to:

On big-endian targets, this leaves lanes in a different order from the “native” SVE order. For example, if subvec is int32x4_t, then on big-endian targets, the first memory element is in lane 3 of subvec and is therefore in lane 3 of the returned SVE vector. Using svld1 to load elements would instead put the first memory element in lane 0 of the returned SVE vector.

When svundef is passed as the vec parameter, compilers are able to reuse the SVE register overlapping the NEON input without generating additional instructions.

Instances
svint8_t svset_neonq[_s8](svint8_t vec, int8x16_t subvec)
svint16_t svset_neonq[_s16](svint16_t vec, int16x8_t subvec)
svint32_t svset_neonq[_s32](svint32_t vec, int32x4_t subvec)
svint64_t svset_neonq[_s64](svint64_t vec, int64x2_t subvec)
svuint8_t svset_neonq[_u8](svuint8_t vec, uint8x16_t subvec)
svuint16_t svset_neonq[_u16](svuint16_t vec, uint16x8_t subvec)
svuint32_t svset_neonq[_u32](svuint32_t vec, uint32x4_t subvec)
svuint64_t svset_neonq[_u64](svuint64_t vec, uint64x2_t subvec)
svfloat16_t svset_neonq[_f16](svfloat16_t vec, float16x8_t subvec)
svfloat32_t svset_neonq[_f32](svfloat32_t vec, float32x4_t subvec)
svfloat64_t svset_neonq[_f64](svfloat64_t vec, float64x2_t subvec)
svbfloat16_t svset_neonq[_bf16](svbfloat16_t vec, bfloat16x8_t subvec)
svmfloat8_t svset_neonq[_mf8](svmfloat8_t vec, mfloat8x16_t subvec)

svget_neonq ^

These intrinsics get the first 128 bit subvector of SVE vector vec as a NEON vector.

Instances
int8x16_t svget_neonq[_s8](svint8_t vec)
int16x8_t svget_neonq[_s16](svint16_t vec)
int32x4_t svget_neonq[_s32](svint32_t vec)
int64x2_t svget_neonq[_s64](svint64_t vec)
uint8x16_t svget_neonq[_u8](svuint8_t vec)
uint16x8_t svget_neonq[_u16](svuint16_t vec)
uint32x4_t svget_neonq[_u32](svuint32_t vec)
uint64x2_t svget_neonq[_u64](svuint64_t vec)
float16x8_t svget_neonq[_f16](svfloat16_t vec)
float32x4_t svget_neonq[_f32](svfloat32_t vec)
float64x2_t svget_neonq[_f64](svfloat64_t vec)
bfloat16x8_t svget_neonq[_bf16](svbfloat16_t vec)
mfloat8x16_t svget_neonq[_mf8](svmfloat8_t vec)

svdup_neonq ^

These intrinsics return an SVE vector with all SVE subvectors containing the duplicated NEON vector vec.

Instances
svint8_t svdup_neonq[_s8](int8x16_t vec)
svint16_t svdup_neonq[_s16](int16x8_t vec)
svint32_t svdup_neonq[_s32](int32x4_t vec)
svint64_t svdup_neonq[_s64](int64x2_t vec)
svuint8_t svdup_neonq[_u8](uint8x16_t vec)
svuint16_t svdup_neonq[_u16](uint16x8_t vec)
svuint32_t svdup_neonq[_u32](uint32x4_t vec)
svuint64_t svdup_neonq[_u64](uint64x2_t vec)
svfloat16_t svdup_neonq[_f16](float16x8_t vec)
svfloat32_t svdup_neonq[_f32](float32x4_t vec)
svfloat64_t svdup_neonq[_f64](float64x2_t vec)
svbfloat16_t svdup_neonq[_bf16](bfloat16x8_t vec)
svmfloat8_t svdup_neonq[_mf8](mfloat8x16_t vec)

Future directions ^

Extensions under consideration ^

Procedure calls and the Q / GE bits ^

The Arm procedure call standard [AAPCS] states that the Q and GE bits are undefined across public interfaces, but in practice it is desirable to return saturation status from functions. There are at least two common use cases:

To define small (inline) functions (defined in terms of expressions involving intrinsics, which provide abstractions or emulate other intrinsic families), it is desirable for such functions to have the same well-defined effects on the Q/GE bits as the corresponding intrinsics.

DSP library functions ^

Options being considered are to define an extension to the pcs attribute to indicate that Q is meaningful on the return, and possibly also to infer this in the case of functions marked as inline.

Returning a value in registers ^

As a type attribute this would allow things like

  struct __attribute__((value_in_regs)) Point { int x[2]; };

This would indicate that the result registers should be used as if the type had been passed as the first argument. The implementation should not complain if the attribute is applied inappropriately (that is, where insufficient registers are available) it might be a template instance.

Custom calling conventions ^

Some interfaces may use calling conventions that depart from the AAPCS. Examples include:

Traps: system calls, breakpoints, … ^

This release of ACLE does not define how to invoke a SVC (supervisor call), BKPT (breakpoint) and other related functionality.

One option would be to mark a function prototype with an attribute, for example

  int __attribute__((svc(0xAB))) system_call(int code, void const *params);

When calling the function, arguments and results would be marshalled according to the AAPCS, the only difference being that the call would be invoked as a trap instruction rather than a branch-and-link.

One issue is that some calls may have non-standard calling conventions. (For example, Arm Linux system calls expect the code number to be passed in R7.)

Another issue is that the code may vary between A32 and T32 state. This issue could be addressed by allowing two numeric parameters in the attribute.

Mixed-endian data ^

Extensions for accessing data in different endianness have been considered. However, this is not an issue specific to the Arm architecture, and it seems better to wait for a lead from language standards.

Memory access with non-temporal hints ^

Supporting memory access with cacheability hints through language extensions is being investigated. For example:

  int *__attribute__((nontemporal)) p;

As a type attribute, will allow indirection of p with non-temporal cacheability hint.

Features not considered for support ^

VFP vector mode ^

The short vector mode of the original VFP architecture is now deprecated, and unsupported in recent implementations of the Arm floating-point instructions set. There is no plan to support it through C extensions.

Bit-banded memory access ^

The bit-banded memory feature of certain Cortex-M cores is now regarded as being outside the architecture, and there is no plan to standardize its support.

  1. For example the sve_bf16 feature depends on sve 

  2. The intention is to support the usecase of newer code if