Assembly kernel glue. More...

#include <CpuGemmAssemblyDispatch.h>

Collaboration diagram for CpuGemmAssemblyDispatch:

Data Structures
class	IFallback

Public Member Functions
	CpuGemmAssemblyDispatch ()
	Constructor. More...

	~CpuGemmAssemblyDispatch ()=default
	Defautl destructor. More...

	ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE (CpuGemmAssemblyDispatch)

void	configure (const ITensorInfo a, const ITensorInfo b, const ITensorInfo c, ITensorInfo d, const AsmGemmInfo &info)
	If supported create a Compute Library function else fallback to the arm_gemm function. More...

bool	is_configured () const
	Was the function successfully configured ? More...

bool	isVarWeightsKernel () const
	Indicates if the convolution executes in variable weights mode. More...

void	prepare (ITensorPack &tensors) override
	Prepare the function for executing. More...

void	run (ITensorPack &tensors) override
	Run the kernels contained in the function. More...

experimental::MemoryRequirements	workspace () const override
	Return the memory requirements required by the workspace. More...

Public Member Functions inherited from INEOperator
	INEOperator (IRuntimeContext *ctx=nullptr)
	Constructor. More...

	INEOperator (const INEOperator &)=delete
	Prevent instances of this class from being copied (As this class contains pointers) More...

	INEOperator (INEOperator &&)=default
	Default move constructor. More...

INEOperator &	operator= (const INEOperator &)=delete
	Prevent instances of this class from being copied (As this class contains pointers) More...

INEOperator &	operator= (INEOperator &&)=default
	Default move assignment operator. More...

	~INEOperator ()
	Default destructor. More...

Public Member Functions inherited from IOperator
virtual	~IOperator ()=default
	Destructor. More...

Static Public Member Functions
static Status	validate (const ITensorInfo a, const ITensorInfo b, const ITensorInfo c, const ITensorInfo d, const AsmGemmInfo &info)
	Indicates whether or not this function can be used to process the given parameters. More...

static Status	has_opt_impl (arm_compute::WeightFormat &weight_format, const ITensorInfo a, const ITensorInfo b, const ITensorInfo c, const ITensorInfo d, const AsmGemmInfo &info)
	Indicates whether or not there is an optimal assembly implementation that can be used to process the given parameters. More...

static bool	is_activation_supported (const ActivationLayerInfo &activation)
	Checks if activation is supported by the gemm assembly dispatcher. More...

Detailed Description

Assembly kernel glue.

Definition at line 70 of file CpuGemmAssemblyDispatch.h.

Constructor & Destructor Documentation

◆ CpuGemmAssemblyDispatch()

CpuGemmAssemblyDispatch ( )

Constructor.

Definition at line 824 of file CpuGemmAssemblyDispatch.cpp.

                                                  : _arm_gemm(nullptr)
 {
 }

◆ ~CpuGemmAssemblyDispatch()

~CpuGemmAssemblyDispatch ( )

default

Defautl destructor.

Member Function Documentation

◆ ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE()

ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE ( CpuGemmAssemblyDispatch )

◆ configure()

void configure	(	const ITensorInfo *	a,
		const ITensorInfo *	b,
		const ITensorInfo *	c,
		ITensorInfo *	d,
		const AsmGemmInfo &	info
	)

If supported create a Compute Library function else fallback to the arm_gemm function.

Note: Configuring "batches" The shapes of a b and d are arranged as follows: Lowest dimension <-> Highest dimension a: [K, M, Batch, Multi] b: [N, K, Multi] d: [N, M, Batch, Multi]

The "Batch" refers to where "Batch" number of MxK slices of tensor a multiplies with a single KxN slice of b The "Multi" refers to where "Multi" number of individual multiplication of a with b

E.g. the following are some example input shape configurations

(1) Normal 2D gemm a: [K=3, M=4] b: [N=5, K=3] d: [N=5, M=4]

(2) Batches of a sharing b (e.g. gemm-based batched convolution where b is the shared ) a: [K=3, M=4, Batch=9] b: [N=5, K=3] d: [N=5, M=4, Batch=9]

(3) "Batches" of independent gemm (e.g. batched matmul) a: [K=3, M=4, Batch=1, Multi=7] b: [N=5, K=3, Multi=7] d: [N=5, M=4, Batch=1, Multi=7]

(4) "Batches" of independent gemm where b is also shared a: [K=3, M=4, Batch=4, Multi=7] b: [N=5, K=3, Multi=7] d: [N=5, M=4, Batch=4, Multi=7]

Parameters

[in]	a	Input tensor (Matrix A)
[in]	b	Input tensor (Matrix B)
[in]	c	Input tensor (Matrix C) used to pass the bias for quantized calculations
[out]	d	Output tensor to store the result of matrix multiplication. Data type supported: same as `input0`.
[in]	info	GEMM meta-data

Definition at line 976 of file CpuGemmAssemblyDispatch.cpp.

 {
     ARM_COMPUTE_ERROR_ON_NULLPTR(a, b, d);
     arm_gemm::Activation act = assembly_utils::map_to_arm_gemm_activation(info.activation_info);
  
     //If we don't support a combination of data types, silently return: it is the caller's responsibility to check if configure() was successful via is_configured()
     if (!CpuGemmAssemblyDispatch::validate(a, b, c, d, info))
     {
         return;
     }
  
     switch (a->data_type())
     {
         case DataType::F32:
             create_arm_gemm<float, float>(_arm_gemm, a, b, c, d, act, info);
             break;
 #ifdef __aarch64__
         case DataType::U8:
         case DataType::QASYMM8:
             if (d->data_type() == DataType::S32)
             {
                 create_arm_gemm<uint8_t, uint32_t>(_arm_gemm, a, b, c, d, act, info);
             }
             else
             {
                 create_arm_gemm_quant<uint8_t, uint8_t>(_arm_gemm, a, b, c, d, act, info);
             }
             break;
         case DataType::S8:
         case DataType::QASYMM8_SIGNED:
             if (d->data_type() == DataType::S32)
             {
                 create_arm_gemm<int8_t, int32_t>(_arm_gemm, a, b, c, d, act, info);
             }
             else
             {
                 create_arm_gemm_quant<int8_t, int8_t>(_arm_gemm, a, b, c, d, act, info);
             }
             break;
 #endif /* __aarch64__ */
 #if defined(ARM_COMPUTE_ENABLE_BF16)
         case DataType::BFLOAT16:
             create_arm_gemm<bfloat16, float>(_arm_gemm, a, b, c, d, act, info);
             break;
 #endif /* defined(ARM_COMPUTE_ENABLE_BF16) */
 #ifdef __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
         case DataType::F16:
             create_arm_gemm<float16_t, float16_t>(_arm_gemm, a, b, c, d, act, info);
             break;
 #endif /* __ARM_FEATURE_FP16_VECTOR_ARITHMETIC */
         default:
             break;
     }
 }

References ARM_COMPUTE_ERROR_ON_NULLPTR, arm_compute::test::validation::b, arm_compute::BFLOAT16, ITensorInfo::data_type(), arm_compute::F16, arm_compute::F32, arm_compute::test::validation::info, arm_compute::assembly_utils::map_to_arm_gemm_activation(), arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::S32, arm_compute::S8, arm_compute::U8, and CpuGemmAssemblyDispatch::validate().

◆ has_opt_impl()

Status has_opt_impl	(	arm_compute::WeightFormat &	weight_format,
		const ITensorInfo *	a,
		const ITensorInfo *	b,
		const ITensorInfo *	c,
		const ITensorInfo *	d,
		const AsmGemmInfo &	info
	)

static

Indicates whether or not there is an optimal assembly implementation that can be used to process the given parameters.

This method has the same use of NEGEMMConvolutionLayer::has_opt_impl, with the only caveat that the value of arm_compute::WeightFormat need to be passed via the parameter info.

Returns: a status.

Definition at line 828 of file CpuGemmAssemblyDispatch.cpp.

 {
     ARM_COMPUTE_ERROR_ON_NULLPTR(a, b, d);
     ARM_COMPUTE_UNUSED(c);
     arm_gemm::Activation act         = assembly_utils::map_to_arm_gemm_activation(info.activation_info);
     Params               p           = extract_parameters(a, b, d, info);
     const CPUInfo       &ci          = NEScheduler::get().cpu_info();
     unsigned int         num_threads = NEScheduler::get().num_threads();
     arm_gemm::GemmConfig cfg;
     cfg.weight_format                           = assembly_utils::map_to_arm_gemm_weight_format(info.weight_format);
     arm_gemm::WeightFormat arm_gemm_expected_wf = assembly_utils::map_to_arm_gemm_weight_format(expected_weight_format);
     arm_gemm::GemmArgs     args(&ci, p.M, p.N, p.K, p.sections, p.batches, p.multis, p.indirect, act, num_threads,
                                 info.fixed_format, info.fast_mode, &cfg);
     // TODO: Incorporate info.transpose_b COMPMID-6595
     switch (a->data_type())
     {
         case DataType::F32:
             ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                 !(arm_gemm::has_opt_gemm<float, float, arm_gemm::Nothing>(arm_gemm_expected_wf, args, {})),
                 "We could not find an optimized kernel for F32 input");
             break;
 #ifdef __aarch64__
         case DataType::U8:
         case DataType::QASYMM8:
             if (d->data_type() == DataType::S32)
             {
                 ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                     !(arm_gemm::has_opt_gemm<uint8_t, uint32_t, arm_gemm::Nothing>(arm_gemm_expected_wf, args, {})),
                     "We could not find an optimized kernel for U8/QASYMM8 input and U32 output");
             }
             else
             {
                 ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                     !(arm_gemm::has_opt_gemm<uint8_t, uint8_t, arm_gemm::Requantize32>(arm_gemm_expected_wf, args, {})),
                     "We could not find an optimized kernel for U8 input and U8 output");
             }
             break;
         case DataType::S8:
         case DataType::QASYMM8_SIGNED:
             if (d->data_type() == DataType::S32)
             {
                 ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                     !(arm_gemm::has_opt_gemm<int8_t, int32_t, arm_gemm::Nothing>(arm_gemm_expected_wf, args, {})),
                     "We could not find an optimized kernel for S8/QASYMM8_SIGNED input and S32 output");
             }
             else
             {
                 ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                     !(arm_gemm::has_opt_gemm<int8_t, int8_t, arm_gemm::Requantize32>(arm_gemm_expected_wf, args, {})),
                     "We could not find an optimized kernel for S8 input and S8 output");
             }
             break;
 #endif /* __aarch64__ */
 #if defined(ARM_COMPUTE_ENABLE_BF16)
         case DataType::BFLOAT16:
         {
             ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                 !(arm_gemm::has_opt_gemm<bfloat16, float, arm_gemm::Nothing>(arm_gemm_expected_wf, args, {})),
                 "We could not find an optimized kernel for BFLOAT16 input and F32 output");
             break;
         }
 #endif /* defined(ARM_COMPUTE_ENABLE_BF16) */
 #ifdef __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
         case DataType::F16:
             ARM_COMPUTE_RETURN_ERROR_ON_MSG(
                 !(arm_gemm::has_opt_gemm<float16_t, float16_t, arm_gemm::Nothing>(arm_gemm_expected_wf, args, {})),
                 "We could not find an optimized kernel for F16 input and F16 output");
             break;
 #endif /* __ARM_FEATURE_FP16_VECTOR_ARITHMETIC */
         default:
             ARM_COMPUTE_RETURN_ERROR_ON_MSG(true, "Usupported type. Could not find a kernel");
             break;
     }
     expected_weight_format = assembly_utils::map_to_arm_compute_weight_format(arm_gemm_expected_wf);
  
     return Status{};
 }

References GemmTuner::args, ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_RETURN_ERROR_ON_MSG, ARM_COMPUTE_UNUSED, arm_compute::test::validation::b, arm_compute::BFLOAT16, ci, IScheduler::cpu_info(), ITensorInfo::data_type(), arm_compute::F16, arm_compute::F32, Scheduler::get(), arm_compute::test::validation::info, arm_compute::assembly_utils::map_to_arm_compute_weight_format(), arm_compute::assembly_utils::map_to_arm_gemm_activation(), arm_compute::assembly_utils::map_to_arm_gemm_weight_format(), IScheduler::num_threads(), arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::S32, arm_compute::S8, arm_compute::U8, and GemmConfig::weight_format.

Referenced by CpuGemmAssemblyDispatch::validate().

◆ is_activation_supported()

bool is_activation_supported ( const ActivationLayerInfo & activation )

static

Checks if activation is supported by the gemm assembly dispatcher.

Parameters

[in] activation Activation to check

Returns: True if activation is supported else false

Definition at line 970 of file CpuGemmAssemblyDispatch.cpp.

 {
     arm_gemm::Activation act = assembly_utils::map_to_arm_gemm_activation(activation);
     return act.type != arm_gemm::Activation::Type::None;
 }

References arm_compute::assembly_utils::map_to_arm_gemm_activation(), Activation::None, and Activation::type.

Referenced by CpuGemmLowpMatrixMultiplyCore::configure().

◆ is_configured()

bool is_configured ( ) const

Was the function successfully configured ?

Returns: True if the function is configured and ready to run

Definition at line 1038 of file CpuGemmAssemblyDispatch.cpp.

 {
     return _arm_gemm && _arm_gemm->is_configured();
 }

◆ isVarWeightsKernel()

bool isVarWeightsKernel ( ) const

inline

Indicates if the convolution executes in variable weights mode.

Similar to CpuGemm::isVarWeightsKernel

Definition at line 182 of file CpuGemmAssemblyDispatch.h.

     {
         return _arm_gemm && _arm_gemm->isVarWeightsKernel();
     }

◆ prepare()

void prepare ( ITensorPack & constants )

overridevirtual

Prepare the function for executing.

Any one off pre-processing step required by the function is handled here

Parameters

[in] constants Vector that contains the constants tensors.

Note: Prepare stage might not need all the function's buffers' backing memory to be available in order to execute

Reimplemented from INEOperator.

Definition at line 1032 of file CpuGemmAssemblyDispatch.cpp.

 {
     ARM_COMPUTE_ERROR_ON(_arm_gemm == nullptr);
     _arm_gemm->prepare(tensors);
 }

References ARM_COMPUTE_ERROR_ON.

◆ run()

void run ( ITensorPack & tensors )

overridevirtual

Run the kernels contained in the function.

Parameters

[in] tensors Vector that contains the tensors to operate on.

Reimplemented from INEOperator.

Definition at line 1043 of file CpuGemmAssemblyDispatch.cpp.

 {
     ARM_COMPUTE_ERROR_ON(_arm_gemm == nullptr);
     _arm_gemm->run(tensors);
 }

References ARM_COMPUTE_ERROR_ON.

◆ validate()

Status validate	(	const ITensorInfo *	a,
		const ITensorInfo *	b,
		const ITensorInfo *	c,
		const ITensorInfo *	d,
		const AsmGemmInfo &	info
	)

static

Indicates whether or not this function can be used to process the given parameters.

Parameters

[in]	a	Input tensor info (Matrix A)
[in]	b	Input tensor info (Matrix B)
[in]	c	Input tensor info (Matrix C) used to pass the bias for quantized calculations
[in]	d	Output tensor to store the result of matrix multiplication. Data type supported: same as `input0`.
[in]	info	GEMM meta-data

Returns: a status.

Definition at line 911 of file CpuGemmAssemblyDispatch.cpp.

 {
     ARM_COMPUTE_UNUSED(c, info);
     ARM_COMPUTE_RETURN_ERROR_ON_NULLPTR(a, b, d);
     ARM_COMPUTE_RETURN_ERROR_ON_CPU_F16_UNSUPPORTED(a);
     ARM_COMPUTE_RETURN_ERROR_ON_CPU_BF16_UNSUPPORTED(a);
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(!(info.reshape_b_only_on_first_run),
                                     "Assembly kernel will not be executed when reshape_b_only_on_first_run is false");
  
 #ifndef __aarch64__
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->element_size() == 1, "8bit integer types only supported for aarch64");
 #endif /* __aarch64__ */
     ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(a, 1, DataType::U8, DataType::QASYMM8,
                                                          DataType::QASYMM8_SIGNED, DataType::S8, DataType::BFLOAT16,
                                                          DataType::F16, DataType::F32);
     ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(
         b, 1, DataType::U8, DataType::QASYMM8, DataType::QASYMM8_SIGNED, DataType::QSYMM8_PER_CHANNEL, DataType::S8,
         DataType::BFLOAT16, DataType::F16, DataType::F32);
     if (is_data_type_quantized_per_channel(b->data_type()))
     {
         ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(a, 1, DataType::QASYMM8_SIGNED, DataType::S8);
     }
     else if (is_fixed_format_fast_math(info.weight_format))
     {
         ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_NOT_IN(a, DataType::F32);
         ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_NOT_IN(b, DataType::BFLOAT16);
     }
     else
     {
         ARM_COMPUTE_RETURN_ERROR_ON_MISMATCHING_DATA_TYPES(a, b);
     }
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->data_type() == DataType::F32 && d->data_type() != DataType::F32,
                                     "Only F32 output supported for F32 input");
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->data_type() == DataType::F16 && d->data_type() != DataType::F16,
                                     "Only F16 output supported for F16 input");
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->data_type() == DataType::BFLOAT16 && d->data_type() != DataType::F32,
                                     "Only F32 output supported for BFLOAT16 input");
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->data_type() == DataType::U8 && d->data_type() != DataType::U32,
                                     "Only U32 output supported for U8 input");
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->data_type() == DataType::S8 && d->data_type() != DataType::S32,
                                     "Only S32 output supported for S8 input");
     ARM_COMPUTE_RETURN_ERROR_ON_MSG(a->data_type() == DataType::QASYMM8 &&
                                         (d->data_type() != DataType::QASYMM8 && d->data_type() != DataType::S32),
                                     "Only QASYMM8/S32 output supported for QASYMM8 input");
     arm_compute::WeightFormat expected_weight_format = arm_compute::WeightFormat::UNSPECIFIED;
     const Status              ret = CpuGemmAssemblyDispatch::has_opt_impl(expected_weight_format, a, b, c, d, info);
     if ((bool)ret && expected_weight_format != arm_compute::WeightFormat::ANY)
     {
         // Correctness check: if the format expected by the kernel is
         // not "any", make sure that the one found matches the format
         // intended by the caller.
         ARM_COMPUTE_RETURN_ERROR_ON_MSG(
             (expected_weight_format != info.weight_format),
             "The format expected by the kernel does not correspond with the one requested by the user.");
     }
     return ret;
 }

References arm_compute::ANY, ARM_COMPUTE_RETURN_ERROR_ON_CPU_BF16_UNSUPPORTED, ARM_COMPUTE_RETURN_ERROR_ON_CPU_F16_UNSUPPORTED, ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN, ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_NOT_IN, ARM_COMPUTE_RETURN_ERROR_ON_MISMATCHING_DATA_TYPES, ARM_COMPUTE_RETURN_ERROR_ON_MSG, ARM_COMPUTE_RETURN_ERROR_ON_NULLPTR, ARM_COMPUTE_UNUSED, arm_compute::test::validation::b, arm_compute::BFLOAT16, ITensorInfo::data_type(), ITensorInfo::element_size(), arm_compute::F16, arm_compute::F32, CpuGemmAssemblyDispatch::has_opt_impl(), arm_compute::test::validation::info, arm_compute::is_data_type_quantized_per_channel(), arm_compute::is_fixed_format_fast_math(), arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::QSYMM8_PER_CHANNEL, arm_compute::S32, arm_compute::S8, arm_compute::U32, arm_compute::U8, and arm_compute::UNSPECIFIED.

Referenced by CpuGemm::configure(), CpuGemmAssemblyDispatch::configure(), CpuMatMul::validate(), CpuGemmDirectConv2d::validate(), CpuGemm::validate(), and CpuGemmLowpMatrixMultiplyCore::validate().

◆ workspace()

experimental::MemoryRequirements workspace ( ) const

overridevirtual

Return the memory requirements required by the workspace.

Reimplemented from INEOperator.

Definition at line 1049 of file CpuGemmAssemblyDispatch.cpp.

 {
     ARM_COMPUTE_ERROR_ON(_arm_gemm == nullptr);
     return _arm_gemm->workspace();
 }

References ARM_COMPUTE_ERROR_ON.

The documentation for this class was generated from the following files:

src/cpu/operators/internal/CpuGemmAssemblyDispatch.h
src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp

Data Structures

Public Member Functions

Static Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ CpuGemmAssemblyDispatch()

◆ ~CpuGemmAssemblyDispatch()

Member Function Documentation

◆ ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE()

◆ configure()

◆ has_opt_impl()

◆ is_activation_supported()

◆ is_configured()

◆ isVarWeightsKernel()

◆ prepare()

◆ run()

◆ validate()

◆ workspace()