Compute Library
 21.02
CLGEMMMatrixMultiplyKernel Class Reference

OpenCL kernel to multiply two input matrices "A" and "B" and add a martix "C" if provided. More...

#include <CLGEMMMatrixMultiplyKernel.h>

Collaboration diagram for CLGEMMMatrixMultiplyKernel:
[legend]

Public Member Functions

 CLGEMMMatrixMultiplyKernel ()
 Default constructor. More...
 
 CLGEMMMatrixMultiplyKernel (const CLGEMMMatrixMultiplyKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
CLGEMMMatrixMultiplyKerneloperator= (const CLGEMMMatrixMultiplyKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
 CLGEMMMatrixMultiplyKernel (CLGEMMMatrixMultiplyKernel &&)=default
 Allow instances of this class to be moved. More...
 
CLGEMMMatrixMultiplyKerneloperator= (CLGEMMMatrixMultiplyKernel &&)=default
 Allow instances of this class to be moved. More...
 
void configure (const ICLTensor *input0, const ICLTensor *input1, const ICLTensor *input2, ICLTensor *output, float alpha, float beta=0.f, bool is_interleaved_transposed=true, const GEMMReshapeInfo &reshape_info=GEMMReshapeInfo(), bool fp_mixed_precision=false, const ActivationLayerInfo &activation_info=ActivationLayerInfo())
 Initialise the kernel's input, output and alpha. More...
 
void configure (const CLCompileContext &compile_context, const ICLTensor *input0, const ICLTensor *input1, const ICLTensor *input2, ICLTensor *output, float alpha, float beta=0.f, bool is_interleaved_transposed=true, const GEMMReshapeInfo &reshape_info=GEMMReshapeInfo(), bool fp_mixed_precision=false, const ActivationLayerInfo &activation_info=ActivationLayerInfo())
 Initialise the kernel's input, output and alpha. More...
 
void run (const Window &window, cl::CommandQueue &queue) override
 Enqueue the OpenCL kernel to process the given window on the passed OpenCL command queue. More...
 
- Public Member Functions inherited from ICLKernel
 ICLKernel ()
 Constructor. More...
 
cl::Kernel & kernel ()
 Returns a reference to the OpenCL kernel of this object. More...
 
template<typename T >
void add_1D_array_argument (unsigned int &idx, const ICLArray< T > *array, const Strides &strides, unsigned int num_dimensions, const Window &window)
 Add the passed 1D array's parameters to the object's kernel's arguments starting from the index idx. More...
 
void add_1D_tensor_argument (unsigned int &idx, const ICLTensor *tensor, const Window &window)
 Add the passed 1D tensor's parameters to the object's kernel's arguments starting from the index idx. More...
 
void add_1D_tensor_argument_if (bool cond, unsigned int &idx, const ICLTensor *tensor, const Window &window)
 Add the passed 1D tensor's parameters to the object's kernel's arguments starting from the index idx if the condition is true. More...
 
void add_2D_tensor_argument (unsigned int &idx, const ICLTensor *tensor, const Window &window)
 Add the passed 2D tensor's parameters to the object's kernel's arguments starting from the index idx. More...
 
void add_2D_tensor_argument_if (bool cond, unsigned int &idx, const ICLTensor *tensor, const Window &window)
 Add the passed 2D tensor's parameters to the object's kernel's arguments starting from the index idx if the condition is true. More...
 
void add_3D_tensor_argument (unsigned int &idx, const ICLTensor *tensor, const Window &window)
 Add the passed 3D tensor's parameters to the object's kernel's arguments starting from the index idx. More...
 
void add_4D_tensor_argument (unsigned int &idx, const ICLTensor *tensor, const Window &window)
 Add the passed 4D tensor's parameters to the object's kernel's arguments starting from the index idx. More...
 
virtual void run_op (ITensorPack &tensors, const Window &window, cl::CommandQueue &queue)
 Enqueue the OpenCL kernel to process the given window on the passed OpenCL command queue. More...
 
template<typename T >
void add_argument (unsigned int &idx, T value)
 Add the passed parameters to the object's kernel's arguments starting from the index idx. More...
 
void set_lws_hint (const cl::NDRange &lws_hint)
 Set the Local-Workgroup-Size hint. More...
 
cl::NDRange lws_hint () const
 Return the Local-Workgroup-Size hint. More...
 
void set_wbsm_hint (const cl_int &wbsm_hint)
 Set the workgroup batch size modifier hint. More...
 
cl_int wbsm_hint () const
 Return the workgroup batch size modifier hint. More...
 
const std::string & config_id () const
 Get the configuration ID. More...
 
void set_target (GPUTarget target)
 Set the targeted GPU architecture. More...
 
void set_target (cl::Device &device)
 Set the targeted GPU architecture according to the CL device. More...
 
GPUTarget get_target () const
 Get the targeted GPU architecture. More...
 
size_t get_max_workgroup_size ()
 Get the maximum workgroup size for the device the CLKernelLibrary uses. More...
 
template<unsigned int dimension_size>
void add_tensor_argument (unsigned &idx, const ICLTensor *tensor, const Window &window)
 
template<typename T , unsigned int dimension_size>
void add_array_argument (unsigned &idx, const ICLArray< T > *array, const Strides &strides, unsigned int num_dimensions, const Window &window)
 Add the passed array's parameters to the object's kernel's arguments starting from the index idx. More...
 
- Public Member Functions inherited from IKernel
 IKernel ()
 Constructor. More...
 
virtual ~IKernel ()=default
 Destructor. More...
 
virtual bool is_parallelisable () const
 Indicates whether or not the kernel is parallelisable. More...
 
virtual BorderSize border_size () const
 The size of the border for that kernel. More...
 
const Windowwindow () const
 The maximum window the kernel can be executed on. More...
 

Static Public Member Functions

static Status validate (const ITensorInfo *input0, const ITensorInfo *input1, const ITensorInfo *input2, const ITensorInfo *output, float alpha, float beta, bool is_interleaved_transposed, const GEMMReshapeInfo &reshape_info, GPUTarget gpu_target, bool fp_mixed_precision=false, const ActivationLayerInfo &activation_info=ActivationLayerInfo())
 Static function to check if given info will lead to a valid configuration of CLGEMMMatrixMultiplyKernel. More...
 
- Static Public Member Functions inherited from ICLKernel
static constexpr unsigned int num_arguments_per_1D_array ()
 Returns the number of arguments enqueued per 1D array object. More...
 
static constexpr unsigned int num_arguments_per_1D_tensor ()
 Returns the number of arguments enqueued per 1D tensor object. More...
 
static constexpr unsigned int num_arguments_per_2D_tensor ()
 Returns the number of arguments enqueued per 2D tensor object. More...
 
static constexpr unsigned int num_arguments_per_3D_tensor ()
 Returns the number of arguments enqueued per 3D tensor object. More...
 
static constexpr unsigned int num_arguments_per_4D_tensor ()
 Returns the number of arguments enqueued per 4D tensor object. More...
 
static cl::NDRange gws_from_window (const Window &window)
 Get the global work size given an execution window. More...
 

Data Fields

const ICLTensor_input0
 
const ICLTensor_input1
 
const ICLTensor_input2
 
ICLTensor_output
 
bool _slide_matrix_b
 
bool _reinterpret_input_as_3d
 
bool _reinterpret_output_as_3d
 
bool _add_bias
 
bool _broadcast_bias
 

Detailed Description

OpenCL kernel to multiply two input matrices "A" and "B" and add a martix "C" if provided.

All elements of the output matrix will be multiplied by alpha. In case matrix C is passed, it will be added to the previous result. For the matrix C, the broadcast addition is supported if the flag "broadcast_bias" is set in the GEMMReshapeInfo object

Note
If the input tensors input0 and input1 have been reshaped respectively with CLGEMMReshapeLHSMatrixKernel" and CLGEMMReshapeRHSMatrixKernel, the flag is_interleaved_transposed must be set to true
Attention
input1 tensor must have at least 2 dimensions (matrix)

Definition at line 42 of file CLGEMMMatrixMultiplyKernel.h.

Constructor & Destructor Documentation

◆ CLGEMMMatrixMultiplyKernel() [1/3]

◆ CLGEMMMatrixMultiplyKernel() [2/3]

Prevent instances of this class from being copied (As this class contains pointers)

◆ CLGEMMMatrixMultiplyKernel() [3/3]

Allow instances of this class to be moved.

Member Function Documentation

◆ configure() [1/2]

void configure ( const ICLTensor input0,
const ICLTensor input1,
const ICLTensor input2,
ICLTensor output,
float  alpha,
float  beta = 0.f,
bool  is_interleaved_transposed = true,
const GEMMReshapeInfo reshape_info = GEMMReshapeInfo(),
bool  fp_mixed_precision = false,
const ActivationLayerInfo activation_info = ActivationLayerInfo() 
)

Initialise the kernel's input, output and alpha.

Parameters
[in]input0Input tensor containing the Matrix A. Data types supported: F16/F32
[in]input1Input tensor containing the Matrix B. Data type supported: same as input0
[in]input2Input tensor containing the Matrix C (bias). Can be nullptr. Data type supported: same as input0
[out]outputOutput tensor to store the result of matrix multiplication. Data type supported: same as input0
[in]alphaWeight of the matrix product
[in]beta(Optional) Weight of vector C. Default value is 0. Only beta = 1 is currently supported.
[in]is_interleaved_transposed(Optional) True if input0 and input1 have been reshaped respectively using CLGEMMReshapeLHSMatrixKernel and CLGEMMReshapeRHSMatrixKernel
[in]reshape_info(Optional) GEMM reshape info. If is_interleaved_transposed = true, this object must contain the information to understand how the matrix A and matrix B have been reshaped
[in]fp_mixed_precision(Optional) Use wider accumulators (32 bit instead of 16 for FP16) to improve accuracy
[in]activation_info(Optional) Activation to apply after the matrix multiplication

Definition at line 275 of file CLGEMMMatrixMultiplyKernel.cpp.

References CLKernelLibrary::get().

277 {
278  configure(CLKernelLibrary::get().get_compile_context(), input0, input1, input2, output, alpha, beta, is_interleaved_transposed, reshape_info, fp_mixed_precision, activation_info);
279 }
static CLKernelLibrary & get()
Access the KernelLibrary singleton.
void configure(const ICLTensor *input0, const ICLTensor *input1, const ICLTensor *input2, ICLTensor *output, float alpha, float beta=0.f, bool is_interleaved_transposed=true, const GEMMReshapeInfo &reshape_info=GEMMReshapeInfo(), bool fp_mixed_precision=false, const ActivationLayerInfo &activation_info=ActivationLayerInfo())
Initialise the kernel&#39;s input, output and alpha.

◆ configure() [2/2]

void configure ( const CLCompileContext compile_context,
const ICLTensor input0,
const ICLTensor input1,
const ICLTensor input2,
ICLTensor output,
float  alpha,
float  beta = 0.f,
bool  is_interleaved_transposed = true,
const GEMMReshapeInfo reshape_info = GEMMReshapeInfo(),
bool  fp_mixed_precision = false,
const ActivationLayerInfo activation_info = ActivationLayerInfo() 
)

Initialise the kernel's input, output and alpha.

Parameters
[in]compile_contextThe compile context to be used.
[in]input0Input tensor containing the Matrix A. Data types supported: F16/F32
[in]input1Input tensor containing the Matrix B. Data type supported: same as input0
[in]input2Input tensor containing the Matrix C (bias). Can be nullptr. Data type supported: same as input0
[out]outputOutput tensor to store the result of matrix multiplication. Data type supported: same as input0
[in]alphaWeight of the matrix product
[in]beta(Optional) Weight of vector C. Default value is 0. Only beta = 1 is currently supported.
[in]is_interleaved_transposed(Optional) True if input0 and input1 have been reshaped respectively using CLGEMMReshapeLHSMatrixKernel and CLGEMMReshapeRHSMatrixKernel
[in]reshape_info(Optional) GEMM reshape info. If is_interleaved_transposed = true, this object must contain the information to understand how the matrix A and matrix B have been reshaped
[in]fp_mixed_precision(Optional) Use wider accumulators (32 bit instead of 16 for FP16) to improve accuracy
[in]activation_info(Optional) Activation to apply after the matrix multiplication

Definition at line 281 of file CLGEMMMatrixMultiplyKernel.cpp.

References CLGEMMMatrixMultiplyKernel::_add_bias, CLGEMMMatrixMultiplyKernel::_broadcast_bias, CLGEMMMatrixMultiplyKernel::_input0, CLGEMMMatrixMultiplyKernel::_input1, CLGEMMMatrixMultiplyKernel::_input2, CLGEMMMatrixMultiplyKernel::_output, CLGEMMMatrixMultiplyKernel::_reinterpret_input_as_3d, CLGEMMMatrixMultiplyKernel::_reinterpret_output_as_3d, CLGEMMMatrixMultiplyKernel::_slide_matrix_b, ActivationLayerInfo::a(), ActivationLayerInfo::activation(), CLBuildOptions::add_option(), CLBuildOptions::add_option_if(), ARM_COMPUTE_ERROR_ON, ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, ActivationLayerInfo::b(), arm_compute::BIFROST, arm_compute::create_kernel(), ITensorInfo::data_type(), ITensorInfo::dimension(), ActivationLayerInfo::enabled(), arm_compute::F16, arm_compute::F32, arm_compute::float_to_string_with_full_precision(), arm_compute::get_arch_from_target(), arm_compute::get_cl_type_from_data_type(), arm_compute::get_padding_info(), ICLKernel::get_target(), arm_compute::has_padding_changed(), ITensor::info(), arm_compute::is_data_type_float(), arm_compute::helpers::float_ops::is_one(), arm_compute::helpers::float_ops::is_zero(), kernel_name, arm_compute::lower_string(), ITensorInfo::num_dimensions(), CLBuildOptions::options(), ICLKernel::set_lws_hint(), arm_compute::string_from_activation_func(), arm_compute::string_from_data_type(), arm_compute::support::cpp11::to_string(), and arm_compute::validate_arguments().

284 {
285  ARM_COMPUTE_ERROR_ON_NULLPTR(input0, input1, output);
286 
287  // Perform validate step
288  ARM_COMPUTE_ERROR_THROW_ON(validate_arguments(input0->info(), input1->info(), (input2 != nullptr) ? input2->info() : nullptr, output->info(), beta,
289  is_interleaved_transposed, reshape_info, fp_mixed_precision));
290 
291  auto padding_info = is_interleaved_transposed ? get_padding_info({ input0, input1, output }) : get_padding_info({ input0, output });
292 
293  _input0 = input0;
294  _input1 = input1;
295  _input2 = helpers::float_ops::is_zero(beta) ? nullptr : input2;
296  _output = output;
297  _reinterpret_input_as_3d = reshape_info.reinterpret_input_as_3d();
298  _reinterpret_output_as_3d = (reshape_info.depth_output_gemm3d() != 0);
299  _add_bias = _input2 != nullptr;
300  _broadcast_bias = reshape_info.broadcast_bias();
301 
302  // In case both input and output have to be reinterpreted as 3D tensors,
303  // force reinterpret_input_as_3d and reinterpret_output_as_3d to be false.
305  {
306  _reinterpret_input_as_3d = false;
308  }
309 
310  // Check if we need to slide the matrix B
311  const unsigned int num_dimensions_input0 = _reinterpret_input_as_3d ? _input0->info()->num_dimensions() - 1 : _input0->info()->num_dimensions();
312 
313  _slide_matrix_b = (_input1->info()->num_dimensions() >= num_dimensions_input0);
314 
315  const DataType data_type = input0->info()->data_type();
316 
317  // Get target architecture
318  GPUTarget gpu_target = get_target();
319 
320  ElementsProcessed num_elements_processed{};
321 
322  // Configure kernel window
323  auto win_config = validate_and_configure_window(input0->info(), input1->info(), (input2 != nullptr) ? input2->info() : nullptr, output->info(), beta, is_interleaved_transposed, reshape_info,
324  gpu_target, num_elements_processed);
325  ARM_COMPUTE_ERROR_THROW_ON(win_config.first);
326  ICLKernel::configure_internal(win_config.second);
327 
328  // If _reinterpret_input_as_3d = _reinterpret_output_as_3d = true, both will be turned off (false)
329  // in which case we will dispatch a batched-GEMM to reduce the complexity of the address calculation within the OpenCL kernel.
330  // This means that the actual m used by the kernel is given by output->info()->dimension(1)
331  const unsigned int internal_m = _reinterpret_output_as_3d ? output->info()->dimension(1) * output->info()->dimension(2) : output->info()->dimension(1);
332  const unsigned int n = output->info()->dimension(0);
333 
334  const unsigned int h_gemm_3d = _reinterpret_output_as_3d ? output->info()->dimension(1) : input0->info()->dimension(1);
335  const unsigned int d_gemm_3d = _reinterpret_output_as_3d ? output->info()->dimension(2) : input0->info()->dimension(2);
336 
337  const unsigned int m0 = num_elements_processed.y();
338  const unsigned int n0 = num_elements_processed.x();
339 
340  // Calculate partial (store instead of load) M0 and partial N0 for the partial blocks at the end of a row/column if any. This is to avoid padding.
341  const unsigned int partial_store_m0 = internal_m % m0;
342  const unsigned int partial_store_n0 = n % n0;
343 
344  // Create build options
345  CLBuildOptions build_opts;
346 
347  build_opts.add_option_if(!(helpers::float_ops::is_one(alpha)), "-DALPHA=" + float_to_string_with_full_precision(alpha));
348  build_opts.add_option_if(_input2 != nullptr, "-DBETA=" + float_to_string_with_full_precision(beta));
349  build_opts.add_option_if(helpers::float_ops::is_one(beta), "-DUNIT_BETA");
350  build_opts.add_option_if(reshape_info.broadcast_bias(), "-DBROADCAST_BIAS");
351  build_opts.add_option_if(_reinterpret_input_as_3d, "-DREINTERPRET_INPUT_AS_3D");
352  build_opts.add_option_if(_reinterpret_output_as_3d, "-DREINTERPRET_OUTPUT_AS_3D");
353  build_opts.add_option_if(_reinterpret_input_as_3d || _reinterpret_output_as_3d, "-DHEIGHT_GEMM3D=" + support::cpp11::to_string(h_gemm_3d));
354  build_opts.add_option_if(_reinterpret_input_as_3d || _reinterpret_output_as_3d, "-DDEPTH_GEMM3D=" + support::cpp11::to_string(d_gemm_3d));
355  build_opts.add_option_if(!_slide_matrix_b, "-DMATRIX_B_DEPTH=" + support::cpp11::to_string(input1->info()->dimension(2)));
356  build_opts.add_option_if(activation_info.enabled(), "-DACTIVATION_TYPE=" + lower_string(string_from_activation_func(activation_info.activation())));
357  build_opts.add_option_if(activation_info.enabled(), "-DA_VAL=" + float_to_string_with_full_precision(activation_info.a()));
358  build_opts.add_option_if(activation_info.enabled(), "-DB_VAL=" + float_to_string_with_full_precision(activation_info.b()));
359 
360  const bool is_bifrost = get_arch_from_target(gpu_target) == GPUTarget::BIFROST;
361 
362  std::string kernel_name;
363  if(is_interleaved_transposed)
364  {
365  const int mult_transpose1xW_width = reshape_info.mult_transpose1xW_width();
366  const int mult_interleave4x4_height = reshape_info.mult_interleave4x4_height();
367 
368  build_opts.add_option("-DM=" + support::cpp11::to_string(internal_m));
369  build_opts.add_option("-DN=" + support::cpp11::to_string(n));
370  build_opts.add_option("-DK=" + support::cpp11::to_string(input1->info()->dimension(0) / (n0 * mult_transpose1xW_width)));
371  build_opts.add_option("-DH0=" + support::cpp11::to_string(mult_transpose1xW_width));
372  build_opts.add_option("-DV0=" + support::cpp11::to_string(mult_interleave4x4_height));
373  build_opts.add_option("-DPARTIAL_STORE_M0=" + support::cpp11::to_string(partial_store_m0));
374  build_opts.add_option("-DPARTIAL_STORE_N0=" + support::cpp11::to_string(partial_store_n0));
375 
376  if(is_data_type_float(data_type) && is_bifrost)
377  {
378  kernel_name = "gemm_mm_interleaved_transposed_" + lower_string(string_from_data_type(data_type)) + "_bifrost";
379  }
380  else
381  {
382  kernel_name = "gemm_mm_interleaved_transposed_" + lower_string(string_from_data_type(data_type));
383  if(fp_mixed_precision && data_type == DataType::F16)
384  {
385  // currently wider accumulator is only supported for fp16 kernels.
386  kernel_name += "_acc32";
387  }
388  }
389  }
390  else // The input tensors have not been reshaped
391  {
392  build_opts.add_option("-DN=" + support::cpp11::to_string(n));
393  build_opts.add_option("-DK=" + support::cpp11::to_string(input0->info()->dimension(0)));
394  build_opts.add_option("-DDATA_TYPE=" + get_cl_type_from_data_type(data_type));
395  build_opts.add_option("-DM0=" + support::cpp11::to_string(m0));
396  build_opts.add_option("-DN0=" + support::cpp11::to_string(n0));
397  build_opts.add_option("-DPARTIAL_STORE_M0=" + support::cpp11::to_string(partial_store_m0));
398  build_opts.add_option("-DPARTIAL_STORE_N0=" + support::cpp11::to_string(partial_store_n0));
399 
400  // Create kernels according to the architecture, data type and input size.
401  if(is_data_type_float(data_type) && is_bifrost)
402  {
403  kernel_name = "gemm_mm_floating_point";
404 
405  if(input0->info()->num_dimensions() != 1)
406  {
407  kernel_name += "_" + lower_string(string_from_data_type(data_type)) + "_bifrost";
408  if(fp_mixed_precision && data_type == DataType::F16)
409  {
410  // currently wider accumulator is only supported for fp16 kernels.
411  kernel_name += "_acc32";
412  }
413  }
414  else if(input1->info()->dimension(0) <= 1000 && data_type == DataType::F32)
415  {
416  // The first kernel is optimized for the case of 1000 or less output elements (e.g. FC8 of AlexNet and VGG-16, and
417  // FC1 of Inception v3). The second kernel is optimized for the case of greater than 1000 output elements (e.g.
418  // FC6 and FC7 of AlexNet and VGG-16).
419  kernel_name += "_" + lower_string(string_from_data_type(data_type)) + "_bifrost_1000";
420  }
421 
422  // The work-group size equal to the Bifrost quad size has been proved to be optimal for these kernels
423  // via exhaustive autotuning over a range of representative layer configurations.
424  set_lws_hint(cl::NDRange(4));
425  }
426  else // (MIDGARD and F32) or (F16)
427  {
428  kernel_name = "gemm_mm_floating_point";
429  }
430  }
431 
432  // Create kernel
433  _kernel = create_kernel(compile_context, kernel_name, build_opts.options());
434 
435  // Set config_id for enabling LWS tuning
436  _config_id = "gemm_";
437  _config_id += (is_interleaved_transposed ? "reshaped_" : "");
438  _config_id += (_add_bias ? "add_bias_" : "");
439  _config_id += (_broadcast_bias ? "broadcast_bias_" : "");
440  _config_id += (fp_mixed_precision ? "fp_mixed_" : "");
441  _config_id += (_reinterpret_input_as_3d ? "3di_" : "");
442  _config_id += (_reinterpret_output_as_3d ? "3do_" : "");
443  _config_id += lower_string(string_from_data_type(input0->info()->data_type()));
444  _config_id += "_";
445  _config_id += support::cpp11::to_string(output->info()->dimension(1));
446  _config_id += "_";
447  _config_id += support::cpp11::to_string(output->info()->dimension(0));
448  _config_id += "_";
449  _config_id += support::cpp11::to_string(output->info()->dimension(2));
450  _config_id += "_";
451  _config_id += support::cpp11::to_string(output->info()->dimension(3));
452  _config_id += "_";
453  _config_id += (is_interleaved_transposed ? support::cpp11::to_string(input1->info()->dimension(0)) : support::cpp11::to_string(input1->info()->dimension(1)));
454 
456 }
virtual size_t num_dimensions() const =0
The number of dimensions of the tensor (rank)
bool is_one(float a, float epsilon=0.00001f)
Checks if the input floating point number is 1.0f checking if the difference is within a range define...
Definition: float_ops.h:97
void set_lws_hint(const cl::NDRange &lws_hint)
Set the Local-Workgroup-Size hint.
Definition: ICLKernel.h:266
std::string to_string(T &&value)
Convert integer and float values to string.
1 channel, 1 F32 per channel
#define ARM_COMPUTE_ERROR_ON(cond)
If the condition is true then an error message is printed and an exception thrown.
Definition: Error.h:466
const std::string & string_from_activation_func(ActivationLayerInfo::ActivationFunction act)
Translates a given activation function to a string.
Definition: Utils.cpp:163
#define ARM_COMPUTE_ERROR_THROW_ON(status)
Definition: Error.h:455
GPUTarget get_arch_from_target(GPUTarget target)
Helper function to get the GPU arch.
Definition: GPUTarget.cpp:189
std::string lower_string(const std::string &val)
Lower a given string.
Definition: Utils.cpp:350
1 channel, 1 F16 per channel
const DataType data_type
Definition: Im2Col.cpp:150
cl::Kernel create_kernel(const CLCompileContext &ctx, const std::string &kernel_name, const std::set< std::string > &build_opts=std::set< std::string >())
Creates an opencl kernel using a compile context.
Definition: CLHelpers.cpp:403
const std::string & string_from_data_type(DataType dt)
Convert a data type identity into a string.
Definition: Utils.cpp:135
std::string float_to_string_with_full_precision(float val)
Create a string with the float in full precision.
Definition: Utils.h:1262
std::string kernel_name
GPUTarget get_target() const
Get the targeted GPU architecture.
Definition: ICLKernel.h:336
std::string get_cl_type_from_data_type(const DataType &dt)
Translates a tensor data type to the appropriate OpenCL type.
Definition: CLHelpers.cpp:37
virtual ITensorInfo * info() const =0
Interface to be implemented by the child class to return the tensor&#39;s metadata.
bool has_padding_changed(const std::unordered_map< const ITensorInfo *, PaddingSize > &padding_map)
Check if the previously stored padding info has changed after configuring a kernel.
Definition: Utils.cpp:528
GPUTarget
Available GPU Targets.
Definition: GPUTarget.h:34
std::unordered_map< const ITensorInfo *, PaddingSize > get_padding_info(std::initializer_list< const ITensorInfo *> infos)
Stores padding information before configuring a kernel.
Definition: Utils.cpp:513
Status validate_arguments(const ITensorInfo *input, const ITensorInfo *bias, const ITensorInfo *output, const GEMMLowpOutputStageInfo *output_stage)
bool is_zero(float a, float epsilon=0.00001f)
Checks if the input floating point number is 0.0f checking if the difference is within a range define...
Definition: float_ops.h:109
#define ARM_COMPUTE_ERROR_ON_NULLPTR(...)
Definition: Validate.h:161
DataType
Available data types.
Definition: Types.h:77
bool is_data_type_float(DataType dt)
Check if a given data type is of floating point type.
Definition: Utils.h:1148

◆ operator=() [1/2]

CLGEMMMatrixMultiplyKernel& operator= ( const CLGEMMMatrixMultiplyKernel )
delete

Prevent instances of this class from being copied (As this class contains pointers)

◆ operator=() [2/2]

Allow instances of this class to be moved.

◆ run()

void run ( const Window window,
cl::CommandQueue &  queue 
)
overridevirtual

Enqueue the OpenCL kernel to process the given window on the passed OpenCL command queue.

Note
The queue is not flushed by this method, and therefore the kernel will not have been executed by the time this method returns.
Parameters
[in]windowRegion on which to execute the kernel. (Must be a valid region of the window returned by window()).
[in,out]queueCommand queue on which to enqueue the kernel.

Reimplemented from ICLKernel.

Definition at line 480 of file CLGEMMMatrixMultiplyKernel.cpp.

References CLGEMMMatrixMultiplyKernel::_add_bias, CLGEMMMatrixMultiplyKernel::_input0, CLGEMMMatrixMultiplyKernel::_input1, CLGEMMMatrixMultiplyKernel::_input2, CLGEMMMatrixMultiplyKernel::_output, CLGEMMMatrixMultiplyKernel::_reinterpret_input_as_3d, CLGEMMMatrixMultiplyKernel::_reinterpret_output_as_3d, CLGEMMMatrixMultiplyKernel::_slide_matrix_b, ICLKernel::add_2D_tensor_argument(), ARM_COMPUTE_ERROR_ON, ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW, ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL, BorderSize::bottom, Window::DimX, Window::DimY, arm_compute::enqueue(), Window::first_slice_window_3D(), ITensor::info(), ICLKernel::lws_hint(), ICLKernel::num_arguments_per_2D_tensor(), ITensorInfo::num_dimensions(), ITensorInfo::padding(), Window::set(), arm_compute::test::validation::reference::slice(), Window::slide_window_slice_3D(), ITensorInfo::strides_in_bytes(), BorderSize::top, and IKernel::window().

481 {
484 
485  if(_input1->info()->num_dimensions() < 3)
486  {
487  // The stride_z for matrix B must be zero if we do not slice
489  }
490 
492  Window slice_matrix_b = slice;
493 
494  slice_matrix_b.set(Window::DimX, Window::Dimension(0, 1, 1));
495  slice_matrix_b.set(Window::DimY, Window::Dimension(0, 1, 1));
496 
497  const unsigned int num_arguments_bias = _add_bias ? num_arguments_per_2D_tensor() + 1 : 0;
498 
500  {
501  // Pass bottom paddings to the kernel if the input has to be reinterpreted as 3D tensor
502  const unsigned int idx0 = 3 * num_arguments_per_2D_tensor() + 3 + num_arguments_bias;
503  const unsigned int total_cross_plane_pad = _input0->info()->padding().top + _input0->info()->padding().bottom;
504  _kernel.setArg<cl_uint>(idx0, static_cast<unsigned int>(total_cross_plane_pad));
505  }
506 
508  {
509  // Pass bottom paddings to the kernel if the output has to be reinterpreted as 3D tensor
510  const unsigned int idx0 = 3 * num_arguments_per_2D_tensor() + 3 + (_reinterpret_input_as_3d ? 1 : 0) + num_arguments_bias;
511  const unsigned int total_cross_plane_pad = _output->info()->padding().top + _output->info()->padding().bottom;
512  _kernel.setArg<cl_uint>(idx0, static_cast<unsigned int>(total_cross_plane_pad));
513  }
514 
515  do
516  {
517  Window slice_b = slice;
518  // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
519  // This scenario can happen when the matrix multiplication is used to perform a convolution operation
520  if(!_slide_matrix_b)
521  {
522  slice_b = slice_matrix_b;
523  }
524 
525  unsigned int idx = 0;
526  add_2D_tensor_argument(idx, _input0, slice);
527  add_2D_tensor_argument(idx, _input1, slice_b);
528  if(_add_bias)
529  {
530  add_2D_tensor_argument(idx, _input2, slice);
531  }
532  add_2D_tensor_argument(idx, _output, slice);
533  _kernel.setArg<cl_uint>(idx++, static_cast<unsigned int>(_input0->info()->strides_in_bytes()[2]));
534  _kernel.setArg<cl_uint>(idx++, static_cast<unsigned int>(_input1->info()->strides_in_bytes()[2]));
535  if(_add_bias)
536  {
537  _kernel.setArg<cl_uint>(idx++, static_cast<unsigned int>(_input2->info()->strides_in_bytes()[2]));
538  }
539  _kernel.setArg<cl_uint>(idx++, static_cast<unsigned int>(_output->info()->strides_in_bytes()[2]));
540  enqueue(queue, *this, slice, lws_hint());
541  }
542  while(window.slide_window_slice_3D(slice));
543 }
unsigned int top
top of the border
Definition: Types.h:375
virtual size_t num_dimensions() const =0
The number of dimensions of the tensor (rank)
const Window & window() const
The maximum window the kernel can be executed on.
Definition: IKernel.cpp:28
void enqueue(IGCKernel &kernel, const Window &window, const gles::NDRange &lws=gles::NDRange(1U, 1U, 1U))
Add the kernel to the command queue with the given window.
Definition: IGCKernel.cpp:41
cl::NDRange lws_hint() const
Return the Local-Workgroup-Size hint.
Definition: ICLKernel.h:276
#define ARM_COMPUTE_ERROR_ON(cond)
If the condition is true then an error message is printed and an exception thrown.
Definition: Error.h:466
unsigned int bottom
bottom of the border
Definition: Types.h:377
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
virtual ITensorInfo * info() const =0
Interface to be implemented by the child class to return the tensor&#39;s metadata.
virtual PaddingSize padding() const =0
Padding of tensor.
static constexpr unsigned int num_arguments_per_2D_tensor()
Returns the number of arguments enqueued per 2D tensor object.
Definition: ICLKernel.h:206
bool slide_window_slice_3D(Window &slice) const
Slide the passed 3D window slice.
Definition: Window.h:335
#define ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL(k)
Definition: Validate.h:941
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
void add_2D_tensor_argument(unsigned int &idx, const ICLTensor *tensor, const Window &window)
Add the passed 2D tensor&#39;s parameters to the object&#39;s kernel&#39;s arguments starting from the index idx...
Definition: ICLKernel.h:148
virtual const Strides & strides_in_bytes() const =0
The strides in bytes for accessing each dimension of the tensor.
Window first_slice_window_3D() const
First 3D slice of the window.
Definition: Window.h:291
#define ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW(f, s)
Definition: Validate.h:205
SimpleTensor< T > slice(const SimpleTensor< T > &src, Coordinates starts, Coordinates ends)

◆ validate()

Status validate ( const ITensorInfo input0,
const ITensorInfo input1,
const ITensorInfo input2,
const ITensorInfo output,
float  alpha,
float  beta,
bool  is_interleaved_transposed,
const GEMMReshapeInfo reshape_info,
GPUTarget  gpu_target,
bool  fp_mixed_precision = false,
const ActivationLayerInfo activation_info = ActivationLayerInfo() 
)
static

Static function to check if given info will lead to a valid configuration of CLGEMMMatrixMultiplyKernel.

Parameters
[in]input0Input tensor containing the Matrix A info. Data types supported: F16/F32
[in]input1Input tensor containing the Matrix B info. Data type supported: same as input0
[in]input2Input tensor containing the Matrix C (bias) info. Can be nullptr. Data type supported: same as input0
[in]outputOutput tensor to store the result of matrix multiplication. Data type supported: same as input0
[in]alphaWeight of the matrix product
[in]betaWeight of vector C. Default value is 0. Only beta = 1 is currently supported.
[in]is_interleaved_transposedTrue if input0 and input1 have been reshaped respectively using CLGEMMReshapeLHSMatrixKernel and CLGEMMReshapeRHSMatrixKernel
[in]reshape_infoGEMM reshape info. If is_interleaved_transposed = true, this object must contain the information to understand how the matrix A and matrix B have been reshaped
[in]gpu_targetGPU Target
[in]fp_mixed_precision(Optional) Use wider accumulators (32 bit instead of 16 for FP16) to improve accuracy
[in]activation_info(Optional) Activation to apply after the matrix multiplication
Returns
a status

Definition at line 458 of file CLGEMMMatrixMultiplyKernel.cpp.

References ARM_COMPUTE_RETURN_ON_ERROR, ARM_COMPUTE_UNUSED, ICloneable< T >::clone(), and arm_compute::validate_arguments().

460 {
461  // Note: num_elements_processed will be set in validate_and_configure_window()
462  ElementsProcessed num_elements_processed{};
463  ARM_COMPUTE_UNUSED(alpha);
464  ARM_COMPUTE_UNUSED(activation_info);
465  ARM_COMPUTE_RETURN_ON_ERROR(validate_arguments(input0, input1, input2, output, beta, is_interleaved_transposed, reshape_info, fp_mixed_precision));
466  ARM_COMPUTE_RETURN_ON_ERROR(validate_and_configure_window(input0->clone().get(),
467  input1->clone().get(),
468  (input2 != nullptr) ? input2->clone().get() : nullptr,
469  output->clone().get(),
470  beta,
471  is_interleaved_transposed,
472  reshape_info,
473  gpu_target,
474  num_elements_processed)
475  .first);
476 
477  return Status{};
478 }
#define ARM_COMPUTE_RETURN_ON_ERROR(status)
Checks if a status contains an error and returns it.
Definition: Error.h:204
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:152
Status validate_arguments(const ITensorInfo *input, const ITensorInfo *bias, const ITensorInfo *output, const GEMMLowpOutputStageInfo *output_stage)

Field Documentation

◆ _add_bias

◆ _broadcast_bias

bool _broadcast_bias

Definition at line 119 of file CLGEMMMatrixMultiplyKernel.h.

Referenced by CLGEMMMatrixMultiplyKernel::configure().

◆ _input0

◆ _input1

◆ _input2

◆ _output

◆ _reinterpret_input_as_3d

bool _reinterpret_input_as_3d

◆ _reinterpret_output_as_3d

bool _reinterpret_output_as_3d

◆ _slide_matrix_b

bool _slide_matrix_b

The documentation for this class was generated from the following files: