Compute Library
 21.08
CpuGemmLowpMatrixMultiplyKernel Class Reference

Kernel to multiply matrices. More...

#include <CpuGemmLowpMatrixMultiplyKernel.h>

Collaboration diagram for CpuGemmLowpMatrixMultiplyKernel:
[legend]

Public Member Functions

 CpuGemmLowpMatrixMultiplyKernel ()=default
 Default constructor. More...
 
 ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE (CpuGemmLowpMatrixMultiplyKernel)
 
void configure (const ITensorInfo *src0, const ITensorInfo *src1, ITensorInfo *dst)
 Initialise the kernel's input and output. More...
 
void run_op (ITensorPack &tensors, const Window &window, const ThreadInfo &info) override
 Execute the kernel on the passed window. More...
 
const char * name () const override
 Name of the kernel. More...
 
- Public Member Functions inherited from ICPPKernel
virtual ~ICPPKernel ()=default
 Default destructor. More...
 
virtual void run (const Window &window, const ThreadInfo &info)
 Execute the kernel on the passed window. More...
 
virtual void run_nd (const Window &window, const ThreadInfo &info, const Window &thread_locator)
 legacy compatibility layer for implemantions which do not support thread_locator In these cases we simply narrow the interface down the legacy version More...
 
- Public Member Functions inherited from IKernel
 IKernel ()
 Constructor. More...
 
virtual ~IKernel ()=default
 Destructor. More...
 
virtual bool is_parallelisable () const
 Indicates whether or not the kernel is parallelisable. More...
 
virtual BorderSize border_size () const
 The size of the border for that kernel. More...
 
const Windowwindow () const
 The maximum window the kernel can be executed on. More...
 
bool is_window_configured () const
 Function to check if the embedded window of this kernel has been configured. More...
 

Static Public Member Functions

static Status validate (const ITensorInfo *src0, const ITensorInfo *src1, const ITensorInfo *dst)
 Static function to check if given info will lead to a valid configuration. More...
 

Detailed Description

Kernel to multiply matrices.

Note
CpuGemmLowpMatrixMultiplyKernel low precision matrix product kernel This kernel performs the following computation:
  1. Convert a values from int8 to int32
  2. Convert b values from int8 to int32
  3. Compute the int32 matrix product of the resulting a * b and store the result as int32

Definition at line 46 of file CpuGemmLowpMatrixMultiplyKernel.h.

Constructor & Destructor Documentation

◆ CpuGemmLowpMatrixMultiplyKernel()

Default constructor.

Member Function Documentation

◆ ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE()

ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE ( CpuGemmLowpMatrixMultiplyKernel  )

◆ configure()

void configure ( const ITensorInfo src0,
const ITensorInfo src1,
ITensorInfo dst 
)

Initialise the kernel's input and output.

The input matrices src0 and src1 must be the output of the kernels: CpuGemmInterleave4x4Kernel and CpuGemmTranspose1xWKernel. These two kernels change the layout of the original matrices to be more cache-friendly.

Parameters
[in]src0Input tensor info containing the interleaved Matrix A. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED
[in]src1Input tensor info containing the transposed1xW Matrix B. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED/QSYMM8/QSYMM8_PER_CHANNEL
[out]dstOutput tensor info to store the result of matrix multiplication. Data type supported: S32

Definition at line 896 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

References ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, ARM_COMPUTE_UNUSED, arm_compute::calculate_max_window(), TensorShape::collapse(), ITensorInfo::dimension(), and ITensorInfo::tensor_shape().

897 {
898  ARM_COMPUTE_UNUSED(src0);
899  ARM_COMPUTE_ERROR_ON_NULLPTR(src0, src1, dst);
900  ARM_COMPUTE_ERROR_THROW_ON(validate_arguments(src0, src1, dst));
901 
902  TensorShape in1_shape = src1->tensor_shape();
903  in1_shape.collapse(2);
904 
905  _slide_matrix_b = in1_shape[2] != 1;
906 
907  constexpr unsigned int num_elems_processed_per_iteration_x = 16;
908  constexpr unsigned int num_elems_processed_per_iteration_y = 4;
909 
910  Window win;
911  // Check if the output tensor is a vector. If so,the kernel runs the vector-matrix multiplication
912  if((dst->dimension(1) == 1))
913  {
914  // Configure kernel window
915  win = calculate_max_window(*dst, Steps(num_elems_processed_per_iteration_x));
916  }
917  else
918  {
919  win = calculate_max_window(*dst, Steps(num_elems_processed_per_iteration_x, num_elems_processed_per_iteration_y));
920  }
921 
922  ICpuKernel::configure(win);
923 }
Window calculate_max_window(const ValidRegion &valid_region, const Steps &steps, bool skip_border, BorderSize border_size)
#define ARM_COMPUTE_ERROR_THROW_ON(status)
Definition: Error.h:455
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:152
#define ARM_COMPUTE_ERROR_ON_NULLPTR(...)
Definition: Validate.h:157

◆ name()

const char * name ( ) const
overridevirtual

Name of the kernel.

Returns
Kernel name

Implements ICPPKernel.

Definition at line 1047 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

1048 {
1049  return "CpuGemmLowpMatrixMultiplyKernel";
1050 }

◆ run_op()

void run_op ( ITensorPack tensors,
const Window window,
const ThreadInfo info 
)
overridevirtual

Execute the kernel on the passed window.

Warning
If is_parallelisable() returns false then the passed window must be equal to window()
Note
The window has to be a region within the window returned by the window() method
The width of the window has to be a multiple of num_elems_processed_per_iteration().
Parameters
[in]tensorsA vector containing the tensors to operate on.
[in]windowRegion on which to execute the kernel. (Must be a region of the window returned by window())
[in]infoInfo about executing thread and CPU.

Reimplemented from ICPPKernel.

Definition at line 931 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

References arm_compute::ACL_DST, arm_compute::ACL_SRC_0, arm_compute::ACL_SRC_1, ARM_COMPUTE_ERROR, ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW, ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL, ARM_COMPUTE_UNUSED, arm_compute::ceil_to_multiple(), arm_compute::data_size_from_type(), Window::DimX, Window::DimY, Window::Dimension::end(), ITensorPack::get_const_tensor(), ITensorPack::get_tensor(), ThreadInfo::num_threads, arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::S8, Window::set(), Window::Dimension::start(), ThreadInfo::thread_id, arm_compute::U8, IKernel::window(), Window::x(), and Window::y().

932 {
936 
937  auto src0 = tensors.get_const_tensor(TensorType::ACL_SRC_0);
938  auto src1 = tensors.get_const_tensor(TensorType::ACL_SRC_1);
939  auto dst = tensors.get_tensor(TensorType::ACL_DST);
940 
941  // Check if the output tensor is a vector. If so,the kernel runs the vector-matrix multiplication path
942  if((dst->info()->dimension(1) == 1))
943  {
944  const auto width_matrix_a = static_cast<int>(src0->info()->dimension(0));
945  const auto width_matrix_b = static_cast<int>(src1->info()->dimension(0));
946  const auto width_out = static_cast<int>(dst->info()->dimension(0));
947  const auto in_b_stride = static_cast<int>(src1->info()->strides_in_bytes()[1] / data_size_from_type(src1->info()->data_type()));
948 
949  // The implementation computes 16 elements per iteration
950  const int window_start_x = 16 * info.thread_id;
951  const int window_step_x = 16 * info.num_threads;
952  // Make sure (window_end_x - window_start_x) is a multiple of window_step_x
953  const int window_end_x = ceil_to_multiple(width_matrix_b - window_start_x, window_step_x) + window_start_x;
954 
955  Window win_out(window);
956  win_out.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
957  win_out.set(Window::DimY, Window::Dimension(0, 1, 1));
958 
959  Window win_a(window);
960  win_a.set(Window::DimX, Window::Dimension(0, 0, 0));
961  win_a.set(Window::DimY, Window::Dimension(0, 0, 0));
962 
963  Window win_b;
964  // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
965  // This scenario can happen when the the matrix multiplication is used to perform a convolution operation
966  if(src1->info()->num_dimensions() >= 3)
967  {
968  win_b = window;
969  }
970  win_b.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
971  win_b.set(Window::DimY, Window::Dimension(0, 1, 1));
972 
973  Iterator ina(src0, win_a);
974  Iterator inb(src1, win_b);
975  Iterator out(dst, win_out);
976 
977  switch(src0->info()->data_type())
978  {
979  case DataType::S8:
981  {
982  vector_matrix_multiply_s8(ina, inb, out, width_matrix_a, width_matrix_b, width_out, in_b_stride, window);
983  break;
984  }
985  case DataType::U8:
986  case DataType::QASYMM8:
987  {
988  vector_matrix_multiply_u8(ina, inb, out, width_matrix_a, width_matrix_b, width_out, in_b_stride, window);
989  break;
990  }
991  default:
992  {
993  ARM_COMPUTE_ERROR("Not supported");
994  break;
995  }
996  }
997  }
998  else
999  {
1000  const size_t in_b_stride = src1->info()->strides_in_bytes()[1];
1001  const int width_b = src1->info()->dimension(0);
1002 
1003  // Set step_x and step_y for matrix A. Scale by a factor of 4 the Y range as the input interleaved matrix A has 4 times less the rows of the output matrix
1004  Window win_a(window);
1005  win_a.set(Window::DimX, Window::Dimension(0, 0, 0));
1006  win_a.set(Window::DimY, Window::Dimension(window.y().start() / 4, window.y().end() / 4, 1));
1007 
1008  // Set step_x and step_y for matrix B. Scale by a factor of 16 the X range as the input transposed matrix A has 16 times less the columns of the output matrix
1009  Window win_b;
1010  // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
1011  // This scenario can happen when the the matrix multiplication is used to perform a convolution operation
1012  if(_slide_matrix_b)
1013  {
1014  win_b = window;
1015  }
1016  win_b.set(Window::DimX, Window::Dimension(window.x().start() / 16, window.x().end() / 16, in_b_stride));
1017  win_b.set(Window::DimY, Window::Dimension(0, 0, 0));
1018 
1019  // The step x and step y for the output matrix has been already set using in configure()
1020  Iterator ina(src0, win_a);
1021  Iterator inb(src1, win_b);
1022  Iterator out(dst, window);
1023 
1024  switch(src0->info()->data_type())
1025  {
1026  case DataType::S8:
1028  {
1029  matrix_multiply_s8(ina, inb, out, width_b, *dst->info(), window);
1030  break;
1031  }
1032  case DataType::U8:
1033  case DataType::QASYMM8:
1034  {
1035  matrix_multiply_u8(ina, inb, out, width_b, *dst->info(), window);
1036  break;
1037  }
1038  default:
1039  {
1040  ARM_COMPUTE_ERROR("Not supported");
1041  break;
1042  }
1043  }
1044  }
1045 }
const Window & window() const
The maximum window the kernel can be executed on.
Definition: IKernel.cpp:28
#define ARM_COMPUTE_ERROR(msg)
Print the given message then throw an std::runtime_error.
Definition: Error.h:352
1 channel, 1 U8 per channel
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:152
auto ceil_to_multiple(S value, T divisor) -> decltype(((value+divisor - 1)/divisor) *divisor)
Computes the smallest number larger or equal to value that is a multiple of divisor.
Definition: Utils.h:71
quantized, asymmetric fixed-point 8-bit number unsigned
size_t data_size_from_type(DataType data_type)
The size in bytes of the data type.
Definition: Utils.h:106
void set(size_t dimension, const Dimension &dim)
Set the values of a given dimension.
Definition: Window.inl:49
#define ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL(k)
Definition: Validate.h:915
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
ScaleKernelInfo info(interpolation_policy, default_border_mode, PixelValue(), sampling_policy, false)
constexpr const Dimension & y() const
Alias to access the second dimension of the window.
Definition: Window.h:154
quantized, asymmetric fixed-point 8-bit number signed
constexpr int end() const
Return the end of the dimension.
Definition: Window.h:99
constexpr int start() const
Return the start of the dimension.
Definition: Window.h:94
signed 8-bit number
#define ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW(f, s)
Definition: Validate.h:201
constexpr const Dimension & x() const
Alias to access the first dimension of the window.
Definition: Window.h:145

◆ validate()

Status validate ( const ITensorInfo src0,
const ITensorInfo src1,
const ITensorInfo dst 
)
static

Static function to check if given info will lead to a valid configuration.

Similar to CpuGemmLowpMatrixMultiplyKernel::configure()

Returns
a status

Definition at line 925 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

References ARM_COMPUTE_RETURN_ON_ERROR.

Referenced by CpuGemmLowpMatrixMultiplyCore::validate().

926 {
927  ARM_COMPUTE_RETURN_ON_ERROR(validate_arguments(src0, src1, dst));
928  return Status{};
929 }
#define ARM_COMPUTE_RETURN_ON_ERROR(status)
Checks if a status contains an error and returns it.
Definition: Error.h:204

The documentation for this class was generated from the following files: