Compute Library
 21.02
NEGEMMLowpMatrixMultiplyKernel Class Reference

Neon kernel to multiply matrices. More...

#include <NEGEMMLowpMatrixMultiplyKernel.h>

Collaboration diagram for NEGEMMLowpMatrixMultiplyKernel:
[legend]

Public Member Functions

const char * name () const override
 Name of the kernel. More...
 
 NEGEMMLowpMatrixMultiplyKernel ()
 Constructor. More...
 
 NEGEMMLowpMatrixMultiplyKernel (const NEGEMMLowpMatrixMultiplyKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
NEGEMMLowpMatrixMultiplyKerneloperator= (const NEGEMMLowpMatrixMultiplyKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
 NEGEMMLowpMatrixMultiplyKernel (NEGEMMLowpMatrixMultiplyKernel &&)=default
 Allow instances of this class to be moved. More...
 
NEGEMMLowpMatrixMultiplyKerneloperator= (NEGEMMLowpMatrixMultiplyKernel &&)=default
 Allow instances of this class to be moved. More...
 
 ~NEGEMMLowpMatrixMultiplyKernel ()=default
 Default destructor. More...
 
void configure (const ITensor *input0, const ITensor *input1, ITensor *output)
 Initialise the kernel's input and output. More...
 
void run (const Window &window, const ThreadInfo &info) override
 Execute the kernel on the passed window. More...
 
- Public Member Functions inherited from ICPPKernel
virtual ~ICPPKernel ()=default
 Default destructor. More...
 
virtual void run_nd (const Window &window, const ThreadInfo &info, const Window &thread_locator)
 legacy compatibility layer for implemantions which do not support thread_locator In these cases we simply narrow the interface down the legacy version More...
 
virtual void run_op (ITensorPack &tensors, const Window &window, const ThreadInfo &info)
 Execute the kernel on the passed window. More...
 
- Public Member Functions inherited from IKernel
 IKernel ()
 Constructor. More...
 
virtual ~IKernel ()=default
 Destructor. More...
 
virtual bool is_parallelisable () const
 Indicates whether or not the kernel is parallelisable. More...
 
virtual BorderSize border_size () const
 The size of the border for that kernel. More...
 
const Windowwindow () const
 The maximum window the kernel can be executed on. More...
 

Static Public Member Functions

static Status validate (const ITensorInfo *input0, const ITensorInfo *input1, const ITensorInfo *output)
 Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixMultiplyKernel. More...
 

Detailed Description

Neon kernel to multiply matrices.

Note
NEGEMMLowpMatrixMultiplyKernel low precision matrix product kernel This kernel performs the following computation:
  1. Convert a values from int8 to int32
  2. Convert b values from int8 to int32
  3. Compute the int32 matrix product of the resulting a * b and store the result as int32

Definition at line 43 of file NEGEMMLowpMatrixMultiplyKernel.h.

Constructor & Destructor Documentation

◆ NEGEMMLowpMatrixMultiplyKernel() [1/3]

Constructor.

Definition at line 898 of file NEGEMMLowpMatrixMultiplyKernel.cpp.

Referenced by NEGEMMLowpMatrixMultiplyKernel::name().

899  : _input0(nullptr), _input1(nullptr), _output(nullptr), _slide_matrix_b(true)
900 {
901 }

◆ NEGEMMLowpMatrixMultiplyKernel() [2/3]

Prevent instances of this class from being copied (As this class contains pointers)

◆ NEGEMMLowpMatrixMultiplyKernel() [3/3]

Allow instances of this class to be moved.

◆ ~NEGEMMLowpMatrixMultiplyKernel()

Default destructor.

Referenced by NEGEMMLowpMatrixMultiplyKernel::name().

Member Function Documentation

◆ configure()

void configure ( const ITensor input0,
const ITensor input1,
ITensor output 
)

Initialise the kernel's input and output.

The input matrices input0 and input1 must be the output of the kernels: NEGEMMInterleave4x4Kernel and NEGEMMTranspose1xWKernel. These two kernels change the layout of the original matrices to be more cache-friendly.

Parameters
[in]input0Input tensor containing the interleaved Matrix A. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED
[in]input1Input tensor containing the transposed1xW Matrix B. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED/QSYMM8/QSYMM8_PER_CHANNEL
[out]outputOutput tensor to store the result of matrix multiplication. Data type supported: S32

Definition at line 903 of file NEGEMMLowpMatrixMultiplyKernel.cpp.

References ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, arm_compute::calculate_max_window(), TensorShape::collapse(), ITensorInfo::dimension(), ITensor::info(), ITensorInfo::num_dimensions(), Dimensions< T >::set_num_dimensions(), ITensorInfo::set_valid_region(), ITensorInfo::tensor_shape(), and arm_compute::validate_arguments().

Referenced by NEGEMMLowpMatrixMultiplyKernel::name().

904 {
905  ARM_COMPUTE_ERROR_ON_NULLPTR(input0, input1, output);
906  ARM_COMPUTE_ERROR_THROW_ON(validate_arguments(input0->info(), input1->info(), output->info()));
907 
908  TensorShape in1_shape = input1->info()->tensor_shape();
909  in1_shape.collapse(2);
910 
911  _input0 = input0;
912  _input1 = input1;
913  _output = output;
914  _slide_matrix_b = in1_shape[2] != 1;
915 
916  constexpr unsigned int num_elems_processed_per_iteration_x = 16;
917  constexpr unsigned int num_elems_processed_per_iteration_y = 4;
918 
919  Window win;
920 
921  // Check if the output tensor is a vector. If so,the kernel runs the vector-matrix multiplication
922  if((output->info()->dimension(1) == 1))
923  {
924  // Configure kernel window
925  win = calculate_max_window(*output->info(), Steps(num_elems_processed_per_iteration_x));
926 
927  Coordinates coord;
928  coord.set_num_dimensions(output->info()->num_dimensions());
929  output->info()->set_valid_region(ValidRegion(coord, output->info()->tensor_shape()));
930  }
931  else
932  {
933  win = calculate_max_window(*output->info(), Steps(num_elems_processed_per_iteration_x, num_elems_processed_per_iteration_y));
934  output->info()->set_valid_region(ValidRegion(Coordinates(), output->info()->tensor_shape()));
935  }
936 
937  INEKernel::configure(win);
938 }
virtual size_t num_dimensions() const =0
The number of dimensions of the tensor (rank)
Window calculate_max_window(const ValidRegion &valid_region, const Steps &steps, bool skip_border, BorderSize border_size)
Shape of a tensor.
Definition: TensorShape.h:39
virtual size_t dimension(size_t index) const =0
Return the size of the requested dimension.
#define ARM_COMPUTE_ERROR_THROW_ON(status)
Definition: Error.h:455
virtual void set_valid_region(const ValidRegion &valid_region)=0
Set the valid region of the tensor.
virtual const TensorShape & tensor_shape() const =0
Size for each dimension of the tensor.
Class to describe a number of elements in each dimension.
Definition: Steps.h:40
Coordinates of an item.
Definition: Coordinates.h:37
virtual ITensorInfo * info() const =0
Interface to be implemented by the child class to return the tensor&#39;s metadata.
Status validate_arguments(const ITensorInfo *input, const ITensorInfo *bias, const ITensorInfo *output, const GEMMLowpOutputStageInfo *output_stage)
#define ARM_COMPUTE_ERROR_ON_NULLPTR(...)
Definition: Validate.h:161
void set_num_dimensions(size_t num_dimensions)
Set number of dimensions.
Definition: Dimensions.h:149
Container for valid region of a window.
Definition: Types.h:188
Describe a multidimensional execution window.
Definition: Window.h:39
void collapse(size_t n, size_t first=0)
Collapse the first n dimensions.
Definition: TensorShape.h:133

◆ name()

◆ operator=() [1/2]

Prevent instances of this class from being copied (As this class contains pointers)

Referenced by NEGEMMLowpMatrixMultiplyKernel::name().

◆ operator=() [2/2]

Allow instances of this class to be moved.

◆ run()

void run ( const Window window,
const ThreadInfo info 
)
overridevirtual

Execute the kernel on the passed window.

Warning
If is_parallelisable() returns false then the passed window must be equal to window()
Note
The window has to be a region within the window returned by the window() method
The width of the window has to be a multiple of num_elems_processed_per_iteration().
Parameters
[in]windowRegion on which to execute the kernel. (Must be a region of the window returned by window())
[in]infoInfo about executing thread and CPU.

Reimplemented from ICPPKernel.

Definition at line 947 of file NEGEMMLowpMatrixMultiplyKernel.cpp.

References ARM_COMPUTE_ERROR, ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW, ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL, ARM_COMPUTE_UNUSED, arm_compute::ceil_to_multiple(), arm_compute::data_size_from_type(), ITensorInfo::data_type(), ITensorInfo::dimension(), Window::DimX, Window::DimY, Window::Dimension::end(), ITensor::info(), ITensorInfo::num_dimensions(), ThreadInfo::num_threads, arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::S8, Window::set(), Window::Dimension::start(), ITensorInfo::strides_in_bytes(), ThreadInfo::thread_id, arm_compute::U8, IKernel::window(), Window::x(), and Window::y().

Referenced by NEGEMMLowpMatrixMultiplyKernel::name().

948 {
949  ARM_COMPUTE_UNUSED(info);
952 
953  // Check if the output tensor is a vector. If so,the kernel runs the vector-matrix multiplication path
954  if((_output->info()->dimension(1) == 1))
955  {
956  const auto width_matrix_a = static_cast<int>(_input0->info()->dimension(0));
957  const auto width_matrix_b = static_cast<int>(_input1->info()->dimension(0));
958  const auto width_out = static_cast<int>(_output->info()->dimension(0));
959  const auto in_b_stride = static_cast<int>(_input1->info()->strides_in_bytes()[1] / data_size_from_type(_input1->info()->data_type()));
960 
961  // The implementation computes 16 elements per iteration
962  const int window_start_x = 16 * info.thread_id;
963  const int window_step_x = 16 * info.num_threads;
964  // Make sure (window_end_x - window_start_x) is a multiple of window_step_x
965  const int window_end_x = ceil_to_multiple(width_matrix_b - window_start_x, window_step_x) + window_start_x;
966 
967  Window win_out(window);
968  win_out.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
969  win_out.set(Window::DimY, Window::Dimension(0, 1, 1));
970 
971  Window win_a(window);
972  win_a.set(Window::DimX, Window::Dimension(0, 0, 0));
973  win_a.set(Window::DimY, Window::Dimension(0, 0, 0));
974 
975  Window win_b;
976  // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
977  // This scenario can happen when the the matrix multiplication is used to perform a convolution operation
978  if(_input1->info()->num_dimensions() >= 3)
979  {
980  win_b = window;
981  }
982  win_b.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
983  win_b.set(Window::DimY, Window::Dimension(0, 1, 1));
984 
985  Iterator ina(_input0, win_a);
986  Iterator inb(_input1, win_b);
987  Iterator out(_output, win_out);
988 
989  switch(_input0->info()->data_type())
990  {
991  case DataType::S8:
993  {
994  vector_matrix_multiply_s8(ina, inb, out, width_matrix_a, width_matrix_b, width_out, in_b_stride, window);
995  break;
996  }
997  case DataType::U8:
998  case DataType::QASYMM8:
999  {
1000  vector_matrix_multiply_u8(ina, inb, out, width_matrix_a, width_matrix_b, width_out, in_b_stride, window);
1001  break;
1002  }
1003  default:
1004  {
1005  ARM_COMPUTE_ERROR("Not supported");
1006  break;
1007  }
1008  }
1009  }
1010  else
1011  {
1012  const size_t in_b_stride = _input1->info()->strides_in_bytes()[1];
1013  const int width_b = _input1->info()->dimension(0);
1014 
1015  // Set step_x and step_y for matrix A. Scale by a factor of 4 the Y range as the input interleaved matrix A has 4 times less the rows of the output matrix
1016  Window win_a(window);
1017  win_a.set(Window::DimX, Window::Dimension(0, 0, 0));
1018  win_a.set(Window::DimY, Window::Dimension(window.y().start() / 4, window.y().end() / 4, 1));
1019 
1020  // Set step_x and step_y for matrix B. Scale by a factor of 16 the X range as the input transposed matrix A has 16 times less the columns of the output matrix
1021  Window win_b;
1022  // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
1023  // This scenario can happen when the the matrix multiplication is used to perform a convolution operation
1024  if(_slide_matrix_b)
1025  {
1026  win_b = window;
1027  }
1028  win_b.set(Window::DimX, Window::Dimension(window.x().start() / 16, window.x().end() / 16, in_b_stride));
1029  win_b.set(Window::DimY, Window::Dimension(0, 0, 0));
1030 
1031  // The step x and step y for the output matrix has been already set using in configure()
1032  Iterator ina(_input0, win_a);
1033  Iterator inb(_input1, win_b);
1034  Iterator out(_output, window);
1035 
1036  switch(_input0->info()->data_type())
1037  {
1038  case DataType::S8:
1040  {
1041  matrix_multiply_s8(ina, inb, out, width_b, *_output->info(), window);
1042  break;
1043  }
1044  case DataType::U8:
1045  case DataType::QASYMM8:
1046  {
1047  matrix_multiply_u8(ina, inb, out, width_b, *_output->info(), window);
1048  break;
1049  }
1050  default:
1051  {
1052  ARM_COMPUTE_ERROR("Not supported");
1053  break;
1054  }
1055  }
1056  }
1057 }
virtual size_t num_dimensions() const =0
The number of dimensions of the tensor (rank)
const Window & window() const
The maximum window the kernel can be executed on.
Definition: IKernel.cpp:28
virtual size_t dimension(size_t index) const =0
Return the size of the requested dimension.
#define ARM_COMPUTE_ERROR(msg)
Print the given message then throw an std::runtime_error.
Definition: Error.h:352
1 channel, 1 U8 per channel
virtual DataType data_type() const =0
Data type used for each element of the tensor.
Describe one of the image&#39;s dimensions with a start, end and step.
Definition: Window.h:77
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:152
auto ceil_to_multiple(S value, T divisor) -> decltype(((value+divisor - 1)/divisor) *divisor)
Computes the smallest number larger or equal to value that is a multiple of divisor.
Definition: Utils.h:71
quantized, asymmetric fixed-point 8-bit number unsigned
virtual ITensorInfo * info() const =0
Interface to be implemented by the child class to return the tensor&#39;s metadata.
size_t data_size_from_type(DataType data_type)
The size in bytes of the data type.
Definition: Utils.h:106
void set(size_t dimension, const Dimension &dim)
Set the values of a given dimension.
Definition: Window.inl:49
#define ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL(k)
Definition: Validate.h:941
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
constexpr const Dimension & y() const
Alias to access the second dimension of the window.
Definition: Window.h:154
quantized, asymmetric fixed-point 8-bit number signed
virtual const Strides & strides_in_bytes() const =0
The strides in bytes for accessing each dimension of the tensor.
constexpr int end() const
Return the end of the dimension.
Definition: Window.h:99
Iterator updated by execute_window_loop for each window element.
Definition: Helpers.h:46
constexpr int start() const
Return the start of the dimension.
Definition: Window.h:94
signed 8-bit number
Describe a multidimensional execution window.
Definition: Window.h:39
#define ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW(f, s)
Definition: Validate.h:205
constexpr const Dimension & x() const
Alias to access the first dimension of the window.
Definition: Window.h:145

◆ validate()

Status validate ( const ITensorInfo input0,
const ITensorInfo input1,
const ITensorInfo output 
)
static

Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixMultiplyKernel.

Parameters
[in]input0Input tensor info containing the interleaved Matrix A. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED
[in]input1Input tensor info containing the transposed Matrix B. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED/QSYMM8/QSYMM8_PER_CHANNEL
[in]outputOutput tensor info to store the result of matrix multiplication. Data type supported: S32
Returns
a status

Definition at line 940 of file NEGEMMLowpMatrixMultiplyKernel.cpp.

References ARM_COMPUTE_RETURN_ON_ERROR, and arm_compute::validate_arguments().

Referenced by NEGEMMLowpMatrixMultiplyKernel::name(), and NEGEMMLowpMatrixMultiplyCore::validate().

941 {
942  ARM_COMPUTE_RETURN_ON_ERROR(validate_arguments(input0, input1, output));
943 
944  return Status{};
945 }
#define ARM_COMPUTE_RETURN_ON_ERROR(status)
Checks if a status contains an error and returns it.
Definition: Error.h:204
Status class.
Definition: Error.h:52
Status validate_arguments(const ITensorInfo *input, const ITensorInfo *bias, const ITensorInfo *output, const GEMMLowpOutputStageInfo *output_stage)

The documentation for this class was generated from the following files: