Kernel to multiply matrices. More...

#include <CpuGemmLowpMatrixMultiplyKernel.h>

Collaboration diagram for CpuGemmLowpMatrixMultiplyKernel:

Public Member Functions
	CpuGemmLowpMatrixMultiplyKernel ()=default
	Default constructor. More...

	ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE (CpuGemmLowpMatrixMultiplyKernel)

void	configure (const ITensorInfo src0, const ITensorInfo src1, ITensorInfo *dst)
	Initialise the kernel's input and output. More...

void	run_op (ITensorPack &tensors, const Window &window, const ThreadInfo &info) override
	Execute the kernel on the passed window. More...

const char *	name () const override
	Name of the kernel. More...

Public Member Functions inherited from ICPPKernel
virtual	~ICPPKernel ()=default
	Default destructor. More...

virtual void	run (const Window &window, const ThreadInfo &info)
	Execute the kernel on the passed window. More...

virtual void	run_nd (const Window &window, const ThreadInfo &info, const Window &thread_locator)
	legacy compatibility layer for implemantions which do not support thread_locator In these cases we simply narrow the interface down the legacy version More...

virtual size_t	get_mws (const CPUInfo &platform, size_t thread_count) const
	Return minimum workload size of the relevant kernel. More...

Public Member Functions inherited from IKernel
	IKernel ()
	Constructor. More...

virtual	~IKernel ()=default
	Destructor. More...

virtual bool	is_parallelisable () const
	Indicates whether or not the kernel is parallelisable. More...

virtual BorderSize	border_size () const
	The size of the border for that kernel. More...

const Window &	window () const
	The maximum window the kernel can be executed on. More...

bool	is_window_configured () const
	Function to check if the embedded window of this kernel has been configured. More...

Static Public Member Functions
static Status	validate (const ITensorInfo src0, const ITensorInfo src1, const ITensorInfo *dst)
	Static function to check if given info will lead to a valid configuration. More...

Static Public Member Functions inherited from ICpuKernel< CpuGemmLowpMatrixMultiplyKernel >
static const auto *	get_implementation (const SelectorType &selector, KernelSelectionType selection_type=KernelSelectionType::Supported)
	Micro-kernel selector. More...

Additional Inherited Members
Static Public Attributes inherited from ICPPKernel
static constexpr size_t	default_mws = 1

Detailed Description

Kernel to multiply matrices.

Note: CpuGemmLowpMatrixMultiplyKernel low precision matrix product kernel This kernel performs the following computation:

Convert a values from int8 to int32
Convert b values from int8 to int32
Compute the int32 matrix product of the resulting a * b and store the result as int32

Definition at line 46 of file CpuGemmLowpMatrixMultiplyKernel.h.

Constructor & Destructor Documentation

◆ CpuGemmLowpMatrixMultiplyKernel()

CpuGemmLowpMatrixMultiplyKernel ( )

default

Default constructor.

Member Function Documentation

◆ ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE()

ARM_COMPUTE_DISALLOW_COPY_ALLOW_MOVE ( CpuGemmLowpMatrixMultiplyKernel )

◆ configure()

void configure	(	const ITensorInfo *	src0,
		const ITensorInfo *	src1,
		ITensorInfo *	dst
	)

Initialise the kernel's input and output.

The input matrices src0 and src1 must be the output of the kernels: CpuGemmInterleave4x4Kernel and CpuGemmTranspose1xWKernel. These two kernels change the layout of the original matrices to be more cache-friendly.

Parameters

[in]	src0	Input tensor info containing the interleaved Matrix A. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED
[in]	src1	Input tensor info containing the transposed1xW Matrix B. Data type supported: U8/QASYMM8/S8/QASYMM8_SIGNED/QSYMM8/QSYMM8_PER_CHANNEL
[out]	dst	Output tensor info to store the result of matrix multiplication. Data type supported: S32

Definition at line 715 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

 {
     ARM_COMPUTE_UNUSED(src0);
     ARM_COMPUTE_ERROR_ON_NULLPTR(src0, src1, dst);
     ARM_COMPUTE_ERROR_THROW_ON(validate_arguments(src0, src1, dst));
  
     TensorShape in1_shape = src1->tensor_shape();
     in1_shape.collapse(2);
  
     _slide_matrix_b = in1_shape[2] != 1;
  
     constexpr unsigned int num_elems_processed_per_iteration_x = 16;
     constexpr unsigned int num_elems_processed_per_iteration_y = 4;
  
     Window win;
     // Check if the output tensor is a vector. If so,the kernel runs the vector-matrix multiplication
     if ((dst->dimension(1) == 1))
     {
         // Configure kernel window
         win = calculate_max_window(*dst, Steps(num_elems_processed_per_iteration_x));
     }
     else
     {
         win =
             calculate_max_window(*dst, Steps(num_elems_processed_per_iteration_x, num_elems_processed_per_iteration_y));
     }
  
     ICpuKernel::configure(win);
 }

References ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, ARM_COMPUTE_UNUSED, arm_compute::calculate_max_window(), TensorShape::collapse(), arm_compute::test::validation::dst, ITensorInfo::tensor_shape(), and arm_compute::cpu::kernels::validate_arguments().

◆ name()

const char * name ( ) const

overridevirtual

Name of the kernel.

Returns: Kernel name

Implements ICPPKernel.

Definition at line 871 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

 {
     return "CpuGemmLowpMatrixMultiplyKernel";
 }

◆ run_op()

void run_op	(	ITensorPack &	tensors,
		const Window &	window,
		const ThreadInfo &	info
	)

overridevirtual

Execute the kernel on the passed window.

Warning: If is_parallelisable() returns false then the passed window must be equal to window()

Note: The window has to be a region within the window returned by the window() method; The width of the window has to be a multiple of num_elems_processed_per_iteration().

Parameters

[in]	tensors	A vector containing the tensors to operate on.
[in]	window	Region on which to execute the kernel. (Must be a region of the window returned by window())
[in]	info	Info about executing thread and CPU.

Reimplemented from ICPPKernel.

Definition at line 752 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

 {
     ARM_COMPUTE_UNUSED(info);
     ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL(this);
     ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW(ICpuKernel::window(), window);
  
     auto src0 = tensors.get_const_tensor(TensorType::ACL_SRC_0);
     auto src1 = tensors.get_const_tensor(TensorType::ACL_SRC_1);
     auto dst  = tensors.get_tensor(TensorType::ACL_DST);
  
     // Check if the output tensor is a vector. If so,the kernel runs the vector-matrix multiplication path
     if ((dst->info()->dimension(1) == 1))
     {
         const auto width_matrix_a = static_cast<int>(src0->info()->dimension(0));
         const auto width_matrix_b = static_cast<int>(src1->info()->dimension(0));
         const auto width_out      = static_cast<int>(dst->info()->dimension(0));
         const auto in_b_stride =
             static_cast<int>(src1->info()->strides_in_bytes()[1] / data_size_from_type(src1->info()->data_type()));
  
         // The implementation computes 16 elements per iteration
         const int window_start_x = 16 * info.thread_id;
         const int window_step_x  = 16 * info.num_threads;
         // Make sure (window_end_x - window_start_x) is a multiple of window_step_x
         const int window_end_x = ceil_to_multiple(width_matrix_b - window_start_x, window_step_x) + window_start_x;
  
         Window win_out(window);
         win_out.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
         win_out.set(Window::DimY, Window::Dimension(0, 1, 1));
  
         Window win_a(window);
         win_a.set(Window::DimX, Window::Dimension(0, 0, 0));
         win_a.set(Window::DimY, Window::Dimension(0, 0, 0));
  
         Window win_b;
         // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
         // This scenario can happen when the the matrix multiplication is used to perform a convolution operation
         if (src1->info()->num_dimensions() >= 3)
         {
             win_b = window;
         }
         win_b.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
         win_b.set(Window::DimY, Window::Dimension(0, 1, 1));
  
         Iterator ina(src0, win_a);
         Iterator inb(src1, win_b);
         Iterator out(dst, win_out);
  
         switch (src0->info()->data_type())
         {
             case DataType::S8:
             case DataType::QASYMM8_SIGNED:
             {
                 vector_matrix_multiply_s8(ina, inb, out, width_matrix_a, width_matrix_b, width_out, in_b_stride,
                                           window);
                 break;
             }
             case DataType::U8:
             case DataType::QASYMM8:
             {
                 vector_matrix_multiply_u8(ina, inb, out, width_matrix_a, width_matrix_b, width_out, in_b_stride,
                                           window);
                 break;
             }
             default:
             {
                 ARM_COMPUTE_ERROR("Not supported");
                 break;
             }
         }
     }
     else
     {
         const size_t in_b_stride = src1->info()->strides_in_bytes()[1];
         const int    width_b     = src1->info()->dimension(0);
  
         // Set step_x and step_y for matrix A. Scale by a factor of 4 the Y range as the input interleaved matrix A has 4 times less the rows of the output matrix
         Window win_a(window);
         win_a.set(Window::DimX, Window::Dimension(0, 0, 0));
         win_a.set(Window::DimY, Window::Dimension(window.y().start() / 4, window.y().end() / 4, 1));
  
         // Set step_x and step_y for matrix B. Scale by a factor of 16 the X range as the input transposed matrix A has 16 times less the columns of the output matrix
         Window win_b;
         // Don't slice matrix B along the z dimension if matrix B has just 2 dimensions and matrix A more than 2
         // This scenario can happen when the the matrix multiplication is used to perform a convolution operation
         if (_slide_matrix_b)
         {
             win_b = window;
         }
         win_b.set(Window::DimX, Window::Dimension(window.x().start() / 16, window.x().end() / 16, in_b_stride));
         win_b.set(Window::DimY, Window::Dimension(0, 0, 0));
  
         // The step x and step y for the output matrix has been already set using in configure()
         Iterator ina(src0, win_a);
         Iterator inb(src1, win_b);
         Iterator out(dst, window);
  
         switch (src0->info()->data_type())
         {
             case DataType::S8:
             case DataType::QASYMM8_SIGNED:
             {
                 matrix_multiply_s8(ina, inb, out, width_b, *dst->info(), window);
                 break;
             }
             case DataType::U8:
             case DataType::QASYMM8:
             {
                 matrix_multiply_u8(ina, inb, out, width_b, *dst->info(), window);
                 break;
             }
             default:
             {
                 ARM_COMPUTE_ERROR("Not supported");
                 break;
             }
         }
     }
 }

References arm_compute::ACL_DST, arm_compute::ACL_SRC_0, arm_compute::ACL_SRC_1, ARM_COMPUTE_ERROR, ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW, ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL, ARM_COMPUTE_UNUSED, arm_compute::ceil_to_multiple(), arm_compute::data_size_from_type(), Window::DimX, Window::DimY, arm_compute::test::validation::dst, Window::Dimension::end(), ITensorPack::get_const_tensor(), ITensorPack::get_tensor(), arm_compute::test::validation::info, arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::S8, Window::set(), Window::Dimension::start(), arm_compute::U8, IKernel::window(), Window::x(), and Window::y().

◆ validate()

Status validate	(	const ITensorInfo *	src0,
		const ITensorInfo *	src1,
		const ITensorInfo *	dst
	)

static

Static function to check if given info will lead to a valid configuration.

Returns: a status

Definition at line 746 of file CpuGemmLowpMatrixMultiplyKernel.cpp.

 {
     ARM_COMPUTE_RETURN_ON_ERROR(validate_arguments(src0, src1, dst));
     return Status{};
 }

References ARM_COMPUTE_RETURN_ON_ERROR, arm_compute::test::validation::dst, and arm_compute::cpu::kernels::validate_arguments().

Referenced by CpuGemmLowpMatrixMultiplyCore::validate().

The documentation for this class was generated from the following files:

src/cpu/kernels/CpuGemmLowpMatrixMultiplyKernel.h
src/cpu/kernels/CpuGemmLowpMatrixMultiplyKernel.cpp

Public Member Functions

Static Public Member Functions

Additional Inherited Members