Compute Library
 19.08
NEGEMMLowpMatrixBReductionKernel Class Reference

NEON kernel used to compute the row-vectors of sums of all the entries in each column of Matrix B. More...

#include <NEGEMMLowpReductionKernel.h>

Collaboration diagram for NEGEMMLowpMatrixBReductionKernel:
[legend]

Public Member Functions

const char * name () const override
 Name of the kernel. More...
 
void configure (const ITensor *mtx_b, ITensor *vector_sum_col, int32_t num_mtx_b_rows, bool is_transposed1xW) override
 Initialise the kernel's input and output. More...
 
void run (const Window &window, const ThreadInfo &info) override
 Execute the kernel on the passed window. More...
 
- Public Member Functions inherited from INEGEMMLowpReductionKernel
 INEGEMMLowpReductionKernel ()
 Constructor. More...
 
 INEGEMMLowpReductionKernel (const INEGEMMLowpReductionKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
INEGEMMLowpReductionKerneloperator= (const INEGEMMLowpReductionKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
 INEGEMMLowpReductionKernel (INEGEMMLowpReductionKernel &&)=default
 Allow instances of this class to be moved. More...
 
INEGEMMLowpReductionKerneloperator= (INEGEMMLowpReductionKernel &&)=default
 Allow instances of this class to be moved. More...
 
- Public Member Functions inherited from ICPPKernel
virtual ~ICPPKernel ()=default
 Default destructor. More...
 
- Public Member Functions inherited from IKernel
 IKernel ()
 Constructor. More...
 
virtual ~IKernel ()=default
 Destructor. More...
 
virtual bool is_parallelisable () const
 Indicates whether or not the kernel is parallelisable. More...
 
virtual BorderSize border_size () const
 The size of the border for that kernel. More...
 
const Windowwindow () const
 The maximum window the kernel can be executed on. More...
 

Static Public Member Functions

static Status validate (const ITensorInfo *mtx_b, const ITensorInfo *vector_sum_col, int32_t num_mtx_b_rows, bool is_transposed1xW)
 Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixBReductionKernel. More...
 

Detailed Description

NEON kernel used to compute the row-vectors of sums of all the entries in each column of Matrix B.

Note
This stage is needed to handle the offset of matrix product https://github.com/google/gemmlowp/blob/master/doc/low-precision.md

Definition at line 104 of file NEGEMMLowpReductionKernel.h.

Member Function Documentation

◆ configure()

void configure ( const ITensor mtx_b,
ITensor vector_sum_col,
int32_t  num_mtx_b_rows,
bool  is_transposed1xW 
)
overridevirtual

Initialise the kernel's input and output.

Parameters
[in]mtx_bInput tensor. Data type supported: Data type supported: QASYMM8
[out]vector_sum_colOutput row-vector of sums of all the entries in each column of mtx_b. Data type supported: S32
[in]num_mtx_b_rowsNumber of matrix B rows
[in]is_transposed1xWTrue if the input tensor is transposed 1xW

Implements INEGEMMLowpReductionKernel.

Definition at line 253 of file NEGEMMLowpReductionKernel.cpp.

254 {
255  ARM_COMPUTE_ERROR_ON_NULLPTR(mtx_b, vector_sum_col);
256  ARM_COMPUTE_ERROR_THROW_ON(validate_arguments_matrix_b_reduction(mtx_b->info(), vector_sum_col->info()));
257 
258  _input = mtx_b;
259  _output = vector_sum_col;
260  _k = num_mtx_b_rows;
261  _is_reshaped = is_transposed1xW;
262 
263  // Configure kernel window
264  auto win_config = validate_and_configure_window_matrix_b_reduction(_input->info(), _output->info());
265  ARM_COMPUTE_ERROR_THROW_ON(win_config.first);
266  INEKernel::configure(win_config.second);
267 }
#define ARM_COMPUTE_ERROR_THROW_ON(status)
Definition: Error.h:327
virtual ITensorInfo * info() const =0
Interface to be implemented by the child class to return the tensor's metadata.
#define ARM_COMPUTE_ERROR_ON_NULLPTR(...)
Definition: Validate.h:161

References ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, and ITensor::info().

Referenced by NEGEMMLowpMatrixMultiplyCore::configure().

◆ name()

const char* name ( ) const
inlineoverridevirtual

Name of the kernel.

Returns
Kernel name

Implements ICPPKernel.

Definition at line 107 of file NEGEMMLowpReductionKernel.h.

108  {
109  return "NEGEMMLowpMatrixBReductionKernel";
110  }

◆ run()

void run ( const Window window,
const ThreadInfo info 
)
overridevirtual

Execute the kernel on the passed window.

Warning
If is_parallelisable() returns false then the passed window must be equal to window()
Note
The window has to be a region within the window returned by the window() method
The width of the window has to be a multiple of num_elems_processed_per_iteration().
Parameters
[in]windowRegion on which to execute the kernel. (Must be a region of the window returned by window())
[in]infoInfo about executing thread and CPU.

Implements ICPPKernel.

Definition at line 279 of file NEGEMMLowpReductionKernel.cpp.

280 {
284 
286 
287  if(_is_reshaped)
288  {
289  Window win_input(collapsed_window);
290  win_input.set(Window::DimX, Window::Dimension(0, 0, 0));
291  win_input.set(Window::DimY, Window::Dimension(0, 0, 0));
292  win_input.set(Window::DimZ, Window::Dimension(0, 0, 0));
293 
294  Iterator in(_input, win_input);
295  Iterator out(_output, collapsed_window);
296 
297  execute_window_loop(collapsed_window, [&](const Coordinates & id)
298  {
299  // Note: Since the input is unsigned char, we can safely use unsigned int for the accumulation
300  uint32x4x4_t sum_col =
301  {
302  {
303  vdupq_n_u32(0),
304  vdupq_n_u32(0),
305  vdupq_n_u32(0),
306  vdupq_n_u32(0)
307  }
308  };
309 
310  const uint8_t *matrix_b = in.ptr() + (id.x() / 16) * _input->info()->strides_in_bytes()[1] + id.y() * _input->info()->strides_in_bytes()[2];
311 
312 #if __arm__
313  asm volatile("PLD [%0, #128*4]" ::"r"(matrix_b));
314 #endif /* __arm__ */
315 
316  int i = 0;
317  for(; i < _k; ++i)
318  {
319  const uint8x16_t b0_u8 = vld1q_u8(matrix_b + i * 16);
320 
321  // Convert S8 to U16
322  const uint16x8x2_t b0_u16 =
323  {
324  {
325  vmovl_u8(vget_low_u8(b0_u8)),
326  vmovl_u8(vget_high_u8(b0_u8))
327  }
328  };
329 
330  // Accumulate to U32
331  sum_col =
332  {
333  {
334  vaddw_u16(sum_col.val[0], vget_low_u16(b0_u16.val[0])),
335  vaddw_u16(sum_col.val[1], vget_high_u16(b0_u16.val[0])),
336  vaddw_u16(sum_col.val[2], vget_low_u16(b0_u16.val[1])),
337  vaddw_u16(sum_col.val[3], vget_high_u16(b0_u16.val[1]))
338  }
339  };
340  }
341 
342  auto vector_sum_col = reinterpret_cast<int32_t *>(out.ptr());
343 
344  vst1q_s32(vector_sum_col + 0, vreinterpretq_s32_u32(sum_col.val[0]));
345  vst1q_s32(vector_sum_col + 4, vreinterpretq_s32_u32(sum_col.val[1]));
346  vst1q_s32(vector_sum_col + 8, vreinterpretq_s32_u32(sum_col.val[2]));
347  vst1q_s32(vector_sum_col + 12, vreinterpretq_s32_u32(sum_col.val[3]));
348  },
349  in, out);
350  }
351  else // it is not reshaped
352  {
353  const auto width_matrix_b = static_cast<int>(_input->info()->dimension(0));
354  const auto in_b_stride = static_cast<int>(_input->info()->strides_in_bytes()[1]);
355 
356  // The implementation computes 16 elements per iteration
357  const int window_start_x = 16 * info.thread_id;
358  const int window_step_x = 16 * info.num_threads;
359  // Make sure (window_end_x - window_start_x) is a multiple of window_step_x
360  const int window_end_x = ceil_to_multiple(width_matrix_b - window_start_x, window_step_x) + window_start_x;
361 
362  Window win_out(collapsed_window);
363  win_out.set(Window::DimX, Window::Dimension(window_start_x, window_end_x, window_step_x));
364 
365  Window win_in(win_out);
366  win_in.set(Window::DimY, Window::Dimension(0, 0, 0));
367  win_in.set(Window::DimZ, Window::Dimension(0, 0, 0));
368 
369  Iterator inb(_input, win_in);
370  Iterator out(_output, win_out);
371 
372  execute_window_loop(win_out, [&](const Coordinates & id)
373  {
374  if(id.x() > width_matrix_b)
375  {
376  return;
377  }
378 
379  // Note: Since the input is unsigned char, we can safely use unsigned int for the accumulation
380  uint32x4x4_t sum_col =
381  {
382  {
383  vdupq_n_u32(0),
384  vdupq_n_u32(0),
385  vdupq_n_u32(0),
386  vdupq_n_u32(0)
387  }
388  };
389 
390  const uint8_t *matrix_b = inb.ptr() + id.y() * _input->info()->strides_in_bytes()[2];
391 
392 #if __arm__
393  asm volatile("PLD [%0, #128*4]" ::"r"(matrix_b));
394  asm volatile("PLD [%0, #128*4]" ::"r"(matrix_b + in_b_stride));
395 #endif /* __arm__ */
396 
397  int i = 0;
398  // This for loop performs 4 accumulations
399  for(; i <= (_k - 4); i += 4)
400  {
401  const uint8x16_t b0_u8 = vld1q_u8(matrix_b + 0 * in_b_stride);
402  const uint8x16_t b1_u8 = vld1q_u8(matrix_b + 1 * in_b_stride);
403  const uint8x16_t b2_u8 = vld1q_u8(matrix_b + 2 * in_b_stride);
404  const uint8x16_t b3_u8 = vld1q_u8(matrix_b + 3 * in_b_stride);
405 
406 #if __arm__
407  asm volatile("PLD [%0, #128*1]" ::"r"(matrix_b + 1 * in_b_stride));
408  asm volatile("PLD [%0, #128*1]" ::"r"(matrix_b + 2 * in_b_stride));
409  asm volatile("PLD [%0, #128*1]" ::"r"(matrix_b + 3 * in_b_stride));
410  asm volatile("PLD [%0, #128*1]" ::"r"(matrix_b + 4 * in_b_stride));
411 #endif /* __arm__ */
412 
413  // Partial accumulation in u16
414  uint16x8x2_t tmp_sum =
415  {
416  {
417  vdupq_n_u16(0),
418  vdupq_n_u16(0)
419  }
420  };
421 
422  tmp_sum.val[0] = vaddw_u8(tmp_sum.val[0], vget_low_u8(b0_u8));
423  tmp_sum.val[0] = vaddw_u8(tmp_sum.val[0], vget_low_u8(b1_u8));
424  tmp_sum.val[0] = vaddw_u8(tmp_sum.val[0], vget_low_u8(b2_u8));
425  tmp_sum.val[0] = vaddw_u8(tmp_sum.val[0], vget_low_u8(b3_u8));
426  tmp_sum.val[1] = vaddw_u8(tmp_sum.val[1], vget_high_u8(b0_u8));
427  tmp_sum.val[1] = vaddw_u8(tmp_sum.val[1], vget_high_u8(b1_u8));
428  tmp_sum.val[1] = vaddw_u8(tmp_sum.val[1], vget_high_u8(b2_u8));
429  tmp_sum.val[1] = vaddw_u8(tmp_sum.val[1], vget_high_u8(b3_u8));
430 
431  // Accumulate to U32
432  sum_col =
433  {
434  {
435  vaddw_u16(sum_col.val[0], vget_low_u16(tmp_sum.val[0])),
436  vaddw_u16(sum_col.val[1], vget_high_u16(tmp_sum.val[0])),
437  vaddw_u16(sum_col.val[2], vget_low_u16(tmp_sum.val[1])),
438  vaddw_u16(sum_col.val[3], vget_high_u16(tmp_sum.val[1]))
439  }
440  };
441 
442  matrix_b += 4 * in_b_stride;
443  }
444 
445  // This for loop perfoms the leftover accumulations
446  for(; i < _k; ++i)
447  {
448  const uint8x16_t b0_u8 = vld1q_u8(matrix_b + 0 * in_b_stride);
449 
450  // Convert S8 to S16
451  const uint16x8x2_t b0_u16 =
452  {
453  {
454  vmovl_u8(vget_low_u8(b0_u8)),
455  vmovl_u8(vget_high_u8(b0_u8))
456  }
457  };
458 
459  // Accumulate to U32
460  sum_col =
461  {
462  {
463  vaddw_u16(sum_col.val[0], vget_low_u16(b0_u16.val[0])),
464  vaddw_u16(sum_col.val[1], vget_high_u16(b0_u16.val[0])),
465  vaddw_u16(sum_col.val[2], vget_low_u16(b0_u16.val[1])),
466  vaddw_u16(sum_col.val[3], vget_high_u16(b0_u16.val[1]))
467  }
468  };
469 
470  matrix_b += in_b_stride;
471  }
472 
473  auto vector_sum_col = reinterpret_cast<int32_t *>(out.ptr());
474 
475  vst1q_s32(vector_sum_col + 0, vreinterpretq_s32_u32(sum_col.val[0]));
476  vst1q_s32(vector_sum_col + 4, vreinterpretq_s32_u32(sum_col.val[1]));
477  vst1q_s32(vector_sum_col + 8, vreinterpretq_s32_u32(sum_col.val[2]));
478  vst1q_s32(vector_sum_col + 12, vreinterpretq_s32_u32(sum_col.val[3]));
479  },
480  inb, out);
481  }
482 }
const Window & window() const
The maximum window the kernel can be executed on.
Definition: IKernel.cpp:28
Describe one of the image's dimensions with a start, end and step.
Definition: Window.h:75
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:160
Window collapse_if_possible(const Window &full_window, size_t first, size_t last, bool *has_collapsed=nullptr) const
Collapse the dimensions between first and last if possible.
Definition: Window.inl:54
auto ceil_to_multiple(S value, T divisor) -> decltype(((value+divisor - 1)/divisor) *divisor)
Computes the smallest number larger or equal to value that is a multiple of divisor.
Definition: Utils.h:66
Coordinates of an item.
Definition: Coordinates.h:37
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
static constexpr size_t DimZ
Alias for dimension 2 also known as Z dimension.
Definition: Window.h:47
void execute_window_loop(const Window &w, L &&lambda_function, Ts &&... iterators)
Iterate through the passed window, automatically adjusting the iterators and calling the lambda_funct...
Definition: Helpers.inl:122
Iterator updated by execute_window_loop for each window element.
Definition: Helpers.h:318
#define ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW(f, s)
Definition: Validate.h:205
Describe a multidimensional execution window.
Definition: Window.h:39
#define ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL(k)
Definition: Validate.h:940

References ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW, ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL, ARM_COMPUTE_UNUSED, arm_compute::ceil_to_multiple(), Window::collapse_if_possible(), Window::DimX, Window::DimY, Window::DimZ, arm_compute::execute_window_loop(), arm_compute::test::validation::info, Iterator::ptr(), Window::set(), and IKernel::window().

◆ validate()

Status validate ( const ITensorInfo mtx_b,
const ITensorInfo vector_sum_col,
int32_t  num_mtx_b_rows,
bool  is_transposed1xW 
)
static

Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixBReductionKernel.

Parameters
[in]mtx_bInput tensor. Data type supported: Data type supported: QASYMM8
[in]vector_sum_colOutput row-vector of sums of all the entries in each column of mtx_b. Data type supported: S32
[in]num_mtx_b_rowsNumber of matrix B rows
[in]is_transposed1xWTrue if the input tensor is transposed 1xW
Returns
a status

Definition at line 269 of file NEGEMMLowpReductionKernel.cpp.

270 {
271  ARM_COMPUTE_UNUSED(num_mtx_b_rows);
272  ARM_COMPUTE_UNUSED(is_transposed1xW);
273  ARM_COMPUTE_RETURN_ON_ERROR(validate_arguments_matrix_b_reduction(mtx_b, vector_sum_col));
274  ARM_COMPUTE_RETURN_ON_ERROR(validate_and_configure_window_matrix_b_reduction(mtx_b->clone().get(), vector_sum_col->clone().get()).first);
275 
276  return Status{};
277 }
#define ARM_COMPUTE_RETURN_ON_ERROR(status)
Checks if a status contains an error and returns it.
Definition: Error.h:193
Status class.
Definition: Error.h:52
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:160
virtual std::unique_ptr< T > clone() const =0
Provide a clone of the current object of class T.

References ARM_COMPUTE_RETURN_ON_ERROR, ARM_COMPUTE_UNUSED, and ICloneable< T >::clone().

Referenced by NEGEMMLowpMatrixMultiplyCore::validate().


The documentation for this class was generated from the following files: