Compute Library
 19.08
NEGEMMLowpMatrixAReductionKernel Class Reference

NEON kernel used to compute the row-vectors of sums of all the entries in each row of Matrix A. More...

#include <NEGEMMLowpReductionKernel.h>

Collaboration diagram for NEGEMMLowpMatrixAReductionKernel:
[legend]

Public Member Functions

const char * name () const override
 Name of the kernel. More...
 
void configure (const ITensor *mtx_a, ITensor *vector_sum_row, int32_t num_mtx_a_cols, bool is_interleaved4x4) override
 Initialise the kernel's input and output. More...
 
void run (const Window &window, const ThreadInfo &info) override
 Execute the kernel on the passed window. More...
 
- Public Member Functions inherited from INEGEMMLowpReductionKernel
 INEGEMMLowpReductionKernel ()
 Constructor. More...
 
 INEGEMMLowpReductionKernel (const INEGEMMLowpReductionKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
INEGEMMLowpReductionKerneloperator= (const INEGEMMLowpReductionKernel &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
 INEGEMMLowpReductionKernel (INEGEMMLowpReductionKernel &&)=default
 Allow instances of this class to be moved. More...
 
INEGEMMLowpReductionKerneloperator= (INEGEMMLowpReductionKernel &&)=default
 Allow instances of this class to be moved. More...
 
- Public Member Functions inherited from ICPPKernel
virtual ~ICPPKernel ()=default
 Default destructor. More...
 
- Public Member Functions inherited from IKernel
 IKernel ()
 Constructor. More...
 
virtual ~IKernel ()=default
 Destructor. More...
 
virtual bool is_parallelisable () const
 Indicates whether or not the kernel is parallelisable. More...
 
virtual BorderSize border_size () const
 The size of the border for that kernel. More...
 
const Windowwindow () const
 The maximum window the kernel can be executed on. More...
 

Static Public Member Functions

static Status validate (const ITensorInfo *mtx_a, const ITensorInfo *vector_sum_row, int32_t num_mtx_a_cols, bool is_interleaved4x4)
 Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixAReductionKernel. More...
 

Detailed Description

NEON kernel used to compute the row-vectors of sums of all the entries in each row of Matrix A.

Note
This stage is needed to handle the offset of matrix product https://github.com/google/gemmlowp/blob/master/doc/low-precision.md

Definition at line 69 of file NEGEMMLowpReductionKernel.h.

Member Function Documentation

◆ configure()

void configure ( const ITensor mtx_a,
ITensor vector_sum_row,
int32_t  num_mtx_a_cols,
bool  is_interleaved4x4 
)
overridevirtual

Initialise the kernel's input and output.

Parameters
[in]mtx_aInput tensor. Data type supported: QASYMM8
[out]vector_sum_rowOutput row-vector of sums of all the entries in each row of mtx_a. Data type supported: S32
[in]num_mtx_a_colsNumber of matrix A columns
[in]is_interleaved4x4True if the matrix A has been interleaved4x4

Implements INEGEMMLowpReductionKernel.

Definition at line 105 of file NEGEMMLowpReductionKernel.cpp.

106 {
107  // Perform validate step
108  ARM_COMPUTE_ERROR_ON_NULLPTR(mtx_a, vector_sum_row);
109  ARM_COMPUTE_ERROR_THROW_ON(validate_arguments_matrix_a_reduction(mtx_a->info(), vector_sum_row->info()));
110 
111  _input = mtx_a;
112  _output = vector_sum_row;
113  _k = num_mtx_a_cols;
114  _is_reshaped = is_interleaved4x4;
115 
116  // Configure kernel window
117  auto win_config = validate_and_configure_window_matrix_a_reduction(_input->info(), _output->info(), _is_reshaped);
118  ARM_COMPUTE_ERROR_THROW_ON(win_config.first);
119  INEKernel::configure(win_config.second);
120 }
#define ARM_COMPUTE_ERROR_THROW_ON(status)
Definition: Error.h:327
virtual ITensorInfo * info() const =0
Interface to be implemented by the child class to return the tensor's metadata.
#define ARM_COMPUTE_ERROR_ON_NULLPTR(...)
Definition: Validate.h:161

References ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, and ITensor::info().

Referenced by NEGEMMLowpMatrixMultiplyCore::configure().

◆ name()

const char* name ( ) const
inlineoverridevirtual

Name of the kernel.

Returns
Kernel name

Implements ICPPKernel.

Definition at line 72 of file NEGEMMLowpReductionKernel.h.

73  {
74  return "NEGEMMLowpMatrixAReductionKernel";
75  }

◆ run()

void run ( const Window window,
const ThreadInfo info 
)
overridevirtual

Execute the kernel on the passed window.

Warning
If is_parallelisable() returns false then the passed window must be equal to window()
Note
The window has to be a region within the window returned by the window() method
The width of the window has to be a multiple of num_elems_processed_per_iteration().
Parameters
[in]windowRegion on which to execute the kernel. (Must be a region of the window returned by window())
[in]infoInfo about executing thread and CPU.

Implements ICPPKernel.

Definition at line 131 of file NEGEMMLowpReductionKernel.cpp.

132 {
136 
138 
139  Window win_input(collapsed_window);
140  win_input.set(Window::DimX, Window::Dimension(0, 0, 0));
141  win_input.set(Window::DimY, Window::Dimension(0, 0, 0));
142  win_input.set(Window::DimZ, Window::Dimension(0, 0, 0));
143 
144  Iterator in(_input, win_input);
145  Iterator out(_output, collapsed_window);
146 
147  if(_is_reshaped)
148  {
149  execute_window_loop(collapsed_window, [&](const Coordinates & id)
150  {
151  // Note: Since the input is unsigned char, we can safely use unsigned int for the accumulation
152  uint32x4_t sum_row = vdupq_n_u32(0);
153 
154  const uint8_t *matrix_a = (in.ptr() + (id.x() / 4) * _input->info()->strides_in_bytes()[1] + id.y() * _input->info()->strides_in_bytes()[2]);
155 
156 #if __arm__
157  asm volatile("PLD [%0, #128*4]" ::"r"(matrix_a));
158 #endif /* __arm__ */
159 
160  int i = 0;
161  // This for loop performs 4 accumulations
162  for(; i <= (_k - 4); i += 4)
163  {
164  const uint8x16_t a0_u8 = vld1q_u8(matrix_a + i * 4);
165 
166  // Convert U8 to U16
167  uint16x4x4_t a0_u16 =
168  {
169  {
170  vget_low_u16(vmovl_u8(vget_low_u8(a0_u8))),
171  vget_high_u16(vmovl_u8(vget_low_u8(a0_u8))),
172  vget_low_u16(vmovl_u8(vget_high_u8(a0_u8))),
173  vget_high_u16(vmovl_u8(vget_high_u8(a0_u8)))
174  }
175  };
176 
177  // Accumulate to U16
178  a0_u16.val[0] = vadd_u16(a0_u16.val[0], a0_u16.val[1]);
179  a0_u16.val[0] = vadd_u16(a0_u16.val[0], a0_u16.val[2]);
180  a0_u16.val[0] = vadd_u16(a0_u16.val[0], a0_u16.val[3]);
181 
182  // Accumulate to U32
183  sum_row = vaddw_u16(sum_row, a0_u16.val[0]);
184  }
185 
186  // This for loop performs the leftover accumulations
187  for(; i < _k; ++i)
188  {
189  const uint8x8_t a0_u8 = vld1_u8(matrix_a + i * 4);
190 
191  // Convert U8 to U16
192  const uint16x4_t a0_u16 = vget_low_u16(vmovl_u8(a0_u8));
193 
194  // Accumulate to U32
195  sum_row = vaddw_u16(sum_row, a0_u16);
196  }
197 
198  auto vector_sum_row = reinterpret_cast<int32_t *>(out.ptr());
199 
200  vst1q_s32(vector_sum_row, vreinterpretq_s32_u32(sum_row));
201  },
202  in, out);
203  }
204  else // it is not reshaped
205  {
206  execute_window_loop(collapsed_window, [&](const Coordinates & id)
207  {
208  // Note: Since the input is unsigned char, we can safely use unsigned int for the accumulation
209  uint32x4_t sum_row_u32 = vdupq_n_u32(0);
210  uint32_t sum_row = 0;
211 
212  const uint8_t *matrix_a = (in.ptr() + id.x() * _input->info()->strides_in_bytes()[1] + id.y() * _input->info()->strides_in_bytes()[2]);
213 
214 #if __arm__
215  asm volatile("PLD [%0, #128*4]" ::"r"(matrix_a));
216 #endif /* __arm__ */
217 
218  int i = 0;
219  // This for loop performs 16 accumulations
220  for(; i <= (_k - 16); i += 16)
221  {
222  const uint8x16_t a0_u8 = vld1q_u8(matrix_a + i);
223 
224  // Partial accumulations in U16
225  const uint16x8_t tmp_sum0 = vaddl_u8(vget_low_u8(a0_u8), vget_high_u8(a0_u8));
226 
227  // Accumulate to U32
228  sum_row_u32 = vaddq_u32(sum_row_u32, vpaddlq_u16(tmp_sum0));
229  }
230 
231  // This for loop performs the leftover accumulations
232  for(; i < _k; ++i)
233  {
234  sum_row += static_cast<uint32_t>(matrix_a[i]);
235  }
236 
237 #if defined(__aarch64__)
238  // Reduction operation available on 64 bit architectures only
239  sum_row += vaddvq_u32(sum_row_u32);
240 #else // __aarch64__
241  uint32x2_t tmp = vpadd_u32(vget_high_u32(sum_row_u32), vget_low_u32(sum_row_u32));
242  tmp = vpadd_u32(tmp, tmp);
243 
244  sum_row += vget_lane_u32(tmp, 0);
245 #endif // __aarch64__
246 
247  *(reinterpret_cast<int *>(out.ptr())) = static_cast<int>(sum_row);
248  },
249  in, out);
250  }
251 }
const Window & window() const
The maximum window the kernel can be executed on.
Definition: IKernel.cpp:28
Describe one of the image's dimensions with a start, end and step.
Definition: Window.h:75
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:160
Window collapse_if_possible(const Window &full_window, size_t first, size_t last, bool *has_collapsed=nullptr) const
Collapse the dimensions between first and last if possible.
Definition: Window.inl:54
Coordinates of an item.
Definition: Coordinates.h:37
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
static constexpr size_t DimZ
Alias for dimension 2 also known as Z dimension.
Definition: Window.h:47
void execute_window_loop(const Window &w, L &&lambda_function, Ts &&... iterators)
Iterate through the passed window, automatically adjusting the iterators and calling the lambda_funct...
Definition: Helpers.inl:122
Iterator updated by execute_window_loop for each window element.
Definition: Helpers.h:318
#define ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW(f, s)
Definition: Validate.h:205
Describe a multidimensional execution window.
Definition: Window.h:39
#define ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL(k)
Definition: Validate.h:940

References ARM_COMPUTE_ERROR_ON_INVALID_SUBWINDOW, ARM_COMPUTE_ERROR_ON_UNCONFIGURED_KERNEL, ARM_COMPUTE_UNUSED, Window::collapse_if_possible(), Window::DimX, Window::DimY, Window::DimZ, arm_compute::execute_window_loop(), arm_compute::test::validation::info, Iterator::ptr(), Window::set(), and IKernel::window().

◆ validate()

Status validate ( const ITensorInfo mtx_a,
const ITensorInfo vector_sum_row,
int32_t  num_mtx_a_cols,
bool  is_interleaved4x4 
)
static

Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixAReductionKernel.

Parameters
[in]mtx_aInput tensor. Data type supported: QASYMM8
[in]vector_sum_rowOutput row-vector of sums of all the entries in each row of mtx_a. Data type supported: S32
[in]num_mtx_a_colsNumber of matrix A columns
[in]is_interleaved4x4True if the matrix A has been interleaved4x4
Returns
a status

Definition at line 122 of file NEGEMMLowpReductionKernel.cpp.

123 {
124  ARM_COMPUTE_UNUSED(num_mtx_a_cols);
125  ARM_COMPUTE_RETURN_ON_ERROR(validate_arguments_matrix_a_reduction(mtx_a, vector_sum_row));
126  ARM_COMPUTE_RETURN_ON_ERROR(validate_and_configure_window_matrix_a_reduction(mtx_a->clone().get(), vector_sum_row->clone().get(), is_interleaved4x4).first);
127 
128  return Status{};
129 }
#define ARM_COMPUTE_RETURN_ON_ERROR(status)
Checks if a status contains an error and returns it.
Definition: Error.h:193
Status class.
Definition: Error.h:52
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:160
virtual std::unique_ptr< T > clone() const =0
Provide a clone of the current object of class T.

References ARM_COMPUTE_RETURN_ON_ERROR, ARM_COMPUTE_UNUSED, and ICloneable< T >::clone().

Referenced by NEGEMMLowpMatrixMultiplyCore::validate().


The documentation for this class was generated from the following files: