Compute Library
 21.02
NEGEMMLowpMatrixMultiplyCore Class Reference

Basic function to execute GEMMLowpMatrixMultiplyCore on Neon. More...

#include <NEGEMMLowpMatrixMultiplyCore.h>

Collaboration diagram for NEGEMMLowpMatrixMultiplyCore:
[legend]

Public Member Functions

 NEGEMMLowpMatrixMultiplyCore (std::shared_ptr< IMemoryManager > memory_manager=nullptr, IWeightsManager *weights_manager=nullptr)
 Constructor. More...
 
 NEGEMMLowpMatrixMultiplyCore (const NEGEMMLowpMatrixMultiplyCore &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
 NEGEMMLowpMatrixMultiplyCore (NEGEMMLowpMatrixMultiplyCore &&)=default
 Default move constructor. More...
 
NEGEMMLowpMatrixMultiplyCoreoperator= (const NEGEMMLowpMatrixMultiplyCore &)=delete
 Prevent instances of this class from being copied (As this class contains pointers) More...
 
NEGEMMLowpMatrixMultiplyCoreoperator= (NEGEMMLowpMatrixMultiplyCore &&)=default
 Default move assignment operator. More...
 
 ~NEGEMMLowpMatrixMultiplyCore ()
 Default destructor. More...
 
void configure (const ITensor *a, const ITensor *b, const ITensor *c, ITensor *output, const GEMMInfo &gemm_info=GEMMInfo())
 Initialise the kernel's inputs, output. More...
 
void run () override
 Run the kernels contained in the function. More...
 
void prepare () override
 Prepare the function for executing. More...
 
- Public Member Functions inherited from IFunction
virtual ~IFunction ()=default
 Destructor. More...
 

Static Public Member Functions

static Status validate (const ITensorInfo *a, const ITensorInfo *b, const ITensorInfo *c, const ITensorInfo *output, const GEMMInfo &gemm_info=GEMMInfo())
 Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixMultiplyCore. More...
 

Detailed Description

Basic function to execute GEMMLowpMatrixMultiplyCore on Neon.

This function calls the following Neon kernels if the DOT product instruction is not available:

  1. NEGEMMInterleave4x4Kernel
  2. NEGEMMTranspose1xWKernel
  3. NEGEMMLowpMatrixMultiplyKernel
  4. NEGEMMLowpOffsetContributionKernel
  5. NEActivationLayer

otherwise if the DOT product instruction is available:

  1. NEGEMMLowpOffsetContributionKernel

Definition at line 63 of file NEGEMMLowpMatrixMultiplyCore.h.

Constructor & Destructor Documentation

◆ NEGEMMLowpMatrixMultiplyCore() [1/3]

NEGEMMLowpMatrixMultiplyCore ( std::shared_ptr< IMemoryManager memory_manager = nullptr,
IWeightsManager weights_manager = nullptr 
)

Constructor.

Definition at line 68 of file NEGEMMLowpMatrixMultiplyCore.cpp.

69  : _memory_group(memory_manager), _weights_manager(weights_manager), _asm_glue(std::make_unique<NEGEMMAssemblyDispatch>(memory_manager, weights_manager)), _mm_kernel(), _mtx_a_reshape_kernel(),
70  _mtx_b_reshape_kernel(), _mtx_a_reduction_kernel(), _mtx_b_reduction_kernel(), _offset_contribution_kernel(), _offset_contribution_output_stage_kernel(), _activation_func(),
71  _convert_to_signed_asymm(), _convert_from_signed_asymm(), _vector_sum_col(), _vector_sum_row(), _tmp_a(), _tmp_b(), _mm_result_s32(), _signed_a(), _signed_output(), _original_b(nullptr), _a_offset(0),
72  _b_offset(0), _run_vector_matrix_multiplication(false), _assembly_path(false), _fused_assembly_path(false), _reshape_b_only_on_first_run(false), _is_prepared(false), _fuse_output_stage(false),
73  _run_activation(false), _flip_signedness(false)
74 {
75 }

◆ NEGEMMLowpMatrixMultiplyCore() [2/3]

Prevent instances of this class from being copied (As this class contains pointers)

◆ NEGEMMLowpMatrixMultiplyCore() [3/3]

Default move constructor.

◆ ~NEGEMMLowpMatrixMultiplyCore()

Default destructor.

Member Function Documentation

◆ configure()

void configure ( const ITensor a,
const ITensor b,
const ITensor c,
ITensor output,
const GEMMInfo gemm_info = GEMMInfo() 
)

Initialise the kernel's inputs, output.

Note
GEMM_LOWP: low precision GEMM kernel This kernel performs the following computations:
  1. Convert a values from QASYMM8 to int32 and add a_offset to each of them.
  2. Convert b values from QASYMM8 to int32 add b_offset to each of them.
  3. Compute the matrix product of the resulting a * b in int32.
Note
The output type is S32 if gemm_info.type == GEMMLowpOutputStageType::NONE. It is QASYMM8/QASYMM8_SIGNED otherwise
Parameters
[in]aFirst input tensor (Matrix A). Data type supported: QASYMM8/QASYMM8_SIGNED.
[in]bSecond input tensor (Matrix B). Data type supported: QASYMM8/QASYMM8_SIGNED/QSYMM8/QSYMM8_PER_CHANNEL.
[in]cThird input tensor (Matrix C). It can be a nullptr. Data type supported: S32
[out]outputOutput tensor. Data type supported: Data type supported: S32/QASYMM8/QASYMM8_SIGNED
[in]gemm_info(Optional) Specifies if the matrix A and/or matrix B have been reshaped and if the reshape of matrix B should be executed only for the first run

Definition at line 77 of file NEGEMMLowpMatrixMultiplyCore.cpp.

References GEMMInfo::activation_info(), TensorAllocator::allocate(), Tensor::allocator(), ARM_COMPUTE_ERROR, ARM_COMPUTE_ERROR_ON_NULLPTR, ARM_COMPUTE_ERROR_THROW_ON, ARM_COMPUTE_UNUSED, arm_compute::test::validation::b, ICloneable< T >::clone(), arm_compute::misc::shape_calculator::compute_interleaved_shape(), arm_compute::misc::shape_calculator::compute_reductionA_shape(), arm_compute::misc::shape_calculator::compute_reductionB_shape(), arm_compute::misc::shape_calculator::compute_transpose1xW_shape(), NEActivationLayer::configure(), ITensorInfo::data_type(), ITensorInfo::dimension(), dt, ActivationLayerInfo::enabled(), GEMMLowpOutputStageInfo::gemmlowp_max_bound, GEMMLowpOutputStageInfo::gemmlowp_min_bound, GEMMLowpOutputStageInfo::gemmlowp_offset, GEMMInfo::gemmlowp_output_stage(), ITensor::info(), Tensor::info(), TensorAllocator::init(), NEGEMMAssemblyDispatch::is_activation_supported(), arm_compute::is_data_type_quantized_asymmetric(), arm_compute::is_data_type_quantized_per_channel(), MemoryGroup::manage(), arm_compute::NONE, UniformQuantizationInfo::offset, arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, ITensorInfo::quantization_info(), arm_compute::QUANTIZE_DOWN_FIXEDPOINT, GEMMInfo::reshape_b_only_on_first_run(), arm_compute::S32, arm_compute::S8, UniformQuantizationInfo::scale, GEMMInfo::set_gemmlowp_output_stage(), ITensorInfo::tensor_shape(), GEMMLowpOutputStageInfo::type, arm_compute::U8, QuantizationInfo::uniform(), and NEGEMMLowpMatrixMultiplyCore::validate().

Referenced by NELSTMLayerQuantized::configure(), main(), and NEQLSTMLayer::NEQLSTMLayer().

78 {
79  ARM_COMPUTE_ERROR_ON_NULLPTR(a, b, output);
81  ARM_COMPUTE_ERROR_THROW_ON(NEGEMMLowpMatrixMultiplyCore::validate(a->info(), b->info(), c != nullptr ? c->info() : nullptr, output->info(), gemm_info));
82 
83  const ITensor *matrix_a = a;
84  const ITensor *matrix_b = b;
85  GEMMInfo info = gemm_info;
86 
87  // Set internal variables
88  _a_offset = a->info()->quantization_info().uniform().offset;
89  _b_offset = b->info()->quantization_info().uniform().offset;
90  _run_vector_matrix_multiplication = a->info()->dimension(1) < 2;
91  _reshape_b_only_on_first_run = info.reshape_b_only_on_first_run();
92  _is_prepared = false;
93  _fused_assembly_path = false;
94  _flip_signedness = is_data_type_quantized_per_channel(b->info()->data_type()) && (a->info()->data_type() == DataType::QASYMM8) && _reshape_b_only_on_first_run;
95  _original_b = b;
96 
97  const ITensor *a_to_use = a;
98 
99  // Convert to QASYMM8 -> QASYMM8_SIGNED and back
100  if(_flip_signedness)
101  {
102  const int32_t offset_correction = 128;
104  const UniformQuantizationInfo iqinfo = a_to_use->info()->quantization_info().uniform();
105 
106  _signed_a.allocator()->init(a_to_use->info()->clone()->set_data_type(dt).set_quantization_info(QuantizationInfo(iqinfo.scale, iqinfo.offset + offset_correction)));
107  _memory_group.manage(&_signed_a);
108  _convert_to_signed_asymm = std::make_unique<NEConvertQuantizedSignednessKernel>();
109  _convert_to_signed_asymm->configure(a_to_use, &_signed_a);
110  a_to_use = &_signed_a;
111  _a_offset = _signed_a.info()->quantization_info().uniform().offset;
112 
113  const UniformQuantizationInfo oqinfo = output->info()->quantization_info().uniform();
114  _memory_group.manage(&_signed_output);
115  _signed_output.allocator()->init(output->info()->clone()->set_data_type(dt).set_quantization_info(QuantizationInfo(oqinfo.scale, oqinfo.offset - offset_correction)));
116 
117  // Output stage correction
118  GEMMLowpOutputStageInfo output_stage_corr = info.gemmlowp_output_stage();
119  output_stage_corr.gemmlowp_offset = _signed_output.info()->quantization_info().uniform().offset;
120  output_stage_corr.gemmlowp_min_bound -= offset_correction;
121  output_stage_corr.gemmlowp_max_bound -= offset_correction;
122  info.set_gemmlowp_output_stage(output_stage_corr);
123 
124  // Update matrix a
125  matrix_a = &_signed_a;
126  }
127 
128  // If GEMMLowpOutputStage != NONE, fuse the offset contribution with the output stage
129  if(info.gemmlowp_output_stage().type != GEMMLowpOutputStageType::NONE)
130  {
131  _fuse_output_stage = true;
132  _memory_group.manage(&_mm_result_s32);
133  TensorInfo info_mm_result_s32(output->info()->tensor_shape(), 1, DataType::S32);
134  _mm_result_s32.allocator()->init(info_mm_result_s32);
135  }
136 
137  // Initialize assembly kernel meta-data
138  const AsmGemmInfo asm_info = init_assembly_metadata(gemm_info);
139 #ifdef __aarch64__
140  switch(a->info()->data_type())
141  {
142  case DataType::QASYMM8:
144  case DataType::U8:
145  case DataType::S8:
146  {
147  if(is_data_type_quantized_asymmetric(a_to_use->info()->data_type()) && info.gemmlowp_output_stage().type == GEMMLowpOutputStageType::QUANTIZE_DOWN_FIXEDPOINT)
148  {
149  _asm_glue->configure(a_to_use, b, c, output, asm_info);
150  _fused_assembly_path = _asm_glue->is_configured();
151  }
152  else
153  {
154  _asm_glue->configure(a_to_use, b, nullptr, _fuse_output_stage ? &_mm_result_s32 : output, asm_info);
155  }
156  _assembly_path = _asm_glue->is_configured();
157  break;
158  }
159  default:
160  {
161  ARM_COMPUTE_ERROR("Datatype not supported");
162  break;
163  }
164  }
165 #endif /* __aarch64__ */
166  if(!(_assembly_path || _run_vector_matrix_multiplication))
167  {
168  matrix_a = &_tmp_a;
169  matrix_b = &_tmp_b;
170 
171  // The interleaved output matrix will have the following shape: [ a_height * 4, ceil(a_width / 4.0f) ]
172  TensorInfo a_info(compute_interleaved_shape(*a_to_use->info()), 1, a_to_use->info()->data_type(), a_to_use->info()->quantization_info());
173  // The transpose1xW output matrix will have the following shape: [ b_height * 16, ceil(b_width / 16.0f) ]
174  TensorInfo b_info(compute_transpose1xW_shape(*b->info()), 1, b->info()->data_type(), b->info()->quantization_info());
175  _tmp_a.allocator()->init(a_info);
176  _tmp_b.allocator()->init(b_info);
177  _memory_group.manage(&_tmp_a);
178  if(!_reshape_b_only_on_first_run)
179  {
180  _memory_group.manage(&_tmp_b);
181  }
182 
183  // Configure interleave kernel
184  _mtx_a_reshape_kernel = std::make_unique<NEGEMMInterleave4x4Kernel>();
185  _mtx_a_reshape_kernel->configure(a_to_use, &_tmp_a);
186 
187  // Configure transpose kernel
188  _mtx_b_reshape_kernel = std::make_unique<NEGEMMTranspose1xWKernel>();
189  _mtx_b_reshape_kernel->configure(b, &_tmp_b);
190  }
191 
192  if(!_fused_assembly_path)
193  {
194  // Build reduction info
195  const GEMMLowpReductionKernelInfo reduction_info(a_to_use->info()->dimension(0), false, 0, false);
196 
197  // Initialize matrix B reduction kernel only if _a_offset is not equal to 0
198  if(_a_offset != 0)
199  {
200  TensorInfo info_vector_sum_col(compute_reductionA_shape(*b->info()), 1, DataType::S32);
201 
202  _vector_sum_col.allocator()->init(info_vector_sum_col);
203  if(!_reshape_b_only_on_first_run)
204  {
205  _memory_group.manage(&_vector_sum_col);
206  }
207 
208  // Configure Matrix B reduction kernel
209  _mtx_b_reduction_kernel = std::make_unique<NEGEMMLowpMatrixBReductionKernel>();
210  _mtx_b_reduction_kernel->configure(b, &_vector_sum_col, reduction_info);
211  }
212 
213  // Initialize Matrix A reduction kernel only if _b_offset is not equal to 0
214  if(_b_offset != 0)
215  {
216  TensorInfo info_vector_sum_row(compute_reductionB_shape(*a_to_use->info()), 1, DataType::S32);
217 
218  _vector_sum_row.allocator()->init(info_vector_sum_row);
219  _memory_group.manage(&_vector_sum_row);
220 
221  // Configure matrix A reduction kernel
222  _mtx_a_reduction_kernel = std::make_unique<NEGEMMLowpMatrixAReductionKernel>();
223  _mtx_a_reduction_kernel->configure(a_to_use, &_vector_sum_row, reduction_info);
224  }
225 
226  if(_fuse_output_stage)
227  {
228  // Configure matrix multiply kernel
229  if(!_assembly_path)
230  {
231  _mm_kernel = std::make_unique<NEGEMMLowpMatrixMultiplyKernel>();
232  _mm_kernel->configure(matrix_a, matrix_b, &_mm_result_s32);
233  }
234 
235  _offset_contribution_output_stage_kernel = std::make_unique<NEGEMMLowpOffsetContributionOutputStageKernel>();
236  _offset_contribution_output_stage_kernel->configure(&_mm_result_s32,
237  _a_offset == 0 ? nullptr : &_vector_sum_col,
238  _b_offset == 0 ? nullptr : &_vector_sum_row, c,
239  _flip_signedness ? &_signed_output : output,
240  a->info()->dimension(0),
241  _a_offset, _b_offset, info.gemmlowp_output_stage());
242 
243  if(_flip_signedness)
244  {
245  _convert_from_signed_asymm = std::make_unique<NEConvertQuantizedSignednessKernel>();
246  _convert_from_signed_asymm->configure(&_signed_output, output);
247  }
248  }
249  else
250  {
251  // Configure matrix multiply kernel
252  if(!_assembly_path)
253  {
254  _mm_kernel = std::make_unique<NEGEMMLowpMatrixMultiplyKernel>();
255  _mm_kernel->configure(matrix_a, matrix_b, output);
256  }
257  // Configure offset contribution kernel
258  _offset_contribution_kernel = std::make_unique<NEGEMMLowpOffsetContributionKernel>();
259  _offset_contribution_kernel->configure(output, _a_offset == 0 ? nullptr : &_vector_sum_col, _b_offset == 0 ? nullptr : &_vector_sum_row, a_to_use->info()->dimension(0), _a_offset, _b_offset);
260  }
261  }
262  // Configure activation
263  const ActivationLayerInfo &activation = gemm_info.activation_info();
264  _run_activation = activation.enabled() && (!_assembly_path || !NEGEMMAssemblyDispatch::is_activation_supported(activation));
265  if(_run_activation)
266  {
267  _activation_func.configure(output, nullptr, activation);
268  }
269 
270  // Allocate tensors
271  if(!_assembly_path && !_run_vector_matrix_multiplication)
272  {
273  _tmp_a.allocator()->allocate();
274  if(!_reshape_b_only_on_first_run)
275  {
276  _tmp_b.allocator()->allocate();
277  }
278  }
279 
280  if(!_fused_assembly_path)
281  {
282  if(_a_offset != 0 && !_reshape_b_only_on_first_run)
283  {
284  _vector_sum_col.allocator()->allocate();
285  }
286 
287  if(_b_offset != 0)
288  {
289  _vector_sum_row.allocator()->allocate();
290  }
291  }
292 
293  if(_fuse_output_stage)
294  {
295  _mm_result_s32.allocator()->allocate();
296  }
297 
298  if(_flip_signedness)
299  {
300  _signed_a.allocator()->allocate();
301  _signed_output.allocator()->allocate();
302  }
303 }
Quantize using a fixed point multiplication.
void init(const TensorAllocator &allocator, const Coordinates &coords, TensorInfo &sub_info)
Shares the same backing memory with another tensor allocator, while the tensor info might be differen...
SimpleTensor< float > b
Definition: DFT.cpp:157
#define ARM_COMPUTE_ERROR(msg)
Print the given message then throw an std::runtime_error.
Definition: Error.h:352
1 channel, 1 U8 per channel
TensorShape compute_reductionA_shape(const ITensorInfo &b)
Calculate the reductionA shape used in GEMMLowp.
#define ARM_COMPUTE_ERROR_THROW_ON(status)
Definition: Error.h:455
TensorShape compute_interleaved_shape(const ITensorInfo &a, int mult_interleave4x4_height=1, bool reinterpret_input_as_3d=false)
Calculate the interleaved shape of an input tensor.
TensorAllocator * allocator()
Return a pointer to the tensor&#39;s allocator.
Definition: Tensor.cpp:48
ITensorInfo * info() const override
Interface to be implemented by the child class to return the tensor&#39;s metadata.
Definition: Tensor.cpp:33
DataType dt
1 channel, 1 S32 per channel
void manage(IMemoryManageable *obj) override
Sets a object to be managed by the given memory group.
Definition: MemoryGroup.h:79
#define ARM_COMPUTE_UNUSED(...)
To avoid unused variables warnings.
Definition: Error.h:152
static bool is_activation_supported(const ActivationLayerInfo &activation)
Checks if activation is supported by the gemm assembly dispatcher.
bool is_data_type_quantized_per_channel(DataType dt)
Check if a given data type is of per channel type.
Definition: Utils.h:1245
quantized, asymmetric fixed-point 8-bit number unsigned
void allocate() override
Allocate size specified by TensorInfo of CPU memory.
UniformQuantizationInfo uniform() const
Return per layer quantization info.
virtual QuantizationInfo quantization_info() const =0
Get the quantization settings (scale and offset) of the tensor.
bool is_data_type_quantized_asymmetric(DataType dt)
Check if a given data type is of asymmetric quantized type.
Definition: Utils.h:1190
TensorShape compute_reductionB_shape(const ITensorInfo &a)
Calculate the reductionB shape used in GEMMLowp.
ScaleKernelInfo info(interpolation_policy, default_border_mode, PixelValue(), sampling_policy, false)
void configure(ITensor *input, ITensor *output, ActivationLayerInfo activation_info)
[NEActivationLayer snippet]
TensorShape compute_transpose1xW_shape(const ITensorInfo &b)
Calculate the transposed 1xW shape.
#define ARM_COMPUTE_ERROR_ON_NULLPTR(...)
Definition: Validate.h:161
static Status validate(const ITensorInfo *a, const ITensorInfo *b, const ITensorInfo *c, const ITensorInfo *output, const GEMMInfo &gemm_info=GEMMInfo())
Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixMultiply...
quantized, asymmetric fixed-point 8-bit number signed
DataType
Available data types.
Definition: Types.h:77
signed 8-bit number

◆ operator=() [1/2]

Prevent instances of this class from being copied (As this class contains pointers)

◆ operator=() [2/2]

Default move assignment operator.

◆ prepare()

void prepare ( )
overridevirtual

Prepare the function for executing.

Any one off pre-processing step required by the function is handled here

Note
Prepare stage might not need all the function's buffers' backing memory to be available in order to execute

Reimplemented from IFunction.

Definition at line 573 of file NEGEMMLowpMatrixMultiplyCore.cpp.

References TensorAllocator::allocate(), Tensor::allocator(), IWeightsManager::are_weights_managed(), ARM_COMPUTE_ERROR_ON, Window::DimX, Window::DimY, Scheduler::get(), ITensor::is_used(), ITensor::mark_as_unused(), and IScheduler::schedule().

Referenced by NEGEMMConvolutionLayer::prepare(), and NEGEMMLowpMatrixMultiplyCore::run().

574 {
575  if(!_is_prepared)
576  {
577  const bool original_b_managed_by_weights_manager = _weights_manager && _weights_manager->are_weights_managed(_original_b);
578  // Run assembly reshape
579  if(_asm_glue->is_configured())
580  {
581  if(!original_b_managed_by_weights_manager)
582  {
583  ARM_COMPUTE_ERROR_ON(!_original_b->is_used());
584  }
585 
586  _asm_glue->prepare();
587  if(!original_b_managed_by_weights_manager)
588  {
589  _original_b->mark_as_unused();
590  }
591  }
592  // Run non-assembly reshape
593  else if(_reshape_b_only_on_first_run && !_run_vector_matrix_multiplication && !_asm_glue->is_configured())
594  {
595  if(!original_b_managed_by_weights_manager)
596  {
597  ARM_COMPUTE_ERROR_ON(!_original_b->is_used());
598  }
599 
600  // Run reshape kernel and mark original weights tensor as unused
601  _tmp_b.allocator()->allocate();
602  NEScheduler::get().schedule(_mtx_b_reshape_kernel.get(), Window::DimY);
603  if(!original_b_managed_by_weights_manager)
604  {
605  _original_b->mark_as_unused();
606  }
607  }
608 
609  // Run matrix B reduction kernel only if _a_offset is not equal to 0
610  if(!_fused_assembly_path && _a_offset != 0 && _reshape_b_only_on_first_run)
611  {
612  _vector_sum_col.allocator()->allocate();
613  NEScheduler::get().schedule(_mtx_b_reduction_kernel.get(), Window::DimX);
614  }
615 
616  _is_prepared = true;
617  }
618 }
bool is_used() const
Flags if the tensor is used or not.
Definition: ITensor.cpp:163
#define ARM_COMPUTE_ERROR_ON(cond)
If the condition is true then an error message is printed and an exception thrown.
Definition: Error.h:466
TensorAllocator * allocator()
Return a pointer to the tensor&#39;s allocator.
Definition: Tensor.cpp:48
void mark_as_unused() const
Marks a tensor as unused.
Definition: ITensor.cpp:168
bool are_weights_managed(const ITensor *weights)
Check if the weights are managed.
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
void allocate() override
Allocate size specified by TensorInfo of CPU memory.
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
virtual void schedule(ICPPKernel *kernel, const Hints &hints)=0
Runs the kernel in the same thread as the caller synchronously.
static IScheduler & get()
Access the scheduler singleton.
Definition: Scheduler.cpp:94

◆ run()

void run ( )
overridevirtual

Run the kernels contained in the function.

For Neon kernels:

  • Multi-threading is used for the kernels which are parallelisable.
  • By default std::thread::hardware_concurrency() threads are used.
Note
CPPScheduler::set_num_threads() can be used to manually set the number of threads

For OpenCL kernels:

  • All the kernels are enqueued on the queue associated with CLScheduler.
  • The queue is then flushed.
Note
The function will not block until the kernels are executed. It is the user's responsibility to wait.
Will call prepare() on first run if hasn't been done

Implements IFunction.

Definition at line 501 of file NEGEMMLowpMatrixMultiplyCore.cpp.

References Window::DimX, Window::DimY, Scheduler::get(), NEGEMMLowpMatrixMultiplyCore::prepare(), NEActivationLayer::run(), and IScheduler::schedule().

Referenced by main(), NELSTMLayerQuantized::run(), NEFullyConnectedLayer::run(), NEQLSTMLayer::run(), and NEGEMMConvolutionLayer::run().

502 {
503  prepare();
504 
505  MemoryGroupResourceScope scope_mg(_memory_group);
506 
507  // Convert QASYMM8->QASYMM8_SIGNED
508  if(_flip_signedness)
509  {
510  NEScheduler::get().schedule(_convert_to_signed_asymm.get(), Window::DimY);
511  }
512 
513  // Run GEMM
514  if(_asm_glue->is_configured())
515  {
516  _asm_glue->run();
517  }
518  else
519  {
520  if(!_run_vector_matrix_multiplication)
521  {
522  // Run interleave kernel
523  NEScheduler::get().schedule(_mtx_a_reshape_kernel.get(), Window::DimY);
524 
525  if(!_reshape_b_only_on_first_run)
526  {
527  // Run transpose kernel
528  NEScheduler::get().schedule(_mtx_b_reshape_kernel.get(), Window::DimY);
529  }
530  }
531  NEScheduler::get().schedule(_mm_kernel.get(), Window::DimY);
532  }
533 
534  if(!_fused_assembly_path)
535  {
536  // Run matrix A reduction kernel only if _b_offset is not equal to 0
537  if(_b_offset != 0)
538  {
539  NEScheduler::get().schedule(_mtx_a_reduction_kernel.get(), Window::DimX);
540  }
541 
542  // Run matrix B reduction kernel only if _a_offset is not equal to 0
543  if(_a_offset != 0 && !_reshape_b_only_on_first_run)
544  {
545  NEScheduler::get().schedule(_mtx_b_reduction_kernel.get(), Window::DimX);
546  }
547 
548  if(_fuse_output_stage)
549  {
550  // Run offset contribution kernel
551  NEScheduler::get().schedule(_offset_contribution_output_stage_kernel.get(), Window::DimY);
552  }
553  else
554  {
555  // Run offset contribution kernel
556  NEScheduler::get().schedule(_offset_contribution_kernel.get(), Window::DimY);
557  }
558  }
559 
560  // Convert QASYMM8_SIGNED->QASYMM8
561  if(!_fused_assembly_path && _fuse_output_stage && _flip_signedness)
562  {
563  NEScheduler::get().schedule(_convert_from_signed_asymm.get(), Window::DimY);
564  }
565 
566  // Run fused activation unless already run in the fused assembly
567  if(_run_activation)
568  {
569  _activation_func.run();
570  }
571 }
void prepare() override
Prepare the function for executing.
static constexpr size_t DimX
Alias for dimension 0 also known as X dimension.
Definition: Window.h:43
void run() override
Run the kernels contained in the function.
static constexpr size_t DimY
Alias for dimension 1 also known as Y dimension.
Definition: Window.h:45
virtual void schedule(ICPPKernel *kernel, const Hints &hints)=0
Runs the kernel in the same thread as the caller synchronously.
static IScheduler & get()
Access the scheduler singleton.
Definition: Scheduler.cpp:94

◆ validate()

Status validate ( const ITensorInfo a,
const ITensorInfo b,
const ITensorInfo c,
const ITensorInfo output,
const GEMMInfo gemm_info = GEMMInfo() 
)
static

Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixMultiplyCore.

Note
The output type is S32 if gemm_info.type == GEMMLowpOutputStageType::NONE. It is QASYMM8/QASYMM8_SIGNED otherwise
Parameters
[in]aFirst input tensor info (Matrix A). Data type supported: QASYMM8/QASYMM8_SIGNED.
[in]bSecond input tensor info (Matrix B). Data type supported: QASYMM8/QASYMM8_SIGNED/QSYMM8/QSYMM8_PER_CHANNEL.
[in]cThird input tensor info (Matrix C). It can be a nullptr. Data type supported: S32
[in]outputOutput tensor info. Data type supported: Data type supported: S32/QASYMM8/QASYMM8_SIGNED
[in]gemm_info(Optional) Specifies if the matrix A and/or matrix B have been reshaped and if the reshape of matrix B should be executed only for the first run
Returns
a status

Definition at line 305 of file NEGEMMLowpMatrixMultiplyCore.cpp.

References GEMMInfo::activation_info(), ARM_COMPUTE_RETURN_ERROR_ON, ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN, ARM_COMPUTE_RETURN_ERROR_ON_MSG, ARM_COMPUTE_RETURN_ON_ERROR, arm_compute::auto_init_if_empty(), arm_compute::test::validation::b, ICloneable< T >::clone(), arm_compute::misc::shape_calculator::compute_reductionA_shape(), arm_compute::misc::shape_calculator::compute_reductionB_shape(), ITensorInfo::data_type(), ITensorInfo::dimension(), dt, ActivationLayerInfo::enabled(), GEMMLowpOutputStageInfo::gemmlowp_max_bound, GEMMLowpOutputStageInfo::gemmlowp_min_bound, GEMMLowpOutputStageInfo::gemmlowp_offset, GEMMInfo::gemmlowp_output_stage(), GEMMInfo::is_a_reshaped(), GEMMInfo::is_b_reshaped(), arm_compute::is_data_type_quantized_asymmetric(), arm_compute::is_data_type_quantized_per_channel(), arm_compute::NONE, UniformQuantizationInfo::offset, arm_compute::QASYMM8, arm_compute::QASYMM8_SIGNED, arm_compute::QSYMM8, arm_compute::QSYMM8_PER_CHANNEL, ITensorInfo::quantization_info(), arm_compute::QUANTIZE_DOWN_FIXEDPOINT, arm_compute::S32, UniformQuantizationInfo::scale, TensorShape::set(), ITensorInfo::tensor_shape(), GEMMLowpOutputStageInfo::type, QuantizationInfo::uniform(), NEConvertQuantizedSignednessKernel::validate(), NEGEMMLowpMatrixMultiplyKernel::validate(), NEActivationLayer::validate(), NEGEMMInterleave4x4Kernel::validate(), NEGEMMLowpOffsetContributionKernel::validate(), NEGEMMTranspose1xWKernel::validate(), NEGEMMAssemblyDispatch::validate(), NEGEMMLowpOffsetContributionOutputStageKernel::validate(), NEGEMMLowpMatrixAReductionKernel::validate(), and NEGEMMLowpMatrixBReductionKernel::validate().

Referenced by NEGEMMLowpMatrixMultiplyCore::configure(), arm_compute::test::validation::DATA_TEST_CASE(), and NELSTMLayerQuantized::validate().

306 {
310  ARM_COMPUTE_RETURN_ERROR_ON_MSG(c != nullptr && gemm_info.gemmlowp_output_stage().type == GEMMLowpOutputStageType::NONE, "Bias addition not supported in NEGEMMLowpMatrixMultiplyCore for output S32");
311  ARM_COMPUTE_RETURN_ERROR_ON_MSG((a)->dimension(0) != (b)->dimension(1),
312  "The product AB is defined only if the number of columns in A is equal to the number of rows in B");
313  ARM_COMPUTE_RETURN_ERROR_ON_MSG(gemm_info.is_a_reshaped(), "Matrix A already reshaped is not supported");
314  ARM_COMPUTE_RETURN_ERROR_ON_MSG(gemm_info.is_b_reshaped(), "Matrix B already reshaped is not supported");
315 
316  GEMMInfo info = gemm_info;
317  const ITensorInfo *matrix_a_info = a;
318  const ITensorInfo *matrix_b_info = b;
319 
320  const ITensorInfo *a_to_use = a;
321 
322  TensorInfo tmp_a_info{};
323  TensorInfo tmp_b_info{};
324  TensorInfo mm_result_s32_info{};
325 
326  int32_t a_offset = a->quantization_info().uniform().offset;
327  int32_t b_offset = b->quantization_info().uniform().offset;
328 
329  bool fuse_output_stage = info.gemmlowp_output_stage().type != GEMMLowpOutputStageType::NONE;
330  if(fuse_output_stage)
331  {
332  auto_init_if_empty(mm_result_s32_info, a->clone()->set_tensor_shape(output->tensor_shape()).set_data_type(DataType::S32));
333  }
334 
335  // Convert QASYMM8->QASYMM8_SIGNED
336  TensorInfo signed_a{};
337  TensorInfo signed_output{};
338  bool flip_signedness = is_data_type_quantized_per_channel(b->data_type()) && (a->data_type() == DataType::QASYMM8) && info.reshape_b_only_on_first_run();
339  if(flip_signedness)
340  {
341  const int32_t offset_correction = 128;
343  const UniformQuantizationInfo iqinfo = a_to_use->quantization_info().uniform();
344 
345  signed_a = a_to_use->clone()->set_data_type(dt).set_quantization_info(QuantizationInfo(iqinfo.scale, iqinfo.offset + offset_correction));
347  a_to_use = &signed_a;
348  a_offset = signed_a.quantization_info().uniform().offset;
349 
350  const UniformQuantizationInfo oqinfo = output->quantization_info().uniform();
351  signed_output = output->clone()->set_data_type(dt).set_quantization_info(QuantizationInfo(oqinfo.scale, oqinfo.offset - offset_correction));
352 
353  // Output stage correction
354  GEMMLowpOutputStageInfo output_stage_corr = info.gemmlowp_output_stage();
355  output_stage_corr.gemmlowp_offset = signed_output.quantization_info().uniform().offset;
356  output_stage_corr.gemmlowp_min_bound -= offset_correction;
357  output_stage_corr.gemmlowp_max_bound -= offset_correction;
358  info.set_gemmlowp_output_stage(output_stage_corr);
359 
360  // Update matrix a
361  matrix_a_info = &signed_a;
362  }
363 
364  // Initialize assembly kernel meta-data
365  const AsmGemmInfo asm_info = init_assembly_metadata(info);
366 
367  // Check if we need to run the optimized assembly kernel
368  bool run_optimised = false;
369  bool run_optimised_requantized = false;
370  if(is_data_type_quantized_asymmetric(a_to_use->data_type()) && info.gemmlowp_output_stage().type == GEMMLowpOutputStageType::QUANTIZE_DOWN_FIXEDPOINT)
371  {
372  run_optimised = bool(NEGEMMAssemblyDispatch::validate(a_to_use, b, c, output, asm_info));
373  run_optimised_requantized = run_optimised;
374  }
375  else
376  {
377  run_optimised = bool(NEGEMMAssemblyDispatch::validate(a_to_use, b, nullptr, fuse_output_stage ? &mm_result_s32_info : output, asm_info));
378  }
379 
380  if(run_optimised)
381  {
382  ARM_COMPUTE_RETURN_ERROR_ON(b->dimension(0) != output->dimension(0));
383  if(info.depth_output_gemm3d() != 0)
384  {
385  if(info.reinterpret_input_as_3d())
386  {
387  ARM_COMPUTE_RETURN_ERROR_ON(a->dimension(1) != output->dimension(1));
388  ARM_COMPUTE_RETURN_ERROR_ON(a->dimension(2) != output->dimension(2));
389  }
390  else
391  {
392  ARM_COMPUTE_RETURN_ERROR_ON(a->dimension(1) != output->dimension(1) * output->dimension(2));
393  }
394  }
395  else
396  {
397  ARM_COMPUTE_RETURN_ERROR_ON(a->dimension(1) != output->dimension(1));
398  }
399  }
400  else
401  {
402  ARM_COMPUTE_RETURN_ERROR_ON_MSG(info.reinterpret_input_as_3d(), "NEGEMM cannot reinterpret the input tensor as 3D");
403  ARM_COMPUTE_RETURN_ERROR_ON_MSG(info.depth_output_gemm3d() != 0, "NEGEMM cannot reinterpret the output tensor as 3D");
404 
405  const bool run_vector_matrix_multiplication = a->dimension(1) < 2;
406  if(!run_vector_matrix_multiplication)
407  {
408  matrix_a_info = &tmp_a_info;
409  matrix_b_info = &tmp_b_info;
410 
411  // The interleaved output matrix will have the following shape: [ a_height * 4, ceil(a_width / 4.0f) ]
412  TensorShape shape_tmp_a = a->tensor_shape();
413  shape_tmp_a.set(0, a->dimension(0) * 4);
414  shape_tmp_a.set(1, std::ceil(a->dimension(1) / 4.f));
415 
416  // The transpose1xW output matrix will have the following shape: [ b_height * 16, ceil(b_width / 16.0f) ]
417  TensorShape shape_tmp_b = b->tensor_shape();
418  shape_tmp_b.set(0, b->dimension(1) * 16);
419  shape_tmp_b.set(1, std::ceil(b->dimension(0) / 16.f));
420 
421  // Validate interleave kernel
422  auto_init_if_empty(tmp_a_info, a_to_use->clone()->set_tensor_shape(shape_tmp_a));
423  auto_init_if_empty(tmp_b_info, b->clone()->set_tensor_shape(shape_tmp_b));
424 
427  }
428  }
429 
430  if(!run_optimised_requantized)
431  {
432  TensorInfo info_vector_sum_col{};
433  TensorInfo info_vector_sum_row{};
434 
435  const GEMMLowpReductionKernelInfo reduction_info(a_to_use->dimension(0), false, 0, false);
436 
437  // Validate matrix B reduction kernel only if _a_offset is not equal to 0
438  if(a_offset != 0)
439  {
440  info_vector_sum_col = TensorInfo(compute_reductionA_shape(*b), 1, DataType::S32);
441 
442  // Configure Matrix B reduction kernel
443  ARM_COMPUTE_RETURN_ON_ERROR(NEGEMMLowpMatrixBReductionKernel::validate(b, &info_vector_sum_col, reduction_info));
444  }
445 
446  // Validate Matrix A reduction kernel only if _b_offset is not equal to 0
447  if(b_offset != 0)
448  {
449  info_vector_sum_row = TensorInfo(compute_reductionB_shape(*a), 1, DataType::S32);
450 
451  // Configure matrix A reduction kernel
452  ARM_COMPUTE_RETURN_ON_ERROR(NEGEMMLowpMatrixAReductionKernel::validate(a_to_use, &info_vector_sum_row, reduction_info));
453  }
454 
455  if(fuse_output_stage)
456  {
457  if(!run_optimised)
458  {
459  ARM_COMPUTE_RETURN_ERROR_ON_MSG(info.reinterpret_input_as_3d(), "NEGEMMLowpMatrixMultiplyKernel cannot reinterpret the input tensor as 3D");
460  ARM_COMPUTE_RETURN_ERROR_ON_MSG(info.depth_output_gemm3d() != 0, "NEGEMMLowpMatrixMultiplyKernel cannot reinterpret the output tensor as 3D");
461 
462  ARM_COMPUTE_RETURN_ON_ERROR(NEGEMMLowpMatrixMultiplyKernel::validate(matrix_a_info, matrix_b_info, &mm_result_s32_info));
463  }
464 
465  // Validate offset contribution kernel
467  a_offset == 0 ? nullptr : &info_vector_sum_col,
468  b_offset == 0 ? nullptr : &info_vector_sum_row,
469  c,
470  flip_signedness ? &signed_output : output,
471  a_offset, b_offset,
472  info.gemmlowp_output_stage()));
473  }
474  else
475  {
476  if(!run_optimised)
477  {
478  ARM_COMPUTE_RETURN_ERROR_ON_MSG(info.reinterpret_input_as_3d(), "NEGEMMLowpMatrixMultiplyKernel cannot reinterpret the input tensor as 3D");
479  ARM_COMPUTE_RETURN_ERROR_ON_MSG(info.depth_output_gemm3d() != 0, "NEGEMMLowpMatrixMultiplyKernel cannot reinterpret the output tensor as 3D");
480 
481  ARM_COMPUTE_RETURN_ON_ERROR(NEGEMMLowpMatrixMultiplyKernel::validate(matrix_a_info, matrix_b_info, output));
482  }
483  // Validate offset contribution kernel
485  a_offset == 0 ? nullptr : &info_vector_sum_col,
486  b_offset == 0 ? nullptr : &info_vector_sum_row,
487  a_offset, b_offset));
488  }
489  }
490 
491  // Validate activation
492  const ActivationLayerInfo &activation = gemm_info.activation_info();
493  if(activation.enabled())
494  {
495  ARM_COMPUTE_RETURN_ON_ERROR(NEActivationLayer::validate(output, nullptr, activation));
496  }
497 
498  return Status{};
499 }
Quantize using a fixed point multiplication.
SimpleTensor< float > b
Definition: DFT.cpp:157
#define ARM_COMPUTE_RETURN_ON_ERROR(status)
Checks if a status contains an error and returns it.
Definition: Error.h:204
TensorShape compute_reductionA_shape(const ITensorInfo &b)
Calculate the reductionA shape used in GEMMLowp.
static Status validate(const ITensorInfo *input, const ITensorInfo *output, const ActivationLayerInfo &act_info)
[NEActivationLayer snippet]
static Status validate(const ITensorInfo *mtx_b, const ITensorInfo *vector_sum_col, const GEMMLowpReductionKernelInfo &info)
Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixBReducti...
#define ARM_COMPUTE_RETURN_ERROR_ON(cond)
If the condition is true, an error is returned.
Definition: Error.h:296
static Status validate(const ITensorInfo *mm_result, const ITensorInfo *vector_sum_col, const ITensorInfo *vector_sum_row, int32_t a_offset, int32_t b_offset)
Static function to check if given info will lead to a valid configuration of NEGEMMLowpOffsetContribu...
static Status validate(const ITensorInfo *input, const ITensorInfo *output)
Static function to check if given info will lead to a valid configuration of NEGEMMTranspose1xWKernel...
DataType dt
1 channel, 1 S32 per channel
static Status validate(const ITensorInfo *input, const ITensorInfo *output)
Static function to check if given info will lead to a valid configuration of NEConvertQuantizedSigned...
bool is_data_type_quantized_per_channel(DataType dt)
Check if a given data type is of per channel type.
Definition: Utils.h:1245
quantized, asymmetric fixed-point 8-bit number unsigned
static Status validate(const ITensorInfo *input, const ITensorInfo *output)
Static function to check if given info will lead to a valid configuration of NEGEMMInterleave4x4Kerne...
bool auto_init_if_empty(ITensorInfo &info, const TensorShape &shape, int num_channels, DataType data_type, QuantizationInfo quantization_info=QuantizationInfo())
Auto initialize the tensor info (shape, number of channels and data type) if the current assignment i...
static Status validate(const ITensorInfo *a, const ITensorInfo *b, const ITensorInfo *c, const ITensorInfo *d, const AsmGemmInfo &info)
Indicates whether or not this function can be used to process the given parameters.
quantized, symmetric fixed-point 8-bit number
bool is_data_type_quantized_asymmetric(DataType dt)
Check if a given data type is of asymmetric quantized type.
Definition: Utils.h:1190
quantized, symmetric per channel fixed-point 8-bit number
TensorShape compute_reductionB_shape(const ITensorInfo &a)
Calculate the reductionB shape used in GEMMLowp.
ScaleKernelInfo info(interpolation_policy, default_border_mode, PixelValue(), sampling_policy, false)
#define ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(t, c,...)
Definition: Validate.h:792
#define ARM_COMPUTE_RETURN_ERROR_ON_MSG(cond, msg)
If the condition is true, an error is returned.
Definition: Error.h:244
static Status validate(const ITensorInfo *input0, const ITensorInfo *input1, const ITensorInfo *output)
Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixMultiply...
quantized, asymmetric fixed-point 8-bit number signed
static Status validate(const ITensorInfo *mm_result, const ITensorInfo *vector_sum_col, const ITensorInfo *vector_sum_row, const ITensorInfo *bias, const ITensorInfo *output, int32_t a_offset, int32_t b_offset, GEMMLowpOutputStageInfo output_stage)
Static function to check if given info will lead to a valid configuration of NEGEMMLowpOffsetContribu...
static Status validate(const ITensorInfo *mtx_a, const ITensorInfo *vector_sum_row, const GEMMLowpReductionKernelInfo &info)
Static function to check if given info will lead to a valid configuration of NEGEMMLowpMatrixAReducti...
DataType
Available data types.
Definition: Types.h:77

The documentation for this class was generated from the following files: