[op compatibility] matMul #27

nsthorat · 2019-09-16T21:39:20Z

This issue will track op compatibility resolution for matMul.

Signature:
matmul(a, b)

Arguments:
a: n-dim tensor
b: n-dim tensor

Docstring:
If both a and b are 2-D they are multiplied like conventional matrices. If one of a or b are 1-D this is treated as a matrix times vector dot product.

If either argument is N dimensional, N>2, it is treated as a stack of matrices (rank-3) with dimensions corresponding to the inner two indices. The matrix multiplication will be broadcasted accordingly.

Example:
If a has shape [2, 3, 4, 5] and b has shape [5, 4], the resulting tensor will have a shape of [2, 3, 4, 4] as a is treated as a size 2 * 3 = 6 stack of [4, 5] matrices. These get broadcast multiplied over the [5, 4] matrix creating 6 [4, 4] matrices. They keep the original shape of a's outer dimensions, resulting in a shape of [2, 3, 4, 4].

Notes:

Does not support fused bias (this will be taken care of by graph optimizer)
Does not support transpose arguments (this is taken care of by graph optimizer)

To be discussed:

Compatibility with underlying APIs / hardware

BenjaminPoulain · 2019-09-16T22:07:34Z

If either argument is N dimensional, N>2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Can you please refine this into a precise definition?

BenjaminPoulain · 2019-09-16T23:13:07Z

CoreML & MPS expose the concept of accumulator precision. That is useful to strike a balance between precision and performance.

By default, the accumulator precision could be the same as the type (e.g. a float16 accumulator for float16 matmul). A high precision accumulator makes a best effort to increase precision (e.g. float32 accumulator for float16 inputs).

What do you think about having a an optional 3rd argument to provide the precision?

matmul(a, b, options)

e.g.:

matmul(a, b, {'precision': 'high'})

nsthorat · 2019-09-20T05:54:31Z

If either argument is N dimensional, N>2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Can you please refine this into a precise definition?

Updated the description a bit but this is the definition from numpy -- I added an example that hopefully makes this clearer.

CoreML & MPS expose the concept of accumulator precision. That is useful to strike a balance between precision and performance.

By default, the accumulator precision could be the same as the type (e.g. a float16 accumulator for float16 matmul). A high precision accumulator makes a best effort to increase precision (e.g. float32 accumulator for float16 inputs).

What do you think about having a an optional 3rd argument to provide the precision?
matmul(a, b, options)
e.g.:
matmul(a, b, {'precision': 'high'})

Great point! I think to resolve this we should try to understand how other accelerators do this -- if some accelerators do not have this option then we can only resolve this by defaulting to one or the other.

BenjaminPoulain · 2019-09-20T17:27:25Z

If either argument is N dimensional, N>2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Can you please refine this into a precise definition?

Updated the description a bit but this is the definition from numpy -- I added an example that hopefully makes this clearer.

Thank you for the update. It's very clear with the example.

wchao1115 · 2020-02-01T21:47:57Z

@BenjaminPoulain the accumulator precision is a great point, especially when one considers supporting lower-precision computation e.g. float16 or bfloat. But this is not a property specific to this one operator. For example, there are implementations of conv2d that results in matrix multiplication, should a consistent accumulator precision be specified for it as well? I suppose this should be a property at a higher scope, may be at an inference session scope, in order to ensure consistency through the graph.

Note that DirectML supports float16 with an option for the caller to specify preference of accumulating precision as a hint. It is a hint because the underlying GPU may not be able to satisfy the requirement fully. However by making this option a device setting, it ensures consistency of result throughout the entire graph.

BenjaminPoulain · 2020-02-11T22:27:36Z

For example, there are implementations of conv2d that results in matrix multiplication, should a consistent accumulator precision be specified for it as well?

I believe it should. I raised this issue for MatMul because it was making more progress than Conv2d at the time but I agree the precision should be defined whenever the size of the accumulator and/or the order of operation matters.

I suppose this should be a property at a higher scope, may be at an inference session scope, in order to ensure consistency through the graph.

There is a legitimate use case to specify the precision per operation.

I have nothing against your proposal to make it global at first. Fine grained optimizations are less portable and may not be suitable across browsers.

huningxin · 2020-03-02T03:57:03Z

Compatibility with underlying APIs / hardware

WebNN	matmul	a	b	output
NNAPI	ANEURALNETWORKS_FULLY_CONNECTED	if N > 3, reshape (ANEURALNETWORKS_RESHAPE) to 3-D tensor. if N == 3, slice (ANEURALNETWORKS_RESHAPE) to 2-D tensors along axis 0. Use inputs[0] of ANEURALNETWORKS_FULLY_CONNECTED	if N > 3, reshape (ANEURALNETWORKS_RESHAPE) to 3-D tensor. if N == 3, slice (ANEURALNETWORKS_RESHAPE) to 2-D tensors along left axis. Transpose the tensor (ANEURALNETWORKS_TRANSPOSE). Use inputs[1] of ANEURALNETWORKS_FULLY_CONNECTED	As ANEURALNETWORKS_FULLY_CONNECTED output is 2-D, need to reshape (ANEURALNETWORKS_RESHAPE) to 3-D tensors, concat (ANEURALNETWORKS_CONCATENATION) along axis 9 and reshape (ANEURALNETWORKS_RESHAPE) to N-D.
DML	DML_GEMM_OPERATOR_DESC	`ATensor` of DML_GEMM_OPERATOR_DESC	`BTensor` of DML_GEMM_OPERATOR_DESC	`OutputTensor` of DML_GEMM_OPERATOR_DESC
MPS	MPSNDArrayMatrixMultiplication or MPSMatrixMultiplication or MPSCNNFullyConnected?
BNNS	BNNSFilterCreateFullyConnectedLayer or vDSP_mmul?
DNNL	matmul	f N > 3, reorder to 3-D tensor. Use DNNL_ARG_SRC of matmul primitive.	If N > 3, reorder to 3-D tensor. Use DNNL_ARG_WEIGHTS of matmul primitive.	Use DNNL_ARG_DST of matmul primitive. If N > 3, need to reorder to N-D.
ONNX	MatMul	A	B	Y

References:

NNAPI: TF-Lite converter unroll_batch_matmul.cc and resolve_tensorflow_matmul.cc
DML: ONNXRuntime DmlOperatorMatMul.cpp

Opens:

NNAPI: for N-D output when N >2, is it correct to map TF-Lite Pack op to NNAPI ANEURALNETWORKS_RESHAPE and ANEURALNETWORKS_CONCATENATION ops?
DML: for N-D inputs when N > 2, how does it map to DML_GEMM_OPERATOR_DESC's 2-D inputs?
MPS: which is the right op to map?
BNNS: which is the right op to map?

anssiko · 2020-03-02T12:27:50Z

@huningxin thank you! Feel free to submit a PR to add the matMul compatibility table to https://github.com/webmachinelearning/webnn/tree/master/op_compatibility so it can be collaboratively edited.

wchao1115 · 2020-03-03T05:30:29Z

@huningxin Please go ahead and submit your PR. I'll fix it up for DML and ONNX.

e.g. DML_GEMM_OPERATOR_DESC already supports ND. The current documentation is not very up to date. ONNX's Gemm operator only supports 2D while MatMul supports ND. The fact that both are defined is a bit redundant and confusing.

huningxin · 2020-03-03T09:10:29Z

@wchao1115 , the table is merged https://github.com/webmachinelearning/webnn/blob/master/op_compatibility/matmul.md. Feel free to create PR to fix it up. Thanks.

huningxin · 2020-03-05T16:23:17Z

MPS: which is the right op to map?

BNNS: which is the right op to map?

@BenjaminPoulain , any suggestions? Thanks!

huningxin · 2020-03-10T03:19:18Z

CoreML has BatchedMatMul that is

A layer that computes the matrix multiplication of two tensors with numpy-like broadcasting
where the matrices reside in the last two indices of the tensor.

I suppose it is compatible to current proposal.

What that, I propose that we can start to craft the PR for matmul op definition. At meanwhile, the mapping details of MPS/BNNS in matmul.md can be filled with a separate PR.

@anssiko @wchao1115 @nsthorat @BenjaminPoulain what do you think?

anssiko · 2020-03-10T05:05:58Z

Sounds good to me. PR review to be requested from folks tagged.

wchao1115 · 2020-03-10T05:30:49Z

@huningxin 'dot' operation in XLA-HLO does the same except that it only supports N <= 2, supposedly because broadcasting is factored out as a separate 'broadcast' operation. This is a good example of why XLA-HLO is sort of sub-operator constructs.

huningxin · 2020-03-10T07:13:12Z

Thanks @wchao1115 . dotGeneral operation of XLA-HLO supports N > 2 batch matmul semantics.

wchao1115 · 2020-03-10T14:45:44Z

Thanks @wchao1115 . dotGeneral operation of XLA-HLO supports N > 2 batch matmul semantics.

Correct. The point is they cut things up into smaller pieces.

huningxin · 2020-03-11T08:00:57Z

Sounds good to me. PR review to be requested from folks tagged.

Done. #49 is opened for review.

dontcallmedom · 2023-03-03T17:39:24Z

this was solved by #49

BenjaminPoulain mentioned this issue Nov 18, 2019

[op compatibility] conv2d #28

Closed

huningxin mentioned this issue Mar 3, 2020

Add matmul compat table #47

Merged

huningxin mentioned this issue Mar 5, 2020

Fix up DirectML definition for matmul. #48

Merged

huningxin mentioned this issue Mar 11, 2020

Add matmul #49

Merged

anssiko added the enhancement label Mar 3, 2023

dontcallmedom closed this as completed Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[op compatibility] matMul #27

[op compatibility] matMul #27

nsthorat commented Sep 16, 2019 •

edited

BenjaminPoulain commented Sep 16, 2019

BenjaminPoulain commented Sep 16, 2019

nsthorat commented Sep 20, 2019 •

edited

BenjaminPoulain commented Sep 20, 2019

wchao1115 commented Feb 1, 2020

BenjaminPoulain commented Feb 11, 2020

huningxin commented Mar 2, 2020

anssiko commented Mar 2, 2020

wchao1115 commented Mar 3, 2020

huningxin commented Mar 3, 2020

huningxin commented Mar 5, 2020

huningxin commented Mar 10, 2020

anssiko commented Mar 10, 2020

wchao1115 commented Mar 10, 2020

huningxin commented Mar 10, 2020

wchao1115 commented Mar 10, 2020

huningxin commented Mar 11, 2020

dontcallmedom commented Mar 3, 2023

[op compatibility] matMul #27

[op compatibility] matMul #27

Comments

nsthorat commented Sep 16, 2019 • edited

BenjaminPoulain commented Sep 16, 2019

BenjaminPoulain commented Sep 16, 2019

nsthorat commented Sep 20, 2019 • edited

BenjaminPoulain commented Sep 20, 2019

wchao1115 commented Feb 1, 2020

BenjaminPoulain commented Feb 11, 2020

huningxin commented Mar 2, 2020

anssiko commented Mar 2, 2020

wchao1115 commented Mar 3, 2020

huningxin commented Mar 3, 2020

huningxin commented Mar 5, 2020

huningxin commented Mar 10, 2020

anssiko commented Mar 10, 2020

wchao1115 commented Mar 10, 2020

huningxin commented Mar 10, 2020

wchao1115 commented Mar 10, 2020

huningxin commented Mar 11, 2020

dontcallmedom commented Mar 3, 2023

nsthorat commented Sep 16, 2019 •

edited

nsthorat commented Sep 20, 2019 •

edited