Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[op compatibility] matMul #27

Closed
nsthorat opened this issue Sep 16, 2019 · 18 comments
Closed

[op compatibility] matMul #27

nsthorat opened this issue Sep 16, 2019 · 18 comments

Comments

@nsthorat
Copy link

nsthorat commented Sep 16, 2019

This issue will track op compatibility resolution for matMul.

Signature:
matmul(a, b)

Arguments:
a: n-dim tensor
b: n-dim tensor

Docstring:
If both a and b are 2-D they are multiplied like conventional matrices. If one of a or b are 1-D this is treated as a matrix times vector dot product.

If either argument is N dimensional, N>2, it is treated as a stack of matrices (rank-3) with dimensions corresponding to the inner two indices. The matrix multiplication will be broadcasted accordingly.

Example:
If a has shape [2, 3, 4, 5] and b has shape [5, 4], the resulting tensor will have a shape of [2, 3, 4, 4] as a is treated as a size 2 * 3 = 6 stack of [4, 5] matrices. These get broadcast multiplied over the [5, 4] matrix creating 6 [4, 4] matrices. They keep the original shape of a's outer dimensions, resulting in a shape of [2, 3, 4, 4].

Notes:

  • Does not support fused bias (this will be taken care of by graph optimizer)
  • Does not support transpose arguments (this is taken care of by graph optimizer)

To be discussed:

  • Compatibility with underlying APIs / hardware
@BenjaminPoulain
Copy link

If either argument is N dimensional, N>2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Can you please refine this into a precise definition?

@BenjaminPoulain
Copy link

CoreML & MPS expose the concept of accumulator precision. That is useful to strike a balance between precision and performance.

By default, the accumulator precision could be the same as the type (e.g. a float16 accumulator for float16 matmul). A high precision accumulator makes a best effort to increase precision (e.g. float32 accumulator for float16 inputs).

What do you think about having a an optional 3rd argument to provide the precision?

matmul(a, b, options)

e.g.:

matmul(a, b, {'precision': 'high'})

@nsthorat
Copy link
Author

nsthorat commented Sep 20, 2019

If either argument is N dimensional, N>2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Can you please refine this into a precise definition?

Updated the description a bit but this is the definition from numpy -- I added an example that hopefully makes this clearer.

CoreML & MPS expose the concept of accumulator precision. That is useful to strike a balance between precision and performance.

By default, the accumulator precision could be the same as the type (e.g. a float16 accumulator for float16 matmul). A high precision accumulator makes a best effort to increase precision (e.g. float32 accumulator for float16 inputs).

What do you think about having a an optional 3rd argument to provide the precision?

matmul(a, b, options)

e.g.:

matmul(a, b, {'precision': 'high'})

Great point! I think to resolve this we should try to understand how other accelerators do this -- if some accelerators do not have this option then we can only resolve this by defaulting to one or the other.

@BenjaminPoulain
Copy link

If either argument is N dimensional, N>2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

Can you please refine this into a precise definition?

Updated the description a bit but this is the definition from numpy -- I added an example that hopefully makes this clearer.

Thank you for the update. It's very clear with the example.

@wchao1115
Copy link
Collaborator

@BenjaminPoulain the accumulator precision is a great point, especially when one considers supporting lower-precision computation e.g. float16 or bfloat. But this is not a property specific to this one operator. For example, there are implementations of conv2d that results in matrix multiplication, should a consistent accumulator precision be specified for it as well? I suppose this should be a property at a higher scope, may be at an inference session scope, in order to ensure consistency through the graph.

Note that DirectML supports float16 with an option for the caller to specify preference of accumulating precision as a hint. It is a hint because the underlying GPU may not be able to satisfy the requirement fully. However by making this option a device setting, it ensures consistency of result throughout the entire graph.

@BenjaminPoulain
Copy link

For example, there are implementations of conv2d that results in matrix multiplication, should a consistent accumulator precision be specified for it as well?

I believe it should. I raised this issue for MatMul because it was making more progress than Conv2d at the time but I agree the precision should be defined whenever the size of the accumulator and/or the order of operation matters.

I suppose this should be a property at a higher scope, may be at an inference session scope, in order to ensure consistency through the graph.

There is a legitimate use case to specify the precision per operation.

I have nothing against your proposal to make it global at first. Fine grained optimizations are less portable and may not be suitable across browsers.

@huningxin
Copy link
Contributor

  • Compatibility with underlying APIs / hardware
WebNN matmul a b output
NNAPI ANEURALNETWORKS_FULLY_CONNECTED if N > 3, reshape (ANEURALNETWORKS_RESHAPE) to 3-D tensor. if N == 3, slice (ANEURALNETWORKS_RESHAPE) to 2-D tensors along axis 0. Use inputs[0] of ANEURALNETWORKS_FULLY_CONNECTED if N > 3, reshape (ANEURALNETWORKS_RESHAPE) to 3-D tensor. if N == 3, slice (ANEURALNETWORKS_RESHAPE) to 2-D tensors along left axis. Transpose the tensor (ANEURALNETWORKS_TRANSPOSE). Use inputs[1] of ANEURALNETWORKS_FULLY_CONNECTED As ANEURALNETWORKS_FULLY_CONNECTED output is 2-D, need to reshape (ANEURALNETWORKS_RESHAPE) to 3-D tensors, concat (ANEURALNETWORKS_CONCATENATION) along axis 9 and reshape (ANEURALNETWORKS_RESHAPE) to N-D.
DML DML_GEMM_OPERATOR_DESC ATensor of DML_GEMM_OPERATOR_DESC BTensor of DML_GEMM_OPERATOR_DESC OutputTensor of DML_GEMM_OPERATOR_DESC
MPS MPSNDArrayMatrixMultiplication or MPSMatrixMultiplication or MPSCNNFullyConnected?
BNNS BNNSFilterCreateFullyConnectedLayer or vDSP_mmul?
DNNL matmul f N > 3, reorder to 3-D tensor. Use DNNL_ARG_SRC of matmul primitive. If N > 3, reorder to 3-D tensor. Use DNNL_ARG_WEIGHTS of matmul primitive. Use DNNL_ARG_DST of matmul primitive. If N > 3, need to reorder to N-D.
ONNX MatMul A B Y

References:

Opens:

  • NNAPI: for N-D output when N >2, is it correct to map TF-Lite Pack op to NNAPI ANEURALNETWORKS_RESHAPE and ANEURALNETWORKS_CONCATENATION ops?
  • DML: for N-D inputs when N > 2, how does it map to DML_GEMM_OPERATOR_DESC's 2-D inputs?
  • MPS: which is the right op to map?
  • BNNS: which is the right op to map?

@anssiko
Copy link
Member

anssiko commented Mar 2, 2020

@huningxin thank you! Feel free to submit a PR to add the matMul compatibility table to https://github.com/webmachinelearning/webnn/tree/master/op_compatibility so it can be collaboratively edited.

@wchao1115
Copy link
Collaborator

@huningxin Please go ahead and submit your PR. I'll fix it up for DML and ONNX.

e.g. DML_GEMM_OPERATOR_DESC already supports ND. The current documentation is not very up to date. ONNX's Gemm operator only supports 2D while MatMul supports ND. The fact that both are defined is a bit redundant and confusing.

@huningxin
Copy link
Contributor

@wchao1115 , the table is merged https://github.com/webmachinelearning/webnn/blob/master/op_compatibility/matmul.md. Feel free to create PR to fix it up. Thanks.

@huningxin
Copy link
Contributor

  • MPS: which is the right op to map?
  • BNNS: which is the right op to map?

@BenjaminPoulain , any suggestions? Thanks!

@huningxin
Copy link
Contributor

CoreML has BatchedMatMul that is

A layer that computes the matrix multiplication of two tensors with numpy-like broadcasting
where the matrices reside in the last two indices of the tensor.

I suppose it is compatible to current proposal.

What that, I propose that we can start to craft the PR for matmul op definition. At meanwhile, the mapping details of MPS/BNNS in matmul.md can be filled with a separate PR.

@anssiko @wchao1115 @nsthorat @BenjaminPoulain what do you think?

@anssiko
Copy link
Member

anssiko commented Mar 10, 2020

Sounds good to me. PR review to be requested from folks tagged.

@wchao1115
Copy link
Collaborator

@huningxin 'dot' operation in XLA-HLO does the same except that it only supports N <= 2, supposedly because broadcasting is factored out as a separate 'broadcast' operation. This is a good example of why XLA-HLO is sort of sub-operator constructs.

@huningxin
Copy link
Contributor

Thanks @wchao1115 . dotGeneral operation of XLA-HLO supports N > 2 batch matmul semantics.

@wchao1115
Copy link
Collaborator

Thanks @wchao1115 . dotGeneral operation of XLA-HLO supports N > 2 batch matmul semantics.

Correct. The point is they cut things up into smaller pieces.

@huningxin huningxin mentioned this issue Mar 11, 2020
@huningxin
Copy link
Contributor

Sounds good to me. PR review to be requested from folks tagged.

Done. #49 is opened for review.

@dontcallmedom
Copy link
Contributor

this was solved by #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants