Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the set of operations and their specification #17

Closed
huningxin opened this issue Apr 30, 2019 · 37 comments
Closed

Define the set of operations and their specification #17

huningxin opened this issue Apr 30, 2019 · 37 comments

Comments

@huningxin
Copy link
Contributor

As raised in CG meeting, the first foundation spec only lists 32 operation types without information about how to use them.

We need to define the set of operations and their specification/semantics.

The set of operations could be derived from the use cases and corresponding models. For WebNN POC, there is a spreadsheet that lists supported models and their required operations. It can be used as a starting point.

By following the spirit of WebML CG charter, the specification will be implementable on top of existing major platform APIs, such as Android NNAPI, Windows DirectML, and macOS/iOS MPS/BNNS. So when specifying operations, the platform APIs mapping/support need to be looked into. For WebNN POC, there is another spreadsheet the captures the native API mapping of supported operations. It can also be leveraged.

We can file individual issue for each operation specification and use this one as the meta issue.

@gregwhitworth
Copy link

Thanks for bringing this issue up @huningxin – this is something we’ve been pondering as well. We would like to propose that this CG not duplicate the efforts being done by the open ONNX community. This community has produced an MIT licensed operator specification based on years of investigations of the top ML models to standardize around. The community brings together platform, software, web and hardware companies to help provide a standardized approach which allows any team across the stack leverage the same set of operators and make development in ML more efficient for everyone involved. A few companies that are involved in this open community are Amazon, Facebook, Microsoft, Intel, IBM, Unity, Nvidia, Qualcomm, AMD and more.

We recommend that we follow the same pattern done by other specifications within the W3C & other standard organizations by referencing the work done by others rather than duplicating efforts and creating numerous standards. A few examples of this are:

  • CSS Text references the Unicode specification
  • Service Workers does not redefine what IETF has defined for the fetch API
  • All specifications that reference floating points (probably including WebNN at some point) they point to IEEE

Additionally, this is an open community so if there are any changes we desire to this specification we can make them (and they also welcome them). Now, this doesn’t mean that this API needs to support ALL of the operators listed by default, we can create a list of operations from within that specification that must be supported by an UA in order to fully support the WebNN API, but we shouldn’t redefine the operators and their signatures, inputs & outputs.

@gregwhitworth
Copy link

PROPOSED RESOLUTION: The specification will reference the ONNX operations and if there are any improvements desired for ONNX the work should be there.

We will be resolving on this on the next telecon, let us know if you want any objections to this.

@anssiko
Copy link
Member

anssiko commented Jun 27, 2019

As discussed on today's call, please provide your input in this issue for the proposed resolution documented below prior to our next call 8 August 2019. During that call we'll resolve this issue as proposed unless objections are recorded in this issue.

PROPOSED RESOLUTION: The specification will reference the ONNX operations and if there are any improvements desired for ONNX the work should be there.

(That's 6 weeks review time to provide feedback due to holiday season in some parts of the world.)

@nsthorat
Copy link

nsthorat commented Jul 2, 2019

An important part of this specification will be ensuring this set of ops are compatible with the major ML JavaScript frameworks (e.g. conv2d paddings are compatible, memory layout of tensors are compatible, etc). It's not possible for us to move forward with this resolution without understanding compatibility.

Would somebody be willing to take on the work of detailing each op and whether it could be used by major frameworks in a doc that can be shared with the group?

Thanks!

@jbingham
Copy link

jbingham commented Jul 2, 2019

A related question: what's the plan for dealing with versioning? I'd expect ONNX to evolve and add ops over time. Also, the ops would be versioned, and details may change over time. Is the idea to reference an explicit list of ONNX ops with versions, and include them by reference in the web standard?

If yes, then the next step would be to agree on which specific ops + versions to reference, ensuring that they're compatible with the major ML JavaScript frameworks we want to support, as well as the backend providers. Maybe the answer is easy, and it's all of the ONNX ops, in their latest versions. I agree that we (the community group) need to do the homework to determine if that's the case.

@RafaelCintron
Copy link
Collaborator

@nsthorat and @jbingham, below are links to existing documents and code that should help answer your questions.

  • Versioning. Up to us, as a community group, which version of the operator set we use for the web spec.

On the compatibility front, there already exists conversion tools to and from other frameworks and model types.

@jbingham
Copy link

jbingham commented Jul 3, 2019

Thanks, @RafaelCintron ! We'll take a closer look.

I see the compatibility matrix. Great! It would be good for us to understand what compatibility really means here.

Some questions that you may already have great answers for:

Versioning:
I see that not all TF ops are convertible to a consistent ONNX version. Eg, most convert to 1, some convert to 1 or 6, others only convert to 7 or 10. And then there's the compatibility matrix for Keras, CoreML, LightGBM, and Scikit-Learn. How would we choose op versions that work for all of these libraries? Is it tractable?

Converting to ONNX from js:
Is the idea that, for each javascript ML framework (like TensorFlow.js), there would be some client-side js to convert from the framework's native graph representation (like a SavedModell) to a WebML graph with ONNX representations of all operators, and then the WebML API would operate on that?

Custom ops:
What if a desired operator isn't available? How are custom ops defined and included in the graph?

How many ops?
Do we need all of the operators in the spec? Or would we get 80% of the performance benefit with 2 or 3 ops, like matmul and conv2d? Does that simplify things at all, as a starting point to build consensus?

Thanks in advance for explaining!

@gramalingam
Copy link
Contributor

@jbingham : re. your versioning question: ONNX has the notion of ops and opsets. An opset consists of a collection of ops with a given specification. Typically an op has the same specification in a number of consecutive opsets until it is updated. Thus, if we consider the Equal op, there is a version introduced in opset 1, which had the same spec in opsets 1 through 6, and was updated in opset 7 (to add broadcasting support).

I think the notation used in the compatibility matrix reflects this. It lists versions 1 and 7 for Equal op because opset versions 1 through 6 have the same spec for Equal, and version 7 onward have the same spec (until now, that is the current opset).

@walrusmcd
Copy link

To also add some more history and context for @jbingham questions:

Versioning:
some history: The ONNX journey was all about interchange from all frameworks. Our goal was to support training in one framework, and exporting to do inference in another framework. So our first opsets where optimized for inference and a set of canonical models + training frameworks. The first toolkit (here) focused on all the ones you mention (CoreML, LigthGBM, Keras, and more) . It was a solid opset. We learned a ton working with all the different behaviors of the frameworks, including key differences like how broadcasting works. We iterated rapidly, and got to production quality by opset7. This was June 2018. It is very tractable. We have deployed to client devices on a very large scale with a vibrant cross framework converter ecosystem. Microsoft even shipped an ONNX ML runtime in the box with a public developer API starting in Windows 1809 RTM :)

For tensorflow the journey has been a bit longer. Since ONNX adopted the LCD set of ops needed for inference, TF had a much larger set of ops (there are 500+ in the tensorflow::ops C++ namespace), contrib ops, experimentation variations, training, etc.. Just like TOCO went through its journey to convert from tf to tflite, we went through a similar journey with tf2onnx. LSTM is a great example of another issue we hit. Keras has LSTM primitives, where TF is more loosely structured (we solved the LSTM with our converters btw!).

Regarding the compatibility matrix link you posted , @gramalingam response is spot on. All of the opsets roll forward and include the previous versions. You target a single opset. Opset 7 is where we supported most tf models. And we just keep getting better everyday. We focus on the data science community and the tf operators they need most for their production deployed models. Examples are like in opset9 when we added NonMaxSuppression. If your model absolutely required NMS, you could use opset7 plus a custom operator, or roll forward to opset9 and the model would just work. Again very tractable.

Using ONNX in js:
The first idea would be to enable the browser platforms to include ML execution providers. Basically a provider knows how to run a kernel, and manage resources (memory). JavaScript ML frameworks could then build on top of these providers . It’s TBD what the provider IDL would look like. We have the other github issues tracking this one (graphs &| op kernels).

This particular github issue would track the idea that the op schema, namespaces, versioning, semantic definitions, could all be driven from ONNX. The idea being no matter which IDL we land with, we still need a common currency how to describe op kernels, schema, and behavior.

Custom ops:
I would imagine this we cover in the IDL conversation for the execution provider. From an ONNX point of view, these custom ops do not need to have governance by the standardized schema, right?

How many ops?
This is a great conversation !! Would love to dig in. We found a couple of things:

  • Having too many was hard to get cross framework consensus on (ONNX landed on ~100 vs. the 500+ tf had).
  • having too few means we often hit operators that we really needed in a model. Custom ops become the norm, vs. the escape valve.
  • having just a couple is really BLAS/MLAS. Already covered, right? No need for an new ML operator set ;)
  • Crossing resource boundaries is expensive. For our DirectML implementations, we strive to have 100% coverage inside our GPU execution provider to maximize perf.

I would imagine for WebML we would want the same. Execution providers that provide enough ops to run most models without leaving the provider resource boundary. Right? What training frameworks are you thinking about? With ONNX opset7 we were able to run almost all of the models we had targeted for (tf, pytorch, scikit-learn, LightGBM, CoreML). That was our goal, to be able to fully represent the models across ML frameworks.

I'm not sure what the right answer is , but I like your idea of starting with a straw man. What if we took opset10 (the latest opset) as a start to reason around that ?

@nsthorat totally agree, compatibility is a huge deal.

ONNX has 2 things it drives here:

  1. This is the master operator schema. It is driven by def files (code) so the doc should not drift out of date. It defines things like Conv and exactly states what the shapes of tensors are, inputs , outputs, behavior defining attributes, and type constraints.
  2. They also provide conformance tests. This is also code, so there can be little ambiguity. The code all lives here , and is based largely on numpy to provide a reference implementation. We then target a set of canonical models (alexnet, inception, squeezenet, vgg)

I image the community would totally jump in and help fill any testing and documentation gaps as we find them. the entire goal here is to enable framework developers to work with it. We have a couple of frameworks supporting ONNX natively (ORT, Caffe2, pytorch, CNTK) and would love to keep adding more !

@jbingham
Copy link

jbingham commented Jul 4, 2019

@gramalingam, thanks for clarifying. So if we wanted to cover a large number of common operators across many real-world models in multiple frameworks, we'd want to select a high enough op set version, such as opset10. That makes sense.

@walrusmcd Lots of great info there. We will likely want to go deeper on some of these topics, possibly in the other github issues you mention, or new ones.

Here's a summary of some of what you describe: imagine a future web platform that lets you train an ML model in any ML framework, and as long as you can convert it to a compatible graph representation, with a standardized set of ops, you can perform inference on any device, taking advantage of hardware acceleration, from any web app, running in any web browser.

While I'm digesting, I have another thought, which I'll riff on in a separate comment.

@jbingham
Copy link

jbingham commented Jul 4, 2019

Here's that other thought:

Given how complex this all is, and how many ops and ML frameworks, is it realistic to expect that there will ever be more than one code base that fully implements ONNX, much less keeps up with the evolution of opsets? From what I see on github, well over a hundred people -- and not just random people, but knowledgeable, highly technical people with deep domain knowledge -- have contributed code to get ONNX to where it is today. That would be difficult to reproduce, especially in a compatible way.

And if there is likely to only ever be one ONNX implementation, does that mean that to achieve standardization across web browsers, ML frameworks, and opsets, the only viable path, realistically, is for the browser vendors to agree to ship the exact same binary? This has been done before, so it's not unprecedented or impossible. But it would take some real effort to make happen, assuming the web should have a standard at this level of abstraction, as opposed to a few low level shaders, which would be less controversial.

Putting it all together would the decision to standardize on ONNX opsets mean that the web standards community would be effectively agreeing to adopt the first and, in practice, only implementation of ONNX, and the browser vendors would be agreeing to ship it?

If that's what's at stake, webnn issue 17 seems kind of important. We might want to put together a detailed case for why this is really where the community should go, and why the ML frameworks and browser makers should all be onboard.

It's also possible that I just took an accidental turn down a rabbit hole, in which case, please help :)

@huningxin
Copy link
Contributor Author

Hi @jbingham ,

Custom ops:
What if a desired operator isn't available? How are custom ops defined and included in the graph?

The issue #6 is for custom ops discussion. So far, the proposal is that JS ML frameworks offload the supported sub-graph execution to WebNN and execute own (custom) WASM/WebGL/GPU kernels for ops that are not supported. It would require WebNN supports high performance data/tensor exchange between WebNN execution and WASM/WebGL/GPU kernel execution.

There are two early prototypes for TensorFlow.js/WebNN integration and ONNX.js/WebNN integration based on this proposal and WebNN POC.

How many ops?
Do we need all of the operators in the spec?

It would depend on what models WebNN would support and the models are according to use cases. (I'd like to provide some inputs in the next comment.)

Or would we get 80% of the performance benefit with 2 or 3 ops, like matmul and conv2d? Does that simplify things at all, as a starting point to build consensus?

As our investigation of custom ops, we have two key observations:

  • Offloading expensive JS framework ops to WebNN can get good speedup
    In our experiment for MobileNet, offloading Convolution (Conv2D + DepthwiseConv2D) + Relu6 to WebNN (MKL-DNN backend) can get 30X speedup to WASM and 80% of native performance.
  • Interop between JS framework ops and WebNN graph has overhead
    When offloading MobileNet to WebNN, per op execution is 3.5X slower than whole graph execution. So it may also require to offloading the less expensive ops (like Relu, concat, add etc.,) to avoid interop overhead.

@huningxin
Copy link
Contributor Author

huningxin commented Jul 4, 2019

For the starting op set, there are some inputs from use case perspective. The idea is to derive the op set by models that support defined use cases.

This table maps the WebNN use cases to related networks.

Use case Network
Image classification MobileNet, SqueezeNet, ResNet, Inception
Object/Person detection TinyYOLO, SSD
Semantic Segmentation DeepLab
Skeleton Detection PoseNet

There are pre-trained models in TFLite or ONNX for above networks. This table lists the required ops of each model.

ONNX Op MobileNetV2 (ONNX) MobileNetV2 (TFLite) SqueezeNet1.1 (ONNX) [4] SqueezeNet (TFLite) ResNet50V2 (ONNX) InceptionV4 (TFLite) TinyYOLOV2 (ONNX) [6] SSD MobileNetV1 (TFLite) [7] PoseNet (TFLite) DeepLabV3 (TFLite)
Add ✔️ ✔️ ✔️ ✔️
AveragePool ✔️ [1] ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
BatchNormalization ✔️ ✔️ ✔️
Clip ✔️ [2] ✔️ [2] ✔️ [2] ✔️ [2]
Concat ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Conv ✔️ ✔️ [3] ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ [3] ✔️ [3] ✔️ [3]
Gemm ✔️ ✔️ [5]
LeakyRelu ✔️
MaxPool ✔️ ✔️ ✔️ ✔️ ✔️
Relu ✔️ ✔️ ✔️ ✔️ ✔️
Reshape ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Resize ✔️ [8]
Sigmoid ✔️
Softmax ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

Notes:

  1. GlobalAveragePool op of ONNX
  2. RELU6 op of TFLite
  3. CONV_2D and DEPTHWISE_CONV_2D op of TFLite
  4. Ignore Dropout op of ONNX
  5. FULLY_CONNECTED op of TFLite
  6. Ignore ImageScalar op of ONNX
  7. Ignore TFLite_Detection_PostProcess op of TFLite
  8. RESIZE_BILINEAR op of TFLite

The following table maps the TFLite op in above TFLite models to ONNX op.

TFLite Op ONNX Op
ADD Add
AVERAGE_POOL_2D AveragePool
CONCATENATION Concat
CONV_2D Conv
DEPTHWISE_CONV_2D Conv
FULLY_CONNECTED Gemm
LEAKY_RELU LeakyRelu
LOGISTIC Sigmoid
MAX_POOL_2D MaxPool
RELU Relu
RELU6 Clip
RESHAPE Reshape
RESIZE_BILINEAR Resize
SOFTMAX Softmax

@jbingham
Copy link

Great info, Ningxin. This is a ton of great work digging into these details.

Summarizing on a bunch of the comments above:

  • We could start with just a couple of ops that give a big performance boost, and we would see real performance benefits
  • BUT there are big costs to moving memory around and switching contexts, which argues for moving all ops in the graph to a single execution provider, ideally, even if many of the ops by themselves aren't accelerated
  • There's separate discussion around how to make custom ops performant.

@jbingham
Copy link

jbingham commented Jul 10, 2019

Related to my previous comment about there being only one ONNX implementation, a similar case can be made for providers. Realistically, maybe there will only be one provider implementation per hardware platform. Eg, for a given Android, iOS, or Windows device and chip set, there would be only one provider implementation that takes advantage of the hardware. The reason is that it's unlikely that anyone other than the device and chip manufacturers would be able to write a performant provider and ship it with the device. Pure play browser vendors wouldn't be in a position to do it, would they?

@gramalingam
Copy link
Contributor

@jbingham : I am not sure I understand the comment about their being only one ONNX implementation (or the related comment about providers above). TensorRT, OpenVino, NGraph, Menoh, NCNN, and others all support ONNX models to various degrees. Most convert ONNX models to their own formats either via scripts outside the runtime or as a preprocessing step in their runtime. It seems to me that the question you are asking is an essential aspect of this WebML standardization effort and it was an essential aspect of the ONNX standardization effort as well. I am unclear if you are suggesting that standardizing the spec for ops is too difficult and not worth doing?

I agree that we can separate the questions of (a) What ops are important enough to include in the WebML spec, and (b) The standardization of the ops selected in (a). But it seems to me that (b) is essential.

@jbingham
Copy link

@gramalingam Yes, agree that we need to standardize the ops that we choose to include in the spec. And agree that ONNX looks like a reasonable proposal.

I'm still trying to understand the layering here, of what's ML framework vs WebNN vs ONNX vs provider vs platform vs hardware, assuming those are the right categories. In my earlier comment, I think I may have misunderstood what ONNX is. IIUC, you're saying that ONNX is just the model interchange format, not a library or provider or ML framework. There are partial conversions implemented for multiple ML frameworks already. So there's only one ONNX model format, but translation layers aren't that painful to write, since several have been created already. Is that right?

The only complication is around the "to various degrees" caveat you added. As we add more ops, the likelihood of them all being implemented correctly by every ML framework and provider decreases. At a certain point, it could become quite difficult and impractical to standardize. If there's a way to minimize the number of ops that are standardized, that's one potential solution. Another solution is to share code, to ease the burden of implementing a huge set of ops and versions.

But if ONNX is an interchange format only, maybe there's no opportunity for code sharing at that level. Maybe at the provider level, as I was suggesting above?

@jbingham
Copy link

Popping back up a level: we on the Google side are coming up to speed on the overall proposal, and have dumped a bunch of thoughts into this issue, many only tangentially related. In the interests of keeping this specific issue scoped narrowly enough to be able to close it, let's split other topics into separate issues.

IIUC, there are at least 2 fundamental premises:

  1. We need a web standard at the level of a graph API (the Web NN API)
  2. The Web NN API requires a set of operations and their specification for use in the graph.

TBH, we on Chrome and the TensorFlow team are not yet fully convinced of the first premise, though it's certainly plausible. Let's move that discussion to another thread (eg, Issue: "Decide if a graph API is the right thing to standardize on"), and ideally spend some time f2f or separately coming up to speed on the thinking that led the group here. One thing that might help is to write out a use cases or user story for what we want to enable on the web. Eg, Issue: "Agree on a user story for ML on the web"

Assuming a graph API, premise 2 seems pretty reasonable. The options seem to be:
A. have black box operations without any web standard
B. everything is a custom op, and there's a spec for how to write custom ops
C. there's some number of operations with specs, but no custom op
D. have some number of operations with specs, and a custom op as an escape hatch

TBH, I can't rule out A, because I'm not yet persuaded that a graph API is the right level of abstraction for a web standard. If the operations are a black box, the whole graph is. Let's move that discussion to a separate thread as well. (Eg, Issue: "Determine if a black box ML model is useful".)

Assuming a graph API, B seems like it wouldn't provide any good way to benefit from hardware acceleration, and would make a graph API pretty useless. I feel like we can safely rule it out.

C and D are in bounds for the proposal on this thread. The proposed solution is to use ONNX for the specifications, and defer their details to the ONNX project. If we have a set of versioned operations with specifications, I agree that ONNX looks like the best interoperable standard currently out there. Sure, later, we might encounter some reason why it's not ideal, but if so, we can deal with it in a separate issue then.

It seems like next up for this github issue is to choose a specific set of operations, within ONNX, that will be included in the graph API, so that we can confirm that there's adequate compatibility in the major ML frameworks. Then we can close out this issue, with the caveat that it relies on some assumptions that we haven't yet agreed to, and should address in separate issues.

Does anyone have a concrete proposal? Eg, something like, "Let's use the complete ONNX Opset 9." Or: "Let's use ops A, B, C... of Opset 7 as a proof-of-concept". Totally made up, but you get the idea.

@gramalingam @huningxin @walrusmcd wdyt?

@gramalingam
Copy link
Contributor

Hi @jbingham : a couple of quick comments.

Re. your point (1): in terms of background, there has been a discussion about this question. My thoughts are as summarized here: #11 (comment)

I think (2) (at least the "The WebNN API requires a set of operations and their specification" part) is orthogonal to (1). I think we mostly require this, even if we decide to go with an "operation API" (just for executing operations, no building graphs) … the one exception might be something like option (B) you mention (though I am not sure I understand this).

I don't know what option (A) is: is it like the "load model" API discussed previously (in other issue/thread)? I am not sure what (B) means either, I agree that for the graph API (or even otherwise), (C) and or (D) make sense and seem most relevant.

@huningxin
Copy link
Contributor Author

@jbingham , thanks for your comments. The next step you proposed sounds good.

Based on my #17 (comment), I propose to support following 14 operators of Opset 10 as the initial proof-of-concept. They are:

let's split other topics into separate issues

+1. There are some ones related to what you mentioned.

Eg, Issue: "Agree on a user story for ML on the web"

The Use cases was defined as a start, those include both application level and framework level. Feel free to propose new ones.

Eg, Issue: "Determine if a black box ML model is useful"
Eg, Issue: "Decide if a graph API is the right thing to standardize on"

There was a discussion about high level or low level API. Then it split into Executing models and Executing operations (eager or graph).

@jbingham
Copy link

jbingham commented Jul 18, 2019 via email

@gramalingam
Copy link
Contributor

One follow-up thought to @jbingham's point about "understanding the layering (frameworks, providers, platforms, hardware, etc.)." as well as the related issue of "Is a graph-builder API the right one?". One way of looking at it is: where do compilers, especially optimizations across an entire model (or across multiple operations), especially when they are hardware-specific optimizations fit in in the WebML picture/stack: do we want WebML to enable the implementation of such compilers/optimizers behind the WebML API (in the browser layer or beneath it) or do we want to enable the implementation of such compilers/optimizers outside (in which case, the compilers/optimizers would emit JavaScript code containing WebML)? In the first case, WebML serves as sort of the source for the compiler/optimizer, while in the second case WebML serves as the target for the compiler/optimizer. These are two different scenarios.

@RafaelCintron
Copy link
Collaborator

I agree with @jbingham that the API should be more than just "custom ops", option B in his list. The point of the API is for web developers to access hardware accelerated capabilities that are not available in other APIs such as WebGL and WebGPU. Just doing custom operations means WebML would essentially be a "helper function" and not be very compelling over what TF.js and ONNX.js provide today.

One approach is to do Option C in @jbingham list and structure the API in a similar manner to the DirectML Workflow but with WebGPU concepts instead of D3D12 concepts. The developer would put weights into input resources, bind the resources into input and output tensors, and record the operations into a command list. Executing the command list would dump the result into output buffers. In this model, to do "custom ops", the developer would interleave compute shaders (already existing concept in WebGPU) in between the operations defined by WebNN. We may be able to do this as an extension to WebGPU.

@kainino0x
Copy link

I'm in support of something along those lines, but by itself it doesn't solve every use-case: it doesn't open us up for interop with CPU accelerators (e.g. tuned SIMD kernels) or standalone accelerators (e.g. Edge TPU), both of which seem important.

At the same time, I think if we can extend that model to cover those too (CPU: WASM+SIMD, TPU: not sure), it would be very nice.

@gregwhitworth
Copy link

gregwhitworth commented Aug 6, 2019

I personally think we should explore Option C as long as what Kai stated is possible. The primary aspect webdevs want from the solution here is to be able to access the hardware for perf benefits if the device has it available. How we get there, graph vs lower level commands doesn't really matter as much to me as long as we can agree on the commands, set of ops and ensure that that hardware access is possible in an interoperable manner. I also, think that we need to come to conclusion regarding this before we spend too much more time on a graph API if we decide to go with the lower level commands. cc: @huningxin @anssiko

@RafaelCintron
Copy link
Collaborator

@kainino0x if WASM will be adding support for SIMD kernels in the near future, then ML framework authors can choose whether to use those or a WebGPU extension to implement their inference engine.

On the other hand, if we have the WebML API decide between "SIMD vs. compute shader", then WebML needs to be more high level than what I described above. "Custom ops" will need to be a first class citizen and framework authors will need to be prepared to provide either compute shaders or Javascript/WASM for WebML to do its job.

@kainino0x
Copy link

WASM is indeed adding SIMD, but just like on GPU, it may not be able to be tuned for every chip and take advantage of every feature (due to having to be abstractable). (OTOH, SIMD is much simpler than CUDA, and maybe WASM SIMD can keep up with hardware in the long run, since it's smaller and easily emulated.)

re: WebML API deciding, I was imagining explicitly exposing the "providers" (e.g. "CPU" or "GPU) to the app so it can choose between them.

@RafaelCintron
Copy link
Collaborator

@kainino0x , if the SIMD aspect is not exposed to developers such that it can be specialized for all hardware then I agree it would be good to expose it as a WebML provider.

@walrusmcd
Copy link

Thanks @jbingham for a bunch of great writeups and ideas. Sorry I was offline for a bit. I'll go through them all and digest and reply. First comment: I agree with you 100% that a key part here is how we layer everything together. Having a solid drawing and concept around the layers will give us a ton of clarity.

@huningxin
Copy link
Contributor Author

@walrusmcd , I happen to have a concept diagram of existing proposal. We may use it as a starting point.

webnn_stack_diagram

Remarks of the digit labels in the diagram:

  1. JS ML Framework loads a ML model.
  2. JS ML Framework executes the ML model by its own kernels implemented in WebGPU compute shader or WebAssembly with SIMD.
  3. When WebNN is available, JS ML Framework identifies sub-graphs that are supported by WebNN and delegates their execution to WebNN.
  4. WebNN offloads the sub-graph execution to native API that accesses the hardware acceleration of CPU/GPU/Accelerator.
  5. WebNN sub-graph execution exchanges the input/output tensors with WebGPU/WebAssembly kernels through high efficient interface.

@huningxin
Copy link
Contributor Author

Re @RafaelCintron

One approach is to do Option C in @jbingham list and structure the API in a similar manner to the DirectML Workflow but with WebGPU concepts instead of D3D12 concepts.

As @gramalingam mentioned, WebNN API may enable the hardware-specific optimizations of graph compiler/optimizer. The "executing command list" approach seems not straight-forward to me to integrate that capability.

Some standalone accelerators, e.g. Edge TPU, may require graph compilation before execution (more details in my next comment). This usage may not fit well into the "executing command list" approach.

IMO, and as you mentioned, the "executing command list" approach could be a ML extension of WebGPU. Similar to DirectML on native, WebGPU with this extension would allow webdevs to interleave ML workloads and rendering workloads for high GPU efficient AI+gaming usage on web.

@huningxin
Copy link
Contributor Author

Re @kainino0x

CPU accelerators (e.g. tuned SIMD kernels) or standalone accelerators (e.g. Edge TPU), both of which seem important.

I agree that the programming model should support devices beyond GPU.

I was imagining explicitly exposing the "providers" (e.g. "CPU" or "GPU) to the app so it can choose between them.

Current proposal only supports implicitly setting execution preference by "fast-answer", "sustained-speed" and "low-power". I think it makes sense to extend the support covering device provider enumeration and selection with appropriate permission mechanism. It may also require querying the capabilities of a device provider since different provider may support different ops, data types and architectures. A JS ML framework needs this info to identify WebNN supported sub-graph.

For example, WebNN may support Edge TPU by a device provider backed by Edge TPU compiler and runtime. According to the doc, Edge TPU compiler could partition a model and compile the sub-graph with supported ops for Edge TPU runtime execution. The unsupported ops are still executed by framework kernels, like TF-Lite. For the web usage, a JS ML framework could partition the graph based on the capabilities of WebNN Edge TPU provider and delegate the supported sub-graph WebNN for Edge TPU compilation and execution. The unsupported operations could still be executed by framework kernels written in WebGPU compute shader or WebAssembly. This usage would require following functionalities of WebNN ("+" means supported in current proposal, "-" mans gaps):

  1. device provider enumeration (-)
  2. device provider capabilities querying (-)
  3. graph building (+)
  4. graph compilation (+)
    4.1 set implicit preference (+)
    4.2 choose device provider (-)
  5. graph execution (+)

@kainino0x
Copy link

It makes sense to select a provider via a hint like "low-power" if the caller has no custom ops. But if they have custom ops written e.g. in WebGPU, they have to be guaranteed to get a WebGPU capable provider (a GPU).

@RafaelCintron
Copy link
Collaborator

If we have multiple providers, CPU and GPU, I would expect the provider selection would be done via a hint (as @kainino0x describes) combined with what is available on the system.

If the GPU provider is available, then the web developer would need to describe their inputs and outputs in terms of WebGPUBuffers and do custom ops with compute shaders. Custom ops are done with WebGPU directly.

If the CPU provider gets used, then the web developer would need to describe their inputs and outputs in terms of ArrayBuffers and do custom ops with WASM or Javascript. Custom ops are done with Javascript/WASM directly.

I think there will exist a set of graph optimizations that can be done in a device-neutral manner. The ML framework libraries should be able to handle these. If there exists graph optimizations that need device-specific information, then I think we have no choice but to add a graph API to the spec that sits on top of the operator API.

I'm of two minds when it comes to how much we unify the two providers. If we provide a more unified API to developers, then the API speaks in terms of generic WebMLBuffer that are filled from ArrayBuffers. The developer is none-the-wiser to the fact that these buffers are CPU or GPU, UNLESS they choose to do custom operators, which now need to be a first class citizen. With custom ops, the web developer will need to be prepared to supply either Javascript/WASM (in the case of CPU provider) or compute shader (in the case of GPU provider) depending on the provider chosen.

@anssiko
Copy link
Member

anssiko commented Aug 8, 2019

RESOLVED: The specification will reference a subset of the ONNX operations, starting small, adding more ops when compatibility with major ML JavaScript frameworks has been validated

@dontcallmedom
Copy link
Contributor

this issue has a resolution - any reason to keep it open?

@anssiko
Copy link
Member

anssiko commented Mar 3, 2023

Thanks for noting this has indeed been addressed and we have a resolution on it. Closing.

@anssiko anssiko closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants