Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Face Detection: How metadata should be tied to MediaStreamTrack video frames #70

Open
youennf opened this issue Aug 19, 2022 · 5 comments

Comments

@youennf
Copy link
Contributor

youennf commented Aug 19, 2022

Following on #69 and media capture transform, face detection metadata could be made available to mediastreamtrack transforms.
There are a few possibilities we could envision. The following come to mind:

  1. Attach FaceDetection metadata to VideoFrame with dedicated face detection metadata getter/setter (new VideoFrame slot that can be cloned/postMessaged).
  2. Attach FaceDetection metadata to VideoFrame using a generic metadata mechanism (mechanism to be defined, see VideoTrackGenerator).
  3. Make MediaStreamTrack transforms expose objects that have a VideoFrame and a metadata object.
  4. Extend MediaStreamTrackProcessor.readable with a face detection metadata getter (related to the last read video frame). And VideoTrackGenerator.writable with a metadata setter.
@youennf
Copy link
Contributor Author

youennf commented Aug 19, 2022

1 is probably the easiest approach compared to 2 and is more natural than 3 and 4.

@sandersdan
Copy link

sandersdan commented Sep 13, 2022

The current situation with generic metadata in WebCodecs VideoFrame is that there is support, but no adequate technical solution proposed. I'm interested in any proposal that:

  • Can be serialized to bytes. (I assume this excludes Symbol)
    • Necessary because VideoFrames can be sent between workers, and also the model assumes that all properties of a VideoFrame are carried by the underlying video resource.
  • Supports namespacing in some form.
    • Important to avoid conflicts between future specifications and current application developers.
  • Has defined semantics for construction and inheritance (eg. what happens withnew VideoFrame(existingVideoFrame, {updatedMetadata: ..., visibleRect: ...})).
    • Presumably some metadata should be carried across this (eg. extra timestamps), but other metadata is invalidated (eg. face positions when the visible rect changes).
    • The basic rule of 'drop everything' may be good enough, and could be used like: new VideoFrame(oldFrame, {metadata: {...oldFrame.metadata, foo: 123}).

Absent such a proposal, we are still recommending (3) or (4), passing the metadata out-of-band.

I don't think there is strong support for handling face metadata specially, but doing so would be the shortest path to in-band metadata.

@youennf
Copy link
Contributor Author

youennf commented Sep 13, 2022

  • Can be serialized to bytes. (I assume this excludes Symbol)

Agreed we need support to clone/postMessage metadata.
I was thinking we could use structure cloning (https://html.spec.whatwg.org/multipage/structured-data.html#safe-passing-of-structured-data), which is what is being used when postMessaging a value, say to workers.

For instance, we could add steps in the constructor to structure clone the metadata input parameter and the result would be stored in a VideoFrame object slot.
The metadata accessor should either provide a copy of the metadata or the metadata object itself (maybe we should freeze it?).

  • Supports namespacing in some form.

Good point.
I am fine either going with UA defined metadata initially or adding support for web app specific metadata.
In any case, both kind of data should probably follow the same principles (data being structure clonable say).

In terms of spec editing, web codec could define a WebCodecMetadata dictionary, either without any member or containing something like a any userDefinedMetata member.
WebRTC spec would then define a partial WebCodecMetadata dictionary listing the face detection dictionary members.

  • The basic rule of 'drop everything' may be good enough

+1

@sandersdan , how does this look to you?
Is it precise enough to think about writing a PR?

@sandersdan
Copy link

sandersdan commented Sep 13, 2022

I was thinking we could use structure cloning

Structured clone by itself doesn't work because it assumes there can be side data (such as ports) in addition to the raw bytes. The for storage variant might work, but I'm not familiar enough to say for sure.

It might actually make sense to just drop down to JSON here. I don't think metadata should need to be self-referential, for example.

In terms of spec editing, web codec could define a WebCodecMetadata dictionary, either without any member or containing something like a any userDefinedMetata member.

Yes, this is about the best I was able to come up with as well, and I think it meets the requirements. I like that unlike a partial for VideoFrame, a partial for VideoFrameMetadata would be straightforward to splat.

{metadata: { user: { ... } } } is a bit cumbersome, but the only alternative I have is { metadata: ..., userMetadata: ... } which just trades for complexity instead. One surprise could be that { metadata: { myMetadata: 123 } } would simply be dropped by the IDL binding, but good documentation can overcome that.

Is it precise enough to think about writing a PR?

I think the serialization part needs work before becoming a PR, but it could be at least proposed in the existing bug.

Edit: The existing bug is w3c/webcodecs#189. There is a separate bug for EncodedChunk metadata, w3c/webcodecs#245, but that also adds the complexity of possibly having to copy metadata from frames to chunks or the reverse.

@youennf
Copy link
Contributor Author

youennf commented Sep 13, 2022

It might actually make sense to just drop down to JSON here

I could see metadata be an array buffer, in which case JSON is not great.

I think the serialization part needs work before becoming a PR, but it could be at least proposed in the existing bug.

I think https://html.spec.whatwg.org/multipage/structured-data.html#structuredserialize is what we want.
This is roughly what structuredClone is using under the hood (we do not want any transfer parameters since we want to ensure we can clone frames). forStorage=false is good here.

that also adds the complexity of possibly having to copy metadata from frames to chunks or the reverse.

I do not think we need to expose this to web pages, at least initially. It should be reasonably simple for the web app to set metadata from a VideoFrame to its corresponding chunk.
This might be something we might want in WebRTC (metadata from track to encoded transform) but WebRTC spec could handle this metadata passthrough on its own.

@aboba aboba changed the title How FaceDetection metadata should be tied to MediaStreamTrack video frames Face Detection: How metadata should be tied to MediaStreamTrack video frames Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants