Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identification of Captured Application By Capturer #52

Open
eladalon1983 opened this issue Apr 13, 2021 · 36 comments
Open

Identification of Captured Application By Capturer #52

eladalon1983 opened this issue Apr 13, 2021 · 36 comments
Labels
Discussion Ongoing enhancement New feature or request

Comments

@eladalon1983
Copy link
Member

eladalon1983 commented Apr 13, 2021

This is the culmination of discussions in w3c/mediacapture-screen-share#159. I am re-summarizing so as to avoid misunderstandings stemming from that other issue's history (it started focused on label, but evolved in a different direction).

Problem

When the user chooses a tab using getDisplayMedia, the capturing application has no good way of discovering which application is is capturing.

Use cases

1. Establishing Cross-Tab Communication

The stress here is on "establishing," by which I mean identification. Communication itself is a solved issue once identification takes place - we can use BroadcastChannel, a shared back-end, sometimes a service worker. This issue is concerned with how the capturing application can identify the captured application, so that messages could be addressed specifically to it.

For example, assume two collaborating apps from ACME corporation - a capturing VC application called ACME VC-Max, and productivity suite app called ACME Presentron. (The marketing department took a day off.) The user has many open tabs, both of Presentron as well as of other applications. When VC-Max asks to capture a tab, and the user chooses Presentron, we want VC-Max to be able to identify that this selection took place. Moreover, we want Presentron to be able to declare an ID; VC-Max can then use this ID to address messages solely to the specific captured Presentron session. (Note - reliably and ergonomically passing this ID is in-scope; use of the ID for communication is beyond scope; it's enough for us that it's possible.)

Once this ID is passed an communication is initiated, VC-Max can display controls for the user that will allow the user to flip through slides on the captured Presentron slides deck from within the VC-Max session.

2. Analytics

Capturing applications can gather statistics over what applications its users tend to capture. This can be used to improve service for the users by introducing collaborations. One such possible collaboration was described in the use-case above.

3. Conditional Tab-Focus Change

In w3c/mediacapture-screen-share#165 I proposed an API for one-way hand-off of tab-focus from capturer to captured. Consider the intended case for this - shortly after capture starts. With the capture-handle defined by the current issue, a capturing application could make an informed decision about whether it wants to hand off tab-focus to the captured application, depending on what the captured application is.

4. Rejecting Undesired Captures

Issue w3c/mediacapture-screen-share#143 introduced a web-developer who wanted to discard MediaStreams that resulted from the capture of either blocklisted or non-allowlisted origins. If we can enable this use-case along the way, that would be great.

5. Detecting Self-Capture

This is a sub-case of use-case 4, but deserves elaboration due to its ubiquity. It is common for VC applications to experience a "hall of mirrors" effect when the user unintentionally self-captures. If the app can detect self-capture, it can also avoid using the stream until the user chooses a new source.

Solution

Define MediaDevices.captureHandle. If set, the application can use it to expose information to potential capturing applications.

  1. Origin.
  2. An arbitrary "handle" string.

The "handle" string can be an ID which is meaningful given the origin (each origin and its published ID schema).

Noteworthy

  1. Opt-in: This mechanism is opt-in.
  2. Application discovering it is being captured: This mechanism does NOT allow the captured application to discover it is being captured. However, if the capturing application wishes to send the captured application a message to that effect, it can. But that's neither new, nor is it problematic.
  3. "Please choose again": One possible concern is that the capturing application could use the capture-handle in order to repeatedly ask the user to choose again until the user chooses the capturing app's intended target. I think this concern is invalid because these attacks were equally possible previously:
    • Detecting capture of the current tab is trivial, e.g. by embedding some specific pixels in the current tab or by flashing the entire tab black for a single frame, then searching for that in the capture.
    • Detecting a specific other tab - this implicitly assumes that information can be harvested from that specific other tab, which also implicitly assumes that it can be parsed by visual inspection, which means its capture can also be distinguished from an arbitrary other capture.
@eladalon1983
Copy link
Member Author

I've edited the PR (#163) to:

  1. Allow an application to expose its origin, its handle, or both. (Or neither, of course.)
  2. I've renamed back to CaptureHandle, since DisplayMediaCaptureHandleConfigType and the like were getting too long. I welcome suggestions for better names.

@jan-ivar
Copy link
Member

Use cases

  1. Establishing Cross-Tab Communication
    ... display controls for the user that will allow the user to flip through slides on the captured ... slides deck
  1. Conditional Tab-Focus Change
    ... a capturing application [decides] whether ... to hand off tab-focus to the captured application, depending ...

These two sound like the actual use case is:

  1. Remote-control a locally captured presentation webpage in a meeting

I think it's important to look at the original use case, to not bake in assumptions or implementation decisions already taken.

With that in hand, I'd take a step back and ask whether getDisplayMedia or getViewportMedia is the right tool.

Attempting to build this new integrated experience over getDisplayMedia seems:

  1. Unsafe: capture resumes past navigation, an oversharing surprise (because we're capturing the tab container)
  2. Unexpected: action-at-a-distance magic ("what have I given permission to? I thought this was a read-only feature?")
  3. Inconsistent: Works one way for some tabs and another for others.
  4. Poor user experience: Users may not choose the right tab.
  5. Limited: any further integration with the target hits a wall (how do I make it bigger/smaller? interact with elements?)
  6. Stagnant: apps settling in a house of straw, not bricks. HTML capture remains as dangerous as ever

Of course, integrating this with getViewportMedia is not without challenges either, but seems more future proof (none of the above problems). Challenges that remain would be:

  • Could tabs be signaled to start capture in the background?
  • How'd users choose what to present? In-content pickers? If so, how could we open it up beyond ACME-only choices?
  • Do we need to programmatically move between tabs? Is that a good idea?

I think there's a lot to be worked out here to be able to support this use case. I think we should do that before we attempt to standardize pieces of the puzzle.

@jan-ivar
Copy link
Member

I also sense a larger question here: how do webpages cooperate on the web, to create an integrated experience?

Are we sure we want getDisplayMedia at the core of it?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Apr 16, 2021

I also sense a larger question here: how do webpages cooperate on the web, to create an integrated experience?
Are we sure we want getDisplayMedia at the core of it?

I think this recurring topic deserves separate discussion. If the WG decides to abandon gDM, then we can close all threads relating to incremental improvement of gDM. Until such a decision is made, I think we should judge proposals to improve gDM on their own merits.

Looking at this specific proposal (#166) and its associated PR (#163), I do not see any criticism of the proposal's merits. The usefulness seems well-established. Your comment lists some ways in which gVM could eventually deal with it better - in that case, let's hope that web-developers migrate to using gVM when it's specified and implemented. But please note:

  1. gDM is widely used and there are no plans to deprecate it.
  2. No browser yet implements gVM. It's not even specified.
  3. It is unclear how many web-developers will accept the cost of retooling their applications to use gVM, if any, or what the timeline for that will be.

IMHO, my third point is significant, and worth repeating - it is unclear how much enthusiasm web-developers will have for investing time+money to completely overhaul their applications to make exclusive use of gVM. More likely we'll see gradual adoption and slow replacement of gDM in some spots, and plenty of applications intentionally maintaining gDM-use for many, many years to come.

If you have suggestions for making this incremental improvement to gDM even better - I'd be very happy to incorporate it, as I have incorporated @youennf's suggestion wrt opt-in origin exposure. I would also be happy to discuss any adjacent improvements that you think could complement this proposal, such as the idea of adding a navigationBreaksCapture or navigationPausesCapture constraint. I am eager to improve gDM.

@eladalon1983
Copy link
Member Author

(My latest proposal is outlined in this explainer.)

Following the WebRTC WG meeting we've just had, I'd like to gauge where we currently stand, and what the path forward is:

  • @jan-ivar and @youennf wanted to remove exposeOrigin from the CaptureHandleConfig and always expose the origin (assuming the captured-app opts into exposing anything).
  • @jan-ivar wanted a browser-assigned ID rather than an app-assigned handle.

Where do we stand, then? If these two changes were made to my latest suggestion, would we have consensus? Or do we have additional points of disagreement?

@youennf
Copy link

youennf commented May 27, 2021

I think these two suggestions are improvement.
If we go with a generated ID, it could represent the tab and be immutable on navigation.
Then comes the issue of what do in case of navigation. There may be an interest in other APIs to bootstrap tighter communication between two tabs. Maybe it deserves a specific API that could be reused outside of getDisplayMedia.

In that case, maybe the generated ID should actually be a JS object that would be a proxy to the captured tab.
If the captured page does not opt-in, this object does not expose any information/does not allow any interaction with the captured page.
If the captured page opts in, some information is exposed (its origin for instance) and interaction with the captured page can be done through this object.

@eladalon1983
Copy link
Member Author

In that case, maybe the generated ID should actually be a JS object that would be a proxy to the captured tab.

This would be a much taller order with significant implications for security. At the moment we're having trouble reaching consensus over something far more modest, despite kicking off the effort 2 months ago and despite all previous proposals from Apple being adopted. I think it's better to think of this as a potential extension. (Please recall that while discussing this feature, Apple has argued for reducing initial scope. For example by removing exposeOrigin and always exposing.)

If we go with a generated ID, it could represent the tab and be immutable on navigation.

We could discuss either a generated ID that's mutable or immutable on navigation. But before we delve into that topic, I think it would be useful to understand if Mozilla and Apple would accept such a proposal. I would like to have firm goalposts.

Full disclosure - I am very unhappy with a browser-assigned ID for several reasons. We can go into that soon. But I'd like to at least establish this as our last remaining point of disagreement.

@youennf
Copy link

youennf commented Jun 1, 2021

At the moment we're having trouble reaching consensus over something far more modest

As I said in the past, I am fine with having something very modest like exposing a static piece of information which always contain an origin (or expose nothing).
But it seems your usecase is much more advanced than that and requires tracking navigation, change of origins and so on.

@eladalon1983
Copy link
Member Author

At the moment we're having trouble reaching consensus over something far more modest

As I said in the past, I am fine with having something very modest like exposing a static piece of information which always contain an origin (or expose nothing).
But it seems your usecase is much more advanced than that and requires tracking navigation, change of origins and so on.

On the one hand: You suggested exposing the origin back on the original thread. It seems like you're still fine with it, and even ask for it to be done by default, with no way for the captured app to opt-out of that part.

On the other hand: You list change of origins (e.g. when the captured tab experiences navigation) as an unexpected (?) complicating factor.

Please help me understand. Has new information come to light, or have the goalposts moved?

@youennf
Copy link

youennf commented Jun 1, 2021

with no way for the captured app to opt-out of that part.
Captured app could always opt-out of exposing its origin.

One known use-case is for a web-app to ensure it is capturing itself or one of its tab. This could be supported by this 'modest' proposal.

Another use-case is for a web-app to capture another tab and start driving this tab, which requires tight synchronization between the two. My understanding is that this is similar to the previous case (user needs to select the right google doc tab) except that the two tabs will start interacting. This can be supported by this 'modest' proposal.

The third use-case is the same as previous one, except that now the main captured document that is part of the synchronization might disappear and synchronization might be recreated by the next top level document of the same tab.
I haven't really thought about the corner cases and potential issues, but this seems more complex (although worth investigating). I also wonder whether there will be other cases where user will relate two pages for improved processing, in which case it would be nice to factor out the bootstrapping mechanism from the actual communication channel.

@eladalon1983
Copy link
Member Author

with no way for the captured app to opt-out of that part.
Captured app could always opt-out of exposing its origin.

One known use-case is for a web-app to ensure it is capturing itself or one of its tab. This could be supported by this 'modest' proposal.

Another use-case is for a web-app to capture another tab and start driving this tab, which requires tight synchronization between the two. My understanding is that this is similar to the previous case (user needs to select the right google doc tab) except that the two tabs will start interacting. This can be supported by this 'modest' proposal.

The third use-case is the same as previous one, except that now the main captured document that is part of the synchronization might disappear and synchronization might be recreated by the next top level document of the same tab.
I haven't really thought about the corner cases and potential issues, but this seems more complex (although worth investigating). I also wonder whether there will be other cases where user will relate two pages for improved processing, in which case it would be nice to factor out the bootstrapping mechanism from the actual communication channel.

The context of this message is completely lost on me. I don't understand how it correlates to our discussion thus far. I can understand each paragraph in isolation, but what is the general thrust of this message? What does it say?

@youennf
Copy link

youennf commented Jun 1, 2021

Sorry if I was not clear, I was trying to express the granularity of use case complexity.
These are thoughts that I was hoping would help but this is not the case apparently :(
Overall, I do not have a solid view on a proposal, hence why I am trying to start with a small-but-still-useful proposal.

Here is another idea, not thought out at all but anyway, it at least helps illustrating the diversity of API that could be used.
If the end-goal is to support complex synchronisation use cases, why not directly expose a MessagePort on the capturing context entangled to another MessagePort exposed on the captured context, instead of an ID.

Captured page needs to opt-in for the port to be exposed on the capturing page through an opt-in API.
It is then up to the capturing context, knowing the captured context origin, to decide whether it wants the entangled port to be exposed to the captured context (exposed for instance through a callback given to the opt-in API).

To support navigation usecases, a new set of ports would be created when the opt-in API is called in the post-navigation captured context.

@eladalon1983
Copy link
Member Author

why not directly expose a MessagePort on the capturing context entangled to another MessagePort exposed on the captured context, instead of an ID.

  1. Because it does not address all use-cases as well as exposing a capture-handle. (Note use-cases 2, 3 and 4 in the original message.)
  2. Because it's an effort of greater complexity, larger scope, and much more significant security and privacy ramifications.
  3. Because it's good to have that as a later extension. You have been advocating small initial scope, after all.

I think it would be good, in the future, if our discussions of possible approaches could follow the standard progression of paring down possible solutions. Late proposals for radical deviations from the agreed-upon course are not conducive to reasonably-paced progress. We've been discussing exposure of origin-and-handle for two months now. What's changed?

@eladalon1983
Copy link
Member Author

Added a fifth use-case - avoiding the "hall of mirror" effect when self-capturing.

@jan-ivar
Copy link
Member

jan-ivar commented Jun 3, 2021

@jan-ivar wanted a browser-assigned ID rather than an app-assigned handle.

I also wanted APIs like this under site-isolation and capture opt-in. So I'd say we've not reached our last point of disagreement.

Another use-case is for a web-app to capture another tab and start driving this tab, which requires tight synchronization between the two.

I feel @youennf understands the complexity that's not apparent in the OP as it conflates the capture of an "app" with capture of a tab:

When VC-Max asks to capture a tab, and the user chooses Presentron, we want VC-Max to be able to identify that this selection took place. Moreover, we want Presentron to be able to declare an ID; VC-Max can then use this ID to address messages solely to the specific captured Presentron session.

To poke holes: what if the user instead chooses a non-Presentron tab, but later navigates to Presentron in it?

I'm not ready to concede that to capture a web-based presentation program requires indiscriminate capture of its browsing context and all its navigation. That's an unsafe foundation to build on IMHO. We have a mandate to make web capture safe. And this isn't.

I'm also not ready to concede that to solve basic "next/previous slide" controls, requires building the ability to remotely browse ("drive") a tab.

This proposal also presents a stark contrast to getViewportMedia which by design isn't able to follow links easily. That's a potential shortcoming of getViewportMedia, since some modern web presentations may contain links that a presenter plans to follow during their presentation.

I'd prefer to take a step back and have a higher-level discussion around that. I think there's a way to solve this that is both safe and solves the web presentation use case, but I'll open a new issue on that.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Jun 3, 2021

I feel @youennf understands the complexity that's not apparent in the OP as it conflates the capture of an "app" with capture of a tab:

That tabs can be navigated after capture begins was discussed explicitly two months ago in my very first message on this topic. In this linked message, search for the section titled "What if the user navigates the captured tab?"

To poke holes: what if the user instead chooses a non-Presentron tab, but later navigates to Presentron in it?

We have discussed this. If you refer back to my slides, you will see that there is an event fired.

I'm not ready to concede that to capture a web-based presentation program requires indiscriminate capture of its browsing context and all its navigation.

  1. The capturing app is already capturing the entire browsing context. It's called getDisplayMedia.
  2. The capturing app is already observing navigation because it's getting all of the pixels from captured tab.
  3. Capture Handle increases the ergonomics of observing navigation, but in a limited fashion. Please see this section and also this one of my public explainer.

I'm also not ready to concede that to solve basic "next/previous slide" controls, requires building the ability to remotely browse ("drive") a tab.

I am afraid that our memories of this feature's fundamental nature are out of alignment. I'd like to encourage you to re-read (a) the original thread, (b) OP and (c) the explainer. Capture-Handle does not "build" the ability to remotely drive the tab. That's something you can build on top of Capture-Handle.

@jan-ivar
Copy link
Member

jan-ivar commented Jun 3, 2021

@eladalon1983 I'm not saying it wasn't discussed, but that I find it overly complex for the use case at hand, and would like to explore better web integration built on safer tech than today's unsafe non-isolated model.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Jun 3, 2021

@eladalon1983 I'm not saying it wasn't discussed, but that I find it overly complex for the use case at hand, and would like to explore better web integration built on safer tech than today's unsafe non-isolated model.

Halting all (relevant) progress on screen-capture until getViewportMedia is specified, implemented and adopted, is not an acceptable solution for us. Not without good reason, and no such reason has been presented.

@jan-ivar
Copy link
Member

jan-ivar commented Jun 3, 2021

I'm not proposing we block on the getViewportMedia API. I've opened w3c/mediacapture-screen-share-extensions#9 to clarify the direction I propose.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Jun 3, 2021

Proposal w3c/mediacapture-screen-share-extensions#9 packages two separate issues in unnecessary union.

  • "isolated-browser" is a stab at increasing security. A noble cause, good luck.
  • The ID allows bootstrapping communication. It seems to me like a neutered version of Capture Handle, and I don't see why it would be beneficial to gate it behind such adoption-hindering measures.

@youennf
Copy link

youennf commented Jun 14, 2021

  • Because it does not address all use-cases as well as exposing a capture-handle. (Note use-cases 2, 3 and 4 in the original message.)

I don't really see why it does not solve 2, 3 and 4, given MessagePort would be complemented by Origin.

  • Because it's an effort of greater complexity, larger scope, and much more significant security and privacy ramifications.

Well, that is the end goal of the main use case you are bringing to the table.
If MessagePort does not work for security reasons, I wonder whether the same security reasons might not impact the capture handle proposal.

  • Because it's good to have that as a later extension. You have been advocating small initial scope, after all.

Small initial scope would be origin, maybe origin plus pathname or something like that.
If we expose MessagePort, the handle does seem somehow redundant.

@eladalon1983
Copy link
Member Author

I don't really see why it does not solve 2, 3 and 4, given MessagePort would be complemented by Origin.

Assuming it IS complemented by origin, then some of these use-cases would also be partially addressed¹, but to an inferior extent. Motivating a change from our current approach to a new approach requires a set of compelling reasons. I have not yet heard such reasons. This new approach offers more complexity (where previously less complexity was requested by Apple) and inferior handling of the use-cases I have cited. Capture Handle is the superior solution here.

--

  1. To name just one reason, consider top-level navigation. How would the capturer be notified now? Shall we add an event for that? If so, we'd be inching towards the Capture Handle proposal. We now have origin exposure, events - the only things we're missing are permittedOrigins² and the handle!
  2. Please consider why permittedOrigins was included in the Capture Handle proposal - protecting again capturer dis-service if capturing a competitor. I trust you will see that you'd soon want to replicate that, too, in the new proposal.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Jul 23, 2021

Shameless plug: https://webrtchacks.com/capture-handle/

@dontcallmedom dontcallmedom added the enhancement New feature or request label Nov 30, 2021
@eladalon1983
Copy link
Member Author

I've been discussing this with @jan-ivar, and my understanding of his current position is that he would support this proposal[*] if we add the ability for some basic messages to be sent capturer->capaturee. This sounds like a good idea to me, and I am happy to resume the discussion accordingly.

The proposal would then comprise two parts:

  1. We keep setCaptureHandleConfig, getCaptureHandle, etc. This allows sophisticated cooperation between closely collaborating applications.
  2. We add a mechanism for applications to declare which basic messages (prev/next/first/last) they support, a mechanism for applications to register a handler for such messages, and a mechanism for a capturing app to send these messages to the captured application.

Overview of the added basic-messaging capabilities:

Shared capturer/capturee:

  • Define a small set of strings, representing actionable messages, that can be sent capturer->capturee.
    • Possibly {"first", "prev", "next", "last"}.
    • Pre-bikeshedding name: CaptureActions.

Capturee-side:

  • Add a control allowing declaring supported actions.
    • Possibly MediaDevices.setSupportedCaptureActions().
  • Add a control for setting a handler for incoming messages.
    • Incoming messages are received as an event that exposes the single action sent.
    • For multiple concurrent capturers, we could later add a source field in the event. At the moment, I don't think this is an important edge case, so I'd rather leave this initially unaddressed.

Capturer-side:

  • Add a control, exposed on the track, for reading the actions supported by the application running in the captured surface.
  • Add a control, exposed on the track, for sending a message to the captured application's handler.
    • Rate-limit messages.
    • Only allow valid CaptureActions messages, but do not verify that the message was declared as support, since there's no harm in letting them through and no reasonable attack enabled through them if they are just ignored. (Devil's advocate - what if an application only declares one action as supported and just assumes that's the only message it's ever getting? Counter - then that application is defective.)

Noteworthy

  • The capturee is still unaware that it's captured, unless the capturer intentionally alerts it to the fact.
  • Declaring supported actions allows the capturer to expose only relevant user-facing controls, and do so without really knowing what it's capturing. For example, a video-conferencing application authored tomorrow and never updated later, could still expose only prev/next buttons when capturing Hypothetical-Slides-Deck, but expose first/prev/next/last when capturing Hypothetical-Video-App, even if both of these captured apps were written long after the VC app, and have never set up collaboration specifically with it.

@jan-ivar: Have I accurately represented your position? Do you agree with the general approach I suggest here for basic messaging?
@youennf, @aboba and everybody else: What do you think? Would you support the adoption of Capture Handle if we add these mechanisms?

@youennf
Copy link

youennf commented Jan 14, 2022

In general, I like the idea of actions.
In terms of scope and role sharing between specs/WGs, I think we should try to not build our own actions.
Instead I would try to piggy back on https://w3c.github.io/mediasession/, which might move some of the corresponding API from MediaDevices to MediaSession.
If feasible, getDisplayMedia would be a boostrapping mechanism to potentially get access to MediaSession proxies.
Other APIs in the future could well allow to expose the same proxies.

I am also interested in the trust model, which probably applies to both CaptureHandle and actions.
In general, it seems desirable to know the origin of data you are processing, which would translate in something like:

  • capturer knows capturee origin if processing some information from capturee.
  • capturee knows capturer origin if processing some information from capturer.
    Depending on the API shape, interaction with transferring MediaStreamTrack might be useful to study.
    Rate limiting could be enforced by the API (only one action being processed at a time, capturee can tell when the action is done to capturer).

Another approach would be to state that the UA is the entity sanitising capturer/capturee relationship.
In which case, the UA would be the one triggering action event handlers on behalf of capturer and potentially doing rate limitations. Capturee would not even need to do anything other than register MediaSession callbacks.

@youennf, @aboba and everybody else: What do you think? Would you support the adoption of Capture Handle if we add these mechanisms?

Can you clarify what the proposal is?
Is it that we would work/adopt one spec that would define both capture handle and actions?
Or that we would work/adopt two specs roughly at the same time, one for capture handle and one for actions?

I would personally be inclined to split the work in two different specs given the scopes seem to be different enough.

@eladalon1983
Copy link
Member Author

Can you clarify what the proposal is?
Is it that we would work/adopt one spec that would define both capture handle and actions?
Or that we would work/adopt two specs roughly at the same time, one for capture handle and one for actions?

I would personally be inclined to split the work in two different specs given the scopes seem to be different enough.

There are two core issues addressed in this thread, which I'll call "identity" (original proposal) and "actions" (additional mechanisms). I also prefer splitting the work, but I tentatively propose addressing both issues together as an attempt to reach a compromise with @jan-ivar. To avoid risking misrepresenting his position, I'd like to ask @jan-ivar to explain why he thinks the two should be combined.

Clarifying my question about support - it's an open ended question. Would you support either part (identity/actions)? Both? Only a certain mix? I hope that you'll be amenable to the identity part at least, @youennf, as the design was much influenced by your earlier input. :-)

Instead I would try to piggy back on https://w3c.github.io/mediasession/, which might move some of the corresponding API from MediaDevices to MediaSession.

Could you please clarify your proposal here?
Please note that:

  • Some applications will want to treat actions differently if they come from the user (MediaSession IIUC), compared to if they come from a capturing application (might be piping the user, might not).
  • The MediaSessionAction enum includes some actions (e.g. seekto) which require more information than we can assume the capturer/capturee have exchanged.
  • The MediaSessionAction enum includes some actions (e.g. togglemicrophone) which IMHO appear inappropriate in our context.

In general, it seems desirable to know the origin of data you are processing

It's an interesting issue, and might not have the same answer for both directions.

  1. The original part of Capture Handle (identity) allows the capturee to select whether it wishes to expose its origin. I think this is simple and flexible enough, because capturers that intend to trust only a given set of origins, will find that undefined does not match any origin on their allowlist. So conditional exposure here is both desirable (discussed earlier) and ergonomic (likely no code changes needed in the app compared to when origin-exposure is mandatory).
  2. For actions the capturee receives from the capturer, I think it's reasonable to either expose, not expose, or conditionally expose the origin of the sender. I have no strong conviction here, and would be happy to align with whatever others agree on.

Capturee would not even need to do anything other than register MediaSession callbacks.

I believe that's equally true for all proposals currently under discussion (modulo declaring capabilities, discussed below).

In which case, the UA would be the one triggering action event handlers on behalf of capturer

The idea behind the current actions-proposal is that the capturing application can expose its own custom, in-content controls for the intersection of controls supported by the capturer and the capturee. So, for example, if Zoom captures Slides, it could expose next/prev, but if it captures YouTube, maybe it also exposes mute (mute-shared-tab, that is). I think this is preferable UA-based controls. But possibly I have misunderstood your point?

@youennf
Copy link

youennf commented Jan 19, 2022

@eladalon1983, @jan-ivar, following on yesterday's meeting, here are examples that I hope clarifies what I have in mind.
First example is targeting a fixed protocol, second example is for custom web-application protocols.

  1. MediaSession-based communication.
// MediaSession related API
partial interface MediaSession {
    readonly attribute MediaSessionProxyRules proxyRules;
} 
partial interface MediaSessionProxyRules {
    maplike<USVString, sequence<MediaSessionAction>>;
}

partial dictionary MediaSessionActionDetails {
    USVString origin;
}

interface MediaSessionProxy {
    readonly attribute USVString origin;
    readonly attribute sequence<MediaSessionAction> actions;
    attribute EventHandler onproxychange;

    Promise<undefined> triggerAction(MediaSessionAction);
}

// getDisplayMedia bootstrapping API, based on https://github.com/w3c/mediacapture-screen-share/issues/190.
partial interface GetDisplayMediaResultEvent {
    readonly attribute MediaSessionProxy mediaSession;
}
partial interface MediaDevices {
    attribute EventHandler ongetdisplaymediaresult;
}

Capturee is exposing its origin through MediaSessionProxy.origin to capturers whose origin is granted by capturee's MediaSessionProxyRules.
Capturer exposes its origin to capturee when sending an action to capturee (through actionDetails.origin, which remains empty for UA triggered actions).
Handling of capturee navigation is done through onproxychange, similarly to change of MediaSession or MediaSession rules.
triggerAction would reject if, at the time of firing the corresponding event, the MediaSession is no longer active or the action no longer matches the MediaSession proxy rules.
previousslide/nextslide could be handled as previoustrack/nexttrack.

  1. MessageChannel-based communication
dictionary GetDisplayMediaCapturer {
    USVString origin;
    MessagePort port;
}
callback GetDisplayMediaChannelCallback = undefined(GetDisplayMediaCapturer capturer);
partial interface MediaDevices {
    undefined setGetDisplayMediaChannelCallback(USVString or sequence<USVString>, GetDisplayMediaChannelCallback callback);
}

interface CaptureeProxy {
    readonly attribute USVString origin;
    attribute EventHandler onproxychange;

    Promise<MessagePort> openChannel();
}

// getDisplayMedia bootstrapping API, based on https://github.com/w3c/mediacapture-screen-share/issues/190.
partial interface GetDisplayMediaResultEvent {
    readonly attribute CaptureeProxy capturee;
}
partial interface MediaDevices {
    attribute EventHandler ongetdisplaymediaresult;
}

Capturee is exposing its origin through CaptureeProxy.origin to capturers whose origin is granted by capturee through setGetDisplayMediaChannelCallback.
setGetDisplayMediaChannelCallback is limited to first-party iframes.
Capturer exposes its origin to capturee when calling openChannel (which may fail in some cases, like navigation or change of channel callback origins, or if CaptureeProxy.origin is empty).
After that, capturee and capturer uses the MessageChannel ports to directly communicate one with each other.
Handling of capturee navigation is done through onproxychange and calling openChannel again.
CaptureeProxy is neutered when the getDisplayMedia track source is ended.
previousslide/nextslide would be handled through capturee-specific messages that capturer needs to understand.

@eladalon1983
Copy link
Member Author

Thank you for these proposals, @youennf.


  1. MediaSession-based communication.

This is an interesting proposal. I have saved for later some nits, so as to focus first on the general thrust. The propsal takes great pains to latch onto an existing mechanism (MediaSession). It's not immediately clear to me what is gained by making that design decision. Could you please explain?

As for the drawbacks I can see:

  • It creates a dependency on MediaSession.
  • It seems more complex than the alternative of a simple bespoke API.
  • It brings into the mix MediaSessionActions which are less relevant given the context (e.g. togglecamera, hangup).
  • Even actions which are relevant for both contexts, are awkwardly phrased for the new one ("nexttrack" vs. a generic "next").
  • It makes it hard to extend with actions that are relevant for our context, but are not relevant for the MediaSession context.

Would love to hear more of your thoughts, as well as those of @jan-ivar.


  1. MessageChannel-based communication

I'd rather steer clear of this.

On the one hand, it requires tight cooperation, or else how would the capturer/capturee understand each other? So we can take it as a given that users of this API would be tightly integrated.

On the other hand, it forces a communications method, and I think this should be left out of scope. Tightly integrated applications have their own various means.

@jan-ivar
Copy link
Member

  1. MediaSession-based communication.

@youennf Reusing mediaSession, while intriguing, seems risky design-wise and scope-wise: repurposing a well-known API with a known a trust model to now also be something else. I don’t see a lot of user benefit frankly.

I'd prefer keeping what's exposed in this WG. We have to be careful not to add ways for malicious sites to remotely operate arbitrary captured pages in ways that may deceive users. E.g. triggerAction() needs transient activation to start.

I’m also not sure “advance slide” is the same thing as “next track”: e.g. a presenter may have background music playing, and expect the latter to skip to the next audio track, not advance to the next slide.

While "play", "pause" and "resume" may be reasonable, what if capturee has two video elements, which one plays? I sense a slippery slope here toward users asking to click on buttons in the capture preview and have it affect buttons in the capturee page. While this might be useful, if done wrong (letting JS control coordinates) it might let scammers remotely operate a user's browser.

So my instinct is we want to tightly control this separate from mediaSession.

  1. MessageChannel-based communication

... capturee and capturer uses the MessageChannel ports to directly communicate...

I'd rather avoid adding another messaging channel to the platform. Also, as soon as such a channel exists, the parties can exchange IDs anyway, so this seems like a superset of @eladalon1983's approach.

partial interface MediaDevices {
    attribute EventHandler ongetdisplaymediaresult;
}

I have to say I like how this separates the control surface from the MediaStreamTrack. Tracks can be cloned and transferred, and it's not clear to me this surface should follow it, e.g. to a worker (would track.triggerAction("next slide") work there? Do workers have transient activation?) — But if getDisplayMedia is called twice in a row, how does JS know which one it is?

@youennf
Copy link

youennf commented Jan 25, 2022

@youennf Reusing mediaSession, while intriguing, seems risky design-wise and scope-wise: repurposing a well-known API with a known a trust model to now also be something else. I don’t see a lot of user benefit frankly.

In the WG meeting, YouTube was given as an example where such API could be potentially useful.
If this API gets useful for slides, I can see it useful for other cases as well, so we might end up with more actions.
Also, if it is useful to control slides from another page, it might also be useful to control slides from a PiP window, so useful to expose specific slides control from a MediaSession.

If we take the approach to add specific actions, I do not want to end up in a place where we duplicate the work with MediaSession.

I'd rather avoid adding another messaging channel to the platform. Also, as soon as such a channel exists, the parties can exchange IDs anyway, so this seems like a superset of @eladalon1983's approach.

The CaptureHandle's proposal is adding a one way communication from capturer to capturee.
As Elad's mentioned several times, this one way communication channel can be used to create a two way communication between capturer and capturee.

The action's proposal is most probably also creating a two way communication channel between capturer and capturee (we would need to have a clear API to validate this).
The same privacy/security risks (exchange IDs) apply to all these proposals and we should review all of them accordingly. Or did I miss something? If the issue is not privacy/security, what is it?

I'd rather steer clear of this.
On the one hand, it requires tight cooperation, or else how would the capturer/capturee understand each other? So we can take it as a given that users of this API would be tightly integrated.

I do not think it requires the same tight cooperation.
Capturee can define its own protocol without knowing capturer. It is up to capturer to be able to understand and implement this protocol. Google slides, as a capturee, does not have to understand capturer at all.
Capturee may of course still want to restrict the origin of capturer it is talking to.

On the other hand, it forces a communications method, and I think this should be left out of scope. Tightly integrated applications have their own various means.

This communication method is the traditional way of doing cross-context communications on the web (postMessage).
If this way of communication is not good, can you explain why and according which criteria?

But if getDisplayMedia is called twice in a row, how does JS know which one it is?

If we take the current eventing model, calling getDisplayMedia twice in a row, say A and B. As long as A and B trigger the prompt, the assumption is that the order is preserved as:

  • event A is fired.
  • promise A is resolved.
  • event B is fired.
  • promise B is resolved.
    While this is not strictly guaranteed by spec by the 'run in parallel' step, we could make that clear.

The case where ordering is not guaranteed is if B fails, in which case promise B may reject sooner (but there will be no event B), though we could fix this as well.
It is true that the technique to register a one time event handler will not work in any case, the API ergonomics are not optimal. This would be also true in the case of the focus event handler so maybe we need a better API there, like an additional getDisplayMedia callback for instance.

@eladalon1983
Copy link
Member Author

If this API gets useful for slides, I can see it useful for other cases as well, so we might end up with more actions.
...
If we take the approach to add specific actions, I do not want to end up in a place where we duplicate the work with MediaSession.

MediaSession is a rich spec that offers much more than sending simple actions. I think the duplication of work is quite minimal. I have not yet heard what entanglement with MediaSession would improve.

The CaptureHandle's proposal is adding a one way communication from capturer to capturee.

It's the other way around.
Identity: Message from capturee to capturer. Normally one-off, with a change (new message) when the capturee navigates.
Actions: Message from capturer to capturee. Often repeating.

The action's proposal is most probably also creating a two way communication channel between capturer and capturee

Looks pretty one-way to me. See explanation above. Each part (Identity, Actions) produces a distinct one-way communication channel. These APIs (Identity, Actions) are useful both in isolation as well as together. One session can involve one, the other, neither, or both.

I'd rather avoid adding another messaging channel to the platform

The same privacy/security risks...

I'd like join @jan-ivar's objection to adding more generic message channels. Before we dive too deep into the discussion of whether it's secure, I think the onus is on you to show it's desirable to add a generic messaging channel where a limited one would do.

I do not think it requires the same tight cooperation.
Capturee can define its own protocol without knowing capturer. It is up to capturer to be able to understand and implement this protocol.

@youennf, I believe these two lines contradict each other. The second line explains the necessity of a mini-protocol for both approaches (my Identity, your MessageChannel-based communication approach). Namely, that even if the mini-protocol is as simple as a stringified JSON with a single key-value pair, e.g. {name: "Wikipedia"}, this is something that the capturer must understand.

If this way of communication is not good, can you explain why and according which criteria?

When applications share a cloud infrastructure, it might be preferable for the developers to go through some pre-existing RESTful API than to add code in the capturee to handle messages from the capturer. Especially if the captured application does not wish to assume that the local user delegates their permissions in captured-application to capturing-application just by allowing it to display-capture. Since the captured-application will still treat messages from the capturing-application as suspicious (the local user might only have partial understanding of what display-capture allows here), it's easier to use pre-existing mechanisms for access-control, than to replicate them for yet another communications channel.

@youennf
Copy link

youennf commented Feb 2, 2022

MediaSession is a rich spec that offers much more than sending simple actions. I think the duplication of work is quite minimal. I have not yet heard what entanglement with MediaSession would improve.

If we go with actions like next slide, previous slide, capturer might want to understand whether:

  • these actions are supported or not (say we are at the end of the slide show)
  • when calling an action, understand when the action is actually executed.

We could try to be very restricted in terms of actions, my gut feeling is telling me people will want more than that.
And the "more than that" will be more and more redundant with MediaSession.

Looks pretty one-way to me. See explanation above. Each part (Identity, Actions) produces a distinct one-way communication channel. These APIs (Identity, Actions) are useful both in isolation as well as together. One session can involve one, the other, neither, or both.

Identity provides a de facto a one-way communication channel.
As you explained during last meeting IIRC, it can be used to create a two-way communication channel using server-side logic.

Actions also provide a two-way communication channel:

  • capturee exposes the actions it supports
  • capturer trigger some actions

I think the onus is on you to show it's desirable to add a generic messaging channel where a limited one would do.

IIRC, in a past WG meeting, you stated that, in some cases, the goal is to create such a messaging channel, either through a network intermediary or through something like RTCPeerConnection.

If we believe this is something desirable, direct support through postMessaging seems a better option to me.
It is simpler, more reliable, more efficient, more powerful and probably more secure.
As an example, a capturee could postMessage a CropTarget to capturer which would use cropTo.
Another example: If RTCPeerConnection is used, who says local networking will always work.

@youennf, I believe these two lines contradict each other. The second line explains the necessity of a mini-protocol for both approaches (my Identity, your MessageChannel-based communication approach). Namely, that even if the mini-protocol is as simple as a stringified JSON with a single key-value pair, e.g. {name: "Wikipedia"}, this is something that the capturer must understand.

The capturer/capturee approach is like a client/server approach: server/capturee defines the protocol, client/capturer has to abide to it. Server/capturee may or may not restrict client to specific origins.
What I am meaning by less tight cooperation is that, with your proposal, there is server-side collaboration required.
With server-side collaboration, capturer will send some identity to capturer service, which has to generate some form of identity proof to the capturee service to do pairing. This identity proof is not needed in case UA internal communication is used (there is a single user using the UA).

When applications share a cloud infrastructure, it might be preferable for the developers to go through some pre-existing RESTful API

This does not seem contradictory to me: the same RESTful API can be used to validate or not the request to open a channel with a capturer.

@eladalon1983
Copy link
Member Author

The capturer/capturee approach is like a client/server approach: server/capturee defines the protocol, client/capturer has to abide to it. Server/capturee may or may not restrict client to specific origins.
What I am meaning by less tight cooperation is that, with your proposal, there is server-side collaboration required.
With server-side collaboration, capturer will send some identity to capturer service, which has to generate some form of identity proof to the capturee service to do pairing. This identity proof is not needed in case UA internal communication is used (there is a single user using the UA).

I have not understood this message.
In both cases one side sends the first message.

  • If we expose a string from the capturee on the capture-handle the capturer sees (my proposal), the capturee effectively sends the first message.
  • If we expose a MessagePort on the capture-handle, the capturer sends the first message.

In both cases, it is necessary for both sides to agree on how messages are structured. It's equally true in both cases, and therefore the tightness of collaboration assumed is the same.

This does not seem contradictory to me: the same RESTful API can be used to validate or not the request to open a channel with a capturer.

I do not believe it makes sense to force two applications that are already communicating using tried-and-true mechanisms that were produced by expensive-to-employ engineers, to now support a new method of communication. Enable - great, let's bookmark the idea of adding a MessagePort to the capture-handle API, and circle back to it when it's time for improvements. But for the MVP, a simple string is enough.

@jan-ivar
Copy link
Member

jan-ivar commented Feb 3, 2022

If we end up duplicating some of the mediaSession API surface, so what? I see benefit in doing so, as it gives JS full control over whose control they wish to enable.

If things are almost the same, they are not the same. We can follow patterns without sharing WebIDL.

@youennf
Copy link

youennf commented Feb 4, 2022

The actions proposal is based on interest from 'capturer', I haven't heard any interest from 'capturee'. That puts this API at risk. To be successful, this API should be as good if not better than the out-of-band approach 'capturee' are apparently planning to use.
Piggy backing on MediaSession is a way to reduce the adoption risk/burden as this is an adopted API, provided we can find the right security model.

I do not believe it makes sense to force two applications

As I said, this gives the choice, existing communication channels can continue to be used.

to now support a new method of communication.

It is not a new method, it is reusing a well known web pattern between two entities that do not trust themselves deeply (cross-origin iframes or opener/openee)

But for the MVP, a simple string is enough.

AIUI, it is not a simple string, it is a string + an origin + an event. This makes it very close to postMessage, albeit transfer.
If we think we will add postMessage communication level, I do not see why we should support the string API.

@eladalon1983
Copy link
Member Author

If we think we will add postMessage communication level, I do not see why we should support the string API.

We will still want the string even if we add the channel:

  1. Because this allows the capturee to send a message that basically says "here is what you need to know, and you don't even have to let me know you're capturing me to find this out."
  2. Some companies have multiple independent products with the same origin. For example, Google has Docs, Slides and Sheets all under the umbrella of docs.google.com. If capturing one of these, it is useful to know immediately which it is. They might use different protocols when communicating over the MessageChannel that you propose, produced at different times by different teams.

Can we agree to label the channel as an improvement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Ongoing enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants