Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional Focus (When Display-Capture Starts) #190

Closed
eladalon1983 opened this issue Sep 6, 2021 · 61 comments · Fixed by #240
Closed

Conditional Focus (When Display-Capture Starts) #190

eladalon1983 opened this issue Sep 6, 2021 · 61 comments · Fixed by #240

Comments

@eladalon1983
Copy link
Member

eladalon1983 commented Sep 6, 2021

Problem

When an application starts capturing a display-surface, the user agent faces a decision - should the captured display-surface be brought to the forefront, or should the capturing application retain focus.

The user agent is mostly agnostic of the nature of the capturing and captured applications, and is therefore ill-positioned to make an informed decision.

In contrast, the capturing application is familiar with its own properties, and is therefore better suited to make this decision. Moreover, by reading displaySurface and/or using Capture Handle, the capturing application can learn about the captured display-surface, driving an even more informed decision.

Sample Use Case

For example, a video conferencing application may wish to:

  • Focus a captured application that users typically interact with during the call, like a text editor.
  • Retain for itself focus when the captured display-surface is non-interactive content, like a video.
    • (Using Capture Handle, the capturing application may even allow the user to remotely start/pause the video.)

Suggested Solution and Demo

  • I have produced a spec-draft of my suggested solution.
  • I have also implemented it behind a flag in Chrome and produced and a demo.
    • Note that this demo works on Chrome Canary.
    • Further instructions are provided in the demo itself.

Sample Code

const stream = await navigator.mediaDevices.getDisplayMedia();
track.focus(ShouldFocus(stream) ? "focus-captured-surface" : "no-focus-change");

function ShouldFocus(stream) {
  const [track] = stream.getVideoTracks();
  if (sampleUsesCaptureHandle) {
    // Assume logic discriminating focusability by origin,
    // for instance focusing anything except https://collaborator.com.
    const captureHandle = track.getCaptureHandle();
    return ShouldFocusOrigin(captureHandle && captureHandle.origin);
  } else {  // Assume Capture Handle is not a thing.
    // Assume the application is only interested in focusing tabs, not windows.
    return track.getSettings().displaySurface == 'browser';
  }
}

Security Concerns

One noteworthy security concerns is that allowing switching focus at an arbitrary moment could allow clickjacking attacks. The suggested spec addresses this concern by limiting the time when focus-switching may be triggered/suppressed - the application may only decide about focus immediately[*] upon the resolution of the Promise<MediaStream>. (See the spec-draft for more details about what "immediately" means and how I suggest various edge-cases be handled.)

@happylinks
Copy link

This looks amazing, been waiting for this for a long time! In tella.tv we have the following flow:

  1. User goes to recorder
  2. User selects a screen/window/tab they want to record
  3. When picking window or tab, they get jumped to the window or tab, but need to go back to tella first to start the recording.
  4. Start recording in tella
  5. Countdown
  6. Manually go to the tab/window they were recording.

If i'm looking at this proposal right, our new flow could be:

  1. User goes to recorder
  2. User selects a screen/window/tab they want to record (and does not get jumped to the window or tab)
  3. User clicks start recording
  4. Countdown
  5. We fire .focus on the stream to bring them where they need to be automatically.

A much nicer flow. Would love to be in the Origin Trial when it's available :)

@happylinks
Copy link

happylinks commented Sep 7, 2021

Ah, reading the spec again, missed this point:
If this call to focus() happens more than one second after the start of the capture, the user agent SHOULD have already made its own decision. The user agent SHOULD silently ignore this call focus().

That means we would not be able to do "We fire .focus on the stream to bring them where they need to be automatically." after countdown.
Still better to keep them in Tella though! What is the reason for the 1 second if I may ask ?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 7, 2021

What is the reason for the 1 second if I may ask ?

Please note that 1s is the hard-limit, and normally the window of opportunity to calling focus() would close even earlier - as soon as the microtask on which the Promise was resolved is completed. (The 1s hard-limit prevents the application from keeping that microtask running for N seconds and then switching focus.)

The rationale for these limitations is preventing attacks where the capturing application induces the user to click at it chosen location in the captured application by switching focus at just the right time. As a somewhat contrived example, imagine the capturing application presenting at just the right location a button labelled "double-click here to confirm" and then switching focus after it registers the first click.

Are these concerns realistic? I'm not yet sure, but it should be easier to reach consensus if we err on the side of caution.

@happylinks
Copy link

Thanks for explaining! Preventing the focus will already help us a lot.

@eladalon1983
Copy link
Member Author

@jan-ivar and @youennf, I'd love to carry on from where we stopped at the end of the WebRTC WG Interim meeting yesterday.

@youennf
IIUC, your main concern was around the 1s hard-limit, but I think that you later mentioned this was due to a misunderstanding? Could you PTAL at my proposed spec and let me know what you think? Please see this part:

  1. If focus() was not called on MT, raise an InvalidStateError. Otherwise, proceed.
  2. If this call to focus() happens more than one second after the start of the capture, the user agent SHOULD have already made its own decision. The user agent SHOULD silently ignore this call focus().

Please note that the intention here is that if focus() is not called by the time MT terminates, then the browser immediately switches focus - without a 1s delay. The 1s delay is a backup in case MT takes an inordinately long time to finish.

@jan-ivar

  1. IIUC, your concerns where over subclassing MediaStreamTrack vs. expsosing an API as navigaotr.mediaDevices.focus. Could you please elaborate on why you think one is superior to the other?
  2. I believe you also suggested scheduling a task to close the focus-window-of-opportunity. My concern here is that tasks are more likely than microtasks to experience long delays, and the hard-limit of 1s would be encountered more often than we'd want. (I think we'd want it as a security restriction that's hardly ever encountered by a non-malicious app.) Could you please elaborate your thoughts on this topic?

Please let me know if I've forgotten any other open issues.

@eladalon1983
Copy link
Member Author

Btw, following the interim it became clearer to me which parts of my proposed spec were in need of clarification. I have made adjustments, mostly around (a) explaining the window of opportunity and (b) providing examples. PTAL.

@jan-ivar
Copy link
Member

jan-ivar commented Sep 22, 2021

  1. IIUC, your concerns where over subclassing MediaStreamTrack vs. expsosing an API as navigaotr.mediaDevices.focus. Could you please elaborate on why you think one is superior to the other?

A user's focus is a global property, so a per-track API simply isn't necessary, and we can avoid the subclassing headache.

I believe you also suggested scheduling a task to close the focus-window-of-opportunity. My concern here is that tasks are more likely than microtasks to experience long delays, and the hard-limit of 1s would be encountered more often than we'd want. (I think we'd want it as a security restriction that's hardly ever encountered by a non-malicious app.) Could you please elaborate your thoughts on this topic?

I think you're looking at this the wrong way. How does a tighter window of opportunity benefit slow apps?

What matters is the amount of code that runs up to the point of invoking the focus method. In your model, JS gets a whole second as long as it stays synchronous, but a single microtask switch (e.g. async undefined) and it's out.

That seems too harsh to me, making things harder on apps, not easier. It also seems biased against apps that use promises.

Microtask-switching is sometimes unavoidable when writing promise code (and arguably why microtasks exist — their whole point is that they're not tasks, and operate within a task). For example:

async function getUserMedia(constraints) {
  const stream = await navigator.mediaDevices.getUserMedia(constraints);
  // store track ids in localStorage
  return stream;
}

This introduces a microtask switch on code calling this function as a drop-in for the real thing — but it's on the same task.🤔

From a security perspective, tasks and microtasks are equally vulnerable to busy-looping.

@jan-ivar
Copy link
Member

@alvestrand asked me to outline the API shape I propose. I'd prefer:

console.log(navigator.mediaDevices.displayMediaFocusMode); // "focus"

const stream = await navigator.mediaDevices.getDisplayMedia();
navigator.mediaDevices.displayMediaFocusMode = "no-focus"; // stream's target won't be focused

It's a simple read-write enum attribute with two values. It never changes value except when JS changes it.

UA reads it in a task queued from getDisplayMedia's success task, to determine whether to focus what was just resolved.

The app can set it ahead of time as well, if it doesn't care what the user picks:

navigator.mediaDevices.displayMediaFocusMode = "no-focus"; // stream's target won't be focused
const stream = await navigator.mediaDevices.getDisplayMedia();

TL;DR: it affects subsequent calls, but if you set it immediately, it affects what just resolved as well.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 23, 2021

...navigator.mediaDevices.displayMediaFocusMode = "no-focus";

I don't think this is going to "play nice" if the document is engaged in multiple concurrent captures.

it affects subsequent calls, but if you set it immediately, it affects what just resolved as well

I don't think that's simple, because the association between what gets captured and what gets focused is not there, and concurrent captures become clouded by questions of timing.

A user's focus is a global property, so a per-track API simply isn't necessary, and we can avoid the subclassing headache.

How is navigator.mediaDevices.someFocusApi different than FMST.focus() with respect to "globality"?
FMST.focus() allows feature discovery; it'd be impossible for an application to attempt to call it on a non-supporting track. So that's a benefit of this approach. What benefits would the other approach confer?

What matters is the amount of code that runs up to the point of invoking the focus method.

One does not need a lot of code to place a "Click here to win a million dollars" button and then guess how long to wait before switching focus to the other tab and getting the user to click something there. (Moreover, if you use tasks rather than microtasks, I think it's easier to wait for an on-over event, making guesses about when the click is coming easier. But I could be wrong there. Let's call this a secondary argument.)

I don't think we can measure and base a decision off of how much code runs, regardless of whether the window-of-opportunity is closed by a task or a microtask. Either could run an arbitrary amount of code. The user's reaction time, though...

It also seems biased against apps that use promises.

Some of my best friends are apps that use Promises. I have found them to be capable of switching between using Promises and doing things synchronously, as the need arises.

@jan-ivar
Copy link
Member

jan-ivar commented Sep 23, 2021

I don't think this is going to "play nice" if the document is engaged in multiple concurrent captures.

It should: There's only one user, who can only accept one prompt at a time. The association is with the prompt — the user interaction — not the track.

Example: Imagine a hypothetical navigator.mediaDevices.displayMediaFocus(track) method that must be called between a pair of tasks queued sequentially by the UA, where the first resolves with a new screen-sharing track: Context tells us what to focus, and the track argument is entirely redundant.

FMST.focus() allows feature discovery

This should suffice:

if ('displayMediaFocusMode' in navigator.mediaDevices) { /* UA feature detected */ }

All browsers today automatically focus all focusable surfaces ("monitor" isn't focusable), so the default is "focus".

I'd recommend we strongly encourage UAs to allow apps to override this automatic focus for all focusable surfaces.

This removes any need for per-surface-type detection of policies.

One does not need a lot of code to place a "Click here to win a million dollars" button and then guess how long to wait before switching focus to the other tab and getting the user to click something there. (Moreover, if you use tasks rather than microtasks, I think it's easier to wait for an on-over event, making guesses about when the click is coming easier.

We discussed offline that gDM can queue two tasks A and B, guaranteeing no other task runs in between, so events aren't an issue. The promise is resolved in A, creating microtask A₁ where JS runs. Your proposal is to limit calls to between A₁-A₂, whereas I'm saying A₁-B is fine and superior as it avoids tripping up promise-use. The 1 second timeout is orthogonal.

@youennf
Copy link
Collaborator

youennf commented Sep 24, 2021

I also prefer an attribute over an explicit call for each track.
This allows an application to set its default value easily (never focus for instance) while providing enough flexibility for web applications that want per-capture decisions.

As of event task vs. micro task, there are preexisting examples.
For instance, the fetch event is fired and some properties are read synchronously after the fetch event listeners, thus before the microtasks that would have been scheduled during the fetch event listeners.

The main advantage I see is that it is a clear web developer contract: the boundary to call focus is the next await.
I believe this approach also allows shiming since the read-focus callback would be executed after the already registered promise resolution callback.

From a pure spec/developer point of view, it should be also easier to specify and implement: queue a task where the promise is resolved and a promise resolution callback to read the value and apply focus is registered.
If we would like to leave the whole task to change the focus value, we might actually not want to read the value in the next task but at the point in the current task where the microtask queue gets emptied after the task steps are executed.
I am not sure how doable that is right now in terms of spec editing.

@jan-ivar
Copy link
Member

jan-ivar commented Sep 24, 2021

I also prefer an attribute over an explicit call for each track.

👍

As of event task vs. micro task, there are preexisting examples.
For instance, the fetch event is fired and some properties are read synchronously after the fetch event listeners, thus before the microtasks that would have been scheduled during the fetch event listeners.

Event listeners must respond synchronously to prevent bubbling/avoid default for historical reasons.

Those "properties" appear to be fetch trying to support promise code with event.respondWith(async () => {}). So my takeaway from fetch is we should make affordances for promise use. ("[In fetch] we use promises heavily").

I believe this approach also allows shiming since the read-focus callback would be executed after the already registered promise resolution callback.

I'll try to explain the shimming problem in more detail: Say a JS library (e.g. adapter.js) needs to shim gDM for some reason:

const nativeGDM =  window.navigator.mediaDevices.getDisplayMedia;

window.navigator.mediaDevices.getDisplayMedia = async function getDisplayMedia(constraints) {
  // Here we're on Task A
  const stream = await nativeGDM.apply(this, arguments);
  // Here we're on Task B, Microtask B₁
  stream.newFeatureX = 3;
  return stream; // the implicit promise returned by this async function is resolved with stream
}

The shim is careful not to await anything else, yet a microtask checkpoint is unavoidable, because the promise returned by the shim is not the same as the one from nativeGDM. Thus the JS library will unintentionally break all existing code attempting to suppress focus. E.g.:

// Here we're on Task A
const stream = await navigator.mediaDevices.getDisplayMedia({video: true});
// Here we'd be on Task B, Microtask B₁ without the shim, but with the shim we're on Microtask B₂ instead
navigator.mediaDevices.displayMediaFocusMode = "no-focus"; // focuses with the shim but not without!

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 27, 2021

Global or per-surface controls

(@youennf:) I also prefer an attribute over an explicit call for each track.

These are two distinct preferences:

  1. Global or per-surface. (Discussed in this section.)
  2. Attribute or method. (Discussed in the next section.)

(@jan-ivar:) There's only one user, who can only accept one prompt at a time.

The browser can operate in modes which skip the prompt. Mechanisms to trigger these include extensions, enterprise policies and command-line arguments. The spec may be agnostic of these, but the fact remains that the user can accept multiple different captures within a very short span of time.

At any rate, if the application fires off two calls to getDisplayMedia and wants to focus exactly one of these, then it's a lot more ergonomic to call focus() on the right track, than to to manipulate a global attribute at just the right time, ensuring it's the intended value when the UA reads it for the one display-surface and the other value when the UA reads it for another display-surface. It requires of the Web-developer much more in-depth understanding.

Method vs. Attribute

Assume, for the sake of argument, that my previous section convinced you to use per-surface controls. Do we want a method or attribute then?

An application that can read the value may just as well set its own preferred value. An attribute for focus would make sense if global, but not if per-surface.

However, before even calling getDisplayMedia, the application might already wish to know whether it can influence the decision. I would not object to adding a global attribute that reads (and potentially writes?) the default behavior - that which applies if the per-surface API is not invoked. The presence of this global attribute also informs the application that the per-surface control will be exposed if the user chooses to share a focusable surface.

Btw, one challenge with writable attributes is that Web-developers would not as readily expect setting of attributes to potentially raise an exception.

Subclassing MediaStreamTrack

I think we have seen multiple cases where subclassing MediaStreamTrack would have confered benefits, but each time a discussion arose over whether it's enough to sub-class just for that. The results of having everything on MediaStreamTrack is sub-optimal. Some immediate beneficiary APIs of a decision to sub-class would be:

  • The focus API - only focusable surfaces (browser-surfaces and windows; in the future potentially applications and/or isolated-browser-surfaces).
  • Capture Handle - only browser-surfaces (maybe more in the future).
  • Cropping - only self-capture (SelfCaptureMST -> BrowserSurfaceMST -> FocusableMST -> DisplayCaptureMST -> MST).

IMHO, this list is sufficiently long and the benefits are sufficient. When someone calls getViewportMedia(), the result is inherently different than when someone calls getUserMedia(), and it makes sense for the APIs exposed to reflect that.

Tasks vs. Microtasks

IIANM, the only argument for tasks is that they are shim-friendly. (Please correct me if I'm wrong.)

An argument against tasks is that in addition to shimming, it allows an application to await somePromise. This is an anti-pattern, as the results would be flaky. If somePromise is already resolved, it would work; if it will only by resolved by a later task, it would not work.

This trade-off is easy to reason about (IMHO) because we can have both. If we use microtasks, shimming is possible with an adapter:

const nativeGDM =  window.navigator.mediaDevices.getDisplayMedia;

function focusCallback(stream) {
  // Return "no-focus-change" or "focus-captured-surface"
}

window.navigator.mediaDevices.getDisplayMedia = async function getDisplayMedia(constraints, focusCallback) {
  const stream = await nativeGDM.apply(this, arguments);
  const [track] = stream.getVideoTracks();
  if (!!focusCallback && !!track.focus) {
    const shouldFocus = focusCallback(stream);
    track.focus(shouldFocus);
  }
  return stream;
}

The code outside the shim just plugs their callback. Note that there are natural limits on what the app can do anyway until the window-of-opportunity closes, so I expect the code would easily and naturally fit inside of a synchronous callback.

@youennf
Copy link
Collaborator

youennf commented Sep 27, 2021

The shim is careful not to await anything else, yet a microtask checkpoint is unavoidable

It is avoidable by having the shim returning the native promise returned by getDisplayMedia.
The only downside is that you cannot change the object being resolved, but you can update the object itself: add tracks, remove tracks...

Looking at the HTML event loop, we can also see that after the micro task checkpoint, there are additional tasks that need to be done, some of which might trigger firing events.

These are two distinct preferences:

  1. Global or per-surface. (Discussed in this section.)
  2. Attribute or method. (Discussed in the next section.)

I was meaning a global attribute, somewhere in navigator.
With the current focus API design, in particular the restriction to read the focus-or-not value synchronously in the task resolving the promise, I do not see a need to have a per-track method.

Let's say the application wants a behavior for all getDisplayMedia calls. It will set the attribute once.
Let's say the application wants to select the behavior for each getDisplayMedia call, it will update the attribute value each time within the promise resolution callback.

A method is less intuitive to me.

one challenge with writable attributes is that Web-developers would not as readily expect setting of attributes to potentially raise an exception.

I do not think the plan for setting this attribute is to raise an exception.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 27, 2021

Let's say the application wants to select the behavior for each getDisplayMedia call, it will update the attribute value each time within the promise resolution callback.

A Web-developer would need non-trivial understanding of the feature in order to become convinced that there'd be no carry-over effect of focusing the wrong surface. They might explain it to themselves as "only the last capture gets focus or not" - but then, why does it sometimes not seem to work? There is no exception raised when they set the new value too late (e.g. await on an yet-unresolved-Promise). With a method, there would be, and that exception would not be unexpected.

@jan-ivar
Copy link
Member

It is avoidable by having the shim returning the native promise returned by getDisplayMedia.
The only downside is that you cannot change the object being resolved, but you can update the object itself: add tracks, remove tracks...

@youennf How? Please be specific.

one challenge with writable attributes is that Web-developers would not as readily expect setting of attributes to potentially raise an exception.

I do not think the plan for setting this attribute is to raise an exception.

Agreed (not beyond WebIDL's TypeError for invalid enum values).

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 27, 2021

I do not think the plan for setting this attribute is to raise an exception.

Agreed (not beyond WebIDL's TypeError for invalid enum values).

Contrast:

1.

const stream = await navigator.mediaDevices.getDisplayMedia();
const [track] = stream.getVideoTracks();
await someOtherPromisesThatResolvesMuchLater;
track.focus("no-focus-change");

2.

const stream = await navigator.mediaDevices.getDisplayMedia();
const [track] = stream.getVideoTracks();
await someOtherPromisesThatResolvesMuchLater;
navigator.mediaDevices.focusPolicy = "no-focus-change";

3.

const stream = await navigator.mediaDevices.getDisplayMedia();
const [track] = stream.getVideoTracks();
await immediatelyResolvedPromise;
navigator.mediaDevices.focusPolicy = "no-focus-change";
const otherStream = await navigator.mediaDevices.getDisplayMedia();
  • The first can throw an exception that explains to Web-developers why what they're doing is not having the desired effect.
  • The second has to accommodate the option that the Web-developer actually intended to set the new policy, going forward, completely independently of the call to gDM that has already resolved. The UA cannot raise an exception, because no error was made - but we human beings can tell there's a good chance it's a bug in the Web-application.
  • The third affects the gDM call that just resolved. Maybe it's what the developer wanted; or maybe they actually intended to only affect otherStream, and this is actually the inverse bug of the second example. Who knows?!

Only the first option produces consistent results (and raises a clear exception when used incorrectly).

@jan-ivar
Copy link
Member

The browser can operate in modes which skip the prompt.

That is out of scope for this working group.

The spec may be agnostic of these, but the fact remains that the user can accept multiple different captures within a very short span of time.

@eladalon1983 Not a problem. JS is single-threaded, and getDisplayMedia implicitly queues a task to resolve a promise from in parallel steps. My proposal would be to queue two tasks, guaranteed to happen in succession. There is no "span of time" short enough for two getDisplayMedia calls resolving to be a problem.

it's a lot more ergonomic to call focus() on the right track, than to to manipulate a global attribute at just the right time,

You have to call focus() at just the right time as well. Also, if ergonomics is the issue, why isn't it stream.focus()?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 27, 2021

The browser can operate in modes which skip the prompt.

That is out of scope for this working group.

I have foreseen this response and added the text which you quoted immediately below (agnosticism etc.). The topic is not whether we can handle it spec-wise, but rather whether it produces ergonomic results for the application. Consider:

const stream1 = await navigator.mediaDevices.getDisplayMedia();
doSomething();  // Maybe ends the task, maybe doesn't.
navigator.mediaDevices.focusPolicy = "no-focus"; // Who is affected? s1? s2? Both? Neither?
doSomethingElse();  // Maybe ends the task, maybe doesn't.
const stream2 = await navigator.mediaDevices.getDisplayMedia();

The global-attribute API produces code which is hard to reason about. Does it affect stream1? Does it affect stream2? Both? Neither? If flaky for either - what is the bug? It's definitely not simple.

Not a problem. JS is single-threaded,

My assertion is not that it's impossible to spec this properly. I argue that the result is not ergonomic and not simple.

Also, if ergonomics is the issue, why isn't it stream.focus()?

Well, if you think ergonomics would be improved by moving focus() to MediaStream, we can discuss. My motivation for not putting it there was that MediaStreams can get tracks added/removed, and then focus() becomes... un-simple.

@youennf
Copy link
Collaborator

youennf commented Sep 27, 2021

@youennf How? Please be specific.

const nativeGDM =  window.navigator.mediaDevices.getDisplayMedia;
window.navigator.mediaDevices.getDisplayMedia = function getDisplayMedia(constraints) {
  const promise = nativeGDM.apply(this, arguments);
  promise.then(stream => {
    stream.newFeatureX = 3;
  });
  return promise;
}

@jan-ivar
Copy link
Member

jan-ivar commented Sep 27, 2021

Only the first option produces consistent results (and raises a clear exception when used incorrectly).

Of the three listed, which is a false choice. We don't need to subclass anything to throw. See below:

I would not object to adding a global attribute that reads (and potentially writes?) the default behavior ...

So you'd introduce an attribute and a method?

Contrast:

  1. navigator.mediaDevices.defaultFocusPolicy = "focus-change";
    const stream = await navigator.mediaDevices.getDisplayMedia();
    await someOtherPromisesThatResolvesMuchLater;
    navigator.mediaDevices.focus("no-focus-change"); // throws InvalidStateError, focus already happened

vs.

  1. navigator.mediaDevices.focusPolicy = "focus-change";
    const stream = await navigator.mediaDevices.getDisplayMedia();
    await someOtherPromisesThatResolvesMuchLater;
    navigator.mediaDevices.focusPolicy = "no-focus-change"; // Doesn't throw, focus already happened

@jan-ivar
Copy link
Member

const nativeGDM =  window.navigator.mediaDevices.getDisplayMedia;
window.navigator.mediaDevices.getDisplayMedia = function getDisplayMedia(constraints) {
  const promise = nativeGDM.apply(this, arguments);
  promise.then(stream => {
    stream.newFeatureX = 3;
  });
  return promise;
}

@youennf Sorry, but each then on the same promise introduces a microtask checkpoint.

@eladalon1983
Copy link
Member Author

So you'd introduce an attribute and a method?

No. We're not on the same page. Maybe a quick summary of my position would help. I am only interested in a per-track method, but I am analyzing all options in my attempt to convince you that the option I favor is superior.

When I discuss inability to raise an exception, I do not mean it's because the attribute is an attribute. Rather, the problem arises because that global-attribute API is always valid to manipulate. Namely, an application could always be setting the value for the next gDM to follow.

I claim that inability to reliably raise an exception is a drawback. It leaves Web-developers in the dark when their applications behave inconsistently. It fails silently. That's a problem.

@jan-ivar
Copy link
Member

I am only interested in a per-track method,

Why, when it is less ergonomic than stream.focus()?

Rather, the problem arises because that global-attribute API is always valid to manipulate.

I showed an option 1 above that would throw outside the same envelope, to show there's no need to subclass MST here.

From my end, the conversation has led to the options I show (1 and 2) which I think capture the remaining open question:

Would we rather offer JS a default value, or a clear exception when JS misses an (obvious) time window, or both (at the cost of double the API surface)?

@eladalon1983
Copy link
Member Author

eladalon1983 commented Sep 27, 2021

I am only interested in a per-track method,

Why, when it is less ergonomic than stream.focus()?

  1. I have answered that question. Please see the bottom of my comment there.
  2. Citation needed for one being more ergonomic than the other. (One being less code is not such a significant change in ergonomics. The ergonomics I refer to when I propose MST.focus are clarity, not brevity.)
  3. I remain (cf. comment) open to discussing a per-stream method . The context in which I mentioned "only per-track" was when you asked if I was suggesting "an attribute and a method". It was not a blanket "only" for anything one could conceive of. It was a clarification that I was not at the time proposing two APIs, as you seemed to think.

I showed an option 1 above that would throw outside the same envelope, to show there's no need to subclass MST here.

From my end, the conversation has led to the options I show (1 and 2) which I think capture the remaining open question:

The original motivation I heard for using a global attribute, was that it allows to:

  1. Probe the UA's capability before calling gDM.
  2. Set a global default that would apply when gDM is called in the future.

So:

  • I agree that (1) is achieved by the global attribute.
  • I claimed that (2) loses the ability to raise instructive exceptions.

Do you now propose a version of the global that does raise exceptions? If so - that loses benefit (2). The UA cannot simultaneously, with a single API, allow the Web-developer to influence the future as well as warn them when they're too late to influence the past.

So my question is - which API are you now proposing, and what are its benefits over MST.focus?

@annevk
Copy link
Member

annevk commented Oct 1, 2021

It would be more like this:

  1. Queue a task to:
    1. Call the callback.
    2. Lock focus.
    3. Resolve the promise.

(Where you place step 1.3 doesn't really matter as it'll only be observable after 1.1 and 1.2)

@eladalon1983
Copy link
Member Author

I am not sure I understand your proposal. It seems like the one I classified as "before" in an earlier comment, but I could be misunderstanding.

function callback(mediaStream) {
  console.log('b');
  mediaStream.removeTrack(mediaStream.getVideoTracks()[0]);
  console.log('c');
}

console.log('a');
const mediaStream = navigator.mediaDevices.getDisplayMedia({video: true}, callback);
console.log('d');
const [track] = (await mediaStream).getVideoTracks();
console.log('e');

What's going to happen with the code above?

@annevk
Copy link
Member

annevk commented Oct 1, 2021

Order would be a, d, b, c, e if I'm not mistaken. You can indeed remove tracks, but this can also happen if you hand out the promise in multiple places. That doesn't really seem problematic to me.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Oct 1, 2021

So you were indeed referring to the before option from this comment, which means I was right to edit away my misunderstanding in this comment. Glad it's clarified. :-)

I think giving access to the MediaStream to something that executes before the Promise resolves violates POLA. Are you aware of precedents, perhaps?

@annevk
Copy link
Member

annevk commented Oct 1, 2021

Yeah, e.g., https://notifications.spec.whatwg.org/#dom-notification-requestpermission. I'm pretty sure we also have places that dispatch an event and resolve a promise in the same task. Stems from how promises work. (And to be pedantic, the promise can be resolved before, but it being resolved cannot be observed until after.)

@jan-ivar
Copy link
Member

jan-ivar commented Oct 1, 2021

How about this:

const stream = await navigator.mediaDevices.getDisplayMedia();
navigator.mediaDevices.addEventListener("focus", e => decide(stream) || e.preventDefault(), {once: true});

This would be the entire API surface. No need for exceptions. I'm open to bikeshedding the event name.

@eladalon1983
Copy link
Member Author

eladalon1983 commented Oct 7, 2021

How about this:

const stream = await navigator.mediaDevices.getDisplayMedia();
navigator.mediaDevices.addEventListener("focus", e => decide(stream) || e.preventDefault(), {once: true});

This would be the entire API surface. No need for exceptions. I'm open to bikeshedding the event name.

The use of preventDefault suggests that the default behavior, hitherto unspecified, would now be specified as focus-captured-surface. Is that right?

@jan-ivar
Copy link
Member

jan-ivar commented Oct 7, 2021

Yes.

@annevk
Copy link
Member

annevk commented Oct 8, 2021

FWIW, as I understand it the task responsible for resolving getDisplayMedia() has been traditionally also the focus task. It seems to me that therefore you would have to expose the stream on the focus event. Delaying the focus by a task in the hope that the caller of getDisplayMedia() holds a stream at that point still seems rather unclean.

@eladalon1983
Copy link
Member Author

I'm wondering if we could in the future run into an issue where some operating systems require a separate permission before they allow the UA to switch focus between applications. We might want to get around that by specifying the default behavior as "the user agent MUST try to focus the captured surface." Wdyt?

@youennf
Copy link
Collaborator

youennf commented Oct 8, 2021

I would also prefer just one task.
My understanding of the algorithm would be something like: queue a task that runs the following steps:

  • fire a focus event (taking stream, a subset of it, stream's clone...)
  • execute focus algorithm based on focus policy computed from focus event handling.
  • resolve getDisplayMedia promise with stream

I do not see why we should delay the focus algorithm.

@jan-ivar
Copy link
Member

jan-ivar commented Oct 8, 2021

FWIW, as I understand it the task responsible for resolving getDisplayMedia() has been traditionally also the focus task.

I ran a test to disprove this https://jsfiddle.net/jib1/q75yb8pf/ but I was also surprised to learn that Firefox blurs the capturer window 15 ms before it resolves the promise (I thought it happened much later):

4823: blurred
4837: resolved

But if we're worried about web compat here we have a much bigger problem: Chrome's prompt steals focus.

1827: blurred
16078: resolved

We've never specified exactly when (non-browser) window is focused before, and I'm not sure we need to here. What we're specifying here is a JS control opportunity by requiring the browser to fire a warning ping "hey, I plan to focus another window, you ok with that?", which to me doesn't inherently need to match up with any user-visible behavior timing wise.

@jan-ivar
Copy link
Member

jan-ivar commented Oct 8, 2021

I would also prefer just one task. My understanding of the algorithm would be something like: queue a task that runs the following steps:

  • fire a focus event (taking stream, a subset of it, stream's clone...)

I don't see a problem with returning the original. Clones are trouble.

  • execute focus algorithm based on focus policy computed from focus event handling.

I'd call it the "focus decision algorithm", since it doesn't seem necessary to specify when exactly actual focusing takes place.

  • resolve getDisplayMedia promise with stream

Let's say I did this:

async function getDisplayMedia() {
  const stream = await navigator.mediaDevices.getDisplayMedia();
  console.log("A");
  return stream;
}

const p = getDisplayMedia();
await new Promise(r => navigator.mediaDevices.onfocus = e => r(decide(e.stream) || e.preventDefault()));
console.log("B");
const stream = await p;

Would this produce A B or B A? I'm not sure microtask order here is even standardized.

I do not see why we should delay the focus algorithm.

Tasks have cleaner semantics than messing with microtask order.

@annevk
Copy link
Member

annevk commented Oct 11, 2021

It would produce B, A. As there's no JavaScript on the stack, the microtask is queue is drained after event callback invocation. I think that indeed makes the order observable and therefore it's good to use the order @youennf proposed.

@youennf
Copy link
Collaborator

youennf commented Mar 8, 2022

Another approach to consider is to use a callback given to getDisplayMedia as input parameter.
It does not allow to set a context wide behavior but it makes it very easy to understand to which getDisplayMedia call the decision should be made.

@happylinks
Copy link

Just wanted to give my vote for having this API again.
Today yet another user of our application told us they thought they were already recording their screen after they shared their tab and it jumped to the tab they shared.
We already try to hint them to come back to our tab through the title and favicon like this:
image
And we try to explain that they should always come back to our tab in our UI, but still users get confused sometimes.

Because we record two streams at the same time, and we need to sync those up, starting recording automatically on a screen share is not good enough. MediaRecorders take time to start, sometimes even multiple seconds. We need to let the user know exactly when the recorder started so they don't start talking before we are actually recording.

In our application specifically we never want the auto-focus behavior to happen and would love to be able to disable it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants