Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

truncation to 64-byte upper limit doesn't mention character boundaries #973

Closed
aphillips opened this issue Jun 27, 2018 · 21 comments · Fixed by #1205
Closed

truncation to 64-byte upper limit doesn't mention character boundaries #973

aphillips opened this issue Jun 27, 2018 · 21 comments · Fixed by #1205
Assignees
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. stat:pr-open type:editorial type:technical
Milestone

Comments

@aphillips
Copy link

https://w3c.github.io/webauthn/#dictionary-pkcredentialentity

When referring to the name the spec says:

Authenticators MUST accept and store a 64-byte minimum length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes.

Note that the specification does not require truncation on a Unicode character boundary. Arbitrary truncation at a 64-byte limit on a multibyte encoding such as UTF-8 can corrupt the last character in the string. The spec should require that the truncation occur on a character boundary (is there a reason you didn't use character count instead of byte count in the first place?)

@aphillips
Copy link
Author

aphillips commented Jun 27, 2018

Hmm... also I think it doesn't mean to say 64-byte *minimum* length. I suspect it means to say "maximum" there. PS> Please add the i18n-comment label.

@equalsJeffH equalsJeffH added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. type:technical labels Jun 27, 2018
@equalsJeffH
Copy link
Contributor

equalsJeffH commented Jun 27, 2018

[this issue is related to issue #593 and PR #951]

@aphillips wrote:

Note that the specification does not require truncation on a Unicode character boundary

I was wondering whether/when you'd bring this up.

I've done some modest research on this topic of "unicode string truncation" (due to the text you cite above) and apparently it is more complex than simply performing truncation on a Unicode character boundary -- it apparently ought to properly be done on extended grapheme cluster (EGC) boundaries.

I found detailed analysis here: https://hoytech.github.io/truncate-presentation/
..and a library: https://github.com/hoytech/Unicode-Truncate, but nothing regarding "unicode string truncation" in IETF, W3C, or Unicode specs :-/

In brief discussion a little while back, @stpeter suggested that specifying how to do proper "unicode string truncation" perhaps ought to be addressed by the unicode consortium. I'd imagine as an addition to TR29 "Unicode Text Segmentation" (but who knows). Charmod may want to say something about it.

Rather than properly & thoroughly spec how to do "unicode string truncation" in webauthn, perhaps we should simply state something like (in addition to the above-quoted spec text):
"Such truncation SHOULD be performed on extended grapheme cluster boundaries [[!UAX29]]."

..though, "who" does such "EGC-aware truncation" is a question. Presently the webauthn spec says it is the authenticator who may perform such truncation, but requiring authenticators to be able to perform EGC-aware truncation will be controversial I suspect. Rather, perhaps the RP and/or Client ought to do that? Though, they do not know the capabilities of the authenticator, i.e., what string length it can accommodate, and it seems the present truncation language in the spec is attempting to allow for authenticators that are able to handle strings longer than 64 bytes? (further discussion at end, below)

is there a reason you didn't use character count instead of byte count in the first place?)

Yes, because the impetus for this length restriction is in the context of a narrow-bandwidth channel (e.g., BLE or NFC) between the webauthn client and an authenticator (illustration here), and also that these strings may be stored by the authenticator who may have limited resources. At that level of abstraction, we're dealing in byte counts, not char counts (which for Unicode in UTF-8 might be several bytes long -- e.g., apparently there's a Tibetan character with 8 combining marks (I dunno offhand how many bytes in UTF-8 encoding that'd end up being)).

also I think it doesn't mean to say 64-byte minimum length. I suspect it means to say "maximum" there

the text you're referring to is:
"Authenticators MUST accept and store a 64-byte minimum length for a name member’s value."

Yeah, I think the perspective that was written from is: one MUST accommodate at least a 64-byte length for this value. I.e., some authenticators may, if presented with a 80 byte string, simply accommodate it. Alternatively, truncate it no shorter than 64 bytes (if doing truncation on arbitrary byte boundaries, which as we note, is not i18n-kosher). I.e., if the authenticator supports 70 byte name strings, it would ostensibly truncate an 80-byte string to 70 bytes.

I'm thinking we need to add a Note: to the spec explaining this rationale (if I have it correct, or whatever the rationale is if I do not).

@aphillips
Copy link
Author

@equalsJeffH Thanks for this. I think it is useful to separate some concerns here.

Counted storage limits generally are required by low-level protocols or data structures. These are generally implemented in code doing serialization/deserialization or as part of e.g. network protocols. At this level, you need to define a length limit in something--I18N folks generally prefer characters (by which we mean Unicode code points) as the length limit and not bytes because limits defined in characters do not disadvantage languages/scripts that use 2-, 3-, or 4-byte forms of UTF-8 the way that byte counts do. The 64-byte limit probably goes unnoticed by English speakers, while Chinese users (with an effectively 21-character limit) notice it more often. User's don't understand why they can type a lot of text but only like 1 (complex) emoji.

If you must have a byte limit boundary (usually due to a protocol requirement), then at a very minimum I expect to see code point based truncation (because creating extra U+FFFD characters is a Bad Thing), hence my comment above. I'm usually fine with low-level specs that define length limited fields and specify character truncation.

So... regarding EGCs, yes, I agree that this is ideal. If you truncate some character sequences, you change the meaning of the user-perceived character (grapheme), such as when you remove a vowel from an Indic character sequence. So ideally truncation would be on grapheme boundaries. UTR#29 talks about this as do some other specs. But, to be honest, I don't think a low level implementation needs to require this or define it in amazing detail. It's a health warning. An example of a W3C spec that deals with the higher level problem is CSS3-Text (at the given location). Note that even there they allow for vagaries with notes like:

Authors are forewarned that dividing grapheme clusters by element boundaries may give inconsistent or undesired results.

So:

  • I would prefer you define the minimum limit in Unicode code points ("characters")
  • I would prefer you require (MUST) truncation only on character boundaries, regardless of whether you are counting code units ("bytes") or code points ("characters")
  • I would like it if you encouraged truncation only on grapheme boundaries

@emlun
Copy link
Member

emlun commented Jun 28, 2018

Asking constrained hardware authenticators to include full UTF-8 parsing logic is not really feasible - that's a lot of really complicated logic which, if history is any indication, also comes with a lot of security vulnerabilities. The proper way to solve this would be, as @equalsJeffH alludes to, to provide a way for the client to query the authenticator for a maximum size in bytes, so that the client can do the appropriate truncations (respecting character boundaries) before sending the data to the authenticator. It doesn't look like CTAP currently provides that, though, so I think we're stuck with the current (admittedly brittle) approach for the Level 1 spec.

What we could do to prevent truncation issues, without needing changes to CTAP, is to specify that clients MUST NOT allow input that would result in byte strings longer than 64 bytes. But I think that would have to wait until Level 2, since it would be a breaking normative change.

@aphillips
Copy link
Author

@emlun Parsing UTF-8 character boundaries is dead simple. There is no complicated logic involved in that. Computing extended grapheme clusters is hard, which I why I don't recommend it.

@nadalin nadalin added this to the L2-WD-00 milestone Jul 5, 2018
@nadalin
Copy link
Contributor

nadalin commented Jul 5, 2018

@aphillips This seems to be beyond scope for current version, moving to next level

@aphillips
Copy link
Author

aphillips commented Jul 6, 2018

@nadalin I don't see how you can say that it is "beyond scope" since you already have text implementing a somewhat complex and arbitrary set of limits.

Truncating on character (code point) boundaries is not hard. The current text should be fixed because what it specifies will lead to broken data---and in the next version you'll claim backward compatibility :-)

@asmusf
Copy link

asmusf commented Jul 7, 2018

Strongly second @aphillips here.

For what you want, characters are the natural unit, and determining character boundaries is not going to challenge embedded processors.

Determining character boundaries is also something that's absolutely stable, unlike extended grapheme clusters. Their design premise is strongly biased towards user interfaces, so, for example, if you were to present a truncated name, you might want to add ellipses at the last EGC boundary inside the 64 character (or longer) window to which the name has been truncated to. But that would be done on the client level.

I am wondering about the fact that the usage seems to be in the context of authentication. Is the username (truncated) used for authentication purposes ? If so, does it make a difference if different authenticators implement different limits?

@asmusf
Copy link

asmusf commented Jul 7, 2018

To add to this. 64 bytes will give you 16 characters in the worst case, as UTF-8 characters can be up to 4 bytes long. For Indic writing systems most characters would use 3 bytes, so you would get 21 code points. Assume that the average syllable takes 3 characters to write, and you get 7 syllables (or grapheme clusters) as the minimal guaranteed length of your name. That's not generous, if you expect that personal names to be used as base for these names.

@emlun
Copy link
Member

emlun commented Jul 9, 2018

Is the username (truncated) used for authentication purposes ?

Mostly no - the user.name and user.displayName fields are used only by the authenticator to display to the user when picking a credential to use (which happens in only a subset of the use cases), and never returned to the RP after the credential is created. The user.id is returned to the RP and used as an identifier for authentication, but unlike the other two it's defined as an opaque byte array and not a text type.

@asmusf
Copy link

asmusf commented Jul 9, 2018 via email

@emlun
Copy link
Member

emlun commented Jul 10, 2018

One issue with truncating like this is that it's not clear to the user agent that a string has been truncated; how would that be handled?

It looks like it won't be handled for the L1 version of the spec. For L2 I think we should collaborate with CTAP to come up with a suitable way for the authenticator to signal to the client what it's able to store, so that the client can do the proper input validation.

@akshayku
Copy link
Contributor

Authenticators don't need to deal with Unicode character boundaries and should do only byte length checks. Platforms can probably do the Unicode checks before sending it to the client. As a general principle for next level, IMO, we should be very careful of increasing complexity from the authenticator side.

@equalsJeffH
Copy link
Contributor

fyi/fwiw, here's "counting [unicode] characters in utf-8 strings", where they note (at end):

  • the penalty for counting UTF-8 characters, or indexing into or iterating over the characters of a UTF-8 string, is very small.

  • People probably shouldn't worry about the efficiency of counting and iterating over characters in UTF-8 strings, at least not if they were using null-terminated C strings before.

..and provide several code examples.

@asmusf
Copy link

asmusf commented Jul 12, 2018

Truncating UTF-8 strings on "random" byte boundaries creates buffers containing corrupt data.

Truncating character strings on (some) boundaries that are not EGC boundaries may create data that cannot be properly rendered (and or may look like a different character string).

As to the former, the goal of any specification should be an robust architecture that avoids the creation, or worse, interchange of corrupt data. Looks like the L1 level fails that; despite the fact that enforcing validity on UTF-8 strings is not a complex task.

From a user interaction, the goal should be to prevent display of "broken" or misleading data. That includes feedback on whether a string has been truncated.

Admittedly, getting that part correct drags in something like extended grapheme clusters. But also the need to communicate whether truncation has taken place.

@agl
Copy link
Contributor

agl commented Jul 12, 2018

As a browser, we're not going to trust the data coming from an authenticator. So even if the spec says that the authenticator must handle UTF-8 correctly and truncate only whole code points, we're still going to UTF-8 validate the data and handle abrupt truncation. So the authenticator might as well not bother.

On the other hand, knowing that the string was truncated would be useful.

@equalsJeffH
Copy link
Contributor

@agl, AIUI, are you saying that if the RP script invokes display of these strings, the browser will "UTF-8 validate the data and handle abrupt truncation", before displaying the strings?

this is just chrome's standard handling of displayed strings ?

what do you mean by "handle abrupt truncation"?

if a utf-8-encoded string, containing multi-byte encoded chars, was truncated at an arbitrary byte boundary, does the validation process catch that?

what is the behavior if the string is "corrupted" from a UTF-8 correctness perspective?

thx.

@equalsJeffH
Copy link
Contributor

I chatted with @agl about this recently.

Given that we merged PR #951, we already have appropriate entities enforcing PRECIS on the name-ish strings.

A thing to note about the "authnrs MAY truncate srings to at least 64 bytes" statement in the spec is that authnrs CAN support/handle/return (UTF-8) strings longer than 64 bytes, it is authnr-specific. Thus the webauthn client cannot really be given the responsibility for truncating these strings, it needs to be left up to the authnrs.

Given that strictly byte-level string truncation can mangle UTF-8 strings, see #973 (comment), any truncation really SHOULD be done on at least the code point level, and if possible, on the EGC (extended grapheme cluster) level. The latter is what @asmusf related, the former @aphillips in #973 (comment).

As @agl implies in #973 (comment), and clarified in our chat, Chrome will reject CBOR-encoded objects containing "text" strings that are not UTF-8 valid (which is detectable). So if an

Below is a two-part proposal for what to add to the spec to address this issue. If they are both nominally acceptable, then perhaps we select one based on whether we are willing to add additional normative language or not:

OLD:

Authenticators MUST accept and store a 64-byte minimum length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes.

  1. NEW, if we are willing to add additional normative language:

Authenticators MUST accept and store at least a 64-byte length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes. Authenticators SHOULD perform any UTF-8 encoded string truncation on a code point boundary, and MAY perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]]. Truncated strings SHOULD include an indication of truncation, such as appending an ellipsis.

Note: Truncation of a UTF-8 encoded string at an arbitrary byte boundary, or even in some cases on an arbitrary code point boundary, may result in a string that cannot be properly rendered, or may look like a different character string if rendered. Truncation on code point boundaries is preferred over arbitrary byte boundaries. Truncation on EGC boundaries is the safest approach.

  1. NEW, no new normative language:

Authenticators MUST accept and store at least a 64-byte length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes.

Note: Authenticators should perform any UTF-8 encoded string truncation on a code point boundary, and may perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]]. Truncated strings should include an indication of truncation, such as appending an ellipsis. Truncation of a UTF-8 encoded string at an arbitrary byte boundary, or even in some cases on an arbitrary code point boundary, may result in a string that cannot be properly rendered, or may look like a different character string if rendered. Truncation on code point boundaries is preferred over arbitrary byte boundaries. Truncation on EGC boundaries is the safest approach.

@aphillips
Copy link
Author

If you go with normative language, I would prefer this:

Authenticators MUST perform any UTF-8 encoded string truncation on a code point boundary, and are encouraged perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]].

... in place of SHOULD/MAY as in the proposal. I used non-normative language about EGC because SHOULD is too strong a recommendation. MAY might be appropriate instead.

Note that a mid-code-point truncated string makes file formats such as JSON invalid (unless a transfer encoding such as base64 is applied to the name--which I think is beside the point??)

@stpeter
Copy link

stpeter commented Jul 18, 2018

+1 to the suggestion from @aphillips

@nadalin
Copy link
Contributor

nadalin commented Apr 24, 2019

@agl to write implementation guidance PR

agl added a commit to agl/webauthn that referenced this issue Apr 27, 2019
@agl agl closed this as completed in #1205 May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. stat:pr-open type:editorial type:technical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants