truncation to 64-byte upper limit doesn't mention character boundaries #973

aphillips · 2018-06-27T17:39:56Z

https://w3c.github.io/webauthn/#dictionary-pkcredentialentity

When referring to the name the spec says:

Authenticators MUST accept and store a 64-byte minimum length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes.

Note that the specification does not require truncation on a Unicode character boundary. Arbitrary truncation at a 64-byte limit on a multibyte encoding such as UTF-8 can corrupt the last character in the string. The spec should require that the truncation occur on a character boundary (is there a reason you didn't use character count instead of byte count in the first place?)

The text was updated successfully, but these errors were encountered:

aphillips · 2018-06-27T17:41:00Z

Hmm... also I think it doesn't mean to say 64-byte *minimum* length. I suspect it means to say "maximum" there. PS> Please add the i18n-comment label.

equalsJeffH · 2018-06-27T23:55:56Z

[this issue is related to issue #593 and PR #951]

@aphillips wrote:

Note that the specification does not require truncation on a Unicode character boundary

I was wondering whether/when you'd bring this up.

I've done some modest research on this topic of "unicode string truncation" (due to the text you cite above) and apparently it is more complex than simply performing truncation on a Unicode character boundary -- it apparently ought to properly be done on extended grapheme cluster (EGC) boundaries.

I found detailed analysis here: https://hoytech.github.io/truncate-presentation/
..and a library: https://github.com/hoytech/Unicode-Truncate, but nothing regarding "unicode string truncation" in IETF, W3C, or Unicode specs :-/

In brief discussion a little while back, @stpeter suggested that specifying how to do proper "unicode string truncation" perhaps ought to be addressed by the unicode consortium. I'd imagine as an addition to TR29 "Unicode Text Segmentation" (but who knows). Charmod may want to say something about it.

Rather than properly & thoroughly spec how to do "unicode string truncation" in webauthn, perhaps we should simply state something like (in addition to the above-quoted spec text):
"Such truncation SHOULD be performed on extended grapheme cluster boundaries [[!UAX29]]."

..though, "who" does such "EGC-aware truncation" is a question. Presently the webauthn spec says it is the authenticator who may perform such truncation, but requiring authenticators to be able to perform EGC-aware truncation will be controversial I suspect. Rather, perhaps the RP and/or Client ought to do that? Though, they do not know the capabilities of the authenticator, i.e., what string length it can accommodate, and it seems the present truncation language in the spec is attempting to allow for authenticators that are able to handle strings longer than 64 bytes? (further discussion at end, below)

is there a reason you didn't use character count instead of byte count in the first place?)

Yes, because the impetus for this length restriction is in the context of a narrow-bandwidth channel (e.g., BLE or NFC) between the webauthn client and an authenticator (illustration here), and also that these strings may be stored by the authenticator who may have limited resources. At that level of abstraction, we're dealing in byte counts, not char counts (which for Unicode in UTF-8 might be several bytes long -- e.g., apparently there's a Tibetan character with 8 combining marks (I dunno offhand how many bytes in UTF-8 encoding that'd end up being)).

also I think it doesn't mean to say 64-byte minimum length. I suspect it means to say "maximum" there

the text you're referring to is:
"Authenticators MUST accept and store a 64-byte minimum length for a name member’s value."

Yeah, I think the perspective that was written from is: one MUST accommodate at least a 64-byte length for this value. I.e., some authenticators may, if presented with a 80 byte string, simply accommodate it. Alternatively, truncate it no shorter than 64 bytes (if doing truncation on arbitrary byte boundaries, which as we note, is not i18n-kosher). I.e., if the authenticator supports 70 byte name strings, it would ostensibly truncate an 80-byte string to 70 bytes.

I'm thinking we need to add a Note: to the spec explaining this rationale (if I have it correct, or whatever the rationale is if I do not).

aphillips · 2018-06-28T01:00:13Z

@equalsJeffH Thanks for this. I think it is useful to separate some concerns here.

Counted storage limits generally are required by low-level protocols or data structures. These are generally implemented in code doing serialization/deserialization or as part of e.g. network protocols. At this level, you need to define a length limit in something--I18N folks generally prefer characters (by which we mean Unicode code points) as the length limit and not bytes because limits defined in characters do not disadvantage languages/scripts that use 2-, 3-, or 4-byte forms of UTF-8 the way that byte counts do. The 64-byte limit probably goes unnoticed by English speakers, while Chinese users (with an effectively 21-character limit) notice it more often. User's don't understand why they can type a lot of text but only like 1 (complex) emoji.

If you must have a byte limit boundary (usually due to a protocol requirement), then at a very minimum I expect to see code point based truncation (because creating extra U+FFFD characters is a Bad Thing), hence my comment above. I'm usually fine with low-level specs that define length limited fields and specify character truncation.

So... regarding EGCs, yes, I agree that this is ideal. If you truncate some character sequences, you change the meaning of the user-perceived character (grapheme), such as when you remove a vowel from an Indic character sequence. So ideally truncation would be on grapheme boundaries. UTR#29 talks about this as do some other specs. But, to be honest, I don't think a low level implementation needs to require this or define it in amazing detail. It's a health warning. An example of a W3C spec that deals with the higher level problem is CSS3-Text (at the given location). Note that even there they allow for vagaries with notes like:

Authors are forewarned that dividing grapheme clusters by element boundaries may give inconsistent or undesired results.

So:

I would prefer you define the minimum limit in Unicode code points ("characters")
I would prefer you require (MUST) truncation only on character boundaries, regardless of whether you are counting code units ("bytes") or code points ("characters")
I would like it if you encouraged truncation only on grapheme boundaries

emlun · 2018-06-28T15:32:00Z

Asking constrained hardware authenticators to include full UTF-8 parsing logic is not really feasible - that's a lot of really complicated logic which, if history is any indication, also comes with a lot of security vulnerabilities. The proper way to solve this would be, as @equalsJeffH alludes to, to provide a way for the client to query the authenticator for a maximum size in bytes, so that the client can do the appropriate truncations (respecting character boundaries) before sending the data to the authenticator. It doesn't look like CTAP currently provides that, though, so I think we're stuck with the current (admittedly brittle) approach for the Level 1 spec.

What we could do to prevent truncation issues, without needing changes to CTAP, is to specify that clients MUST NOT allow input that would result in byte strings longer than 64 bytes. But I think that would have to wait until Level 2, since it would be a breaking normative change.

aphillips · 2018-06-28T16:46:26Z

@emlun Parsing UTF-8 character boundaries is dead simple. There is no complicated logic involved in that. Computing extended grapheme clusters is hard, which I why I don't recommend it.

nadalin · 2018-07-05T23:49:59Z

@aphillips This seems to be beyond scope for current version, moving to next level

aphillips · 2018-07-06T00:11:53Z

@nadalin I don't see how you can say that it is "beyond scope" since you already have text implementing a somewhat complex and arbitrary set of limits.

Truncating on character (code point) boundaries is not hard. The current text should be fixed because what it specifies will lead to broken data---and in the next version you'll claim backward compatibility :-)

asmusf · 2018-07-07T01:25:06Z

Strongly second @aphillips here.

For what you want, characters are the natural unit, and determining character boundaries is not going to challenge embedded processors.

Determining character boundaries is also something that's absolutely stable, unlike extended grapheme clusters. Their design premise is strongly biased towards user interfaces, so, for example, if you were to present a truncated name, you might want to add ellipses at the last EGC boundary inside the 64 character (or longer) window to which the name has been truncated to. But that would be done on the client level.

I am wondering about the fact that the usage seems to be in the context of authentication. Is the username (truncated) used for authentication purposes ? If so, does it make a difference if different authenticators implement different limits?

asmusf · 2018-07-07T01:32:35Z

To add to this. 64 bytes will give you 16 characters in the worst case, as UTF-8 characters can be up to 4 bytes long. For Indic writing systems most characters would use 3 bytes, so you would get 21 code points. Assume that the average syllable takes 3 characters to write, and you get 7 syllables (or grapheme clusters) as the minimal guaranteed length of your name. That's not generous, if you expect that personal names to be used as base for these names.

emlun · 2018-07-09T08:05:18Z

Is the username (truncated) used for authentication purposes ?

Mostly no - the user.name and user.displayName fields are used only by the authenticator to display to the user when picking a credential to use (which happens in only a subset of the use cases), and never returned to the RP after the credential is created. The user.id is returned to the RP and used as an identifier for authentication, but unlike the other two it's defined as an opaque byte array and not a text type.

asmusf · 2018-07-09T20:31:10Z

On 7/9/2018 1:05 AM, Emil Lundberg wrote: Is the username (truncated) used for authentication purposes ? Mostly no - the |user.name| <https://www.w3.org/TR/webauthn/#dom-publickeycredentialentity-name> and |user.displayName| <https://www.w3.org/TR/webauthn/#dom-publickeycredentialuserentity-displayname> fields are used only by the authenticator to display to the user when picking a credential to use (which happens in only a subset of the use cases), and never returned to the RP after the credential is created. The |user.id| <https://www.w3.org/TR/webauthn/#dom-publickeycredentialuserentity-id> /is/ returned to the RP and used as an identifier for authentication, but unlike the other two it's defined as an opaque byte array and not a text type.

If name / displayname are truncated, then truncation on a *character* boundary makes sense - a client could further truncate at a EGC boundary before placing an ellipsis. One issue with truncating like this is that it's not clear to the user agent that a string has been truncated; how would that be handled?

emlun · 2018-07-10T09:29:20Z

One issue with truncating like this is that it's not clear to the user agent that a string has been truncated; how would that be handled?

It looks like it won't be handled for the L1 version of the spec. For L2 I think we should collaborate with CTAP to come up with a suitable way for the authenticator to signal to the client what it's able to store, so that the client can do the proper input validation.

akshayku · 2018-07-11T00:00:41Z

Authenticators don't need to deal with Unicode character boundaries and should do only byte length checks. Platforms can probably do the Unicode checks before sending it to the client. As a general principle for next level, IMO, we should be very careful of increasing complexity from the authenticator side.

equalsJeffH · 2018-07-11T21:16:10Z

fyi/fwiw, here's "counting [unicode] characters in utf-8 strings", where they note (at end):

the penalty for counting UTF-8 characters, or indexing into or iterating over the characters of a UTF-8 string, is very small.
People probably shouldn't worry about the efficiency of counting and iterating over characters in UTF-8 strings, at least not if they were using null-terminated C strings before.

..and provide several code examples.

asmusf · 2018-07-12T04:02:20Z

Truncating UTF-8 strings on "random" byte boundaries creates buffers containing corrupt data.

Truncating character strings on (some) boundaries that are not EGC boundaries may create data that cannot be properly rendered (and or may look like a different character string).

As to the former, the goal of any specification should be an robust architecture that avoids the creation, or worse, interchange of corrupt data. Looks like the L1 level fails that; despite the fact that enforcing validity on UTF-8 strings is not a complex task.

From a user interaction, the goal should be to prevent display of "broken" or misleading data. That includes feedback on whether a string has been truncated.

Admittedly, getting that part correct drags in something like extended grapheme clusters. But also the need to communicate whether truncation has taken place.

agl · 2018-07-12T18:28:58Z

As a browser, we're not going to trust the data coming from an authenticator. So even if the spec says that the authenticator must handle UTF-8 correctly and truncate only whole code points, we're still going to UTF-8 validate the data and handle abrupt truncation. So the authenticator might as well not bother.

On the other hand, knowing that the string was truncated would be useful.

equalsJeffH · 2018-07-12T21:00:22Z

@agl, AIUI, are you saying that if the RP script invokes display of these strings, the browser will "UTF-8 validate the data and handle abrupt truncation", before displaying the strings?

this is just chrome's standard handling of displayed strings ?

what do you mean by "handle abrupt truncation"?

if a utf-8-encoded string, containing multi-byte encoded chars, was truncated at an arbitrary byte boundary, does the validation process catch that?

what is the behavior if the string is "corrupted" from a UTF-8 correctness perspective?

thx.

equalsJeffH · 2018-07-18T18:59:04Z

I chatted with @agl about this recently.

Given that we merged PR #951, we already have appropriate entities enforcing PRECIS on the name-ish strings.

A thing to note about the "authnrs MAY truncate srings to at least 64 bytes" statement in the spec is that authnrs CAN support/handle/return (UTF-8) strings longer than 64 bytes, it is authnr-specific. Thus the webauthn client cannot really be given the responsibility for truncating these strings, it needs to be left up to the authnrs.

Given that strictly byte-level string truncation can mangle UTF-8 strings, see #973 (comment), any truncation really SHOULD be done on at least the code point level, and if possible, on the EGC (extended grapheme cluster) level. The latter is what @asmusf related, the former @aphillips in #973 (comment).

As @agl implies in #973 (comment), and clarified in our chat, Chrome will reject CBOR-encoded objects containing "text" strings that are not UTF-8 valid (which is detectable). So if an

Below is a two-part proposal for what to add to the spec to address this issue. If they are both nominally acceptable, then perhaps we select one based on whether we are willing to add additional normative language or not:

OLD:

Authenticators MUST accept and store a 64-byte minimum length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes.

NEW, if we are willing to add additional normative language:

Authenticators MUST accept and store at least a 64-byte length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes. Authenticators SHOULD perform any UTF-8 encoded string truncation on a code point boundary, and MAY perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]]. Truncated strings SHOULD include an indication of truncation, such as appending an ellipsis.

Note: Truncation of a UTF-8 encoded string at an arbitrary byte boundary, or even in some cases on an arbitrary code point boundary, may result in a string that cannot be properly rendered, or may look like a different character string if rendered. Truncation on code point boundaries is preferred over arbitrary byte boundaries. Truncation on EGC boundaries is the safest approach.

NEW, no new normative language:

Authenticators MUST accept and store at least a 64-byte length for a name member’s value. Authenticators MAY truncate a name member’s value to a length equal to or greater than 64 bytes.

Note: Authenticators should perform any UTF-8 encoded string truncation on a code point boundary, and may perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]]. Truncated strings should include an indication of truncation, such as appending an ellipsis. Truncation of a UTF-8 encoded string at an arbitrary byte boundary, or even in some cases on an arbitrary code point boundary, may result in a string that cannot be properly rendered, or may look like a different character string if rendered. Truncation on code point boundaries is preferred over arbitrary byte boundaries. Truncation on EGC boundaries is the safest approach.

aphillips · 2018-07-18T19:05:32Z

If you go with normative language, I would prefer this:

Authenticators MUST perform any UTF-8 encoded string truncation on a code point boundary, and are encouraged perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]].

... in place of SHOULD/MAY as in the proposal. I used non-normative language about EGC because SHOULD is too strong a recommendation. MAY might be appropriate instead.

Note that a mid-code-point truncated string makes file formats such as JSON invalid (unless a transfer encoding such as base64 is applied to the name--which I think is beside the point??)

stpeter · 2018-07-18T21:15:57Z

+1 to the suggestion from @aphillips

nadalin · 2019-04-24T19:34:30Z

@agl to write implementation guidance PR

Fixes w3c#973.

equalsJeffH added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. type:technical labels Jun 27, 2018

nadalin added this to the L2-WD-00 milestone Jul 5, 2018

r12a mentioned this issue Sep 5, 2018

truncation to 64-byte upper limit doesn't mention character boundaries w3c/i18n-activity#589

Closed

nadalin assigned aphillips Mar 13, 2019

nadalin added the type:editorial label Mar 13, 2019

nadalin assigned akshayku Mar 13, 2019

akshayku assigned equalsJeffH Apr 3, 2019

equalsJeffH unassigned aphillips and akshayku Apr 24, 2019

equalsJeffH assigned agl Apr 24, 2019

agl added a commit to agl/webauthn that referenced this issue Apr 27, 2019

Add considerations for string truncation.

6f8a921

Fixes w3c#973.

agl mentioned this issue Apr 27, 2019

Add considerations for string truncation. #1205

Merged

emlun added the stat:pr-open label May 1, 2019

agl closed this as completed in #1205 May 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

truncation to 64-byte upper limit doesn't mention character boundaries #973

truncation to 64-byte upper limit doesn't mention character boundaries #973

aphillips commented Jun 27, 2018

aphillips commented Jun 27, 2018 •

edited

equalsJeffH commented Jun 27, 2018 •

edited

aphillips commented Jun 28, 2018

emlun commented Jun 28, 2018

aphillips commented Jun 28, 2018

nadalin commented Jul 5, 2018

aphillips commented Jul 6, 2018 •

edited

asmusf commented Jul 7, 2018

asmusf commented Jul 7, 2018

emlun commented Jul 9, 2018

asmusf commented Jul 9, 2018 via email

emlun commented Jul 10, 2018

akshayku commented Jul 11, 2018

equalsJeffH commented Jul 11, 2018

asmusf commented Jul 12, 2018

agl commented Jul 12, 2018

equalsJeffH commented Jul 12, 2018

equalsJeffH commented Jul 18, 2018

aphillips commented Jul 18, 2018

stpeter commented Jul 18, 2018

nadalin commented Apr 24, 2019

truncation to 64-byte upper limit doesn't mention character boundaries #973

truncation to 64-byte upper limit doesn't mention character boundaries #973

Comments

aphillips commented Jun 27, 2018

aphillips commented Jun 27, 2018 • edited

equalsJeffH commented Jun 27, 2018 • edited

aphillips commented Jun 28, 2018

emlun commented Jun 28, 2018

aphillips commented Jun 28, 2018

nadalin commented Jul 5, 2018

aphillips commented Jul 6, 2018 • edited

asmusf commented Jul 7, 2018

asmusf commented Jul 7, 2018

emlun commented Jul 9, 2018

asmusf commented Jul 9, 2018 via email

emlun commented Jul 10, 2018

akshayku commented Jul 11, 2018

equalsJeffH commented Jul 11, 2018

asmusf commented Jul 12, 2018

agl commented Jul 12, 2018

equalsJeffH commented Jul 12, 2018

equalsJeffH commented Jul 18, 2018

aphillips commented Jul 18, 2018

stpeter commented Jul 18, 2018

nadalin commented Apr 24, 2019

aphillips commented Jun 27, 2018 •

edited

equalsJeffH commented Jun 27, 2018 •

edited

aphillips commented Jul 6, 2018 •

edited