New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
truncation to 64-byte upper limit doesn't mention character boundaries #973
Comments
Hmm... also I think it doesn't mean to say |
[this issue is related to issue #593 and PR #951] @aphillips wrote:
I was wondering whether/when you'd bring this up. I've done some modest research on this topic of "unicode string truncation" (due to the text you cite above) and apparently it is more complex than simply performing truncation on a Unicode character boundary -- it apparently ought to properly be done on extended grapheme cluster (EGC) boundaries. I found detailed analysis here: https://hoytech.github.io/truncate-presentation/ In brief discussion a little while back, @stpeter suggested that specifying how to do proper "unicode string truncation" perhaps ought to be addressed by the unicode consortium. I'd imagine as an addition to TR29 "Unicode Text Segmentation" (but who knows). Charmod may want to say something about it. Rather than properly & thoroughly spec how to do "unicode string truncation" in webauthn, perhaps we should simply state something like (in addition to the above-quoted spec text): ..though, "who" does such "EGC-aware truncation" is a question. Presently the webauthn spec says it is the authenticator who may perform such truncation, but requiring authenticators to be able to perform EGC-aware truncation will be controversial I suspect. Rather, perhaps the RP and/or Client ought to do that? Though, they do not know the capabilities of the authenticator, i.e., what string length it can accommodate, and it seems the present truncation language in the spec is attempting to allow for authenticators that are able to handle strings longer than 64 bytes? (further discussion at end, below)
Yes, because the impetus for this length restriction is in the context of a narrow-bandwidth channel (e.g., BLE or NFC) between the webauthn client and an authenticator (illustration here), and also that these strings may be stored by the authenticator who may have limited resources. At that level of abstraction, we're dealing in byte counts, not char counts (which for Unicode in UTF-8 might be several bytes long -- e.g., apparently there's a Tibetan character with 8 combining marks (I dunno offhand how many bytes in UTF-8 encoding that'd end up being)).
the text you're referring to is: Yeah, I think the perspective that was written from is: one MUST accommodate at least a 64-byte length for this value. I.e., some authenticators may, if presented with a 80 byte string, simply accommodate it. Alternatively, truncate it no shorter than 64 bytes (if doing truncation on arbitrary byte boundaries, which as we note, is not i18n-kosher). I.e., if the authenticator supports 70 byte name strings, it would ostensibly truncate an 80-byte string to 70 bytes. I'm thinking we need to add a Note: to the spec explaining this rationale (if I have it correct, or whatever the rationale is if I do not). |
@equalsJeffH Thanks for this. I think it is useful to separate some concerns here. Counted storage limits generally are required by low-level protocols or data structures. These are generally implemented in code doing serialization/deserialization or as part of e.g. network protocols. At this level, you need to define a length limit in something--I18N folks generally prefer characters (by which we mean Unicode code points) as the length limit and not bytes because limits defined in characters do not disadvantage languages/scripts that use 2-, 3-, or 4-byte forms of UTF-8 the way that byte counts do. The 64-byte limit probably goes unnoticed by English speakers, while Chinese users (with an effectively 21-character limit) notice it more often. User's don't understand why they can type a lot of text but only like 1 (complex) emoji. If you must have a byte limit boundary (usually due to a protocol requirement), then at a very minimum I expect to see code point based truncation (because creating extra U+FFFD characters is a Bad Thing), hence my comment above. I'm usually fine with low-level specs that define length limited fields and specify character truncation. So... regarding EGCs, yes, I agree that this is ideal. If you truncate some character sequences, you change the meaning of the user-perceived character (grapheme), such as when you remove a vowel from an Indic character sequence. So ideally truncation would be on grapheme boundaries. UTR#29 talks about this as do some other specs. But, to be honest, I don't think a low level implementation needs to require this or define it in amazing detail. It's a health warning. An example of a W3C spec that deals with the higher level problem is CSS3-Text (at the given location). Note that even there they allow for vagaries with notes like:
So:
|
Asking constrained hardware authenticators to include full UTF-8 parsing logic is not really feasible - that's a lot of really complicated logic which, if history is any indication, also comes with a lot of security vulnerabilities. The proper way to solve this would be, as @equalsJeffH alludes to, to provide a way for the client to query the authenticator for a maximum size in bytes, so that the client can do the appropriate truncations (respecting character boundaries) before sending the data to the authenticator. It doesn't look like CTAP currently provides that, though, so I think we're stuck with the current (admittedly brittle) approach for the Level 1 spec. What we could do to prevent truncation issues, without needing changes to CTAP, is to specify that clients MUST NOT allow input that would result in byte strings longer than 64 bytes. But I think that would have to wait until Level 2, since it would be a breaking normative change. |
@emlun Parsing UTF-8 character boundaries is dead simple. There is no complicated logic involved in that. Computing extended grapheme clusters is hard, which I why I don't recommend it. |
@aphillips This seems to be beyond scope for current version, moving to next level |
@nadalin I don't see how you can say that it is "beyond scope" since you already have text implementing a somewhat complex and arbitrary set of limits. Truncating on character (code point) boundaries is not hard. The current text should be fixed because what it specifies will lead to broken data---and in the next version you'll claim backward compatibility :-) |
Strongly second @aphillips here. For what you want, characters are the natural unit, and determining character boundaries is not going to challenge embedded processors. Determining character boundaries is also something that's absolutely stable, unlike extended grapheme clusters. Their design premise is strongly biased towards user interfaces, so, for example, if you were to present a truncated name, you might want to add ellipses at the last EGC boundary inside the 64 character (or longer) window to which the name has been truncated to. But that would be done on the client level. I am wondering about the fact that the usage seems to be in the context of authentication. Is the username (truncated) used for authentication purposes ? If so, does it make a difference if different authenticators implement different limits? |
To add to this. 64 bytes will give you 16 characters in the worst case, as UTF-8 characters can be up to 4 bytes long. For Indic writing systems most characters would use 3 bytes, so you would get 21 code points. Assume that the average syllable takes 3 characters to write, and you get 7 syllables (or grapheme clusters) as the minimal guaranteed length of your name. That's not generous, if you expect that personal names to be used as base for these names. |
Mostly no - the |
On 7/9/2018 1:05 AM, Emil Lundberg wrote:
Is the username (truncated) used for authentication purposes ?
Mostly no - the |user.name|
<https://www.w3.org/TR/webauthn/#dom-publickeycredentialentity-name>
and |user.displayName|
<https://www.w3.org/TR/webauthn/#dom-publickeycredentialuserentity-displayname>
fields are used only by the authenticator to display to the user when
picking a credential to use (which happens in only a subset of the use
cases), and never returned to the RP after the credential is created.
The |user.id|
<https://www.w3.org/TR/webauthn/#dom-publickeycredentialuserentity-id>
/is/ returned to the RP and used as an identifier for authentication,
but unlike the other two it's defined as an opaque byte array and not
a text type.
If name / displayname are truncated, then truncation on a *character*
boundary makes sense - a client could further truncate at a EGC boundary
before placing an ellipsis.
One issue with truncating like this is that it's not clear to the user
agent that a string has been truncated; how would that be handled?
|
It looks like it won't be handled for the L1 version of the spec. For L2 I think we should collaborate with CTAP to come up with a suitable way for the authenticator to signal to the client what it's able to store, so that the client can do the proper input validation. |
Authenticators don't need to deal with Unicode character boundaries and should do only byte length checks. Platforms can probably do the Unicode checks before sending it to the client. As a general principle for next level, IMO, we should be very careful of increasing complexity from the authenticator side. |
fyi/fwiw, here's "counting [unicode] characters in utf-8 strings", where they note (at end):
..and provide several code examples. |
Truncating UTF-8 strings on "random" byte boundaries creates buffers containing corrupt data. Truncating character strings on (some) boundaries that are not EGC boundaries may create data that cannot be properly rendered (and or may look like a different character string). As to the former, the goal of any specification should be an robust architecture that avoids the creation, or worse, interchange of corrupt data. Looks like the L1 level fails that; despite the fact that enforcing validity on UTF-8 strings is not a complex task. From a user interaction, the goal should be to prevent display of "broken" or misleading data. That includes feedback on whether a string has been truncated. Admittedly, getting that part correct drags in something like extended grapheme clusters. But also the need to communicate whether truncation has taken place. |
As a browser, we're not going to trust the data coming from an authenticator. So even if the spec says that the authenticator must handle UTF-8 correctly and truncate only whole code points, we're still going to UTF-8 validate the data and handle abrupt truncation. So the authenticator might as well not bother. On the other hand, knowing that the string was truncated would be useful. |
@agl, AIUI, are you saying that if the RP script invokes display of these strings, the browser will "UTF-8 validate the data and handle abrupt truncation", before displaying the strings? this is just chrome's standard handling of displayed strings ? what do you mean by "handle abrupt truncation"? if a utf-8-encoded string, containing multi-byte encoded chars, was truncated at an arbitrary byte boundary, does the validation process catch that? what is the behavior if the string is "corrupted" from a UTF-8 correctness perspective? thx. |
I chatted with @agl about this recently. Given that we merged PR #951, we already have appropriate entities enforcing PRECIS on the name-ish strings. A thing to note about the "authnrs MAY truncate srings to at least 64 bytes" statement in the spec is that authnrs CAN support/handle/return (UTF-8) strings longer than 64 bytes, it is authnr-specific. Thus the webauthn client cannot really be given the responsibility for truncating these strings, it needs to be left up to the authnrs. Given that strictly byte-level string truncation can mangle UTF-8 strings, see #973 (comment), any truncation really SHOULD be done on at least the code point level, and if possible, on the EGC (extended grapheme cluster) level. The latter is what @asmusf related, the former @aphillips in #973 (comment). As @agl implies in #973 (comment), and clarified in our chat, Chrome will reject CBOR-encoded objects containing "text" strings that are not UTF-8 valid (which is detectable). So if an Below is a two-part proposal for what to add to the spec to address this issue. If they are both nominally acceptable, then perhaps we select one based on whether we are willing to add additional normative language or not: OLD: Authenticators MUST accept and store a 64-byte minimum length for a
Authenticators MUST accept and store at least a 64-byte length for a Note: Truncation of a UTF-8 encoded string at an arbitrary byte boundary, or even in some cases on an arbitrary code point boundary, may result in a string that cannot be properly rendered, or may look like a different character string if rendered. Truncation on code point boundaries is preferred over arbitrary byte boundaries. Truncation on EGC boundaries is the safest approach.
Authenticators MUST accept and store at least a 64-byte length for a Note: Authenticators should perform any UTF-8 encoded string truncation on a code point boundary, and may perform such a truncation on a extended grapheme cluster (EGC) boundary [[!UAX29]]. Truncated strings should include an indication of truncation, such as appending an ellipsis. Truncation of a UTF-8 encoded string at an arbitrary byte boundary, or even in some cases on an arbitrary code point boundary, may result in a string that cannot be properly rendered, or may look like a different character string if rendered. Truncation on code point boundaries is preferred over arbitrary byte boundaries. Truncation on EGC boundaries is the safest approach. |
If you go with normative language, I would prefer this:
... in place of SHOULD/MAY as in the proposal. I used non-normative language about EGC because SHOULD is too strong a recommendation. MAY might be appropriate instead. Note that a mid-code-point truncated string makes file formats such as JSON invalid (unless a transfer encoding such as base64 is applied to the name--which I think is beside the point??) |
+1 to the suggestion from @aphillips |
@agl to write implementation guidance PR |
https://w3c.github.io/webauthn/#dictionary-pkcredentialentity
When referring to the
name
the spec says:Note that the specification does not require truncation on a Unicode character boundary. Arbitrary truncation at a 64-byte limit on a multibyte encoding such as UTF-8 can corrupt the last character in the string. The spec should require that the truncation occur on a character boundary (is there a reason you didn't use character count instead of byte count in the first place?)
The text was updated successfully, but these errors were encountered: