Cleanup lexical structure of numbers and identifiers #403

phadej · 2021-02-12T21:36:57Z

The proposal has been accepted; the following discussion is mostly of historic interest.

This proposal cleanups and clarifies lexical structure of numbers and identifiers. (Contains Unicode inside).

Rendered

kindaro · 2021-02-13T10:07:06Z

I am not sure why we should disallow decimal numbers as a notation for decimal numbers.

This is nonsense on the conceptual level, as evidenced by the sentence above.
There are places in the world where unusual numbers are usual. Example.

phadej · 2021-02-13T11:28:51Z

I am not sure why we should disallow decimal numbers as a notation for decimal numbers.

What I propose for numeric literals is what GHC does now already. I hope that is not controversial, otherwise you should fill a bug report to GHC issue tracker.

EDIT: I mention in alternatives that other scripts can be recognized for decimal numbers, but I argue that only with UnicodeSyntax. I leave that for committee to comment about.

kindaro · 2021-02-13T11:49:01Z

Yes, I am aware of both your points, and I understand that it may not be worth the effort for you or me to actually write a parser for unusual numbers. But there is a difference between being lazy to do something and disallowing that thing altogether. I do not argue that we should immediately start parsing unusual numbers — I argue that we should leave the possibility open and inviting. An error message like «unusual numbers are not yet implemented» would be suitable.

phadej · 2021-02-13T12:05:00Z

This is not the case with other Unicode syntax either today

Prelude> :t (\x -> x) :: Int →  Int

<interactive>:1:18: error:
    Not in scope: type constructor or class ‘→’

Prelude> :set -XUnicodeSyntax 
Prelude> :t (\x -> x) :: Int →  Int
(\x -> x) :: Int →  Int :: Int -> Int

Improving error messages don't need a proposal, IMHO. As my proposal accidentally shows, lexer errors are awful. I'm not against improving errors like:

Prelude> ٦

<interactive>:4:1: error: lexical error at character '\1638'

They could be clearer, and may invite to vote for a feature.

EDIT: the goal of this proposal is to document the status quo and fix one clear wart (not treating letter numbers as any kind of "letter").

kindaro · 2021-02-13T12:17:35Z

I am not sure I follow. What is not the case with other Unicode syntax?

Note also that I am not giving any comment as to whether UnicodeSyntax should affect the parsing of unusual numbers. My argument is that there is no reason to specifically disallow unusual numbers.

phadej · 2021-02-13T12:32:34Z

There is. I arguet that having GHC accept

evil :: Int
evil = ٦٦٦

without any extension enabled would confuse most Haskell programmers (and tools).

kindaro · 2021-02-13T12:56:24Z

This is not an argument specific to unusual numbers — it is a general argument about Unicode and UnicodeSyntax. I am not saying anything about that — it would be a whole other conversation. I believe that the behaviour of UnicodeSyntax will be decided upon in the best way possible without my participation. In the end, I do not think we have any disagreement here.

AntC2 · 2021-02-14T00:07:46Z

What's the status of underscore (_)? I think it can start identifiers, so it counts as lower-case?

But also there's extension -XNumericUnderscores that allows _ to appear in numbers as separator for readability: 123_456. For that string without that extension recent GHCs warn that's probably what you meant, older GHCs parse as two lexemes, of which the second is an identifier. GHC 7.10 says

Found hole `_456' with type: t0
Where: `t0' is an ambiguous type variable
Relevant bindings include ...
In the first argument of `123', namely `_456'

123_456.7890_1234 is also allowed with numeric underscores. And dubious things like 123_456.78_90_123_4

phadej · 2021-02-14T05:01:36Z

@AntC2, yes, underscore is a small character

small       →   ascSmall | uniSmall | _

Re, NumericUnderscores, that is a good question. See their proposal:

In particular, that proposal changes decimal

-decimal     →  digit{digit}
+decimal     →  digit{numSpacer digit}

However the implementation was and is in terms of decDigit = ascDigit

-decimal     →  ascDigit{ascDigit}
+decimal     →  ascDigit{numSpacer ascDigit}

This proposal is compatible. I will mention NumericUnderscores explicitly.

Does this answer your concerns?

AntC2 · 2021-02-14T09:48:20Z

@phadej This proposal is compatible. I will mention NumericUnderscores explicitly.

Does this answer your concerns?

Thanks. (No "concerns", just dotting i's, crossing t's, and scoring under's.)

blamario · 2021-02-14T19:39:39Z

It's worth noting that GHC already treats Other Letter as lowercase, ever since https://gitlab.haskell.org/ghc/ghc/-/issues/3741, and this without any language pragma.

What's missing, both from GHC and from the proposal, is a way to name a type or constructor using an uncased script. I had written a very preliminary proposal for Haskell 2020, but it never went anywhere.

phadej · 2021-02-14T20:19:45Z

:Proposal text says

More precisely:

Extend the small character class to allow scripts without small/large
character distinction (see Other Letter)
uniSmall    →   any Unicode lowercase letter or Other Letter

And notes that

The two truly new changes are abandoning an idea of "decimal digit"
commented with a ToDo in GHC's Lexer.x (there would be just ascii digits and all others number characaters)
and adding of Letter Number category to the uniDigit class (Other Number is already there).

meaning that everything else is already in GHC, but undocumented.

I'll welcome suggestions how to proposal text clear.

phadej · 2021-02-14T20:37:08Z

@blamario, I read through your proposal. It's much more ambitious, also it contains breaking change. Therefore I won't incorporate any parts of that here. I'm afraid to open the pandora box.

blamario · 2021-02-14T22:30:01Z

That's all right, I didn't expect you to adopt it. I think that Haskell really needs to support the native scripts of approximately half the humanity, but I won't pretend my proposal solves the problem. Nowadays I think that a unified namespace as in Idris would be ideal, but I don't know how to get Haskell there.

meaning that everything else is already in GHC, but undocumented.
I'll welcome suggestions how to proposal text clear.

Perhaps say something like

the proposed small character class is already the state of affairs in GHC, even though this divergence from the standard has never been documented.

You should also point to ticket #3741.

phadej · 2021-02-14T22:34:55Z

I'm sorry, how #3741 is related?

  The problem in #3741 was that we had confused column numbers with byte
  offsets, which fails in the case of UTF-8 (amongst other things).
  Fortunately we're tracking correct column offsets now, so we didn't
  ....

Did you mean to mention some other issue?

blamario · 2021-02-14T23:18:35Z

I got to that issue by following the history of the test file T3741.hs, which is the only positive parser test file with an uncased identifier. If the feature was implemented even before that, there's no test for it. Probably no ticket either.

phadej · 2021-02-14T23:22:25Z

Probably https://gitlab.haskell.org/ghc/ghc/-/issues/1103, I'll add it to the list of issues.

phadej · 2021-02-19T23:06:21Z

@nomeata (are you still acting as secretary?) I'd like to submit this proposal to the committee.

maralorn · 2021-02-19T23:26:10Z

I think you meant @nomeata

aspiwack · 2021-02-20T13:03:18Z

As the shepherd, I'll review right after the Icfp deadline (in a little over week).

nomeata · 2021-02-20T20:32:01Z

/remind @aspiwack that the deadline is over in two weeks :-)

reminders-prs · 2021-02-20T20:32:15Z

@nomeata set a reminder for Mar 6th 2021

aspiwack

Hello, I am back as promised. And here is a bit of review before sending the proposal to the committee.

The one discussion that I find is missing a bit, and probably should find its way to the Alternatives section is why the Other Letter group is considered as small characters.

It would be natural to consider them as just idChar, since these don't care about case. Ah, I believe you say, but then their is no way to write an identifier in Thai script, and that is kind of mean.

Which is fair, but should probably figure in the Alternatives section nonetheless. But maybe more interestingly: with your proposal, I still can't write a constructor name in Thai script, and that's kind of mean too. So why do you choose to favour varid over conid for scripts without case?

proposals/0000-cleanup-lexical-structure.md

Co-authored-by: Arnaud Spiwack <arnaud@spiwack.net>

phadej · 2021-03-04T14:28:02Z

@aspiwack

It would be natural to consider them as just idChar, since these don't care about case. Ah, I believe you say, but then their is no way to write an identifier in Thai script, and that is kind of mean.

Yes. But that would be a change, breaking how GHC works now. So I don't propose it. I added a note to alternatives section.

proposals/0000-cleanup-lexical-structure.md

nomeata · 2021-03-05T07:45:04Z

When implemented, the difference to Haskell98 should probably be documented in this part of the user’s guide: https://ghc.gitlab.haskell.org/ghc/doc/users_guide/bugs.html

aspiwack · 2021-03-05T08:08:59Z

I forgot to get back to here, but I did recommend acceptance to the committee.

phadej · 2021-03-05T08:37:01Z

@nomeata

When implemented, the difference to Haskell98 should probably be documented in this part of the user’s guide: https://ghc.gitlab.haskell.org/ghc/doc/users_guide/bugs.html

Yes. The lack of documentation is one of reasons to write this proposal. (And waiting quite long before writing to do it, as current state wasn't documented).

Co-authored-by: Joachim Breitner <mail@joachim-breitner.de>

reminders-prs · 2021-03-06T09:33:13Z

👋 @aspiwack, the deadline is over :-)

aspiwack · 2021-03-17T09:57:40Z

Hi @phadej , I'm happy to report that this proposal has now been officially accepted by the committee. I'll leave it to @nomeata to merge the branch.

phadej added 3 commits February 12, 2021 23:36

Cleanup ltexical structure

b10b8d3

Fix renderding, add PR link

b9b968d

Mention change in uniSmall

b0d9c31

phadej changed the title ~~Cleanup ltexical structure~~ Cleanup lexical structure of numbers and identifiers Feb 12, 2021

More links, less mistakes

6e0b051

Add StackOverflow question as a curiosity

60b9385

Mention numeric underscores

a7e1b9d

Add 1103 issue to the list

29fd17e

nomeata added the Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion label Feb 20, 2021

nomeata changed the title ~~Cleanup lexical structure of numbers and identifiers~~ Cleanup lexical structure of numbers and identifiers (under review) Feb 20, 2021

nomeata requested a review from aspiwack February 20, 2021 11:46

reminders-prs bot added the reminder label Feb 20, 2021

aspiwack reviewed Mar 4, 2021

View reviewed changes

proposals/0000-cleanup-lexical-structure.md Show resolved Hide resolved

proposals/0000-cleanup-lexical-structure.md Outdated Show resolved Hide resolved

proposals/0000-cleanup-lexical-structure.md Outdated Show resolved Hide resolved

phadej and others added 2 commits March 4, 2021 16:14

Apply suggestions from code review

a56de86

Co-authored-by: Arnaud Spiwack <arnaud@spiwack.net>

mention other letter

b001f03

nomeata reviewed Mar 5, 2021

View reviewed changes

proposals/0000-cleanup-lexical-structure.md Outdated Show resolved Hide resolved

aspiwack added Pending committee review The committee needs to evaluate the proposal and make a decision and removed Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion labels Mar 5, 2021

Update proposals/0000-cleanup-lexical-structure.md

1fa3168

Co-authored-by: Joachim Breitner <mail@joachim-breitner.de>

reminders-prs bot removed the reminder label Mar 6, 2021

int-index assigned aspiwack Mar 11, 2021

aspiwack added Accepted The committee has decided to accept the proposal and removed Pending committee review The committee needs to evaluate the proposal and make a decision labels Mar 17, 2021

nomeata merged commit d337e89 into ghc-proposals:master Mar 17, 2021

nomeata changed the title ~~Cleanup lexical structure of numbers and identifiers (under review)~~ Cleanup lexical structure of numbers and identifiers Mar 17, 2021

phadej deleted the cleanup-lexical-structure branch March 21, 2021 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup lexical structure of numbers and identifiers #403

Cleanup lexical structure of numbers and identifiers #403

phadej commented Feb 12, 2021 •

edited by nomeata

kindaro commented Feb 13, 2021

phadej commented Feb 13, 2021 •

edited

kindaro commented Feb 13, 2021

phadej commented Feb 13, 2021 •

edited

kindaro commented Feb 13, 2021

phadej commented Feb 13, 2021 •

edited

kindaro commented Feb 13, 2021

AntC2 commented Feb 14, 2021

phadej commented Feb 14, 2021 •

edited

AntC2 commented Feb 14, 2021

blamario commented Feb 14, 2021

phadej commented Feb 14, 2021

phadej commented Feb 14, 2021

blamario commented Feb 14, 2021

phadej commented Feb 14, 2021 •

edited

blamario commented Feb 14, 2021

phadej commented Feb 14, 2021

phadej commented Feb 19, 2021 •

edited

maralorn commented Feb 19, 2021

aspiwack commented Feb 20, 2021

nomeata commented Feb 20, 2021

reminders-prs bot commented Feb 20, 2021

aspiwack left a comment

phadej commented Mar 4, 2021

nomeata commented Mar 5, 2021

aspiwack commented Mar 5, 2021

phadej commented Mar 5, 2021 •

edited

reminders-prs bot commented Mar 6, 2021

aspiwack commented Mar 17, 2021

Cleanup lexical structure of numbers and identifiers #403

Cleanup lexical structure of numbers and identifiers #403

Conversation

phadej commented Feb 12, 2021 • edited by nomeata

kindaro commented Feb 13, 2021

phadej commented Feb 13, 2021 • edited

kindaro commented Feb 13, 2021

phadej commented Feb 13, 2021 • edited

kindaro commented Feb 13, 2021

phadej commented Feb 13, 2021 • edited

kindaro commented Feb 13, 2021

AntC2 commented Feb 14, 2021

phadej commented Feb 14, 2021 • edited

AntC2 commented Feb 14, 2021

blamario commented Feb 14, 2021

phadej commented Feb 14, 2021

phadej commented Feb 14, 2021

blamario commented Feb 14, 2021

phadej commented Feb 14, 2021 • edited

blamario commented Feb 14, 2021

phadej commented Feb 14, 2021

phadej commented Feb 19, 2021 • edited

maralorn commented Feb 19, 2021

aspiwack commented Feb 20, 2021

nomeata commented Feb 20, 2021

reminders-prs bot commented Feb 20, 2021

aspiwack left a comment

Choose a reason for hiding this comment

phadej commented Mar 4, 2021

nomeata commented Mar 5, 2021

aspiwack commented Mar 5, 2021

phadej commented Mar 5, 2021 • edited

reminders-prs bot commented Mar 6, 2021

aspiwack commented Mar 17, 2021

phadej commented Feb 12, 2021 •

edited by nomeata

phadej commented Feb 13, 2021 •

edited

phadej commented Feb 13, 2021 •

edited

phadej commented Feb 13, 2021 •

edited

phadej commented Feb 14, 2021 •

edited

phadej commented Feb 14, 2021 •

edited

phadej commented Feb 19, 2021 •

edited

phadej commented Mar 5, 2021 •

edited