Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup lexical structure of numbers and identifiers #403

Merged
merged 10 commits into from Mar 17, 2021

Conversation

phadej
Copy link
Contributor

@phadej phadej commented Feb 12, 2021

The proposal has been accepted; the following discussion is mostly of historic interest.


This proposal cleanups and clarifies lexical structure of numbers and identifiers. (Contains Unicode inside).

Rendered

@phadej phadej changed the title Cleanup ltexical structure Cleanup lexical structure of numbers and identifiers Feb 12, 2021
@kindaro
Copy link
Contributor

kindaro commented Feb 13, 2021

I am not sure why we should disallow decimal numbers as a notation for decimal numbers.

  • This is nonsense on the conceptual level, as evidenced by the sentence above.
  • There are places in the world where unusual numbers are usual. Example.

@phadej
Copy link
Contributor Author

phadej commented Feb 13, 2021

I am not sure why we should disallow decimal numbers as a notation for decimal numbers.

What I propose for numeric literals is what GHC does now already. I hope that is not controversial, otherwise you should fill a bug report to GHC issue tracker.

EDIT: I mention in alternatives that other scripts can be recognized for decimal numbers, but I argue that only with UnicodeSyntax. I leave that for committee to comment about.

@kindaro
Copy link
Contributor

kindaro commented Feb 13, 2021

Yes, I am aware of both your points, and I understand that it may not be worth the effort for you or me to actually write a parser for unusual numbers. But there is a difference between being lazy to do something and disallowing that thing altogether. I do not argue that we should immediately start parsing unusual numbers — I argue that we should leave the possibility open and inviting. An error message like «unusual numbers are not yet implemented» would be suitable.

@phadej
Copy link
Contributor Author

phadej commented Feb 13, 2021

This is not the case with other Unicode syntax either today

Prelude> :t (\x -> x) :: Int →  Int

<interactive>:1:18: error:
    Not in scope: type constructor or class ‘→’
Prelude> :set -XUnicodeSyntax 
Prelude> :t (\x -> x) :: Int →  Int
(\x -> x) :: Int →  Int :: Int -> Int

Improving error messages don't need a proposal, IMHO. As my proposal accidentally shows, lexer errors are awful. I'm not against improving errors like:

Prelude> ٦

<interactive>:4:1: error: lexical error at character '\1638'

They could be clearer, and may invite to vote for a feature.


EDIT: the goal of this proposal is to document the status quo and fix one clear wart (not treating letter numbers as any kind of "letter").

@kindaro
Copy link
Contributor

kindaro commented Feb 13, 2021

I am not sure I follow. What is not the case with other Unicode syntax? 

Note also that I am not giving any comment as to whether UnicodeSyntax should affect the parsing of unusual numbers. My argument is that there is no reason to specifically disallow unusual numbers.

@phadej
Copy link
Contributor Author

phadej commented Feb 13, 2021

There is. I arguet that having GHC accept

evil :: Int
evil = ٦٦٦

without any extension enabled would confuse most Haskell programmers (and tools).

@kindaro
Copy link
Contributor

kindaro commented Feb 13, 2021

This is not an argument specific to unusual numbers — it is a general argument about Unicode and UnicodeSyntax. I am not saying anything about that — it would be a whole other conversation. I believe that the behaviour of UnicodeSyntax will be decided upon in the best way possible without my participation. In the end, I do not think we have any disagreement here.

@AntC2
Copy link
Contributor

AntC2 commented Feb 14, 2021

What's the status of underscore (_)? I think it can start identifiers, so it counts as lower-case?

But also there's extension -XNumericUnderscores that allows _ to appear in numbers as separator for readability: 123_456. For that string without that extension recent GHCs warn that's probably what you meant, older GHCs parse as two lexemes, of which the second is an identifier. GHC 7.10 says

Found hole `_456' with type: t0
Where: `t0' is an ambiguous type variable
Relevant bindings include ...
In the first argument of `123', namely `_456'

123_456.7890_1234 is also allowed with numeric underscores. And dubious things like 123_456.78_90_123_4

@phadej
Copy link
Contributor Author

phadej commented Feb 14, 2021

@AntC2, yes, underscore is a small character

small       →   ascSmall | uniSmall | _

Re, NumericUnderscores, that is a good question. See their proposal:

In particular, that proposal changes decimal

-decimal     →  digit{digit}
+decimal     →  digit{numSpacer digit}

However the implementation was and is in terms of decDigit = ascDigit

-decimal     →  ascDigit{ascDigit}
+decimal     →  ascDigit{numSpacer ascDigit}

This proposal is compatible. I will mention NumericUnderscores explicitly.

Does this answer your concerns?

@AntC2
Copy link
Contributor

AntC2 commented Feb 14, 2021

@phadej This proposal is compatible. I will mention NumericUnderscores explicitly.

Does this answer your concerns?

Thanks. (No "concerns", just dotting i's, crossing t's, and scoring under's.)

@blamario
Copy link
Contributor

It's worth noting that GHC already treats Other Letter as lowercase, ever since https://gitlab.haskell.org/ghc/ghc/-/issues/3741, and this without any language pragma.

What's missing, both from GHC and from the proposal, is a way to name a type or constructor using an uncased script. I had written a very preliminary proposal for Haskell 2020, but it never went anywhere.

@phadej
Copy link
Contributor Author

phadej commented Feb 14, 2021

:Proposal text says

More precisely:

Extend the small character class to allow scripts without small/large
character distinction (see Other Letter)

uniSmall    →   any Unicode lowercase letter or Other Letter

And notes that

The two truly new changes are abandoning an idea of "decimal digit"
commented with a ToDo in GHC's Lexer.x (there would be just ascii digits and all others number characaters)
and adding of Letter Number category to the uniDigit class (Other Number is already there).

meaning that everything else is already in GHC, but undocumented.

I'll welcome suggestions how to proposal text clear.

@phadej
Copy link
Contributor Author

phadej commented Feb 14, 2021

@blamario, I read through your proposal. It's much more ambitious, also it contains breaking change. Therefore I won't incorporate any parts of that here. I'm afraid to open the pandora box.

@blamario
Copy link
Contributor

That's all right, I didn't expect you to adopt it. I think that Haskell really needs to support the native scripts of approximately half the humanity, but I won't pretend my proposal solves the problem. Nowadays I think that a unified namespace as in Idris would be ideal, but I don't know how to get Haskell there.

meaning that everything else is already in GHC, but undocumented.
I'll welcome suggestions how to proposal text clear.

Perhaps say something like

the proposed small character class is already the state of affairs in GHC, even though this divergence from the standard has never been documented.

You should also point to ticket #3741.

@phadej
Copy link
Contributor Author

phadej commented Feb 14, 2021

I'm sorry, how #3741 is related?

  The problem in #3741 was that we had confused column numbers with byte
  offsets, which fails in the case of UTF-8 (amongst other things).
  Fortunately we're tracking correct column offsets now, so we didn't
  ....

Did you mean to mention some other issue?

@blamario
Copy link
Contributor

I got to that issue by following the history of the test file T3741.hs, which is the only positive parser test file with an uncased identifier. If the feature was implemented even before that, there's no test for it. Probably no ticket either.

@phadej
Copy link
Contributor Author

phadej commented Feb 14, 2021

Probably https://gitlab.haskell.org/ghc/ghc/-/issues/1103, I'll add it to the list of issues.

@phadej
Copy link
Contributor Author

phadej commented Feb 19, 2021

@nomeata (are you still acting as secretary?) I'd like to submit this proposal to the committee.

@maralorn
Copy link
Contributor

I think you meant @nomeata

@nomeata nomeata added the Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion label Feb 20, 2021
@nomeata nomeata changed the title Cleanup lexical structure of numbers and identifiers Cleanup lexical structure of numbers and identifiers (under review) Feb 20, 2021
@aspiwack
Copy link
Contributor

As the shepherd, I'll review right after the Icfp deadline (in a little over week).

@nomeata
Copy link
Contributor

nomeata commented Feb 20, 2021

/remind @aspiwack that the deadline is over in two weeks :-)

@reminders-prs
Copy link

reminders-prs bot commented Feb 20, 2021

@nomeata set a reminder for Mar 6th 2021

Copy link
Contributor

@aspiwack aspiwack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I am back as promised. And here is a bit of review before sending the proposal to the committee.

The one discussion that I find is missing a bit, and probably should find its way to the Alternatives section is why the Other Letter group is considered as small characters.

It would be natural to consider them as just idChar, since these don't care about case. Ah, I believe you say, but then their is no way to write an identifier in Thai script, and that is kind of mean.

Which is fair, but should probably figure in the Alternatives section nonetheless. But maybe more interestingly: with your proposal, I still can't write a constructor name in Thai script, and that's kind of mean too. So why do you choose to favour varid over conid for scripts without case?

proposals/0000-cleanup-lexical-structure.md Show resolved Hide resolved
proposals/0000-cleanup-lexical-structure.md Outdated Show resolved Hide resolved
proposals/0000-cleanup-lexical-structure.md Outdated Show resolved Hide resolved
phadej and others added 2 commits March 4, 2021 16:14
Co-authored-by: Arnaud Spiwack <arnaud@spiwack.net>
@phadej
Copy link
Contributor Author

phadej commented Mar 4, 2021

@aspiwack

It would be natural to consider them as just idChar, since these don't care about case. Ah, I believe you say, but then their is no way to write an identifier in Thai script, and that is kind of mean.

Yes. But that would be a change, breaking how GHC works now. So I don't propose it. I added a note to alternatives section.

@nomeata
Copy link
Contributor

nomeata commented Mar 5, 2021

When implemented, the difference to Haskell98 should probably be documented in this part of the user’s guide: https://ghc.gitlab.haskell.org/ghc/doc/users_guide/bugs.html

@aspiwack aspiwack added Pending committee review The committee needs to evaluate the proposal and make a decision and removed Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion labels Mar 5, 2021
@aspiwack
Copy link
Contributor

aspiwack commented Mar 5, 2021

I forgot to get back to here, but I did recommend acceptance to the committee.

@phadej
Copy link
Contributor Author

phadej commented Mar 5, 2021

@nomeata

When implemented, the difference to Haskell98 should probably be documented in this part of the user’s guide: https://ghc.gitlab.haskell.org/ghc/doc/users_guide/bugs.html

Yes. The lack of documentation is one of reasons to write this proposal. (And waiting quite long before writing to do it, as current state wasn't documented).

Co-authored-by: Joachim Breitner <mail@joachim-breitner.de>
@reminders-prs reminders-prs bot removed the reminder label Mar 6, 2021
@reminders-prs
Copy link

reminders-prs bot commented Mar 6, 2021

👋 @aspiwack, the deadline is over :-)

@aspiwack
Copy link
Contributor

Hi @phadej , I'm happy to report that this proposal has now been officially accepted by the committee. I'll leave it to @nomeata to merge the branch.

@aspiwack aspiwack added Accepted The committee has decided to accept the proposal and removed Pending committee review The committee needs to evaluate the proposal and make a decision labels Mar 17, 2021
@nomeata nomeata merged commit d337e89 into ghc-proposals:master Mar 17, 2021
@nomeata nomeata changed the title Cleanup lexical structure of numbers and identifiers (under review) Cleanup lexical structure of numbers and identifiers Mar 17, 2021
@phadej phadej deleted the cleanup-lexical-structure branch March 21, 2021 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted The committee has decided to accept the proposal
Development

Successfully merging this pull request may close these issues.

None yet

7 participants