Add FSC option for retrieving raw lexer tokens #10578

laenas · 2020-12-01T21:52:37Z

Went with the trinary-union for the setting since we want a default case, but decomposing a 2-case option was awkward.
Changed the name of the option since there's a little bit more going on with it now
Stacked it as a unique flag rather than an additive flag to simplify the CompilerOption
I am slightly less confident in the first-pass approach to the new handling in the parser:

We have a disjointed venn-diagram of the three state situation and the two relevant outcomes of that state - since two of them share a tokenizer and the other two of them share the consumption function. It seemed to either end up in duplication of the match deconstruction or duplication in the deconstruction. I chose the latter.

Also added a note that threw me for a brief moment at first glance as to how the flag's while loop was exiting and to what effect. A guidepost to the future.

laenas · 2020-12-18T22:09:41Z

Rebased due to awkward habits - but since there was a semi-direct merge conflict there I think it mostly works out to have a cleaner single commit to add this behavior. The only change from the initial PR commit is that I have now added the Tokenizer alias since it gets used in those pair of functions, owing to the change to LexFilter that preceded this - and it seems clearer to do this rather than use the raw function type signatures there. It requires wrapping the unfiltered lexer in a lambda, but that seems reasonable in this circumstance.

KevinRansom

It would probably be nice to add a test case.
Miserably these types of tests are still in our very worst test harness and haven't been ported to our newer less (but still quite) horrific test harness.

For now the test belongs here:

The tests are really quite yukky!!

The command line goes in an env.lst file, like this
This example validates the output of fsc /?

ReqENU	SOURCE=dummy.fsx COMPILE_ONLY=1  PRECMD="\$FSC_PIPE >help40.txt /? 2>&1"     POSTCMD="\$FSI_PIPE --nologo --quiet --exec ..\\..\\..\\comparer.fsx help40.txt help40.437.1033.bsl"	# /?

The precmd here executes fsc and gathers the output

The post cmd compares the output against a

To verify the tests still succeed, use:

build -c release -testfsharpqa

laenas · 2020-12-23T19:42:41Z

I'll hack it in and get a commit up over the holidays, thanks for the feedback!

KevinRansom · 2020-12-23T19:44:05Z

Other than that the code looks good mate. Thanks for this.

laenas · 2020-12-28T17:44:27Z

Other than that the code looks good mate. Thanks for this.

Thanks!

I have two things to follow up on, before I've got a commit ready:

I'm not sure how best to tackle actually doing a side-by-side comparison against a known-good: The major issue here is that when fsc is operating with tokenize - it prints a fully-qualified filepath along with the range of the token - and so immediately, the output of fsc is going to be very machine-specific (and possibly run specific, in CI server settings). It feels very wrong to go update how fsc prints this stuff, just to make tests work; updating the current comparer.fsx to have special cases for these tests also feels a little heavy-handed; and so I'm left wondering if the route we'd want to take is to add another special-case comparison script just for these tests, that handles the sanitizing of output - my concern with that approach being that we lose 'known good' output, to some degree, as we'd likely need to trim lines that look something like tokenize - got IDENT @ C:\Users\ryan\Source\fsharp\artifacts\bin\fsc\Debug\net472\dummy.fsx(3,0)-(3,4) down to tokenize - got IDENT (3,0)-(3,4) - not difficult, but does suddenly add the fragility of a special-case script.
Along with that, I ponder somewhat the completeness we'd be looking for in these tests. Is it desirable just to have a quick smoke test of 'these two options produce a specific known output, distinct from each other' - against a minimal dummyfile? Or do we want a comprehensive sample script that covers every token?

laenas · 2021-03-26T15:52:17Z

@dsyme - Context around this is starting to blur really badly in my memory. I summon thee, to determine whether we should:

Expand this out to tweak the tokenization outputs to do something more like relative pathing rather than absolute (which allows us to write clean tests)
Write some extra (untested) logic into the tests to handle matching of paths
Ignore both for a feature that's going to get used once a year 😅

Happypig375 · 2021-05-27T17:02:13Z

Bump @dsyme @KevinRansom

KevinRansom · 2021-06-28T18:16:52Z

@laenas Can we please have some tests added.

Thanks

Kevin

laenas · 2021-06-28T18:25:04Z

@KevinRansom
See the above questions from the initial request. Are there decisions made for the problems in attempting to add such tests?

This reverts commit 9e8980c, reversing changes made to 4ff375e.

This reverts commit 4080ff8.

vzarytovskii · 2021-06-29T08:44:11Z

@KevinRansom @laenas yeah, we will need to add some sort of framework for comparing the lexer tokens (with paths and (possibly) ranges erasure).
I can look into it. If it's gonna be a low effort - I can quickly make it work, if not - I think we can merge it since not much of a new functionality was added here.

Also, I think I messed up a bit with reverts, I'm gonna take a look at it.

laenas · 2021-06-29T09:07:22Z

@vzarytovskii I'm happy to pitch in with regards to trying to bolt some framework on for that output comparison, I'm just trying to think through the edge-cases of the parsing of the string, since filepaths can be so wildly similar to other tokens we'd be looking for. I suppose rather than trying to 'skip over' what looks like a filepath, we could just read from both ends? Up to the first @ on the front, and backwards to the initial range from the end?

vzarytovskii · 2021-06-29T09:16:51Z

Yeah, that's a good question...in IL comparer/verifier, we just try to skip the paths and normalize certain things. Not sure if the same will work with the tokenizer.

laenas · 2021-06-29T09:29:26Z

At a glance it should, I'm just thinking through trying to minimize the parsing complexity - since so much overlap exists between allowed filepaths and 'known' tokens in the output string.

But second to that, what do we want these sort of tests to even...test? All tokens? Only tokens that filter? I'm a little fuzzy and don't have the source up now, but I don't think even normal --tokenize has tests around it to model this after.

vzarytovskii · 2021-06-29T09:40:42Z

At a glance it should, I'm just thinking through trying to minimize the parsing complexity - since so much overlap exists between allowed filepaths and 'known' tokens in the output string.

But second to that, what do we want these sort of tests to even...test? All tokens? Only tokens that filter? I'm a little fuzzy and don't have the source up now, but I don't think even normal --tokenize has tests around it to model this after.

Yeah, I think the only tokenizer tests are in service (TokenizerTests), which is slightly different.

We pretty much only want to test if the command line switch works and outputs something which looks like tokens (both normal and unfiltered). I don't think we want to test tokenizer correctness here, since you didn't change anything about tokenizer itself.

I don't even think we even should check ranges in such a test.
I would argue that having fsc --tokenize running on a very simple program, and checking certain things in output should be enough. @KevinRansom may disagree though :)

vzarytovskii · 2021-06-29T10:26:51Z

@laenas I've added a few basic tests which are checking tokens for given source. Does it look right to you?
I've removed paths and ranges, so these tokens should be 1:1 match to the source.

vzarytovskii · 2021-06-29T10:33:19Z

CI is having a nap. There are some issues with AzDo currently.

laenas · 2021-06-29T10:34:19Z

@vzarytovskii at a glance that all looks correct from the differentials. Could you elaborate for me - when you say 'removed paths and ranges' - this is because the test stuff now can use the <Expects> element to effectively do the equivalent to a Contains just to check for that known good string?

vzarytovskii · 2021-06-29T10:50:11Z

@laenas

@vzarytovskii at a glance that all looks correct from the differentials. Could you elaborate for me - when you say 'removed paths and ranges' - this is because the test stuff now can use the <Expects> element to effectively do the equivalent to a Contains just to check for that known good string?

Yeah, value inside <Expects></Expects> is treated like a regex by FSarpQA perl runner, and gets applied to output (after run or compilation).

I didn't include ranges and filenames in it (although it certainly could be done), the original output was looking like this:

tokenize - got EQUALS @ C:\Users\vlza\code\fsharp\tests\fsharpqa\source\CompilerOptions\fsc\tokenize\tokenize02.fs(5,12)-(5,13)

So, <Expects></Expects> pattern could've looked more like ^tokenize - got EQUALS @ .+tokenize02.fs$5,12$-$5,13$$, but I decided to keep it simple.

Do you think we should include ranges and filenames into tests? They should be consistent between runs.

laenas · 2021-06-29T10:55:31Z

@vzarytovskii

So, <Expects></Expects> pattern could've looked more like ^tokenize - got EQUALS @ .+tokenize02.fs(5,12)-(5,13)$, but I decided to keep it simple.

That makes a lot of sense - out of curiosity - is that whole behavior something new this year? Because that's pretty much the right balance on this - and seems a lot better than my initial concerns, which were based on how a bunch of the other tests actually do full-compiler-output string matching - which seemed like it was going to be a pain to work with for this use case.

vzarytovskii · 2021-06-29T11:02:31Z

@vzarytovskii

So, <Expects></Expects> pattern could've looked more like ^tokenize - got EQUALS @ .+tokenize02.fs(5,12)-(5,13)$, but I decided to keep it simple.

That makes a lot of sense - out of curiosity - is that whole behavior something new this year? Because that's pretty much the right balance on this - and seems a lot better than my initial concerns, which were based on how a bunch of the other tests actually do full-compiler-output string matching - which seemed like it was going to be a pain to work with for this use case.

I don't think it's new, probably only F#QA tests have this (and not CompilerAssert based ones).

laenas · 2021-06-29T11:03:19Z

Aha, that's probably it. Makes tons of sense, thanks for polishing that up and explaining!

vzarytovskii · 2021-06-29T11:06:22Z

Aha, that's probably it. Makes tons of sense, thanks for polishing that up and explaining!

Np, sorry that it took that long, and thanks for implementing it!

vzarytovskii · 2021-06-29T15:45:18Z

CI is having a really bad day today :(

cartermp requested a review from dsyme December 5, 2020 02:07

Add FSC option for retrieving raw lexer tokens

4ff375e

laenas force-pushed the unfiltered-tokenize branch from 4c7f088 to 4ff375e Compare December 18, 2020 22:00

KevinRansom reviewed Dec 23, 2020

View reviewed changes

vzarytovskii mentioned this pull request Jan 2, 2021

Add option to display pre-lexfilter tokens #10831

Closed

laenas requested a review from KevinRansom January 24, 2021 11:05

Merge branch 'main' into unfiltered-tokenize

9e8980c

KevinRansom assigned vzarytovskii Jun 28, 2021

vzarytovskii added 2 commits June 29, 2021 10:37

Revert "Merge branch 'main' into unfiltered-tokenize"

4080ff8

This reverts commit 9e8980c, reversing changes made to 4ff375e.

Revert "Revert "Merge branch 'main' into unfiltered-tokenize""

07b4e18

This reverts commit 4080ff8.

vzarytovskii added 2 commits June 29, 2021 11:44

Fix after merge (and reverts)

4968fb3

Added basic tokenizer switch tests

eb4b250

vzarytovskii approved these changes Jun 29, 2021

View reviewed changes

dsyme approved these changes Jun 29, 2021

View reviewed changes

vzarytovskii closed this Jun 29, 2021

vzarytovskii reopened this Jun 29, 2021

vzarytovskii closed this Jun 29, 2021

vzarytovskii reopened this Jun 29, 2021

Merge branch 'main' into unfiltered-tokenize

dc64d87

vzarytovskii merged commit 4157807 into dotnet:main Jun 29, 2021

laenas deleted the unfiltered-tokenize branch June 29, 2021 17:16

Add FSC option for retrieving raw lexer tokens #10578

Add FSC option for retrieving raw lexer tokens #10578

Uh oh!

Conversation

laenas commented Dec 1, 2020

Uh oh!

laenas commented Dec 18, 2020

Uh oh!

KevinRansom left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laenas commented Dec 23, 2020

Uh oh!

KevinRansom commented Dec 23, 2020

Uh oh!

laenas commented Dec 28, 2020

Uh oh!

laenas commented Mar 26, 2021

Uh oh!

Happypig375 commented May 27, 2021

Uh oh!

KevinRansom commented Jun 28, 2021

Uh oh!

laenas commented Jun 28, 2021

Uh oh!

vzarytovskii commented Jun 29, 2021

Uh oh!

laenas commented Jun 29, 2021

Uh oh!

vzarytovskii commented Jun 29, 2021

Uh oh!

laenas commented Jun 29, 2021

Uh oh!

vzarytovskii commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vzarytovskii commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vzarytovskii commented Jun 29, 2021

Uh oh!

laenas commented Jun 29, 2021

Uh oh!

vzarytovskii commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laenas commented Jun 29, 2021

Uh oh!

vzarytovskii commented Jun 29, 2021

Uh oh!

laenas commented Jun 29, 2021

Uh oh!

vzarytovskii commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vzarytovskii commented Jun 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KevinRansom left a comment •

edited

Loading

vzarytovskii commented Jun 29, 2021 •

edited

Loading

vzarytovskii commented Jun 29, 2021 •

edited

Loading

vzarytovskii commented Jun 29, 2021 •

edited

Loading

vzarytovskii commented Jun 29, 2021 •

edited

Loading