-
Notifications
You must be signed in to change notification settings - Fork 563
docs(ref): Specify frontmatter #1974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/SUMMARY.md
Outdated
| - [Input format](input-format.md) | ||
| - [Keywords](keywords.md) | ||
| - [Identifiers](identifiers.md) | ||
| - [Frontmatter](frontmatter.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not stable yet: rust-lang/rust#136889
| # Crates and source files | ||
|
|
||
| r[crate.syntax] | ||
| ```grammar,items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: it took me a while before I found the appropriate documentation for this.
I quickly went to CONTRIBUTING.md but skipped over the link to the authoring page because it was at the end of the intro. I skimmed the sections until I found "Adding Documentation" which seemed to describe my situation but I found no link.
Eventually I found authoring.md and thankfully I read thoroughly enough to notice the "Grammar" section at the bottom. At that point, I was mostly able to interpret the results to figure out what I needed (e.g. the limitations of ~, what to search for to understand how to do footnotes).
As noted at https://github.com/rust-lang/reference/pull/1974/files#r2292171963, it doesn't really give a style guidance on casing.
161b0fa to
61afc85
Compare
I created productions for `END_OF_LINE`, `IGNORABLE_CODE_POINT`, and `HORIZONTAL_WHITESPACE` as that is how the unicode standard is written and in preparation for rust-lang#1974 which will make use of `HORIZONTAL_WHITESPACE`
I created productions for `END_OF_LINE`, `IGNORABLE_CODE_POINT`, and `HORIZONTAL_WHITESPACE` as that is how the unicode standard is written and in preparation for rust-lang#1974 which will make use of `HORIZONTAL_WHITESPACE`
This does not create any new productions, instead preferring comments. rust-lang#1974 will involve pulling out the horizontal whitespace into a separate production. Comment wording (and casing) is modeled off of https://www.unicode.org/reports/tr31/#R3a. I left off a "unicode" prefix for ASCII items as they are likely common enough in that context that specifying them as "unicode" could cause more confusion.
This does not create any new productions, instead preferring comments. rust-lang#1974 will involve pulling out the horizontal whitespace into a separate production. Comment wording (and casing) is modeled off of https://www.unicode.org/reports/tr31/#R3a. I left off a "unicode" prefix for ASCII items as they are likely common enough in that context that specifying them as "unicode" could cause more confusion.
traviscross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Before digging into the rest, let me ask some questions about changes here related to layout, organization, and style.
src/whitespace.md
Outdated
| U+0009 // Horizontal tab, `'\t'` | ||
| | U+000A // Line feed, `'\n'` | ||
| | U+000B // Vertical tab | ||
| | U+000C // Form feed | ||
| | U+000D // Carriage return, `'\r'` | ||
| | U+0020 // Space, `' '` | ||
| | U+0085 // Next line | ||
| | U+200E // Left-to-right mark | ||
| | U+200F // Right-to-left mark | ||
| | U+2028 // Line separator | ||
| | U+2029 // Paragraph separator | ||
| TAB -> U+0009 // Horizontal tab, `'\t'` | ||
| LF -> U+000A // Line feed, `'\n'` | ||
| CR -> U+000D // Carriage return, `'\r'` | ||
| END_OF_LINE | ||
| | IGNORABLE_CODE_POINT | ||
| | HORIZONTAL_WHITESPACE | ||
| END_OF_LINE -> | ||
| LF | ||
| | U+000B // vertical tabulation | ||
| | U+000C // form feed | ||
| | CR | ||
| | U+0085 // Unicode next line | ||
| | U+2028 // Unicode LINE SEPARATOR | ||
| | U+2029 // Unicode PARAGRAPH SEPARATOR | ||
| IGNORABLE_CODE_POINT -> | ||
| U+200E // Unicode LEFT-TO-RIGHT MARK | ||
| | U+200F // Unicode RIGHT-TO-LEFT MARK | ||
| HORIZONTAL_WHITESPACE -> | ||
| TAB | ||
| | U+0020 // space ' ' | ||
| TAB -> U+0009 // horizontal tab ('\t') | ||
| LF -> U+000A // line feed ('\n') | ||
| CR -> U+000D // carriage return ('\r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these sort of grammar productions that are essentially lists, we generally find it more clear to not mix terminals and non-terminals even if that leads to some duplication. As applied here, that would suggest inlining LF and CR in END_OF_LINE and TAB in HORIZONTAL_WHITESPACE. Does that make sense, or are there non-obvious problems with that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see it both ways.
Extra abstractions can make things less clear.
Lack of use of abstractions highlights a difference which can also be confusing or make searching more difficult.
Which gets me thinking, with how thin this abstraction is, should we even have a production for TAB, LF, or CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fairly common in grammars to have separate productions for TAB, LF, and CR, even if duplicated. I'd suggest keeping these three but inlining the terminals into the other productions.
src/whitespace.md
Outdated
| | U+0085 // Unicode next line | ||
| | U+2028 // Unicode LINE SEPARATOR | ||
| | U+2029 // Unicode PARAGRAPH SEPARATOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my copy of the Unicode Character Database, I see:
0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;
2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
In a somewhat more pretty form:
- https://util.unicode.org/UnicodeJsps/character.jsp?a=0085
- https://util.unicode.org/UnicodeJsps/character.jsp?a=2028
- https://util.unicode.org/UnicodeJsps/character.jsp?a=2029
That is, in the database, U+0085 has a null name and a name alias of "NEXT LINE|NEL", U+2028 has a name of "LINE SEPARATOR" and a null name alias, and U+2029 has a name of "PARAGRAPH SEPARATOR" and a null name alias.
Given that, is there a reason that we need to capitalize these differently, or could we perhaps align the capitalization with Reference norms without losing something important here?
I do see, in The Unicode Standard, Version 17.0.0, §5.8.1 "Definitions", Table 5-1 that all of "next line", "line separator", and "paragraph separator" are presented in lowercase, but that doesn't seem to explain why we might want to treat these differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to match how they are specified where it talks about the different categories of whitespace, see https://www.unicode.org/reports/tr31/#R3a-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me that the capitalization and other matters of presentation there are intended to be more normative than the capitalization in the Unicode Character Database (which is normatively part of the Unicode standard).
To my eyes, it looks like they made the editorial choice in tr31 to notate code point names as presented in the database, and then where the name is null, to present the primary code point alias in parentheses and in lowercase (and to elide secondary aliases). This presentation was, I'm guessing, intended to notate this difference between names and aliases.
Is there a reason that it's important to follow the presentation in tr31 exactly? If so, I'd expect that we'd want to follow the convention with the parentheses as well.
However, it's not clear to me that it's important, for our purposes, to make this distinction between code point names and code point aliases. If that's right, and there's not, then I'd prefer we render these according to our normal conventions for comments, i.e., in sentence case (especially as the authors of Unicode standards documents also seem to treat the capitalizations as malleable).
But, is that right, or is there an important reason we need to distinguish, for our purposes, code point names and aliases (where the name is null)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that it's important to follow the presentation in tr31 exactly? If so, I'd expect that we'd want to follow the convention with the parentheses as well.
That is the document we are referring back to on this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, understood. How would you then apply that to the points and questions raised above?
src/whitespace.md
Outdated
| HORIZONTAL_WHITESPACE -> | ||
| TAB | ||
| | U+0020 // space ' ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the grammar productions above for frontmatter, it seems that only HORIZONTAL_WHITESPACE is used (and LF, of course). If we were willing to accept some duplication, i.e. by defining HORIZONTAL_WHITESPACE directly in terms of terminals (that also would appear under WHITESPACE), does that work for defining frontmatter without these other changes, or is there a reason these other changes to the layout, organization, and styling of the WHITESPACE grammar are important for defining frontmatter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think duplicating an entry like HORIZONTAL_WHITESPACE is anti-user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think duplicating an entry like
HORIZONTAL_WHITESPACEis anti-user.
Obviously we strive to be pro-user. However, I'm not clear how this stylistic choice would be anti-user. If you could please elaborate, I'd appreciate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, I am not a DRY absolutist. Things having the appearance of duplication isn't a reason to de-duplicate but we should work towards understanding the underlying principles and requirements and de-duplicate those where it works well and reducing the burden of duplication where it doesn't work well (e.g. having two definitions next to each other). In this case, we are working off of an item (whitespace) that is defined in three categories that is then defined as literals that natural leads to a way to divide things up to not duplicate knowledge.
I see this is similar to but worse than #1974 (comment)
At least with TAB, the abstraction is thin (too thin as I suggested in that thread). Here, we'd be duplicating a lot of the definition. This puts a burden on the user to match up the entries to
- Understand that one is a superset of the other
- Understand what the role of the items mean
When seeing duplication like this, it is natural to expect there to a reason for it and that reason to be there are differences. For me, this makes me doubt myself and pore over something more, wasting my time and making me not confident that I understand what I'm working off of, and more annoyed in the process.
I suspect this will also be a honey trap for contributors who will see this and then submit PRs to do what I did.
d7203a4 to
f844924
Compare
ehuss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like items.mod.attributes.intro might need updating.
| This prevents an [inner attribute] at the start of a source file being removed. | ||
|
|
||
| > [!NOTE] | ||
| > The standard library [`include!`] macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. The [`include_str!`] and [`include_bytes!`] macros do not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the impact on these macros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on rust-lang/rust#146377 to update this
| r[input.frontmatter] | ||
| ## Frontmatter removal | ||
|
|
||
| After some whitespace, [frontmatter] may next appear in the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be worded more explicitly how it relates to the items around it (particularly shebang)? This rule isn't quite standing on its own and it isn't quite clear how it fits.
For example, something like: "after the optional [shebang] and then optional [whitespace], [frontmatter] may appear next in the input".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule isn't quite standing on its own and it isn't quite clear how it fits.
At least shebang is written in an additive manner:
If the remaining sequence begins with the characters
#!,
Seems like it would be good to be consistent. I could make the additive nature more explicit.
| | HORIZONTAL_WHITESPACE | ||
| END_OF_LINE -> | ||
| U+000A // line feed, `'\n'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please keep the original comments and same comment style in these rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some existing discussion on the comment style at #1974 (comment)
The intent of the comment style is to match the linked to standard for easier comparison.
src/frontmatter.md
Outdated
| (FRONTMATTER_LINE LF )* | ||
| FRONTMATTER_FENCE[^matched-fence] HORIZONTAL_WHITESPACE* LF | ||
| FRONTMATTER_FENCE -> `---` `-`* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| FRONTMATTER_FENCE -> `---` `-`* | |
| FRONTMATTER_FENCE -> `-`{3..} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think the existing content is easier for an end user to look at but whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted for a sec as I re-think how I want to handle some of this feedback
src/SUMMARY.md
Outdated
| - [Input format](input-format.md) | ||
| - [Keywords](keywords.md) | ||
| - [Identifiers](identifiers.md) | ||
| - [Frontmatter](frontmatter.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be sequenced just after "Input format" since it is tightly related to that? I realize the rest of these aren't in any particularly logical order, but I'm not quite seeing why it is placed in the middle here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why I chose that spot. I've moved it.
| @root FRONTMATTER -> | ||
| FRONTMATTER_FENCE HORIZONTAL_WHITESPACE* INFOSTRING? HORIZONTAL_WHITESPACE* LF | ||
| (FRONTMATTER_LINE LF )* | ||
| FRONTMATTER_FENCE[^matched-fence] HORIZONTAL_WHITESPACE* LF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unusual to have load-bearing footnotes in the grammar. Would it be possible to define this recursively so that isn't necessary? Or is there maybe some other way around it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, TC and I have been trying to come up with a grammar here, but we have not yet been able to land on something. I'm still working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that being too clever in encoding this could have a negative impact on users understanding this.
Another route is to extend the grammar for inline context notes like how *? gets a (non-greedy) note and ! gets a warning icon with with the exception of.
| [^matched-fence]: The closing fence must have the same number of `-` as the opening fence | ||
| [^escaped-fence]: A `FRONTMATTER_FENCE` at the beginning of a `FRONTMATTER_LINE` is only invalid if it has the same or more `-` as the `FRONTMATTER_FENCE` | ||
|
|
||
| Frontmatter is an optional section for content intended for external tools without requiring these tools to have full knowledge of the Rust grammar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use "intro" for the introduction of something.
Also, perhaps this intro could be moved to the top of the file?
| Frontmatter is an optional section for content intended for external tools without requiring these tools to have full knowledge of the Rust grammar. | |
| r[frontmatter.intro] | |
| Frontmatter is an optional section for content intended for external tools without requiring these tools to have full knowledge of the Rust grammar. |
Can you also include an example here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, perhaps this intro could be moved to the top of the file?
All other Lexical structure pages with a grammar open with the grammar. Most (but not all) other pages I randomly sampled did the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The into reference and an example were added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI identified two problems with the example
- A feature flag must be present to build
- How do we handle the feature flag in the example when feature flags are not within the scope of the reference?
- Would we be removing it from the example when this is stabilized and released? Unfortunate that we have to wait until then
- It doesn't understand frontmatter and fails to build
- I think it is important for examples to be practical so I made a small, runnable cargo script
This reverts commit 60eb145. This re-formats our Whitespace to be centered on Unicode's defintion. This makes it easy to compare with the standard and helps with Frontmatter. Unlike regular Rust, Frontmatter cares about the type of Whitespace. Even if we want to duplicate the definition, having them formatted similarly makes them easy to compare.
I'm splitting out `HORIZONTAL_WHITESPACE` to make it easier to connect the concept in frontmatter with `WHITESPACE`. In doing this, I found it awkward to only pull out part when the comments are there. I either needed a redundant comment to start a section, re-arrange out of order from Unicode, or split out all classes. I did the latter.
e7299ae to
2b63fb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two categories of feedback I am concerned about for this PR
- ones I disagree with as currently stated (e.g. docs(ref): Specify frontmatter #1974 (comment), docs(ref): Specify frontmatter #1974 (comment))
- ones that require additional structural change which I have not had the best experience contributing to (e.g. docs(ref): Specify frontmatter #1974 (comment))
Please keep in mind that I am not contributing to this for my own interest in the Reference (I was interested in that at one point but no longer) but to fulfill a requirement of another team.
How about an alternative route for this: this PR gets re-framed as a perma-draft that is a content dump without any editorial changes (changing the content to be as surgical as possible). Once we're agreed to the technical content, a T-reference person can then copy the commits, make any editorial changes they wish, and post their own PR.
| As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed. | ||
| This prevents an [inner attribute] at the start of a source file being removed. | ||
|
|
||
| > [!NOTE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @traviscross at rust-lang/rust#148051 (comment) related to r[input.shebang.inner-attribute] (just above this)
Let's please document the behavior of frontmatter removal, including how it interacts with the documented shebang removal exception
There is no change to the shebang vs attribute detection. I've been trying to think this through and so far I've not come up with a way to include frontmatter that doesn't feel artificial (saying what doesn't affect this is an open set).
Tracking issue: rust-lang/rust#136889