From dee80940b123ad7006e0497391d8c160ae15ba1b Mon Sep 17 00:00:00 2001 From: Julia Evans Date: Wed, 12 Nov 2025 19:53:20 +0000 Subject: [PATCH 01/29] doc: add an explanation of Git's data model Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans Signed-off-by: Junio C Hamano --- Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 307 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 311 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6ebf22..5f4acfacbdb6f0 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 00000000000000..3614f5960ea143 --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,307 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <>: commits, trees, blobs, and tag objects +2. <>: branches, tags, + remote-tracking branches, etc +3. <>, also known as the staging area +4. <>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS +------- + +All of the commits and files in a Git repository are stored as "Git objects". +Git objects never change after they're created, and every object has an ID, +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. + +This means that if you have an object's ID, you can always recover its +exact contents as long as the object hasn't been deleted. + +Every object has: + +[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <>, <>, <>, + and <>. +3. *contents*. The structure of the contents depends on the type. + +Here's how each type of object is structured: + +[[commit]] +commit:: + A commit contains these required fields + (though there are other optional fields): ++ +1. The full directory structure of all the files in that version of the + repository and each file's contents, stored as the *<>* ID + of the commit's top-level directory +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed +5. A *commit message* ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya 1759173425 -0400 +committer Maya 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ +Git does not store the diff for a commit: when you ask Git to show +the commit with linkgit:git-show[1], it calculates the diff from its +parent on the fly. + +[[tree]] +tree:: + A tree is how Git represents a directory. + It can contain files or other trees (which are subdirectories). + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` +2. The *file type*, which must be one of these five types: + - *regular file* + - *executable file* + - *symbolic link* + - *directory* + - *gitlink* (for use with submodules) +3. The <> with the contents of the file, directory, + or gitlink. ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + +NOTE: In the output above, Git displays the file type of each tree entry +using a format that's loosely modelled on Unix file modes (`100644` is +"regular file", `100755` is "executable file", `120000` is "symbolic +link", `040000` is "directory", and `160000` is "gitlink"). It also +displays the object's type: `blob` for files and symlinks, `tree` for +directories, and `commit` for gitlinks. + +[[blob]] +blob:: + A blob object contains a file's contents. ++ +When you make a commit, Git stores the full contents of each file that +you changed as a blob. +For example, if you have a commit that changes 2 files in a repository +with 1000 files, that commit will create 2 new blobs, and use the +previous blob ID for the other 998 files. +This means that commits can use relatively little disk space even in a +very large repository. + +[[tag-object]] +tag object:: + Tag objects contain these required fields + (though there are other optional fields): ++ +1. The *ID* of the object it references +2. The *type* of the object it references +3. The *tagger* and tag date +4. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya 1759927359 -0400 + +Release version 1.0.0 +---- + +NOTE: All of the examples in this section were generated with +`git cat-file -p `. + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either refer to: + +1. An object ID, usually a <> ID +2. Another reference. This is called a "symbolic reference" + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/`:: + A branch refers to a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/`:: + A tag refers to a commit ID, tag object ID, or other object ID. + There are two types of tags: + 1. "Annotated tags", which reference a <> ID + which contains a tag message + 2. "Lightweight tags", which reference a commit, blob, or tree ID + directly ++ +Even though branches and tags both refer to a commit ID, Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <> to point to the new commit. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <>, + if there is a current branch. `HEAD` can either be: ++ +1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. +2. A direct reference to a commit ID. In this case there is no current branch. + This is called "detached HEAD state", see the DETACHED HEAD section + of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote-tracking branches: `refs/remotes//`:: + A remote-tracking branch refers to a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. ++ +`refs/remotes//HEAD` is a symbolic reference to the remote's +default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + +NOTE: Git may delete objects that aren't "reachable" from any reference +or <>. +An object is "reachable" if we can find it by following tags to whatever +they tag, commits to their parents or trees, and trees to the trees or +blobs that they contain. +For example, if you amend a commit with `git commit --amend`, +there will no longer be a branch that points at the old commit. +The old commit is recorded in the current branch's <>, +so it is still "reachable", but when the reflog entry expires it may +become unreachable and get deleted. + +the old commit will usually not be reachable, so it may be deleted eventually. +Reachable objects will never be deleted. + +[[index]] +THE INDEX +--------- +The index, also known as the "staging area", is a list of files and +the contents of each file, stored as a <>. +You can add files to the index or update the contents of a file in the +index with linkgit:git-add[1]. This is called "staging" the file for commit. + +Unlike a <>, the index is a flat list of files. +When you commit, Git converts the list of files in the index to a +directory <> and uses that tree in the new <>. + +Each index entry has 4 fields: + +1. The *file type*, which must be one of: + - *regular file* + - *executable file* + - *symbolic link* + - *gitlink* (for use with submodules) +2. The *<>* ID of the file, + or (rarely) the *<>* ID of the submodule +3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. +4. The *file path*, for example `src/hello.py` + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Every time a branch, remote-tracking branch, or HEAD is updated, Git +updates a log called a "reflog" for that <>. +This means that if you make a mistake and "lose" a commit, you can +generally recover the commit ID by running `git reflog `. + +A reflog is a list of log entries. Each entry has: + +1. The *commit ID* +2. *Timestamp* when the change was made +3. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +You can view a reflog with `git reflog `. +For example, here's the reflog for a `main` branch which has changed twice: + +---- +$ git reflog main --date=iso --no-decorate +750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README +4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit +---- + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b71b0..20ba121314b9a4 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <>. [[def_object_name]]object name:: The unique identifier of an <>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index 44f94cdb7ba672..f3fcf4bc9196aa 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, From 8d4725e48ef29bd857e21e689411878b6eb4df92 Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:47 -0800 Subject: [PATCH 02/29] whitespace: correct bit assignment comments A comment in diff.c claimed that bits up to 12th (counting from 0th) are whitespace rules, and 13th thru 15th are for new/old/context, but it turns out it was miscounting. Correct them, and clarify where the whitespace rule bits come from in the comment. Extend bit assignment comments to cover bits used for color-moved, which weren't described. Also update the way these bit constants are defined to use (1 << N) notation, instead of octal constants, as it tends to make it easier to notice a breakage like this. Sprinkle a few blank lines between logically distinct groups of CPP macro definitions to make them easier to read. Signed-off-by: Junio C Hamano --- diff.c | 7 +++++-- diff.h | 6 +++--- ws.h | 25 ++++++++++++++----------- 3 files changed, 22 insertions(+), 16 deletions(-) diff --git a/diff.c b/diff.c index a74e701806be52..74261b332af16c 100644 --- a/diff.c +++ b/diff.c @@ -801,16 +801,19 @@ enum diff_symbol { DIFF_SYMBOL_CONTEXT_MARKER, DIFF_SYMBOL_SEPARATOR }; + /* * Flags for content lines: - * 0..12 are whitespace rules - * 13-15 are WSEH_NEW | WSEH_OLD | WSEH_CONTEXT + * 0..11 are whitespace rules (see ws.h) + * 12..14 are WSEH_NEW | WSEH_CONTEXT | WSEH_OLD * 16 is marking if the line is blank at EOF + * 17..19 are used for color-moved. */ #define DIFF_SYMBOL_CONTENT_BLANK_LINE_EOF (1<<16) #define DIFF_SYMBOL_MOVED_LINE (1<<17) #define DIFF_SYMBOL_MOVED_LINE_ALT (1<<18) #define DIFF_SYMBOL_MOVED_LINE_UNINTERESTING (1<<19) + #define DIFF_SYMBOL_CONTENT_WS_MASK (WSEH_NEW | WSEH_OLD | WSEH_CONTEXT | WS_RULE_MASK) /* diff --git a/diff.h b/diff.h index 2fa256c3ef0079..cbd355cf50f68e 100644 --- a/diff.h +++ b/diff.h @@ -331,9 +331,9 @@ struct diff_options { int ita_invisible_in_index; /* white-space error highlighting */ -#define WSEH_NEW (1<<12) -#define WSEH_CONTEXT (1<<13) -#define WSEH_OLD (1<<14) +#define WSEH_NEW (1<<12) +#define WSEH_CONTEXT (1<<13) +#define WSEH_OLD (1<<14) unsigned ws_error_highlight; const char *prefix; int prefix_length; diff --git a/ws.h b/ws.h index 5ba676c5595db5..23708efb7322ed 100644 --- a/ws.h +++ b/ws.h @@ -7,19 +7,22 @@ struct strbuf; /* * whitespace rules. * used by both diff and apply - * last two digits are tab width + * last two octal-digits are tab width (we support only up to 63). */ -#define WS_BLANK_AT_EOL 0100 -#define WS_SPACE_BEFORE_TAB 0200 -#define WS_INDENT_WITH_NON_TAB 0400 -#define WS_CR_AT_EOL 01000 -#define WS_BLANK_AT_EOF 02000 -#define WS_TAB_IN_INDENT 04000 -#define WS_TRAILING_SPACE (WS_BLANK_AT_EOL|WS_BLANK_AT_EOF) +#define WS_BLANK_AT_EOL (1<<6) +#define WS_SPACE_BEFORE_TAB (1<<7) +#define WS_INDENT_WITH_NON_TAB (1<<8) +#define WS_CR_AT_EOL (1<<9) +#define WS_BLANK_AT_EOF (1<<10) +#define WS_TAB_IN_INDENT (1<<11) + +#define WS_TRAILING_SPACE (WS_BLANK_AT_EOL|WS_BLANK_AT_EOF) #define WS_DEFAULT_RULE (WS_TRAILING_SPACE|WS_SPACE_BEFORE_TAB|8) -#define WS_TAB_WIDTH_MASK 077 -/* All WS_* -- when extended, adapt diff.c emit_symbol */ -#define WS_RULE_MASK 07777 +#define WS_TAB_WIDTH_MASK ((1<<6)-1) + +/* All WS_* -- when extended, adapt constants defined after diff.c:diff_symbol */ +#define WS_RULE_MASK ((1<<12)-1) + extern unsigned whitespace_rule_cfg; unsigned whitespace_rule(struct index_state *, const char *); unsigned parse_whitespace_rule(const char *); From f83d1afafb8b772397aa3854184c42f7810fa0df Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:48 -0800 Subject: [PATCH 03/29] diff: emit_line_ws_markup() if/else style fix Apply the simple rule: if you need {} in one arm of the if/else if/else... cascade, have {} in all of them. Signed-off-by: Junio C Hamano --- diff.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/diff.c b/diff.c index 74261b332af16c..9a24a0791ceaca 100644 --- a/diff.c +++ b/diff.c @@ -1327,14 +1327,14 @@ static void emit_line_ws_markup(struct diff_options *o, ws = NULL; } - if (!ws && !set_sign) + if (!ws && !set_sign) { emit_line_0(o, set, NULL, 0, reset, sign, line, len); - else if (!ws) { + } else if (!ws) { emit_line_0(o, set_sign, set, !!set_sign, reset, sign, line, len); - } else if (blank_at_eof) + } else if (blank_at_eof) { /* Blank line at EOF - paint '+' as well */ emit_line_0(o, ws, NULL, 0, reset, sign, line, len); - else { + } else { /* Emit just the prefix, then the rest. */ emit_line_0(o, set_sign ? set_sign : set, NULL, !!set_sign, reset, sign, "", 0); From fc7abcd9d5460b381701ef43b7f6dafa73962950 Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:49 -0800 Subject: [PATCH 04/29] diff: correct suppress_blank_empty hack The suppress-blank-empty feature abused the CONTEXT_INCOMPLETE symbol that was meant to be used only for "\ No newline at the end of file" code path. The intent of the feature was to turn a context line we receive from xdiff machinery (which always uses ' ' for context lines, even an empty one) and spit it out as a truly empty line. Perform such a conversion very locally at where a line from xdiff that begins with ' ' is handled for output; there are many checks before the control reaches such place that checks the first letter of the diff output line to see if it is a context line, and having to check for '\n' and treat it as a special case is error prone. In order to catch similar hacks in the future, make sure the code path that is meant for "\ No newline" case checks the first byte is indeed a backslash. Signed-off-by: Junio C Hamano --- diff.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-) diff --git a/diff.c b/diff.c index 9a24a0791ceaca..b9ef8550cc859a 100644 --- a/diff.c +++ b/diff.c @@ -1321,6 +1321,11 @@ static void emit_line_ws_markup(struct diff_options *o, const char *ws = NULL; int sign = o->output_indicators[sign_index]; + if (diff_suppress_blank_empty && + sign_index == OUTPUT_INDICATOR_CONTEXT && + len == 1 && line[0] == '\n') + sign = 0; + if (o->ws_error_highlight & ws_rule) { ws = diff_get_color_opt(o, DIFF_WHITESPACE); if (!*ws) @@ -1498,15 +1503,9 @@ static void emit_diff_symbol_from_struct(struct diff_options *o, case DIFF_SYMBOL_WORDS: context = diff_get_color_opt(o, DIFF_CONTEXT); reset = diff_get_color_opt(o, DIFF_RESET); - /* - * Skip the prefix character, if any. With - * diff_suppress_blank_empty, there may be - * none. - */ - if (line[0] != '\n') { - line++; - len--; - } + + /* Skip the prefix character */ + line++; len--; emit_line(o, context, reset, line, len); break; case DIFF_SYMBOL_FILEPAIR_PLUS: @@ -2375,12 +2374,6 @@ static int fn_out_consume(void *priv, char *line, unsigned long len) ecbdata->label_path[0] = ecbdata->label_path[1] = NULL; } - if (diff_suppress_blank_empty - && len == 2 && line[0] == ' ' && line[1] == '\n') { - line[0] = '\n'; - len = 1; - } - if (line[0] == '@') { if (ecbdata->diff_words) diff_words_flush(ecbdata); @@ -2431,12 +2424,14 @@ static int fn_out_consume(void *priv, char *line, unsigned long len) ecbdata->lno_in_preimage++; emit_context_line(ecbdata, line + 1, len - 1); break; - default: + case '\\': /* incomplete line at the end */ ecbdata->lno_in_preimage++; emit_diff_symbol(o, DIFF_SYMBOL_CONTEXT_INCOMPLETE, line, len, 0); break; + default: + BUG("fn_out_consume: unknown line '%s'", line); } return 0; } From ced0561828271e8fc3fa2699754c5925969111b5 Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:50 -0800 Subject: [PATCH 05/29] diff: keep track of the type of the last line seen The "\ No newline at the end of the file" can come after any of the "-" (deleted preimage line), " " (unchanged line), or "+" (added postimage line). In later steps in this series, we will start treating a change that makes a file to end in an incomplete line as a whitespace error, and we would need to know what the previous line was when we react to "\ No newline" in the diff output. If the previous line was a context (i.e., unchanged) line, the file lacked the final newline before the change, and the change did not touch that line and left it still incomplete, so we do not want to warn in such a case. Teach fn_out_consume() function to keep track of what the previous line was, and prepare an otherwise empty switch statement to let us react differently to "\ No newline" based on that. Note that there is an existing curiosity (read: likely to be a bug) in the code that increments line number in the preimage file every time it sees a line with "\ No newline" on it, regardless of what the previous line was. I left it as-is, because it does not affect the main theme of this series, and more importantly, I do not think it matters, as these numbers are used only to compare them with blank_at_eof_in_{pre,post}image to issue a warning when we see more empty line was added at the end, but by definition, after we see "\ No newline at the end of the file" for an added line, we will not see an added line for the file. An independent audit to ensure that this curious increment can be safely removed would make a good #leftoverbits clean-up (we may even find some code that decrements this counter or over-increments the other quantity this counter is compared with that compensates the effect of this curious increment that hides a bug, in which case we may also need to remove them). Signed-off-by: Junio C Hamano --- diff.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/diff.c b/diff.c index b9ef8550cc859a..ff8fc91f88d30e 100644 --- a/diff.c +++ b/diff.c @@ -601,6 +601,7 @@ struct emit_callback { int blank_at_eof_in_postimage; int lno_in_preimage; int lno_in_postimage; + int last_line_kind; const char **label_path; struct diff_words_data *diff_words; struct diff_options *opt; @@ -2426,6 +2427,15 @@ static int fn_out_consume(void *priv, char *line, unsigned long len) break; case '\\': /* incomplete line at the end */ + switch (ecbdata->last_line_kind) { + case '+': + case '-': + case ' ': + break; + default: + BUG("fn_out_consume: '\\No newline' after unknown line (%c)", + ecbdata->last_line_kind); + } ecbdata->lno_in_preimage++; emit_diff_symbol(o, DIFF_SYMBOL_CONTEXT_INCOMPLETE, line, len, 0); @@ -2433,6 +2443,7 @@ static int fn_out_consume(void *priv, char *line, unsigned long len) default: BUG("fn_out_consume: unknown line '%s'", line); } + ecbdata->last_line_kind = line[0]; return 0; } From 29228cbdc5f8a80b1f61c7cc209ba8e3714cc38e Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:51 -0800 Subject: [PATCH 06/29] diff: refactor output of incomplete line Create a helper function that reacts to "\ No newline at the end of file" in preparation for unifying the incomplete line handling in the code path that handles xdiff output and the code path that bypasses xdiff and produces a complete-rewrite patch. Currently the output from the DIFF_SYMBOL_CONTEXT_INCOMPLETE case still (ab)uses the same code as what is used for context lines, but that would change in a later step where we introduce support to treat an incomplete line as a whitespace error. Signed-off-by: Junio C Hamano --- diff.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/diff.c b/diff.c index ff8fc91f88d30e..7ee86204291161 100644 --- a/diff.c +++ b/diff.c @@ -1379,6 +1379,10 @@ static void emit_diff_symbol_from_struct(struct diff_options *o, emit_line(o, "", "", line, len); break; case DIFF_SYMBOL_CONTEXT_INCOMPLETE: + set = diff_get_color_opt(o, DIFF_CONTEXT); + reset = diff_get_color_opt(o, DIFF_RESET); + emit_line(o, set, reset, line, len); + break; case DIFF_SYMBOL_CONTEXT_MARKER: context = diff_get_color_opt(o, DIFF_CONTEXT); reset = diff_get_color_opt(o, DIFF_RESET); @@ -1668,6 +1672,13 @@ static void emit_context_line(struct emit_callback *ecbdata, emit_diff_symbol(ecbdata->opt, DIFF_SYMBOL_CONTEXT, line, len, flags); } +static void emit_incomplete_line_marker(struct emit_callback *ecbdata, + const char *line, int len) +{ + emit_diff_symbol(ecbdata->opt, DIFF_SYMBOL_CONTEXT_INCOMPLETE, + line, len, 0); +} + static void emit_hunk_header(struct emit_callback *ecbdata, const char *line, int len) { @@ -2437,8 +2448,7 @@ static int fn_out_consume(void *priv, char *line, unsigned long len) ecbdata->last_line_kind); } ecbdata->lno_in_preimage++; - emit_diff_symbol(o, DIFF_SYMBOL_CONTEXT_INCOMPLETE, - line, len, 0); + emit_incomplete_line_marker(ecbdata, line, len); break; default: BUG("fn_out_consume: unknown line '%s'", line); From 35925f1832ac00a4926623dff061a3b52123470b Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:52 -0800 Subject: [PATCH 07/29] diff: call emit_callback ecbdata everywhere Everybody else, except for emit_rewrite_lines(), calls the emit_callback data ecbdata. Make sure we call the same thing by the same name for consistency. Signed-off-by: Junio C Hamano --- diff.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/diff.c b/diff.c index 7ee86204291161..44b86544b75f97 100644 --- a/diff.c +++ b/diff.c @@ -1780,7 +1780,7 @@ static void add_line_count(struct strbuf *out, int count) } } -static void emit_rewrite_lines(struct emit_callback *ecb, +static void emit_rewrite_lines(struct emit_callback *ecbdata, int prefix, const char *data, int size) { const char *endp = NULL; @@ -1791,17 +1791,17 @@ static void emit_rewrite_lines(struct emit_callback *ecb, endp = memchr(data, '\n', size); len = endp ? (endp - data + 1) : size; if (prefix != '+') { - ecb->lno_in_preimage++; - emit_del_line(ecb, data, len); + ecbdata->lno_in_preimage++; + emit_del_line(ecbdata, data, len); } else { - ecb->lno_in_postimage++; - emit_add_line(ecb, data, len); + ecbdata->lno_in_postimage++; + emit_add_line(ecbdata, data, len); } size -= len; data += len; } if (!endp) - emit_diff_symbol(ecb->opt, DIFF_SYMBOL_NO_LF_EOF, NULL, 0, 0); + emit_diff_symbol(ecbdata->opt, DIFF_SYMBOL_NO_LF_EOF, NULL, 0, 0); } static void emit_rewrite_diff(const char *name_a, From 8d8e3c61874bbcf50d64aff34fb6c533458adf5e Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:53 -0800 Subject: [PATCH 08/29] diff: update the way rewrite diff handles incomplete lines The diff_symbol based output framework uses one DIFF_SYMBOL_* enum value per the kind of output lines of "git diff", which corresponds to one output line from the xdiff machinery used internally. Most notably, DIFF_SYMBOL_PLUS and DIFF_SYMBOL_MINUS that correspond to "+" and "-" lines are designed to always take a complete line, even if the output from xdiff machinery may produce "\ No newline at the end of file" immediately after them. But this is not true in the rewrite-diff codepath, which completely bypasses the xdiff machinery. Since the code path feeds the bytes directly from the payload to the output routines, the output layer has to deal with an incomplete line with DIFF_SYMBOL_PLUS and DIFF_SYMBOL_MINUS, which never would see an incomplete line in the normal code paths. This lack of final newline is compensated by an ugly hack for a fabricated DIFF_SYMBOL_NO_LF_EOF token to inject an extra newline to the output to simulate output coming from the xdiff machinery. Revamp the way the complete-rewrite code path feeds the lines to the output layer by treating the last line of the pre/post image when it is an incomplete line specially. This lets us remove the DIFF_SYMBOL_NO_LF_EOF hack and use the usual DIFF_SYMBOL_CONTEXT_INCOMPLETE code path, which will later learn how to handle whitespace errors. Signed-off-by: Junio C Hamano --- diff.c | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) diff --git a/diff.c b/diff.c index 44b86544b75f97..5c606409bb9cbe 100644 --- a/diff.c +++ b/diff.c @@ -797,7 +797,6 @@ enum diff_symbol { DIFF_SYMBOL_CONTEXT_INCOMPLETE, DIFF_SYMBOL_PLUS, DIFF_SYMBOL_MINUS, - DIFF_SYMBOL_NO_LF_EOF, DIFF_SYMBOL_CONTEXT_FRAGINFO, DIFF_SYMBOL_CONTEXT_MARKER, DIFF_SYMBOL_SEPARATOR @@ -1352,7 +1351,6 @@ static void emit_line_ws_markup(struct diff_options *o, static void emit_diff_symbol_from_struct(struct diff_options *o, struct emitted_diff_symbol *eds) { - static const char *nneof = " No newline at end of file\n"; const char *context, *reset, *set, *set_sign, *meta, *fraginfo; enum diff_symbol s = eds->s; @@ -1361,13 +1359,6 @@ static void emit_diff_symbol_from_struct(struct diff_options *o, unsigned flags = eds->flags; switch (s) { - case DIFF_SYMBOL_NO_LF_EOF: - context = diff_get_color_opt(o, DIFF_CONTEXT); - reset = diff_get_color_opt(o, DIFF_RESET); - putc('\n', o->file); - emit_line_0(o, context, NULL, 0, reset, '\\', - nneof, strlen(nneof)); - break; case DIFF_SYMBOL_SUBMODULE_HEADER: case DIFF_SYMBOL_SUBMODULE_ERROR: case DIFF_SYMBOL_SUBMODULE_PIPETHROUGH: @@ -1786,22 +1777,38 @@ static void emit_rewrite_lines(struct emit_callback *ecbdata, const char *endp = NULL; while (0 < size) { - int len; + int len, plen; + char *pdata = NULL; endp = memchr(data, '\n', size); - len = endp ? (endp - data + 1) : size; + + if (endp) { + len = endp - data + 1; + plen = len; + } else { + len = size; + plen = len + 1; + pdata = xmalloc(plen + 2); + memcpy(pdata, data, len); + pdata[len] = '\n'; + pdata[len + 1] = '\0'; + } if (prefix != '+') { ecbdata->lno_in_preimage++; - emit_del_line(ecbdata, data, len); + emit_del_line(ecbdata, pdata ? pdata : data, plen); } else { ecbdata->lno_in_postimage++; - emit_add_line(ecbdata, data, len); + emit_add_line(ecbdata, pdata ? pdata : data, plen); } + free(pdata); size -= len; data += len; } - if (!endp) - emit_diff_symbol(ecbdata->opt, DIFF_SYMBOL_NO_LF_EOF, NULL, 0, 0); + if (!endp) { + static const char nneof[] = "\\ No newline at end of file\n"; + ecbdata->last_line_kind = prefix; + emit_incomplete_line_marker(ecbdata, nneof, sizeof(nneof) - 1); + } } static void emit_rewrite_diff(const char *name_a, From 3a4eb5ad2e9166255d5921196470710523f24ec4 Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:54 -0800 Subject: [PATCH 09/29] apply: revamp the parsing of incomplete lines A patch file represents the incomplete line at the end of the file with two lines, one that is the usual "context" with " " as the first letter, "added" with "+" as the first letter, or "removed" with "-" as the first letter that shows the content of the line, plus an extra "\ No newline at the end of file" line that comes immediately after it. Ever since the apply machinery was written, the "git apply" machinery parses "\ No newline at the end of file" line independently, without even knowing what line the incomplete-ness applies to, simply because it does not even remember what the previous line was. This poses a problem if we want to check and warn on an incomplete line. Revamp the code that parses a fragment, to actually drop the '\n' at the end of the incoming patch file that terminates a line, so that check_whitespace() calls made from the code path actually sees an incomplete as incomplete. Note that the result of this parsing is not directly used by the code path that applies the patch. apply_one_fragment() function already checks if each of the patch text it handles is followed by a line that begins with a backslash to drop the newline at the end of the current line it is looking at. In a sense, this patch harmonizes the behaviour of the parsing side to what is already done in the application side. Signed-off-by: Junio C Hamano --- apply.c | 70 ++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 49 insertions(+), 21 deletions(-) diff --git a/apply.c b/apply.c index a2ceb3fb40d3b5..2b0f8bdab55463 100644 --- a/apply.c +++ b/apply.c @@ -1670,6 +1670,35 @@ static void check_old_for_crlf(struct patch *patch, const char *line, int len) } +/* + * Just saw a single line in a fragment. If it is a part of this hunk + * that is a context " ", an added "+", or a removed "-" line, it may + * be followed by "\\ No newline..." to signal that the last "\n" on + * this line needs to be dropped. Depending on locale settings when + * the patch was produced we don't know what this line would exactly + * say. The only thing we do know is that it begins with "\ ". + * Checking for 12 is just for sanity check; "\ No newline..." would + * be at least that long in any l10n. + * + * Return 0 if the line we saw is not followed by "\ No newline...", + * or length of that line. The caller will use it to skip over the + * "\ No newline..." line. + */ +static int adjust_incomplete(const char *line, int len, + unsigned long size) +{ + int nextlen; + + if (*line != '\n' && *line != ' ' && *line != '+' && *line != '-') + return 0; + if (size - len < 12 || memcmp(line + len, "\\ ", 2)) + return 0; + nextlen = linelen(line + len, size - len); + if (nextlen < 12) + return 0; + return nextlen; +} + /* * Parse a unified diff. Note that this really needs to parse each * fragment separately, since the only way to know the difference @@ -1684,6 +1713,7 @@ static int parse_fragment(struct apply_state *state, { int added, deleted; int len = linelen(line, size), offset; + int skip_len = 0; unsigned long oldlines, newlines; unsigned long leading, trailing; @@ -1710,6 +1740,22 @@ static int parse_fragment(struct apply_state *state, len = linelen(line, size); if (!len || line[len-1] != '\n') return -1; + + /* + * For an incomplete line, skip_len counts the bytes + * on "\\ No newline..." marker line that comes next + * to the current line. + * + * Reduce "len" to drop the newline at the end of + * line[], but add one to "skip_len", which will be + * added back to "len" for the next iteration, to + * compensate. + */ + skip_len = adjust_incomplete(line, len, size); + if (skip_len) { + len--; + skip_len++; + } switch (*line) { default: return -1; @@ -1745,20 +1791,10 @@ static int parse_fragment(struct apply_state *state, newlines--; trailing = 0; break; - - /* - * We allow "\ No newline at end of file". Depending - * on locale settings when the patch was produced we - * don't know what this line looks like. The only - * thing we do know is that it begins with "\ ". - * Checking for 12 is just for sanity check -- any - * l10n of "\ No newline..." is at least that long. - */ - case '\\': - if (len < 12 || memcmp(line, "\\ ", 2)) - return -1; - break; } + + /* eat the "\\ No newline..." as well, if exists */ + len += skip_len; } if (oldlines || newlines) return -1; @@ -1768,14 +1804,6 @@ static int parse_fragment(struct apply_state *state, fragment->leading = leading; fragment->trailing = trailing; - /* - * If a fragment ends with an incomplete line, we failed to include - * it in the above loop because we hit oldlines == newlines == 0 - * before seeing it. - */ - if (12 < size && !memcmp(line, "\\ ", 2)) - offset += linelen(line, size); - patch->lines_added += added; patch->lines_deleted += deleted; From a675104c399d242dd3ff5a0823fcd770563cf60f Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:55 -0800 Subject: [PATCH 10/29] whitespace: allocate a few more bits and define WS_INCOMPLETE_LINE Reserve a few more bits in the diff flags word to be used for future whitespace rules. Add WS_INCOMPLETE_LINE without implementing the behaviour (yet). Signed-off-by: Junio C Hamano --- Documentation/config/core.adoc | 2 ++ diff.c | 16 ++++++++-------- diff.h | 6 +++--- ws.c | 6 ++++++ ws.h | 3 ++- 5 files changed, 21 insertions(+), 12 deletions(-) diff --git a/Documentation/config/core.adoc b/Documentation/config/core.adoc index e2de270c869c77..682fb595fb096c 100644 --- a/Documentation/config/core.adoc +++ b/Documentation/config/core.adoc @@ -626,6 +626,8 @@ core.whitespace:: part of the line terminator, i.e. with it, `trailing-space` does not trigger if the character before such a carriage-return is not a whitespace (not enabled by default). +* `incomplete-line` treats the last line of a file that is missing the + newline at the end as an error (not enabled by default). * `tabwidth=` tells how many character positions a tab occupies; this is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent` errors. The default tab width is 8. Allowed values are 1 to 63. diff --git a/diff.c b/diff.c index 5c606409bb9cbe..1b27b15f846333 100644 --- a/diff.c +++ b/diff.c @@ -804,15 +804,15 @@ enum diff_symbol { /* * Flags for content lines: - * 0..11 are whitespace rules (see ws.h) - * 12..14 are WSEH_NEW | WSEH_CONTEXT | WSEH_OLD - * 16 is marking if the line is blank at EOF - * 17..19 are used for color-moved. + * 0..15 are whitespace rules (see ws.h) + * 16..18 are WSEH_NEW | WSEH_CONTEXT | WSEH_OLD + * 19 is marking if the line is blank at EOF + * 20..22 are used for color-moved. */ -#define DIFF_SYMBOL_CONTENT_BLANK_LINE_EOF (1<<16) -#define DIFF_SYMBOL_MOVED_LINE (1<<17) -#define DIFF_SYMBOL_MOVED_LINE_ALT (1<<18) -#define DIFF_SYMBOL_MOVED_LINE_UNINTERESTING (1<<19) +#define DIFF_SYMBOL_CONTENT_BLANK_LINE_EOF (1<<19) +#define DIFF_SYMBOL_MOVED_LINE (1<<20) +#define DIFF_SYMBOL_MOVED_LINE_ALT (1<<21) +#define DIFF_SYMBOL_MOVED_LINE_UNINTERESTING (1<<22) #define DIFF_SYMBOL_CONTENT_WS_MASK (WSEH_NEW | WSEH_OLD | WSEH_CONTEXT | WS_RULE_MASK) diff --git a/diff.h b/diff.h index cbd355cf50f68e..422658407d4e86 100644 --- a/diff.h +++ b/diff.h @@ -331,9 +331,9 @@ struct diff_options { int ita_invisible_in_index; /* white-space error highlighting */ -#define WSEH_NEW (1<<12) -#define WSEH_CONTEXT (1<<13) -#define WSEH_OLD (1<<14) +#define WSEH_NEW (1<<16) +#define WSEH_CONTEXT (1<<17) +#define WSEH_OLD (1<<18) unsigned ws_error_highlight; const char *prefix; int prefix_length; diff --git a/ws.c b/ws.c index 70acee3337f241..34a7b4fad2840a 100644 --- a/ws.c +++ b/ws.c @@ -26,6 +26,7 @@ static struct whitespace_rule { { "blank-at-eol", WS_BLANK_AT_EOL, 0 }, { "blank-at-eof", WS_BLANK_AT_EOF, 0 }, { "tab-in-indent", WS_TAB_IN_INDENT, 0, 1 }, + { "incomplete-line", WS_INCOMPLETE_LINE, 0, 0 }, }; unsigned parse_whitespace_rule(const char *string) @@ -139,6 +140,11 @@ char *whitespace_error_string(unsigned ws) strbuf_addstr(&err, ", "); strbuf_addstr(&err, "tab in indent"); } + if (ws & WS_INCOMPLETE_LINE) { + if (err.len) + strbuf_addstr(&err, ", "); + strbuf_addstr(&err, "no newline at the end of file"); + } return strbuf_detach(&err, NULL); } diff --git a/ws.h b/ws.h index 23708efb7322ed..06d5cb73f8f88f 100644 --- a/ws.h +++ b/ws.h @@ -15,13 +15,14 @@ struct strbuf; #define WS_CR_AT_EOL (1<<9) #define WS_BLANK_AT_EOF (1<<10) #define WS_TAB_IN_INDENT (1<<11) +#define WS_INCOMPLETE_LINE (1<<12) #define WS_TRAILING_SPACE (WS_BLANK_AT_EOL|WS_BLANK_AT_EOF) #define WS_DEFAULT_RULE (WS_TRAILING_SPACE|WS_SPACE_BEFORE_TAB|8) #define WS_TAB_WIDTH_MASK ((1<<6)-1) /* All WS_* -- when extended, adapt constants defined after diff.c:diff_symbol */ -#define WS_RULE_MASK ((1<<12)-1) +#define WS_RULE_MASK ((1<<16)-1) extern unsigned whitespace_rule_cfg; unsigned whitespace_rule(struct index_state *, const char *); From 9fb15a8e1430b77e2cc771e425ce4f0954ce4777 Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:56 -0800 Subject: [PATCH 11/29] apply: check and fix incomplete lines The final line of a file that lacks the terminating newline at its end is called an incomplete line. In general they are frowned upon for many reasons (imagine concatenating two files with "cat A B" and what happens when A ends in an incomplete line, for example), and text-oriented tools often mishandle such a line. Implement checks in "git apply" for incomplete lines, which is off by default for backward compatibility's sake, so that "git apply --whitespace={fix,warn,error}" can notice, warn against, and fix them. As one of the new test shows, if you modify contents on an incomplete line in the original and leave the resulting line incomplete, it is still considered a whitespace error, the reasoning being that "you'd better fix it while at it if you are making a change on an incomplete line anyway", which may controversial. Signed-off-by: Junio C Hamano --- apply.c | 13 ++- t/t4124-apply-ws-rule.sh | 187 +++++++++++++++++++++++++++++++++++++++ ws.c | 14 +++ 3 files changed, 213 insertions(+), 1 deletion(-) diff --git a/apply.c b/apply.c index 2b0f8bdab55463..c9fb45247d8cd2 100644 --- a/apply.c +++ b/apply.c @@ -1640,6 +1640,14 @@ static void record_ws_error(struct apply_state *state, state->squelch_whitespace_errors < state->whitespace_error) return; + /* + * line[len] for an incomplete line points at the "\n" at the end + * of patch input line, so "%.*s" would drop the last letter on line; + * compensate for it. + */ + if (result & WS_INCOMPLETE_LINE) + len++; + err = whitespace_error_string(result); if (state->apply_verbosity > verbosity_silent) fprintf(stderr, "%s:%d: %s.\n%.*s\n", @@ -1794,7 +1802,10 @@ static int parse_fragment(struct apply_state *state, } /* eat the "\\ No newline..." as well, if exists */ - len += skip_len; + if (skip_len) { + len += skip_len; + state->linenr++; + } } if (oldlines || newlines) return -1; diff --git a/t/t4124-apply-ws-rule.sh b/t/t4124-apply-ws-rule.sh index 485c7d2d124ade..115a0f857906d4 100755 --- a/t/t4124-apply-ws-rule.sh +++ b/t/t4124-apply-ws-rule.sh @@ -556,4 +556,191 @@ test_expect_success 'whitespace check skipped for excluded paths' ' git apply --include=used --stat --whitespace=error sample-i && + (test_write_lines 1 2 3 0 5 && printf 6) >sample2-i && + cat sample-i >target && + git add target && + cat sample2-i >target && + git diff-files -p target >patch && + + cat sample-i >target && + git apply --whitespace=error target && + git apply --whitespace=error --check error && + test_cmp sample-i target && + test_must_be_empty error && + + cat sample2-i >target && + git apply --whitespace=error -R target && + git apply -R --whitespace=error --check error && + test_cmp sample2-i target && + test_must_be_empty error +' + +test_expect_success 'last line made incomplete (error)' ' + test_write_lines 1 2 3 4 5 6 >sample && + (test_write_lines 1 2 3 4 5 && printf 6) >sample-i && + cat sample >target && + git add target && + cat sample-i >target && + git diff-files -p target >patch && + + cat sample >target && + test_must_fail git apply --whitespace=error error && + test_grep "no newline" error && + + cat sample >target && + test_must_fail git apply --whitespace=error --check actual && + test_cmp sample target && + cat >expect <<-\EOF && + :10: no newline at the end of file. + 6 + error: 1 line adds whitespace errors. + EOF + test_cmp expect actual && + + cat sample-i >target && + git apply --whitespace=error -R target && + git apply --whitespace=error --check -R error && + test_cmp sample-i target && + test_must_be_empty error && + + cat sample >target && + git apply --whitespace=fix sample-i && + test_write_lines 1 2 3 4 5 6 >sample && + cat sample-i >target && + git add target && + cat sample >target && + git diff-files -p target >patch && + + cat sample-i >target && + git apply --whitespace=error target && + git apply --whitespace=error --check error && + test_cmp sample-i target && + test_must_be_empty error && + + cat sample >target && + test_must_fail git apply --whitespace=error -R error && + test_grep "no newline" error && + + cat sample >target && + test_must_fail git apply --whitespace=error --check -R actual && + test_cmp sample target && + cat >expect <<-\EOF && + :9: no newline at the end of file. + 6 + error: 1 line adds whitespace errors. + EOF + test_cmp expect actual && + + cat sample >target && + git apply --whitespace=fix -R sample-i && + test_write_lines 1 2 3 4 5 7 >sample3 && + cat sample-i >target && + git add target && + cat sample3 >target && + git diff-files -p target >patch && + + cat sample-i >target && + git apply --whitespace=error target && + git apply --whitespace=error --check error && + test_cmp sample-i target && + test_must_be_empty error && + + cat sample3 >target && + test_must_fail git apply --whitespace=error -R error && + test_grep "no newline" error && + + cat sample3 >target && + test_must_fail git apply --whitespace=error -R --check actual && + test_cmp sample3 target && + cat >expect <<-\EOF && + :9: no newline at the end of file. + 6 + error: 1 line adds whitespace errors. + EOF + test_cmp expect actual && + + cat sample3 >target && + git apply --whitespace=fix -R sample-i && + (test_write_lines 1 2 3 4 5 && printf 7) >sample3-i && + test_write_lines 1 2 3 4 5 6 >sample && + test_write_lines 1 2 3 4 5 7 >sample3 && + cat sample-i >target && + git add target && + cat sample3-i >target && + git diff-files -p target >patch && + + cat sample-i >target && + test_must_fail git apply --whitespace=error error && + test_grep "no newline" error && + + cat sample-i >target && + test_must_fail git apply --whitespace=error --check actual && + test_cmp sample-i target && + cat >expect <<-\EOF && + :11: no newline at the end of file. + 7 + error: 1 line adds whitespace errors. + EOF + test_cmp expect actual && + + cat sample3-i >target && + test_must_fail git apply --whitespace=error -R error && + test_grep "no newline" error && + + cat sample3-i >target && + test_must_fail git apply --whitespace=error --check -R actual && + test_cmp sample3-i target && + cat >expect <<-\EOF && + :9: no newline at the end of file. + 6 + error: 1 line adds whitespace errors. + EOF + test_cmp expect actual && + + cat sample-i >target && + git apply --whitespace=fix target && + git apply --whitespace=fix -R Date: Wed, 12 Nov 2025 14:02:57 -0800 Subject: [PATCH 12/29] diff: highlight and error out on incomplete lines Teach "git diff" to highlight "\ No newline at end of file" message as a whitespace error when incomplete-line whitespace error class is in effect. Thanks to the previous refactoring of complete rewrite code path, we can do this at a single place. Unlike whitespace errors in the payload where we need to annotate in line, possibly using colors, the line that has whitespace problems, we have a dedicated line already that can serve as the error message, so paint it as a whitespace error message. Also teach "git diff --check" to notice incomplete lines as whitespace errors and report when incomplete-line whitespace error class is in effect. Signed-off-by: Junio C Hamano --- diff.c | 29 +++++++++++++++-- t/t4015-diff-whitespace.sh | 67 +++++++++++++++++++++++++++++++++++--- 2 files changed, 90 insertions(+), 6 deletions(-) diff --git a/diff.c b/diff.c index 1b27b15f846333..7b7cd50dc24351 100644 --- a/diff.c +++ b/diff.c @@ -1370,7 +1370,11 @@ static void emit_diff_symbol_from_struct(struct diff_options *o, emit_line(o, "", "", line, len); break; case DIFF_SYMBOL_CONTEXT_INCOMPLETE: - set = diff_get_color_opt(o, DIFF_CONTEXT); + if ((flags & WS_INCOMPLETE_LINE) && + (flags & o->ws_error_highlight)) + set = diff_get_color_opt(o, DIFF_WHITESPACE); + else + set = diff_get_color_opt(o, DIFF_CONTEXT); reset = diff_get_color_opt(o, DIFF_RESET); emit_line(o, set, reset, line, len); break; @@ -1666,8 +1670,14 @@ static void emit_context_line(struct emit_callback *ecbdata, static void emit_incomplete_line_marker(struct emit_callback *ecbdata, const char *line, int len) { + int last_line_kind = ecbdata->last_line_kind; + unsigned flags = (last_line_kind == '+' + ? WSEH_NEW + : last_line_kind == '-' + ? WSEH_OLD + : WSEH_CONTEXT) | ecbdata->ws_rule; emit_diff_symbol(ecbdata->opt, DIFF_SYMBOL_CONTEXT_INCOMPLETE, - line, len, 0); + line, len, flags); } static void emit_hunk_header(struct emit_callback *ecbdata, @@ -3254,6 +3264,7 @@ struct checkdiff_t { struct diff_options *o; unsigned ws_rule; unsigned status; + int last_line_kind; }; static int is_conflict_marker(const char *line, int marker_size, unsigned long len) @@ -3292,6 +3303,7 @@ static void checkdiff_consume_hunk(void *priv, static int checkdiff_consume(void *priv, char *line, unsigned long len) { struct checkdiff_t *data = priv; + int last_line_kind; int marker_size = data->conflict_marker_size; const char *ws = diff_get_color(data->o->use_color, DIFF_WHITESPACE); const char *reset = diff_get_color(data->o->use_color, DIFF_RESET); @@ -3302,6 +3314,8 @@ static int checkdiff_consume(void *priv, char *line, unsigned long len) assert(data->o); line_prefix = diff_line_prefix(data->o); + last_line_kind = data->last_line_kind; + data->last_line_kind = line[0]; if (line[0] == '+') { unsigned bad; data->lineno++; @@ -3324,6 +3338,17 @@ static int checkdiff_consume(void *priv, char *line, unsigned long len) data->o->file, set, reset, ws); } else if (line[0] == ' ') { data->lineno++; + } else if (line[0] == '\\') { + /* no newline at the end of the line */ + if ((data->ws_rule & WS_INCOMPLETE_LINE) && + (last_line_kind == '+')) { + unsigned bad = WS_INCOMPLETE_LINE; + data->status |= bad; + err = whitespace_error_string(bad); + fprintf(data->o->file, "%s%s:%d: %s.\n", + line_prefix, data->filename, data->lineno, err); + free(err); + } } return 0; } diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh index 9de7f73f42b534..3c8eb02e4f3e64 100755 --- a/t/t4015-diff-whitespace.sh +++ b/t/t4015-diff-whitespace.sh @@ -43,6 +43,53 @@ do ' done +test_expect_success "incomplete line in both pre- and post-image context" ' + (echo foo && echo baz | tr -d "\012") >x && + git add x && + (echo bar && echo baz | tr -d "\012") >x && + git diff x && + git -c core.whitespace=incomplete diff --check x && + git diff -R x && + git -c core.whitespace=incomplete diff -R --check x +' + +test_expect_success "incomplete lines on both pre- and post-image" ' + # The interpretation taken here is "since you are touching + # the line anyway, you would better fix the incomplete line + # while you are at it." but this is debatable. + echo foo | tr -d "\012" >x && + git add x && + echo bar | tr -d "\012" >x && + git diff x && + test_must_fail git -c core.whitespace=incomplete diff --check x >error && + test_grep "no newline at the end of file" error && + git diff -R x && + test_must_fail git -c core.whitespace=incomplete diff -R --check x >error && + test_grep "no newline at the end of file" error +' + +test_expect_success "fix incomplete line in pre-image" ' + echo foo | tr -d "\012" >x && + git add x && + echo bar >x && + git diff x && + git -c core.whitespace=incomplete diff --check x && + git diff -R x && + test_must_fail git -c core.whitespace=incomplete diff -R --check x >error && + test_grep "no newline at the end of file" error +' + +test_expect_success "new incomplete line in post-image" ' + echo foo >x && + git add x && + echo bar | tr -d "\012" >x && + git diff x && + test_must_fail git -c core.whitespace=incomplete diff --check x >error && + test_grep "no newline at the end of file" error && + git diff -R x && + git -c core.whitespace=incomplete diff -R --check x +' + test_expect_success "Ray Lehtiniemi's example" ' cat <<-\EOF >x && do { @@ -1040,7 +1087,8 @@ test_expect_success 'ws-error-highlight test setup' ' { echo "0. blank-at-eol " && echo "1. still-blank-at-eol " && - echo "2. and a new line " + echo "2. and a new line " && + printf "3. and more" } >x && new_hash_x=$(git hash-object x) && after=$(git rev-parse --short "$new_hash_x") && @@ -1050,11 +1098,13 @@ test_expect_success 'ws-error-highlight test setup' ' index $before..$after 100644 --- a/x +++ b/x - @@ -1,2 +1,3 @@ + @@ -1,2 +1,4 @@ 0. blank-at-eol -1. blank-at-eol +1. still-blank-at-eol +2. and a new line + +3. and more + \ No newline at end of file EOF cat >expect.all <<-EOF && @@ -1062,11 +1112,13 @@ test_expect_success 'ws-error-highlight test setup' ' index $before..$after 100644 --- a/x +++ b/x - @@ -1,2 +1,3 @@ + @@ -1,2 +1,4 @@ 0. blank-at-eol -1. blank-at-eol +1. still-blank-at-eol +2. and a new line + +3. and more + \ No newline at end of file EOF cat >expect.none <<-EOF @@ -1074,16 +1126,19 @@ test_expect_success 'ws-error-highlight test setup' ' index $before..$after 100644 --- a/x +++ b/x - @@ -1,2 +1,3 @@ + @@ -1,2 +1,4 @@ 0. blank-at-eol -1. blank-at-eol +1. still-blank-at-eol +2. and a new line + +3. and more + \ No newline at end of file EOF ' test_expect_success 'test --ws-error-highlight option' ' + git config core.whitespace blank-at-eol,incomplete-line && git diff --color --ws-error-highlight=default,old >current.raw && test_decode_color current && @@ -1100,6 +1155,7 @@ test_expect_success 'test --ws-error-highlight option' ' ' test_expect_success 'test diff.wsErrorHighlight config' ' + git config core.whitespace blank-at-eol,incomplete-line && git -c diff.wsErrorHighlight=default,old diff --color >current.raw && test_decode_color current && @@ -1116,6 +1172,7 @@ test_expect_success 'test diff.wsErrorHighlight config' ' ' test_expect_success 'option overrides diff.wsErrorHighlight' ' + git config core.whitespace blank-at-eol,incomplete-line && git -c diff.wsErrorHighlight=none \ diff --color --ws-error-highlight=default,old >current.raw && @@ -1135,6 +1192,8 @@ test_expect_success 'option overrides diff.wsErrorHighlight' ' ' test_expect_success 'detect moved code, complete file' ' + git config core.whitespace blank-at-eol && + git reset --hard && cat <<-\EOF >test.c && #include From 51358a1ede7f4b6b50e4e5a86558af5204691fe0 Mon Sep 17 00:00:00 2001 From: Junio C Hamano Date: Wed, 12 Nov 2025 14:02:58 -0800 Subject: [PATCH 13/29] attr: enable incomplete-line whitespace error for this project Now "git diff --check" and "git apply --whitespace=warn/fix" learned incomplete line is a whitespace error, enable them for this project to prevent patches to add new incomplete lines to our source to both code and documentation files. Signed-off-by: Junio C Hamano --- .gitattributes | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.gitattributes b/.gitattributes index 32583149c2f927..a8e2950a735677 100644 --- a/.gitattributes +++ b/.gitattributes @@ -1,13 +1,13 @@ * whitespace=!indent,trail,space -*.[ch] whitespace=indent,trail,space diff=cpp -*.sh whitespace=indent,trail,space text eol=lf +*.[ch] whitespace=indent,trail,space,incomplete diff=cpp +*.sh whitespace=indent,trail,space,incomplete text eol=lf *.perl text eol=lf diff=perl *.pl text eof=lf diff=perl *.pm text eol=lf diff=perl *.py text eol=lf diff=python *.bat text eol=crlf CODE_OF_CONDUCT.md -whitespace -/Documentation/**/*.adoc text eol=lf +/Documentation/**/*.adoc text eol=lf whitespace=trail,space,incomplete /command-list.txt text eol=lf /GIT-VERSION-GEN text eol=lf /mergetools/* text eol=lf From 65e8141f051401a101567b95860424d94f42ef39 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:11:56 -0500 Subject: [PATCH 14/29] compat/mmap: mark unused argument in git_munmap() Our mmap compat code emulates mapping by using malloc/free. Our git_munmap() must take a "length" parameter to match the interface of munmap(), but we don't use it (it is up to the allocator to know how big the block is in free()). Let's mark it as UNUSED to avoid complaints from -Wunused-parameter. Otherwise you cannot build with "make DEVELOPER=1 NO_MMAP=1". Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- compat/mmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/compat/mmap.c b/compat/mmap.c index 2fe1c7732eea94..1a118711f7244a 100644 --- a/compat/mmap.c +++ b/compat/mmap.c @@ -38,7 +38,7 @@ void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t of return start; } -int git_munmap(void *start, size_t length) +int git_munmap(void *start, size_t length UNUSED) { free(start); return 0; From 4deb882e54152c31bef23f8b33ad38b7bc26d398 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:06 -0500 Subject: [PATCH 15/29] pack-bitmap: handle name-hash lookups in incremental bitmaps If a bitmap has a name-hash cache, it is an array of 32-bit integers, one per entry in the bitmap, which we've mmap'd from the .bitmap file. We access it directly like this: if (bitmap_git->hashes) hash = get_be32(bitmap_git->hashes + index_pos); That works for both regular pack bitmaps and for non-incremental midx bitmaps. There is one bitmap_index with one "hashes" array, and index_pos is within its bounds (we do the bounds-checking when we load the bitmap). But for an incremental midx bitmap, we have a linked list of bitmap_index structs, and each one has only its own small slice of the name-hash array. If index_pos refers to an object that is not in the first bitmap_git of the chain, then we'll access memory outside of the bounds of its "hashes" array, and often outside of the mmap. Instead, we should walk through the list until we find the bitmap_index which serves our index_pos, and use its hash (after adjusting index_pos to make it relative to the slice we found). This is exactly what we do elsewhere for incremental midx lookups (like the pack_pos_to_midx() call a few lines above). But we can't use existing helpers like midx_for_object() here, because we're walking through the chain of bitmap_index structs (each of which refers to a midx), not the chain of incremental multi_pack_index structs themselves. The problem is triggered in the test suite, but we don't get a segfault because the out-of-bounds index is too small. The OS typically rounds our mmap up to the nearest page size, so we just end up accessing some extra zero'd memory. Nor do we catch it with ASan, since it doesn't seem to instrument mmaps at all. But if we build with NO_MMAP, then our maps are replaced with heap allocations, which ASan does check. And so: make NO_MMAP=1 SANITIZE=address cd t ./t5334-incremental-multi-pack-index.sh does show the problem (and this patch makes it go away). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- pack-bitmap.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index ac6d62b980c5a8..9c9aa7b67c3197 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -212,6 +212,28 @@ static uint32_t bitmap_num_objects(struct bitmap_index *index) return index->pack->num_objects; } +static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos) +{ + if (bitmap_is_midx(index)) { + while (index && pos < index->midx->num_objects_in_base) { + ASSERT(bitmap_is_midx(index)); + index = index->base; + } + + if (!index) + BUG("NULL base bitmap for object position: %"PRIu32, pos); + + pos -= index->midx->num_objects_in_base; + if (pos >= index->midx->num_objects) + BUG("out-of-bounds midx bitmap object at %"PRIu32, pos); + } + + if (!index->hashes) + return 0; + + return get_be32(index->hashes + pos); +} + static struct repository *bitmap_repo(struct bitmap_index *bitmap_git) { if (bitmap_is_midx(bitmap_git)) @@ -1726,8 +1748,7 @@ static void show_objects_for_type( pack = bitmap_git->pack; } - if (bitmap_git->hashes) - hash = get_be32(bitmap_git->hashes + index_pos); + hash = bitmap_name_hash(bitmap_git, index_pos); show_reach(&oid, object_type, 0, hash, pack, ofs, payload); } @@ -3083,8 +3104,8 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git, if (oe) { reposition[i] = oe_in_pack_pos(mapping, oe) + 1; - if (bitmap_git->hashes && !oe->hash) - oe->hash = get_be32(bitmap_git->hashes + index_pos); + if (!oe->hash) + oe->hash = bitmap_name_hash(bitmap_git, index_pos); } } From a9990f8ec0e750a68802f7d79d2aa2293c4811b4 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:13 -0500 Subject: [PATCH 16/29] Makefile: turn on NO_MMAP when building with ASan Git often uses mmap() to access on-disk files. This leaves a blind spot in our SANITIZE=address builds, since ASan does not seem to handle mmap at all. Nor does the OS notice most out-of-bounds access, since it tends to round up to the nearest page size (so depending on how big the map is, you might have to overrun it by up to 4095 bytes to trigger a segfault). The previous commit demonstrates a memory bug that we missed. We could have made a new test where the out-of-bounds access was much larger, or where the mapped file ended closer to a page boundary. But the point of running the test suite with sanitizers is to catch these problems without having to construct specific tests. Let's enable NO_MMAP for our ASan builds by default, which should give us better coverage. This does increase the memory usage of Git, since we're copying from the filesystem into heap. But the repositories in the test suite tend to be small, so the overhead isn't really noticeable (and ASan already has quite a performance penalty). There are a few other known bugs that this patch will help flush out. However, they aren't directly triggered in the test suite (yet). So it's safe to turn this on now without breaking the test suite, which will help us add new tests to demonstrate those other bugs as we fix them. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- Makefile | 1 + meson.build | 8 +++++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 70d1543b6b8688..c2d327838a64cb 100644 --- a/Makefile +++ b/Makefile @@ -1518,6 +1518,7 @@ SANITIZE_LEAK = YesCompiledWithIt endif ifneq ($(filter address,$(SANITIZERS)),) NO_REGEX = NeededForASAN +NO_MMAP = NeededForASAN SANITIZE_ADDRESS = YesCompiledWithIt endif endif diff --git a/meson.build b/meson.build index 596f5ac7110ebf..269769b166a8c7 100644 --- a/meson.build +++ b/meson.build @@ -1369,12 +1369,18 @@ if host_machine.system() == 'windows' libgit_c_args += '-DUSE_WIN32_MMAP' else checkfuncs += { - 'mmap' : ['mmap.c'], # provided by compat/mingw.c. 'unsetenv' : ['unsetenv.c'], # provided by compat/mingw.c. 'getpagesize' : [], } + + if get_option('b_sanitize').contains('address') + libgit_c_args += '-DNO_MMAP' + libgit_sources += 'compat/mmap.c' + else + checkfuncs += { 'mmap': ['mmap.c'] } + endif endif foreach func, impls : checkfuncs From c4c9089584d0ed04978e8d0945b2ba2985e67bd3 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:18 -0500 Subject: [PATCH 17/29] cache-tree: avoid strtol() on non-string buffer A cache-tree extension entry in the index looks like this: NUL SPACE NEWLINE where the "_nr" items are human-readable base-10 ASCII. We parse them with strtol(), even though we do not have a NUL-terminated string (we'd generally have an mmap() of the on-disk index file). For a well-formed entry, this is not a problem; strtol() will stop when it sees the newline. But there are two problems: 1. A corrupted entry could omit the newline, causing us to read further. You'd mostly get stopped by seeing non-digits in the oid field (and if it is likewise truncated, there will still be 20 or more bytes of the index checksum). So it's possible, though unlikely, to read off the end of the mmap'd buffer. Of course a malicious index file can fake the oid and the index checksum to all (ASCII) 0's. This is further complicated by the fact that mmap'd buffers tend to be zero-padded up to the page boundary. So to run off the end, the index size also has to be a multiple of the page size. This is also unlikely, though you can construct a malicious index file that matches this. The security implications aren't too interesting. The index file is a local file anyway (so you can't attack somebody by cloning, but only if you convince them to operate in a .git directory you made, at which point attacking .git/config is much easier). And it's just a read overflow via strtol(), which is unlikely to buy you much beyond a crash. 2. ASan has a strict_string_checks option, which tells it to make sure that options to string functions (like strtol) have some eventual NUL, without regard to what the function would actually do (like stopping at a newline here). This option sometimes has false positives, but it can point to sketchy areas (like this one) where the input we use doesn't exhibit a problem, but different input _could_ cause us to misbehave. Let's fix it by just parsing the values ourselves with a helper function that is careful not to go past the end of the buffer. There are a few behavior changes here that should not matter: - We do not consider overflow, as strtol() would. But nor did the original code. However, we don't trust the value we get from the on-disk file, and if it says to read 2^30 entries, we would notice that we do not have that many and bail before reading off the end of the buffer. - Our helper does not skip past extra leading whitespace as strtol() would, but according to gitformat-index(5) there should not be any. - The original quit parsing at a newline or a NUL byte, but now we insist on a newline (which is what the documentation says, and what Git has always produced). Since we are providing our own helper function, we can tweak the interface a bit to make our lives easier. The original code does not use strtol's "end" pointer to find the end of the parsed data, but rather uses a separate loop to advance our "buf" pointer to the trailing newline. We can instead provide a helper that advances "buf" as it parses, letting us read strictly left-to-right through the buffer. I didn't add a new test here. It's surprisingly difficult to construct an index of exactly the right size due to the way we pad entries. But it is easy to trigger the problem in existing tests when using ASan's strict string checking, coupled with a recent change to use NO_MMAP with ASan builds. So: make SANITIZE=address cd t ASAN_OPTIONS=strict_string_checks=1 ./t0090-cache-tree.sh triggers it reliably. Technically it is not deterministic because there is ~8% chance (it's 1-(255/256)^20, or ^32 for sha256) that the trailing checksum hash has a NUL byte in it. But we compute enough cache-trees in the course of that script that we are very likely to hit the problem in one of them. We can look at making strict_string_checks the default for ASan builds, but there are some other cases we'd want to fix first. Reported-by: correctmost Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- cache-tree.c | 50 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 37 insertions(+), 13 deletions(-) diff --git a/cache-tree.c b/cache-tree.c index fa3858e2829aa8..2309911dfa0309 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -548,12 +548,41 @@ void cache_tree_write(struct strbuf *sb, struct cache_tree *root) trace2_region_leave("cache_tree", "write", the_repository); } +static int parse_int(const char **ptr, unsigned long *len_p, int *out) +{ + const char *s = *ptr; + unsigned long len = *len_p; + int ret = 0; + int sign = 1; + + while (len && *s == '-') { + sign *= -1; + s++; + len--; + } + + while (len) { + if (!isdigit(*s)) + break; + ret *= 10; + ret += *s - '0'; + s++; + len--; + } + + if (s == *ptr) + return -1; + + *ptr = s; + *len_p = len; + *out = sign * ret; + return 0; +} + static struct cache_tree *read_one(const char **buffer, unsigned long *size_p) { const char *buf = *buffer; unsigned long size = *size_p; - const char *cp; - char *ep; struct cache_tree *it; int i, subtree_nr; const unsigned rawsz = the_hash_algo->rawsz; @@ -569,19 +598,14 @@ static struct cache_tree *read_one(const char **buffer, unsigned long *size_p) buf++; size--; it = cache_tree(); - cp = buf; - it->entry_count = strtol(cp, &ep, 10); - if (cp == ep) + if (parse_int(&buf, &size, &it->entry_count) < 0) goto free_return; - cp = ep; - subtree_nr = strtol(cp, &ep, 10); - if (cp == ep) + if (!size || *buf != ' ') goto free_return; - while (size && *buf && *buf != '\n') { - size--; - buf++; - } - if (!size) + buf++; size--; + if (parse_int(&buf, &size, &subtree_nr) < 0) + goto free_return; + if (!size || *buf != '\n') goto free_return; buf++; size--; if (0 <= it->entry_count) { From 0b6ec075df2ac77a4792b8b1a7290a36b636012b Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:20 -0500 Subject: [PATCH 18/29] fsck: assert newline presence in fsck_ident() The fsck code purports to handle buffers that are not NUL-terminated, but fsck_ident() uses some string functions. This works OK in practice, as explained in 8e4309038f (fsck: do not assume NUL-termination of buffers, 2023-01-19). Before calling fsck_ident() we'll have called verify_headers(), which makes sure we have at least a trailing newline. And none of our string-like functions will walk past that newline. However, that makes this code at the top of fsck_ident() very confusing: *ident = strchrnul(*ident, '\n'); if (**ident == '\n') (*ident)++; We should always see that newline, or our memory safety assumptions have been violated! Further, using strchrnul() is weird, since the whole point is that if the newline is not there, we don't necessarily have a NUL at all, and might read off the end of the buffer. So let's have callers pass in the boundary of our buffer, which lets us safely find the newline with memchr(). And if it is not there, this is a BUG(), because it means our caller did not validate the input with verify_headers() as it was supposed to (and we are better off bailing rather than having memory-safety problems). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- fsck.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/fsck.c b/fsck.c index 8dc8472ceb3781..3d1027dc877424 100644 --- a/fsck.c +++ b/fsck.c @@ -859,16 +859,18 @@ static int verify_headers(const void *data, unsigned long size, FSCK_MSG_UNTERMINATED_HEADER, "unterminated header"); } -static int fsck_ident(const char **ident, +static int fsck_ident(const char **ident, const char *ident_end, const struct object_id *oid, enum object_type type, struct fsck_options *options) { const char *p = *ident; + const char *nl; char *end; - *ident = strchrnul(*ident, '\n'); - if (**ident == '\n') - (*ident)++; + nl = memchr(p, '\n', ident_end - p); + if (!nl) + BUG("verify_headers() should have made sure we have a newline"); + *ident = nl + 1; if (*p == '<') return report(options, oid, type, FSCK_MSG_MISSING_NAME_BEFORE_EMAIL, "invalid author/committer line - missing space before email"); @@ -957,7 +959,7 @@ static int fsck_commit(const struct object_id *oid, author_count = 0; while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { author_count++; - err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); + err = fsck_ident(&buffer, buffer_end, oid, OBJ_COMMIT, options); if (err) return err; } @@ -969,7 +971,7 @@ static int fsck_commit(const struct object_id *oid, return err; if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line"); - err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); + err = fsck_ident(&buffer, buffer_end, oid, OBJ_COMMIT, options); if (err) return err; if (memchr(buffer_begin, '\0', size)) { @@ -1064,7 +1066,7 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, goto done; } else - ret = fsck_ident(&buffer, oid, OBJ_TAG, options); + ret = fsck_ident(&buffer, buffer_end, oid, OBJ_TAG, options); if (buffer < buffer_end && !starts_with(buffer, "\n")) { /* From 830424def4896dbf041d41dad873f5b86fdf9bfa Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:23 -0500 Subject: [PATCH 19/29] fsck: avoid strcspn() in fsck_ident() We may be operating on a buffer that is not NUL-terminated, but we use strcspn() to parse it. This is OK in practice, as discussed in 8e4309038f (fsck: do not assume NUL-termination of buffers, 2023-01-19), because we know there is at least a trailing newline in our buffer, and we always pass "\n" to strcspn(). So we know it will stop before running off the end of the buffer. But this is a subtle point to hang our memory safety hat on. And it confuses ASan's strict_string_checks mode, even though it is technically a false positive (that mode complains that we have no NUL, which is true, but it does not know that we have verified the presence of the newline already). Let's instead open-code the loop. As a bonus, this makes the logic more obvious (to my mind, anyway). The current code skips forward with strcspn until it hits "<", ">", or "\n". But then it must check which it saw to decide if that was what we expected or not, duplicating some logic between what's in the strcspn() and what's in the domain logic. Instead, we can just check each character as we loop and act on it immediately. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- fsck.c | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/fsck.c b/fsck.c index 3d1027dc877424..3696f1b849ad9f 100644 --- a/fsck.c +++ b/fsck.c @@ -874,18 +874,30 @@ static int fsck_ident(const char **ident, const char *ident_end, if (*p == '<') return report(options, oid, type, FSCK_MSG_MISSING_NAME_BEFORE_EMAIL, "invalid author/committer line - missing space before email"); - p += strcspn(p, "<>\n"); - if (*p == '>') - return report(options, oid, type, FSCK_MSG_BAD_NAME, "invalid author/committer line - bad name"); - if (*p != '<') - return report(options, oid, type, FSCK_MSG_MISSING_EMAIL, "invalid author/committer line - missing email"); + for (;;) { + if (p >= ident_end || *p == '\n') + return report(options, oid, type, FSCK_MSG_MISSING_EMAIL, "invalid author/committer line - missing email"); + if (*p == '>') + return report(options, oid, type, FSCK_MSG_BAD_NAME, "invalid author/committer line - bad name"); + if (*p == '<') + break; /* end of name, beginning of email */ + + /* otherwise, skip past arbitrary name char */ + p++; + } if (p[-1] != ' ') return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_EMAIL, "invalid author/committer line - missing space before email"); - p++; - p += strcspn(p, "<>\n"); - if (*p != '>') - return report(options, oid, type, FSCK_MSG_BAD_EMAIL, "invalid author/committer line - bad email"); - p++; + p++; /* skip past '<' we found */ + for (;;) { + if (p >= ident_end || *p == '<' || *p == '\n') + return report(options, oid, type, FSCK_MSG_BAD_EMAIL, "invalid author/committer line - bad email"); + if (*p == '>') + break; /* end of email */ + + /* otherwise, skip past arbitrary email char */ + p++; + } + p++; /* skip past '>' we found */ if (*p != ' ') return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_DATE, "invalid author/committer line - missing space before date"); p++; From f05df7ffca492b37d604ad6beed788055eb56ebd Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:25 -0500 Subject: [PATCH 20/29] fsck: remove redundant date timestamp check After calling "parse_timestamp(p, &end, 10)", we complain if "p == end", which would imply that we did not see any digits at all. But we know this cannot be the case, since we would have bailed already if we did not see any digits, courtesy of extra checks added by 8e4309038f (fsck: do not assume NUL-termination of buffers, 2023-01-19). Since then, checking "p == end" is redundant and we can drop it. This will make our lives a little easier as we refactor further. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- fsck.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fsck.c b/fsck.c index 3696f1b849ad9f..98b16a9e584060 100644 --- a/fsck.c +++ b/fsck.c @@ -919,7 +919,7 @@ static int fsck_ident(const char **ident, const char *ident_end, return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date"); if (date_overflows(parse_timestamp(p, &end, 10))) return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow"); - if ((end == p || *end != ' ')) + if (*end != ' ') return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date"); p = end + 1; if ((*p != '+' && *p != '-') || From 5a993593b24df699f60841296795f9a6ca60d399 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:28 -0500 Subject: [PATCH 21/29] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated In fsck_ident(), we parse the timestamp with parse_timestamp(), which is really an alias for strtoumax(). But since our buffer may not be NUL-terminated, this can trigger a complaint from ASan's strict_string_checks mode. This is a false positive, since we know that the buffer contains a trailing newline (which we checked earlier in the function), and that strtoumax() would stop there. But it is worth working around ASan's complaint. One is because that will let us turn on strict_string_checks by default, which has helped catch other real problems. And two is that the safety of the current code is very hard to reason about (it subtly depends on distant code which could change). One option here is to just parse the number left-to-right ourselves. But we care about the size of a timestamp_t and detecting overflow, since that's part of the point of these checks. And doing that correctly is tricky. So we'll instead just pull the digits into a separate, NUL-terminated buffer, and use that to call parse_timestamp(). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- fsck.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/fsck.c b/fsck.c index 98b16a9e584060..ec45f786d6ed74 100644 --- a/fsck.c +++ b/fsck.c @@ -859,13 +859,28 @@ static int verify_headers(const void *data, unsigned long size, FSCK_MSG_UNTERMINATED_HEADER, "unterminated header"); } +static timestamp_t parse_timestamp_from_buf(const char **start, const char *end) +{ + const char *p = *start; + char buf[24]; /* big enough for 2^64 */ + size_t i = 0; + + while (p < end && isdigit(*p)) { + if (i >= ARRAY_SIZE(buf) - 1) + return TIME_MAX; + buf[i++] = *p++; + } + buf[i] = '\0'; + *start = p; + return parse_timestamp(buf, NULL, 10); +} + static int fsck_ident(const char **ident, const char *ident_end, const struct object_id *oid, enum object_type type, struct fsck_options *options) { const char *p = *ident; const char *nl; - char *end; nl = memchr(p, '\n', ident_end - p); if (!nl) @@ -917,11 +932,11 @@ static int fsck_ident(const char **ident, const char *ident_end, "invalid author/committer line - bad date"); if (*p == '0' && p[1] != ' ') return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date"); - if (date_overflows(parse_timestamp(p, &end, 10))) + if (date_overflows(parse_timestamp_from_buf(&p, ident_end))) return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow"); - if (*end != ' ') + if (*p != ' ') return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date"); - p = end + 1; + p++; if ((*p != '+' && *p != '-') || !isdigit(p[1]) || !isdigit(p[2]) || From a031b6181a1e1ee6768d19d6a03b031b6e9004e9 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 18 Nov 2025 04:12:30 -0500 Subject: [PATCH 22/29] t: enable ASan's strict_string_checks option ASan has an option to enable strict string checking, where any pointer passed to a function that expects a NUL-terminated string will be checked for that NUL termination. This can sometimes produce false positives. E.g., it is not wrong to pass a buffer with { '1', '2', '\n' } into strtoul(). Even though it is not NUL-terminated, it will stop at the newline. But in trying it out, it identified two problematic spots in our test suite (which have now been adjusted): 1. The strtol() parsing in cache-tree.c was a real potential problem, which would have been very hard to find otherwise (since it required constructing a very specific broken index file). 2. The use of string functions in fsck_ident() were false positives, because we knew that there was always a trailing newline which would stop the functions from reading off the end of the buffer. But the reasoning behind that is somewhat fragile, and silencing those complaints made the code easier to reason about. So even though this did not find any earth-shattering bugs, and even had a few false positives, I'm sufficiently convinced that its complaints are more helpful than hurtful. Let's turn it on by default (since the test suite now runs cleanly with it) and see if it ever turns up any other instances. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- t/test-lib.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/t/test-lib.sh b/t/test-lib.sh index 92d0db13d7429d..bbda7abb16c1c4 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -77,6 +77,7 @@ prepend_var GIT_SAN_OPTIONS : strip_path_prefix="$GIT_BUILD_DIR/" # want that one to complain to stderr). prepend_var ASAN_OPTIONS : $GIT_SAN_OPTIONS prepend_var ASAN_OPTIONS : detect_leaks=0 +prepend_var ASAN_OPTIONS : strict_string_checks=1 export ASAN_OPTIONS prepend_var LSAN_OPTIONS : $GIT_SAN_OPTIONS From fd7d79d068dd14a4d7a4a93f7bfd31cf24020aec Mon Sep 17 00:00:00 2001 From: Lucas Seiki Oshiro Date: Tue, 18 Nov 2025 17:37:03 -0300 Subject: [PATCH 23/29] repo: factor out field printing to dedicated function Move the field printing in git-repo-info to a new function called `print_field`, allowing it to be called by functions other than `print_fields`. Also change its use of quote_c_style() helper to output directly to the standard output stream, instead of taking a result in a strbuf and then printing it outselves. Signed-off-by: Lucas Seiki Oshiro Signed-off-by: Junio C Hamano --- builtin/repo.c | 34 ++++++++++++++++++---------------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/builtin/repo.c b/builtin/repo.c index 9d4749f79befa8..f9fb4184940e2e 100644 --- a/builtin/repo.c +++ b/builtin/repo.c @@ -85,13 +85,29 @@ static get_value_fn *get_value_fn_for_key(const char *key) return found ? found->get_value : NULL; } +static void print_field(enum output_format format, const char *key, + const char *value) +{ + switch (format) { + case FORMAT_KEYVALUE: + printf("%s=", key); + quote_c_style(value, NULL, stdout, 0); + putchar('\n'); + break; + case FORMAT_NUL_TERMINATED: + printf("%s\n%s%c", key, value, '\0'); + break; + default: + BUG("not a valid output format: %d", format); + } +} + static int print_fields(int argc, const char **argv, struct repository *repo, enum output_format format) { int ret = 0; struct strbuf valbuf = STRBUF_INIT; - struct strbuf quotbuf = STRBUF_INIT; for (int i = 0; i < argc; i++) { get_value_fn *get_value; @@ -105,25 +121,11 @@ static int print_fields(int argc, const char **argv, } strbuf_reset(&valbuf); - strbuf_reset("buf); - get_value(repo, &valbuf); - - switch (format) { - case FORMAT_KEYVALUE: - quote_c_style(valbuf.buf, "buf, NULL, 0); - printf("%s=%s\n", key, quotbuf.buf); - break; - case FORMAT_NUL_TERMINATED: - printf("%s\n%s%c", key, valbuf.buf, '\0'); - break; - default: - BUG("not a valid output format: %d", format); - } + print_field(format, key, valbuf.buf); } strbuf_release(&valbuf); - strbuf_release("buf); return ret; } From 155caac7d1fa981b21192c598cf9bbffdb5aea12 Mon Sep 17 00:00:00 2001 From: Lucas Seiki Oshiro Date: Tue, 18 Nov 2025 17:37:04 -0300 Subject: [PATCH 24/29] repo: add --all to git-repo-info Add a new flag `--all` to git-repo-info for requesting values for all the available keys. By using this flag, the user can retrieve all the values instead of searching what are the desired keys for what they wants. Helped-by: Karthik Nayak Helped-by: Patrick Steinhardt Signed-off-by: Lucas Seiki Oshiro Signed-off-by: Junio C Hamano --- Documentation/git-repo.adoc | 6 +++--- builtin/repo.c | 29 +++++++++++++++++++++++++++-- t/t1900-repo.sh | 21 +++++++++++++++++++++ 3 files changed, 51 insertions(+), 5 deletions(-) diff --git a/Documentation/git-repo.adoc b/Documentation/git-repo.adoc index ce43cb19c8b03c..70f0a6d2e47291 100644 --- a/Documentation/git-repo.adoc +++ b/Documentation/git-repo.adoc @@ -8,7 +8,7 @@ git-repo - Retrieve information about the repository SYNOPSIS -------- [synopsis] -git repo info [--format=(keyvalue|nul)] [-z] [...] +git repo info [--format=(keyvalue|nul)] [-z] [--all | ...] git repo structure [--format=(table|keyvalue|nul)] DESCRIPTION @@ -19,13 +19,13 @@ THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE. COMMANDS -------- -`info [--format=(keyvalue|nul)] [-z] [...]`:: +`info [--format=(keyvalue|nul)] [-z] [--all | ...]`:: Retrieve metadata-related information about the current repository. Only the requested data will be returned based on their keys (see "INFO KEYS" section below). + The values are returned in the same order in which their respective keys were -requested. +requested. The `--all` flag requests the values for all the available keys. + The output format can be chosen through the flag `--format`. Two formats are supported: diff --git a/builtin/repo.c b/builtin/repo.c index f9fb4184940e2e..e30e2416d4f59d 100644 --- a/builtin/repo.c +++ b/builtin/repo.c @@ -15,7 +15,7 @@ #include "utf8.h" static const char *const repo_usage[] = { - "git repo info [--format=(keyvalue|nul)] [-z] [...]", + "git repo info [--format=(keyvalue|nul)] [-z] [--all | ...]", "git repo structure [--format=(table|keyvalue|nul)]", NULL }; @@ -129,6 +129,23 @@ static int print_fields(int argc, const char **argv, return ret; } +static int print_all_fields(struct repository *repo, + enum output_format format) +{ + struct strbuf valbuf = STRBUF_INIT; + + for (size_t i = 0; i < ARRAY_SIZE(repo_info_fields); i++) { + const struct field *field = &repo_info_fields[i]; + + strbuf_reset(&valbuf); + field->get_value(repo, &valbuf); + print_field(format, field->key, valbuf.buf); + } + + strbuf_release(&valbuf); + return 0; +} + static int parse_format_cb(const struct option *opt, const char *arg, int unset UNUSED) { @@ -152,6 +169,7 @@ static int cmd_repo_info(int argc, const char **argv, const char *prefix, struct repository *repo) { enum output_format format = FORMAT_KEYVALUE; + int all_keys = 0; struct option options[] = { OPT_CALLBACK_F(0, "format", &format, N_("format"), N_("output format"), @@ -160,6 +178,7 @@ static int cmd_repo_info(int argc, const char **argv, const char *prefix, N_("synonym for --format=nul"), PARSE_OPT_NONEG | PARSE_OPT_NOARG, parse_format_cb), + OPT_BOOL(0, "all", &all_keys, N_("print all keys/values")), OPT_END() }; @@ -167,7 +186,13 @@ static int cmd_repo_info(int argc, const char **argv, const char *prefix, if (format != FORMAT_KEYVALUE && format != FORMAT_NUL_TERMINATED) die(_("unsupported output format")); - return print_fields(argc, argv, repo, format); + if (all_keys && argc) + die(_("--all and cannot be used together")); + + if (all_keys) + return print_all_fields(repo, format); + else + return print_fields(argc, argv, repo, format); } struct ref_stats { diff --git a/t/t1900-repo.sh b/t/t1900-repo.sh index 2beba67889af25..51d55f11a5ed66 100755 --- a/t/t1900-repo.sh +++ b/t/t1900-repo.sh @@ -4,6 +4,15 @@ test_description='test git repo-info' . ./test-lib.sh +# git-repo-info keys. It must contain the same keys listed in the const +# repo_info_fields, in lexicographical order. +REPO_INFO_KEYS=' + layout.bare + layout.shallow + object.format + references.format +' + # Test whether a key-value pair is correctly returned # # Usage: test_repo_info