Skip to content

improve performance by only looking up git information once#191

Merged
CreatorHead merged 11 commits into
mainfrom
fivetanley/performance-updates
Feb 12, 2026
Merged

improve performance by only looking up git information once#191
CreatorHead merged 11 commits into
mainfrom
fivetanley/performance-updates

Conversation

@fivetanley
Copy link
Copy Markdown
Collaborator

@fivetanley fivetanley commented Feb 5, 2026

🛠️ Description

  • use filepath.walkDir instead of filepath.Walk. filepath.walkDir can be faster than Walk due to not calling os.Lstat on every file. This allows us to call os.Lstat from the goroutines in parallel instead of always blocking in the main thread. We also avoid calling os.Lstat on files in the ignore pattern
  • Only calculate the repository's first commit year once per run, instead of on every file lookup
  • Pass repoRoot to functions instead of calculating it every single time
  • Use git log across the repository to grab the first commit year and build an index of [fileName] = yearOfLastCommit. This prevents us from calling git log on every single file. Since git log --reverse will traverse the whole git tree anyway, this prevents us from having to iterate the entire git log for every single file.
  • ignore .git/**/*.pack files by default

Improvements:

On https://github.com/hashicorp/cloud-ui, this:

  • takes the runtime down from 11.6 minutes (701.80 seconds) to about 5s first run , 4s on subsequent runs (it runs faster on subsequent runs due to OS filesystem caching)
  • uses a lot less memory. Before: ~5.30gb, After: ~50mb

🔗 External Links

👍 Definition of Done

  • New functionality works?
  • Tests added? - no new tests, but existing tests pass

🤔 Can be merged upon approval?

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

@fivetanley fivetanley requested a review from a team as a code owner February 5, 2026 17:29
Comment thread cmd/headers.go Outdated

if !dryRun {
updated, err := licensecheck.UpdateCopyrightHeader(path, targetHolder, configYear, false)
updated, err := licensecheck.UpdateCopyrightHeader(path, targetHolder, configYear, false, repoFirstYear, repoRoot)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any external code using these exported functions will break. This is a major version bump. Is it possible to mitigate it by adding wrapper functions for backward compatibility.

Comment thread licensecheck/update.go Outdated
Comment thread cmd/headers.go
Comment thread licensecheck/update.go
once.Do(func() {
cache, firstYear, err := buildRepositoryCache(repoRoot)
if err != nil {
lastCommitYearsCache = make(map[string]int)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know your thoughts: When git cache fails, the tool still works but uses lastCommitYear=0 for all files, falling back to config-based years. This is probably fine, but:

  • We may need to log it for user (debugging becomes easier)
  • Might mask real problems like (git installed but failing, etc)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic that was present (see 376 in the pervious version of this file) so I was just maintaining what was there. Happy to add a new log line if that's helpful

Comment thread cmd/headers.go
Comment on lines +192 to +195
repoFirstYear, _ := licensecheck.GetRepoFirstCommitYear(".")

// Open git repository once for all file operations
repoRoot, _ := licensecheck.GetRepoRoot(".")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we run this from a subdirectory, or across multiple repos, will it still work correctly? Since the cache would only apply to the parent repo. Please let me know if my understanding is correct.

Copy link
Copy Markdown
Collaborator Author

@fivetanley fivetanley Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, GetRepoRoot remains unchanged. So if run within a subdirectory, it's shelling out to git to find the root anyway. This function is not cached.

Comment thread cmd/headers.go
".github/dependabot.yml",
"**/node_modules/**",
".copywrite.hcl",
".git/**/*.pack",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we shouldn't skip the entire .git folder? I cannot think of a good reason not to.

I'm surprised this was not ignored already to be honest and we accidentally didn't cause some random files there to get updated. 😅

@fivetanley fivetanley force-pushed the fivetanley/performance-updates branch from af510fb to 80c40af Compare February 11, 2026 22:03
@CreatorHead CreatorHead merged commit 42a6ce3 into main Feb 12, 2026
5 checks passed
@CreatorHead CreatorHead deleted the fivetanley/performance-updates branch February 12, 2026 04:51
CreatorHead added a commit that referenced this pull request Feb 24, 2026
Two bugs were introduced in #191 (42a6ce3) that caused copyright year
updates to never be applied:

1. buildRepositoryCache used '--format=format:%ad' which outputs bare
   years (e.g. '2026'), but the parser expected lines prefixed with
   '__CW_YEAR__=' to distinguish years from filenames. This meant
   lastCommitYear was always 0, so the condition 'lastCommitYear >
   info.EndYear' was never true and end years were never bumped.
   Fix: change format to '--format=format:__CW_YEAR__=%ad'.

2. getFileLastCommitYear computed the relative path via
   filepath.Rel(repoRoot, absPath), but on macOS /tmp is a symlink to
   /private/tmp. git resolves the real path for repoRoot while
   filepath.Abs preserves the symlink, causing a path mismatch and a
   cache miss on every file.
   Fix: call filepath.EvalSymlinks on absPath before computing the
   relative path. This is a no-op on Linux and Windows where symlinks
   are not involved.
CreatorHead added a commit that referenced this pull request Feb 26, 2026
…#195)

Two bugs were introduced in #191 (42a6ce3) that caused copyright year
updates to never be applied:

1. buildRepositoryCache used '--format=format:%ad' which outputs bare
   years (e.g. '2026'), but the parser expected lines prefixed with
   '__CW_YEAR__=' to distinguish years from filenames. This meant
   lastCommitYear was always 0, so the condition 'lastCommitYear >
   info.EndYear' was never true and end years were never bumped.
   Fix: change format to '--format=format:__CW_YEAR__=%ad'.

2. getFileLastCommitYear computed the relative path via
   filepath.Rel(repoRoot, absPath), but on macOS /tmp is a symlink to
   /private/tmp. git resolves the real path for repoRoot while
   filepath.Abs preserves the symlink, causing a path mismatch and a
   cache miss on every file.
   Fix: call filepath.EvalSymlinks on absPath before computing the
   relative path. This is a no-op on Linux and Windows where symlinks
   are not involved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants