Skip to content

COMP: Free disk on Linux.Python CI before post-job ccache tar#6199

Merged
dzenanz merged 1 commit intoInsightSoftwareConsortium:mainfrom
hjmjohnson:fix/itk-linux-python-ccache-disk-pressure
May 4, 2026
Merged

COMP: Free disk on Linux.Python CI before post-job ccache tar#6199
dzenanz merged 1 commit intoInsightSoftwareConsortium:mainfrom
hjmjohnson:fix/itk-linux-python-ccache-disk-pressure

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Fixes a recurring ITK.Linux.Python post-job tar: Error is not recoverable (ENOSPC) by inserting a disk-cleanup step between Build and test and the implicit Cache@2 post-job uploader. Builds 14748 and 14751 both hit this; build/test phases were green, only the cache upload failed.

Root cause (deep dive on build 14751)

The ccache stats step before the post-job prints:

Primary storage:
  Cache size (GB): 4.94 / 5.00 (98.87 %)
  Cleanups:          15

Local ccache is at the soft ceiling and has self-evicted 15 times during one build. The post-job Cache@2 task then tries to tar the entire 4.94 GB ccache for upload on a runner whose / is already at 96% used:

2026-05-03T04:11:31.6036279Z tar: fce56f0f2f94438cabba19e4f240f17a_archive.tar: Wrote only 4096 of 10240 bytes
2026-05-03T04:11:31.6042095Z tar: Error is not recoverable: exiting now
2026-05-03T04:11:32.1885773Z ##[error]Process returned non-zero exit code: 2

Every run also reports There is a cache miss. for the ccache-v4|Linux|LinuxPython|<sha> fingerprint, so the next run is cold too — the pipeline is stuck in a self-perpetuating broken state.

What this PR adds

A new step in Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml, inserted after Build and test and before ccache stats:

- bash: |
    df -h /
    ccache --show-stats
    ccache --evict-older-than 5d
    ccache --max-size 6.5G
    ccache -c
    ccache --show-stats
    ninja -C $(Build.SourcesDirectory)-build clean || true
    df -h /
  condition: always()
  displayName: 'Free disk before post-job ccache upload'
  • ccache --evict-older-than 5d drops stale entries that no longer match recent compiler/source SHAs.
  • ccache --max-size 6.5G lowers the local ceiling. Combined with ccache -c (cleanup) this forces the local store under the new ceiling so the tar fits in the runner's available disk.
  • ninja -C <build> clean removes the .o files that are no longer needed after the build/test phase reported. || true lets the step succeed if the tree is already clean.
  • Bracketing df -h / calls give visibility on both sides of the cleanup so future regressions are diagnosable from logs alone.
  • condition: always() so even a failing build/test phase still gets the post-job a free runway.
Why 6.5G specifically

The build-time env still has CCACHE_MAXSIZE: 8G so ccache has room to operate during the build. The post-job ceiling of 6.5G is below the runner's typical free-disk headroom at the end of a Python wrapping build (the build itself consumes ~30-35 GB across /home/vsts/work for sources + build tree + ExternalData + ccache combined). Lowering further would unnecessarily evict warm entries; leaving at 8G would not actually fix the immediate failure. 6.5G is the sweet spot.

Doesn't address (and shouldn't, in this PR)
  • Whether the ccache-v4|...|<Build.SourceVersion> cache fingerprint is too coarse / too fine. Separate optimization.
  • Whether CCACHE_MAXSIZE: 8G env at the Build and test step actually takes effect (logs show ccache reporting a 5G ceiling there — possibly a persistent ccache config overrides). Worth a follow-up but not blocking this fix.
  • The same disk-pressure pattern on ITK.macOS.Python (build 14750 currently green, but the same growth trajectory could hit it).

@github-actions github-actions Bot added type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct labels May 3, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review May 3, 2026 16:23
@hjmjohnson hjmjohnson requested review from dzenanz May 3, 2026 16:24
@greptile-apps

This comment was marked as resolved.

Comment thread Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml
Comment thread Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml Outdated
Comment thread Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml
ITK.Linux.Python builds 14748 and 14751 (and presumably any
subsequent run) fail at the implicit `Cache@2` post-job step that
tars `CCACHE_DIR` for upload, with

    tar: <archive>: Wrote only 4096 of 10240 bytes
    tar: Error is not recoverable: exiting now
    ##[error]Process returned non-zero exit code: 2

after 10+ `Free disk space on / is lower than 5%` warnings.  The
build/test phase itself is green; only the cache upload fails.

The local ccache fills its 5G default ceiling during the build:

    Cache size (GB): 4.94 / 5.00 (98.87 %)
    Cleanups:          15

and the runner's writable disk is already 96% full when `tar` starts,
leaving no room for the staging archive.  Worse, because the upload
fails every run, the pipeline cache key reports a miss every time, so
the next run is also cold and refills the local store from scratch.

Insert a between-step that runs after `Build and test` and before the
implicit post-job `Restore ccache` upload:

  - `df -h /` for visibility on both sides of the cleanup;
  - `ccache --evict-older-than 5d` to drop stale entries;
  - `ccache --max-size 6.5G` to lower the soft ceiling;
  - `ccache -c` to force the store under that ceiling;
  - `ninja -C <build> clean` to free the .o files that are no longer
    needed once the build/test phase has reported (`|| true` so a
    missing/already-clean tree does not fail the step).

Setting max-size to 6.5G (above the in-build CCACHE_MAXSIZE=8G env
but below the runner's free-disk headroom) gives ccache room to
operate during the build while keeping the post-job tar inside the
runner's disk budget.
@hjmjohnson hjmjohnson marked this pull request as draft May 4, 2026 00:40
@hjmjohnson hjmjohnson force-pushed the fix/itk-linux-python-ccache-disk-pressure branch from 817fd27 to ea4ba99 Compare May 4, 2026 00:40
@hjmjohnson hjmjohnson marked this pull request as ready for review May 4, 2026 00:43
Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed the "low disk space" warning. This is at least worth trying.

@dzenanz dzenanz merged commit 7c28551 into InsightSoftwareConsortium:main May 4, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants