Skip to content

Conversation

@jgfouca
Copy link
Member

@jgfouca jgfouca commented Aug 20, 2025

[BFB]

@jgfouca jgfouca requested a review from mahf708 August 20, 2025 16:46
@jgfouca jgfouca self-assigned this Aug 20, 2025
@jgfouca jgfouca added BFB PR leaves answers BFB CI: workflow change approved Allow gh action PR testing on ghci-snl-* machines for PRs that alter a worfklow file CI: approved Allow gh actions PR testing on ghci-snl-* machines labels Aug 20, 2025
@jgfouca jgfouca changed the title Add and A case and an ELM case to CI Add an A case and an ELM case to CI Aug 20, 2025
@mahf708
Copy link
Contributor

mahf708 commented Aug 20, 2025

@jgfouca these machines have like 4 procs (or at most 8, but it safer to assume only 2 or 4). Would it be ok to add _P4 to the names? The elm-betr test is failing with an exception about requesting 64 procs

Test ERS.f19_g16.I1850ELM.ghci-oci_gnu.elm-betr phase RUN requested more (64) than entire pool (self._proc_pool)
Waiting for tests to finish

@jgfouca
Copy link
Member Author

jgfouca commented Aug 20, 2025

@mahf708 , yes, i didn't notice that the other tests were doing that too (_P4).

@mahf708
Copy link
Contributor

mahf708 commented Aug 20, 2025

I will need to issue a PR in E3SM-Project/containers and add the new files needed, otherwise Rob (and ANL server people) will kill me. ... 👀

@bartgol
Copy link
Contributor

bartgol commented Aug 20, 2025

@jgfouca these machines have like 4 procs (or at most 8, but it safer to assume only 2 or 4). Would it be ok to add _P4 to the names? The elm-betr test is failing with an exception about requesting 64 procs

Test ERS.f19_g16.I1850ELM.ghci-oci_gnu.elm-betr phase RUN requested more (64) than entire pool (self._proc_pool)
Waiting for tests to finish

Why are these tests requesting 64 ranks? Isn't CIME supposed to limit the number of procs with whatever is in config_machines.xml?

Edit: I see that ghci-oci does not specifies max_tasks_per_node=16. First, I would expect CIME to use 16, no? Second, @mahf708 how about changing that to

<MAX_TASKS_PER_NODE>$SHELL{cat /proc/cpuinfo | grep processor | wc -l}</MAX_TASKS_PER_NODE>

so that it works with whatever github throws at us? @jgfouca I'm assuming the above will work as intended (meaning, CIME will expand it with the shell output)?

@jgfouca
Copy link
Member Author

jgfouca commented Aug 20, 2025

@bartgol , config_pes.xml is what determines ranks per case. We could have configured it such that ghci-oci just uses 4 ranks for everything, but P4 is fine too I think.

@bartgol
Copy link
Contributor

bartgol commented Aug 20, 2025

@bartgol , config_pes.xml is what determines ranks per case. We could have configured it such that ghci-oci just uses 4 ranks for everything, but P4 is fine too I think.

I think some gh hosted runners may only have 2 cores. Can config_pes.xml use the SHELL{..} syntax to programmatically retrieve it (or maybe use $MAX_TASKS_PER_NODE as a default)? Not that important though.

@mahf708
Copy link
Contributor

mahf708 commented Aug 20, 2025

Jim, not sure what's getting this over the disk limit. Maybe domain files? But the other test passed. Should we try to use ne4pg2_oQU480 for all these tests? What do you think?

@jgfouca
Copy link
Member Author

jgfouca commented Aug 20, 2025

@mahf708 , I see one check still running. Where are you seeing that we are going over the limit?

@mahf708
Copy link
Contributor

mahf708 commented Aug 20, 2025

@mahf708 , I see one check still running. Where are you seeing that we are going over the limit?

I made it run again; here's the first (failed) run: https://github.com/E3SM-Project/E3SM/actions/runs/17105290100/attempts/1; you can find the attempts on the top right in small box

it'a simple disk error

error

[ci (ERS_P4.f19_g16.I1850ELM.ghci-oci_gnu.elm-betr)](https://github.com/E3SM-Project/E3SM/actions/runs/17105290100/job/48512511062)
System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20250820-170805-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.OSFileStreamStrategy.Write(ReadOnlySpan`1 buffer)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20250820-170805-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.OSFileStreamStrategy.Write(ReadOnlySpan`1 buffer)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/_diag/Worker_20250820-170805-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.OSFileStreamStrategy.Write(ReadOnlySpan`1 buffer)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.Tracing.Dispose(Boolean disposing)
   at GitHub.Runner.Common.Tracing.Dispose()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

@jgfouca
Copy link
Member Author

jgfouca commented Aug 20, 2025

@mahf708 , OK, I switched them all to the same grid.

@mahf708 mahf708 requested a review from bishtgautam August 21, 2025 02:24
@mahf708
Copy link
Contributor

mahf708 commented Aug 21, 2025

@bishtgautam @rljacob I'm struggling to figure out the sweet combo that will make this test not try to download lots of data... but also, I'm hearing elm-betr likely not the best test to add here. Any suggestions for other tests that will cover common cases?

Also, tagging @jonbob in case you'd like us to add a test or two for ocean/ice :) thanks.

The goal of these tests is to basically run simple ones (no baselines for now) that can uncover low-order/basic build/run errors

@rljacob rljacob added the Testing Anything related to unit/system tests label Aug 21, 2025
@bishtgautam
Copy link
Contributor

You can try adding a low res version of the SMS.r05_r05.I1850ELMCN.elm-qian_1948. Something like SMS.ne4pg2_oQU480.I1850ELMCN.elm-qian_1948

@rljacob rljacob added the github_actions Pull requests that update GitHub Actions code label Aug 21, 2025
jgfouca added a commit that referenced this pull request Aug 21, 2025
Add an A case and an ELM case to CI

[BFB]
jgfouca added a commit that referenced this pull request Aug 21, 2025
Add an A case and an ELM case to CI

[BFB]
jgfouca added a commit that referenced this pull request Aug 21, 2025
Add an A case and an ELM case to CI

Merge 2 for this PR, forgot to update before merging.

[BFB]
@jgfouca jgfouca merged commit 9d49d95 into master Aug 21, 2025
8 checks passed
@jgfouca jgfouca deleted the jgfouca/add_ci_tests branch August 21, 2025 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFB PR leaves answers BFB CI: approved Allow gh actions PR testing on ghci-snl-* machines CI: workflow change approved Allow gh action PR testing on ghci-snl-* machines for PRs that alter a worfklow file github_actions Pull requests that update GitHub Actions code Testing Anything related to unit/system tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants