Skip to content

[wip] feat: add static config for search space#111

Closed
cquil11 wants to merge 17 commits intomainfrom
refactor-less-jobs
Closed

[wip] feat: add static config for search space#111
cquil11 wants to merge 17 commits intomainfrom
refactor-less-jobs

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Oct 17, 2025

This PR does some refactoring, mainly centered around a more intuitive way to specify the parallelism-concurrency search space for a given model-GPU-seqlen scenario.

Test Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18601277517

Made necessary output changes

https://github.com/InferenceMAX/InferenceMAX/actions/runs/18600393148
Screenshot 2025-10-17 at 1 00 40 PM

@cquil11 cquil11 force-pushed the refactor-less-jobs branch from f6d4ac7 to 35bc387 Compare October 17, 2025 00:31
Comment thread .github/configs/search-space.yml Outdated
Comment thread benchmarks/dsr1_fp4_b200_trt_slurm.sh Outdated
Comment thread benchmarks/dsr1_fp4_b200_trt_slurm.sh Outdated
@cquil11 cquil11 closed this Oct 29, 2025
@functionstackx functionstackx deleted the refactor-less-jobs branch November 5, 2025 02:38
cquil11 added a commit that referenced this pull request Apr 28, 2026
Both fixes we wanted are now on origin/main:
  * #110 nginx-rework-ulimit (Ishan): gates the 1M nofile bump behind opt-in
    frontend.nginx_raise_ulimit. Default off, fixes clusters whose container
    RLIMIT_NOFILE hard cap < 1M.
  * #111 (cam): demotes the per-srun command logger.info to logger.debug so
    the 5KB fingerprint heredoc stops dominating orchestrator logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants