Skip to content

fix: test in mesh_utils#1898

Merged
akoumpa merged 6 commits intomainfrom
akoumparouli/fix_mesh_test
Apr 17, 2026
Merged

fix: test in mesh_utils#1898
akoumpa merged 6 commits intomainfrom
akoumparouli/fix_mesh_test

Conversation

@akoumpa
Copy link
Copy Markdown
Contributor

@akoumpa akoumpa commented Apr 17, 2026

  The 4 failing tests in tests/unit_tests/distributed/test_mesh_utils.py were over-specifying implementation details. The implementation of get_fsdp_dp_mesh
  (nemo_automodel/components/distributed/mesh_utils.py:414) correctly probes axis sizes via device_mesh[cp_name].size() and device_mesh[dp_replicate_name].size() before slicing, which makes 2–3 __getitem__ calls
   total. The tests asserted assert_called_once_with(...) / assert_not_called(), which conflicts with those size probes.

  Fix in tests/unit_tests/distributed/test_mesh_utils.py:
  - Lines 364, 379, 393: assert_called_once_with → assert_any_call. The existing result._key and result._mesh is mesh assertions already verify the real guarantees (correct slice key + shared root mesh).
  - Lines 414–415: replaced __getitem__.assert_not_called() with a filter that only forbids tuple-key direct slices (the actual guarantee), allowing bare-name size probes.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

…y were over-specifying implementation details. The implementation of get_fsdp_dp_mesh

  (nemo_automodel/components/distributed/mesh_utils.py:414) correctly probes axis sizes via device_mesh[cp_name].size() and device_mesh[dp_replicate_name].size() before slicing, which makes 2–3 __getitem__ calls
   total. The tests asserted assert_called_once_with(...) / assert_not_called(), which conflicts with those size probes.

Fix in tests/unit_tests/distributed/test_mesh_utils.py:
  - Lines 364, 379, 393: assert_called_once_with → assert_any_call. The existing result._key and result._mesh is mesh assertions already verify the real guarantees (correct slice key + shared root mesh).
  - Lines 414–415: replaced __getitem__.assert_not_called() with a filter that only forbids tuple-key direct slices (the actual guarantee), allowing bare-name size probes.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 17, 2026

/ok to test efa7d4a

akoumpa added 2 commits April 17, 2026 10:29
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 17, 2026

/ok to test 7e8807d

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 17, 2026

/ok to test c921559

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 17, 2026

akoumparouli@1604ab7-lcedt:/mnt/4tb/auto/26_04/Automodel_fix_test$ egrep -irIno 'examples[^ ]+.yaml' tests/ | cut -d':' -f3 | sort -u | grep -iv examples_dir| xargs ls -lah
-rw-r--r-- 1 akoumparouli domain-users 2.4K Apr 17 12:26 examples/llm_benchmark/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.4K Apr 17 12:26 examples/llm_benchmark/qwen/custom_qwen2_5_32b_peft_benchmark.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.7K Apr 17 12:26 examples/llm_benchmark/qwen/qwen3_moe_30b_te_deepep.yaml
-rw-r--r-- 1 akoumparouli domain-users 3.2K Apr 11 19:19 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.4K Apr 11 19:19 examples/llm_finetune/llama3_2/llama3_2_1b_squad_flashoptim.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.5K Apr 11 19:19 examples/llm_finetune/llama3_2/llama3_2_1b_squad_megatron_fsdp.yaml
-rw-r--r-- 1 akoumparouli domain-users 3.9K Apr 11 19:19 examples/llm_finetune/moonlight/moonlight_16b_te.yaml
-rw-r--r-- 1 akoumparouli domain-users 3.7K Apr 11 19:19 examples/llm_finetune/qwen/qwen3_moe_2layer_proxy_lora.yaml
-rw-r--r-- 1 akoumparouli domain-users 3.7K Apr 11 19:19 examples/llm_finetune/qwen/qwen3_moe_2layer_proxy_torch_sdpa.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.4K Apr 11 19:19 examples/llm_kd/llama3_2/llama3_2_1b_kd.yaml
-rw-r--r-- 1 akoumparouli domain-users 3.6K Apr 11 19:19 examples/llm_pretrain/megatron_pretrain_gpt2.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.7K Apr 11 19:19 examples/vlm_finetune/gemma3/gemma3_vl_4b_cord_v2.yaml
-rw-r--r-- 1 akoumparouli domain-users 2.7K Apr 11 19:19 examples/vlm_finetune/gemma3/gemma3_vl_4b_cord_v2_megatron_fsdp.yaml
-rw-r--r-- 1 akoumparouli domain-users 3.0K Apr 11 19:19 examples/vlm_finetune/gemma3/gemma3_vl_4b_cord_v2_peft.yaml

all yamls should exist now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants