Skip to content

update ep cookbook#40

Merged
kevssim merged 1 commit intodevfrom
dev-wkw
Feb 6, 2026
Merged

update ep cookbook#40
kevssim merged 1 commit intodevfrom
dev-wkw

Conversation

@kevssim
Copy link
Copy Markdown
Collaborator

@kevssim kevssim commented Feb 6, 2026

No description provided.

@kevssim kevssim changed the base branch from main to dev February 6, 2026 07:32
@kevssim kevssim merged commit 396b5b5 into dev Feb 6, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a major overhaul and rebranding of the project, transitioning from 'SWIFT' to 'Twinkle'. It establishes a comprehensive, modular, and distributed training framework for large language models, with a strong focus on multi-adapter LoRA, deep integration with Ray for distributed execution, and broad support for diverse model architectures like Transformers and Megatron-Core, including MoE and VLM. The changes encompass a complete restructuring of the project's core components, new data handling mechanisms, a client-server architecture for remote interaction, and extensive documentation to guide users and developers.

Highlights

  • Project Rebranding: The project has undergone a significant rebranding from 'SWIFT' to 'Twinkle', with extensive updates across documentation, configuration files, and code to reflect the new name and identity.
  • Distributed Training Infrastructure: A robust distributed training infrastructure has been introduced, leveraging Ray for orchestration, remote execution, and efficient resource management across various hardware types (GPU, NPU).
  • Modular Component Design: The framework is built with a modular architecture, featuring dedicated components for data loading (dataloader, dataset), model management (model), loss calculation (loss), and performance metrics (metric), promoting flexibility and extensibility.
  • Multi-Backend Model Support: Comprehensive support for both HuggingFace Transformers and NVIDIA Megatron-Core backends is integrated, enabling training of diverse LLM architectures, including Mixture-of-Experts (MoE) and Vision-Language Models (VLM).
  • Multi-Adapter LoRA Training: The framework now supports multi-adapter LoRA training, allowing efficient fine-tuning of multiple LoRA adapters on a single base model, crucial for multi-tenant scenarios.
  • Automated Client Generation & HTTP Services: A new client generation tool and HTTP server components are introduced, facilitating easy interaction with remote models and samplers for training and inference.
  • Extensive Cookbooks and Documentation: A wide array of new cookbooks and detailed documentation (including Chinese and English versions) are added, covering installation, quick-starts, component usage, and advanced training examples like GRPO and SFT.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .github/copilot-instructions.md
    • Added new guidelines for AI agents interacting with the repository.
  • .gitignore
    • Added new ignore patterns for client-side generated files, lock files, and test cookbooks.
  • .pre-commit-config.yaml
    • Updated versions of code quality tools (flake8, isort, yapf, pre-commit-hooks).
    • Removed the 'fix-encoding-pragma' hook.
  • CONTRIBUTING_CN.md
    • Updated project name from 'SWIFT' to 'twinkle' in the Chinese contributing guide.
    • Revised sections on contribution needs and removed resource support information.
  • README.md
    • Added a new English README with 'Twinkle' branding, project slogan, badges, and documentation links.
  • README_ZH.md
    • Added a new Chinese README with 'Twinkle' branding, project slogan, badges, and a section comparing 'Twinkle' with 'ms-swift'.
  • ROADMAP.md
    • Added a new roadmap document outlining future development plans for version 0.1.
  • client_tools/client_generator.py
    • Added a new script to automatically generate client wrappers for various 'twinkle' components.
  • cookbook/client/tinker/megatron/lora.py
    • Added a Tinker-compatible Megatron LoRA training example.
  • cookbook/client/tinker/megatron/server.py
    • Added a Tinker-compatible Megatron server example.
  • cookbook/client/tinker/megatron/server_config.yaml
    • Added configuration for the Tinker-compatible Megatron server.
  • cookbook/client/tinker/transformer/lora.py
    • Added a Tinker-compatible Transformer LoRA training example.
  • cookbook/client/tinker/transformer/sample.py
    • Added a Tinker-compatible Transformer sampling example.
  • cookbook/client/tinker/transformer/self_congnition.py
    • Added a Tinker-compatible Transformer self-cognition example.
  • cookbook/client/tinker/transformer/server.py
    • Added a Tinker-compatible Transformer server example.
  • cookbook/client/tinker/transformer/server_config.yaml
    • Added configuration for the Tinker-compatible Transformer server.
  • cookbook/client/twinkle/megatron/lora.py
    • Added a native 'Twinkle' Megatron LoRA training example.
  • cookbook/client/twinkle/megatron/server.py
    • Added a native 'Twinkle' Megatron server example.
  • cookbook/client/twinkle/megatron/server_config.yaml
    • Added configuration for the native 'Twinkle' Megatron server.
  • cookbook/client/twinkle/transformer/grpo_lora.py
    • Added a native 'Twinkle' Transformer GRPO LoRA training example.
  • cookbook/client/twinkle/transformer/lora.py
    • Added a native 'Twinkle' Transformer LoRA training example.
  • cookbook/client/twinkle/transformer/server.py
    • Added a native 'Twinkle' Transformer server example.
  • cookbook/client/twinkle/transformer/server_config.yaml
    • Added configuration for the native 'Twinkle' Transformer server.
  • cookbook/grpo/lora.py
    • Added a GRPO LoRA training example for hybrid mode.
  • cookbook/grpo/lora_gpu.py
    • Added a GRPO LoRA training example optimized for GPU environments.
  • cookbook/grpo/lora_npu.py
    • Added a GRPO LoRA training example optimized for NPU environments.
  • cookbook/megatron/lora.py
    • Added a Megatron-Core LoRA training example.
  • cookbook/megatron/moe_lora.py
    • Added a Megatron-Core Mixture-of-Experts (MoE) LoRA training example.
  • cookbook/megatron/vlm_lora.py
    • Added a Megatron-Core Vision-Language Model (VLM) LoRA training example.
  • cookbook/remote/tinker/ascend/lora.py
    • Added a remote Tinker-compatible LoRA training example for Ascend NPU.
  • cookbook/remote/tinker/ascend/server.py
    • Added a remote Tinker-compatible server example for Ascend NPU.
  • cookbook/remote/tinker/ascend/server_config.yaml
    • Added configuration for the remote Tinker-compatible Ascend server.
  • cookbook/remote/tinker/lora.py
    • Added a remote Tinker-compatible LoRA training example.
  • cookbook/remote/tinker/server.py
    • Added a remote Tinker-compatible server example.
  • cookbook/remote/tinker/server_config.yaml
    • Added configuration for the remote Tinker-compatible server.
  • cookbook/remote/twinkle/lora.py
    • Added a remote native 'Twinkle' LoRA training example.
  • cookbook/remote/twinkle/server.py
    • Added a remote native 'Twinkle' server example.
  • cookbook/remote/twinkle/server_config.yaml
    • Added configuration for the remote native 'Twinkle' server.
  • cookbook/sampler_demo.py
    • Added a demonstration script for the sampler component.
  • cookbook/sft/ep_fsdp_qwen3_moe.py
    • Added a Supervised Fine-Tuning (SFT) example for Qwen3 MoE with Expert Parallel (EP) and FSDP.
  • cookbook/sft/full_sft.py
    • Added a full SFT example.
  • cookbook/sft/local_dataset.py
    • Added an SFT example demonstrating usage with local datasets.
  • cookbook/sft/lora_npu.py
    • Added an SFT LoRA example optimized for NPU environments.
  • cookbook/sft/multi_lora.py
    • Added a multi-LoRA SFT example.
  • cookbook/sft/single_controller.py
    • Added a single controller SFT example.
  • cookbook/sft/single_controller_sp.py
    • Added a single controller SFT example with sequence parallelism.
  • cookbook/sft/single_program.py
    • Added a single program SFT example.
  • cookbook/sft/single_program_full.py
    • Added a full single program SFT example.
  • cookbook/sft/single_program_megatron.py
    • Added a single program Megatron SFT example.
  • cookbook/sft/single_program_moe.py
    • Added a single program MoE SFT example.
  • cookbook/sft/streaming_dataset.py
    • Added an SFT example demonstrating usage with streaming datasets.
  • cookbook/sft/vlm_lora.py
    • Added an SFT VLM LoRA example.
  • docs/Makefile
    • Added Makefile for building Sphinx documentation.
  • docs/README.md
    • Added README for documentation maintenance guidelines.
  • docs/make.bat
    • Added make.bat for building Sphinx documentation on Windows.
  • docs/source/.readthedocs.yaml
    • Added Read the Docs configuration for Chinese documentation.
  • docs/source/Components/index.rst
    • Added index for documentation components.
  • docs/source/Components/数据格式/InputFeature.md
    • Added documentation for the InputFeature data format.
  • docs/source/Components/数据格式/Message.md
    • Added documentation for the Message data format.
  • docs/source/Components/数据集/Dataset.md
    • Added documentation for the Dataset component.
  • docs/source/Components/数据集/IterableDataset.md
    • Added documentation for the IterableDataset component.
  • docs/source/Components/数据集/IterablePackingDataset.md
    • Added documentation for the IterablePackingDataset component.
  • docs/source/Components/数据集/LazyDataset.md
    • Added documentation for the LazyDataset component.
  • docs/source/Components/数据集/PackingDataset.md
    • Added documentation for the PackingDataset component.
  • docs/source/GetStarted/Installation.md
    • Added installation guide for 'Twinkle'.
  • docs/source/GetStarted/Quick-start.md
    • Added a quick-start guide for 'Twinkle', including its purpose and comparison with 'ms-swift'.
  • docs/source/_templates/autosummary/class.rst
    • Added Sphinx autosummary template for classes.
  • docs/source/_templates/classtemplate.rst
    • Added Sphinx class template.
  • docs/source/_templates/sobolengine.rst
    • Added Sphinx template for SobolEngine.
  • docs/source/conf.py
    • Added Sphinx configuration for Chinese documentation.
  • docs/source/index.rst
    • Added main index for Chinese documentation.
  • docs/source_en/.readthedocs.yaml
    • Added Read the Docs configuration for English documentation.
  • docs/source_en/_templates/autosummary/class.rst
    • Added Sphinx autosummary template for classes (English).
  • docs/source_en/_templates/classtemplate.rst
    • Added Sphinx class template (English).
  • docs/source_en/_templates/sobolengine.rst
    • Added Sphinx template for SobolEngine (English).
  • docs/source_en/conf.py
    • Added Sphinx configuration for English documentation.
  • docs/source_en/index.rst
    • Added main index for English documentation.
  • examples/expert_parallel/train_qwen3_30b_ep_fsdp_demo.py
    • Added an example for Qwen3-30B Expert Parallel (EP) and FSDP2 training.
  • pyproject.toml
    • Added 'twinkle' project metadata, dependencies, and optional dependencies.
  • src/twinkle/init.py
    • Modified to use lazy loading for core modules, improving startup performance.
  • src/twinkle/data_format/init.py
    • Initialized the data format module.
  • src/twinkle/data_format/input_feature.py
    • Defined the InputFeature TypedDict for model inputs.
  • src/twinkle/data_format/message.py
    • Defined Message, ToolCall, and Tool TypedDicts for conversational data.
  • src/twinkle/data_format/trajectory.py
    • Defined the Trajectory TypedDict for RL training data.
  • src/twinkle/dataloader/init.py
    • Initialized the dataloader module.
  • src/twinkle/dataloader/dataloader.py
    • Implemented DataLoader with retry mechanisms and device mesh sharding.
  • src/twinkle/dataloader/device_mesh_fetcher.py
    • Implemented DeviceMeshIterableFetcher for sharding iterable datasets across devices.
  • src/twinkle/dataloader/device_mesh_sampler.py
    • Implemented DeviceMeshSampler for batch sharding across data parallel ranks.
  • src/twinkle/dataloader/retry_sampler.py
    • Implemented RetrySampler for robust data loading with retries on failed samples.
  • src/twinkle/dataset/init.py
    • Initialized the dataset module.
  • src/twinkle/dataset/base.py
    • Implemented the Dataset base class and DatasetMeta for flexible data loading and preprocessing.
  • src/twinkle/dataset/iterable_dataset.py
    • Implemented IterableDataset for streaming data processing.
  • src/twinkle/dataset/iterable_packing_dataset.py
    • Implemented IterablePackingDataset for streaming bin-packing of data.
  • src/twinkle/dataset/lazy_dataset.py
    • Implemented LazyDataset for lazy encoding of data, useful for multimodal scenarios.
  • src/twinkle/dataset/packing_dataset.py
    • Implemented PackingDataset for efficient bin-packing of variable-length sequences.
  • src/twinkle/gym/init.py
    • Initialized the gym module.
  • src/twinkle/gym/base.py
    • Defined the Gym base class.
  • src/twinkle/hub/init.py
    • Initialized the hub module.
  • src/twinkle/hub/hub.py
    • Implemented HubOperation, MSHub, and HFHub for standardized model and dataset management across different platforms.
  • src/twinkle/infra/init.py
    • Refactored infrastructure initialization, including global settings for mode, seed, and device groups.
    • Introduced remote_class and remote_function decorators for distributed execution.
  • src/twinkle/infra/_ray/init.py
    • Initialized the Ray-specific infrastructure module.
  • src/twinkle/infra/_ray/ray_helper.py
    • Implemented RayHelper for managing Ray actors, executing remote calls, and handling distributed results.
  • src/twinkle/infra/_ray/resource_manager.py
    • Implemented ResourceManager for allocating and managing Ray placement groups and resources.
  • src/twinkle/kernel/README.md
    • Added a README for the kernel module, explaining layer-level and function-level kernelization.
  • src/twinkle/kernel/init.py
    • Initialized the kernel module with kernelize_model and registration functions for optimized operations.
  • src/twinkle/kernel/base.py
    • Defined base types and utility functions for the kernel module, including mode and device detection.
  • src/twinkle/kernel/function.py
    • Implemented function-level kernel registration and application for monkey-patching specific functions.
  • src/twinkle/kernel/layer.py
    • Implemented layer-level kernel registration and application for replacing entire nn.Module implementations.
  • src/twinkle/kernel/registry.py
    • Implemented kernel registries for managing layer and function kernel specifications.
  • src/twinkle/loss/init.py
    • Initialized the loss module with various loss functions and a mapping for easy access.
  • src/twinkle/loss/base.py
    • Defined the Loss abstract base class.
  • src/twinkle/loss/chunked_cross_entropy.py
    • Implemented ChunkedCrossEntropyLoss for memory-efficient loss calculation.
  • src/twinkle/loss/cross_entropy.py
    • Implemented CrossEntropyLoss.
  • src/twinkle/loss/grpo.py
    • Implemented GRPOLoss (Group Relative Policy Optimization) and its variants (GSPO, SAPO, CISPO, BNPO, DRGRPO).
  • src/twinkle/loss/mse.py
    • Implemented MSELoss.
  • src/twinkle/loss/vocab_parallel_cross_entropy.py
    • Implemented VocabParallelCrossEntropyLoss for Megatron training with tensor parallelism.
  • src/twinkle/loss_scale/init.py
    • Initialized the loss scale module.
  • src/twinkle/loss_scale/base.py
    • Defined the LossScale base class.
  • src/twinkle/metric/init.py
    • Initialized the metric module.
  • src/twinkle/metric/accuracy.py
    • Implemented the Accuracy metric.
  • src/twinkle/metric/base.py
    • Defined the Metric abstract base class.
  • src/twinkle/metric/loss.py
    • Implemented the LossMetric for tracking training loss.
  • src/twinkle/metric/train_metric.py
    • Implemented the TrainMetric for tracking training progress and speed.
  • src/twinkle/model/init.py
    • Initialized the model module, including Megatron-Core models.
  • src/twinkle/model/base.py
    • Defined the TwinkleModel abstract base class, outlining the common interface for all models.
  • src/twinkle/model/megatron/init.py
    • Initialized the Megatron model module for lazy loading of Megatron-Core components.
  • src/twinkle/model/megatron/args.py
    • Defined TwinkleMegatronArgs for managing Megatron-Core specific configurations.
  • src/twinkle/model/megatron/megatron.py
    • Implemented MegatronModel for integrating with Megatron-Core, including DDP wrapping and optimizer management.
  • src/twinkle/model/megatron/model/init.py
    • Initialized the Megatron model sub-module.
  • src/twinkle/model/megatron/model/constant.py
    • Defined constants for LLM and MLLM model types.
  • src/twinkle/model/megatron/model/gpt_bridge.py
    • Implemented GPTBridge for converting weights between HuggingFace and Megatron-Core formats.
  • src/twinkle/model/megatron/model/gpt_model.py
    • Implemented GPTModel as a wrapper for Megatron-Core's GPT model, including custom forward logic.
  • src/twinkle/model/megatron/model/gpts/init.py
    • Registered GPT models for Megatron-Core integration.
  • src/twinkle/model/megatron/model/mm_gpt_model.py
    • Implemented MultimodalGPTModel for handling multimodal inputs in Megatron-Core.
  • src/twinkle/model/megatron/model/mm_gpts/init.py
    • Initialized multimodal GPT models.
  • src/twinkle/model/megatron/model/mm_gpts/qwen.py
    • Implemented Qwen2/2.5-VL models for Megatron-Core.
  • src/twinkle/model/megatron/model/mm_gpts/qwen3_vl.py
    • Implemented Qwen3-VL models for Megatron-Core, including deepstack visual feature injection.
  • src/twinkle/model/megatron/model/mm_gpts/utils.py
    • Provided utility functions for multimodal models in Megatron-Core.
  • src/twinkle/model/megatron/model/register.py
    • Implemented a registry for Megatron-Core models.
  • src/twinkle/model/megatron/model/rope.py
    • Implemented RoPE (Rotary Positional Embedding) utilities for Megatron-Core.
  • src/twinkle/model/megatron/multi_lora_megatron.py
    • Implemented MultiLoraMegatronModel for managing multiple LoRA adapters within Megatron-Core.
  • src/twinkle/model/megatron/strategy/init.py
    • Initialized the Megatron strategy module.
  • src/twinkle/model/megatron/strategy/megatron.py
    • Implemented MegatronStrategy for handling distributed training specifics in Megatron-Core.
  • src/twinkle/model/megatron/tuners/init.py
    • Initialized the tuners module.
  • src/twinkle/model/megatron/tuners/lora.py
    • Implemented LoraParallelLinear for Megatron-compatible LoRA layers with tensor parallel support.
  • src/twinkle/model/megatron/tuners/utils.py
    • Provided utility functions for Megatron-Core tuners, including layer finding and deepcopy patching.
  • src/twinkle/model/megatron/utils/init.py
    • Initialized the Megatron utilities module.
  • src/twinkle/model/megatron/utils/config.py
    • Implemented configuration conversion from HuggingFace to Megatron-Core format.
  • src/twinkle/model/megatron/utils/utils.py
    • Provided general utility functions for Megatron-Core.
  • src/twinkle/model/multi_lora.py
    • Added the MultiLora class for managing multiple LoRA adapters on a single base model.
  • src/twinkle/model/transformers/init.py
    • Initialized the Transformers model module.
  • src/twinkle/model/transformers/multi_lora_transformers.py
    • Implemented MultiLoraTransformersModel for managing multiple LoRA adapters within the Transformers backend.
  • src/twinkle/model/transformers/strategy.py
    • Implemented TransformersStrategy for handling distributed training specifics in the Transformers backend.
  • src/twinkle/model/transformers/transformers.py
    • Implemented TransformersModel for integrating with HuggingFace Transformers.
  • src/twinkle/patch/init.py
    • Initialized the patch module.
  • src/twinkle/patch/megatron_peft.py
    • Implemented Megatron PEFT patching for compatibility.
  • src/twinkle/patch/vllm_lora_weights.py
    • Implemented vLLM LoRA weights patching for synchronization.
  • src/twinkle/processor/init.py
    • Initialized the processor module.
  • src/twinkle/processor/base.py
    • Defined Preprocessor and DataFilter abstract base classes.
  • src/twinkle/processor/competition_math.py
    • Implemented CompetitionMathProcessor.
  • src/twinkle/processor/competition_math_grpo.py
    • Implemented CompetitionMathGRPOProcessor.
  • src/twinkle/processor/input_processor.py
    • Implemented InputProcessor for handling model input collation.
  • src/twinkle/processor/self_cognition.py
    • Implemented SelfCognitionProcessor.
  • src/twinkle/rl/init.py
    • Initialized the RL module.
  • src/twinkle/rl/grpo_advantage.py
    • Implemented GRPOAdvantage for calculating advantages in GRPO.
  • src/twinkle/sampler/init.py
    • Initialized the sampler module.
  • src/twinkle/sampler/base.py
    • Defined the Sampler abstract base class.
  • src/twinkle/sampler/torch_sampler.py
    • Implemented TorchSampler.
  • src/twinkle/sampler/types.py
    • Defined types for sampler configurations and responses.
  • src/twinkle/sampler/vllm_engine.py
    • Implemented VLLMEngine for vLLM integration.
  • src/twinkle/sampler/vllm_sampler.py
    • Implemented VLLMSampler for efficient inference with vLLM.
  • src/twinkle/server/init.py
    • Initialized the server module.
  • src/twinkle/server/tinker/init.py
    • Initialized the Tinker server module.
  • src/twinkle/server/tinker/common.py
    • Provided common utilities for the Tinker server.
  • src/twinkle/server/tinker/model.py
    • Implemented the Tinker model server.
  • src/twinkle/server/tinker/sampler.py
    • Implemented the Tinker sampler server.
  • src/twinkle/server/tinker/server.py
    • Implemented the main Tinker server.
  • src/twinkle/server/twinkle/init.py
    • Initialized the native 'Twinkle' server module.
  • src/twinkle/server/twinkle/model.py
    • Implemented the native 'Twinkle' model server.
  • src/twinkle/server/twinkle/twinkle_server.py
    • Implemented the main native 'Twinkle' server.
  • src/twinkle/template/init.py
    • Initialized the template module.
  • src/twinkle/template/base.py
    • Defined the Template abstract base class for input formatting.
  • src/twinkle/template/qwen3.py
    • Implemented Qwen3Template.
  • src/twinkle/template/qwen3_vl.py
    • Implemented Qwen3VLTemplate for Vision-Language models.
  • src/twinkle/utils/init.py
    • Initialized the utilities module.
  • src/twinkle/utils/framework.py
    • Provided framework-specific utility functions.
  • src/twinkle/utils/import_utils.py
    • Provided import utility functions, including lazy loading.
  • src/twinkle/utils/platform.py
    • Provided platform detection and device management utilities.
  • src/twinkle/utils/plugin.py
    • Provided plugin management utilities.
  • src/twinkle/utils/processing_lock.py
    • Implemented a processing lock for concurrent operations.
  • src/twinkle/utils/torch_util.py
    • Provided PyTorch-specific utility functions.
  • src/twinkle/utils/utils.py
    • Provided general utility functions.
  • src/twinkle/version.py
    • Updated project version information.
  • src/twinkle/weight_loader/init.py
    • Initialized the weight loader module.
  • src/twinkle/weight_loader/base.py
    • Defined the WeightLoader abstract base class.
  • src/twinkle/weight_loader/ipc_loader.py
    • Implemented IPCWeightLoader for inter-process communication of weights.
  • src/twinkle/weight_loader/native_loader.py
    • Implemented NativeLoader for direct weight loading.
  • tests/infra/init.py
    • Renamed from src/twinkle/infra/ray/__init__.py.
  • tests/kernel/init.py
    • Renamed from src/twinkle/plugin/__init__.py.
Activity
  • The pull request was created by kevssim. No further human activity (comments, reviews) is recorded in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the expert parallelism (EP) cookbook example for Qwen3 MoE models. The changes simplify the device mesh setup by introducing ep_size support into the DeviceMesh.from_sizes factory method, which is a nice API improvement. Consequently, the example script is updated to use this new, more declarative approach, and dependencies on DeviceGroup and remote_group are removed, streamlining the code. The script's execution mode is also switched from 'ray' to 'local', enhancing its accessibility for users without a Ray cluster. Additionally, a minor bug is fixed to prevent saving the model checkpoint at the very beginning of the training loop. My review includes a suggestion to add input validation to the new ep_size parameter for better robustness. Overall, these are solid improvements to the example.

Comment on lines +89 to +93
if ep_size is not None:
mesh_dim_sizes.append(ep_size)
mesh_dim_names.append("ep")
if origin_world_size == 1:
world_size *= ep_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve robustness, it's a good practice to validate input parameters. The ep_size should be a positive integer. Consider adding a check to ensure ep_size > 0 to prevent potential runtime errors during mesh creation if an invalid value is passed.

Suggested change
if ep_size is not None:
mesh_dim_sizes.append(ep_size)
mesh_dim_names.append("ep")
if origin_world_size == 1:
world_size *= ep_size
if ep_size is not None:
if ep_size <= 0:
raise ValueError(f'ep_size must be positive, but got {ep_size}')
mesh_dim_sizes.append(ep_size)
mesh_dim_names.append("ep")
if origin_world_size == 1:
world_size *= ep_size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant