Nemotron SSH training and Kaggle-compatible submission workflow for the NVIDIA Nemotron reasoning challenge.
src/nemotron_model/data_bridge.py- prepares balanced SFT JSONL files from Kaggle
train.csv - keeps the safer final-answer formatting we already validated locally
- still exposes the earlier Tinker subcommands if we want that path again later
- prepares balanced SFT JSONL files from Kaggle
scripts/train_trl_kaggle_sim.py- SSH/HPC-side LoRA or warm-start adapter training with
transformers + peft + trl - designed for a Kaggle-like offline setup with local model paths
- SSH/HPC-side LoRA or warm-start adapter training with
scripts/build_improved_notebook.py- converts the original Kaggle notebook into the improved notebook version
scripts/sync_repo_to_hpc.py- uploads this repo to the HFUT SSH machine over SFTP
docs/classify_divide_trace_sft.md- task-by-task notes for the chosen classify-divide + stable trace + SFT approach
slurm/train_nemotron_sft.sbatch- Slurm training body
slurm/submit_train.sh- helper wrapper so partition/account/qos can be passed at submit time
The HFUT SSH endpoint is reachable and login works:
- host:
210.45.253.131 - port:
20003 - user format:
u + student id
Important findings from the live probe:
- the SSH endpoint is a Slurm entry machine, not a ready-to-train GPU shell
srun --partition=8-card --gres=gpu:1 ...currently fails for this account with:invalid account or account/partition combination specified
- login node Python is only
3.6.8 - outbound container pulls from Docker Hub timed out
So this repo is ready for the SSH route, but the actual full Nemotron training still needs:
- GPU partition access for the account
- either a newer Python environment on the cluster, or a working container/image path
Do not commit Kaggle competition data into Git.
python -m nemotron_model.data_bridge prepare \
--train-csv /path/to/train.csv \
--output-dir /path/to/prepared \
--target-samples-per-task 1200 \
--val-fraction 0.05 \
--val-min-size-per-task 2This writes:
train_sft.jsonlval_sft.jsonldataset_summary.jsontask_strategy_report.jsontrace_preview.md
python scripts/train_trl_kaggle_sim.py \
--model-path /path/to/base-model \
--train-jsonl /path/to/prepared/train_sft.jsonl \
--val-jsonl /path/to/prepared/val_sft.jsonl \
--output-dir /path/to/outputs/nemotron-sft \
--warm-start-adapter /path/to/tong-adapter \
--max-length 4096 \
--per-device-train-batch-size 1 \
--gradient-accumulation-steps 16 \
--num-train-epochs 1 \
--learning-rate 5e-5Edit nothing inside the sbatch body unless you have to. Pass cluster-specific values at submit time:
REPO_DIR=$HOME/Nemotron-Model \
PARTITION=8-card \
QOS=duzhan \
GPU_COUNT=1 \
CPUS_PER_TASK=16 \
MEMORY=120G \
TIME_LIMIT=24:00:00 \
BASE_MODEL_PATH=/path/to/base-model \
TRAIN_JSONL=/path/to/prepared/train_sft.jsonl \
VAL_JSONL=/path/to/prepared/val_sft.jsonl \
OUTPUT_DIR=$HOME/nemotron-runs/run1 \
WARM_START_ADAPTER=/path/to/tong-adapter \
bash slurm/submit_train.shIf your cluster admin later gives you an account name, also pass:
ACCOUNT=<your_gpu_account>python scripts/sync_repo_to_hpc.py \
--host 210.45.253.131 \
--port 20003 \
--user u2025171971 \
--remote-dir /home/u2025171971/Nemotron-ModelPassword can be passed with --password or HPC_PASSWORD.
- This repo intentionally excludes Kaggle raw data, adapters, and checkpoints.
- The current SSH route is prepared and tested up to remote login, file sync, and cluster probing.
- The blocking issue is cluster authorization for the GPU partition, not the training code layout.