The script naming convention follows this format:
r{x}s{x}{description}
r{x}: Round number (iteration stage)s{x}: Step number (specific process within the round){description}: Brief explanation of the step
- Calls
src.data.sft_generator, which generates the initial synthetic fine-tuning dataset. - Uses datasets A, B, and C to generate training data, while dataset D is used to generate prompts.
- The output is stored in
syn_datasets/. - The training framework used is OpenRLHF, with training scripts available in
openrlhf_train_scripts/train_sft_all.sh.
- Calls
src.data.sft_federated, which creates datasets for downstream tasks. - The output is stored in
syned_datasets/original_0_sft_fed/.
- Calls
src.data.synscore_anchors, which clusters and extracts category centers from the test set to reduce the scale of the scoring dataset. - The output is stored in
./score_anchors/.
- Calls
syn_addprompt.py, which performs three main tasks:- Synthetic Answer Generation (
syn): Uses the test set from r0_s1 to generate synthetic answers. - Answer Extraction (
extract): Converts answers into a standardized format based onsrc.data.schema, which varies by dataset. - Prompt Augmentation (
addprompt): Integrates scoring datasets from r0_s3 into each synthetic data sample, storing them inds0andds1fields.
- Synthetic Answer Generation (
- Calls
convert_raw_to_csv.py, which:- Converts the raw dataset from r1_s1 into CSV format.
- Removes unnecessary fields.
- Similar to r0_s2, but generates downstream task datasets only for training.
- Uses the CSV data generated in r1_s2.
- Calls
syn_addscore.py, which:- Uses the
ds0andds1scoring datasets from r1_s1. - Assigns scores to each data sample and stores them in the
scorefield.
- Uses the
- Calls
convert_score_to_prefer.py, which:- Ranks synthetic data based on the
scorefield. - Generates preference pairs for DPO (Direct Preference Optimization) training.
- Outputs data in OpenRLHF format, with training scripts available in
openrlhf_train_scripts/train_dpo.sh.
- Ranks synthetic data based on the
./src/data/synscore_anchors.yaml: Thenum_clustersparameter determines the size of the reward dataset. A larger value ensures better representation but increases computational cost../syn_addprompt.yaml: Thenum_repeatparameter controls the number of synthetic prompts generated per data point. Increasing this value leads to a larger synthetic dataset. If set high, adjustsyn_config.sampling_params.temperatureandtop_paccordingly.
- Synthetic Model training set (
syn_datasets): Used to train the cold-start model. Ideally, each node should provide its own cold-start model. However, for consistency, a unified synthetic cold-start model is used, trained onabalone,adult,buddy, andcaliforniadatasets, ensuring it has not seenmedandfindatasets. - Synthetic Data Model Configuration (
./syn_addprompt.yaml): Each dataset has its own configuration, set viasyn_model: "syn_checkpoints/llama32-3b-sft_syn_diabetes". - Reward Model: The evaluation model used for reward scoring is defined in
syn_addscore.pyunderclient_config. This model should use a merged model and should be trained usingr1_s3data to create thescalebiomerge model.