This repo contains the utils used by K2V to synthesize checklist.
-
After synthesizing fill-blank style QA pairs using graphgen-mask, we need to synthesize a question-specific checklist for each QA pair.
bash data_postprocess/run.sh \ --input_file \ --output_dir \ --model_path Qwen/Qwen2.5-72B-Instruct \ --inference_mode offline \ --batch_size 2000 \ --tensor_parallel_size 8 \ --domain EN_AGRI \
-
data_postprocess/run.shwill automatically execute a data filtering pipeline. If you want to retain the all data, you can directly run the following script.python data_postprocess/synthesize_checklist/synthesize_checklist.py
-
Convert the data from JSON format to Parquet format.
python verl/convert_json_to_parquet.py
-
Verl needs a validation set to be specified to start training. Here, we choose to randomly sample from the training set as the validation set.
python verl/get_val_dataset.py