🧠 Finetuning
SATQuest provides a reinforcement learning workflow powered by GRPO that encourages models to produce well-structured reasoning traces (<think>...</think>) and exact binary answers (<answer>...</answer>). This page explains how to launch the reference setup and how to adapt it to your infrastructure.
📝 Requirements
- 8 GPUs (the reference run uses two groups of four for VLLM serving and GRPO training). Adjust device mappings if you have fewer cards.
- CUDA-ready environment with the
uvdependencies from therftgroup installed (uv sync --group rft). - Access to the
sdpkjc/SATQuest-RFT-3kdataset or your own regeneration. - A base instruction-tuned model. We test with
Qwen/Qwen2.5-7B-Instruct, but any chat model with<think>/<answer>style outputs works if you update the reward functions.
🚀 Launch the VLLM Server
Start an inference server that the trainer will query for rollout generation:
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup uv run --group rft trl vllm-serve \
--model Qwen/Qwen2.5-7B-Instruct \
--tensor_parallel_size 4 \
--max_model_len 16384 \
--gpu_memory_utilization 0.9 \
--enable_prefix_caching True &
tensor_parallel_sizeshould match the number of GPUs allocated to serving.max_model_lenmust accommodate the longest prompt + completion you expect (GRPO defaults to 2048+8192 tokens).- Run the command inside
nohupor a process manager so the server remains available when you start training.
🧠 Run the GRPO Trainer
In a separate shell, launch the trainer against the server:
CUDA_VISIBLE_DEVICES=4,5,6,7 uv run --group rft accelerate launch \
--num-processes 4 --config-file zero3.yaml \
rft.py --model-id Qwen/Qwen2.5-7B-Instruct \
--p-list SATSP --q-list math \
--exp-name qwen2p5_satsp_math \
--server-ip 0.0.0.0
rft.py loads the RFT dataset, expands it across specified problem/question types, shuffles examples, and feeds them into a GRPOTrainer. The configuration in code sets:
num_generations=16rollouts per prompt,per_device_train_batch_size=2withgradient_accumulation_steps=16,- cosine LR schedule with
learning_rate=2e-6, and - reward weights
[1.0, 0.05, 0.05]for correctness, tag coverage, and format matching.
Modify zero3.yaml if you need a different parallelism strategy.
💰 Reward Functions
Three reward components are applied to each completion:
score_reward: extracts<answer>...</answer>, normalises it withre_matcher, and checks correctness via the SATQuest verifier.tag_count_reward: encourages exactly one<think>and<answer>pair to avoid degenerate outputs.format_reward: measures how much of the completion falls inside the desired tag structure.
Adjust the weights or implement new reward callables if you want to include latency penalties, reasoning-length bonuses, or other task-specific signals.
🎨 Customising the Training Dataset
- Use
--p-list/--q-listto control which problem classes appear in the rollout queue. MixingSATSPwithSATDP_UNSAT, for example, teaches the model to output both satisfying assignments and UNSAT certificates. - Regenerate data with
gen_cnf_rft_dataset.pyto explore other clause densities. The trainer only readscnf_dimacs,prompt, andp_type, so you can augment records with extra metadata without code changes.
📺 Monitoring and Checkpoints
Training logs are written to the directory named after exp_name. Because save_only_model=True, checkpoints contain only the model weights—use the same tokenizer when you resume. Keep an eye on rollout rewards to catch API failures from the VLLM server; flat reward curves usually indicate parsing errors or truncated completions, so consider lowering max_completion_length or increasing server timeout.
🔜 Next Steps
- Plug the finetuned model into the evaluation script for a side-by-side comparison with the base model.
- Extend
make_process_fninrft.pyif you want to insert chain-of-thought exemplars or additional instructions before reinforcement learning.