SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs


π§° What's Inside
- CNF-based problems spanning three orthogonal dimensions: instance scale, problem type, and question format.
- A PySAT-backed verifier that scores binary answers and exposes solver metadata for reproducible diagnostics.
- Ready-to-use datasets on Hugging Face, evaluation scripts, and RFT scripts.
β‘ Quickstart
from datasets import load_dataset
from satquest import CNF, create_problem, create_question
item = load_dataset("sdpkjc/SATQuest", split="test")[0]
cnf = CNF(dimacs=item["sat_dimacs"])
problem = create_problem("SATSP", cnf)
question = create_question("math")
prompt = problem.accept(question)
print(prompt)
print(problem.check(problem.solution))
πΊοΈ Explore the Workflow
- Datasets: CNF schema, regeneration flags, and Hugging Face links.
- Evaluate: batch LLM scoring with Weave -> W&B logging.
- Finetuning: GRPO setup using the verifier-driven reward functions.
- Examples: prompts and API snippets for every problem/question pairing.
π Highlights from the Paper
- Reasoning-tuned models lead the leaderboard:
o3-mini (0.56), DeepSeek-R1 (0.42), QwQ-32B (0.40), and DeepSeek-R1-Distill-Qwen-32B (0.39) outperform vanilla LLMs such as GPT-4.1 (0.38) and DeepSeek-V3-0324 (0.36), while Qwen2.5-7B-Instruct stays below 0.10 accuracy.
- Accuracy drops as solver effort grows; hallucinated "reasoning shortcuts" appear on hard MCS/MUS instances, underscoring the need for verifier feedback.
- Question presentation matters: math prompts are easiest, whereas story/dualstory formats expose sizable regressions for open-weight models. RFT experiments show verifier rewards extend reasoning traces and transfer to harder instances, but cross-format generalisation remains challenging.
π Citing SATQuest
@misc{zhao2025satquestverifierlogicalreasoning,
title={SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs},
author={Yanxiao Zhao and Yaqian Li and Zihao Bo and Rinyoichi Takezoe and Haojia Hui and Mo Guang and Lei Ren and Xiaolin Qin and Kaiwen Long},
year={2025},
eprint={2509.00930},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.00930},
}