StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion
ThinkingF pipeline
- Formal knowledge distillation: 183K Lean4-aligned pairs via majority voting BEq filtering
- Template-guided reasoning: 5.8K trajectories (Claude 3.7)
- Training: 2-stage SFT (formal <think> reasoning) RLVR (BEq-based reward, GRPO DAPO)
- Backbone: DeepSeek-R1-Distill-Qwen (7B / 32B)
Data sourcing
Source: 256K NL math problems (NuminaMath-1.5)
- Kimina-Autoformalizer generates 16 formalizations per prompt
- Filtered via syntax check → BEq clustering → majority vote → DeepSeek-V3 validation → 183K pairs
Benchmarks
Main Results (BEq@1)
FormalMATH-Lite (in-domain)
- Kimina-7B: 35.1
- StepFun-7B: 38.3
- StepFun-32B: 40.5
ProverBench (OOD)
- Kimina-7B: 13.3
- StepFun-7B: 25.1
- StepFun-32B: 26.7
CombiBench (real-world)
- Kimina-7B: 2.6
- StepFun-7B: 5.2
- StepFun-32B: 6.9
Ablation (ProverBench BEq@16)
- Full pipeline: 37.9
- w/o reasoning: 25.3
- w/o knowledge: 37.4
End-to-End Theorem Proving (10K informal problems)
- StepFun-7B: 4940 provable
- Kimina-7B: 4549 provable
StepFun-Formalizer surpasses both general and specialized models, scales to OOD, and improves real-world verifiability.