CODEI/O—Condense Reasoning Patterns in Code into a Training for LLMs
Leveraging Code for Advanced Reasoning
While abundant training data exists for tasks such as math problem solving or code generation, many other domains—especially those that require broad logical, scientific, symbolic, and commonsense reasoning—suffer from sparse and fragmented supervision. Without sufficient rich and diverse signals, models struggle to develop robust, generalizable reasoning skills.
The key insight of the paper is that everyday code is a treasure trove of diverse reasoning patterns. Real-world code inherently encodes logical flow planning, state-space searching, recursive decomposition, and decision-making processes. However, the raw code as available in repositories can be too noisy or tangled with syntax-specific details to be directly useful for training general reasoning models. To address this, the authors introduce CODEI/O, an approach that transforms raw code into a structured training signal by converting it into an input–output prediction task.
Transforming Code into Reasoning Data
The CODEI/O framework begins by collecting raw code files from multiple sources, such as CodeMix and specialized subsets like PyEdu-R, ensuring a balance of algorithmic, mathematical, and logic-intensive content. This code is then preprocessed into a unified and executable format where non-essential elements (like visualization commands) are removed. The cleaned code includes a main entrypoint and is refactored to clearly present the core logic.
Next, for each function extracted from this code, multiple input–output pairs are generated. These pairs stem from controlled sampling of input values and executing the code to obtain deterministic outputs. Crucially, the training tasks go further by incorporating Chain-of-Thought (CoT) rationales. Rather than merely predicting the output, the model is trained to express the reasoning behind the prediction entirely in natural language. This decouples the inherent logic from language-specific syntax, allowing the models to internalize universal reasoning primitives.
Incorporating Multi-turn Revision: CODEI/O
While generating coherent reasoning chains is challenging, some predictions may initially be incorrect. The authors address this by introducing a feedback-based, multi-turn revision process—resulting in an enhanced version termed CODEI/O . In this setup, models are not only verified by re-executing the code to check the accuracy of their predictions but are also prompted to revise any errors. The final training sample becomes a concatenation of the initial response and one or more revision rounds. Although the improvements tend to plateau after the first revision turn, this approach still leads to better performance compared to using the uncorrected responses.
The training of models using CODEI/O is organized in two distinct stages:
Stage 1 – Pre-training with CODEI/O:
Models are exposed to a large dataset (with over 3.5 million samples) constructed with the method described above. This pre-training reinforces broad reasoning ability by repeatedly exposing the model to diverse logical, symbolic, and procedural reasoning patterns distilled from code.
Stage 2 – General Instruction Tuning:
Following the reasoning-focused pre-training, models are then fine-tuned using a more general instruction-tuning dataset. This second stage helps models adapt their newly acquired reasoning capabilities to a wide range of downstream tasks, ensuring versatility.
The experiments are conducted on multiple advanced base models—including Qwen 2.5 Coder (7B parameters), LLaMA 3.1 (8B), DeepSeek Coder v2 Lite (16B), and Gemma 2 (27B)—and evaluated across a rich selection of benchmarks. These benchmarks span many domains: commonsense reasoning (e.g., WinoGrande, BBH), numerical and symbolic reasoning (e.g., DROP, GSM8K, MATH, MMLU-STEM), logical problem solving (e.g., GPQA, CruxEval, ZebraGrid, KorBench), and even code output prediction tasks (e.g., LeetCode-O, LiveBench).
Noteworthy Experimental Results
The experimental results underscore several key findings:
Balanced Improvements Across Domains:
Despite being based solely on code-derived data, CODEI/O shows consistent gains across a wide range of reasoning tasks. Unlike baselines that excel in only specific domains, CODEI/O improves performance in symbolic, logical, mathematical, and commonsense reasoning, demonstrating its generalizability.
Effectiveness of Multi-turn Revisions:
Incorporating execution-feedback and a single-turn revision process (forming CODEI/O ) results in further performance enhancements. Although additional revisions yield diminishing returns, the initial revision clearly boosts accuracy without compromising balance across tasks.
Scalability Benefits:
The authors show that increasing the number of training samples and the number of input–output pairs per code sample leads to more robust reasoning abilities. This scaling effect reinforces that the benefits come not just from a larger dataset but from a carefully constructed one that contains diverse, repeatable reasoning patterns.
Ablation Studies Validate Design Choices:
Several ablation experiments demonstrate that separating the prediction of inputs from outputs, retaining even erroneous responses for diversity, and carefully structuring prompt-response formats all contribute to the observed performance improvements.
Conclusion
In summary, the paper presents a novel method—CODEI/O—that condenses diverse reasoning patterns from code into a training framework accessible to LLMs. By transforming raw code into a structured prediction task augmented with natural language rationales and multi-turn revision, the approach effectively bridges the gap between code-specific execution and generalized reasoning. The experimental evaluations across multiple models and benchmarks validate that this method not only enhances specific reasoning capacities but does so in a balanced and scalable manner, opening new avenues for the development of more robust and versatile language models capable of deep reasoning.