2/n Comparison with other work
1. Similarity-based Retrieval (RAG for Code)
Prior Work: Approaches like DrCode (Zhang et al., 2023), CodeT5 (Wang et al., 2023), and CoCoNut (Yu et al., 2023) rely on embedding code snippets and retrieval based on semantic similarity.
Contrast with CODEXGRAPH: CODEXGRAPH argues that similarity-based retrieval often struggles with complex tasks requiring multi-hop reasoning and understanding of code structure. It emphasizes that simply finding similar code snippets is often insufficient for tasks like code completion in a large repository, where understanding dependencies and relationships between code elements is crucial.
2. Tool/API-based Interfaces
Prior Work: Examples include approaches that use LLMs to interact with APIs for specific tasks like code summarization (Deshpande et al., 2024) or bug fixing (Arora et al., 2024).
Contrast with CODEXGRAPH: CODEXGRAPH argues that these approaches are often task-specific and require significant manual effort to design and implement. Building a separate tool or API for every code-related task is not scalable. In contrast, CODEXGRAPH aims for a more generalizable solution where the same code graph database can be used for various tasks.
3. Hybrid Approaches
Prior Work: Some approaches (Orwall, 2024) combine similarity-based retrieval with a limited set of manually designed tools or APIs.
Contrast with CODEXGRAPH: While hybrid approaches acknowledge the need for both retrieval and structured reasoning, they still rely on manually crafted tools for specific tasks. CODEXGRAPH aims to provide a more unified and flexible framework through the code graph database.
4. Agentless (Xia et al., 2024)
Prior Work: Agentless preprocesses the code repository's structure and file skeleton to make it easier for LLMs to understand.
Contrast with CODEXGRAPH: While Agentless improves LLM's understanding of code structure, it still relies on the LLM's limited context window to process the preprocessed information. CODEXGRAPH, on the other hand, allows LLMs to interact with the codebase through a queryable database, enabling more efficient and targeted access to relevant information.
In summary: CODEXGRAPH distinguishes itself from prior work by proposing a code graph database as a central, task-agnostic interface between LLMs and code repositories. This approach aims to overcome the limitations of purely similarity-based retrieval, manual tool design, and reliance on LLMs' limited context window, enabling more effective and generalizable LLM-based code understanding and manipulation.