For those trying to figure out how to get LLMs to best follow workflows, the paper “FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents” introduces FlowBench, the first benchmark designed to evaluate LLM agents in planning tasks guided by workflow knowledge. Skip to Appendix section C.2 for the relevant prompts and data input formats.
🔹 Key Challenges Addressed:
1. Planning Hallucinations: LLM agents often generate actions that conflict with task knowledge, especially in expertise-intensive tasks.
2. Workflow Knowledge Integration: The paper formalizes and tests different formats of workflow knowledge (text, code, flowchart) for improving planning accuracy.
3. Comprehensiveness of Evaluation: FlowBench covers 51 scenarios across 6 domains, providing a multi-tiered evaluation framework to assess how well LLM agents use workflow knowledge in planning.
🔹 Main Contributions:
1. Workflow Formalization: The paper revisits different workflow formats, including natural language, symbolic code, and flowchart schema, offering insights into their efficacy in guiding LLM agents.
2. FlowBench: A benchmark that includes diverse domains like customer service, personal assistants, and robotic process automation. It provides structured tasks for LLM agents, assessing their planning reliability across various real-world scenarios.
3. Evaluation Framework: The study presents a static turn-level and dynamic session-level evaluation framework to measure agent performance in using workflow knowledge.
🔹 Key Findings:
1. Flowchart Superiority: Flowcharts strike the best balance among performance, adaptability, and user-friendliness, outperforming text and code formats in both single-scenario and cross-scenario evaluations.
2. Need for Improvement: Even the best-performing LLM (GPT-4o) struggles with planning in certain tasks, highlighting the need for further research.
Results: The study demonstrates that structured workflow knowledge, especially in flowchart format, significantly enhances LLM agents’ planning capabilities but that many real-world tasks are still unattainable.
Paper: arXiv:2406.14884v1