This is quite a valuable resource from
@GoogleA for evaluating the complex reasoning and numerical calculation capabilities of large language models. A few key takeaways:
'TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools'
📌 The TACT (Text And Calculations through Tables) dataset challenges LLMs' reasoning and computational abilities on complex instructions that require aggregating information scattered across texts and performing complex integration on this information to generate the answer. TACT instances consist of the original text, the written instruction, and a gold answer, all requiring advanced text comprehension and reasoning.
📌 TACT was constructed by leveraging the InstructIE dataset, which contains texts and their associated tables. For each table, experts formulated new queries and gathered their respective answers. The dataset creation process involved 1) Initial review and relevance vetting, 2) Numerical aspect identification, 3) Natural language instruction formulation, 4) Natural language query over the table, 5) Translation to Pandas commands and gold response extraction, and 6) Command execution and validation.
📌 Experiments show that all contemporary LLMs perform poorly on TACT, achieving an accuracy below 38%. To pinpoint the difficulties, the authors analyze model performance across three components: table-generation, Pandas command-generation, and execution.
📌 To address these challenges, the authors propose the "IE as a tool" framework. The key idea is to solve TACT instructions through the sequential invocation of two tools: one that generates a table from the text and instruction, and one that generates a corresponding Pandas command. The model then executes the command, along with the original instruction and text, to produce the final answer. The authors implement each tool with few-shot prompting.
📌 The IE as a tool approach shows a 12% improvement over existing prompting techniques on TACT. Analyzing the performance on the individual table-generation and Pandas command-generation tasks reveals significant headroom in each, suggesting that focused few-shot prompting can considerably enhance performance. This aligns with the authors' finding that each dissected component of the TACT task has untapped potential for improvement.