https://arxiv.org/pdf/2407.18069
Abdolmahdi Bagheri [email protected] School of Electrical and Computer Engineering University of Tehran
Matin Alinejad [email protected] Electrical Engineering Department, Sharif University of Technology
Kevin Bello [email protected] Machine Learning Department, Carnegie Mellon University
Alireza Akhondi-Asl [email protected] Department of Anaesthesia, Harvard Medical School
Abstract
Causal reasoning is the primary bottleneck that Large Language Models (LLMs) must over- come to attain human-level intelligence. To address this, we introduce the Causal Chain of Prompting (C2P) as the first reasoning framework that equips current LLMs with causal reasoning capabilities. C2P operates autonomously, avoiding reliance on external tools or modules during both the causal learning and reasoning phases, and can be seamlessly im- plemented during the training or fine-tuning of LLMs. Experimental results across various benchmark datasets demonstrate a significant improvement in causal learning and sub- sequent reasoning accuracy of LLMs. We illustrate how C2P enhances LLMs’ ability to causally reason in real-world scenarios, addressing complex problems in fields such as health- care, medicine, economics, education, social sciences, environmental science, and marketing. With few-shot learning, GPT-4 Turbo using C2P with as few as six examples achieves sig- nificant performance improvements, boasting over a 33% increase in reasoning accuracy over the most state-of-the-art LLMs, which perform nearly randomly in similar circumstances. This demonstrates the transformative potential of integrating C2P into LLM training or fine-tuning processes, thereby empowering these models with advanced causal reasoning capabilities.
1 Introduction
Recent advancements in Large Language Models (LLMs) have impacted existing AI paradigms and height- ened expectations regarding AI’s capabilities (Achiam et al., 2023; Brown et al., 2020). Despite significant architectural differences in LLMs, they generally produce outputs based on the most likely results learned from vast amounts of training data (Vaswani et al., 2017). This enables them to acquire extensive knowledge ranging from common sense to specialized domains such as mathematics and science (Jiralerspong et al., 2024). Despite this, the inefficiency of LLMs in addressing causal reasoning questions remains their pri- mary bottleneck and simple tasks can completely break down reasoning in state-of-the-art LLMs (Nezhurina et al., 2024). Additionally, studies such as Kalai & Vempala (2023); Xu et al. (2024) have demonstrated that, despite the training data containing numerous examples of interventions, outcomes, and explanations, as well as similar tasks, hallucinatory responses persist and there is a lack of causal reasoning capability. As
1
Figure 1: Example of the standard prompting vs few-shot learned GPT-4 with C2P in open problem in astrophysics (Pasquato et al., 2023)
.
a result, while they may talk causality, they are not causal (Zečević et al., 2023). This deficiency represents a fundamental drawback of LLMs as AI systems compared to human intelligence, which goes beyond mere correlations and depends on causal relationships for decision-making (Penn & Povinelli, 2007; Anwar et al., 2024).
Recently, answering cause and effect questions with LLMs has gained extensive interest (Shin et al., 2020; Jin et al., 2023b; Ashwani et al., 2024). To address this issue, it is important to note that LLMs have been utilized in conjunction with external tools to extract causal structures, as demonstrated in (Jiralerspong et al., 2024). However, their architectures lack specialized modules specifically designed to enhance the understanding of cause-and-effect relationships within their outputs (Wang et al., 2023; Imani et al., 2023). Aside from studies that reason causally based on the knowledge that is already in their training data (Petroni et al., 2019; Jiang et al., 2020) and the ones that use LLMs in causality (Kıcıman et al., 2023; Zhang et al., 2023; Feder et al., 2024; Khatibi et al., 2024), chain-of-thought prompting is presented in (Wei et al., 2022) as one of the initial attempts in enhancing the reasoning in LLMs that shows improvement based on the data of the given query. However, LLMs still struggle with rigorous numerical and abstract reasoning among many other tasks (Xu et al., 2023). For example, a recent work, Causal Reasoning Assessment Benchmark (CRAB, Romanou et al., 2023), is designed to evaluate the causal understanding of events in real-world narratives. This study demonstrated that most systems perform poorly in identifying cause-and-effect tasks. Similarly, in (Jin et al., 2023b), the CORR2CAUSE dataset is introduced and demonstrated that current models often perform no better than random chance when tasked with causal questions. Following that, in (Jin et al., 2023a), the CLADDER dataset is introduced to asses Average Treatment Effects with LLMs and it is demon- strated that these models struggle with causal tasks. In their study, with the implementation of the proposed framework, CAUSALCoT, progress has been made in evaluating the average treatment effect where LLMs are provided with a collection of causal graphs and various types of queries (associational, interventional, and counterfactual), such as those included in the CLADDER dataset. More recently, in (Ashwani et al., 2024), a novel architecture called the Context-Aware Reasoning Enhancement with Counterfactual Analysis (CARE-CA) framework is presented to enhance causal reasoning and explainability. Their proposed frame- work incorporates an external explicit causal detection module with ConceptNet (Speer et al., 2017) and counterfactual statements, as well as implicit causal detection through LLMs, showing progress in causal reasoning in short and simple queries. Several other works at the intersection of causal inference and LLMs are discussed in an extensive survey by Liu et al. (2024).
2
As argued by Pearl (1995), causal Directed Acyclic Graphs (DAGs), along with d-separation, allow for the investigation of cause-and-effect relationships without relying on structural equation models in computa- tional studies. Inspired by Pearl’s foundational work, we propose a novel framework, named Causal Chain of Prompting (C2P), to address the inefficiencies of LLMs in handling causal queries (see Figure 1 as an example). We show that by identifying the adjacency matrix, equivalent to and instead of the causal DAG in Pearl’s work, the causal relationships among the variables in the premise’s cause-and-effect relationships can be effectively reasoned within the context of language models. Contrary to the existing research in address- ing weaknesses of LLMs in causal reasoning, the C2P operates autonomously, avoiding reliance on external tools or modules during both the learning and reasoning phases to answer causal questions. Additionally, C2P can easily be implemented in the training or fine-tuning process of LLMs. C2P includes five simple main sub-tasks, as follows: (1) Prompting for extracting the random variables from the provided data. (2) Prompting for extracting all conditional and unconditional relations and cause and effect relations specifically mentioned among the random variables. (3) Prompting to create the initial adjacency matrix with values 1 for all elements except the diagonal elements and the ones corresponding to effect-cause relations (the cause and effects elements are also one). (4) Prompting of the conditional and unconditional independencies and identification of the colliders, step by step, to extract the causal adjacency matrix; and, (5) Prompting for reasoning questions or hypotheses. To evaluate the accuracy and reliability of implementing the C2P on LLMs, we initially assess it using publicly available synthetic datasets such as (Jin et al., 2023b). Subse- quently, we evaluate it in more realistic and complex scenarios in real-world problems presented in (Pearl & Mackenzie, 2018) and (Pasquato et al., 2023). Moreover, we present results on few-shot learning with C2P on both synthetics and realistic scenarios.
Contributions. In this work, we present several important contributions to facilitate causal reasoning in language models. Concretely,