Logo ProCo

Large Language Models Can Self-Correct
with Key Condition Verification


1Xi'an Jiaotong University, 2University of Notre Dame

*Equal contribution
†Corresponding to: chaoshen@xjtu.edu.cn
ProCo

ProCo performs three steps: (1) Initialization: Use CoT method to generate an initial answer. (2) Verification: Mask the key condition in the question and use the previous generated answer as a new condition to construct the verification question. Solve the verification question to get the verified answer and check if the verified answer and the key condition are equivalent. If they are equivalent, the previous generated answer is adopted as the final answer, otherwise add it to the set of potentially incorrect answers. (3) Correction: Use the set of potentially incorrect answers as feedback to correct previous generated answer. By cycle executing step (2) and step (3), the performance of LLMs on various complex reasoning tasks is progressively enhanced.

Introduction

Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective prompting method enhances LLM performance in identifying and correcting inaccurate answers without external feedback. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numerical value in an arithmetic question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo-1106 as the backend LLM, yields +6.8 exact match on four open-domain question answering datasets, +14.1 accuracy on three arithmetic reasoning datasets, and +9.6 accuracy on a commonsense reasoning dataset, compared to Self-Correct.

Experiment Results

Comparison with Baseline Methods

We evaluate ProCo on three complex reasoning tasks: arithmetic reasoning (GSM8K, AQuA, and MATH); open-domain question answering (NQ, TriviaQA, WebQ, and HotpotQA); and commonsense reasoning (CSQA). We compare ProCo with three types of baselines: (1) LLM-generated documents: GenRead. (2) Search engine-retrieved documents: RAG. (3) Without external documents: CoT, CoVe, and Self-Correct. All methods serve as baselines for open-domain question answering and commonsense reasoning tasks. For arithmetic reasoning, where external documents are unnecessary, CoT and Self-Correct are used. These baselines can be integrated into ProCo, for instance, using GenRead to generate an initial answer and ProCo to refine it (GenRead + ProCo).

method Open-domain Question Answering Commonsense
Reasoning
Arithmetic Reasoning
NQ TriviaQA WebQ HotpotQA CSQA GSM8K AQuA MATH
EM F1 EM F1 EM F1 EM F1 Accuracy Accuracy Accuracy Accuracy
*Using LLMs to generate problem-related documents
GenRead 42.2 49.4 70.8 74.8 41.3 48.5 38.0 43.2 67.3 ---- ---- ----
GenRead + ProCo 48.3 55.6 78.4 82.4 46.7 53.9 47.0 51.0 76.4 ---- ---- ----
*Using search engines to retrieve problem-related documents
RAG 45.3 52.4 72.7 76.4 40.1 46.9 37.0 41.1 65.9 ---- ---- ----
RAG + ProCo 48.5 56.0 78.4 82.1 45.2 52.5 39.0 44.2 74.2 ---- ---- ----
*Direct question answering without external documents
CoT 40.3 46.4 69.2 72.2 38.2 44.6 28.0 31.2 72.9 78.6 51.3 37.9
Self-Correct 40.1 47.1 71.3 74.1 39.2 45.7 29.0 32.4 65.9 75.1 48.7 27.6
CoVe 43.4 48.9 76.4 79.4 43.1 49.0 31.0 35.2 73.1 ---- ---- ----
ProCo 48.0 54.8 78.7 82.1 47.0 57.0 33.0 36.2 75.5 87.1 65.2 41.5

Performance on NQ, TriviaQA, WebQ, HotpotQA, CSQA, GSM8K, AQuA, and MATH benchmarks using GPT-3.5-Turbo-1106 (black-box LLM) and Mixtral-8x7B (open-source LLM). The best performance for each dataset is highlighted in bold. ProCo improves baseline methods with external documents across all benchmarks and outperforms those without external documents.

Examples

BibTeX


      @inproceedings{wu2024proco,
        title={Large Language Models Can Self-Correct with Key Condition Verification}, 
        author={Zhenyu Wu and Qingkai Zeng and Zhihan Zhang and Zhaoxuan Tan and Chao Shen and Meng Jiang},
        booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
        year={2024},
      }