ProCo

Introduction

Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective prompting method enhances LLM performance in identifying and correcting inaccurate answers without external feedback. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numerical value in an arithmetic question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo-1106 as the backend LLM, yields +6.8 exact match on four open-domain question answering datasets, +14.1 accuracy on three arithmetic reasoning datasets, and +9.6 accuracy on a commonsense reasoning dataset, compared to Self-Correct.

Comparison with Baseline Methods

We evaluate ProCo on three complex reasoning tasks: arithmetic reasoning (GSM8K, AQuA, and MATH); open-domain question answering (NQ, TriviaQA, WebQ, and HotpotQA); and commonsense reasoning (CSQA). We compare ProCo with three types of baselines: (1) LLM-generated documents: GenRead. (2) Search engine-retrieved documents: RAG. (3) Without external documents: CoT, CoVe, and Self-Correct. All methods serve as baselines for open-domain question answering and commonsense reasoning tasks. For arithmetic reasoning, where external documents are unnecessary, CoT and Self-Correct are used. These baselines can be integrated into ProCo, for instance, using GenRead to generate an initial answer and ProCo to refine it (GenRead + ProCo).

method	Open-domain Question Answering								Commonsense Reasoning	Arithmetic Reasoning
	NQ		TriviaQA		WebQ		HotpotQA		CSQA	GSM8K	AQuA	MATH
	EM	F1	EM	F1	EM	F1	EM	F1	Accuracy	Accuracy	Accuracy	Accuracy
*Using LLMs to generate problem-related documents
GenRead	42.2	49.4	70.8	74.8	41.3	48.5	38.0	43.2	67.3	----	----	----
GenRead + ProCo	48.3	55.6	78.4	82.4	46.7	53.9	47.0	51.0	76.4	----	----	----
*Using search engines to retrieve problem-related documents
RAG	45.3	52.4	72.7	76.4	40.1	46.9	37.0	41.1	65.9	----	----	----
RAG + ProCo	48.5	56.0	78.4	82.1	45.2	52.5	39.0	44.2	74.2	----	----	----
*Direct question answering without external documents
CoT	40.3	46.4	69.2	72.2	38.2	44.6	28.0	31.2	72.9	78.6	51.3	37.9
Self-Correct	40.1	47.1	71.3	74.1	39.2	45.7	29.0	32.4	65.9	75.1	48.7	27.6
CoVe	43.4	48.9	76.4	79.4	43.1	49.0	31.0	35.2	73.1	----	----	----
ProCo	48.0	54.8	78.7	82.1	47.0	57.0	33.0	36.2	75.5	87.1	65.2	41.5

method	Open-domain Question Answering								Commonsense Reasoning	Arithmetic Reasoning
	NQ		TriviaQA		WebQ		HotpotQA		CSQA	GSM8K	AQuA	MATH
	EM	F1	EM	F1	EM	F1	EM	F1	Accuracy	Accuracy	Accuracy	Accuracy
*Using LLMs to generate problem-related documents
GenRead	46.7	52.0	69.0	72.4	51.1	56.5	36.0	39.7	64.3	----	----	----
GenRead + ProCo	48.5	53.7	72.3	75.8	52.0	57.5	38.0	42.3	70.4	----	----	----
*Using search engines to retrieve problem-related documents
RAG	48.8	54.6	75.3	78.5	46.3	52.1	37.0	40.2	66.3	----	----	----
RAG + ProCo	51.6	57.1	79.6	83.0	50.3	56.3	41.0	43.7	71.8	----	----	----
*Direct question answering without external documents
CoT	42.6	48.2	66.7	70.3	46.6	51.9	29.0	34.4	68.4	74.4	49.2	28.4
Self-Correct	44.8	50.5	71.3	74.8	47.5	51.9	32.0	36.2	49.8	72.5	44.4	21.5
CoVe	47.6	53.0	73.2	76.4	53.4	58.2	33.0	36.9	70.8	----	----	----
ProCo	50.7	53.6	74.5	76.6	55.1	59.2	35.0	41.3	72.7	78.7	54.3	30.2

Performance on NQ, TriviaQA, WebQ, HotpotQA, CSQA, GSM8K, AQuA, and MATH benchmarks using GPT-3.5-Turbo-1106 (black-box LLM) and Mixtral-8x7B (open-source LLM). The best performance for each dataset is highlighted in bold. ProCo improves baseline methods with external documents across all benchmarks and outperforms those without external documents.

Examples

BibTeX


      @inproceedings{wu2024proco,
        title={Large Language Models Can Self-Correct with Key Condition Verification}, 
        author={Zhenyu Wu and Qingkai Zeng and Zhihan Zhang and Zhaoxuan Tan and Chao Shen and Meng Jiang},
        booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
        year={2024},
      }

ProCo

Large Language Models Can Self-Correct with Key Condition Verification

Introduction

Experiment Results

Comparison with Baseline Methods

Examples

BibTeX

Large Language Models Can Self-Correct
with Key Condition Verification