Eight mathematical reasoning datasets MATH500, SVAMP, AddSub, ASDiv, GSM8K, AQuA, GSM-IC2, and GSM-ICM and two non-mathematical reasoning datasets HotpotQA and CSQA are used as testbed. We evaluate StepCo against three categories of baselines in mathematical reasoning: (1) Direct Generation Baselines: Direct, Zero-Shot-CoT, Manual-CoT, Complex-CoT, Auto-CoT, PAL, and Least-to-Most; (2) Correction-Based Baselines: Self-Correct, Self-Refine, PHP-CoT, Self-Check, and CRITIC; (3) Sampling-Selection Baselines: Self-Consistency (SC) and Best-of-N.
method | Mathematical Reasoning | Open-domain Question Answering |
Commonsense Reasoning |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SVAMP | AddSub | GSM8K | AQuA | MATH500 | ASDiv | GSM-IC2 | GSM-ICM | HotpotQA | CSQA | |||
Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | EM | F1 | Accuracy | ||
*Direct Generation Baselines | ||||||||||||
Direct | 78.2 | 86.1 | 77.8 | 63.4 | 39.7 | 86.2 | 88.9 | 83.4 | ---- | ---- | ---- | |
Zero-Shot-CoT | 76.7 | 85.2 | 78.6 | 51.3 | 37.9 | 84.3 | 87.0 | 82.0 | 28.0 | 31.2 | 69.3 | |
Manual-CoT | 77.1 | 85.3 | 76.4 | 54.3 | 42.3 | 87.3 | 86.8 | 81.4 | ---- | ---- | ---- | |
Auto-CoT | 80.9 | 88.0 | 78.8 | 57.8 | 39.1 | 86.9 | 84.3 | 81.8 | ---- | ---- | ---- | |
Complex-CoT | 80.4 | 87.9 | 78.9 | 59.1 | 40.1 | 87.2 | 84.3 | 83.0 | ---- | ---- | ---- | |
Least-to-Most | 79.6 | 90.4 | 77.5 | 57.4 | 39.5 | 89.1 | 86.9 | 80.2 | ---- | ---- | ---- | |
PAL | 77.8 | 89.1 | 79.5 | 63.4 | 41.4 | 81.0 | 85.2 | 84.7 | ---- | ---- | ---- | |
*Correction-Based Baselines | ||||||||||||
Self-Refine | 82.5 | 87.6 | 75.1 | 58.6 | 40.2 | 88.3 | 86.1 | 81.3 | ---- | ---- | ---- | |
Self-Correct | 81.5 | 82.3 | 73.6 | 48.7 | 35.3 | 81.7 | 83.5 | 79.6 | 29.0 | 32.4 | 65.9 | |
Self-Check | 80.7 | 86.9 | 74.3 | 64.6 | 42.1 | 86.4 | 84.7 | 82.7 | ---- | ---- | ---- | |
PHP-CoT | 83.1 | 85.3 | 81.3 | 60.6 | 48.9 | 90.2 | 87.5 | 84.1 | ---- | ---- | ---- | |
CRITIC | 83.3 | 89.5 | 79.2 | 63.8 | 44.9 | 90.7 | 89.2 | 86.4 | ---- | ---- | ---- | |
*Sampling-Selection Baselines | ||||||||||||
Self-Consistency (10) | 85.8 | 92.2 | 84.6 | 65.0 | 39.5 | 92.5 | 89.7 | 88.9 | ---- | ---- | ---- | |
Best-of-10 | 85.5 | 91.3 | 85.3 | 66.1 | 42.1 | 93.3 | 88.9 | 88.5 | 32.9 | 44.1 | 73.0 | |
StepCo | 89.7 | 93.4 | 87.0 | 72.4 | 56.9 | 98.4 | 90.7 | 89.0 | 35.0 | 47.4 | 74.3 |