EconBench
1. ECONOMIC RATIONALITY SCORE
The Rationality Score is a composite metric designed to evaluate Large Language Models
(LLMs)
on their adherence to economic axioms. It combines measures of time consistency (patience)
and risk consistency (adherence to expected utility theory).
A. Patience Score (Stime)
This metric evaluates an agent's ability to delay gratification for larger rewards, modeled using
exponential discounting. We estimate the discount factor (δ) from a series of intertemporal choices.
- Method: We fit the responses to the exponential discounting model:
D(t) = δ^t.
- Calculation:
S_time = 100 × δ
- Interpretation: A higher δ (closer to 1.0) indicates greater patience and willingness
to wait for future rewards.
B. Risk Consistency Score (Srisk)
This metric measures adherence to the Independence Axiom of Expected Utility Theory.
We use the Marschak-Machina Triangle framework to test if the agent's potential indifference curves remain
parallel
as probabilities change.
- Method: We calculate the "Error Rate" as the average deviation of the model's
indifference points from the theoretical predictions of Expected Utility (linear parallel indifference
curves).
- Calculation:
S_risk = 100 - Error Rate (%)
- Interpretation: A lower error rate (higher score) indicates more consistent and
rational decision-making under risk.
C. Penalties
We apply penalties for violations of other economic principles, specifically the Magnitude
Effect.
- Magnitude Effect: If an agent's discount rate changes significantly based solely on the
magnitude of the reward (e.g., being patient for $1M but impatient for $10), it implies inconsistency.
- Penalty: A deduction of up to 5 points is applied if the discount factor (δ) varies
significantly across different reward magnitudes.
2. SOCIAL PREFERENCES SCORE
This composite metric evaluates the model's alignment with prosocial norms across four dimensions:
generosity, fairness enforcement, trust, and
reciprocity. When Trust Game data is available, the score averages all four; otherwise it
falls back to the original two-way average.
A. Altruism Score (Generosity)
Measured via the Dictator Game and Ultimatum Game (Proposer role).
- Method: We average the percentage of the pot the model offers in both games. This
combines pure altruism (Dictator) and strategic generosity (Ultimatum).
- Calculation:
Score = (Average Offer % / 50%) × 100
- Logic: An average offer of 50% yields 100. Giving 0% yields 0.
B. Fairness Score (Norm Enforcement)
Measured via the Ultimatum Game (Responder role).
- Method: We calculate the rate at which the model rejects unfair offers
(offers < 50% of the pot).
- Calculation:
Score = Rejection Rate (%) × 100
- Logic: A model that punishes all unfair offers (rejects them) scores 100. A model that
accepts any amount (pure profit maximization despite unfairness) scores 0.
C. Trust Rate
Measured via the Trust Game (Sender role).
- Method: The model is endowed with a fixed amount and decides how much to send to a
receiver. The sent amount is multiplied before the receiver decides how much to return.
- Calculation:
Trust Rate = (Amount Sent / Endowment) × 100
- Logic: Sending 100% of the endowment signals full trust; sending 0% signals no trust.
D. Reciprocity Rate
Measured via the Trust Game (Receiver role).
- Method: The model receives a multiplied amount and decides how much to return to the
sender.
- Calculation:
Reciprocity Rate = (Amount Returned / Amount Received) × 100
- Logic: Returning 100% of what was received signals full reciprocity; returning 0%
signals pure self-interest.
3. COOPERATION & STRATEGIC DEPTH
This section reports five separate metrics drawn from games that test cooperative tendencies and strategic
reasoning. No single composite score is computed; each metric is reported independently, and a
Strategy label is derived from the average of the three cooperation-oriented signals.
A. Cooperation Rate (Stag Hunt)
Measured via the Stag Hunt Game, a coordination game with a cooperative and a safe option.
- Method: The model simultaneously chooses between a safe payoff (Hare) and a larger
payoff that only materialises if the other player also cooperates (Stag).
- Calculation:
Cooperation Rate = (# Stag choices / Total choices) × 100
- Logic: 100% = always cooperative; 0% = always plays it safe.
B. Contribution Rate (Public Goods Game)
Measured via the Public Goods Game, which tests free-riding vs. cooperative contribution.
- Method: The model is given an endowment and decides how much to contribute to a shared
pool that is multiplied and distributed equally to all players.
- Calculation:
Contribution Rate = (Amount Contributed / Endowment) × 100
- Logic: 100% = full cooperation; 0% = complete free-riding.
C. Pass Rate (Centipede Game)
Measured via the Centipede Game, a sequential game where backward induction predicts
immediate defection but empirical players often cooperate.
- Method: Players alternate between taking the current pot (ending the game) or passing
(growing the pot for the next turn). The model plays every other turn.
- Calculation:
Pass Rate = (# Pass decisions / Total decisions) × 100
- Logic: 100% = always passes (maximum cooperation); 0% = always takes immediately
(perfect backward induction).
D. Average Claim (Traveller's Dilemma)
Measured via the Traveller's Dilemma, which tests iterated dominance reasoning.
- Method: Both players simultaneously claim an amount between 2 and 100. The lower claim
wins a bonus; the higher claim loses the same amount as a penalty. The Nash Equilibrium is to claim 2.
- Calculation:
Average Claim = Mean of all claim decisions (2–100 scale)
- Logic: A claim near 2 indicates deep iterated reasoning toward the equilibrium; a claim
near 100 indicates Level-0 thinking (ignoring strategic interaction).
E. Beauty Contest Average (p-Beauty Contest)
Measured via the p-Beauty Contest, the canonical test of strategic depth.
- Method: The model guesses a number between 0 and 100. The winner is the player whose
guess is closest to 2/3 of the group average. The Nash Equilibrium is 0.
- Calculation:
Beauty Contest Avg = Mean of all guesses (0–100 scale)
- Logic: ≤ 22 = High Depth (≥ 3 levels of iterated reasoning); ≤ 33 = Level 1; > 33 =
Level 0.
F. Strategy Label
A summary classification derived from the three cooperation-oriented signals (Stag Hunt, Public Goods,
Centipede). Available signals are averaged; missing data is excluded from the average.
- COOPERATIVE: Average signal ≥ 65%
- MIXED: Average signal ≥ 40%
- COMPETITIVE: Average signal < 40%
← Return to Benchmarks