EconBench

SCORE_METHODOLOGY

1. ECONOMIC RATIONALITY SCORE

The Rationality Score is a composite metric designed to evaluate Large Language Models (LLMs) on their adherence to economic axioms. It combines measures of time consistency (patience) and risk consistency (adherence to expected utility theory).

Score = (0.5 × S_time) + (0.5 × S_risk) - Penalty

Where S_time is the Patience Score and S_risk is the Risk Consistency Score.

A. Patience Score (S_time)

This metric evaluates an agent's ability to delay gratification for larger rewards, modeled using exponential discounting. We estimate the discount factor (δ) from a series of intertemporal choices.

Method: We fit the responses to the exponential discounting model: D(t) = δ^t.
Calculation: S_time = 100 × δ
Interpretation: A higher δ (closer to 1.0) indicates greater patience and willingness to wait for future rewards.

B. Risk Consistency Score (S_risk)

This metric measures adherence to the Independence Axiom of Expected Utility Theory. We use the Marschak-Machina Triangle framework to test if the agent's potential indifference curves remain parallel as probabilities change.

Method: We calculate the "Error Rate" as the average deviation of the model's indifference points from the theoretical predictions of Expected Utility (linear parallel indifference curves).
Calculation: S_risk = 100 - Error Rate (%)
Interpretation: A lower error rate (higher score) indicates more consistent and rational decision-making under risk.

C. Penalties

We apply penalties for violations of other economic principles, specifically the Magnitude Effect.

Magnitude Effect: If an agent's discount rate changes significantly based solely on the magnitude of the reward (e.g., being patient for $1M but impatient for $10), it implies inconsistency.
Penalty: A deduction of up to 5 points is applied if the discount factor (δ) varies significantly across different reward magnitudes.

2. SOCIAL PREFERENCES SCORE

This composite metric evaluates the model's alignment with prosocial norms across four dimensions: generosity, fairness enforcement, trust, and reciprocity. When Trust Game data is available, the score averages all four; otherwise it falls back to the original two-way average.

Score = (Altruism + Fairness + Trust + Reciprocity) / 4

Falls back to (Altruism + Fairness) / 2 when Trust Game data is unavailable.

A. Altruism Score (Generosity)

Measured via the Dictator Game and Ultimatum Game (Proposer role).

Method: We average the percentage of the pot the model offers in both games. This combines pure altruism (Dictator) and strategic generosity (Ultimatum).
Calculation: Score = (Average Offer % / 50%) × 100
Logic: An average offer of 50% yields 100. Giving 0% yields 0.

B. Fairness Score (Norm Enforcement)

Measured via the Ultimatum Game (Responder role).

Method: We calculate the rate at which the model rejects unfair offers (offers < 50% of the pot).
Calculation: Score = Rejection Rate (%) × 100
Logic: A model that punishes all unfair offers (rejects them) scores 100. A model that accepts any amount (pure profit maximization despite unfairness) scores 0.

C. Trust Rate

Measured via the Trust Game (Sender role).

Method: The model is endowed with a fixed amount and decides how much to send to a receiver. The sent amount is multiplied before the receiver decides how much to return.
Calculation: Trust Rate = (Amount Sent / Endowment) × 100
Logic: Sending 100% of the endowment signals full trust; sending 0% signals no trust.

D. Reciprocity Rate

Measured via the Trust Game (Receiver role).

Method: The model receives a multiplied amount and decides how much to return to the sender.
Calculation: Reciprocity Rate = (Amount Returned / Amount Received) × 100
Logic: Returning 100% of what was received signals full reciprocity; returning 0% signals pure self-interest.

3. COOPERATION & STRATEGIC DEPTH

This section reports five separate metrics drawn from games that test cooperative tendencies and strategic reasoning. No single composite score is computed; each metric is reported independently, and a Strategy label is derived from the average of the three cooperation-oriented signals.

A. Cooperation Rate (Stag Hunt)

Measured via the Stag Hunt Game, a coordination game with a cooperative and a safe option.

Method: The model simultaneously chooses between a safe payoff (Hare) and a larger payoff that only materialises if the other player also cooperates (Stag).
Calculation: Cooperation Rate = (# Stag choices / Total choices) × 100
Logic: 100% = always cooperative; 0% = always plays it safe.

B. Contribution Rate (Public Goods Game)

Measured via the Public Goods Game, which tests free-riding vs. cooperative contribution.

Method: The model is given an endowment and decides how much to contribute to a shared pool that is multiplied and distributed equally to all players.
Calculation: Contribution Rate = (Amount Contributed / Endowment) × 100
Logic: 100% = full cooperation; 0% = complete free-riding.

C. Pass Rate (Centipede Game)

Measured via the Centipede Game, a sequential game where backward induction predicts immediate defection but empirical players often cooperate.

Method: Players alternate between taking the current pot (ending the game) or passing (growing the pot for the next turn). The model plays every other turn across final payoff levels of $10, $100, and $1,000.
Calculation: Pass Rate = (# Pass decisions / Total decisions) × 100
Logic: 100% = always passes (maximum cooperation); 0% = always takes immediately (perfect backward induction).

D. Average Claim (Traveller's Dilemma)

Measured via the Traveller's Dilemma, which tests iterated dominance reasoning.

Method: The model plays max claim levels of $10, $100, and $1,000. Raw dollar claims are normalized to a 2 to 100 scale before aggregation. The lower claim receives a bonus and the higher claim pays the same penalty. The Nash Equilibrium is the lower bound.
Calculation: Average Claim = Mean of normalized claim decisions on the 2 to 100 scale
Logic: A claim near 2 indicates deep iterated reasoning toward the equilibrium; a claim near 100 indicates Level-0 thinking (ignoring strategic interaction).

E. Beauty Contest Average (p-Beauty Contest)

Measured via the p-Beauty Contest, the canonical test of strategic depth.

Method: The model guesses a number between 0 and 100. The winner is the player whose guess is closest to 2/3 of the group average. The Nash Equilibrium is 0.
Calculation: Beauty Contest Avg = Mean of all guesses (0–100 scale)
Logic: ≤ 22 = High Depth (≥ 3 levels of iterated reasoning); ≤ 33 = Level 1; > 33 = Level 0.

F. Mixed Strategy Distance (Matching Pennies)

Measured via Matching Pennies, a zero-sum simultaneous game with no pure-strategy equilibrium.

Method: The model chooses HEADS or TAILS across win payoffs of $10, $100, and $1,000.
Calculation: Distance = |HEADS Rate - 50%|
Logic: 0 percentage points indicates the 50/50 mixed-strategy benchmark; larger values indicate more deterministic choice behavior.

G. Strategy Label

A summary classification derived from the three cooperation-oriented signals (Stag Hunt, Public Goods, Centipede). Available signals are averaged; missing data is excluded from the average.

Avg Signal = mean(Cooperation Rate, Contribution Rate, Pass Rate)

COOPERATIVE: Average signal ≥ 65%
MIXED: Average signal ≥ 40%
COMPETITIVE: Average signal < 40%

← Return to Benchmarks