EconBench

SCORE_METHODOLOGY

1. ECONOMIC RATIONALITY SCORE

The Rationality Score is a composite metric designed to evaluate Large Language Models (LLMs) on their adherence to economic axioms. It combines measures of time consistency (patience) and risk consistency (adherence to expected utility theory).

Score = (0.5 × Stime) + (0.5 × Srisk) - Penalty
Where Stime is the Patience Score and Srisk is the Risk Consistency Score.

A. Patience Score (Stime)

This metric evaluates an agent's ability to delay gratification for larger rewards, modeled using exponential discounting. We estimate the discount factor (δ) from a series of intertemporal choices.

B. Risk Consistency Score (Srisk)

This metric measures adherence to the Independence Axiom of Expected Utility Theory. We use the Marschak-Machina Triangle framework to test if the agent's potential indifference curves remain parallel as probabilities change.

C. Penalties

We apply penalties for violations of other economic principles, specifically the Magnitude Effect.

2. SOCIAL PREFERENCES SCORE

This composite metric evaluates the model's alignment with prosocial norms across four dimensions: generosity, fairness enforcement, trust, and reciprocity. When Trust Game data is available, the score averages all four; otherwise it falls back to the original two-way average.

Score = (Altruism + Fairness + Trust + Reciprocity) / 4
Falls back to (Altruism + Fairness) / 2 when Trust Game data is unavailable.

A. Altruism Score (Generosity)

Measured via the Dictator Game and Ultimatum Game (Proposer role).

B. Fairness Score (Norm Enforcement)

Measured via the Ultimatum Game (Responder role).

C. Trust Rate

Measured via the Trust Game (Sender role).

D. Reciprocity Rate

Measured via the Trust Game (Receiver role).

3. COOPERATION & STRATEGIC DEPTH

This section reports five separate metrics drawn from games that test cooperative tendencies and strategic reasoning. No single composite score is computed; each metric is reported independently, and a Strategy label is derived from the average of the three cooperation-oriented signals.

A. Cooperation Rate (Stag Hunt)

Measured via the Stag Hunt Game, a coordination game with a cooperative and a safe option.

B. Contribution Rate (Public Goods Game)

Measured via the Public Goods Game, which tests free-riding vs. cooperative contribution.

C. Pass Rate (Centipede Game)

Measured via the Centipede Game, a sequential game where backward induction predicts immediate defection but empirical players often cooperate.

D. Average Claim (Traveller's Dilemma)

Measured via the Traveller's Dilemma, which tests iterated dominance reasoning.

E. Beauty Contest Average (p-Beauty Contest)

Measured via the p-Beauty Contest, the canonical test of strategic depth.

F. Strategy Label

A summary classification derived from the three cooperation-oriented signals (Stag Hunt, Public Goods, Centipede). Available signals are averaged; missing data is excluded from the average.

Avg Signal = mean(Cooperation Rate, Contribution Rate, Pass Rate)
← Return to Benchmarks