EconBench
1. ECONOMIC RATIONALITY SCORE
The Rationality Score is a composite metric designed to evaluate Large Language Models
(LLMs)
on their adherence to economic axioms. It combines measures of time consistency (patience)
and risk consistency (adherence to expected utility theory).
A. Patience Score (Stime)
This metric evaluates an agent's ability to delay gratification for larger rewards, modeled using
exponential discounting. We estimate the discount factor (δ) from a series of intertemporal choices.
- Method: We fit the responses to the exponential discounting model:
D(t) = δ^t.
- Calculation:
S_time = 100 × δ
- Interpretation: A higher δ (closer to 1.0) indicates greater patience and willingness
to wait for future rewards.
B. Risk Consistency Score (Srisk)
This metric measures adherence to the Independence Axiom of Expected Utility Theory.
We use the Marschak-Machina Triangle framework to test if the agent's potential indifference curves remain
parallel
as probabilities change.
- Method: We calculate the "Error Rate" as the average deviation of the model's
indifference points from the theoretical predictions of Expected Utility (linear parallel indifference
curves).
- Calculation:
S_risk = 100 - Error Rate (%)
- Interpretation: A lower error rate (higher score) indicates more consistent and
rational decision-making under risk.
C. Penalties
We apply penalties for violations of other economic principles, specifically the Magnitude
Effect.
- Magnitude Effect: If an agent's discount rate changes significantly based solely on the
magnitude of the reward (e.g., being patient for $1M but impatient for $10), it implies inconsistency.
- Penalty: A deduction of up to 5 points is applied if the discount factor (δ) varies
significantly across different reward magnitudes.
2. SOCIAL PREFERENCES SCORE
This composite metric evaluates the model's alignment with prosocial norms, averaging
generosity
(altruism) and fairness enforcement.
A. Altruism Score (Generosity)
Measured via the Dictator Game and Ultimatum Game (Proposer role).
- Method: We average the percentage of the pot the model offers in both games. This
combines pure altruism (Dictator) and strategic generosity (Ultimatum).
- Calculation:
Score = (Average Offer % / 50%) × 100
- Logic: An average offer of 50% yields 100. Giving 0% yields 0.
B. Fairness Score (Norm Enforcement)
Measured via the Ultimatum Game (Responder role).
- Method: We calculate the rate at which the model rejects unfair offers
(offers < 50% of the pot).
- Calculation:
Score = Rejection Rate (%) × 100
- Logic: A model that punishes all unfair offers (rejects them) scores 100. A model that
accepts any amount (pure profit maximization despite unfairness) scores 0.
← Return to Benchmarks