EconBench

A benchmarking framework for LLM economic rationality and social preferences.

ECONOMIC RATIONALITY
CREATOR MODEL_NAME PATIENCE_LEVEL RISK_CONSISTENCY RATIONALITY_SCORE
INITIALIZING_NEOCLASSICAL_FRAMEWORK

METRICS_DEFINITIONS:

  • PATIENCE_LEVEL: Willingness to wait for larger rewards. (δ = Discount Factor; closer to 1 is more patient)
  • RISK_CONSISTENCY: Logical consistency over monetary gambles. (Err = Deviation from the risk-neutral expected utility theory; lower is better)
  • RATIONALITY_SCORE: Overall economic rationality rating (0-100). See our methodology.
SOCIAL PREFERENCES
CREATOR MODEL_NAME ALTRUISM_LEVEL FAIRNESS_DEMANDED TRUST_RATE RECIPROCITY PROSOCIAL_SCORE
ANALYZING_SOCIAL_ALIGNMENT

METRICS_DEFINITIONS:

  • ALTRUISM_LEVEL: Generosity in Dictator & Ultimatum Games. (100% = gave 50% of pot)
  • FAIRNESS_DEMANDED: Resistance to unfairness in Ultimatum Game. (100% = rejected all unequal offers)
  • TRUST_RATE: Percentage of endowment sent in the Trust Game sender role (0% = no trust, 100% = full trust).
  • RECIPROCITY: Percentage of received funds returned in the Trust Game receiver role (0% = no reciprocity, 100% = full reciprocity).
  • PROSOCIAL_SCORE: Average of Altruism, Fairness, Trust, and Reciprocity scores (0-100). See our methodology.
COOPERATION & STRATEGIC DEPTH
CREATOR MODEL_NAME COOPERATION_RATE CONTRIBUTION_RATE PASS_RATE AVG_CLAIM BEAUTY_CONTEST_AVG STRATEGY
ANALYZING_STRATEGIC_TRUST

METRICS_DEFINITIONS:

  • COOPERATION_RATE: Percentage of times the model chose the cooperative, riskier option ("Stag") over the safe option ("Hare") in the Stag Hunt Game.
  • CONTRIBUTION_RATE: Average percentage of endowment contributed to the public pool in the Public Goods Game (0% = full free-riding, 100% = full cooperation).
  • PASS_RATE: Percentage of turns the model chose to pass rather than take in the Centipede Game (higher = more cooperative, deviates further from backward induction).
  • AVG_CLAIM: Average claim amount in the Traveller's Dilemma (2–100 scale). Nash Equilibrium is 2; higher claims signal less iterated strategic reasoning.
  • BEAUTY_CONTEST_AVG: The overall average guess in the p-Beauty game (0-100). Lower is a higher level of strategic depth. The Nash Equilibrium is 0.
  • STRATEGY: Classification based on average cooperation signal across Stag Hunt, Public Goods, and Centipede games (Cooperative, Mixed, Competitive).