About AI Scientist Arena

An open platform dedicated to the rigorous evaluation of AI models on complex scientific research tasks.

Our Mission

The AI Scientist Arena (ASA) focuses on curated, high-impact benchmarks that test the limits of AI in scientific reasoning. We move beyond general chat to evaluate specific capabilities: accuracy in key number extraction, logical consistency in hypothesis generation, and the ability to synthesize experimental evidence.

Curated Benchmarks: Expert-verified tasks from real-world scientific literature.

Quantitative Rigor: Using metrics like Brier Score and Log Loss for probabilistic assessment.

Discovery Benchmarks

Our primary focus is the Discovery Leaderboard. We evaluate models on static, high-quality datasets where performance can be measured against ground truth and expert consensus.

Interactive Validation

The Arena mode complements our benchmarks by allowing researchers to interactively probe model reasoning and discover new failure modes or strengths in real-time.

Community Proposals

Signed-in users can propose new scientific events, papers, or benchmarks. The community upvotes the most critical areas for evaluation, shaping the future of AI science research.

Privacy & Data

All prompts and model outputs may be used to improve the platform and AI co-scientist systems. Do not submit sensitive or proprietary data.

Join the Discovery

Explore the leaderboard or contribute your own evaluations to help shape the future of scientific AI.

View Rankings Try Model Evaluation