Evaluation Metrics

How do we know if we are actually building AGI, or just a very impressive "Narrow" system?

The Problem with Current Benchmarks

Most current benchmarks (like ImageNet or GLUE) test Performance, not Generality. A system can win a benchmark simply by memorizing the distribution of the test set.

New Metrics for AGI

1. The ARC Score

The Abstraction and Reasoning Corpus (ARC) is currently our best metric for general intelligence. It requires learning from few examples without any prior training on the specific task.

2. Sample Efficiency

Measuring how few examples a system needs to learn a new concept compared to a human.

Narrow AI: 1,000,000 examples.
AGI Goal: 1-5 examples.

3. Cross-Domain Transfer

Measuring how much performance on Task A improves Task B. If learning to play Chess doesn't help the agent learn how to play Go, it isn't "General."

4. Self-Improvement Rate

The ability of the system to identify and fix bottlenecks in its own reasoning or code.

The Maturity Model

Stage	Metric Requirement	Current Status
Stage 1	Solves static datasets	Achieved (Narrow AI)
Stage 2	Adapts to distribution shifts	Partial (Few-Shot)
Stage 3	Zero-shot task creation	Research Phase
Stage 4	Recursive Self-Optimization	Theoretical

Next: Experiments section

Evaluation Metrics

On this page