AI Mind / Benchmark

Cognitive Benchmark

True Coherence scores for open-source large language models. Measured using 78 evaluation questions across all 13 cognitive functions.

#	Model	Architecture	TC	Strongest	Weakest
1	Llama-3.3-70B70B	Dense	15.37%	Emotion	Intuition
2	Mistral-Small-24B24B	Dense	11.46%	Sensation	Feelings
3	Qwen3.5-35B-A3B35B (3B active)	MoE + Mamba	10.24%	Consciousness	Cognition
4	Qwen3.5-Distilled35B (3B active)	MoE + Mamba	9.07%	Emotion	Sensation
5	Gemma-3-12B12B	Dense	7.44%	Emotion	Intuition

Observations

True Coherence scales with model size: 70B parameters yields roughly 2× the TC of 12B.
Knowledge distillation (Qwen3.5 → Qwen3.5-Distilled) reduces TC by 1.17 percentage points (−11%), despite similar downstream task performance.
Mamba hybrid architectures show near-perfect cognitive function hierarchy but lower overall coherence than large dense models.
Emotion is the dominant cognitive function in 4 of 5 models, suggesting affective processing is a fundamental property of trained language models.

Run your own benchmark

from aime_loc import LOC

loc = LOC(api_key="sk-aime-...")
results = loc.benchmark([
    "meta-llama/Llama-3.3-70B-Instruct",
    "mistralai/Mistral-Small-24B-Instruct-2501",
    "your-org/your-model",
], questions="78q")

results.summary_table()
results.heatmap(save="benchmark.png")