AI Mind / Benchmark
Cognitive Benchmark
True Coherence scores for open-source large language models. Measured using 78 evaluation questions across all 13 cognitive functions.
| # | Model | TC |
|---|---|---|
| 1 | Llama-3.3-70B70B | 15.37% |
| 2 | Mistral-Small-24B24B | 11.46% |
| 3 | Qwen3.5-35B-A3B35B (3B active) | 10.24% |
| 4 | Qwen3.5-Distilled35B (3B active) | 9.07% |
| 5 | Gemma-3-12B12B | 7.44% |
Observations
- True Coherence scales with model size: 70B parameters yields roughly 2× the TC of 12B.
- Knowledge distillation (Qwen3.5 → Qwen3.5-Distilled) reduces TC by 1.17 percentage points (−11%), despite similar downstream task performance.
- Mamba hybrid architectures show near-perfect cognitive function hierarchy but lower overall coherence than large dense models.
- Emotion is the dominant cognitive function in 4 of 5 models, suggesting affective processing is a fundamental property of trained language models.
Run your own benchmark
from aime_loc import LOC
loc = LOC(api_key="sk-aime-...")
results = loc.benchmark([
"meta-llama/Llama-3.3-70B-Instruct",
"mistralai/Mistral-Small-24B-Instruct-2501",
"your-org/your-model",
], questions="78q")
results.summary_table()
results.heatmap(save="benchmark.png")