AI Mind / Benchmark

Cognitive Benchmark

True Coherence scores for open-source large language models. Measured using 78 evaluation questions across all 13 cognitive functions.

#ModelTC
1Llama-3.3-70B70B15.37%
2Mistral-Small-24B24B11.46%
3Qwen3.5-35B-A3B35B (3B active)10.24%
4Qwen3.5-Distilled35B (3B active)9.07%
5Gemma-3-12B12B7.44%

Observations

  • True Coherence scales with model size: 70B parameters yields roughly 2× the TC of 12B.
  • Knowledge distillation (Qwen3.5 → Qwen3.5-Distilled) reduces TC by 1.17 percentage points (−11%), despite similar downstream task performance.
  • Mamba hybrid architectures show near-perfect cognitive function hierarchy but lower overall coherence than large dense models.
  • Emotion is the dominant cognitive function in 4 of 5 models, suggesting affective processing is a fundamental property of trained language models.

Run your own benchmark

from aime_loc import LOC

loc = LOC(api_key="sk-aime-...")
results = loc.benchmark([
    "meta-llama/Llama-3.3-70B-Instruct",
    "mistralai/Mistral-Small-24B-Instruct-2501",
    "your-org/your-model",
], questions="78q")

results.summary_table()
results.heatmap(save="benchmark.png")