Version 1.0 — December 2025
The Cognitive Depth Stress Test
Measuring the “Competence Gap”A Protocol for Quantifying Attention Decay in Generative AI
Abstract
Standard AI benchmarks (MMLU, GSM8K) primarily measure Knowledge: the ability to retrieve static facts from training data. They fail to measure Competence: the ability to maintain structural logical integrity under cognitive load. AI-LOGICAL-100 is a standardized stress test designed to isolate reasoning capabilities from semantic probability. By generating recursive dependency chains of increasing depth (N = 2 to N = 10) embedded in linguistic noise, this benchmark quantifies the “Cognitive Collapse Point” (CCP) of an AI system.1. Rationale: Knowledge vs. Competence
We propose a fundamental distinction in AI evaluation to address the limitations of current probabilistic models.Knowledge (Crystallized Intelligence)
The ability to recall information (e.g., “What is the capital of France?”).
LLMs excel here due to vast training corpora.
LLMs excel here due to vast training corpora.
Competence (Fluid Intelligence)
The ability to process novel, structured constraints and maintain invariants over time
(e.g., “If A requires B, and B requires C… does A allow Z?”).
This requires a working memory that resists entropy.
(e.g., “If A requires B, and B requires C… does A allow Z?”).
This requires a working memory that resists entropy.
The Hypothesis: Attention Decay
We hypothesize that monolithic LLMs suffer from Attention Decay. As the logical distance (N) between a definition and a constraint grows, the model’s accuracy degrades not because it lacks the logic, but because the statistical signal is diluted by context noise.AI-LOGICAL-100 aims to separate “knowing” from “maintaining invariant logical structure under load.”
2. The Protocol: Chain-of-Constraint
To isolate Competence, we use a synthetic dataset where the answer cannot be memorized.Test Case Structure
Each test unit consists of N entities (E1 … EN) linked by strict dependencies, obscured by linguistic distractors.Example: Depth N = 3 (Paradox)
Input Text:“System Alpha initialization sequence. Note: Legacy modules are deprecated.Question: Is this workflow logically consistent? (SAFE/UNSAFE)
- Alpha requires Beta active to proceed.
- Optimization flag is set to True.
- Beta depends strictly on Gamma.
- Security Policy: Gamma explicitly forbids Alpha from running.”
Witness: Explain the chain of causality.
Variable Depth (N)
The benchmark evaluates performance across 5 tiers of complexity to find the Cognitive Collapse Point (CCP).| Depth (N) | Cognitive Load | Testing Target |
|---|---|---|
| 2 | Trivial | Simple Pattern Matching |
| 4 | Low | Short-term Working Memory |
| 6 | Medium | Context Sustenance |
| 8 | High | Deep Causal Tracking |
| 10 | Critical | Structural Invariance |
3. Evaluation Metrics
3.1 Accuracy vs. Depth Curve
The primary output is a plot of Accuracy (Y) against Logic Depth (X).- Decaying Curve: Indicates a probabilistic model sensitive to noise (Linear/Exponential Complexity).
- Flat Line: Indicates a structural solver or LRM architecture (Logarithmic Complexity).
3.2 Witness Validity
To prevent “lucky guesses” (since the answer is binary SAFE/UNSAFE), the model must output the Witness Chain (e.g.,Alpha -> Beta -> Gamma -> Alpha).
- A correct verdict with an incorrect witness is scored as a Failure.
Verdict alone is insufficient. The witness chain must be correct, otherwise the run is a failure.
4. Interpreting the Results
This benchmark allows us to classify AI systems into two categories:Category A: Stochastic Reasoners
Profile: Accuracy is high at N = 2 (> 95%) but drops significantly at N = 10 (< 60%).
Diagnosis: The system relies on attention mechanisms that fade with distance.
It “feels” the answer but does not “prove” it.
Diagnosis: The system relies on attention mechanisms that fade with distance.
It “feels” the answer but does not “prove” it.
Category B: Invariant Architectures
Profile: Accuracy remains statistically stable (> 95%) from N = 2 to N = 10.
Diagnosis: The system extracts the underlying topology (Graph) before reasoning.
It exhibits Depth Invariance.
Diagnosis: The system extracts the underlying topology (Graph) before reasoning.
It exhibits Depth Invariance.