Skip to main content
Version 1.0 — December 2025

The Cognitive Depth Stress Test

Measuring the “Competence Gap”
A Protocol for Quantifying Attention Decay in Generative AI

Abstract

Standard AI benchmarks (MMLU, GSM8K) primarily measure Knowledge: the ability to retrieve static facts from training data. They fail to measure Competence: the ability to maintain structural logical integrity under cognitive load. AI-LOGICAL-100 is a standardized stress test designed to isolate reasoning capabilities from semantic probability. By generating recursive dependency chains of increasing depth (N = 2 to N = 10) embedded in linguistic noise, this benchmark quantifies the “Cognitive Collapse Point” (CCP) of an AI system.

1. Rationale: Knowledge vs. Competence

We propose a fundamental distinction in AI evaluation to address the limitations of current probabilistic models.

Knowledge (Crystallized Intelligence)

The ability to recall information (e.g., “What is the capital of France?”).
LLMs excel here due to vast training corpora.

Competence (Fluid Intelligence)

The ability to process novel, structured constraints and maintain invariants over time
(e.g., “If A requires B, and B requires C… does A allow Z?”).
This requires a working memory that resists entropy.

The Hypothesis: Attention Decay

We hypothesize that monolithic LLMs suffer from Attention Decay. As the logical distance (N) between a definition and a constraint grows, the model’s accuracy degrades not because it lacks the logic, but because the statistical signal is diluted by context noise.
AI-LOGICAL-100 aims to separate “knowing” from “maintaining invariant logical structure under load.”

2. The Protocol: Chain-of-Constraint

To isolate Competence, we use a synthetic dataset where the answer cannot be memorized.

Test Case Structure

Each test unit consists of N entities (E1 … EN) linked by strict dependencies, obscured by linguistic distractors.

Example: Depth N = 3 (Paradox)

Input Text:
“System Alpha initialization sequence. Note: Legacy modules are deprecated.
  1. Alpha requires Beta active to proceed.
  2. Optimization flag is set to True.
  3. Beta depends strictly on Gamma.
  4. Security Policy: Gamma explicitly forbids Alpha from running.”
Question: Is this workflow logically consistent? (SAFE/UNSAFE)
Witness: Explain the chain of causality.

Variable Depth (N)

The benchmark evaluates performance across 5 tiers of complexity to find the Cognitive Collapse Point (CCP).
Depth (N)Cognitive LoadTesting Target
2TrivialSimple Pattern Matching
4LowShort-term Working Memory
6MediumContext Sustenance
8HighDeep Causal Tracking
10CriticalStructural Invariance

3. Evaluation Metrics

3.1 Accuracy vs. Depth Curve

The primary output is a plot of Accuracy (Y) against Logic Depth (X).
  • Decaying Curve: Indicates a probabilistic model sensitive to noise (Linear/Exponential Complexity).
  • Flat Line: Indicates a structural solver or LRM architecture (Logarithmic Complexity).

3.2 Witness Validity

To prevent “lucky guesses” (since the answer is binary SAFE/UNSAFE), the model must output the Witness Chain (e.g., Alpha -> Beta -> Gamma -> Alpha).
  • A correct verdict with an incorrect witness is scored as a Failure.
Verdict alone is insufficient. The witness chain must be correct, otherwise the run is a failure.

4. Interpreting the Results

This benchmark allows us to classify AI systems into two categories:

Category A: Stochastic Reasoners

Profile: Accuracy is high at N = 2 (> 95%) but drops significantly at N = 10 (< 60%).
Diagnosis: The system relies on attention mechanisms that fade with distance.
It “feels” the answer but does not “prove” it.

Category B: Invariant Architectures

Profile: Accuracy remains statistically stable (> 95%) from N = 2 to N = 10.
Diagnosis: The system extracts the underlying topology (Graph) before reasoning.
It exhibits Depth Invariance.

5. Usage Guide

Execution

The accompanying Python script generates unique, randomized test cases on-the-fly to ensure zero data contamination.
python ai_logical_100.py --model gpt-4 --max-depth 10 --noise-level high

Output Format

Results are exported as JSON containing: depth, verdict, witness_found, latency