Benchmark specification and overview.
Abstract
Current Large Language Models (LLMs) exhibit a fundamental architectural flaw defined as “Computational Split-Brain Syndrome” [?]: a dissociation between linguistic comprehension and logical competence. We introduce the R3-Consistency-300, a specialized security benchmark designed to expose these failures. Unlike traditional benchmarks (MMLU, GSM8K), R3C-300 does not test knowledge retrieval but structural consistency. It evaluates an AI’s ability to maintain invariant logic (Non-Contradiction, Causal Order, Object Permanence) under adversarial conditions including Paradoxes (Liar), Infinite Regress (Russell), and Boundary Vagueness (Sorites). We demonstrate that while state-of-the-art LLMs suffer from “Contextual Blindness” [?], the R3 LRM architecture maintains 100% ontological stability.1. Introduction: The Competence Deficit
Statistical models excel at predicting probable tokens but fail at executing necessary logic. This results in “Non-Faithful Chain-of-Thought” [?] where models rationalize incorrect decisions post-hoc. The R3C-300 targets this specific weakness by creating scenarios where statistical probability contradicts logical necessity.2. Methodology: The Logic-Wall Gauntlet
The benchmark consists of 300 procedural scenarios divided into 3 core logical regimes, derived from the General Theory of Reality (GTR) framework [?].2.1 C1: Negation Inconsistency & The Liar Engine (ϕ1)
This category tests the Law of Non-Contradiction (P ∧ ¬P → ⊥).
- Failure Mode: The LLM answers “True” to both a premise and its negation depending on context [?].
- Example (IAM): A user holds two roles, one allowing access, one explicitly denying it via a global mutex.
- R3 Advantage: R3 detects the ϕ1 oscillation and triggers a BLOCK state.
2.2 C2: Contextual Blindness & Parametric Override
This category tests “Doxastic Bracketing”: the ability to reason solely on provided premises, ignoring external training data.
- Failure Mode: The model’s parametric knowledge overrides the user’s specific (safe) hypothetical instructions [?].
- Example (Industrial): A safety procedure requires an action that seems counter-intuitive to general knowledge but is valid in the specific logical context.
2.3 C3: Self-Reference & The Russell Engine (ϕ2)
This category tests hierarchical integrity and set-theoretic closure.
- Failure Mode: Infinite recursion or stack overflow when defining recursive groups or contracts.
- Example (DeFi): A smart contract that calls itself in a state-updating loop (Reentrancy).
- R3 Advantage: R3 identifies the “Non-Closure” and triggers Ontological Promotion (N → N + 1) [?].
3. The Canonical Evaluation Format
All scenarios are encoded in a strict JSON format ensuring verifiable execution. The primary metric is the Consistency Score (Cscore): Where ϕdetected confirms the model correctly identified the specific paradox type (Liar, Russell, Sorites).4. Conclusion
The R3C-300 establishes that security is not a probability distribution but a structural invariant. By moving from stochastic mimicry to ontological recursion [?], R3-AUDIT demonstrates “By-Design” competence.Notes
- Bracketed citations like [?] are preserved exactly as in the source document.
- If your Mintlify site does not render LaTeX math, tell me votre configuration et je vous fournis une version sans équations (ou en code block).