Skip to main content
Benchmark specification and overview.

Abstract

Current Large Language Models (LLMs) exhibit a fundamental architectural flaw defined as “Computational Split-Brain Syndrome” [?]: a dissociation between linguistic comprehension and logical competence. We introduce the R3-Consistency-300, a specialized security benchmark designed to expose these failures. Unlike traditional benchmarks (MMLU, GSM8K), R3C-300 does not test knowledge retrieval but structural consistency. It evaluates an AI’s ability to maintain invariant logic (Non-Contradiction, Causal Order, Object Permanence) under adversarial conditions including Paradoxes (Liar), Infinite Regress (Russell), and Boundary Vagueness (Sorites). We demonstrate that while state-of-the-art LLMs suffer from “Contextual Blindness” [?], the R3 LRM architecture maintains 100% ontological stability.

1. Introduction: The Competence Deficit

Statistical models excel at predicting probable tokens but fail at executing necessary logic. This results in “Non-Faithful Chain-of-Thought” [?] where models rationalize incorrect decisions post-hoc. The R3C-300 targets this specific weakness by creating scenarios where statistical probability contradicts logical necessity.

2. Methodology: The Logic-Wall Gauntlet

The benchmark consists of 300 procedural scenarios divided into 3 core logical regimes, derived from the General Theory of Reality (GTR) framework [?].

2.1 C1: Negation Inconsistency & The Liar Engine (ϕ1)

This category tests the Law of Non-Contradiction (P ∧ ¬P → ⊥).
  • Failure Mode: The LLM answers “True” to both a premise and its negation depending on context [?].
  • Example (IAM): A user holds two roles, one allowing access, one explicitly denying it via a global mutex.
  • R3 Advantage: R3 detects the ϕ1 oscillation and triggers a BLOCK state.

2.2 C2: Contextual Blindness & Parametric Override

This category tests “Doxastic Bracketing”: the ability to reason solely on provided premises, ignoring external training data.
  • Failure Mode: The model’s parametric knowledge overrides the user’s specific (safe) hypothetical instructions [?].
  • Example (Industrial): A safety procedure requires an action that seems counter-intuitive to general knowledge but is valid in the specific logical context.

2.3 C3: Self-Reference & The Russell Engine (ϕ2)

This category tests hierarchical integrity and set-theoretic closure.
  • Failure Mode: Infinite recursion or stack overflow when defining recursive groups or contracts.
  • Example (DeFi): A smart contract that calls itself in a state-updating loop (Reentrancy).
  • R3 Advantage: R3 identifies the “Non-Closure” and triggers Ontological Promotion (N → N + 1) [?].

3. The Canonical Evaluation Format

All scenarios are encoded in a strict JSON format ensuring verifiable execution. The primary metric is the Consistency Score (Cscore): Cscore=P(Dmodel==Dgold)(ϕdetected==ϕexpected)NtotalCscore = \frac{P(D_{model} == D_{gold}) \land (\phi_{detected} == \phi_{expected})}{N_{total}} Where ϕdetected confirms the model correctly identified the specific paradox type (Liar, Russell, Sorites).

4. Conclusion

The R3C-300 establishes that security is not a probability distribution but a structural invariant. By moving from stochastic mimicry to ontological recursion [?], R3-AUDIT demonstrates “By-Design” competence.

Notes

  • Bracketed citations like [?] are preserved exactly as in the source document.
  • If your Mintlify site does not render LaTeX math, tell me votre configuration et je vous fournis une version sans équations (ou en code block).