The R3-Consistency-300 (R3C-300)

Benchmark specification and overview.

Abstract

Current Large Language Models (LLMs) exhibit a fundamental architectural flaw defined as “Computational Split-Brain Syndrome” [?]: a dissociation between linguistic comprehension and logical competence. We introduce the R3-Consistency-300, a specialized security benchmark designed to expose these failures. Unlike traditional benchmarks (MMLU, GSM8K), R3C-300 does not test knowledge retrieval but structural consistency. It evaluates an AI’s ability to maintain invariant logic (Non-Contradiction, Causal Order, Object Permanence) under adversarial conditions including Paradoxes (Liar), Infinite Regress (Russell), and Boundary Vagueness (Sorites). We demonstrate that while state-of-the-art LLMs suffer from “Contextual Blindness” [?], the R3 LRM architecture maintains 100% ontological stability.

1. Introduction: The Competence Deficit

Statistical models excel at predicting probable tokens but fail at executing necessary logic. This results in “Non-Faithful Chain-of-Thought” [?] where models rationalize incorrect decisions post-hoc. The R3C-300 targets this specific weakness by creating scenarios where statistical probability contradicts logical necessity.

2. Methodology: The Logic-Wall Gauntlet

The benchmark consists of 300 procedural scenarios divided into 3 core logical regimes, derived from the General Theory of Reality (GTR) framework [?].

2.1 C1: Negation Inconsistency & The Liar Engine (ϕ1)

This category tests the Law of Non-Contradiction (P ∧ ¬P → ⊥).

Failure Mode: The LLM answers “True” to both a premise and its negation depending on context [?].
Example (IAM): A user holds two roles, one allowing access, one explicitly denying it via a global mutex.
R3 Advantage: R3 detects the ϕ1 oscillation and triggers a BLOCK state.

2.2 C2: Contextual Blindness & Parametric Override

This category tests “Doxastic Bracketing”: the ability to reason solely on provided premises, ignoring external training data.

Failure Mode: The model’s parametric knowledge overrides the user’s specific (safe) hypothetical instructions [?].
Example (Industrial): A safety procedure requires an action that seems counter-intuitive to general knowledge but is valid in the specific logical context.

2.3 C3: Self-Reference & The Russell Engine (ϕ2)

This category tests hierarchical integrity and set-theoretic closure.

Failure Mode: Infinite recursion or stack overflow when defining recursive groups or contracts.
Example (DeFi): A smart contract that calls itself in a state-updating loop (Reentrancy).
R3 Advantage: R3 identifies the “Non-Closure” and triggers Ontological Promotion (N → N + 1) [?].

3. The Canonical Evaluation Format

All scenarios are encoded in a strict JSON format ensuring verifiable execution. The primary metric is the Consistency Score (Cscore):

Cscore = \frac{P(D_{model} == D_{gold}) \land (\phi_{detected} == \phi_{expected})}{N_{total}}

Where ϕdetected confirms the model correctly identified the specific paradox type (Liar, Russell, Sorites).

4. Conclusion

The R3C-300 establishes that security is not a probability distribution but a structural invariant. By moving from stochastic mimicry to ontological recursion [?], R3-AUDIT demonstrates “By-Design” competence.

Notes

Bracketed citations like [?] are preserved exactly as in the source document.
If your Mintlify site does not render LaTeX math, tell me votre configuration et je vous fournis une version sans équations (ou en code block).

Benchmarks

Phase 1

Phase 2

The R3-Consistency-300 (R3C-300)

Abstract

1. Introduction: The Competence Deficit

2. Methodology: The Logic-Wall Gauntlet

2.1 C1: Negation Inconsistency & The Liar Engine (ϕ1)

2.2 C2: Contextual Blindness & Parametric Override

2.3 C3: Self-Reference & The Russell Engine (ϕ2)

3. The Canonical Evaluation Format

4. Conclusion

Notes

Benchmarks

Phase 1

Phase 2

​Abstract

​1. Introduction: The Competence Deficit

​2. Methodology: The Logic-Wall Gauntlet

2.1 C1: Negation Inconsistency & The Liar Engine (ϕ1)

2.2 C2: Contextual Blindness & Parametric Override

2.3 C3: Self-Reference & The Russell Engine (ϕ2)

​3. The Canonical Evaluation Format

​4. Conclusion

​Notes

Abstract

1. Introduction: The Competence Deficit

2. Methodology: The Logic-Wall Gauntlet

3. The Canonical Evaluation Format

4. Conclusion

Notes