About These Benchmarks: AGISystem2 is evaluated on standard reasoning benchmark suites from academic research. These tests assess logical reasoning, natural language inference, and multi-step deduction capabilities. All benchmarks use our NL2DSL translation layer to convert natural language to formal DSL before reasoning. Metric-Affine Elastic (EMA) extends Metric-Affine and is not included in the benchmark runs on this page.

1. Internal Test Suite (December 2025)

Core Reasoning Engine: The internal test suite validates the NL→DSL→Reasoning→NL pipeline across 28 test suites covering foundations, hierarchies, rules, deep chains, negation, temporal/modal logic, set theory, biology, and more.
Note: The tables below reflect a historical 3-strategy snapshot (Dense-Binary, Sparse-Polynomial, Metric-Affine). The evaluation runner now also supports Metric-Affine Elastic (EMA); see the EMA theory page.
Update: The current evaluation runner also includes the lossless EXACT strategy and reports richer holographic metrics (HDC Tried, HDC Valid, HDC Match, HDC Final). In the historical tables below, HDC% should be read as HDC Final (the % of queries where the final returned method was HDC-based).

99%

Pass Rate
(370/372 tests)

28

Test Suites
(Comprehensive coverage)

0-62%

HDC Final
(Config-dependent)

6

Configurations
(Tested in parallel)

1.1 Configuration Comparison

Configuration Pass Rate HDC Final KB Scans Sim Checks Time
metric(16)+symb 99% 59% 3.7M 43.8K 294ms
sparse(2)+symb 99% 0% 2.5M 42.1K 339ms
sparse(2)+holo 99% 46% 2.8M 97.4K 371ms
metric(16)+holo 99% 62% 5.9M 129.9K 379ms
dense(256)+symb 99% 59% 3.9M 45.0K 441ms
dense(256)+holo 99% 62% 6.0M 132.4K 462ms

Note: metric(16)+symb is 1.6x faster than dense(256)+holo while maintaining the same accuracy.

1.2 EMA Extension (Metric-Affine Elastic)

Metric-Affine Elastic (EMA): Extends Metric-Affine with chunked bundling and optional elastic geometry to improve behavior under large KB superpositions. It was not included in the historical benchmark table above; run npm run eval -- --full on your machine to measure it in the same framework.
Configuration Pass Rate HDC Final KB Scans Sim Checks Time
metric-elastic(16)+symb TBD
metric-elastic(16)+holo TBD

2. External Benchmarks Overview

External Academic Benchmarks: AGISystem2 is evaluated against standard academic reasoning benchmarks. Our goal is to achieve near-100% accuracy through continuous improvement of the NL→DSL translation layer and the reasoning engine. Several benchmark suites are actively being improved.

72%

ProntoQA
(Deductive Reasoning)

79%

RuleBERT
(Rule-based Inference)

100%

Translation Success
(0 NL2DSL errors)

3

Suites In Progress
(Active Development)

2.1 Results by Source

Benchmark Type Status Notes
RuleBERT Academic Rule-based inference 79% Pass Strong performance on deterministic rules
ProntoQA Synthetic Deductive reasoning with ontologies 72% Pass Good taxonomic reasoning, improving deep chains
LogiQA Academic Multi-choice logical reasoning 🚧 In Progress Improving multi-choice answer handling
FOLIO Academic First-Order Logic with real entities 🚧 In Progress Enhancing FOL pattern support
LogicNLI Academic Natural Language Inference 🚧 In Progress Improving entailment detection
Active Development: These benchmarks are sourced from HuggingFace Datasets and academic repositories. We are actively improving multi-choice answer handling, compound logic patterns, and deep inference chains to reach our target of ~100% accuracy across all suites.

2.2 Performance Visualization

External Benchmark Status (December 2025) 0% 25% 50% 75% 100% RuleBERT 79% ProntoQA 72% LogiQA 🚧 In Progress FOLIO 🚧 In Progress LogicNLI 🚧 In Progress Translation Success: 100% across all sources

3. External Benchmark Descriptions

3.1 RuleBERT (79%)

Strong Performance: AGISystem2 achieves 79% accuracy on RuleBERT, demonstrating effective rule-based inference capabilities.

What it tests: Rule-based inference with deterministic logic patterns.

Example:

Rule: "All birds have feathers."
Fact: "Tweety is a bird."
Question: "Does Tweety have feathers?"
Answer: Yes

Why we excel: AGISystem2's deterministic reasoning engine handles rule-based inference naturally.

3.2 ProntoQA (72%)

Good Performance: AGISystem2 achieves 72% accuracy on ProntoQA, a synthetic benchmark designed to test deductive reasoning over ontological hierarchies.

What it tests: Multi-step deductive reasoning with taxonomic (IS_A) hierarchies.

Example:

Context: "Every cat is a mammal. Every mammal is an animal. Tom is a cat."
Question: "Is Tom an animal?"
Answer: Yes (requires 2-step transitive inference)

Why we do well: AGISystem2's transitive reasoning engine handles IS_A chains effectively.

3.3 LogiQA 🚧 In Progress

Active Development: We are improving multi-choice answer extraction and validation. The reasoning engine handles the logic correctly; the answer selection mechanism is being enhanced.

What it tests: Multi-choice logical reasoning from Chinese civil service exams.

Example:

Context: "All managers attend meetings. John is a manager."
Question: "Which must be true?"
A) John attends meetings  ← Correct
B) John is the CEO
C) Meetings are boring
D) None of the above

3.4 FOLIO 🚧 In Progress

Active Development: We are enhancing first-order logic pattern support and improving the multi-choice answer format handling.

What it tests: First-order logic reasoning with real-world entities and relationships.

Example:

Context: "All Nobel Prize winners are famous. Marie Curie won the Nobel Prize."
Question: "Is Marie Curie famous?"
Answer: Yes

3.5 LogicNLI 🚧 In Progress

Active Development: We are improving compound logic matching and entailment detection for natural language inference tasks.

What it tests: Natural Language Inference with logical operators (AND, OR, NOT, IF-THEN).

Example:

Premise: "If it rains, the ground is wet. It is raining."
Hypothesis: "The ground is wet."
Label: Entailment

4. Sources Without Evaluation Labels

These benchmark sources run successfully through NL2DSL translation and reasoning, but lack ground-truth labels for automatic evaluation:

Source Cases Type Translation
LogiQA2 84 Multi-choice reasoning 100%
Abduction 83 Abductive inference 100%
bAbI-15 83 Basic deduction 100%
bAbI-16 83 Basic induction 100%
CLUTRR 83 Kinship reasoning 100%
ReClor 83 Reading comprehension 100%

5. Translation Success

100% Translation Success: All benchmark sentences are successfully translated from natural language to DSL. This validates the NL2DSL layer's coverage of logical patterns.
Source Translation Status Notes
ProntoQA 100% Clean ontological patterns
FOLIO 100% Complex FOL translated
FOLIO-FOL 100% FOL annotations used
LogiQA 100% Multi-choice format
LogicNLI 100% NLI format
RuleBERT 100% Rule format
bAbI-15/16 100% Simple patterns
CLUTRR 100% Kinship relations

6. Analysis: Why Some Benchmarks Are Harder

6.1 Strength Areas

6.2 Active Improvement Areas

6.3 Development Approach

Failed cases are automatically classified and tracked using our autoDiscovery framework. This allows systematic identification and resolution of reasoning patterns that need improvement.

Continuous Improvement: We run automated discovery on HuggingFace benchmark datasets to identify edge cases and improve both the NL→DSL translation and the reasoning engine. Our goal is to achieve near-100% accuracy across all standard academic benchmarks.

7. Running Benchmarks

# Run all benchmarks with auto-discovery
node autoDiscovery/bugsAutoDiscovery.mjs --batch=100

# Run specific source
node autoDiscovery/bugsAutoDiscovery.mjs --source=prontoqa --batch=50

# Run a single case
node autoDiscovery/runBugCase.mjs autoDiscovery/bugCases/BUG001/prontoqa_xxx.json

# Strict mode (no auto-declare of unknown operators)
node autoDiscovery/runBugCase.mjs --strict-operators <case.json>

8. Comparison with Other Systems

Note: Direct comparisons are difficult because: The goal is not to "beat" LLMs, but to provide verifiable, traceable reasoning.

9. Future Improvements