About These Benchmarks: AGISystem2 is evaluated on standard reasoning benchmark suites from academic research.
These tests assess logical reasoning, natural language inference, and multi-step deduction capabilities.
All benchmarks use our NL2DSL translation layer to convert natural language to formal DSL before reasoning.
Metric-Affine Elastic (EMA) extends Metric-Affine and is not included in the benchmark runs on this page.
1. Internal Test Suite (December 2025)
Core Reasoning Engine: The internal test suite validates the NL→DSL→Reasoning→NL pipeline
across 28 test suites covering foundations, hierarchies, rules, deep chains, negation, temporal/modal logic,
set theory, biology, and more.
Note: The tables below reflect a historical 3-strategy snapshot (Dense-Binary, Sparse-Polynomial, Metric-Affine). The evaluation runner now also supports Metric-Affine Elastic (EMA); see the EMA theory page.
Update: The current evaluation runner also includes the lossless EXACT strategy and reports richer holographic metrics (HDC Tried, HDC Valid, HDC Match, HDC Final). In the historical tables below, HDC% should be read as HDC Final (the % of queries where the final returned method was HDC-based).
99%
Pass Rate (370/372 tests)
28
Test Suites (Comprehensive coverage)
0-62%
HDC Final (Config-dependent)
6
Configurations (Tested in parallel)
1.1 Configuration Comparison
Configuration
Pass Rate
HDC Final
KB Scans
Sim Checks
Time
metric(16)+symb
99%
59%
3.7M
43.8K
294ms
sparse(2)+symb
99%
0%
2.5M
42.1K
339ms
sparse(2)+holo
99%
46%
2.8M
97.4K
371ms
metric(16)+holo
99%
62%
5.9M
129.9K
379ms
dense(256)+symb
99%
59%
3.9M
45.0K
441ms
dense(256)+holo
99%
62%
6.0M
132.4K
462ms
Note:metric(16)+symb is 1.6x faster than dense(256)+holo while maintaining the same accuracy.
1.2 EMA Extension (Metric-Affine Elastic)
Metric-Affine Elastic (EMA): Extends Metric-Affine with chunked bundling and optional elastic geometry to improve behavior under large KB superpositions. It was not included in the historical benchmark table above; run npm run eval -- --full on your machine to measure it in the same framework.
Configuration
Pass Rate
HDC Final
KB Scans
Sim Checks
Time
metric-elastic(16)+symb
TBD
—
—
—
—
metric-elastic(16)+holo
TBD
—
—
—
—
2. External Benchmarks Overview
External Academic Benchmarks: AGISystem2 is evaluated against standard academic reasoning benchmarks.
Our goal is to achieve near-100% accuracy through continuous improvement of the NL→DSL translation layer
and the reasoning engine. Several benchmark suites are actively being improved.
72%
ProntoQA (Deductive Reasoning)
79%
RuleBERT (Rule-based Inference)
100%
Translation Success (0 NL2DSL errors)
3
Suites In Progress (Active Development)
2.1 Results by Source
Benchmark
Type
Status
Notes
RuleBERTAcademic
Rule-based inference
79% Pass
Strong performance on deterministic rules
ProntoQASynthetic
Deductive reasoning with ontologies
72% Pass
Good taxonomic reasoning, improving deep chains
LogiQAAcademic
Multi-choice logical reasoning
🚧 In Progress
Improving multi-choice answer handling
FOLIOAcademic
First-Order Logic with real entities
🚧 In Progress
Enhancing FOL pattern support
LogicNLIAcademic
Natural Language Inference
🚧 In Progress
Improving entailment detection
Active Development: These benchmarks are sourced from
HuggingFace Datasets and academic repositories.
We are actively improving multi-choice answer handling, compound logic patterns, and deep inference chains
to reach our target of ~100% accuracy across all suites.
Good Performance: AGISystem2 achieves 72% accuracy on ProntoQA,
a synthetic benchmark designed to test deductive reasoning over ontological hierarchies.
What it tests: Multi-step deductive reasoning with taxonomic (IS_A) hierarchies.
Example:
Context: "Every cat is a mammal. Every mammal is an animal. Tom is a cat."
Question: "Is Tom an animal?"
Answer: Yes (requires 2-step transitive inference)
Why we do well: AGISystem2's transitive reasoning engine handles IS_A chains effectively.
Active Development: We are improving multi-choice answer extraction and validation.
The reasoning engine handles the logic correctly; the answer selection mechanism is being enhanced.
What it tests: Multi-choice logical reasoning from Chinese civil service exams.
Example:
Context: "All managers attend meetings. John is a manager."
Question: "Which must be true?"
A) John attends meetings ← Correct
B) John is the CEO
C) Meetings are boring
D) None of the above
These benchmark sources run successfully through NL2DSL translation and reasoning,
but lack ground-truth labels for automatic evaluation:
Source
Cases
Type
Translation
LogiQA2
84
Multi-choice reasoning
100%
Abduction
83
Abductive inference
100%
bAbI-15
83
Basic deduction
100%
bAbI-16
83
Basic induction
100%
CLUTRR
83
Kinship reasoning
100%
ReClor
83
Reading comprehension
100%
5. Translation Success
100% Translation Success: All benchmark sentences are successfully translated
from natural language to DSL. This validates the NL2DSL layer's coverage of logical patterns.
Source
Translation Status
Notes
ProntoQA
100%
Clean ontological patterns
FOLIO
100%
Complex FOL translated
FOLIO-FOL
100%
FOL annotations used
LogiQA
100%
Multi-choice format
LogicNLI
100%
NLI format
RuleBERT
100%
Rule format
bAbI-15/16
100%
Simple patterns
CLUTRR
100%
Kinship relations
6. Analysis: Why Some Benchmarks Are Harder
6.1 Strength Areas
Rule-based inference: Deterministic rules with clear antecedents (RuleBERT: 79%)
Multi-step deduction: Chained inferences up to 3+ steps
Translation coverage: 100% NL→DSL success across all sources
6.2 Active Improvement Areas
Multi-choice format: Answer extraction and validation for LogiQA, FOLIO
Compound logic: Complex And/Or patterns for LogicNLI
Deep chains: Extended inference chains for ProntoQA
FOL patterns: First-order logic structures for FOLIO
6.3 Development Approach
Failed cases are automatically classified and tracked using our autoDiscovery framework.
This allows systematic identification and resolution of reasoning patterns that need improvement.
Continuous Improvement: We run automated discovery on HuggingFace benchmark datasets
to identify edge cases and improve both the NL→DSL translation and the reasoning engine.
Our goal is to achieve near-100% accuracy across all standard academic benchmarks.
7. Running Benchmarks
# Run all benchmarks with auto-discovery
node autoDiscovery/bugsAutoDiscovery.mjs --batch=100
# Run specific source
node autoDiscovery/bugsAutoDiscovery.mjs --source=prontoqa --batch=50
# Run a single case
node autoDiscovery/runBugCase.mjs autoDiscovery/bugCases/BUG001/prontoqa_xxx.json
# Strict mode (no auto-declare of unknown operators)
node autoDiscovery/runBugCase.mjs --strict-operators <case.json>