AGISystem2 – External Benchmarks

About These Benchmarks: AGISystem2 is evaluated on standard reasoning benchmark suites from academic research. These tests assess logical reasoning, natural language inference, and multi-step deduction capabilities. All benchmarks use our NL2DSL translation layer to convert natural language to formal DSL before reasoning. Metric-Affine Elastic (EMA) extends Metric-Affine and is not included in the benchmark runs on this page.

1. Internal Test Suite (December 2025)

Core Reasoning Engine: The internal test suite validates the NL→DSL→Reasoning→NL pipeline across 28 test suites covering foundations, hierarchies, rules, deep chains, negation, temporal/modal logic, set theory, biology, and more.

Note: The tables below reflect a historical 3-strategy snapshot (Dense-Binary, Sparse-Polynomial, Metric-Affine). The evaluation runner now also supports Metric-Affine Elastic (EMA); see the EMA theory page.

Update: The current evaluation runner also includes the lossless EXACT strategy and reports richer holographic metrics (HDC Tried, HDC Valid, HDC Match, HDC Final). In the historical tables below, HDC% should be read as HDC Final (the % of queries where the final returned method was HDC-based).

99%

Pass Rate
(370/372 tests)

28

Test Suites
(Comprehensive coverage)

0-62%

HDC Final
(Config-dependent)

6

Configurations
(Tested in parallel)

1.1 Configuration Comparison

Configuration	Pass Rate	HDC Final	KB Scans	Sim Checks	Time
metric(16)+symb	99%	59%	3.7M	43.8K	294ms
sparse(2)+symb	99%	0%	2.5M	42.1K	339ms
sparse(2)+holo	99%	46%	2.8M	97.4K	371ms
metric(16)+holo	99%	62%	5.9M	129.9K	379ms
dense(256)+symb	99%	59%	3.9M	45.0K	441ms
dense(256)+holo	99%	62%	6.0M	132.4K	462ms

Note: metric(16)+symb is 1.6x faster than dense(256)+holo while maintaining the same accuracy.

1.2 EMA Extension (Metric-Affine Elastic)

Metric-Affine Elastic (EMA): Extends Metric-Affine with chunked bundling and optional elastic geometry to improve behavior under large KB superpositions. It was not included in the historical benchmark table above; run npm run eval -- --full on your machine to measure it in the same framework.

Configuration	Pass Rate	HDC Final	KB Scans	Sim Checks	Time
metric-elastic(16)+symb	TBD	—	—	—	—
metric-elastic(16)+holo	TBD	—	—	—	—

2. External Benchmarks Overview

External Academic Benchmarks: AGISystem2 is evaluated against standard academic reasoning benchmarks. Our goal is to achieve near-100% accuracy through continuous improvement of the NL→DSL translation layer and the reasoning engine. Several benchmark suites are actively being improved.

72%

ProntoQA
(Deductive Reasoning)

79%

RuleBERT
(Rule-based Inference)

100%

Translation Success
(0 NL2DSL errors)

3

Suites In Progress
(Active Development)

2.1 Results by Source

Benchmark	Type	Status	Notes
RuleBERT Academic	Rule-based inference	79% Pass	Strong performance on deterministic rules
ProntoQA Synthetic	Deductive reasoning with ontologies	72% Pass	Good taxonomic reasoning, improving deep chains
LogiQA Academic	Multi-choice logical reasoning	🚧 In Progress	Improving multi-choice answer handling
FOLIO Academic	First-Order Logic with real entities	🚧 In Progress	Enhancing FOL pattern support
LogicNLI Academic	Natural Language Inference	🚧 In Progress	Improving entailment detection

Active Development: These benchmarks are sourced from HuggingFace Datasets and academic repositories. We are actively improving multi-choice answer handling, compound logic patterns, and deep inference chains to reach our target of ~100% accuracy across all suites.

2.2 Performance Visualization

3. External Benchmark Descriptions

3.1 RuleBERT (79%)

Strong Performance: AGISystem2 achieves 79% accuracy on RuleBERT, demonstrating effective rule-based inference capabilities.

What it tests: Rule-based inference with deterministic logic patterns.

Example:

Rule: "All birds have feathers."
Fact: "Tweety is a bird."
Question: "Does Tweety have feathers?"
Answer: Yes

Why we excel: AGISystem2's deterministic reasoning engine handles rule-based inference naturally.

Source: arxiv.org/abs/2109.13006 • HuggingFace

3.2 ProntoQA (72%)

Good Performance: AGISystem2 achieves 72% accuracy on ProntoQA, a synthetic benchmark designed to test deductive reasoning over ontological hierarchies.

What it tests: Multi-step deductive reasoning with taxonomic (IS_A) hierarchies.

Example:

Context: "Every cat is a mammal. Every mammal is an animal. Tom is a cat."
Question: "Is Tom an animal?"
Answer: Yes (requires 2-step transitive inference)

Why we do well: AGISystem2's transitive reasoning engine handles IS_A chains effectively.

Source: github.com/asaparov/prontoqa • HuggingFace

3.3 LogiQA 🚧 In Progress

Active Development: We are improving multi-choice answer extraction and validation. The reasoning engine handles the logic correctly; the answer selection mechanism is being enhanced.

What it tests: Multi-choice logical reasoning from Chinese civil service exams.

Example:

Context: "All managers attend meetings. John is a manager."
Question: "Which must be true?"
A) John attends meetings  ← Correct
B) John is the CEO
C) Meetings are boring
D) None of the above

Source: github.com/lgw863/LogiQA-dataset • HuggingFace

3.4 FOLIO 🚧 In Progress

Active Development: We are enhancing first-order logic pattern support and improving the multi-choice answer format handling.

What it tests: First-order logic reasoning with real-world entities and relationships.

Example:

Context: "All Nobel Prize winners are famous. Marie Curie won the Nobel Prize."
Question: "Is Marie Curie famous?"
Answer: Yes

Source: github.com/Yale-LILY/FOLIO • HuggingFace

3.5 LogicNLI 🚧 In Progress

Active Development: We are improving compound logic matching and entailment detection for natural language inference tasks.

What it tests: Natural Language Inference with logical operators (AND, OR, NOT, IF-THEN).

Example:

Premise: "If it rains, the ground is wet. It is raining."
Hypothesis: "The ground is wet."
Label: Entailment

Source: github.com/microsoft/LogicNLI • HuggingFace

4. Sources Without Evaluation Labels

These benchmark sources run successfully through NL2DSL translation and reasoning, but lack ground-truth labels for automatic evaluation:

Source	Cases	Type	Translation
LogiQA2	84	Multi-choice reasoning	100%
Abduction	83	Abductive inference	100%
bAbI-15	83	Basic deduction	100%
bAbI-16	83	Basic induction	100%
CLUTRR	83	Kinship reasoning	100%
ReClor	83	Reading comprehension	100%

5. Translation Success

100% Translation Success: All benchmark sentences are successfully translated from natural language to DSL. This validates the NL2DSL layer's coverage of logical patterns.

Source	Translation Status	Notes
ProntoQA	100%	Clean ontological patterns
FOLIO	100%	Complex FOL translated
FOLIO-FOL	100%	FOL annotations used
LogiQA	100%	Multi-choice format
LogicNLI	100%	NLI format
RuleBERT	100%	Rule format
bAbI-15/16	100%	Simple patterns
CLUTRR	100%	Kinship relations

6. Analysis: Why Some Benchmarks Are Harder

6.1 Strength Areas

Rule-based inference: Deterministic rules with clear antecedents (RuleBERT: 79%)
Taxonomic reasoning: IS_A hierarchies, transitive chains (ProntoQA: 72%)
Multi-step deduction: Chained inferences up to 3+ steps
Translation coverage: 100% NL→DSL success across all sources

6.2 Active Improvement Areas

Multi-choice format: Answer extraction and validation for LogiQA, FOLIO
Compound logic: Complex And/Or patterns for LogicNLI
Deep chains: Extended inference chains for ProntoQA
FOL patterns: First-order logic structures for FOLIO

6.3 Development Approach

Failed cases are automatically classified and tracked using our autoDiscovery framework. This allows systematic identification and resolution of reasoning patterns that need improvement.

Continuous Improvement: We run automated discovery on HuggingFace benchmark datasets to identify edge cases and improve both the NL→DSL translation and the reasoning engine. Our goal is to achieve near-100% accuracy across all standard academic benchmarks.

7. Running Benchmarks

# Run all benchmarks with auto-discovery
node autoDiscovery/bugsAutoDiscovery.mjs --batch=100

# Run specific source
node autoDiscovery/bugsAutoDiscovery.mjs --source=prontoqa --batch=50

# Run a single case
node autoDiscovery/runBugCase.mjs autoDiscovery/bugCases/BUG001/prontoqa_xxx.json

# Strict mode (no auto-declare of unknown operators)
node autoDiscovery/runBugCase.mjs --strict-operators <case.json>

8. Comparison with Other Systems

Note: Direct comparisons are difficult because:

AGISystem2 uses deterministic symbolic reasoning (not neural)
LLMs use probabilistic pattern matching with different trade-offs
AGISystem2 requires NL2DSL translation (adds potential errors)

The goal is not to "beat" LLMs, but to provide verifiable, traceable reasoning.

9. Future Improvements

Proof by contradiction: Implement reductio ad absurdum for ProntoQA edge cases
World knowledge integration: Connect to external KBs (Wikidata, ConceptNet)
Probabilistic extension: Add confidence scores for soft rules
Deeper chains: Optimize for 10+ step inference chains