Theories as training data factories. AGISystem2's formal theories can be used to generate training examples for LLMs and other neural systems. Correctness is with respect to the theory and requires an external generation pipeline.

Status: Research pattern (DS08). Synthetic data generation pipelines are not shipped in the runtime.

1. The System 1 / System 2 Integration

In cognitive science terminology:

Aspect System 1 (LLMs) System 2 (AGISystem2)
Processing Fast, intuitive Slow, deliberate
Knowledge Implicit in weights Explicit in theories
Strength Language fluency Logical rigor
Weakness Hallucination, inconsistency Brittle to novel phrasing
Key Insight: Use AGISystem2's formal theories to generate training data that teaches LLMs to reason correctly within specific domains. The LLM learns to approximate the formal system's behavior.

2. The Synthetic Data Pipeline

Theory → Examples → Training Data

  1. Define Theory: Encode domain knowledge as formal theory
  2. Generate Queries: Systematically create all valid queries
  3. Execute Reasoning: Get answers with proof traces
  4. Verbalize: Convert to natural language Q&A pairs
  5. Train LLM: Fine-tune on generated examples

Example: Generating Animal Taxonomy Data

// Theory Definition Dog IS_A Animal Cat IS_A Animal Animal IS_A LivingThing Spot instanceOf Dog Whiskers instanceOf Cat // Generate all valid inferences → Dog IS_A LivingThing (transitive) → Spot IS_A Animal (instance + inheritance) → Spot IS_A LivingThing (transitive chain) → Whiskers IS_A Animal → Whiskers IS_A LivingThing // Generated Q&A pairs Q: Is Spot an animal? A: Yes, because Spot is a dog, and all dogs are animals. Q: Is Whiskers a living thing? A: Yes, because Whiskers is a cat, cats are animals, and animals are living things. Q: Is Spot a cat? A: No, there is no evidence that Spot is a cat. Spot is known to be a dog.

3. Types of Generated Data

3.1 Direct Fact Questions

Q: What type of thing is Spot? A: Spot is a dog. [Source: explicit fact in KB]

3.2 Inference Questions

Q: Is Spot a mammal? A: Yes, by the following reasoning: 1. Spot is a dog (known fact) 2. Dogs are mammals (taxonomic rule) Therefore, Spot is a mammal.

3.3 Negative Examples

Q: Is Spot a plant? A: No. Spot is a dog, and dogs are animals. Animals and plants are disjoint categories.

3.4 Explanation-with-Answer

Q: Explain why all dogs are living things. A: Dogs are living things because: - Dogs are animals (taxonomic classification) - Animals are living things (biological category) - By transitive inheritance, dogs are living things.

3.5 Counterfactual Reasoning

Q: If dogs were not animals, would Spot still be a living thing? A: We cannot conclude Spot is a living thing without the Dog→Animal→LivingThing inference chain. Additional facts about Spot would be needed.

4. Benefits of Theory-Generated Data

Theory-Relative Correctness

Generated examples are correct with respect to the formal theory (assuming the theory is valid). External generation pipelines should preserve this constraint.

Unlimited Quantity

Theories can generate combinatorially many examples. Add 10 new entities and get thousands of new training pairs.

Systematic Coverage

Generate examples that cover all inference paths, all edge cases, all negative examples. No gaps in training coverage.

Controllable Difficulty

Generate examples with 1-step reasoning, 2-step, n-step. Control the complexity of the training curriculum.

Domain Adaptation

Same pipeline works for any domain: medical, legal, scientific, financial. Just change the theory.

5. Architecture for Data Generation


+--------------------------------------------------+
|             Natural Language Generator            |
|    Proof traces → fluent Q&A pairs                |
+--------------------------------------------------+
|              Example Enumerator                   |
|    Systematically explore query space             |
+--------------------------------------------------+
|               Reasoning Engine                    |
|    prove() with full traces                       |
+--------------------------------------------------+
|                 Theory Store                      |
|    Domain knowledge in DSL format                 |
+--------------------------------------------------+
  

6. Verbalization Strategies

Converting formal proofs to natural language:

6.1 Template-Based

Rule: X IS_A Y Template: "All {X}s are {Y}s" / "{X} is a type of {Y}" Fact: entity instanceOf Class Template: "{entity} is a {Class}" Inference: X IS_A Y, Y IS_A Z ⟹ X IS_A Z Template: "{X} is a {Z} because {X} is a {Y}, and {Y}s are {Z}s"

6.2 LLM-Assisted Verbalization

Use an LLM to paraphrase template outputs for variety:

Template output: "Spot is an animal because Spot is a dog, and all dogs are animals." LLM paraphrases: - "Since Spot is a dog, and dogs belong to the animal kingdom, Spot is therefore an animal." - "We know Spot is an animal. Why? Because Spot is a dog, and by definition, dogs are animals." - "Spot, being a dog, falls under the category of animals."

7. Training Strategies

7.1 Curriculum Learning

Start with simple examples, progressively increase complexity:

  1. Level 1: Direct fact retrieval
  2. Level 2: Single-step inference
  3. Level 3: Multi-step chains
  4. Level 4: Counterfactuals and negation
  5. Level 5: Complex reasoning with multiple rules

7.2 Contrastive Examples

Generate pairs that highlight distinctions:

Positive: "Is Spot an animal?" → Yes (dog → animal) Negative: "Is Spot a plant?" → No (dog ⊥ plant) Similar: "Is Whiskers an animal?" → Yes (cat → animal) Contrast: "Is Spot a cat?" → No (Spot is dog, not cat)

7.3 Explanation Verification

Train LLM to generate explanations that can be verified:

Input: "Why is Spot a living thing?" Target output (verifiable): "Spot is a living thing because: 1. Spot is a dog [can verify: Dog(Spot) in KB] 2. Dogs are animals [can verify: Dog IS_A Animal] 3. Animals are living things [can verify: Animal IS_A LivingThing] Therefore Spot is a living thing [inference valid]"

8. Quality Assurance

Verification Loop: After training, test the LLM's outputs against the original formal system. Any divergence indicates either:

9. Research Directions

Open Questions:

10. Applications

Related Documentation