Theories as training data factories. AGISystem2's formal theories can be used to generate training examples for LLMs and other neural systems. Correctness is with respect to the theory and requires an external generation pipeline.
Status: Research pattern (DS08). Synthetic data generation pipelines are not shipped in the runtime.
1. The System 1 / System 2 Integration
In cognitive science terminology:
| Aspect |
System 1 (LLMs) |
System 2 (AGISystem2) |
| Processing |
Fast, intuitive |
Slow, deliberate |
| Knowledge |
Implicit in weights |
Explicit in theories |
| Strength |
Language fluency |
Logical rigor |
| Weakness |
Hallucination, inconsistency |
Brittle to novel phrasing |
Key Insight: Use AGISystem2's formal theories to generate training data that teaches LLMs to reason correctly within specific domains. The LLM learns to approximate the formal system's behavior.
2. The Synthetic Data Pipeline
Theory → Examples → Training Data
- Define Theory: Encode domain knowledge as formal theory
- Generate Queries: Systematically create all valid queries
- Execute Reasoning: Get answers with proof traces
- Verbalize: Convert to natural language Q&A pairs
- Train LLM: Fine-tune on generated examples
Example: Generating Animal Taxonomy Data
// Theory Definition
Dog IS_A Animal
Cat IS_A Animal
Animal IS_A LivingThing
Spot instanceOf Dog
Whiskers instanceOf Cat
// Generate all valid inferences
→ Dog IS_A LivingThing (transitive)
→ Spot IS_A Animal (instance + inheritance)
→ Spot IS_A LivingThing (transitive chain)
→ Whiskers IS_A Animal
→ Whiskers IS_A LivingThing
// Generated Q&A pairs
Q: Is Spot an animal?
A: Yes, because Spot is a dog, and all dogs are animals.
Q: Is Whiskers a living thing?
A: Yes, because Whiskers is a cat, cats are animals,
and animals are living things.
Q: Is Spot a cat?
A: No, there is no evidence that Spot is a cat.
Spot is known to be a dog.
3. Types of Generated Data
3.1 Direct Fact Questions
Q: What type of thing is Spot?
A: Spot is a dog.
[Source: explicit fact in KB]
3.2 Inference Questions
Q: Is Spot a mammal?
A: Yes, by the following reasoning:
1. Spot is a dog (known fact)
2. Dogs are mammals (taxonomic rule)
Therefore, Spot is a mammal.
3.3 Negative Examples
Q: Is Spot a plant?
A: No. Spot is a dog, and dogs are animals.
Animals and plants are disjoint categories.
3.4 Explanation-with-Answer
Q: Explain why all dogs are living things.
A: Dogs are living things because:
- Dogs are animals (taxonomic classification)
- Animals are living things (biological category)
- By transitive inheritance, dogs are living things.
3.5 Counterfactual Reasoning
Q: If dogs were not animals, would Spot still be a living thing?
A: We cannot conclude Spot is a living thing without
the Dog→Animal→LivingThing inference chain.
Additional facts about Spot would be needed.
4. Benefits of Theory-Generated Data
Theory-Relative Correctness
Generated examples are correct with respect to the formal theory (assuming the theory is valid). External generation pipelines should preserve this constraint.
Unlimited Quantity
Theories can generate combinatorially many examples. Add 10 new entities and get thousands of new training pairs.
Systematic Coverage
Generate examples that cover all inference paths, all edge cases, all negative examples. No gaps in training coverage.
Controllable Difficulty
Generate examples with 1-step reasoning, 2-step, n-step. Control the complexity of the training curriculum.
Domain Adaptation
Same pipeline works for any domain: medical, legal, scientific, financial. Just change the theory.
5. Architecture for Data Generation
+--------------------------------------------------+
| Natural Language Generator |
| Proof traces → fluent Q&A pairs |
+--------------------------------------------------+
| Example Enumerator |
| Systematically explore query space |
+--------------------------------------------------+
| Reasoning Engine |
| prove() with full traces |
+--------------------------------------------------+
| Theory Store |
| Domain knowledge in DSL format |
+--------------------------------------------------+
6. Verbalization Strategies
Converting formal proofs to natural language:
6.1 Template-Based
Rule: X IS_A Y
Template: "All {X}s are {Y}s" / "{X} is a type of {Y}"
Fact: entity instanceOf Class
Template: "{entity} is a {Class}"
Inference: X IS_A Y, Y IS_A Z ⟹ X IS_A Z
Template: "{X} is a {Z} because {X} is a {Y}, and {Y}s are {Z}s"
6.2 LLM-Assisted Verbalization
Use an LLM to paraphrase template outputs for variety:
Template output: "Spot is an animal because Spot is a dog,
and all dogs are animals."
LLM paraphrases:
- "Since Spot is a dog, and dogs belong to the animal kingdom,
Spot is therefore an animal."
- "We know Spot is an animal. Why? Because Spot is a dog,
and by definition, dogs are animals."
- "Spot, being a dog, falls under the category of animals."
7. Training Strategies
7.1 Curriculum Learning
Start with simple examples, progressively increase complexity:
- Level 1: Direct fact retrieval
- Level 2: Single-step inference
- Level 3: Multi-step chains
- Level 4: Counterfactuals and negation
- Level 5: Complex reasoning with multiple rules
7.2 Contrastive Examples
Generate pairs that highlight distinctions:
Positive: "Is Spot an animal?" → Yes (dog → animal)
Negative: "Is Spot a plant?" → No (dog ⊥ plant)
Similar: "Is Whiskers an animal?" → Yes (cat → animal)
Contrast: "Is Spot a cat?" → No (Spot is dog, not cat)
7.3 Explanation Verification
Train LLM to generate explanations that can be verified:
Input: "Why is Spot a living thing?"
Target output (verifiable):
"Spot is a living thing because:
1. Spot is a dog [can verify: Dog(Spot) in KB]
2. Dogs are animals [can verify: Dog IS_A Animal]
3. Animals are living things [can verify: Animal IS_A LivingThing]
Therefore Spot is a living thing [inference valid]"
8. Quality Assurance
Verification Loop: After training, test the LLM's outputs against the original formal system. Any divergence indicates either:
- Insufficient training (add more examples)
- Ambiguous verbalization (improve templates)
- Model capacity limits (use larger model or narrower domain)
9. Research Directions
Open Questions:
- Transfer efficiency: How many examples needed to teach reliable reasoning?
- Generalization: Does training on theory A help with similar theory B?
- Faithfulness: Can we guarantee LLM follows formal reasoning patterns?
- Hybrid inference: When should LLM defer to formal system?
- Active learning: Which examples are most informative?
10. Applications
- Domain-specific assistants: Legal, medical, financial AI trained on domain theories
- Educational AI: Tutors that explain reasoning correctly
- Compliance chatbots: Answer regulatory questions with verifiable reasoning
- Scientific assistants: Help researchers reason about theories
Related Documentation