Synthetic Data Generation - Trustworthy AI

Theories as training data factories. AGISystem2's formal theories can be used to generate training examples for LLMs and other neural systems. Correctness is with respect to the theory and requires an external generation pipeline.

Status: Research pattern (DS08). Synthetic data generation pipelines are not shipped in the runtime.

1. The System 1 / System 2 Integration

In cognitive science terminology:

Aspect	System 1 (LLMs)	System 2 (AGISystem2)
Processing	Fast, intuitive	Slow, deliberate
Knowledge	Implicit in weights	Explicit in theories
Strength	Language fluency	Logical rigor
Weakness	Hallucination, inconsistency	Brittle to novel phrasing

Key Insight: Use AGISystem2's formal theories to generate training data that teaches LLMs to reason correctly within specific domains. The LLM learns to approximate the formal system's behavior.

2. The Synthetic Data Pipeline

Theory → Examples → Training Data

Define Theory: Encode domain knowledge as formal theory
Generate Queries: Systematically create all valid queries
Execute Reasoning: Get answers with proof traces
Verbalize: Convert to natural language Q&A pairs
Train LLM: Fine-tune on generated examples

Example: Generating Animal Taxonomy Data

// Theory Definition
Dog IS_A Animal
Cat IS_A Animal
Animal IS_A LivingThing
Spot instanceOf Dog
Whiskers instanceOf Cat

// Generate all valid inferences
→ Dog IS_A LivingThing (transitive)
→ Spot IS_A Animal (instance + inheritance)
→ Spot IS_A LivingThing (transitive chain)
→ Whiskers IS_A Animal
→ Whiskers IS_A LivingThing

// Generated Q&A pairs
Q: Is Spot an animal?
A: Yes, because Spot is a dog, and all dogs are animals.

Q: Is Whiskers a living thing?
A: Yes, because Whiskers is a cat, cats are animals,
   and animals are living things.

Q: Is Spot a cat?
A: No, there is no evidence that Spot is a cat.
   Spot is known to be a dog.
  

3. Types of Generated Data

3.1 Direct Fact Questions

Q: What type of thing is Spot?
A: Spot is a dog.
[Source: explicit fact in KB]
  

3.2 Inference Questions

Q: Is Spot a mammal?
A: Yes, by the following reasoning:
   1. Spot is a dog (known fact)
   2. Dogs are mammals (taxonomic rule)
   Therefore, Spot is a mammal.
  

3.3 Negative Examples

Q: Is Spot a plant?
A: No. Spot is a dog, and dogs are animals.
   Animals and plants are disjoint categories.
  

3.4 Explanation-with-Answer

Q: Explain why all dogs are living things.
A: Dogs are living things because:
   - Dogs are animals (taxonomic classification)
   - Animals are living things (biological category)
   - By transitive inheritance, dogs are living things.
  

3.5 Counterfactual Reasoning

Q: If dogs were not animals, would Spot still be a living thing?
A: We cannot conclude Spot is a living thing without
   the Dog→Animal→LivingThing inference chain.
   Additional facts about Spot would be needed.
  

4. Benefits of Theory-Generated Data

Theory-Relative Correctness

Generated examples are correct with respect to the formal theory (assuming the theory is valid). External generation pipelines should preserve this constraint.

Unlimited Quantity

Theories can generate combinatorially many examples. Add 10 new entities and get thousands of new training pairs.

Systematic Coverage

Generate examples that cover all inference paths, all edge cases, all negative examples. No gaps in training coverage.

Controllable Difficulty

Generate examples with 1-step reasoning, 2-step, n-step. Control the complexity of the training curriculum.

Domain Adaptation

Same pipeline works for any domain: medical, legal, scientific, financial. Just change the theory.

5. Architecture for Data Generation


+--------------------------------------------------+
|             Natural Language Generator            |
|    Proof traces → fluent Q&A pairs                |
+--------------------------------------------------+
|              Example Enumerator                   |
|    Systematically explore query space             |
+--------------------------------------------------+
|               Reasoning Engine                    |
|    prove() with full traces                       |
+--------------------------------------------------+
|                 Theory Store                      |
|    Domain knowledge in DSL format                 |
+--------------------------------------------------+

6. Verbalization Strategies

Converting formal proofs to natural language:

6.1 Template-Based

Rule: X IS_A Y
Template: "All {X}s are {Y}s" / "{X} is a type of {Y}"

Fact: entity instanceOf Class
Template: "{entity} is a {Class}"

Inference: X IS_A Y, Y IS_A Z ⟹ X IS_A Z
Template: "{X} is a {Z} because {X} is a {Y}, and {Y}s are {Z}s"
  

6.2 LLM-Assisted Verbalization

Use an LLM to paraphrase template outputs for variety:

Template output: "Spot is an animal because Spot is a dog,
                  and all dogs are animals."

LLM paraphrases:
- "Since Spot is a dog, and dogs belong to the animal kingdom,
   Spot is therefore an animal."
- "We know Spot is an animal. Why? Because Spot is a dog,
   and by definition, dogs are animals."
- "Spot, being a dog, falls under the category of animals."
  

7. Training Strategies

7.1 Curriculum Learning

Start with simple examples, progressively increase complexity:

Level 1: Direct fact retrieval
Level 2: Single-step inference
Level 3: Multi-step chains
Level 4: Counterfactuals and negation
Level 5: Complex reasoning with multiple rules

7.2 Contrastive Examples

Generate pairs that highlight distinctions:

Positive: "Is Spot an animal?" → Yes (dog → animal)
Negative: "Is Spot a plant?" → No (dog ⊥ plant)

Similar: "Is Whiskers an animal?" → Yes (cat → animal)
Contrast: "Is Spot a cat?" → No (Spot is dog, not cat)
  

7.3 Explanation Verification

Train LLM to generate explanations that can be verified:

Input: "Why is Spot a living thing?"

Target output (verifiable):
"Spot is a living thing because:
 1. Spot is a dog [can verify: Dog(Spot) in KB]
 2. Dogs are animals [can verify: Dog IS_A Animal]
 3. Animals are living things [can verify: Animal IS_A LivingThing]
 Therefore Spot is a living thing [inference valid]"
  

8. Quality Assurance

Verification Loop: After training, test the LLM's outputs against the original formal system. Any divergence indicates either:

Insufficient training (add more examples)
Ambiguous verbalization (improve templates)
Model capacity limits (use larger model or narrower domain)

9. Research Directions

Open Questions:

Transfer efficiency: How many examples needed to teach reliable reasoning?
Generalization: Does training on theory A help with similar theory B?
Faithfulness: Can we guarantee LLM follows formal reasoning patterns?
Hybrid inference: When should LLM defer to formal system?
Active learning: Which examples are most informative?

10. Applications

Domain-specific assistants: Legal, medical, financial AI trained on domain theories
Educational AI: Tutors that explain reasoning correctly
Compliance chatbots: Answer regulatory questions with verifiable reasoning
Scientific assistants: Help researchers reason about theories

Synthetic Data Generation from Theories Research