Research note Practical System‑2 infrastructure for AI‑accelerated science

Toward a Practical “System 2” for AI‑Assisted Research

Principles, Failure Modes, and Evaluation Signals for Rigorous Review

This short paper condenses observations from building AGISystem2 using an AI‑assisted, specification‑driven methodology. It treats “System 1 / System 2” as a workflow metaphor: fast generation must be separated from deterministic validation and auditable traces.

Abstract

LLM‑based agents have shifted research practice from “assisted writing” toward workflow orchestration: drafting specifications, generating and refactoring multi‑file codebases, writing tests, and iterating via tool feedback. This creates a synthetic “System 1” that is fast and locally fluent—but fragile under long‑horizon constraint satisfaction, global coherence, and citation integrity.

Based on building AGISystem2 with an AI‑assisted, specification‑driven methodology, we summarize practical principles for rigorous AI‑assisted research, map recurring failure modes to mitigations, and propose evaluation signals that track not only throughput, but coherence and epistemic integrity.

Core idea: use AI for high‑throughput generation, but gate acceptance through independent validation: explicit semantics, deterministic checks, and auditable traces. Prompting is never a substitute for review.

Principles for AI‑assisted research

AI assistance becomes reliably useful when it is placed inside governance that externalizes constraints and separates generation from acceptance.

Table 1. Observed principles (condensed) and what they operationally mean
Principle Operational meaning Why it matters for rigor
P1. Specifications are governance Specs define stable intent, act as external memory, and bound the agent’s solution space; micro‑specs prevent drift at file/module granularity. Without governance, synthetic System 1 reinterprets tasks over iterations, producing un‑auditable divergence.
P2. Separate generation from validation LLMs propose; deterministic checks or independent review gates accept/reject; narrative rationales are never sufficient. Reduces confirmation bias and “fluent but wrong” acceptance; enforces objective criteria.
P3. Traceability is first‑class Requirements map to implementations and tests; changes are justified against explicit constraints; evidence is logged. Enables audit, regression control, and reproducibility under rapid iteration.
P4. Epistemic redundancy Different models/agents assume distinct roles; outputs are reviewed and challenged independently; disagreement is surfaced early. Reduces correlated error and premature consensus; approximates adversarial review.
P5. Representational commitments Use DSLs, typed interfaces, theory layers, structured IRs, and invariants so validation has defined semantics. Eliminates ambiguity that would otherwise hide hallucinations and incoherence.
P6. Global coherence Evaluate long‑horizon consistency, invariants, regression suites, and cross‑artifact coherence (spec↔code↔docs). LLMs optimize local plausibility; research fails at interfaces and over time.

These principles jointly reframe the researcher’s role. In AI‑accelerated settings, comparative advantage shifts from producing raw text and code to designing constraint systems, specifying goals precisely, and auditing outputs.

Failure modes and mitigations

The workflow reveals characteristic failure modes that are likely to generalize across AI‑assisted research projects.

Table 2. Failure modes (CR) and mitigations
Failure mode What typically goes wrong Mitigation that preserves acceleration
CR1. Engineering validation substitutes for explanation Passing suites become “evidence,” while assumptions and explanatory structure remain implicit. Make assumptions explicit in specs; require negative tests not derived from implementation; add “why this should work” notes and manually verify key claims.
CR2. Superficial understanding of imported theory AI accelerates literature reconnaissance while hiding subtle constraints and caveats. Add theory checkpoints (toy formalizations, sanity derivations); require primary‑source verification for load‑bearing claims.
CR3. Premature design lock‑in Early functional prototypes create inertia and close the design space too soon. Keep modularity for cheap variants; record alternatives in decision logs; enforce an exploration budget before hardening.
CR4. Hallucinated references and false authority Plausible citations, misattribution, or invented bibliographies contaminate writing. Treat citations as untrusted until verified; separate “candidate references” from curated bibliography; adopt a citation verification protocol.
CR5. Hidden technical debt and security risks Polished code includes brittle edges, weak error handling, or insecure patterns; tools amplify risk. Run static analysis, dependency scanning, and secret scanning; sandbox tool execution; require human approval for sensitive operations.
Pattern: accelerated synthesis reduces incidental friction—so epistemic control must be reintroduced as explicit process and tooling.

Evaluation signals: what to measure

AI‑assisted research should be evaluated not only by throughput but by coherence and epistemic integrity. The categories below provide a minimal evaluation vocabulary aligned with the principles and failure modes above.

Table 3. Evaluation signals for AI‑assisted research
Category What to measure Why it is load‑bearing
Productivity Time to working prototype; iteration count; human hours per module. Quantifies acceleration while enabling honest comparisons to baseline workflows.
Quality & correctness Regression rate; defect density; coverage/mutation where applicable; static analysis findings. Prevents “fast but fragile” progress from being mistaken for durable correctness.
Coherence & consistency Invariant violations; long‑horizon scenario consistency; cross‑artifact consistency (spec↔code↔docs). Targets the known weakness of synthetic System 1: global coherence.
Epistemic integrity Citation error rate; proportion of primary‑source verified claims; audits of key assertions. Protects scientific legitimacy in the presence of hallucinated authority.

Conclusion

The central observation is that AI assistance does not merely speed up research; it changes the dominant bottleneck. When synthesis is cheap and fluent, epistemic control becomes the scarce resource. The System‑1/System‑2 framing captures this shift: synthetic System 1 can propose artifacts at scale, but science remains dependent on System‑2 functions—explicit assumptions, adversarial scrutiny, and reproducible validation.

The AGISystem2 experiment indicates that System‑2‑like layers can be prototyped rapidly with current tools, but turning them into reviewer‑grade validators is still hard: it requires explicit assumptions, robust handling of adversarial and out‑of‑distribution cases, and reproducible traces that survive scrutiny.

The broader requirement is a discipline‑agnostic System 2 that can perform deterministic review and falsification procedures across sciences, integrate formal checks and validated instruments, and produce traces that support reproduction and audit.

In the medium term, this points toward a trajectory sometimes summarized as “turning science into code.” The phrase should be interpreted narrowly: not full formalization of all knowledge, but the construction of executable substrates for the parts of scientific reasoning that must remain stable under acceleration. If AI is to accelerate science without lowering standards, investment in reviewer‑grade System‑2 infrastructure is not optional; it is the enabling condition that prevents fluent synthesis from outpacing rigor.