Toward the Automation of Scientific Research

Why the Time Has Come

The strongest version of the thesis is not that science will suddenly become autonomous. It is that a growing share of scientific work is becoming structured enough to be partially automated. That claim is plausible because several once-separate components are now maturing in parallel: multi-agent systems for hypothesis generation, autonomous experimental platforms, AI systems that remove concrete scientific bottlenecks, and governance frameworks for trustworthy deployment. Google’s 2025 AI co-scientist was explicitly introduced as a multi-agent system intended to help scientists generate hypotheses and research proposals [Gottweis-Natarajan-2025]. Self-driving laboratories are no longer speculative abstractions: recent overviews describe them as systems that automate both experimental tasks and the design and selection of experiments in chemistry and materials science [Canty-2025]. AlphaFold remains the clearest proof that AI can alter the pace of a real scientific subfield by solving a major bottleneck in protein structure prediction [Jumper-2021].

Taken together, these developments justify a shift in focus. The central question is no longer only whether AI can assist research at the margins through drafting, coding, or search. It is whether parts of the structured work through which science produces knowledge can be made explicit enough to be executed, checked, and improved by machine-mediated systems [Gottweis-Natarajan-2025] [Canty-2025] [Jumper-2021].

Research as a Structured Process

Scientific work is often described in terms of intuition, creativity, and discovery. That is true, but it is incomplete. Science also consists of recurrent operations: framing questions, formulating hypotheses, identifying alternatives, designing tests, interpreting evidence, and restricting conclusions to what the evidence actually warrants. These are not incidental features. They are part of what makes scientific practice cumulative.

This matters because such recurrent operations are precisely the parts most likely to become partially automatable once they are represented more explicitly. At present, much of the structure of research remains fragmented across papers, scripts, notebooks, spreadsheets, conversations, tacit laboratory routines, and human memory. In such a setting, AI can improve surface productivity, but it cannot reliably preserve the dependency structure of a research program. If research questions, assumptions, datasets, protocols, claims, and limitations are instead represented in more explicit and linkable form, machine systems can begin to operate on research as a structured process rather than as an unbounded stream of prose. The broader logic is aligned with the NIST AI Risk Management Framework, which emphasizes that trustworthy AI depends on governance, measurement, and process, not only on model capability [NIST-2023].

The First Wave of Automation

The first wave of automation is unlikely to be full scientific autonomy. It is more likely to involve the transfer of semi-formal research labor from loosely coordinated human workflows into more disciplined computational systems.

This includes literature mapping, extraction of claims from papers, comparison of experimental settings, generation and repair of analysis code, experiment bookkeeping, figure regeneration, statistical sanity checks, and consistency checks between text and results. These tasks may appear secondary, but they consume a substantial fraction of research effort in many fields.

The reason this claim is credible is that early instances already exist. Google’s AI co-scientist is positioned not as a universal scientist but as a multi-agent collaborator for generating and refining hypotheses and proposals [Gottweis-Natarajan-2025]. In materials science, A-Lab linked literature, computational screening, machine learning, active learning, and robotics into an autonomous loop for inorganic synthesis [Szymanski-2023]. In structural biology, AlphaFold did not automate biology as a whole, but it did automate a scientifically central subtask at unprecedented scale and accuracy [Jumper-2021]. The common lesson is that automation begins not with mythical replacement, but with the systematic capture of high-value scientific subtasks.

A Society of Scientific Agents

A further implication is that research is unlikely to be automated well by a single monolithic agent. The reason is not merely engineering convenience. Scientific work is epistemically heterogeneous. Exploration, criticism, validation, execution, synthesis, and experimental planning are different activities, and there is little reason to expect one computational regime to be equally good at all of them.

A more plausible architecture is therefore a society of specialized scientific agents. Some would map literature and unresolved contradictions. Some would generate hypotheses. Some would propose controls and experiments. Some would run code, simulations, or instruments. Some would search for confounders, leakage, weak baselines, or overextended claims. Some would be oriented primarily toward validation rather than novelty. Google’s own description of the AI co-scientist emphasizes a coalition of specialized agents that iteratively generate, evaluate, and refine hypotheses [Gottweis-Natarajan-2025]. That design choice is important because it reflects something true about science itself: progress depends not only on producing candidate ideas, but also on filtering, criticizing, and revising them.

Simulate Before You Build

One of the strongest drivers of research automation is the growing ability to explore candidate explanations, structures, or designs in silico before committing scarce physical resources. In many domains, simulation is the first practical bridge between reasoning and intervention.

Here again the evidence is concrete. AlphaFold moved part of structural biology away from slow experimental bottlenecks by making high-accuracy structure prediction computationally accessible [Jumper-2021]. A-Lab was motivated precisely by the need to connect computational selection with experimental realization in materials discovery [Szymanski-2023]. The self-driving lab literature now explicitly treats the automated design and selection of experiments as central to scientific acceleration, not as an optional add-on [Canty-2025].

The broader point is that science becomes more automatable when candidate worlds can be searched, ranked, and stress-tested before expensive laboratory action. Simulation is therefore not only a performance optimization. It is one of the main ways in which scientific inquiry becomes programmable.

Why LLMs Are Not Enough

Large language models are a major advance, but they are not, by themselves, a sufficient substrate for reliable scientific automation. Their strength lies in flexible inference over language and semi-structured context. Their weakness is that they do not natively provide durable provenance, explicit constraint tracking, methodological discipline, or stable epistemic memory.

That is why a future based only on LLMs risks automating the appearance of science rather than its structure. Such systems can produce plausible summaries, explanations, and manuscripts while remaining fragile with respect to evidence tracing, hidden assumptions, and reproducible validation. This is also why the move toward trustworthy deployment cannot stop at model quality. The NIST framework treats trustworthiness as a property emerging from measurement, governance, documentation, and risk management [NIST-2023]. In scientific settings, that implies a broader architecture in which LLMs help formulate, translate, and synthesize, while statistical procedures, workflow runtimes, simulators, and more explicit validators handle a larger share of checking and constraint enforcement.

Toward More Formal Review

One of the most promising areas for partial formalization is scientific review. The claim here should be made carefully. Peer review is not reducible to a checklist, and novelty judgments remain partly irreducible to formal procedure. But a substantial part of review is structural enough to benefit from explicit computational support.

There are at least two reasons to take this seriously. First, AI is already entering peer review in practice. Nature reported in 2025 that AI was transforming peer review while raising concerns about inconsistent and poorly governed use [Naddaf-2025]. A follow-up Nature report in early 2026 stated that more than half of surveyed researchers had used AI tools while reviewing manuscripts, often despite restrictive guidance [Naddaf-2026]. Second, there is now direct evidence that software tools can assist with some review-relevant criteria. A 2026 comparative study in PLOS One found that combinations of automated tools could outperform individual tools on some rigor and transparency checks [Eckmann-2026].

The practical implication is not that reviewers disappear. It is that papers, protocols, and reports can increasingly be treated not only as prose, but as structured objects containing claims, evidence relations, assumptions, evaluation choices, and possible contradictions. Review then becomes partly formalizable: a system can ask whether a conclusion exceeds the evidence, whether causal language is justified by design, whether a comparison is fair, or whether a claimed robustness property is actually tested [Eckmann-2026]. Human judgment remains central, but it can be supported by a more explicit technical substrate.

Building Trustworthy Scientific AI

If AI is to participate in the internal mechanics of research, then trustworthy AI is not a cosmetic layer. It is part of the core design problem. The reason is simple: the more influence machine systems have over hypotheses, experiments, interpretations, and review, the more consequential failures of provenance, uncertainty handling, reproducibility, privacy, and accountability become.

This is precisely the logic of existing governance frameworks. NIST states that understanding and managing AI risks helps enhance trustworthiness [NIST-2023]. The OECD AI Principles describe trustworthy AI as AI that is innovative while also respecting human rights and democratic values [OECD-2019]. In a scientific setting, these general ideas become concrete requirements. Important claims should have inspectable support. Transformations of data or interpretation should leave an audit trail. Systems should distinguish exploratory signals from validated findings and should expose uncertainty when the evidential basis is weak. In high-stakes contexts, they should also know when to abstain and when to escalate to human oversight. Trustworthiness, in that sense, is not a final moderation step. It is a distributed guardrail layer over the entire research process.

A Gradual Path Forward

The path toward automation is likely to be gradual because the current evidence is local and compositional, not universal. AlphaFold automated a major scientific bottleneck, but not biology as a whole [Jumper-2021]. A-Lab demonstrated an autonomous loop for a defined materials workflow, not for all of experimental science [Szymanski-2023]. Self-driving labs are presented in the literature as powerful but still technically and organizationally challenging research infrastructures [Canty-2025]. The Google AI co-scientist is framed as a collaborator for hypothesis generation and proposal support, not as a complete replacement for scientific practice [Gottweis-Natarajan-2025].

A plausible trajectory therefore consists of cumulative steps: more structured research artefacts, more explicit workflows, clearer separation between generation, execution, critique, and validation, partial formalization of review, and native integration of provenance, uncertainty handling, audit logs, and escalation policies. In some fields, these layers will connect directly to robotic laboratories; in others, they will remain primarily computational. The important point is that no single step needs to solve the whole problem for the trajectory to be real.

The Changing Role of Scientists

The role of scientists is unlikely to disappear. It is more likely to change in the direction of problem selection, judgment, interpretation, and governance. This is not merely a philosophical preference; it follows from the structure of the current evidence. The tasks that are easiest to automate first are those that are repetitive, semi-formal, and locally checkable. By contrast, the selection of worthwhile questions, the evaluation of trade-offs, the interpretation of anomalies, and the social and ethical framing of research remain less reducible to routine procedure.

That is also how leading examples present themselves. Google characterizes its system as a “virtual scientific collaborator,” not as a replacement for scientists [Gottweis-Natarajan-2025]. The self-driving lab literature likewise emphasizes human-machine and human-human collaboration, not the disappearance of human scientific agency [Canty-2025]. The likely outcome is therefore differentiation rather than elimination: less human effort spent on low-level coordination and more effort concentrated on direction, judgment, and responsibility.

A New Horizon for Science

The most defensible conclusion is not that science will be handed over to autonomous machines. It is that the structured work of science is becoming increasingly representable, executable, and checkable in ways that allow machine systems to participate much more deeply than before.

That conclusion is credible because early examples already exist for multiple parts of the puzzle: AI systems that solve concrete scientific bottlenecks [Jumper-2021], autonomous experimental loops [Szymanski-2023] [Canty-2025], multi-agent systems for hypothesis generation [Gottweis-Natarajan-2025], software support for rigor and transparency checking [Eckmann-2026], and governance frameworks that define trustworthy deployment as more than raw capability [NIST-2023] [OECD-2019].

The real choice, then, is not whether AI will enter science. It already has. The choice is whether it will remain largely at the level of fluent assistance, or whether it will be integrated into a more rigorous architecture for producing, checking, and revising knowledge. The first path mainly accelerates output. The second has the potential to improve the structure of inquiry itself.

References

[Canty-2025] Canty, J. et al. Self-driving laboratories. Nature Reviews Methods Primers. 2025.
[Eckmann-2026] Eckmann, P. et al. Use as directed? A comparison of software tools intended to check rigor and transparency of published work. PLOS One. 2026.
[Gottweis-Natarajan-2025] Gottweis, J.; Natarajan, V. Accelerating scientific breakthroughs with an AI co-scientist. Google Research Blog. 2025.
[Jumper-2021] Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021.
[Naddaf-2025] Naddaf, M. AI is transforming peer review — and many scientists are worried. Nature. 2025.
[Naddaf-2026] Naddaf, M. More than half of researchers now use AI for peer review — often against guidance. Nature. 2026.
[NIST-2023] National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0). 2023.
[OECD-2019] OECD. OECD AI Principles overview. 2019.
[Szymanski-2023] Szymanski, N. J. et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature. 2023.