From Protocol to Pipette: What Two New RCTs Tell Us About AI Biosecurity Risk

LLMs uplift novice performance on computational biology tasks, while their effect in physical laboratories remains modest and task-dependent.

Mar 11, 2026

Two preprints published in February 2026 represent the most rigorous attempts yet to move AI biosecurity risk assessment from the theoretical to the empirical. Read together, they tell a nuanced and useful story.

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

The first paper:1

Hong SZ et al. Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology, arXiv:2602.16703. Preprint, arXiv, 18 Feb. 2026. https://doi.org/10.48550/arXiv.2602.16703.

Hong and colleagues at Active Site ran a pre-registered, investigator-blinded randomized controlled trial in a BSL-2 laboratory over eight weeks, enrolling 153 undergrads with minimal prior lab experience. Half had access to frontier LLMs from Anthropic, OpenAI, and Google DeepMind; the other half had the internet only. Participants worked independently to complete tasks modeling a viral reverse genetics workflow: cell culture, molecular cloning, virus production, and RNA quantification. The headline result was that LLM access did not significantly improve completion of the full workflow.

Figure 1 from Hong et al 2026: Trial Design. Schematic of the 8-week in-person study. Participants (n = 153) completed safety and LLM training prior to randomization and the start of laboratory work (Session 1). Participants completed a workflow consisting of a foundational skill assessment (Pre-task 1) followed by the core reverse genetics sequence (Tasks 2–4) and RNA quantification (Task 5). Baseline surveys were collected at Session 5; outcome measures (task completion) and tool utilization (chat logs, search history) were recorded continuously throughout the 39 sessions.

Only about 5% of either group made it through. The more nuanced picture suggested a roughly 1.4-fold improvement in step-by-step advancement, statistically uncertain but directionally consistent. Notably, participants with LLMs succeeded at cell culture six days faster than controls. Less encouragingly, both groups rated YouTube more helpful than any single LLM, suggesting that tacit embodied skills like aseptic technique are still better transmitted through demonstration than text.

Figure 3 from Hong et al 2026: Task Success Rates and Pooled Effect Estimates. Forest plot displaying success rates expressed as Risk Ratios (RR = LLM/INT). Black markers represent observed RRs with 95% confidence intervals (CI) calculated using the Koopman score method; P values are derived from one-sided Fisher’s exact tests. Purple markers represent posterior estimates from a hierarchical Bayesian logistic regression model, displaying posterior means, 95% credible intervals (CrI), and the posterior probability of a positive effect (Pr(RR) > 1). Shaded regions depict full posterior densities. Out-of-sample: Posterior distribution of the predicted RR for a hypothetical, out-of-sample reverse genetics task. The vertical dashed line at x = 1 indicates no effect.

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

The second paper:

Zhang CBC et al. LLM Novice Uplift on Dual-Use, In Silico Biology Tasks, arXiv:2602.23329. Preprint, arXiv, 26 Feb. 2026. https://doi.org/10.48550/arXiv.2602.23329.

Zhang and colleagues at Scale AI and SecureBio asked a related but distinct question: can LLMs uplift novices on in silico biology tasks with dual-use relevance? Unlike the study above, this one wasn’t a wet lab experiment but a human benchmark study, pitting novices with or without LLM access against biosecurity-relevant question sets including the Virology Capabilities Test, Human Pathogen Capabilities Test, and several others. Here the results were considerably more striking. LLM-assisted novices were roughly 4x more accurate than controls and outperformed human domain experts on three of four benchmarks where expert baselines were available. Nearly 90% of participants reported no difficulty obtaining dual-use relevant information despite model safety filters.

Figure 1 from Zhang 2026: Experimental design and resulting performance scores.

Considering both studies

These two studies are complementary, measuring different things. Hong et al. probe physical execution, where tacit knowledge, manual dexterity, and equipment fluency remain significant barriers. Zhang et al. probe knowledge retrieval, reasoning, and protocol design, domains where LLMs already operate near or above expert level. Together they paint a picture in which the cognitive bottlenecks to dangerous biology are breaking down faster than the physical ones.

What still remains relatively unexplored is the hybrid scenario that’s relevant to near-term risk: a novice using AI to design an experimental protocol, which is then executed by trained personnel or automated liquid-handling systems in a cloud laboratory.2 In that threat model, the wet lab barriers documented by Hong et al. largely disappear, and the in silico capabilities documented by Zhang et al. become directly actionable. Evaluating that design-to-execution pipeline, with rigorous controls and physical outcome measures, is the next important experiment.

For more on this paper, read (1) the blog post from Active Site: Does frontier AI enhance novices in molecular biology?; (2) the blog post from Forecasting Research Institute: How Well Did Superforecasters and Experts Predict Wet Lab Skill Uplift from LLMs?; and (3) the summary thread from @activesite.bio on Bluesky.

For example, see Ginkgo’s recently launched Cloud Lab: https://cloud.ginkgo.bio/.

Paired Ends

Comments

Ready for more?