Press the Red Button, Dave

A small study at LANL has real-world biosecurity relevance: evaluating whether AI can transfer tacit laboratory skill to novices.

Feb 23, 2026

Scientists at Los Alamos National Laboratory (LANL) published a preprint in December 2025 that I found really interesting:

Romero-Severson, E. O., Harvey, T., Generous, N., & Mach, P. M. (2025). Measuring skill-based uplift from AI in a real biological laboratory. arXiv 2512.10960. https://doi.org/10.48550/arXiv.2512.10960.

Most of the “AI in bio” conversation still treats uplift as a score on a benchmark. Did the model answer the question. Did it write a plausible protocol. Did it propose an experiment. That framing misses the part that matters for both scientific productivity and biosecurity: in biology, the real bottleneck is often skill, not knowledge. This paper is one of the rare public attempts to measure that distinction directly, in a real laboratory, with real novices, trying to complete a multi-day wet-lab workflow.

What they did

The authors use “uplift” in the most concrete sense: an AI system raises the real-world capability of a naive actor enough that they can competently perform a complex technical task. They also draw a useful boundary between knowledge-based uplift (can you describe or plan a protocol) and “real uplift” (can you actually execute it at the bench, including the tacit, procedural know-how that usually takes time and mentorship to acquire). That gap between description and execution is exactly where a lot of current risk analysis hand-waves. As someone who did my early training in a developmental biology (wet) lab, I can confirm that it is also where a lot of day-to-day scientific friction lives.

The study was straightforward. 10 LANL employees with no prior wet-lab experience were randomized into two groups: a control group with internet access only, and a “treatment” group with ChatGPT o1 plus the internet (participants were encouraged to use Deep Research). Everyone was asked to perform a three-day workflow: transform E. coli with a provided expression construct, plate and grow colonies under selection, pick colonies and induce expression with IPTG, then prepare a sample for downstream confirmation of expression by mass spectrometry. Outcomes were scored with a mix of quantitative pass/fail checkpoints (did colonies grow, did OD600 reach target, did the culture get pelleted) and qualitative observation of how participants interacted with the tools, the lab environment, and each other.

Refer to caption — Figure 1 from the paper: Uplift experimental workflow. Each participant attempted to complete the tasks indicated on each day either with only internet access or both internet access and ChatGPT access. Successes criteria for each day are indicated in the bottom half of the figure. Each participant was allowed to attempt the whole three day process again if they failed at any point in the first three days.

What they found

As a pilot, it is intentionally not over-claimed. With only five people per arm, the study is not powered for clean statistical inference, and the authors say so explicitly. Still, the direction is suggestive: on the first attempt, 3/5 in the ChatGPT group completed all three days, vs. 1/5 in the internet-only group. Even more important than those coarse rates are the failure modes and interface frictions they surfaced, because those are exactly the mechanisms that computational benchmarks almost never capture.

The “red button problem” 🔴

The Qualitative Observations section on p7 holds an interesting tidbit. It’s the best concrete example in the whole paper of why “multimodal lab assistant” demos can be misleading. Early in their study development, they experimented with a multimodal version of ChatGPT that could take video, audio, and text, expecting that novices might benefit from richer instruction. Instead, participants found the interface distracting, and the model struggled with equipment guidance in a very specific way: it repeatedly hallucinated that lab equipment had a red button that needed to be pressed. The authors dub this the “red-button problem,” and they note it created substantial confusion because the device simply did not have a red button. In their final setup, switching to ChatGPT o1 with deep research reduced those issues, but the point stands: in a physical lab, “confidently wrong” is not just a bad answer, it is a workflow derailment that can waste hours and increase the chance of unsafe improvisation.

“Press the red button, Dave. The laboratory workflow cannot advance until you do.”

What this all means

I think uplift studies with a real wet-lab component are not optional for AIxBio and biosecurity work. Text-only evaluations tend to reward plausible narration. Bench work punishes narration that is even slightly misaligned with reality. A protocol can be “correct” in an abstract sense and still fail when a novice cannot translate it into the sequence of micro-actions that the environment demands: which tube, which setting, which physical affordance, what to do when the expected colony is not there, how to recover when you pipetted the wrong volume, when to stop reading and start doing. Those are not edge cases. They are the core of practical competence.

The paper also has a second, underappreciated takeaway: when you look closely, uplift is not a single number. The authors argue that future studies should emphasize elucidating mechanisms of uplift rather than chasing formal hypothesis testing. They propose that outcome measures should be grounded in an explicit threat model and that studies should collect richer descriptive data, including “anthropological” observation of how people actually use the system, where they get stuck, and what kinds of errors they make. They even suggest a pragmatic hybrid: run a small number of rigorous, large-scale real uplift experiments, then distill what you learn into scalable benchmarks that can be simulated at scale. That seems directionally right. If we care about real-world risk and real-world benefit, we need at least some evaluation that is anchored to reality, even if we later abstract it.

What comes next is straightforward and hard. We need larger uplift studies that keep the laboratory component, measure more than binary success, and are designed to surface where AI helps, where it distracts, and where it introduces new classes of error. I am assembling a small team to start building that capacity here, with an emphasis on careful protocol design and high-resolution observational data. I hope to talk more about this later on this year.

Romero-Severson, E. O., Harvey, T., Generous, N., & Mach, P. M. (2025). Measuring skill-based uplift from AI in a real biological laboratory. arXiv 2512.10960. https://doi.org/10.48550/arXiv.2512.10960.

Paired Ends

Comments

Ready for more?