Weekly Recap (Jan 2025 part 2)

Autonomous microbial sensors for detecting TNT in soil, genome size estimation from long reads, indexing + compressing GWAS summary stats, Clair3-RNA for small variant calling on long-read RNA-seq...

Jan 10, 2025

I'm still catching up on papers from my late 2024 backlog. This week’s recap highlights autonomous microbial sensors for detecting TNT in soil, genome size estimation from long reads, STABIX for indexing and compressing GWAS summary statistics, and Clair3-RNA for deep learning-based small variant calling on long-read RNA-seq data.

Others that caught my attention include generating protein pockets with PocketGen, insights from the Human Cell Atlas, AI agents designing new SARS-CoV-2 nanobodies, a review of reprogramming the cytoplasm and nucleus during the maternal-to-zygotic transition, recommendations for bioinformatics in clinical practice, metagenomic coassembly to recover novel genomes, pangenome growth and size estimation with panacus, on-disk representation of genome graphs with extgfa, a web server for exploring scRNA-seq data, and directed evolution of engineered virus-like particles.

Deep dive

An autonomous microbial sensor enables long-term detection of TNT explosive in natural soil

Paper: Essington EA, et al., "An autonomous microbial sensor enables long-term detection of TNT explosive in natural soil" in Nature Communications, 2024. https://doi.org/10.1038/s41467-024-54866-y.

Before coming to Colossal I worked at a small scientific services and technical consulting firm, Signature Science (SigSci). Most of our clients and research funders were in the defense / national security space. A few years ago we won a $1.5M contract from DARPA under their Bio Reporters for Subterranean Surveillance program to modify B. subtilis spores to be able to sense chemical agents in soil and emit VOCs that can be detected by a drone (details). The bio team at SigSci doing this work is world class, and it’s great to see others in the field are making significant advances in engineering microbial detection capabilities for CBRNE-related threats.

TL;DR: This group engineered Bacillus subtilis bacteria to detect TNT in soil by creating a genetic circuit combining a TNT-sensing riboswitch with a genetic memory switch. The sensor achieved impressive 14-fold activation after 7 days at low TNT concentrations and maintained detection capability for over 3 weeks in natural soil conditions. This represents a major advance in creating viable microbial sensors that can function long-term in complex environments.

Summary: The researchers engineered a sophisticated genetic system in B. subtilis that uses a TNT-binding riboswitch to control the expression of a site-specific integrase, which then permanently flips a genetic switch to activate reporter gene expression. The system achieved 14-fold activation at environmentally relevant TNT concentrations (4.5 mg/kg soil) and maintained measurable activation for 21 days while competing with natural soil microbes. The work is particularly important because it demonstrates that engineered microbial sensors can survive and function long enough in complex natural environments to provide meaningful detection capabilities - answering a key question in the field. The applications extend beyond TNT detection, as the modular design allows for sensing other environmental contaminants by swapping in different riboswitch sensors, potentially enabling autonomous environmental monitoring across large geographic areas.

Methodological highlights:

Used predictive computational models (Riboswitch Calculator) to rationally design TNT-sensing riboswitches with optimized activation ratios and translation rates in B. subtilis.
Developed a novel genetic circuit architecture combining sense and antisense promoters to reduce leaky expression while maintaining robust TNT-dependent activation.
Created comprehensive multi-modal longitudinal testing workflow combining flow cytometry, qPCR, colony counting, and next-generation sequencing to quantitatively measure sensor function and persistence in soil over 28 days.

New tools, data, and resources:

All sequencing data is available at NCBI at PRJNA1187511.
The supplementary data has all genetic part and system sequences, model calculations, experimental measurements, and statistical & data analysis.

Figure 1 from Essington 2024: A synthetic genetic circuit for detecting TNT in soil. (A) Engineered *Bacillus subtilis* cells sense TNT inside the soil in competition with natural microbes, using optical or olfactory output signals to transmit response information for stand-off detection. (B) The engineered synthetic genetic circuit contains cell sensing, memory, and response modules.

Genome size estimation from long read overlaps

Paper: Hall MB and Coin LJM, "Genome size estimation from long read overlaps" bioRxiv, 2024. DOI: 10.1101/2024.11.27.625777.

My colleagues and I sequence, assemble, and work with the genomes of non-model organisms (like elephants) and ancient genomes (like the thylacine). A tool like LRGE could help estimate genome size easily from HiFi/ONT data when we have it.

TL;DR: The authors present LRGE, a new tool that estimates genome size by analyzing how long sequencing reads overlap with each other. Unlike existing methods that rely on k-mers or assemblies, LRGE uses a clever statistical approach based on read overlaps, making it both accurate and computationally efficient for long-read ONT/PacBio data.

Summary: The authors introduce LRGE (Long Read-based Genome size Estimation), a novel approach that leverages read-to-read overlap information to estimate genome size without requiring a reference genome or k-mer counting. The method works by analyzing how individual reads overlap with one another and uses statistical properties of these overlaps to infer the underlying genome size. This work is particularly important because existing tools are primarily optimized for short-read data and often struggle with the higher error rates in long reads. The authors demonstrate that LRGE performs comparably to assembly-based approaches while using significantly fewer computational resources, and they validate it on both bacterial and eukaryotic genomes. The tool is particularly valuable for automated pipelines and workflows where genome size estimation is needed for tasks like coverage calculation or assembly.

Methodological highlights:

The method calculates per-read genome size estimates by analyzing the expected number of overlaps between reads, considering read lengths and a minimum overlap threshold, then takes the median of these estimates for robustness.
The implementation offers two strategies: a two-set approach where query and target reads are separate, and an all-vs-all approach, with the two-set strategy being more computationally efficient.
The authors developed a novel statistical framework for providing confidence ranges on the genome size estimates, showing that the true genome size falls within their predicted range with >87% confidence.

Tools:

Code: LRGE is implemented in Rust and available at https://github.com/mbhall88/lrge under an MIT license.
LRGE is available as a standalone binary, via Bioconda, Cargo, and Docker.

Results — Figure 1 from Hall 2024: Absolute relative error (y-axis) for each method's (x-axis) genome size estimation on ONT (black) and PacBio (orange) data. The y-axis is scaled according to a symmetric logarithm, which is linear between -1 and 1 and logarithmic (base 10) thereafter. The statistical annotations are the result of a Tukey's range test and are coloured by the sequencing platform being compared. The dashed lines in the violins are the quartiles.

STABIX: Summary statistic-based GWAS indexing and compression

Paper: Schneider K, et al, "STABIX: Summary statistic-based GWAS indexing and compression" bioRxiv 2024. https://doi.org/10.1101/2024.11.15.623812.

I wrote the qqman R package over back in 2014. Even back then when GWAS were around 500k variants, plotting all these points was painfully slow. Doing this for modern GWAS which can have millions of imputed SNPs across thousands of traits will require some kind of querying/filtering procedure. This new tool from Ryan Layer’s lab looks useful if you’re dealing with UKBB-scale GWAS data.

TL;DR: STABIX is a new tool that improves how we store and query GWAS summary statistics by using column-specific compression and smart indexing of p-values. It’s faster than current methods (up to 7x speedup) when searching for significant variants and reduces file sizes compared to standard compression. This matters because GWAS files are getting huge with biobanks like UK Biobank publishing thousands of traits.

Summary: STABIX introduces two key innovations to handle the growing size of GWAS summary statistics: column-specific compression that applies different compression methods based on data type (integers, floating point numbers, strings), and a statistical index that allows direct querying of significant p-values without scanning entire files. The tool builds upon existing approaches like bgzip/tabix but achieves better compression ratios and dramatically faster query times when searching for genome-wide significant variants. The authors demonstrate STABIX's performance using 10 phenotypes from the Pan-UK Biobank, showing it creates smaller compressed files than standard methods and achieves 7x faster decompression for genes without significant variants. This addresses a critical need in the field as biobanks generate increasingly large GWAS datasets — for example, the current PanUKBB includes over 7,000 traits which would require over 10TB of compressed storage using current methods.

Methodological highlights:

Novel column-based compression strategy allows different compression algorithms to be used for different data types, improving overall compression efficiency.
Two-step indexing approach combines genomic position indexing with p-value binning, enabling rapid identification of significant variants without full file decompression.
Block-based compression design with configurable block sizes allows optimization for different query patterns and usage scenarios.

New tools, data, and resources:

Code: STABIX software available at https://github.com/kristen-schneider/stabix, written in C++.
Analysis scripts and figure generation code available at https://github.com/kristen-schneider/stabix-analysis for reproducibility.
Evaluations performed on publicly available PanUKBB GWAS data, with specific file identifiers and trait information provided in the paper.

Figure 2 from Schneider 2024: STABIX vs tabix performance for 10 PanUKBB GWAS per-phenotype files.

Clair3-RNA: A deep learning-based small variant caller for long-read RNA sequencing data

Paper: Zheng Z, et al, "Clair3-RNA: A deep learning-based small variant caller for long-read RNA sequencing data", bioRxiv, 2024. DOI: 10.1101/2024.11.17.624050.

Another tool in your toolbelt for variant calling, this time for RNA-seq data using long reads. The accuracy and computational efficiency benchmarks look good.

TL;DR: This paper introduces Clair3-RNA, the first deep learning-based variant caller specifically designed for long-read RNA sequencing data. It tackles unique challenges in RNA variant calling like uneven coverage, RNA editing events, and zygosity switching by incorporating novel techniques in its neural network architecture. The tool significantly outperforms existing methods, achieving >90% SNP F1-scores across PacBio and ONT platforms.

Summary: Clair3-RNA is built upon the successful Clair series but incorporates RNA-specific optimizations like coverage normalization, training data refinement, and editing site discovery. The importance of this work lies in enabling accurate variant calling from RNA-seq data, which offers advantages like cost-effectiveness, higher depth in expressed regions, and ability to detect post-transcriptional modifications. The authors demonstrate its broad applicability across different long-read RNA sequencing platforms including PacBio Iso-Seq, PacBio MAS-Seq, and Oxford Nanopore direct RNA sequencing, with particularly strong performance on protein-coding genes and the ability to phase variants across long reads, enabling haplotype-aware analysis.

Methodological highlights:

Uses a bidirectional LSTM neural network architecture with multi-task output incorporating both variant genotype prediction and RNA editing site detection.
Implements novel coverage normalization technique to handle the uneven depth distribution characteristic of RNA-seq, improving variant calling in both high and low coverage regions.
Integrates phasing information into the neural network architecture using haplotagged reads, leading to significant performance improvements especially when combined with the latest sequencing platforms.

New tools, data, and resources:

Code: Available at https://github.com/HKU-BAL/Clair3-RNA (written in Python, BSD license). The tool supports variant calling for PacBio Iso-Seq, MAS-Seq, and Oxford Nanopore RNA sequencing data.
ONT sequencing data available at PRJNA1166426.

Figure 3(d-e) from Zheng 2024: PacBio and ONT benchmarking results. (d) Precision-recall curves of four variant callers on different datasets; the dot indicates the best F1-score. AUPRC is the area under the precision-recall curve. (e) Runtime and memory usage of four variant callers on different datasets.

Other papers of note

The Human Cell Atlas from a cell census to a unified foundation model https://www.nature.com/articles/s41586-024-08338-4
The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation https://www.biorxiv.org/content/10.1101/2024.11.11.623004v1
Review: The maternal-to-zygotic transition: reprogramming of the cytoplasm and nucleus https://www.nature.com/articles/s41576-024-00792-0
Recommendations for Bioinformatics in Clinical Practice https://www.biorxiv.org/content/10.1101/2024.11.23.624993v1
Bin Chicken: targeted metagenomic coassembly for the efficient recovery of novel genomes https://www.biorxiv.org/content/10.1101/2024.11.24.625082v1
Panacus: fast and exact pangenome growth and core size estimation https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae720/7914008
extgfa: a low-memory on-disk representation of genome graphs https://www.biorxiv.org/content/10.1101/2024.11.29.626045v1
scExplorer: A Comprehensive Web Server for Single-Cell RNA Sequencing Data Analysis https://www.biorxiv.org/content/10.1101/2024.11.11.622710v1
Directed evolution of engineered virus-like particles with improved production and transduction efficiencies https://www.nature.com/articles/s41587-024-02467-x
Efficient generation of protein pockets with PocketGen https://www.nature.com/articles/s42256-024-00920-9

Figure 1 from Zhang 2024: Overview of PocketGen generative model for the design of full-atom ligand-binding protein pockets.

Paired Ends

Comments