Weekly Recap (July 2025, part 3)

Long-read consensus assembly with Autocycler, Progen3 for generation and functional understanding of proteins, CarpeDeam for ancient metagenome assembly, hifiasm ONT for T2T assembly of Simplex reads

Jul 24, 2025

This week’s recap highlights Autocycler for long-read consensus assembly for bacterial genomes (future post on this one alone coming soon), Progen3 for broader generation and deeper functional understanding of proteins, the CarpeDeam de novo metagenome assembler for ancient datasets, and hifiasm ONT for efficient T2T assembly of Nanopore Simplex reads.

Others that caught my attention include a paper on how LLMs internalize scientific literature and work with citations, PanScan for tertiary analysis of human pangenome graphs, the py_ped_sim pedigree and genetic simulator for complex family pedigree analysis, Bonsai for visualization and exploratory analysis of single-cell omics data, alignment-free integration of single-nucleus ATAC-seq across species with sPYce, the Japan Omics Browser for integrative visualization of multi-omics data, evidence-based genome annotation with EviAnn, a review on predicting gene expression from DNA sequence using deep learning models, another review on cell-type deconvolution methods for spatial transcriptomics, and yet another review on methodological opportunities in genomic data analysis to advance health equity.

Deep dive

Autocycler: long-read consensus assembly for bacterial genomes

Paper: Wick, R. R., et al., “Autocycler: long-read consensus assembly for bacterial genomes”, bioRxiv, 2025, https://doi.org/10.1101/2025.05.12.653612.

Autocycler is the successor to Trycycler, and for most users the author Ryan Wick recommends Autocycler over Trycycler on the Autocycler README. I tried out Autocycler and I’m impressed by its speed and ease of use. I’ll write about this soon.

TLDR: This paper introduces Autocycler, a new automated command-line tool that significantly boosts the accuracy of bacterial genome assemblies from long-read sequencing data. It tackles the common issue of errors in individual assemblers by intelligently combining multiple assembly attempts into a high-quality consensus, all without needing manual intervention.

Summary: Autocycler is a command-line tool designed to generate accurate, complete bacterial genome assemblies by creating a consensus from multiple alternative long-read assemblies of the same genome. The tool automates a multi-step process that includes building a compacted De Bruijn graph from the input assemblies, clustering related contigs, performing automated quality control, trimming overlaps or excess sequences (including circular and hairpin overlaps), and finally resolving a consensus sequence by selecting the most common variants at each locus. Autocycler addresses the limitations of individual long-read assemblers, which can produce imperfect assemblies with sequence-level or structural errors, and overcomes the scalability issues of manual consensus methods. By automating this process, Autocycler facilitates the generation of high-quality bacterial genomes, which is essential for large-scale genomics projects and downstream applications such as comparative genomics, accurate gene annotation, and the study of genomic features like plasmids and prophages. While fully automated, it also allows for optional manual curation for complex cases.

Methodological highlights:

Autocycler automates the generation of a consensus assembly by first compressing multiple input assemblies into a compacted De Bruijn graph, then clustering contigs based on similarity, trimming these contigs to remove overlaps and inconsistencies, and finally resolving a consensus sequence for each replicon by identifying common anchor sequences and the most supported paths (bridges) between them.
The tool includes a utility called autocycler subsample which divides a single long-read dataset into multiple, minimally overlapping subsets; these subsets can then be assembled independently to generate diverse input assemblies, which can improve the final consensus, especially for very high-depth datasets.
Designed for automation, Autocycler produces detailed metrics in YAML format at each step and supports optional manual curation steps, allowing users to inspect intermediate files (like Newick trees for clustering or dotplots for trimming) and refine assemblies when necessary, particularly for challenging genomes.

New tools, data, and resources:

Autocycler code: https://github.com/rrwick/Autocycler (GPL, written in Rust). You can install it via conda or use the pre-built binaries on the release page.
Helper scripts: Autocycler provides helper scripts to run various common long-read assemblers (e.g., Canu, Flye, LJA, metaMDBG, miniasm, NECAT, NextDenovo, Raven, wtdbg2) which can be used to generate the initial set of diverse assemblies required by Autocycler. These are on the release page.

Figure 2 from Wick 2025: Assembler benchmarking results from the assess_assembly.py script. **(A)** Sequence errors (substitutions and indels); **(B)** Sequence errors and structural assembly errors (missing and extra bases). Lower values on the y-axes indicate better assembly accuracy. Results are coloured by category: individual long-read assembly tools (blue), long-read assembly pipelines (orange) and consensus assembly tools (green). Autocycler results are shown separately for fully automated and manually curated assemblies. Boxplot whiskers extend to the minimum and maximum values. The y-axes use a pseudo-logarithmic scale that accommodates zeros.

Scaling unlocks broader generation and deeper functional understanding of proteins

Paper: Bhatnagar, A., et al., “Scaling unlocks broader generation and deeper functional understanding of proteins”, bioRxiv, 2025, https://doi.org/10.1101/2025.04.15.649055.

This is another paper from Profluent Bio who designed and published OpenCRISPR-1 (paper, sequence), demonstrating the first successful precision editing of the human genome with a programmable gene editor designed with AI. Profluent also developed the previously covered Protein2PAM (paper, repo), which uses deep learning trained on evolutionary data to customize CRISPR-Cas PAM recognition, enabling faster and more precise genome editing. See also the blog post introducing ProGen3.

TLDR: This paper introduces ProGen3, a new family of powerful, sparsely activated protein language models, and the authors systematically show that by significantly scaling up model size to 46 billion parameters (!), these AI models get much better at designing a wider variety of viable proteins. They tested these generated proteins in the wet lab, finding that bigger models not only produce more diverse functional candidates but also learn more effectively from experimental data to improve tasks like predicting protein fitness, which is a big deal for designing new proteins for all sorts of applications.

Summary: ProGen3 is a family of sparse generative protein language models (PLMs), with the largest instance reaching 46 billion parameters, pretrained on an extensive 1.5 trillion amino acid tokens derived from the newly curated Profluent Protein Atlas v1, which contains 3.4 billion full-length protein sequences. The authors establish compute-optimal scaling laws for these sparse PLMs and investigate how model scale and the distribution of training data influence the ability of these models to generate novel protein sequences. The significance of this work lies in its comprehensive demonstration, supported by both computational analysis and wet-lab experiments, that larger PLMs exhibit superior capabilities in generating viable proteins across a more diverse range of protein families and are more adept at aligning with laboratory data to enhance protein fitness prediction and sequence generation. These findings underscore that increased scale, coupled with well-curated large datasets, leads to more powerful and versatile foundation models, thereby advancing the frontiers of protein design for applications in medicine, agriculture, and various industrial processes.

Methodological highlights:

ProGen3 models employ a sparse mixture of experts (MoE) architecture, which means only about 27% of the model's parameters are active during a forward pass, enabling more efficient training and inference for very large models compared to traditional dense architectures.
The research derived compute-optimal scaling laws specifically for sparse autoregressive PLMs (Nopt(D)∝D1.479), providing a principled way to balance model size (N) and the number of training tokens (D) for optimal performance given a fixed computational budget.
ProGen3 supports generalized language modeling (GLM), allowing it to perform infilling of amino acid spans within a protein sequence, in addition to standard N-to-C or C-to-N terminal generation, which is crucial for redesigning specific protein domains.
A new, large-scale, high-quality dataset, the Profluent Protein Atlas v1 (PPA-1), consisting of 3.4 billion full-length proteins, was curated, and an "Inverse Log" data sampling strategy was used during training to enhance out-of-distribution generalization.

New tools, data, and resources:

Code: https://github.com/Profluent-AI/progen3. Apache licensed. The repo also includes links to the models served on Hugging Face.
Profluent Protein Atlas v1 (PPA-1): A large, curated dataset of 3.4 billion full-length protein sequences (1.1 trillion amino acid tokens) compiled from diverse genomic and metagenomic sources, used for pre-training ProGen3. The paper details its construction but does not explicitly state if the dataset itself is downloadable.

Figure 1 from Bhatnagar 2025: Determining the optimal data distribution and scaling laws to train ProGen3, a sparse generative PLM. (a) ProGen3 can generate proteins from the N- to C-terminal or from the C- to N-terminal. It can also generate spans in the middle of a protein. (b) ProGen3 is an autoregressive transformer with a sparse mixture of experts architecture. (c) Diversity of PPA-1 data distributions as measured by the CDF of 50% ID cluster sizes. (d) Validation losses of 1.4B parameter models trained on 80B tokens from different data distributions. (e) Validation losses of models with 112M to 46B parameters trained on 10B to 1.5T tokens from the Inverse Log distribution. “Opt.” indicates the predicted losses for the compute-optimal configurations.

CarpeDeam: A De Novo Metagenome Assembler for Heavily Damaged Ancient Datasets

Paper: Kraft, L., et al., “CarpeDeam: A De Novo Metagenome Assembler for Heavily Damaged Ancient Datasets”, bioRxiv, 2025, https://doi.org/10.1101/2024.08.09.607291.

This paper combines a few interests of mine: metagenomics and ancient DNA!

TLDR: This paper unveils CarpeDeam, a new assembler specifically designed to reconstruct genomes from challenging ancient metagenomic samples where the DNA is highly fragmented and damaged. It outperforms existing tools on these tough datasets, meaning we can learn more about ancient microbes and their evolution.

Summary: CarpeDeam is a new de novo metagenome assembler optimized for ancient DNA (aDNA) datasets, which are typically characterized by short DNA fragments, high rates of chemical damage (like deamination), and often low concentrations of endogenous DNA mixed with environmental contaminants. Its importance lies in addressing the significant bioinformatics challenge of accurately reconstructing microbial genomes from such degraded material, where standard assemblers often fail or produce highly fragmented results. By improving the quality and completeness of assembled ancient metagenomes, CarpeDeam allows for more reliable investigations into the composition, function, and evolution of past microbial communities and ancient pathogens, potentially unlocking new insights from paleontological and archaeological samples.

Methodological highlights:

CarpeDeam employs a co-assembly approach, combining reads from multiple related ancient samples to increase coverage depth and improve the contiguity of the assembly, which is crucial for low-biomass samples.
It integrates an aDNA-aware error model into its de Bruijn graph-based assembly algorithm, specifically designed to handle the characteristic C-to-T substitutions caused by cytosine deamination, a common form of ancient DNA damage, allowing for more accurate contig reconstruction.
The assembler uses a path-centric algorithm on the de Bruijn graph and incorporates graph cleaning steps tailored to the error profiles of aDNA, helping to distinguish true genomic signal from noise caused by damage and fragmentation.

New tools, data, and resources:

Code: https://github.com/LouisPwr/CarpeDeam. GPL license, C and C++.
Simulated datasets: https://zenodo.org/records/13208898.

Figure 2 from Kraft 2025: CarpeDeam’s main workflow: The input are aDNA sequences (FASTQ format) which have been trimmed and, for paired-end data, merged. During an iterative process, the fragments are corrected and extended to long contigs. In **PHASE 1** the fragments are grouped into clusters sharing at least one k-mer as well as an overlap sequence identity of 99% in RYmer space. **PHASE 2** corrects deaminated bases. In particular, the center sequence of each cluster (which is always the longest sequence) is assigned the most likely base per position given the evidence of overlapping sequences in the cluster and the user-provided damage patterns. In **PHASE 3**, the center sequence of each cluster is extended by the candidate sequence from the cluster that is most likely to be the correct extension. In fact, **PHASE 3** is divided into two steps. First, only aDNA fragments (non-extended sequences) are taken into account for extension, as the provided damage patterns are only valid for non-extended sequences. In the second step of **PHASE 3** exclusively contigs (sequences that already have been extended at least once) are used for the extension, applying a modified Bayesian extension model from the native PenguiN assembler.

Efficient near telomere-to-telomere assembly of Nanopore Simplex reads

Paper: Cheng, H., et al., “Efficient near telomere-to-telomere assembly of Nanopore Simplex reads”, bioRxiv, 2025, https://doi.org/10.1101/2025.04.14.648685.

TLDR: This paper introduces hifiasm (ONT), the first assembler that can build near complete, telomere-to-telomere (T2T) genomes using standard Oxford Nanopore Simplex reads, ditching the need for expensive and hard-to-get ultra-long reads.

Summary: Hifiasm (ONT) is a novel de novo assembly algorithm capable of producing near telomere-to-telomere (T2T) assemblies of human and other genomes using standard Oxford Nanopore Technologies (ONT) Simplex reads, thereby eliminating the dependency on ultra-long sequencing data. The core innovation is an efficient error correction method that leverages read phasing to distinguish true genomic variants from the non-random, recurrent sequencing errors prevalent in ONT Simplex reads. The importance of hifiasm (ONT) lies in its ability to significantly broaden the accessibility of T2T assembly, as ultra-long reads are costly and experimentally challenging to obtain, especially for samples without established cell lines or for large-scale projects. This new assembler not only reduces computational demands by an order of magnitude but also reconstructs more chromosomes end-to-end compared to existing methods using the same standard Simplex datasets, making high-quality diploid genome assembly feasible for a wider array of applications including clinical genomics and biodiversity research.

Methodological highlights:

Hifiasm (ONT) introduces a novel error correction algorithm tailored for ONT Simplex reads that utilizes long-range phasing information to differentiate true heterozygous sites from recurrent sequencing errors. It achieves this by identifying potential informative sites and then clustering them based on mutual compatibility using an efficient dynamic programming approach, effectively leveraging how true variants phase together along a read.
The error correction accuracy is further improved by filtering out potential informative sites that are likely errors, such as those in homopolymer regions, exhibiting strand bias, or having low base quality scores (below 10).
For enhanced T2T assembly, hifiasm (ONT) implements strategies to preserve telomeric sequences by identifying reads containing telomere repeats prior to assembly and safeguarding the corresponding tips in the assembly graph during cleaning. Additionally, it employs a dual-scaffold approach for diploid genomes, using information from one assembled haplotype to guide the scaffolding and gap filling of the other.

New tools, data, and resources:

Code: https://github.com/chhylp123/hifiasm (C++, MIT license).
Data: The data availability section of the paper is over a page long, giving links to all of the publicly available data used in the paper. I’m not copying it here, so go to the full text and jump to the bottom to see it.

Figure 2 from Cheng 2025: Assembly results using ONT ultra-long reads. All assemblies generated by hifiasm (ONT) and Verkko+HERRO are used as-is, except for the Arabidopsis assembly by hifiasm (ONT), in which low-coverage contigs were filtered out (Supplementary Section 1.1). **(a)** Running time of hifiasm (ONT) and Verkko+HERRO. Error correction and post-error correction times are shown separately for each sample. **(b)** Number of T2T contigs and scaffolds. T2T contigs and scaffolds indicate entire chromosomes reconstructed without and with gaps, respectively. The genomes of human (diploid), Arabidopsis (haploid), tomato (haploid), and zebrafish (haploid) have 46, 5, 12, and 25 chromosomes, respectively. **(c)** Contig N50. Assembly results for diploid human genomes (HG002 and HG02818) include two haplotypes, while the remaining three non-human genomes are haploid and have only one assembly. **(d)** Scaffold N50.

Other papers of note

How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? https://arxiv.org/abs/2504.02767
PanScan: A Tool for Tertiary Analysis of Human Pangenome Graphs https://www.biorxiv.org/content/10.1101/2025.05.01.651685v1
py_ped_sim: a flexible forward pedigree and genetic simulator for complex family pedigree analysis https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06142-z
Bonsai: Tree representations for distortion-free visualization and exploratory analysis of single-cell omics data https://www.biorxiv.org/content/10.1101/2025.05.08.652944v1
Alignment-free integration of single-nucleus ATAC-seq across species with sPYce https://www.biorxiv.org/content/10.1101/2025.05.07.652648v1
JOB: Japan Omics Browser (https://japan-omics.jp/) provides integrative visualization of multi-omics data https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-025-11639-1
Efficient evidence-based genome annotation with EviAnn https://www.biorxiv.org/content/10.1101/2025.05.07.652745v1
Review: Predicting gene expression from DNA sequence using deep learning models https://www.nature.com/articles/s41576-025-00841-2 (read free: https://rdcu.be/elSrE)
Review: Methodological opportunities in genomic data analysis to advance health equity https://www.nature.com/articles/s41576-025-00839-w (read free: https://rdcu.be/el8MQ)
Review: Cell-type deconvolution methods for spatial transcriptomics https://www.nature.com/articles/s41576-025-00845-y (read free: https://rdcu.be/el1kC)

Fig. 3 — Figure 3 from Gaspard-Boulinc 2025: Characteristics of deconvolution methods for spatial transcriptomics. Summary of deconvolution method characteristics across six categories: use of reference, use of extra modality (coordinates and/or image), programming languages according to GitHub repositories, output types, class of framework following our proposed classification and presence in independent benchmarks. Methods are grouped according to shared characteristics. Method names in grey are methods not yet peer reviewed. Left: decision tree to identify all possible methods in function of the data types available and the modalities the user wants to use.

Paired Ends

1 Comment