Weekly Recap (June 2025, part 2)

Interactive visualization of tabular data, shared foundations of gene expression and chromatin structure, fast protein structure searching with graph embeddings, visualizing cis-regulatory rules, ...

Jun 27, 2025

This week’s recap highlights the new Datavzrd tool for interactive visualization and communication of tabular data (I’m genuinely really looking forward to trying this one), tracing the shared foundations of gene expression and chromatin structure, PISA for visualizing cis-regulatory rules in genomic data, fast protein structure searching using structure graph embeddings, and a review/perspective on intrinsically disordered regions as facilitators of the transcription factor target search.

Others that caught my attention include evaluation of strategies for evidence-driven genome annotation using long-read RNA-seq, enhancing nanopore adaptive sampling for PromethION using readfish, a unified analysis of atlas single-cell data, multimodal foundation models in molecular cell biology, broader generation and deeper functional understanding of proteins, datamap for visualizing high-dimensional data, low overlap of transcription factor DNA binding and regulatory targets, efficient near telomere-to-telomere assembly of Nanopore Simplex reads (hifiasm), and rewriting regulatory DNA to dissect and reprogram gene expression.

Buy me a coffee ☕️

Deep dive

Datavzrd: Rapid programming- and maintenance-free interactive visualization and communication of tabular data

Paper: Wiegand, F., et al. “Datavzrd: Rapid programming- and maintenance-free interactive visualization and communication of tabular data” bioRxiv (preprint), 2025. https://doi.org/10.1101/2025.04.03.647146.

Ever wished you could make interactive data reports like Shiny apps but share them easily like Excel files without server headaches?

TLDR: Datavzrd is a new tool that does exactly that 👆. It generates standalone, interactive HTML reports from tabular data using simple configuration files, making complex data exploration accessible and shareable for everyone, including non-programmers.

Summary: Datavzrd is a new command-line tool designed to create interactive, visually rich reports from tabular data across scientific disciplines, bridging the gap between simple spreadsheets and complex web applications. It generates portable, server-free, standalone HTML reports through declarative YAML configuration files, eliminating the need for imperative programming or server maintenance. This approach significantly lowers the barrier for creating sophisticated data visualizations and offers accessibility for non-computational scientists. Datavzrd reports support interactive filtering, sorting, various plot types (heatmaps, tick plots, custom Vega-Lite plots), linking between related tables, and scale efficiently to large datasets via pagination and compression, facilitating easier communication and exploration of scientific results.

Methodological highlights:

Uses a declarative YAML configuration system, avoiding imperative programming for specifying interactive report layouts and features.
Generates self-contained, static HTML reports, embedding compressed data (using JSONM and lz-string) within JavaScript files, eliminating the need for server backends or specialized viewing software.
Implements data partitioning into separate HTML pages and pre-built search indices to handle large datasets interactively within browser limitations and same-origin policy constraints.

New tools, data, and resources:

Datavzrd: A command-line tool written in Rust for creating interactive HTML reports from tabular data. Available at https://github.com/datavzrd/datavzrd and via package managers Cargo (https://crates.io/crates/datavzrd) and Conda (https://anaconda.org/conda-forge/datavzrd).
Datavzrd Spells: A central repository (https://github.com/datavzrd/datavzrd-spells) and catalog (https://datavzrd.github.io/docs/spells.html) for reusable YAML configuration snippets ('spells') for common data types and visualizations.
Snakemake wrapper: Allows integration into Snakemake workflows for reproducible analysis pipelines.

Figure 3 from Wiegand 2025: Example Datavzrd view of an astronomy spreadsheet. Boolean columns have been rendered with Datavzrd’s boolean spell, numeric columns have been rendered as heatmaps. The squarish link buttons at the far right of each row (generated by Datavzrd’s dataset linking functionality) jump to corresponding rows in other views.

Tracing the Shared Foundations of Gene Expression and Chromatin Structure

Paper: Liang, H., et al. “Tracing the Shared Foundations of Gene Expression and Chromatin Structure” bioRxiv, 2025. https://doi.org/10.1101/2025.03.31.646349.

TLDR: This paper uses AI foundation models (scGPT) and a novel bag-of-genes approach to define TADs across species, showing this combo better captures gene relationships than just co-expression. They introduce contextual transcriptional similarity (CTS) and consensus TAD Maps to explore how 3D genome structure influences function in development, aging, and cancer. It’s a cool new way to look at TAD function without needing Hi-C for every cell type.

Summary: This study introduces a new framework for understanding the relationship between 3D chromatin structure (specifically TADs) and gene regulation, moving beyond cell-type specific analyses. The authors develop consensus TAD maps for human and mouse using a novel bag-of-genes approach, abstracting TADs to their gene content, and leverage a generative AI foundation model (scGPT) to define contextual transcriptional similarity (CTS), a metric argued to capture functional gene relationships more effectively than traditional co-expression. The work highlights that TADs act as loci of enriched transcriptional context, enhancing CTS multiplicatively across genomic distances, potentially through interactions with transcriptional condensates. This framework is applied to investigate systematic changes in TAD-mediated regulation during cell differentiation, aging, and cancer progression, suggesting TAD organization plays a key role in cellular plasticity. The authors also propose TAD signatures derived from scRNA-seq data as a novel method for analyzing cellular heterogeneity.

Methodological highlights:

Development of species-level consensus TAD Maps using maximum likelihood estimation and dynamic programming on aggregated TAD calls, representing TADs as a bag-of-genes.
Introduction of Contextual Transcriptional Similarity (CTS), calculated as the Pearson correlation between gene embeddings derived from the scGPT single-cell foundation model, as a measure of functional gene relationships.
Proposal of TAD signatures, inferred using a probabilistic 2-component mixture model (Poisson-based) on scRNA-seq data via an EM algorithm, to represent TAD activation states for dimensionality reduction and analysis.

New tools, data, and resources:

Pre-computed TAD Maps and consensus TAD boundary estimates for human and mouse genomes are available at http://singhlab.net/tadmap.
Code: https://github.com/rohitsinghlab/TADMap.
Documentation: https://tadmap.readthedocs.io.

Figure 1 from Liang 2025: TADMap overview.

PISA: a versatile interpretation tool for visualizing cis-regulatory rules in genomic data

Paper: McAnany, C.E., et al. "PISA: a versatile interpretation tool for visualizing cis-regulatory rules in genomic data" bioRxiv, 2025. 10.1101/2025.04.07.647613.

TLDR: This paper describes a tool called PISA to really see how specific DNA bases affect specific positions in genomics readouts (like ChIP, ATAC, MNase) predicted by deep learning models. It helps uncover hidden motif effects (like mixed positive/negative contributions) and experimental bias, even letting them make better MNase-seq models and find new nucleosome positioning rules de novo. Solves a big interpretation headache for sequence-to-profile models.

Summary: Interpreting sequence-to-function deep learning models to understand cis-regulatory logic remains challenging, especially for complex genomic data confounded by experimental bias, such as MNase-seq. This paper introduces Pairwise Influence by Sequence Attribution (PISA), a new interpretation tool that decodes how each input base influences the prediction at each specific output coordinate in sequence-to-profile models. PISA visualizes these pairwise relationships using novel “squid plots” and heatmaps, revealing motif influence ranges, uncovering motifs with complex positive and negative contribution patterns missed by traditional methods, and identifying local experimental biases. The authors integrated PISA into a new deep learning suite, BPReveal, and demonstrated its utility by building improved, bias-corrected MNase-seq models for yeast. This enabled the de novo discovery and characterization of known and potentially novel motifs involved in nucleosome positioning, and facilitated the successful design and experimental validation of synthetic sequences with altered nucleosome occupancy.

Methodological highlights:

PISA calculates contribution scores (using deepSHAP) for each input base i relative to each specific output base j, generating a 2D matrix (Pi→j) representing pairwise influence.
Introduces “squid plots” and PISA heatmaps as novel visualization methods for the 2D attribution matrix, revealing spatial patterns of motif influence and local biases.
Developed a method to generate a synthetic MNase-seq bias track by aligning and subtracting PISA heatmaps from models predicting 5' and 3' fragment ends, enabling the training of bias-corrected nucleosome prediction models within the BPReveal framework.

New tools, data, and resources:

BPReveal: A new Python package extending BPNet/ChromBPNet capabilities, implementing PISA for model interpretation and supporting training of various genomics models (including bias-corrected ones). Code available at https://github.com/mmtrebuchet/bpreveal.
Analysis code for the manuscript is available at https://github.com/zeitlingerlab/bpreveal-manuscript.

Figure 1 from McAnany 2025: PISA and its applications. (a) A sequence-to-profile model uses DNA sequence as input and learns to predict experimental readouts from genomic assays. (b) Previous interpretation tools have assigned contribution scores to the input bases based on properties of the entire output profile. (c) Our PISA approach assigns contribution scores for each output individually, resulting in a 2D matrix ℙ_i_→_j. (d) Squid plots show high PISA values as colored lines that connect an input base to an output position. In this example, the TCA motif has a strong effect, and the bulk of this effect is to the left of the motif’s position. (e) A heatmap of the same region used in (d) shows PISA values as a colored grid. While less immediately readable than the squid plots, heatmaps are useful when multiple effects overlap or when experimental biases are present at the diagonal. (f) PISA is implemented in the BPReveal package and has multiple applications further described in this study.

Fast protein structure searching using structure graph embeddings

Paper: Greener, J.G. and Jamali, K. "Fast protein structure searching using structure graph embeddings" Bioinformatics Advances, 2025. 10.1093/bioadv/vbaf042.

TLDR: This paper introduces Progres, a tool that uses graph neural networks to create compact vector embeddings for protein structures. This makes searching massive structure databases (like AlphaFold's) incredibly fast (think a tenth of a second per query on a CPU) while keeping accuracy close to methods like Dali and Foldseek. It's basically turning structure comparison into a super-quick vector lookup.

Summary: This study presents Progres, a graph neural network (GNN) based method designed for rapid protein structure comparison and database searching. The approach learns low-dimensional (128D) vector embeddings for protein domains by applying supervised contrastive learning, utilizing SCOPe domain classifications to guide the embedding process such that structurally similar domains are mapped closely in the embedding space. Progres aims to address the challenge of efficiently searching rapidly growing structural databases (e.g., AlphaFold DB) by converting structure comparison into a fast cosine similarity calculation between embeddings. The GNN model takes Cα coordinates and associated geometric features as input and is trained to be SE(3)-invariant. This framework enables searching the entire AlphaFold TED domain database in under a second per query on standard CPUs, offering a significant speed advantage over traditional structure alignment methods while maintaining competitive accuracy for detecting homologous structures, particularly at the fold level.

Methodological highlights:

Uses a graph neural network (GNN) operating on Cα atoms, incorporating distance, torsion angle, and sinusoidal positional encoding features to learn structure representations.
Employs supervised contrastive learning, using SCOPe domain classifications (family, superfamily, fold) to train the network to embed structurally similar domains closely together.
Generates fixed-size (128D) embeddings for protein domains, allowing for rapid structure searching via cosine similarity comparison, compatible with fast indexing libraries like FAISS.

New tools, data, and resources:

Progres: A Python package implementing the GNN embedding and search functionality. Available at https://github.com/greener-group/progres.
Progres web server: Provides an online interface for searching structure databases using the Progres method. Available at https://progres.mrc-lmb.cam.ac.uk.
Pre-trained model and pre-embedded databases (e.g., for SCOPe, AlphaFold TED): Available on Zenodo at https://zenodo.org/record/7782088.

Table 1 from Greener 2025: Comparison of ability to retrieve homologous proteins.

Review/Perspective: Intrinsically disordered regions as facilitators of the transcription factor target search

Paper: Jonas, F., Navon, Y. & Barkai, N. "Intrinsically disordered regions as facilitators of the transcription factor target search" Nature Reviews Genetics, 2025. https://doi.org/10.1038/s41576-025-00816-3 (read free: https://rdcu.be/eaMBK).

TLDR: Turns out the “floppy bits” on transcription factors (those intrinsically disordered regions, or IDRs) are super important for how TFs actually find their DNA targets in the messy nucleus. This paper reviews how these flexible IDRs, which lack a fixed 3D structure, help guide TFs, influencing binding specificity and search speed, possibly by interacting with chromatin or forming condensates.

Summary: This perspective article explores the increasingly recognized role of intrinsically disordered regions (IDRs) in eukaryotic transcription factor (TF) function, specifically focusing on how they facilitate the crucial process of finding specific DNA target sites within the genome. IDRs are protein segments that lack a stable three-dimensional structure, instead existing as dynamic ensembles of conformations. While structured DNA-binding domains (DBDs) confer sequence specificity, IDRs are emerging as key modulators that influence both the specificity and the dynamics (speed and efficiency) of the TF target search. The authors review evidence from yeast and multicellular eukaryotes indicating that IDRs shape TF binding preferences, direct TF dynamics on chromatin (potentially influencing diffusion and residence times), and contribute to target selection in ways distinct from DBDs. Deciphering the sequence-function relationship of IDRs is challenging due to their low sequence complexity and rapid evolution, requiring specialized computational and experimental approaches beyond traditional sequence alignment. The paper discusses potential mechanisms by which IDRs function, including facilitating TF co-localization, mediating incorporation into biomolecular condensates, and possibly engaging in direct, specific interactions with chromatin or DNA, highlighting IDRs as central players in the complex orchestration of gene regulation.

Fig. 4 — Figure 4 from Jonas 2025: Challenges with computational sequence analysis of IDRs. a, Schematic illustrating how predicting sequence–function relationships for folded domains often relies on alignment-based methods that are less applicable to IDRs. b, Example phylogeny of transcription factor (TF) DNA-binding domains (DBDs) and intrinsically disordered regions (IDRs) in which both components have conserved functions (for example, guiding TF binding to the same genomic targets); however, the DBD but not the IDR has conserved sequence, as is often observed. c, Possible computational approaches to analysing IDR sequences. Without alignment-based sequence similarity, current computational approaches focus on short linear motif (SLiM) detection and bulk features, such as amino acid composition, biophysical property prediction and sequence patterning. These features are used to compare IDRs and define functional elements, but their associated mechanisms and functions often remain unclear. PPIs, protein–protein interactions.

Other papers of note

Evaluation of strategies for evidence-driven genome annotation using long-read RNA-seq https://genome.cshlp.org/content/35/4/1053.abstract
Enhancing nanopore adaptive sampling for PromethION using readfish at scale https://genome.cshlp.org/content/35/4/877.abstract
Towards multimodal foundation models in molecular cell biology https://www.nature.com/articles/s41586-025-08710-y (read free: https://rdcu.be/eh4li)
Scaling unlocks broader generation and deeper functional understanding of proteins https://www.biorxiv.org/content/10.1101/2025.04.15.649055v1
DataMap: A Portable Application for Visualizing High-Dimensional Data https://arxiv.org/abs/2504.08875
Low overlap of transcription factor DNA binding and regulatory targets https://www.nature.com/articles/s41586-025-08916-0 (read free: https://rdcu.be/eicJC)
Efficient near telomere-to-telomere assembly of Nanopore Simplex reads (hifiasm) https://www.biorxiv.org/content/10.1101/2025.04.14.648685v1
Rewriting regulatory DNA to dissect and reprogram gene expression https://www.sciencedirect.com/science/article/pii/S0092867425003526
A unified analysis of atlas single-cell data https://genome.cshlp.org/content/early/2025/04/10/gr.279631.124.short

Paired Ends

Comments