Weekly Recap (Sep 2024, part 4)

Regulatory element mapping, phylogenetic tree analysis+viz, mapping cellular interactions with scRNA-seq, bioprospecting, transformers in single-cell omics, Nextflow nf-core/omicsgenetraitassociation

Sep 20, 2024

This week’s recap highlights a new nf-core workflow for multi-omics trait association studies, a new tool for linking genotype to phenotype (G2P) by directly sequencing alleles from CRISPR base editing experiments, the SplitsTree app for interactive analysis and visualization using phylogenetic trees and networks, mapping cellular interactions from spatially resolved transcriptomics data, a study of marine microbial diversity and bioprospecting of new antibiotics and plastic-eating bacteria, and a comprehensive review of transformers in single-cell omics.

Others that caught my attention include a new python package for gene set enrichment of single cell RNA-seq data, variational autoencoders for biophysical modeling of scRNA-seq, pangene analysis, genomic language models, frameworks for sharing clinical and genetic data, genotype inference from ATAC-seq data, detecting base modification with nanopore sequencing data, polygenic risk score portability across diverse populations, assembly and binning of a 20-year metagenomic time series, and a commentary on pursuing genetic diversity in human genetic studies.

Deep dive

CRISPR-CLEAR: Nucleotide-Resolution Mapping of Regulatory Elements via Allelic Readout of Tiled Base Editing

Paper: Becerra, et al. “CRISPR-CLEAR: Nucleotide-Resolution Mapping of Regulatory Elements via Allelic Readout of Tiled Base Editing.” bioRxiv, 2024. DOI:10.1101/2024.09.09.612085.

This paper caught my attention because I saw Luca Pinello’s keynote at the Bioconductor 2024 conference back in July. The Pinello Lab has developed other tools I’ve used in the past, including CRISPResso2 for summarizing gene editing experiments.

TL;DR: CRISPR-CLEAR is a high-resolution mapping tool that links genotype to phenotype by directly sequencing alleles from CRISPR base-edited cells, offering unprecedented nucleotide-level insights into regulatory element functions.

Summary: The CRISPR-CLEAR approach introduces a novel framework for dissecting regulatory elements at nucleotide resolution. By leveraging dense base editing and targeted sequencing, this method allows for direct genotype-phenotype linkage, overcoming limitations of traditional guide RNA-based readouts. The tool was applied to identify key regulatory nucleotides within a CD19 enhancer, a crucial element in B-cell leukemia. Base editing using cytosine and adenine base editors was followed by allele-level sequencing, identifying several critical nucleotides that affect CD19 expression. The study demonstrated that CRISPR-CLEAR can reveal specific nucleotides crucial for gene regulation, with potential applications in understanding gene regulation and enhancing targeted therapies like CAR-T.

Methodological highlights:

Allele-Based Readout: Uses direct sequencing of edited alleles to link nucleotide variants to phenotype, offering higher resolution than traditional guide RNA-based methods.
Bayesian Linear Regression (CRISPR-Millipede): Implements a statistical framework to quantify the effects of nucleotide substitutions on regulatory function.
Base Editing Tools: Utilizes ABE8e-SpRY and evoCDA base editors for dense mutagenesis of regulatory regions.

New tools, data, and resources:

Code availability: The manuscript references source code for two tools. Both are licensed under the AGPL.
- CRISPR-Millipede: GitHub, PyPI.
- CRISPR-Correct: GitHub, PyPI.
Data availability: All data used in this study are on Zenodo at https://zenodo.org/doi/10.5281/zenodo.13737736.

Figure 1a from Becerra et al 2024: A base-editor tiling screen with allele-based readout, comparing CRISPR-CLEAR with standard sgRNA enrichment sequencing.

The SplitsTree App: interactive analysis and visualization using phylogenetic trees and networks

Paper: Huson, Daniel H., and David Bryant. “The SplitsTree App: interactive analysis and visualization using phylogenetic trees and networks.” Nature Methods, 2024. DOI:10.1038/s41592-024-02406-3 (read free: https://rdcu.be/dTsWr).

SplitsTree goes back to Daniel Huson’s original 1998 SplitsTree paper, and the later 2005 paper describing SplitsTree4. This new version of SplitsTree (SplitsTree6) is completely re-written, and now open-source. You can watch a recording of a recent talk from Daniel Huson about SplitsTree here:

TL;DR: The SplitsTree App is an interactive software tool for analyzing phylogenetic data through both trees and networks. It incorporates over 100 algorithms for constructing distances, phylogenetic trees, split networks, haplotype networks, and rooted phylogenetic networks, offering improved computational efficiency and comprehensive analysis features for reticulate evolutionary events.

Summary: This paper introduces the SplitsTree App, designed to extend and improve upon earlier phylogenetic tools like SplitsTree4, Dendroscope, and PopArt. The application supports the analysis of phylogenetic trees and networks, enabling researchers to visualize and explore evolutionary data. It offers over 100 algorithms to compute various phylogenetic and network models, with a focus on reticulate evolutionary events such as horizontal gene transfer and hybridization. A faster version of neighbor-net and an implementation of median-joining networks are among the many advanced features. The app supports multiple input formats (sequences, alignments, distances, etc.), and helps users evaluate data quality, detect evolutionary conflicts, and visualize networks interactively. With an emphasis on reproducibility and workflow-driven design, SplitsTree allows users to save and restore analysis steps, while features like QR code embedding ensure easier data sharing and reuse.

Methodological highlights:

100+ Algorithms: Supports comprehensive analyses, including distance matrices, phylogenetic trees, tanglegrams, and various network models. The supplemental data has a full manual with all these features detailed, but the documentation on GitHub is probably more up to date.

Workflow-Driven Approach: Ensures reproducibility by saving and restoring all intermediate steps, facilitating data sharing through QR code embedding in outputs.

SplitsTree Software: Available at https://github.com/husonlab/splitstree6, supporting Linux, MacOS, and Windows under the GPLv3 license.

Fig. 1 — Figure 1 from Huson & Bryant 2024: SplitsTree interface. Each dataset is represented by a single window and different analyses of the dataset appear as different tabs in that window.

Mapping cellular interactions from spatially resolved transcriptomics data

Paper: Zhu, James, et al. “Mapping cellular interactions from spatially resolved transcriptomics data.” Nature Methods, 2024. DOI: 10.1038/s41592-024-02408-1. Read free: https://rdcu.be/dTwjd.

TL;DR: This paper introduces Spacia, a tool designed to detect cell–cell communications (CCCs) using spatially resolved transcriptomics (SRT) data, overcoming limitations of existing methods by considering spatial closeness and complex multi-sender-to-one-receiver paradigms.

Summary: The study presents Spacia, a machine-learning tool that enables the mapping of cellular interactions at single-cell resolution using SRT data. Spacia addresses limitations in current CCC inference tools that rely solely on ligand-receptor interactions, which often result in high false positive rates and loss of single-cell resolution. Spacia innovates by considering spatial proximity and using a multiple-instance learning (MIL) framework, allowing it to accurately model multi-sender-to-one-receiver communications. The tool was validated on both simulated data and real datasets from technologies like MERSCOPE, CosMx, and Xenium. Results from prostate cancer data demonstrated Spacia’s ability to detect context-specific cellular interactions, such as tumor microenvironment-induced epithelial–mesenchymal transitions (EMT). The tool significantly advances quantitative analysis of CCCs by incorporating spatial and transcriptional data, providing new insights into cellular dynamics in tissues.

Methodological highlights:

Multiple-Instance Learning (MIL) Framework: Models the interaction of multiple sending cells impacting one receiver cell, addressing spatial proximity in cell-cell communications.
High Sensitivity and Specificity: Demonstrates high accuracy in detecting interacting cell pairs with an area under the ROC curve exceeding 0.95 in simulations.
Application to Real Datasets: Validated across three single-cell resolution SRT technologies, providing detailed CCC analysis in prostate cancer and other complex tissue environments.

New tools, data, and resources:

Spacia Software: The Spacia software is available under a BSD license at https://github.com/yunguan-wang/Spacia.
Data availability: The paper has a detailed list of all available data.
- The MERSCOPE datasets were downloaded from https://vizgen.com/data-release-program/ (the ‘MERSCOPE FFPE Human Immuno-oncology’ datasets).
- The breast cancer Xenium dataset was downloaded from www.10xgenomics.com/resources/datasets (the ‘xenium-ffpe-human-breast-with-custom-add-on-panel-1-standard’ dataset).
- The TCGA data were downloaded from https://gdac.broadinstitute.org/ cohorts: BRCA, COAD, LIHC, LUSC, OV, PRAD, SKCM and UCEC).
- The scRNA-seq datasets by Zhang et al.and Sade-Feldman et al.were accessed via the Gene Expression Omnibus under accession numbers GSE169246 and GSE120575, respectively.
- The scRNA-seq datasets by Bassez et al.were accessed from biokey.lambrechtslab.org.
- The CosMx datasets are available from https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/human-liver-rna-ffpe-dataset/.
- The prostate cancer scRNA-seq data that we generated are archived in Zenodo at https://doi.org/10.5281/zenodo.8270765.
- The breast cancer GeoMx data that the authors generated are archived at https://github.com/yunguan-wang/Spacia/tree/main/geomx/. Basic clinical characteristics of the individuals with prostate cancer and those with breast cancer, from whom we generated the scRNA-seq and GeoMX data, respectively, are provided in Supplementary Table 4.

Global marine microbial diversity and its potential in bioprospecting

Paper: Chen, Jianwei, et al. “Global marine microbial diversity and its potential in bioprospecting.” Nature, 2024. DOI:10.1038/s41586-024-07891-2.

There are so many amazing finds in this paper. 43k genomes recovered from metagenomes, a new CRISPR-Cas9 system, 10 new antimicrobial peptides, and 3 new enzymes that degrade polyethylene terephthalate (PET) plastic, experimentally validated. Read on.

TL;DR: This paper presents a comprehensive study of marine microbial diversity, recovering over 43,000 bacterial and archaeal genomes. The study reveals novel CRISPR–Cas9 systems, antimicrobial peptides, and plastic-degrading enzymes, highlighting their potential for biotechnological and biomedical applications.

Summary: This research expands the understanding of marine microbial diversity by analyzing 43,191 genomes from marine metagenomes. The study constructed a global ocean microbiome catalogue (GOMC) containing diverse microbial species across 138 phyla. Using in silico bioprospecting, the authors identified a novel CRISPR–Cas9 system, several antimicrobial peptides (AMPs), and enzymes that degrade polyethylene terephthalate (PET), which were experimentally validated. This work underscores the importance of large-scale genomic studies for bioprospecting and provides resources that could contribute to environmental and medical applications. The discovery of highly active plastic-degrading enzymes and novel CRISPR systems opens new avenues for biotechnology, from antibiotic development to plastic waste degradation.

Methodological highlights:

Genome-Resolved Metagenomics: Constructed a comprehensive genome catalogue using metagenomic assemblies from marine samples, spanning various ecosystems.
Novel CRISPR–Cas Systems: Identified and validated a new CRISPR–Cas9 system with in vitro activity, expanding the tools available for genome editing.
Plastic-Degrading Enzymes: Discovered enzymes with high efficiency in PET degradation, particularly in halophilic environments, and demonstrated their activity under experimental conditions.

New tools, data, and resources:

Data availability: All 43,191 genomes recovered in this study, the GOMC database containing 24,195 unique genomes and other supporting data can be interactively accessed at China National GeneBank DataBase (CNGBdb) (https://db.cngb.org/maya/datasets/MDB0000002). The previously available public marine bacterial and archaeal genomes in NCBI have been also collected and backed up in China National GeneBank Sequence Archive (CNSA) under the accession DATAmic13.
Code: The code used throughout the manuscript is available on GitHub at https://github.com/BGI-Qingdao/GOMC.

Review: Transformers in single-cell omics: a review and new perspectives

Paper: Szałata, Artur, et al. “Transformers in single-cell omics: a review and new perspectives.” Nature Methods, 2024. DOI:0.1038/s41592-024-02353-z (read free: https://rdcu.be/dTANf).

I’ve been reading through Vicki Boykis’s What are embeddings? book, and watching 3Blue1Brown’s YouTube videos on transformers and related topics, trying not to get so far behind on how the tech underpinning modern AI is working. This paper provided a great review of applications in single-cell omics. Make sure to check out the paper’s GitHub repo.

TL;DR: This paper reviews the application of transformer models to single-cell omics, highlighting their potential for analyzing large-scale heterogeneous datasets. It discusses various transformer-based models, their strengths in gene-level and cell-level tasks, and the challenges of adapting transformers to nonsequential single-cell data.

Summary: The authors review how transformers, originally developed for natural language processing, are now being adapted for single-cell omics. These models are adept at handling large and heterogeneous datasets, offering new opportunities for gene function prediction, cell type annotation, and gene regulatory network inference. The paper highlights the three major approaches to adapt single-cell data to transformers: gene ordering, value binning, and value projection. Despite their promise, transformers face challenges in dealing with the nonsequential nature of single-cell data, which lacks the inherent structure found in other domains like language or protein sequences. The paper suggests that future research should focus on developing better representations of single-cell data and scaling models to capture the complexity of multiomics data.

Online supplement: The paper’s GitHub repo has a curated list of single-cell transformers and their evaluation results with links to papers, code, pretraining data, and lists out omic modalities and architecture: https://github.com/theislab/single-cell-transformer-papers.

Table 1 from Szałata et al 2024: Selected transformers for single-cell omics. See the full list at the paper’s GitHub repo.

A methodology for gene level omics-WAS integration identifies genes influencing traits associated with cardiovascular risks [nf-core/omicsgenetraitassociation]

Paper: Acharya, Sandeep et al. “A methodology for gene-level omics-WAS integration identifies genes influencing traits associated with cardiovascular risks: the Long Life Family Study.” Human Genetics, 2024. DOI:10.1007/s00439-024-02701-1.

This paper describes a GWAS + rare variant + transcriptome-wide association study to cardiovascular traits. It’s a nice integrative analysis on its own right, but I’m including it here because it introduces the new nf-core/omicsgenetraitassociation workflow.

Workflow diagram for the new nf-core/omicsgenetraitassociation workflow.

TL;DR: This paper presents an integrative methodology for linking omics data and genome-wide association studies (GWAS) to identify genes that influence cardiovascular traits using the Long Life Family Study (LLFS). Notably, the pipeline is implemented in the new nf-core/omicsgenetraitassociation workflow.

Summary: The authors of this study employed an innovative omics-wide association study (omics-WAS) to investigate the genetic mechanisms underlying cardiovascular risk traits in the LLFS cohort. By aggregating gene-level statistics from GWAS, rare-variant analysis (RVA), and transcriptome-wide association studies (TWAS), they identified 64 genes significantly associated with cardiovascular traits, with 29 of these being replicated in the Framingham Heart Study. The findings are significant as they highlight genes and biological processes, such as sterol transport and immune response regulation, that may influence cardiovascular health. The integration of omics data and network module analysis allowed for deeper insights into potential causal genes and pathways. The pipeline is implemented in Nextflow and is being developed for nf-core.

Methodological highlights:

Multi-omics integration: Combines GWAS, TWAS, and RVA using a correlated meta-analysis approach to strengthen gene-trait association findings.
Module enrichment analysis: Identifies biological processes and protein interaction networks enriched in trait-associated genes.
nf-core omicsgenetraitassociation pipeline: A Nextflow/nf-core workflow is available at https://nf-co.re/omicsgenetraitassociation. Code for this workflow is available at https://github.com/nf-core/omicsgenetraitassociation/.
Data availability: The summary results from GWAS, TWAS, RVA, and CMA on all 11 traits for the LLFS cohort are available in the paper’s supplement.

Other papers of note

pyVIPER: A fast and scalable Python package for rank-based enrichment analysis of single-cell RNASeq data https://www.biorxiv.org/content/10.1101/2024.08.25.609585v1.full
Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data https://www.nature.com/articles/s41592-024-02365-9 (free: https://rdcu.be/dSsaH)
Pandagma: A tool for identifying pan-gene sets and gene families at desired evolutionary depths and accommodating whole genome duplications https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae526/7740678
Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae529/7745814
Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05875-7
A framework for sharing of clinical and genetic data for precision medicine applications https://www.nature.com/articles/s41591-024-03239-5
Genotype inference from aggregated chromatin accessibility data reveals genetic regulatory mechanisms https://www.biorxiv.org/content/10.1101/2024.09.04.610850v1
Characterization and bioinformatic filtering of ambient gRNAs in single-cell CRISPR screens using CLEANSER https://www.biorxiv.org/content/10.1101/2024.09.04.611293v1
Adapting nanopore sequencing basecalling models for modification detection via incremental learning and anomaly detection https://www.nature.com/articles/s41467-024-51639-5
Polygenic risk score portability for common diseases across genetically diverse populations https://humgenomics.biomedcentral.com/articles/10.1186/s40246-024-00664-y
Coassembly and binning of a twenty-year metagenomic time-series from Lake Mendota https://www.nature.com/articles/s41597-024-03826-8
Commentary: Defining and pursuing diversity in human genetic studies https://www.nature.com/articles/s41588-024-01903-7 (read free: https://rdcu.be/dTqdm)

Paired Ends

Comments

Ready for more?