Weekly Recap (Dec 2024, part 2)

Nextflow to web apps, variant analysis with DRAGEN, synthetic cis-regulatory elements, Nextflow for kinase ID+characterization, language model for single cell analysis, scalable protein design, ...

Dec 13, 2024

Article voiceover

1×

0:00

-12:56

This week’s recap highlights a new way to turn Nextflow pipelines into web apps, DRAGEN for fast and accurate variant calling, machine-guided design of cell-type-targeting cis-regulatory elements, a Nextflow pipeline for identifying and classifying protein kinases, a new language model for single cell perturbations that integrates knowledge from literature, GeneCards, etc., and a new method for scalable protein design in a relaxed sequence space.

Others that caught my attention include commentary on improving bioinformatics software quality through teamwork, targeted nanopore sequencing for mitochondrial variant analysis, a review on plant conservation in the era of genome engineering, a de novo assembly tool for complex plant organelle genomes, learning to call copy number variants on low coverage ancient genomes, a near telomere-to-telomere phased reference assembly for the male mountain gorilla, a method for optimized germline and somatic variant detection across genome builds, a searchable large-scale web repository for bacterial genomes, and an integer programming framework for pangenome-based genome inference.

Audio generated with NotebookLM. (The hosts were very excited about this issue!)

Deep dive

Cloudgene 3: Transforming Nextflow Pipelines into Powerful Web Services

Paper: Lukas Forer and Sebastian Schönherr. Cloudgene 3: Transforming Nextflow Pipelines into Powerful Web Services. bioRxiv, 2024. DOI: 10.1101/2024.10.27.620456.

I got to meet both Lukas and Sebastian in person at the Nextflow Summit. Lukas gave a talk on nf-test, while Sebastian gave a talk on the Michigan Imputation Server (MIS). MIS is implemented in Nextflow and driven using Cloudgene, and has helped over 12,000 researchers worldwide impute over 100 million samples. This paper describes Cloudgene for turning a Nextflow pipeline into a web service.

TL;DR: Cloudgene 3 provides a user-friendly platform to convert Nextflow pipelines into scalable web services, allowing scientists to deploy and run complex bioinformatics workflows without requiring web development expertise.

Summary: Cloudgene 3 addresses the challenge of deploying Nextflow pipelines as scalable web services, allowing researchers to leverage computational workflows without the need for technical setup or coding. The platform simplifies the transformation of Nextflow pipelines into “Cloudgene apps,” which include user-friendly interfaces and allow for seamless dataset management, job monitoring, and data security. By supporting features like workflow chaining and dataset integration, Cloudgene 3 enables collaborative and flexible use of pipelines across various scientific domains, from genomics to proteomics. This tool expands accessibility to complex analyses, facilitating data sharing and enhancing reproducibility, and has already been implemented in large-scale services like the Michigan Imputation Server. Its open accessibility and adaptable deployment model (cloud or local infrastructure) highlight its utility for bioinformatics workflows.

Methodological highlights:

Converts Nextflow pipelines into web services with a few simple steps, creating portable “apps” that include metadata, input/output parameters, and multi-step workflows.
Integrates real-time status updates and error handling for Nextflow tasks, leveraging a unique secret URL for each task to monitor progress.
Supports cloud platforms and local installations, providing compatibility with engines like Slurm and AWS Batch and storage options like AWS S3.

New tools, data, and resources:

Cloudgene 3 platform: Free platform available at cloudgene.io.
Cloudgene 3 source code: https://github.com/genepi/cloudgene3.

Figure 1 from Forer 2024: Integration of Nextflow and its communication within Cloudgene 3.

Example of a cloudgene web app built from the nf-core/fetchngs pipeline.

Comprehensive genome analysis and variant detection at scale using DRAGEN

Paper: Behera, S., et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nature Biotechnology, 2024. DOI: 10.1038/s41587-024-02382-1.

DRAGEN was a godsend in a previous job. I needed a turnkey variant calling solution that was fast. I bought an on-prem DRAGEN FPGA server, which was capable of taking you from FASTQ files to VCF in ~30 minutes for a 30X human whole genome. Illumina has previously published white papers on DRAGEN’s speed and accuracy. The publication in Nature Biotechnology engendered some interesting discussion online. On one hand, the paper was a pleasure to read, and the benchmarks are compelling and well done. On the other, the method isn’t available to explore, reproduce, understand in detail, or build upon. Which raises the question — should this have been a peer-reviewed publication in the scientific record? Or should this just have been another white paper? At some point “papers” hawking some new and improved closed source method are thinly veiled advertisements stamped with the approval of peer review. I think there should be some place in the scientific literature for papers like this describing a closed-source method, but where benchmarks are independently evaluated by a team of peer reviewers. I just don’t know what that looks like in the current landscape of peer reviewed papers versus a vendor’s white paper.

TL;DR: DRAGEN is a high-speed, highly accurate genomic analysis platform for variant detection, leveraging hardware acceleration, pangenome references, and machine learning. It outperforms traditional tools across variant types (SNVs, indels, SVs, CNVs, STRs) and is designed for large-scale, clinical genomics applications.

Summary: This study presents DRAGEN, a platform that uses accelerated hardware and sophisticated algorithms to enable comprehensive variant detection at unprecedented speed and accuracy. By integrating pangenome references and optimizing for all major variant classes, DRAGEN achieves high concordance in identifying complex and diverse genomic variants, even in challenging regions. Benchmarking across 3,202 genomes from the 1000 Genomes Project highlights DRAGEN’s scalability and its advantages over traditional methods like GATK and DeepVariant, especially for clinically relevant genes. The platform’s robust performance across SNVs, SVs, CNVs, and STRs allows for large-cohort analyses critical for population-scale genomics and clinical diagnostics, facilitating variant discovery in diseases with both common and rare genetic underpinnings.

Methodological highlights:

Uses pangenome references to enhance alignment accuracy and variant detection across diverse populations.
Optimized for rapid, parallel processing of SNVs, indels, CNVs, and STRs with an average processing time of ~30 minutes per genome.
Employs machine learning-based filtering to reduce false positives and improve accuracy in variant calling.
Integration of ExpansionHunter for STR analysis and specialized callers for pharmacogenomic variants (e.g., CYP2D6, SMN) ensures reliable detection in medically significant genes.

Figure 1 from Behara 2024: Overview of the DRAGEN variant calling pipeline. a–g, DRAGEN improves variant identification from a single base pair to multiple megabase pairs of alleles. This is achieved by implementing multiple optimized concepts. a, Mapping uses a pangenome reference including 64 haplotypes. b, SV calling is substantially improved over local assemblies based on breakpoint graphs; Chr, chromosome; DEL, deletion; DUP, duplication; INS, insertion; INV, inversion; BND, breakend (or breakpoint). c, SNV calling is improved using multiple strategies, including machine learning-based scoring and filtering. d, CNV calling uses the multigenome mapping and the SV calling information to make informed decisions; CN, copy number. e, An additional nine tools targeting specific difficult regions of the genome are included, four of which have not been previously reported; Hap, haplotype; Prop., proportion. f, STR calling is integrated based on ExpansionHunter25. g, A gVCF genotyper implementation to provide a population-level fully genotyped VCF file; msVCF, multisample VCF.

Machine-guided design of cell-type-targeting cis-regulatory elements

Paper: Gosai, S. J., et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature, 2024. DOI: 10.1038/s41586-024-08070-z.

TL;DR: This paper introduces a platform for designing synthetic cis-regulatory elements (CREs) with programmed cell-type specificity using a deep-learning-based model called Malinois, combined with a computational design tool, CODA, and massively parallel reporter assays (MPRAs) for validation.

Summary: This study presents a framework for designing synthetic CREs that drive gene expression specifically in desired cell types. Using Malinois, a deep convolutional neural network trained on MPRA data from human cells, the researchers predict CRE activity and design synthetic elements targeting specific cell lines. The CODA (Computational Optimization of DNA Activity) platform then iteratively refines these designs to achieve high specificity, which is validated in vitro across multiple cell types and in vivo in mice and zebrafish. By outperforming natural CREs in specificity and robustness, these synthetic elements could significantly enhance targeted gene therapy approaches, especially by providing tools for precise gene expression control in therapeutic and research applications. The framework expands our capacity to engineer regulatory DNA for complex tissue-specific requirements, advancing possibilities for both biomedical research and gene therapy.

Methodological highlights:

Malinois CNN model predicts cell-type-specific CRE activity directly from DNA sequences, validated with MPRA-based data in K562, HepG2, and SK-N-SH cells.
CODA optimization platform iteratively adjusts CRE sequences to increase cell-type specificity, employing algorithms such as Fast SeqProp for efficient sequence design.
High-throughput MPRA validates the activity of 77,157 synthetic and natural CRE sequences across cell types, showing superior specificity in synthetic CREs.

New tools, data, and resources:

Code availability: https://github.com/sjgosai/boda2 (yes, this is the CODA repo, which is named “boda2” for “legacy reasons”).
Data availability: All the data used in the study is described in the data availability section of the paper.

Figure 2a from Gosai 2024: CODA effectively designs cell-type-specific cis regulatory elements. CODA designs synthetic elements by iteratively updating sequences to improve predicted function. Malinois predicts CRE activity and an objective function directs sequence updates. After a stopping criteria is met, candidates are nominated for experimental validation.

KiNext: a portable and scalable workflow for the identification and classification of protein kinases

Paper: Hellec, E., et al. KiNext: A Portable and Scalable Workflow for the Identification and Classification of Protein Kinases. BMC Bioinformatics, 2024. DOI: 10.1186/s12859-024-05953-w.

TL;DR: KiNext is a Nextflow-based pipeline for identifying and classifying protein kinases (kinome) from annotated genomes, enabling reproducible analysis and classification of kinase families across species.

Summary: Protein kinases are crucial for cellular signaling and adaptation, and identifying the full kinome of an organism can reveal insights into its physiological and adaptive capabilities. KiNext automates this process, applying Hidden Markov Models (HMMs) to detect both eukaryotic and atypical protein kinases from genomic data and classifying them into known kinase families. Validated against two model species, Crassostrea gigas and Ostreococcus tauri, KiNext identified previously unclassified kinases and achieved enhanced classification accuracy compared to earlier methods. The tool is particularly valuable for large-scale genome projects, such as the Earth BioGenome Project, as it ensures reproducible kinome analysis in alignment with FAIR data principles. By allowing user-provided HMMs, it can be adapted for diverse taxa, making it versatile for comparative kinome studies.

Methodological highlights:

Uses Nextflow and Singularity containers for reproducible, scalable analysis of kinomes across computing environments.
Combines HMM models for specific kinase groups with automatic phylogenetic analysis of ePKs and aPKs using IQTree, facilitating precise classification.
Outputs include detailed summary tables of identified kinases, with options for further validation using structural prediction tools like AlphaFold and Foldseek.

New tools, data, and resources:

Code availability: On GitLab at https://gitlab.ifremer.fr/bioinfo/workflows/kinext. Note that this is licensed under the AGPL, not the regular GPL. Meaning that the source code availability trigger in the GPL also applies if the pipeline is offered over a network as a service.

Fig. 1 — Figure 1 from Hellec 2024: Overview of the 7 processes from the *KiNext* pipeline.

scGenePT: Is language all you need for modeling single-cell perturbations?

Paper: Istrate, A.-M., et al. scGenePT: Is language all you need for modeling single-cell perturbations? bioRxiv, 2024. https://doi.org/10.1101/2024.10.23.619972.

This work from researchers at the Chan Zuckerberg Initiative was also covered a recent issue of the Bits In Bio Weekly newsletter (which, by the way, if you haven’t subscribed to the newsletter or joined the BiB Slack, you should). Of note the text embeddings use GPT 3.5, which is already outdated. Unfortunately the code and model checkpoints aren’t available, and “will be made available upon publication.”

TL;DR: This study introduces scGenePT, an enhanced single-cell gene perturbation model combining biological and language-based representations to predict gene expression outcomes. By incorporating gene knowledge from scientific literature, it improves over traditional models like scGPT in complex perturbation settings.

Summary: In advancing single-cell biology, predicting the effects of gene perturbations is essential. Traditional models rely on experimental data, such as single-cell RNA sequencing counts, but scGenePT integrates language embeddings from scientific resources (e.g., NCBI, UniProt, Gene Ontology) to enrich gene representations. The findings suggest that adding textual gene representations enhances the predictive power for single- and two-gene perturbations, especially where gene interactions yield non-additive effects. This approach is crucial for handling diverse perturbation scenarios, with significant implications for precision medicine, as it facilitates a more nuanced understanding of gene interactions under various conditions. The study validates the integration approach with superior performance over models relying solely on experimental data, demonstrating that language representations can serve as a valuable prior in biological modeling.

Methodological highlights:

scGenePT extends scGPT by adding gene-specific language embeddings sourced from scientific literature, providing both additive and complementary information for perturbation modeling.
Uses GPT-3.5 embeddings to align knowledge sources like NCBI gene summaries, UniProt protein data, and Gene Ontology annotations.
Employs a Transformer-based architecture where language and experimental data embeddings converge at the gene representation level, followed by a Transformer encoder-decoder setup for predictive modeling.
Code availability: Unfortunately the code and model checkpoints aren’t available, and “will be made available upon publication.”

Figure 1 from Istrate 2024: Genes can have representations learned from different modalities: experimental data, (e.g.scRNAseq counts) or language - through the scientific literature - (e.g. NCBI gene/UniProt protein summaries, Gene Ontology annotations). Each modality can provide additive and complementary information when computing gene, and ultimately, cell representations.

Scalable protein design using optimization in a relaxed sequence space

Paper: Frank, C., et al. Scalable Protein Design Using Optimization in a Relaxed Sequence Space. Science, 2024. DOI: 10.1126/science.adq1741.

Protein design is important in medicine and biotechnology because it enables the creation of synthetic proteins tailored for specific functions, such as therapeutic agents that can target diseases more effectively or robust enzymes for diagnostics. In the area I work in (de-extinction), designed proteins can help reconstitute essential biological functions in revived or closely related species, potentially aiding in biodiversity restoration and ecological balance.

TL;DR: This paper presents a new pipeline for protein design using relaxed sequence optimization (RSO), which enables efficient gradient-based design of large and complex protein structures. This approach yields high-quality protein backbones and supports applications in synthetic protein-protein interactions.

Summary: The study introduces Relaxed Sequence Optimization (RSO), a gradient-descent method applied in a “relaxed” sequence space that enables rapid convergence in protein design without the constraints of one-hot encoding. By iteratively optimizing loss functions within AlphaFold2’s predictive space, RSO creates diverse protein backbones, later refined by ProteinMPNN to ensure foldability into intended structures. This approach was experimentally validated with over 100 designed proteins, including large monomers and heterodimers, highlighting its scalability and structural accuracy for de novo proteins up to 1000 amino acids. The RSO pipeline, which allows for custom loss functions tailored to specific design goals, represents a significant advance in both design speed and application scope, facilitating the development of stable, large-scale proteins relevant to fields like structural biology and synthetic biochemistry.

Methodological highlights:

RSO operates in a relaxed sequence space, optimizing backbones without forcing discrete amino acid representations, allowing for smooth gradient transitions.
Integrates ProteinMPNN for generating foldable protein sequences that match RSO-designed backbones, achieving stable high-resolution structures in experimental validation.
Customizable loss functions support specific design tasks, such as helical content reduction and binder-targeting configurations, without retraining.

New tools, data, and resources:

Code availability: Available on GitHub: github.com/sokrypton/ColabDesign
Data availability: Experimental data, including TEM images and crystal structures, are archived on Figshare at doi.org/10.6084/m9.figshare.27009724, with additional cryo-EM and crystal models in PDB and EMDB.

Figure 1 from Frank 2024: Schematics of the protein design pipeline. (A) Schematic representation how the free gradient descent in RSO enables an efficient search for minima of the loss function. (B) Exemplary design tasks that can be accomplished using the RSO method. (C) Schematic view of the design process consisting of backbone design, sequence generation with ProteinMPNN, and candidate design filtering with ESM-Fold/AF2.

Other papers of note

Improving bioinformatics software quality through teamwork https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae632/7831429
Targeted nanopore sequencing using the Flongle device to identify mitochondrial DNA variants https://www.nature.com/articles/s41598-024-75749-8
Review: Plant conservation in the age of genome editing: opportunities and challenges https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03399-0
Oatk: a de novo assembly tool for complex plant organelle genomes https://www.biorxiv.org/content/10.1101/2024.10.23.619857v1
LYCEUM: Learning to call copy number variants on low coverage ancient genomes https://www.biorxiv.org/content/10.1101/2024.10.28.620589v2
A near telomere-to-telomere phased reference assembly for the male mountain gorilla https://www.biorxiv.org/content/10.1101/2024.10.28.620258v1
StableLift: Optimized Germline and Somatic Variant Detection Across Genome Builds https://www.biorxiv.org/content/10.1101/2024.10.31.621401v1
BakRep – a searchable large-scale web repository for bacterial genomes, characterizations and metadata https://doi.org/10.1099/mgen.0.001305
Integer programming framework for pangenome-based genome inference https://www.biorxiv.org/content/10.1101/2024.10.27.620212v1.full

Paired Ends

Comments