What I'm reading (July 2024)
What I’m reading this week in synthetic biology, population genetics, AI in genomics/bioinformatics, microbial genomics, and more
This post expands on a few of the papers I posted in this Twitter thread. I highlight a few in the deep dive at the top, then link to a few other papers of note later on. Subscribe to Paired Ends to get summaries like this in your e-mail as soon as I write them.
Deep dive
Prevalence of errors in lab-made plasmids across the globe
Paper: Bai, Xingjian, et al. “Prevalence of errors in lab-made plasmids across the globe.” bioRxiv, 2024. DOI: https://doi.org/10.1101/2024.06.17.596931.
TL;DR: This study investigates the quality of lab-made plasmids globally, revealing significant design and sequence errors in nearly half of the samples analyzed. It underscores the need for community-wide standards to ensure the reliability of plasmids used in research and medicine. This research was highlighted in the recent Nature Article, “Serious errors plague DNA tool that’s a workhorse of biology.”
Summary: The paper presents a comprehensive survey of plasmids sourced from hundreds of academic and industrial labs worldwide, highlighting that about 50% of these plasmids contain design and/or sequence errors. This includes nearly 40% of AAV transfer plasmids carrying mutations in ITR regions due to their inherent instability. The findings emphasize serious concerns over the reliability of lab-made plasmids, drawing parallels to previous issues like mycoplasma contamination and misidentified mammalian cell lines. The study calls for establishing community-wide standards to improve the quality of plasmids, essential tools in life sciences research and therapeutic development.
Methodological highlights:
High prevalence of design errors: Approximately 15% of the surveyed plasmids contained significant design errors affecting their functionality.
Sequence validation: Direct sequencing revealed that 35% of plasmids had sequence errors in functional regions.
AI-informed conservation genomics
The paper: van Oosterhout, Cock. “AI-informed conservation genomics.” Heredity, 2024. DOI: https://doi.org/10.1038/s41437-023-00666-x.
TL;DR: This paper discusses the role of genomic data and AI models in conservation biology, highlighting how these tools can improve extinction risk assessments and conservation efforts. The study underscores the importance of integrating genomic data with traditional conservation metrics to enhance species recovery strategies.
Summary: The paper explores the increasing importance of genomic data and AI models in conservation biology, using a study by Wilder et al. (2023) as a foundation. Despite a weak correlation between genomic data and Red List categories, the paper argues that genomic data provide critical insights into aspects of extinction risk not captured by traditional metrics. The use of AI models, such as forward-in-time simulations with tools like SLiM, can predict long-term extinction risks and recovery potentials, extending the scope of conservation assessments. The study calls for the integration of genomic data into the IUCN Red List and Green Status assessments to provide a more comprehensive evaluation of species’ long-term viability and conservation needs.
Methodological highlights:
The use of forward-in-time simulations with tools like SLiM to predict long-term extinction risks and recovery potentials.
Emphasis on integrating genomic data with traditional conservation metrics for comprehensive species assessments.
New tools, data, and resources:
SLiM: A forward genetic simulation tool used for predicting long-term extinction risks and recovery potentials of species.
Data availability: The paper emphasizes the need for extensive ecological and genomic data, which can be accessed from databases such as the Global Biodiversity Information Facility (https://www.gbif.org/) and the IUCN Red List (https://www.iucnredlist.org/resources/spatial-data-download).
scMaSigPro: differential expression analysis along single-cell trajectories
Paper: Srivastava, Priyansh, et al. “scMaSigPro: differential expression analysis along single-cell trajectories.” Bioinformatics, 2024. DOI: 10.1093/bioinformatics/btae443.
TL;DR: The paper introduces scMaSigPro, a method for identifying differential gene expression along single-cell trajectories, which outperforms existing methods in terms of false positive rate control and computational efficiency.
Summary: The paper presents scMaSigPro, an adaptation of the maSigPro method for single-cell RNA sequencing (scRNA-seq) trajectory data. This method is designed to identify genes with varying expression along pseudotime and branching paths, which are crucial for understanding cell differentiation mechanisms. The authors demonstrate that scMaSigPro outperforms other existing methods in controlling false positive rates and maintaining computational efficiency using both synthetic and public datasets. The method incorporates a binning strategy, equalization of heterogeneous cell distributions, and polynomial generalized linear models to address challenges such as high data dimensionality and noise. Applications include identifying key genes in cell fate decisions and improving our understanding of cellular processes.
Methodological highlights:
Utilizes Polynomial Generalized Linear Models (Poly-GLM) for modeling gene expression over pseudotime.
Implements binning and equalization strategies to handle noise and uneven cell distributions.
Enhances computational efficiency through parallel processing and specific algorithmic adaptations.
New tools, data, and resources:
scMaSigPro R package: Available at https://github.com/BioBam/scMaSigPro, written in R.
Data availability: The method was tested using synthetic datasets generated with Splatter and public datasets from Setty et al. (2019), with raw data accessible through the European Nucleotide Archive (ENA), project accession PRJEB37166.
Genomic Language Models: Opportunities and Challenges
Paper: Benegas, Gonzalo, et al. “Genomic Language Models: Opportunities and Challenges.” arXiv preprint, 2024. https://arxiv.org/abs/2407.11435v1.
TL;DR: This paper reviews the potential and challenges of Genomic Language Models (gLMs), which are large language models trained on DNA sequences. It highlights key applications such as fitness prediction, sequence design, and transfer learning, and discusses the complexities of developing effective gLMs for species with large, complex genomes.
Summary: The paper discusses the transformative potential of Genomic Language Models (gLMs) for understanding and predicting the functions of genomic sequences. By drawing parallels to advances in Natural Language Processing, the authors illustrate how gLMs can be used for fitness prediction, sequence design, and transfer learning. They emphasize that despite recent progress, significant challenges remain in creating effective and efficient gLMs, particularly for species with large genomes that contain many non-functional regions. The authors review current methodologies, propose improvements, and underscore the need for better benchmarks and interpretative tools to evaluate gLM performance. Applications of gLMs include identifying deleterious genetic variants, designing novel genetic sequences, and improving functional genomics annotations through transfer learning.
Methodological highlights from the review:
Fitness Prediction: Uses log-likelihood ratios to estimate the deleteriousness of genetic variants without supervised labels.
Sequence Design: Employs causal language models to generate new DNA sequences with desired properties.
Transfer Learning: Utilizes pretrained models to improve performance on related genomic tasks, enabling better gene and regulatory element annotations.
Bayesian estimation of gene constraint from an evolutionary model with gene features
Paper: Zeng, Tony, et al. “Bayesian estimation of gene constraint from an evolutionary model with gene features.” Nature Genetics, 2024. DOI: 10.1038/s41588-024-01820-9.
TL;DR: The paper introduces GeneBayes, a framework combining a population genetics model with machine learning on gene features to estimate gene constraint, outperforming existing metrics, especially for short genes.
Summary: This study presents GeneBayes, a novel framework that combines a population genetics model with machine learning to accurately estimate gene constraint using thousands of gene features. Traditional metrics like pLI and LOEUF are limited in their ability to detect constraints in short genes, potentially overlooking important pathogenic mutations. GeneBayes addresses these limitations by incorporating gene features such as expression patterns, protein structure, and evolutionary conservation, enabling more accurate estimates of constraint even for genes with few observed loss-of-function (LOF) variants. The framework demonstrates superior performance in prioritizing essential and disease-associated genes compared to existing metrics. Applications of GeneBayes extend to improving the estimation of other gene-level properties, enhancing our understanding of gene function and disease mechanisms.
Methodological highlights:
Combines Population Genetics and Machine Learning: Uses gene features to predict constraint, improving accuracy for genes with few observed LOFs.
Flexible and Interpretable: Provides posterior distributions of gene constraint, making it adaptable for various genomic analyses.
Superior Performance: Outperforms existing metrics in identifying essential and disease-associated genes, particularly for short genes.
New tools, data, and resources:
GeneBayes: Available at https://github.com/tkzeng/GeneBayes. The tool is written in R and Python, designed to estimate gene constraint using a combination of population genetics and machine learning models.
Data availability: Uses LOF data from the gnomAD consortium (v2.1) comprising exome sequences from approximately 125,000 individuals. Supplementary data and tables are available in the publication.
Other papers of note
Pre-Assembly NGS Correction of ONT Reads Achieves HiFi-Level Assembly Quality https://www.biorxiv.org/content/10.1101/2024.07.12.603260v1 🧬🖥️ https://github.com/MGI-EU/assembly_workflow
A New Era in Missense Variant Analysis: Statistical Insights and the Introduction of VAMPP-Score for Pathogenicity Assessment https://www.biorxiv.org/content/10.1101/2024.07.11.602867v1 🧬🖥️
Review: Evolution and regulation of animal sex chromosomes https://www.nature.com/articles/s41576-024-00757-3 https://rdcu.be/dOdPH
In silico methods for predicting functional synonymous variants https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02966-1
isolateR: an R package for generating microbial libraries from Sanger sequencing data https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae448/7712426?login=false 🧬🖥️ https://github.com/bdaisley/isolateR #rstats
CompareM2 is a genomes-to-report pipeline for comparing microbial genomes https://www.biorxiv.org/content/10.1101/2024.07.12.603264v2?rss=1 🧬🖥️ https://github.com/cmkobel/comparem2
HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae452/7714688 🧬🖥️ https://github.com/wh-xu/Hyper-Gen
Drawing mitochondrial genomes with circularMT https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae450/7713370 🧬🖥️ https://github.com/msjimc/circularMT