3 Comments
User's avatar
Johnbosco Tayebwa's avatar

Would be nice to put that Structural Variant challenge to GPT-5 and compare. Do you have access to GPT-5 yet?

Expand full comment
Stephen Turner's avatar

Not yet! I'll post here when I do!

Expand full comment
Stephen Turner's avatar

GPT-5 (formatting lost, unfortunately)

Quick prioritization rubric (high-level)

Tier 1 — highest priority

SVs that directly disrupt coding sequence of genes with high predicted impact (frameshift, exon deletion, fusion) OR SVs that delete/duplicate promoters or core regulatory elements of dosage-sensitive genes.

Tier 2 — high priority

SVs overlapping regulatory elements (enhancers, promoters) linked to genes with relevant function, SVs creating novel gene fusions, or large SVs that reposition TAD boundaries.

Tier 3 — medium

SVs in conserved noncoding regions, moderate-size CNVs in genes of unknown function, or SVs with expression/ASE evidence but unclear consequence.

Tier 4 — low

SVs in poorly conserved intergenic sequence with no expression or regulatory evidence and no overlap with annotated elements.

Pragmatic scoring (automatable)

Compute a composite priority score per SV (0–100) from weighted features. Example weights (tunable):

Overlaps coding exon and predicted LoF: +30

Frameshift / gene fusion predicted: +20

Gene is highly constrained / dosage sensitive (pLI/LOEUF analog): +15

Overlaps promoter / proximal regulatory element: +10

Overlaps enhancer predicted to contact gene (ABC model): +8

High sequence conservation (PhyloP/PhastCons): +6

RNA-seq evidence (differential expression or novel junction / fusion): +20

Allele status: heterozygous +5, homozygous +10 (depends on expected biology)

Breakpoint complexity (e.g., mobile element, insertion of sequence): +4

Population frequency (common → subtract; novel/rare → +10)

Filter on a threshold (e.g., >40) for experimental follow-up.

Data and annotations to use for ranking

Gene models / coding annotation: Ensembl/Gencode or species-specific annotation. If no species annotation, lift-over ortholog annotations from a close species and refine with RNA-seq and Iso-Seq.

Conservation: PhyloP / PhastCons across mammals (or closest clade).

Constraint / disease relevance: For orthologous human genes, use pLI / LOEUF / ClinVar annotations as proxy to infer dosage sensitivity or known phenotype relevance.

Regulatory features:

DNase/ATAC peaks, H3K27ac, H3K4me1 from close species (if available) or inferred from conserved enhancer maps.

Predicted enhancer–gene links: ABC model, GeneHancer, or correlation-based methods from expression/ATAC data.

Expression data:

Bulk RNA-seq from tissues relevant to the phenotype (or multiple tissues) — look for changes in expression, aberrant splice junctions, allele-specific expression (ASE), and fusion transcripts.

Long-read (Iso-Seq / ONT cDNA) to find novel isoforms and fusion transcripts.

SV annotation tools: AnnotSV, CADD-SV, SVScore, SnpEff / VEP (for predicted coding consequences), and custom BED overlays for regulatory elements.

Population context: Genotype additional individuals (if available) to estimate allele frequency. Rarity increases priority if suspected functional.

Computational tools & workflows (detection → annotation → prioritization)

SV detection refinement:

Use multiple callers and evidence types: long-read callers (Sniffles2, SVIM, cuteSV), short-read callers (Manta, LUMPY, Delly) and assembly-based calling (hifiasm/assembly + dipcall). Combine via SURVIVOR for consensus.

Local assembly to refine breakpoints (flye / wtdbg2 / racon / polished contig alignment).

Breakpoint resolution & visualization:

IGV, Ribbon, and dotplots (minimap2 + mummer) to inspect breakpoints.

Annotation & scoring:

SnpEff / VEP for coding impacts.

AnnotSV and CADD-SV for integrated pathogenicity-like scores.

BEDTools / pybedtools to intersect SVs with regulatory annotations, conserved elements, repeat masker.

Use custom scripts to compute the composite priority score above.

Experimental validation methods (ordered by applicability and yield)

Orthogonal molecular confirmation of the SV and breakpoint

PCR + Sanger across breakpoint for small/medium SVs; design primers flanking predicted breakpoint for junction-specific amplification.

Long-range PCR for larger deletions/insertions, or targeted capture + long-read sequencing.

Targeted long-read sequencing (ONT adaptive sampling or Amplicon-based PacBio HiFi) to get single-molecule evidence and exact sequence at breakpoints.

Optical mapping (Bionano) for large/complex rearrangements and copy-number changes across large genomic spans.

ddPCR / qPCR for CNV quantification.

Assess effect on transcription / splicing

Short-read RNA-seq from the tissue where the gene is expressed: check for differential expression, exon skipping, novel junctions, fusion transcripts.

Long-read RNA (Iso-Seq or ONT cDNA) to detect novel isoforms or fusion transcripts produced by the SV.

RT-PCR + Sanger for specific fusions or aberrant splice products.

Functional assays — coding-impact SVs

If an exon is removed or a fusion created: express the altered cDNA in a heterologous cell line to test protein stability or function (western blot, activity assays where relevant).

CRISPR-based recapitulation: introduce the SV (or mimic it) in a model cell line using CRISPR/Cas (paired cuts, or HDR) to study cellular phenotype (proliferation, signaling, etc.). Use isogenic controls.

Proteomics (targeted mass spec) if you expect a truncated/new protein.

Functional assays — regulatory-impact SVs

Reporter assays (luciferase / GFP): clone the suspected regulatory element (wild-type vs SV-allele) upstream of a minimal promoter and test activity in relevant cell type.

MPRA (Massively Parallel Reporter Assays) if you have many regulatory SVs and want high-throughput testing.

CRISPRi / CRISPRa / CRISPR deletion in a relevant cell line or primary cells to silence/activate the region and monitor target gene expression.

3D genome assays (Hi-C / Capture-C / HiChIP) to test whether the SV reconfigures chromatin contacts (TAD disruption, enhancer-promoter rewiring).

In vivo / organismal (if warranted)

Introduce variant into a model organism (mouse, zebrafish) when the gene/function justifies the investment — good for developmental phenotypes or whole-organism effects.

Practical experimental order (low-cost → high-confidence)

Filter & score candidates computationally (use composite score).

For top candidates: orthogonal confirmation by PCR/Sanger and/or targeted long-read sequencing. This quickly removes false positives and refines breakpoints.

For confirmed SVs that overlap genes: check expression (RNA-seq / RT-PCR). For regulatory candidates: reporter assays and CRISPR perturbation.

For strong coding-impact SVs with a clear hypothesis: functional cellular assays or CRISPR recapitulation.

For large TAD-disrupting events: Hi-C and follow-up expression assays.

If several SVs cluster in a pathway, consider broader phenotyping and proteomics.

Sample & population considerations

If you sequenced a single individual, genotype the SVs in additional individuals (by PCR, ddPCR, or targeted sequencing) to determine allele frequency and segregation with traits / phenotypes. Rarity + conservation + functional evidence makes stronger case for follow-up.

Preserve high molecular weight DNA / RNA from the same specimen for orthogonal assays (optical mapping, HiFi, Iso-Seq).

Validation of complex cases

For insertions with novel sequence (e.g., mobile element or foreign sequence), targeted long-read sequencing and assembly of the inserted sequence is essential.

Structural variants that create fusion genes: confirm at DNA and RNA level, then test protein expression and function.

Useful tooling stack (summary)

SV calling / assembly: Sniffles2, SVIM, cuteSV, Manta, LUMPY, Delly, hifiasm, Flye, SURVIVOR.

Breakpoint refinement: local assembly, minimap2, pbmm2, dotplots, IGV/Ribbon.

Annotation / scoring: VEP / SnpEff, AnnotSV, CADD-SV, SVScore, BEDTools, custom scoring script.

Regulatory annotation: ABC model, ATAC/ChIP peaks, PhastCons/PhyloP.

Validation assays: PCR/Sanger, long-range PCR, targeted PacBio/ONT, Bionano optical mapping, ddPCR, RNA-seq, Iso-Seq, RT-PCR, reporter assays, CRISPR perturbations, Hi-C/Capture-C.

Visualization/Reporting: IGV screenshots, breakpoint schematics, and a reproducible table of SVs with priority scores + evidence per candidate.

Documentation & decision tracking

Keep a per-SV record with:

Caller evidence (which callers supported it), read-level evidence, breakpoint sequence, annotation overlaps, conservation, RNA evidence, composite score, PCR validation status, and recommended next assay. This helps you triage limited experimental resources.

Final practical notes / heuristics

Start with SVs that you can validate cheaply (PCR, RT-PCR) and that have the highest predicted functional consequence — you’ll get confirmatory wins early.

Use RNA evidence early — it’s very informative for both coding and regulatory effects.

Don’t ignore TAD boundary and large rearrangements — they can have subtle but large regulatory consequences not obvious from exon overlap. Use Hi-C when perturbation of chromatin structure is suspected.

When possible, leverage ortholog knowledge (human/mouse) for gene importance and known phenotypes, but validate in your species because regulatory landscapes can differ.

Expand full comment