Skip to content

fix: DBS across codon boundary correctly classified as nonsense (#84)#96

Open
jmg421 wants to merge 3 commits into
Bioconductor:develfrom
jmg421:fix/84-dbs-nonsense
Open

fix: DBS across codon boundary correctly classified as nonsense (#84)#96
jmg421 wants to merge 3 commits into
Bioconductor:develfrom
jmg421:fix/84-dbs-nonsense

Conversation

@jmg421

@jmg421 jmg421 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

A dinucleotide base substitution (DBS) that spans a codon boundary and produces a stop codon in one of the two resulting amino acids was misclassified as nonsynonymous instead of nonsense.

Example from the issue: GG→AA on the - strand produces VARAA='P*' (proline + stop), but was reported as nonsynonymous because the check only matched when VARAA was exactly '*'.

Root Cause

Line 184 of methods-predictCoding.R:

consequence[nonsynonymous & (as.character(varAA) %in% "*")] <- "nonsense"

%in% "*" requires the entire string to equal *. For a DBS spanning two codons, varAA is two characters (e.g. P*), so the check fails.

Fix

consequence[nonsynonymous & grepl("*", as.character(varAA), fixed=TRUE)] <- "nonsense"

grepl detects a stop codon anywhere in the translated variant sequence. Any premature stop truncates the protein — it doesn't matter what precedes it.

Testing

Added test_predictCoding_dbs_nonsense verifying that:

  • P* → nonsense (stop after one residue)
  • * → nonsense (classic single-codon stop)
  • PQ → nonsynonymous (no stop)
  • *L → nonsense (stop at first position)

Fixes #84

jmg421 added 3 commits June 29, 2026 23:39
Bioconductor#81)

Large INDELs (e.g. 2542bp deletions) that span multiple exons were
silently dropped by locateVariants() and predictCoding() because the
internal call to mapToTranscripts() uses type='within', which requires
the variant to fit entirely within a single exon/CDS element.

Changes:
- locateVariants (.makeResult): after mapToTranscripts, detect variants
  that overlap the transcript (type='any') but were not mapped. These
  are now included in results with LOCATION='coding' and LOCSTART/LOCEND
  set to NA. A warning is emitted identifying the rescued variants.
- predictCoding (.localCoordinates): emit a warning when variants
  overlap CDS regions but cannot be mapped to transcript coordinates,
  directing users to locateVariants() for identification.

This is the minimal fix that prevents silent data loss. Full amino acid
consequence prediction for multi-exon spanning variants would require
changes to GenomicFeatures::mapToTranscripts itself.

Fixes Bioconductor#81
expand() on a CollapsedVCF was calling mcols(rdexp) <- NULL after
expanding rowRanges, which wiped all user-added metadata columns
(e.g. SNP_name, num_alts). The fast path (no multi-allelic sites)
already preserved these columns correctly.

Fix: simply expand rd[idx, ] without wiping mcols. The VCF()
constructor accepts rowRanges with extra mcols — they are stored
in the rowRanges slot alongside paramRangeID, separate from the
fixed fields (REF/ALT/QUAL/FILTER).

Fixes Bioconductor#85
…onductor#84)

A dinucleotide base substitution (DBS) spanning a codon boundary that
produces a stop codon in one of the two affected amino acids (e.g.
VARAA='P*') was misclassified as 'nonsynonymous' instead of 'nonsense'.

Root cause: the nonsense check used `as.character(varAA) %in% "*"`
which only matches when the entire VARAA string is '*'. For multi-codon
DBS, VARAA is multi-character (e.g. 'P*') and the check fails.

Fix: use `grepl("*", ..., fixed=TRUE)` to detect stop codons anywhere
in the translated variant amino acid sequence. Any premature stop
truncates the protein regardless of flanking residues.

Fixes Bioconductor#84
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nonsense mutations from DBS across two codons misclassified as nonsynonymous?

1 participant