Skip to content

fix: predictCoding uses CDS-mapped width for exon/intron boundary deletions (#83)#97

Open
jmg421 wants to merge 4 commits into
Bioconductor:develfrom
jmg421:fix/83-exon-intron-boundary-codon
Open

fix: predictCoding uses CDS-mapped width for exon/intron boundary deletions (#83)#97
jmg421 wants to merge 4 commits into
Bioconductor:develfrom
jmg421:fix/83-exon-intron-boundary-codon

Conversation

@jmg421

@jmg421 jmg421 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

A deletion that starts inside an exon and extends into the adjacent intron produced incorrect REFCODON/VARCODON values because the codon extraction logic used the genomic width of the variant instead of the transcript-space width (CDS overlap only).

Example from the issue

chr1:3826659-3826710 (52bp genomic deletion)
Only 37bp overlap exon 16 of ENST00000378230

predictCoding reported a 54-character REFCODON spanning into the next exon and a 3-character VARCODON extending into intron sequence — both wrong.

Root Cause

Two places in methods-predictCoding.R used width(txlocal) (the genomic width of the variant) when they should use width(mcols(txlocal)$CDSLOC) (the width of the variant as mapped to transcript/CDS coordinates):

  1. .getRefCodons()cend calculation extends past the actual CDS overlap
  2. Frameshift detectionrefwidth inflated by intronic bases, causing incorrect mod-3 check

Fix

Replace width(txlocal) with width(mcols(txlocal)$CDSLOC) in both locations. The CDSLOC IRanges represents the variant's position and extent in transcript coordinates (exons only, introns excluded), which is what the codon arithmetic needs.

Fixes #83

jmg421 added 4 commits June 29, 2026 23:39
Bioconductor#81)

Large INDELs (e.g. 2542bp deletions) that span multiple exons were
silently dropped by locateVariants() and predictCoding() because the
internal call to mapToTranscripts() uses type='within', which requires
the variant to fit entirely within a single exon/CDS element.

Changes:
- locateVariants (.makeResult): after mapToTranscripts, detect variants
  that overlap the transcript (type='any') but were not mapped. These
  are now included in results with LOCATION='coding' and LOCSTART/LOCEND
  set to NA. A warning is emitted identifying the rescued variants.
- predictCoding (.localCoordinates): emit a warning when variants
  overlap CDS regions but cannot be mapped to transcript coordinates,
  directing users to locateVariants() for identification.

This is the minimal fix that prevents silent data loss. Full amino acid
consequence prediction for multi-exon spanning variants would require
changes to GenomicFeatures::mapToTranscripts itself.

Fixes Bioconductor#81
expand() on a CollapsedVCF was calling mcols(rdexp) <- NULL after
expanding rowRanges, which wiped all user-added metadata columns
(e.g. SNP_name, num_alts). The fast path (no multi-allelic sites)
already preserved these columns correctly.

Fix: simply expand rd[idx, ] without wiping mcols. The VCF()
constructor accepts rowRanges with extra mcols — they are stored
in the rowRanges slot alongside paramRangeID, separate from the
fixed fields (REF/ALT/QUAL/FILTER).

Fixes Bioconductor#85
…onductor#84)

A dinucleotide base substitution (DBS) spanning a codon boundary that
produces a stop codon in one of the two affected amino acids (e.g.
VARAA='P*') was misclassified as 'nonsynonymous' instead of 'nonsense'.

Root cause: the nonsense check used `as.character(varAA) %in% "*"`
which only matches when the entire VARAA string is '*'. For multi-codon
DBS, VARAA is multi-character (e.g. 'P*') and the check fails.

Fix: use `grepl("*", ..., fixed=TRUE)` to detect stop codons anywhere
in the translated variant amino acid sequence. Any premature stop
truncates the protein regardless of flanking residues.

Fixes Bioconductor#84
…etions (Bioconductor#83)

A deletion starting in an exon and extending into the intron produced
incorrect REFCODON/VARCODON because .getRefCodons() and the frameshift
calculation used the genomic width of the variant rather than the
transcript-space (CDSLOC) width.

For example, a 51bp genomic deletion with only 37bp overlapping the CDS
would compute cend using 51, causing the reference codon to extend
incorrectly into the next exon's sequence.

Fix: replace width(txlocal) (genomic width) with
width(mcols(txlocal)$CDSLOC) (CDS-mapped width) in both:
- .getRefCodons(): codon boundary calculation
- frameshift detection: refwidth for length-change check

Fixes Bioconductor#83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect annotation of exon/intron boundary deletion

1 participant