Skip to content

fix: locateVariants/predictCoding no longer silently drop large INDELs (#81)#94

Open
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:fix/81-large-indel-warning
Open

fix: locateVariants/predictCoding no longer silently drop large INDELs (#81)#94
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:fix/81-large-indel-warning

Conversation

@jmg421

@jmg421 jmg421 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

Large INDELs (e.g. 2542bp deletions from Sniffles2) that span multiple exons were silently dropped by locateVariants() and predictCoding() because the internal call to GenomicFeatures::mapToTranscripts() uses type='within' in findOverlaps(), which requires the variant to fit entirely within a single exon/CDS element.

Root Cause

mapToTranscripts(GenomicRanges, GRangesList) at line 209-210 of GenomicFeatures calls:

findOverlaps(x, unlist(transcripts), minoverlap=1L, type="within", ...)

A variant wider than any single exon produces zero hits and is silently excluded from all downstream results.

Fix

locateVariants().makeResult()

After mapToTranscripts(), detect variants that overlap the transcript features (type='any') but were not mapped. These are now:

  • Included in results with LOCATION='coding'
  • LOCSTART/LOCEND set to NA (transcript coordinates cannot be computed)
  • A warning is emitted identifying the rescued variants

predictCoding().localCoordinates()

Emit a warning when variants overlap CDS regions but cannot be mapped to transcript coordinates, directing users to locateVariants() for identification.

Design Decision

This is the minimal fix that prevents silent data loss — the most dangerous failure mode (users making decisions on incomplete output without knowing it).

Full amino acid consequence prediction for multi-exon spanning variants would require changes to GenomicFeatures::mapToTranscripts() itself (changing the overlap type or adding a parameter), which is outside the scope of this package. The warning directs users to identify affected variants so they can be handled appropriately (e.g., via VEP/SnpEff for full structural consequence annotation).

Testing

Added inst/unitTests/test_large_indels.R with 4 test cases:

  1. Large INDEL spanning 3 exons is no longer dropped from locateVariants()
  2. Small INDELs within a single exon continue to work normally
  3. predictCoding() emits a warning for multi-exon spanning variants
  4. Mixed queries (small + large) both appear in results

Reproducer (from issue)

A 2542bp Sniffles2 deletion is read successfully by readVcf() but disappears from locateVariants() and predictCoding() results. After this fix, it appears with LOCATION='coding' and a warning.

Fixes #81

Bioconductor#81)

Large INDELs (e.g. 2542bp deletions) that span multiple exons were
silently dropped by locateVariants() and predictCoding() because the
internal call to mapToTranscripts() uses type='within', which requires
the variant to fit entirely within a single exon/CDS element.

Changes:
- locateVariants (.makeResult): after mapToTranscripts, detect variants
  that overlap the transcript (type='any') but were not mapped. These
  are now included in results with LOCATION='coding' and LOCSTART/LOCEND
  set to NA. A warning is emitted identifying the rescued variants.
- predictCoding (.localCoordinates): emit a warning when variants
  overlap CDS regions but cannot be mapped to transcript coordinates,
  directing users to locateVariants() for identification.

This is the minimal fix that prevents silent data loss. Full amino acid
consequence prediction for multi-exon spanning variants would require
changes to GenomicFeatures::mapToTranscripts itself.

Fixes Bioconductor#81
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

locateVariants() and predictCoding() silently drop large INDELs

1 participant