fix: locateVariants/predictCoding no longer silently drop large INDELs (#81)#94
Open
jmg421 wants to merge 1 commit into
Open
fix: locateVariants/predictCoding no longer silently drop large INDELs (#81)#94jmg421 wants to merge 1 commit into
jmg421 wants to merge 1 commit into
Conversation
Bioconductor#81) Large INDELs (e.g. 2542bp deletions) that span multiple exons were silently dropped by locateVariants() and predictCoding() because the internal call to mapToTranscripts() uses type='within', which requires the variant to fit entirely within a single exon/CDS element. Changes: - locateVariants (.makeResult): after mapToTranscripts, detect variants that overlap the transcript (type='any') but were not mapped. These are now included in results with LOCATION='coding' and LOCSTART/LOCEND set to NA. A warning is emitted identifying the rescued variants. - predictCoding (.localCoordinates): emit a warning when variants overlap CDS regions but cannot be mapped to transcript coordinates, directing users to locateVariants() for identification. This is the minimal fix that prevents silent data loss. Full amino acid consequence prediction for multi-exon spanning variants would require changes to GenomicFeatures::mapToTranscripts itself. Fixes Bioconductor#81
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Large INDELs (e.g. 2542bp deletions from Sniffles2) that span multiple exons were silently dropped by
locateVariants()andpredictCoding()because the internal call toGenomicFeatures::mapToTranscripts()usestype='within'infindOverlaps(), which requires the variant to fit entirely within a single exon/CDS element.Root Cause
mapToTranscripts(GenomicRanges, GRangesList)at line 209-210 of GenomicFeatures calls:A variant wider than any single exon produces zero hits and is silently excluded from all downstream results.
Fix
locateVariants()—.makeResult()After
mapToTranscripts(), detect variants that overlap the transcript features (type='any') but were not mapped. These are now:LOCATION='coding'LOCSTART/LOCENDset toNA(transcript coordinates cannot be computed)predictCoding()—.localCoordinates()Emit a warning when variants overlap CDS regions but cannot be mapped to transcript coordinates, directing users to
locateVariants()for identification.Design Decision
This is the minimal fix that prevents silent data loss — the most dangerous failure mode (users making decisions on incomplete output without knowing it).
Full amino acid consequence prediction for multi-exon spanning variants would require changes to
GenomicFeatures::mapToTranscripts()itself (changing the overlap type or adding a parameter), which is outside the scope of this package. The warning directs users to identify affected variants so they can be handled appropriately (e.g., via VEP/SnpEff for full structural consequence annotation).Testing
Added
inst/unitTests/test_large_indels.Rwith 4 test cases:locateVariants()predictCoding()emits a warning for multi-exon spanning variantsReproducer (from issue)
A 2542bp Sniffles2 deletion is read successfully by
readVcf()but disappears fromlocateVariants()andpredictCoding()results. After this fix, it appears withLOCATION='coding'and a warning.Fixes #81