fix: expand() preserves non-standard rowRanges columns (#85)#95
Open
jmg421 wants to merge 2 commits into
Open
Conversation
Bioconductor#81) Large INDELs (e.g. 2542bp deletions) that span multiple exons were silently dropped by locateVariants() and predictCoding() because the internal call to mapToTranscripts() uses type='within', which requires the variant to fit entirely within a single exon/CDS element. Changes: - locateVariants (.makeResult): after mapToTranscripts, detect variants that overlap the transcript (type='any') but were not mapped. These are now included in results with LOCATION='coding' and LOCSTART/LOCEND set to NA. A warning is emitted identifying the rescued variants. - predictCoding (.localCoordinates): emit a warning when variants overlap CDS regions but cannot be mapped to transcript coordinates, directing users to locateVariants() for identification. This is the minimal fix that prevents silent data loss. Full amino acid consequence prediction for multi-exon spanning variants would require changes to GenomicFeatures::mapToTranscripts itself. Fixes Bioconductor#81
expand() on a CollapsedVCF was calling mcols(rdexp) <- NULL after expanding rowRanges, which wiped all user-added metadata columns (e.g. SNP_name, num_alts). The fast path (no multi-allelic sites) already preserved these columns correctly. Fix: simply expand rd[idx, ] without wiping mcols. The VCF() constructor accepts rowRanges with extra mcols — they are stored in the rowRanges slot alongside paramRangeID, separate from the fixed fields (REF/ALT/QUAL/FILTER). Fixes Bioconductor#85
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
expand()on aCollapsedVCFwith multi-allelic sites was dropping all user-addedmcolsfromrowRanges. For example, a column likeSNP_nameadded viamcols(rowRanges(vcf))$SNP_name <- ...would disappear afterexpand().Root Cause
In
methods-expand.R, the multi-allelic expansion path had:This wiped ALL metadata columns. The
rd[idx, ]subsetting already correctly repeats rows for multi-allelic sites — themcols <- NULLwas unnecessary and destructive.Note: the fast path (
all(elt == 1L), no multi-allelic sites) already preserved user columns correctly by passingrddirectly toVCF().Fix
Remove the
mcols(rdexp) <- NULLline. Simply userd[idx, ]which preserves all columns. TheVCF()constructor stores these in therowRangesslot separately from the fixed fields (REF/ALT/QUAL/FILTER are in thefixedslot).Testing
Added
test_expand_preserves_nonstandard_mcolsto existing test file using the exact reproducer from the issue.Reproducer (from issue)
Fixes #85