Skip to content

fix: expand() preserves non-standard rowRanges columns (#85)#95

Open
jmg421 wants to merge 2 commits into
Bioconductor:develfrom
jmg421:fix/85-expand-preserve-mcols
Open

fix: expand() preserves non-standard rowRanges columns (#85)#95
jmg421 wants to merge 2 commits into
Bioconductor:develfrom
jmg421:fix/85-expand-preserve-mcols

Conversation

@jmg421

@jmg421 jmg421 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

expand() on a CollapsedVCF with multi-allelic sites was dropping all user-added mcols from rowRanges. For example, a column like SNP_name added via mcols(rowRanges(vcf))$SNP_name <- ... would disappear after expand().

Root Cause

In methods-expand.R, the multi-allelic expansion path had:

rdexp <- rd[idx, ]
mcols(rdexp) <- NULL

This wiped ALL metadata columns. The rd[idx, ] subsetting already correctly repeats rows for multi-allelic sites — the mcols <- NULL was unnecessary and destructive.

Note: the fast path (all(elt == 1L), no multi-allelic sites) already preserved user columns correctly by passing rd directly to VCF().

Fix

Remove the mcols(rdexp) <- NULL line. Simply use rd[idx, ] which preserves all columns. The VCF() constructor stores these in the rowRanges slot separately from the fixed fields (REF/ALT/QUAL/FILTER are in the fixed slot).

Testing

Added test_expand_preserves_nonstandard_mcols to existing test file using the exact reproducer from the issue.

Reproducer (from issue)

vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
alt(vcf) <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
ref(vcf) <- DNAStringSet(c("G", "AA", "T", "G"))
mcols(rowRanges(vcf))$SNP_name <- paste0("SNP_", 1:4)

vcfLong <- expand(vcf)
# Before fix: SNP_name is gone
# After fix: SNP_name is preserved and correctly expanded
mcols(rowRanges(vcfLong, fixed=FALSE))$SNP_name
# [1] "SNP_1" "SNP_2" "SNP_3" "SNP_3" "SNP_4" "SNP_4"

Fixes #85

jmg421 added 2 commits June 29, 2026 23:39
Bioconductor#81)

Large INDELs (e.g. 2542bp deletions) that span multiple exons were
silently dropped by locateVariants() and predictCoding() because the
internal call to mapToTranscripts() uses type='within', which requires
the variant to fit entirely within a single exon/CDS element.

Changes:
- locateVariants (.makeResult): after mapToTranscripts, detect variants
  that overlap the transcript (type='any') but were not mapped. These
  are now included in results with LOCATION='coding' and LOCSTART/LOCEND
  set to NA. A warning is emitted identifying the rescued variants.
- predictCoding (.localCoordinates): emit a warning when variants
  overlap CDS regions but cannot be mapped to transcript coordinates,
  directing users to locateVariants() for identification.

This is the minimal fix that prevents silent data loss. Full amino acid
consequence prediction for multi-exon spanning variants would require
changes to GenomicFeatures::mapToTranscripts itself.

Fixes Bioconductor#81
expand() on a CollapsedVCF was calling mcols(rdexp) <- NULL after
expanding rowRanges, which wiped all user-added metadata columns
(e.g. SNP_name, num_alts). The fast path (no multi-allelic sites)
already preserved these columns correctly.

Fix: simply expand rd[idx, ] without wiping mcols. The VCF()
constructor accepts rowRanges with extra mcols — they are stored
in the rowRanges slot alongside paramRangeID, separate from the
fixed fields (REF/ALT/QUAL/FILTER).

Fixes Bioconductor#85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

expand() and non-standard rowRanges columns

1 participant