fix: writeVcf restores '*' spanning deletion in multi-allele ALT (#65)#103
Closed
jmg421 wants to merge 1 commit into
Closed
fix: writeVcf restores '*' spanning deletion in multi-allele ALT (#65)#103jmg421 wants to merge 1 commit into
jmg421 wants to merge 1 commit into
Conversation
…records (Bioconductor#65) writeVcf() previously wrote empty alleles for spanning deletions in multi-allele ALT fields (e.g., 'CA,,TTA' instead of 'CA,*,TTA'). This produced malformed VCF that caused IGV/htsjdk to error with 'empty alleles are not permitted in VCF records'. Root cause: readVcf() converts '*' to empty strings at parse time (in .formatALT), losing the distinction between '*' (spanning deletion) and '.' (no allele). At write time, the empty strings were either dropped or written as '.'. Fix: at write time, restore '*' for empty strings that appear in multi-allele list elements (length > 1), since an empty string alongside other alleles is unambiguously a spanning deletion. For ExpandedVCF (scalar ALT per row), empty strings also become '*' since expand() doesn't produce monomorphic rows. The ambiguous single-allele case (sole '*' vs '.') remains as '.' for backward compat; a full fix would require changes at the C parsing level as discussed in the issue thread. Partially addresses Bioconductor#65
Author
|
Superseded by #104, which fixes the issue at read time (preserving '' as a CharacterList element) rather than trying to reconstruct it at write time. The read-time fix is cleaner, handles all cases (including single '' alleles), and aligns with how structural variants are already handled. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Partially fixes #65.
writeVcf()now correctly writes*(spanning deletion) alleles in multi-allele ALT fields instead of producing empty alleles that make the VCF malformed.Problem
As reported in #65, writing a VCF with spanning deletions produced output like:
instead of:
The empty allele caused IGV/htsjdk to reject the file with:
Root Cause
readVcf()converts*to empty strings at parse time (in.formatALT()), losing the distinction between*(spanning deletion) and.(no allele). At write time, these empty strings were collapsed into empty fields.Fix
At write time (
.makeVcfMatrix()), before collapsing ALT alleles withunstrsplit, restore*for empty strings that appear in multi-allele list elements (length > 1). An empty string alongside other alleles is unambiguously a spanning deletion per VCF spec. ForExpandedVCFobjects (scalar ALT per row), empty strings also become*sinceexpand()doesn't produce monomorphic rows.Limitations
The ambiguous single-allele case (a sole
*vs.) cannot be fully resolved at write time because the C-level parser erases the distinction. This PR conservatively writes.for that case. A complete fix would require changes to the C parsing layer as @hpages discussed in the issue thread.Testing
CA,*,TTAis correctly round-tripped through read/writeAgentic tooling disclosure
This PR was produced with the assistance of Kiro CLI (Amazon's AI coding agent). All changes were reviewed and verified by the author.