Skip to content

fix: writeVcf restores '*' spanning deletion in multi-allele ALT (#65)#103

Closed
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:fix/65-writeVcf-asterisk
Closed

fix: writeVcf restores '*' spanning deletion in multi-allele ALT (#65)#103
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:fix/65-writeVcf-asterisk

Conversation

@jmg421

@jmg421 jmg421 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Partially fixes #65.

writeVcf() now correctly writes * (spanning deletion) alleles in multi-allele ALT fields instead of producing empty alleles that make the VCF malformed.

Problem

As reported in #65, writing a VCF with spanning deletions produced output like:

chr1  10150  .  CTA  CA,,TTA  ...

instead of:

chr1  10150  .  CTA  CA,*,TTA  ...

The empty allele caused IGV/htsjdk to reject the file with:

empty alleles are not permitted in VCF records

Root Cause

readVcf() converts * to empty strings at parse time (in .formatALT()), losing the distinction between * (spanning deletion) and . (no allele). At write time, these empty strings were collapsed into empty fields.

Fix

At write time (.makeVcfMatrix()), before collapsing ALT alleles with unstrsplit, restore * for empty strings that appear in multi-allele list elements (length > 1). An empty string alongside other alleles is unambiguously a spanning deletion per VCF spec. For ExpandedVCF objects (scalar ALT per row), empty strings also become * since expand() doesn't produce monomorphic rows.

Limitations

The ambiguous single-allele case (a sole * vs .) cannot be fully resolved at write time because the C-level parser erases the distinction. This PR conservatively writes . for that case. A complete fix would require changes to the C parsing layer as @hpages discussed in the issue thread.

Testing

  • All existing writeVcf tests pass
  • Verified manually: CA,*,TTA is correctly round-tripped through read/write

Agentic tooling disclosure

This PR was produced with the assistance of Kiro CLI (Amazon's AI coding agent). All changes were reviewed and verified by the author.

…records (Bioconductor#65)

writeVcf() previously wrote empty alleles for spanning deletions in
multi-allele ALT fields (e.g., 'CA,,TTA' instead of 'CA,*,TTA').
This produced malformed VCF that caused IGV/htsjdk to error with
'empty alleles are not permitted in VCF records'.

Root cause: readVcf() converts '*' to empty strings at parse time
(in .formatALT), losing the distinction between '*' (spanning
deletion) and '.' (no allele). At write time, the empty strings
were either dropped or written as '.'.

Fix: at write time, restore '*' for empty strings that appear in
multi-allele list elements (length > 1), since an empty string
alongside other alleles is unambiguously a spanning deletion.
For ExpandedVCF (scalar ALT per row), empty strings also become '*'
since expand() doesn't produce monomorphic rows.

The ambiguous single-allele case (sole '*' vs '.') remains as '.'
for backward compat; a full fix would require changes at the C
parsing level as discussed in the issue thread.

Partially addresses Bioconductor#65
@jmg421

jmg421 commented Jul 1, 2026

Copy link
Copy Markdown
Author

Superseded by #104, which fixes the issue at read time (preserving '' as a CharacterList element) rather than trying to reconstruct it at write time. The read-time fix is cleaner, handles all cases (including single '' alleles), and aligns with how structural variants are already handled.

@jmg421 jmg421 closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

writeVcf Output Causes IGV Errors from htsjdk

1 participant