Skip to content

fix: preserve '*' spanning deletion alleles at read time (#65)#104

Open
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:fix/65-preserve-asterisk-readtime
Open

fix: preserve '*' spanning deletion alleles at read time (#65)#104
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:fix/65-preserve-asterisk-readtime

Conversation

@jmg421

@jmg421 jmg421 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Fully fixes #65. Supersedes #103 (which only addressed the write-path).

readVcf() now preserves * (spanning deletion) alleles instead of erasing them, enabling faithful VCF round-trips for files containing spanning deletions.

Problem

* in VCF ALT fields represents a spanning deletion allele (VCF spec §1.4.2). At read time, .formatALT() converted * to an empty string ("") to fit into DNAStringSet (since * is not a valid DNA character). At write time, this empty string produced malformed VCF output like CA,,TTA instead of CA,*,TTA, causing IGV/htsjdk to reject the file.

Root Cause (per @hpages' analysis in the issue thread)

* is not a DNA sequence — it's a special VCF allele that cannot be stored in DNAStringSet. The package's architecture stores ALT as either DNAStringSetList (normal alleles) or CharacterList (structural/special alleles). * should be in the latter category.

Fix

One line change: added grepl("*", x, fixed=TRUE) to .isStructural() in AllUtilities.R. This causes VCFs containing * alleles to be stored with a CharacterList ALT column (same treatment as <DEL>, ], etc.), preserving the * through read/write.

Also removed the now-unnecessary flat[grepl("*", flat, fixed=TRUE)] <- "" conversion in .formatALT().

Behavior Change

Scenario Before After
VCF without * alt() returns DNAStringSetList Same (unchanged)
VCF with * alt() returns DNAStringSetList with "" for * alt() returns CharacterList with "*" preserved
writeVcf with * Empty alleles in output (malformed) * correctly written

This is consistent with how structural variants (<DEL>, breakends) are already handled — alt() is a CharacterList and the special alleles are preserved.

Testing

  • All existing readVcf and writeVcf tests pass
  • Verified round-trip: CA,*,TTA → read → write → CA,*,TTA
  • Verified single * allele round-trips correctly
  • VCFs without * still use DNAStringSetList (no change)

Agentic tooling disclosure

This PR was produced with the assistance of Kiro CLI (Amazon's AI coding agent). All changes were reviewed and verified by the author.

…r#65)

VCFs containing '*' (spanning deletion) alleles are now read with a
CharacterList ALT column, preserving the '*' character. Previously,
.formatALT() converted '*' to an empty string to fit into DNAStringSet,
but this made it impossible to write valid VCF output — IGV/htsjdk
would reject files with empty alleles.

The fix treats '*' as a structural/special allele (like '<DEL>'), which
causes .isStructural() to return TRUE and the ALT to be stored as a
CharacterList. This is semantically correct per the VCF spec: '*'
represents a spanning deletion and is not a DNA sequence.

For VCFs without '*' alleles, behavior is unchanged (DNAStringSetList).

This supersedes the write-path-only fix in PR Bioconductor#103, which could only
handle the multi-allele case. With this read-time fix, both single '*'
and multi-allele '*' are correctly round-tripped.

Closes Bioconductor#65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

writeVcf Output Causes IGV Errors from htsjdk

1 participant