Skip to content

fix: quote-aware VCF/BCF header key=value parser#70

Open
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:devel
Open

fix: quote-aware VCF/BCF header key=value parser#70
jmg421 wants to merge 1 commit into
Bioconductor:develfrom
jmg421:devel

Conversation

@jmg421

@jmg421 jmg421 commented Jun 12, 2026

Copy link
Copy Markdown

Problem

scanBcfHeader() fails to parse structured VCF/BCF header lines whose quoted field values contain commas or = signs. Two cascading regex bugs in .bcfHeaderAsSimpleList():

  1. Comma split — the lookahead ',(?=Description=".*?")' uses lazy .*? which still breaks when the quoted value contains = (e.g. Description="IDs [VRS version=2.0.1]")
  2. Key=value split(?<=[ID|Number|Type|Description])= uses [...] (character class, matches single chars I,D,|,…) instead of alternation, so it splits on = signs inside quoted values too

Reproducing examples from the tracker:

# VariantAnnotation#89
INFO=<ID=VRS_Allele_IDs,Number=R,Type=String,Description="The computed identifiers [VRS version=2.0.1;VRS-Python version=2.1.1]">

# VariantAnnotation#80
GATKCommandLine=<ID=SelectVariants,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false">

Fix

Replace both strsplit calls with .splitKeyVals(), a character-by-character quote-aware scanner that:

  • Splits on commas only outside double-quoted strings
  • Splits each field on the first = only (key names never contain =)

The helper is hoisted to package scope (was previously inlined) so unit tests and downstream packages can access it via Rsamtools:::.splitKeyVals.

Changes

  • R/methods-BcfFile.R — replace regex split with .splitKeyVals()
  • inst/unitTests/test_BcfFile.R — 3 regression tests covering both issues + commas-in-quotes
  • NEWS — changelog entry under v2.18

References

The previous regex-based split in .bcfHeaderAsSimpleList used:
  strsplit(..., ',(?=Description=".*?")', perl=TRUE)
  strsplit(..., '(?<=[ID|Number|Type|Description])=', perl=TRUE)

Both regexes fail when quoted Description (or other) values contain
commas or '=' signs, e.g.:
  Description="IDs [VRS version=2.0.1;VRS-Python version=2.1.1]"
  CommandLineOptions="analysis_type=SelectVariants input_file=[]"

Replace with .splitKeyVals(), a character-by-character quote-aware
scanner that:
  1. Splits on commas only outside double-quoted strings
  2. Splits each field on the FIRST '=' only (keys never contain '=')

The helper is hoisted to package scope so unit tests can access it
via Rsamtools:::.splitKeyVals.

Fixes VariantAnnotation#89, VariantAnnotation#80
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant