fix: predictCoding uses CDS-mapped width for exon/intron boundary deletions (#83)#97
Open
jmg421 wants to merge 4 commits into
Open
fix: predictCoding uses CDS-mapped width for exon/intron boundary deletions (#83)#97jmg421 wants to merge 4 commits into
jmg421 wants to merge 4 commits into
Conversation
Bioconductor#81) Large INDELs (e.g. 2542bp deletions) that span multiple exons were silently dropped by locateVariants() and predictCoding() because the internal call to mapToTranscripts() uses type='within', which requires the variant to fit entirely within a single exon/CDS element. Changes: - locateVariants (.makeResult): after mapToTranscripts, detect variants that overlap the transcript (type='any') but were not mapped. These are now included in results with LOCATION='coding' and LOCSTART/LOCEND set to NA. A warning is emitted identifying the rescued variants. - predictCoding (.localCoordinates): emit a warning when variants overlap CDS regions but cannot be mapped to transcript coordinates, directing users to locateVariants() for identification. This is the minimal fix that prevents silent data loss. Full amino acid consequence prediction for multi-exon spanning variants would require changes to GenomicFeatures::mapToTranscripts itself. Fixes Bioconductor#81
expand() on a CollapsedVCF was calling mcols(rdexp) <- NULL after expanding rowRanges, which wiped all user-added metadata columns (e.g. SNP_name, num_alts). The fast path (no multi-allelic sites) already preserved these columns correctly. Fix: simply expand rd[idx, ] without wiping mcols. The VCF() constructor accepts rowRanges with extra mcols — they are stored in the rowRanges slot alongside paramRangeID, separate from the fixed fields (REF/ALT/QUAL/FILTER). Fixes Bioconductor#85
…onductor#84) A dinucleotide base substitution (DBS) spanning a codon boundary that produces a stop codon in one of the two affected amino acids (e.g. VARAA='P*') was misclassified as 'nonsynonymous' instead of 'nonsense'. Root cause: the nonsense check used `as.character(varAA) %in% "*"` which only matches when the entire VARAA string is '*'. For multi-codon DBS, VARAA is multi-character (e.g. 'P*') and the check fails. Fix: use `grepl("*", ..., fixed=TRUE)` to detect stop codons anywhere in the translated variant amino acid sequence. Any premature stop truncates the protein regardless of flanking residues. Fixes Bioconductor#84
…etions (Bioconductor#83) A deletion starting in an exon and extending into the intron produced incorrect REFCODON/VARCODON because .getRefCodons() and the frameshift calculation used the genomic width of the variant rather than the transcript-space (CDSLOC) width. For example, a 51bp genomic deletion with only 37bp overlapping the CDS would compute cend using 51, causing the reference codon to extend incorrectly into the next exon's sequence. Fix: replace width(txlocal) (genomic width) with width(mcols(txlocal)$CDSLOC) (CDS-mapped width) in both: - .getRefCodons(): codon boundary calculation - frameshift detection: refwidth for length-change check Fixes Bioconductor#83
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A deletion that starts inside an exon and extends into the adjacent intron produced incorrect
REFCODON/VARCODONvalues because the codon extraction logic used the genomic width of the variant instead of the transcript-space width (CDS overlap only).Example from the issue
predictCodingreported a 54-character REFCODON spanning into the next exon and a 3-character VARCODON extending into intron sequence — both wrong.Root Cause
Two places in
methods-predictCoding.Rusedwidth(txlocal)(the genomic width of the variant) when they should usewidth(mcols(txlocal)$CDSLOC)(the width of the variant as mapped to transcript/CDS coordinates):.getRefCodons()—cendcalculation extends past the actual CDS overlaprefwidthinflated by intronic bases, causing incorrect mod-3 checkFix
Replace
width(txlocal)withwidth(mcols(txlocal)$CDSLOC)in both locations. The CDSLOC IRanges represents the variant's position and extent in transcript coordinates (exons only, introns excluded), which is what the codon arithmetic needs.Fixes #83