feat: add eqtl-catalogue-region-fetch skill (cis-QTL region tabix fetcher)#157
Draft
madaraviv wants to merge 1 commit into
Draft
feat: add eqtl-catalogue-region-fetch skill (cis-QTL region tabix fetcher)#157madaraviv wants to merge 1 commit into
madaraviv wants to merge 1 commit into
Conversation
…ix fetcher for eQTL Catalogue v7+
Standalone skill that pulls per-variant cis-QTL summary statistics for a
chromosomal window from EBI's eQTL Catalogue, returning harmonised rows
(variant_id, chromosome, position, ref, alt, beta, SE, p-value, MAF,
molecular_trait_id, study_id) suitable for downstream colocalization,
fine-mapping, regional plotting, or Mendelian randomisation.
Why this exists:
The eQTL Catalogue REST API at /api/v2/datasets/{id}/associations silently
truncates region requests to one side of the TSS (verified May 2026: the
`pos_min` / `pos_max` filters are ignored, returning only genomic-lower
hits). This skill uses the supported tabix-on-FTP path against the
harmonised .all.tsv.gz files, which contain the full +/- 1 Mb of
strand-aware cis-window per gene as designed.
Source layout:
- Tabix range-fetch from
https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/<QTS>/<QTD>/<QTD>.all.tsv.gz
- REST API used only for dataset metadata (study_id, tissue_label,
quant_method) — never for region pulls.
- Coordinates: GRCh38 throughout; effect allele is `alt`, beta is per-copy
of `alt`.
Covers all expression-related QTL flavors the catalogue hosts: gene-level
(`ge`), exon-level (`exon`), transcript-level (`tx`), transcript-event
usage (`txrev`, txrevise tool), splicing (`leafcutter`), microarray, and
allelic fold-change (`aFC`). pQTL data is NOT in this catalogue (UKB-PPP
and deCODE are the canonical pQTL sources).
Layout follows the K-Dense pattern (light SKILL.md frontmatter +
workflow-doc body + scripts/ + references/ + tests/), modelled after
pydeseq2 and exa-search:
scientific-skills/eqtl-catalogue-region-fetch/
SKILL.md # workflow doc (Overview, Trigger,
# Scope, Workflow, CLI Reference,
# Example Output, Gotchas, Safety,
# Agent Boundary)
references/
dataset_resolution.md # how to resolve a human description
# or OT studyId slug into the
# canonical QTD###### dataset_id
quant_methods.md # the 6 catalogue quant methods (ge,
# tx, txrev, exon, leafcutter,
# microarray) and per-method beta
# interpretation + pitfalls
cis_window_biology.md # strand-aware +/- 1 Mb of TSS;
# genomic-coord asymmetry; REST API
# truncation gotcha; tabix-on-FTP is
# the only correct fetch path
scripts/
eqtl_catalogue_region_fetch.py # standard CLI: --input <config>
# --output <dir> --demo --list-demos
# PEP 723 inline deps so it runs via
# `uv run --with pysam,pandas,requests`
examples/ # 3 biology demos:
sort1_gtex_minor_salivary_gland.json # SORT1 1p13.3 LDL/CHD
il6r_gtex_small_intestine.json # IL6R Swerdlow 2012 MR
irf5_gtex_adipose_visceral.json # IRF5 Wang 2021 SLE
default.json input.json # convenience aliases
tests/
test_eqtl_catalogue_region_fetch.py # 18 stdlib unittest cases:
# pure helpers, row
# normaliser, demo loader,
# CLI plumbing. No network.
Cache directory: per-region JSON cache at
~/.cache/eqtl_catalogue_region_fetch/ (overridable via
$EQTL_CATALOGUE_CACHE_DIR), so repeated calls hit disk instead of FTP.
Catalog: also adds an `eQTL Catalogue Region Fetch` entry under
`## Scientific Databases & Data Access` in `docs/scientific-skills.md`.
Sources Green-with-attribution:
- eQTL Catalogue v7+ (Kerimov 2021, Nat Genet 53:1290), CC-BY-4.0
- Per-study attribution: original publication of each constituent dataset
Verified:
- Cisco AI Defense Skill Scanner (skill-scanner scan --use-behavioral):
SAFE / 0 findings
- Unit tests: `python -m unittest discover -s tests -v` -> 18/18 passing
- Smoke test: 2833 SORT1 cis-eQTL variants in minor salivary gland from
+/- 500 kb of TSS; 3276 IRF5 variants in visceral adipose from +/- 500
kb of the OT coloc lead at chr7:128937250
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewed-by: Aviv Madar <madaraviv@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR adds
A standalone
eqtl-catalogue-region-fetchskill: tabix-on-FTP region fetcher for EBI's eQTL Catalogue v7+. Returns harmonised per-variant cis-QTL summary stats (variant_id, chromosome, position, ref, alt, beta, SE, p-value, MAF, molecular_trait_id, study_id) for any one (study × tissue × quantification) dataset in the catalogue, suitable for downstream colocalization, fine-mapping, regional plotting, or Mendelian randomisation.Coverage gap this fills
The current K-Dense bio/genomics coverage doesn't include a region-level cis-QTL fetcher.
database-lookupcovers point queries to GTEx and eQTL Catalogue (per-variant) but not the per-region pulls needed for coloc / fine-mapping / regional plots. This skill complements that — point lookups stay indatabase-lookup; window-level summary-stats slices come here.Why a tabix-on-FTP path (not REST)
The eQTL Catalogue REST API at
/api/v2/datasets/{id}/associationssilently truncates region requests to one side of the TSS — thepos_min/pos_maxfilters are ignored, returning only the genomic-lower hits (verified May 2026). The tabix-indexed.all.tsv.gzfiles on EBI's FTP contain the full ±1 Mb of strand-aware cis-window per gene as designed; this skill uses that path. REST is kept only for dataset metadata.Coverage of catalogue's quant methods
All expression-related QTL flavors the catalogue hosts: gene-level (
ge), exon-level (exon), transcript-level (tx), transcript-event usage (txrev, txrevise tool), splicing (leafcutter), microarray, and allelic fold-change (aFC). pQTL data is not in this catalogue (UKB-PPP / deCODE are the canonical pQTL sources).Layout
Follows the K-Dense convention (light SKILL.md frontmatter + workflow-doc body +
references/+scripts/+tests/), modelled afterpydeseq2/andexa-search/:Cache writes to
~/.cache/eqtl_catalogue_region_fetch/(overridable viaEQTL_CATALOGUE_CACHE_DIR), so repeated calls skip the FTP round-trip.Also adds an
eQTL Catalogue Region Fetchentry under## Scientific Databases & Data Accessindocs/scientific-skills.md, next toDatabase Lookup.Try it
Verifications
skill-scanner scan scientific-skills/eqtl-catalogue-region-fetch --use-behavioral): SAFE / 0 findingspython -m unittest discover -s scientific-skills/eqtl-catalogue-region-fetch/tests -v→ 18/18 passing (no network, stdlib only)Test plan
--help,--list-demos,--demoagainst live FTP)python -m unittest discover -s tests -v→ 18/18)skill-scanner scan ... --use-behavioral)__pycache__/,.tbi,.tsv.gz, or run-output dirs tracked (.gitignorein skill root)Coming next: 3 more skills to enable coloc signal inspection
This is skill 1 of 4 in a suite enabling regional LocusCompare diagnostics — the visual sanity check that distinguishes a real shared causal variant from an LD artefact, and the gold standard for genetic target support in drug discovery and drug development. The other three (separate PRs):
gwas-catalog-region-fetch— tabix on GWAS Catalog harmonised summary stats (disease/trait outcome side)ld-1000g-region-compute— plink2 r² against 1000G Phase 3 GRCh38 (LD coloring)locuscompare-region-render— orchestrator composing the three primitives + GENCODE gene track into the canonical 4-panel LocusCompare plot (Liu 2019 Nat Methods) + JSON manifestThis PR is self-contained; the orchestrator depends on the three primitives, but each primitive is independently useful for non-LocusCompare asks.
Sources & licensing
🤖 Generated with Claude Code