Skip to content

feat: add eqtl-catalogue-region-fetch skill (cis-QTL region tabix fetcher)#157

Draft
madaraviv wants to merge 1 commit into
K-Dense-AI:mainfrom
madaraviv:add-eqtl-catalogue-region-fetch
Draft

feat: add eqtl-catalogue-region-fetch skill (cis-QTL region tabix fetcher)#157
madaraviv wants to merge 1 commit into
K-Dense-AI:mainfrom
madaraviv:add-eqtl-catalogue-region-fetch

Conversation

@madaraviv

Copy link
Copy Markdown

What this PR adds

A standalone eqtl-catalogue-region-fetch skill: tabix-on-FTP region fetcher for EBI's eQTL Catalogue v7+. Returns harmonised per-variant cis-QTL summary stats (variant_id, chromosome, position, ref, alt, beta, SE, p-value, MAF, molecular_trait_id, study_id) for any one (study × tissue × quantification) dataset in the catalogue, suitable for downstream colocalization, fine-mapping, regional plotting, or Mendelian randomisation.

Coverage gap this fills

The current K-Dense bio/genomics coverage doesn't include a region-level cis-QTL fetcher. database-lookup covers point queries to GTEx and eQTL Catalogue (per-variant) but not the per-region pulls needed for coloc / fine-mapping / regional plots. This skill complements that — point lookups stay in database-lookup; window-level summary-stats slices come here.

Why a tabix-on-FTP path (not REST)

The eQTL Catalogue REST API at /api/v2/datasets/{id}/associations silently truncates region requests to one side of the TSS — the pos_min / pos_max filters are ignored, returning only the genomic-lower hits (verified May 2026). The tabix-indexed .all.tsv.gz files on EBI's FTP contain the full ±1 Mb of strand-aware cis-window per gene as designed; this skill uses that path. REST is kept only for dataset metadata.

Coverage of catalogue's quant methods

All expression-related QTL flavors the catalogue hosts: gene-level (ge), exon-level (exon), transcript-level (tx), transcript-event usage (txrev, txrevise tool), splicing (leafcutter), microarray, and allelic fold-change (aFC). pQTL data is not in this catalogue (UKB-PPP / deCODE are the canonical pQTL sources).

Layout

Follows the K-Dense convention (light SKILL.md frontmatter + workflow-doc body + references/ + scripts/ + tests/), modelled after pydeseq2/ and exa-search/:

scientific-skills/eqtl-catalogue-region-fetch/
├── SKILL.md                          # Overview, Trigger, Scope, Workflow,
│                                     # CLI Reference, Example Output,
│                                     # Gotchas, Safety, Agent Boundary
├── references/
│   ├── dataset_resolution.md         # how to resolve human description /
│   │                                 # OT studyId slug → QTD###### dataset_id
│   ├── quant_methods.md              # the 6 catalogue quant methods + per-
│   │                                 # method β interpretation + pitfalls
│   └── cis_window_biology.md         # strand-aware ±1 Mb of TSS, REST API
│                                     # truncation gotcha, FTP tabix path
├── scripts/
│   ├── eqtl_catalogue_region_fetch.py    # CLI: --input, --output, --demo,
│   │                                     # --list-demos. PEP 723 inline deps.
│   └── examples/
│       ├── sort1_gtex_minor_salivary_gland.json  # SORT1 1p13.3 LDL/CHD locus
│       ├── il6r_gtex_small_intestine.json        # IL6R Swerdlow 2012 MR-classic
│       └── irf5_gtex_adipose_visceral.json       # IRF5 Wang 2021 SLE eQTL
└── tests/
    └── test_eqtl_catalogue_region_fetch.py       # 18 stdlib unittest cases
                                                  # (pure helpers, row normaliser,
                                                  # demo loader, CLI plumbing) —
                                                  # no network, no extra deps

Cache writes to ~/.cache/eqtl_catalogue_region_fetch/ (overridable via EQTL_CATALOGUE_CACHE_DIR), so repeated calls skip the FTP round-trip.

Also adds an eQTL Catalogue Region Fetch entry under ## Scientific Databases & Data Access in docs/scientific-skills.md, next to Database Lookup.

Try it

# Run via uv (PEP 723 inline deps, no separate install)
uv run --with pysam,pandas,requests \
    scientific-skills/eqtl-catalogue-region-fetch/scripts/eqtl_catalogue_region_fetch.py \
    --demo sort1_gtex_minor_salivary_gland --output /tmp/sort1_run

# Or with an existing env that has pysam, pandas, requests
cd scientific-skills/eqtl-catalogue-region-fetch
python scripts/eqtl_catalogue_region_fetch.py --list-demos
python scripts/eqtl_catalogue_region_fetch.py \
    --demo sort1_gtex_minor_salivary_gland --output /tmp/sort1_run
ls /tmp/sort1_run/  # variants.tsv, manifest.yaml, report.md

Verifications

  • Cisco AI Defense Skill Scanner (skill-scanner scan scientific-skills/eqtl-catalogue-region-fetch --use-behavioral): SAFE / 0 findings
  • Unit tests: python -m unittest discover -s scientific-skills/eqtl-catalogue-region-fetch/tests -v18/18 passing (no network, stdlib only)
  • Smoke test: 2,833 SORT1 cis-eQTL variants in minor salivary gland (±500 kb of TSS); 3,276 IRF5 variants in visceral adipose (±500 kb of the OT coloc lead at chr7:128,937,250)
  • Network-only on first call per (dataset, region); subsequent calls hit the local cache
  • Python 3.10+ compatible

Test plan

  • Scripts run cleanly (--help, --list-demos, --demo against live FTP)
  • Unit tests pass locally (python -m unittest discover -s tests -v → 18/18)
  • Cisco scanner SAFE / 0 findings (skill-scanner scan ... --use-behavioral)
  • No __pycache__/, .tbi, .tsv.gz, or run-output dirs tracked (.gitignore in skill root)
  • Live smoke against the 3 bundled demos (reviewer step if desired)

Coming next: 3 more skills to enable coloc signal inspection

This is skill 1 of 4 in a suite enabling regional LocusCompare diagnostics — the visual sanity check that distinguishes a real shared causal variant from an LD artefact, and the gold standard for genetic target support in drug discovery and drug development. The other three (separate PRs):

  • gwas-catalog-region-fetch — tabix on GWAS Catalog harmonised summary stats (disease/trait outcome side)
  • ld-1000g-region-compute — plink2 r² against 1000G Phase 3 GRCh38 (LD coloring)
  • locuscompare-region-render — orchestrator composing the three primitives + GENCODE gene track into the canonical 4-panel LocusCompare plot (Liu 2019 Nat Methods) + JSON manifest

This PR is self-contained; the orchestrator depends on the three primitives, but each primitive is independently useful for non-LocusCompare asks.

Sources & licensing

  • eQTL Catalogue v7+ (Kerimov 2021 Nat Genet 53:1290), CC-BY-4.0
  • Per-study attribution: original publication of each constituent dataset
  • Skill code: MIT

🤖 Generated with Claude Code

…ix fetcher for eQTL Catalogue v7+

Standalone skill that pulls per-variant cis-QTL summary statistics for a
chromosomal window from EBI's eQTL Catalogue, returning harmonised rows
(variant_id, chromosome, position, ref, alt, beta, SE, p-value, MAF,
molecular_trait_id, study_id) suitable for downstream colocalization,
fine-mapping, regional plotting, or Mendelian randomisation.

Why this exists:
The eQTL Catalogue REST API at /api/v2/datasets/{id}/associations silently
truncates region requests to one side of the TSS (verified May 2026: the
`pos_min` / `pos_max` filters are ignored, returning only genomic-lower
hits). This skill uses the supported tabix-on-FTP path against the
harmonised .all.tsv.gz files, which contain the full +/- 1 Mb of
strand-aware cis-window per gene as designed.

Source layout:
- Tabix range-fetch from
  https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/<QTS>/<QTD>/<QTD>.all.tsv.gz
- REST API used only for dataset metadata (study_id, tissue_label,
  quant_method) — never for region pulls.
- Coordinates: GRCh38 throughout; effect allele is `alt`, beta is per-copy
  of `alt`.

Covers all expression-related QTL flavors the catalogue hosts: gene-level
(`ge`), exon-level (`exon`), transcript-level (`tx`), transcript-event
usage (`txrev`, txrevise tool), splicing (`leafcutter`), microarray, and
allelic fold-change (`aFC`). pQTL data is NOT in this catalogue (UKB-PPP
and deCODE are the canonical pQTL sources).

Layout follows the K-Dense pattern (light SKILL.md frontmatter +
workflow-doc body + scripts/ + references/ + tests/), modelled after
pydeseq2 and exa-search:

  scientific-skills/eqtl-catalogue-region-fetch/
    SKILL.md                          # workflow doc (Overview, Trigger,
                                      # Scope, Workflow, CLI Reference,
                                      # Example Output, Gotchas, Safety,
                                      # Agent Boundary)
    references/
      dataset_resolution.md           # how to resolve a human description
                                      # or OT studyId slug into the
                                      # canonical QTD###### dataset_id
      quant_methods.md                # the 6 catalogue quant methods (ge,
                                      # tx, txrev, exon, leafcutter,
                                      # microarray) and per-method beta
                                      # interpretation + pitfalls
      cis_window_biology.md           # strand-aware +/- 1 Mb of TSS;
                                      # genomic-coord asymmetry; REST API
                                      # truncation gotcha; tabix-on-FTP is
                                      # the only correct fetch path
    scripts/
      eqtl_catalogue_region_fetch.py  # standard CLI: --input <config>
                                      # --output <dir> --demo --list-demos
                                      # PEP 723 inline deps so it runs via
                                      # `uv run --with pysam,pandas,requests`
      examples/                       # 3 biology demos:
        sort1_gtex_minor_salivary_gland.json   # SORT1 1p13.3 LDL/CHD
        il6r_gtex_small_intestine.json         # IL6R Swerdlow 2012 MR
        irf5_gtex_adipose_visceral.json        # IRF5 Wang 2021 SLE
        default.json input.json                # convenience aliases
    tests/
      test_eqtl_catalogue_region_fetch.py      # 18 stdlib unittest cases:
                                               # pure helpers, row
                                               # normaliser, demo loader,
                                               # CLI plumbing. No network.

Cache directory: per-region JSON cache at
~/.cache/eqtl_catalogue_region_fetch/ (overridable via
$EQTL_CATALOGUE_CACHE_DIR), so repeated calls hit disk instead of FTP.

Catalog: also adds an `eQTL Catalogue Region Fetch` entry under
`## Scientific Databases & Data Access` in `docs/scientific-skills.md`.

Sources Green-with-attribution:
- eQTL Catalogue v7+ (Kerimov 2021, Nat Genet 53:1290), CC-BY-4.0
- Per-study attribution: original publication of each constituent dataset

Verified:
- Cisco AI Defense Skill Scanner (skill-scanner scan --use-behavioral):
  SAFE / 0 findings
- Unit tests: `python -m unittest discover -s tests -v` -> 18/18 passing
- Smoke test: 2833 SORT1 cis-eQTL variants in minor salivary gland from
  +/- 500 kb of TSS; 3276 IRF5 variants in visceral adipose from +/- 500
  kb of the OT coloc lead at chr7:128937250

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewed-by: Aviv Madar <madaraviv@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant