Skip to content

test: unify a shared flaky-rerun marker; fix nwis collection-time call#329

Merged
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix/nwis-flaky-rerun
Jun 23, 2026
Merged

test: unify a shared flaky-rerun marker; fix nwis collection-time call#329
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix/nwis-flaky-rerun

Conversation

@thodson-usgs

@thodson-usgs thodson-usgs commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Problem

The Python package CI on main failed (run 27969553989, on the #326 merge): lint/type-check and all Windows test jobs passed, but all three Ubuntu jobs failed with the same cause —

dataretrieval.exceptions.ServiceUnavailable: HTTP 503 (URL: https://waterservices.usgs.gov/nwis/site?...)

A transient 503 from the live legacy NWIS service hit the live tests in tests/nwis_test.py. Windows passed only by timing. Not a regression from #326 — the live tests and the offending call date to #62.

Two things made a transient outage fatal instead of self-healing:

  1. No rerun protection. waterdata_test.py / ngwmn_test.py retry transient 429/5xx via pytest.mark.flaky (test(waterdata): rerun flaky transient 5xx/429 from the chunked fan-out #325); nwis_test.py had none.
  2. A live call at collection time. TestTZ fetched its sites in the class body (sites, _ = what_sites(stateCd="MD")), so the 503 raised during collection (exit 2, Interrupted: 1 error during collection on 3.13). pytest-rerunfailures reruns failed tests, never collection errors.

What this does

Audited every test module for live-API calls (ran each behind a dead proxy: genuine calls fail, mocked ones pass). Then unified the rerun handling instead of adding a fourth inline copy:

  • One shared marker, conftest.flaky_api, with a single transient-pattern list. Applied to every live suite:
    • nwis_test.py, waterdata_test.py, ngwmn_test.py — module-level pytestmark = flaky_api.
    • utils_test.py::Test_query — class-level @flaky_api (the only other live suite the audit found; the rest of that module is mocked).
  • The status pattern now matches both message shapes — the OGC path's "<status>:" and the legacy query path's "HTTP <status>" — so one list serves all four.
  • Fixes a latent coverage gap: ngwmn's old marker omitted ServiceUnavailable and the chunked QuotaExhausted/ServiceInterrupted wrappers (test(waterdata): rerun flaky transient 5xx/429 from the chunked fan-out #325 added those to waterdata but missed ngwmn), so a 503 in an NGWMN live test would have failed CI too. It now inherits the full set.
  • Fixes the collection bug: TestTZ's class-body fetch moves into a class-scoped sites fixture, so it runs at test time where the marker can retry it.

Everything else is mocked, module-skipped (nadp), or — the chunking ..._on_real_transport test — a deterministic local http.server, not an external service.

Verification

  • nwis_test.py collection is now network-free (--collect-only makes no live call).
  • The shared marker attaches to all four suites with the unified config; ngwmn now covers ServiceUnavailable/ServiceInterrupted.
  • The pattern matches ServiceUnavailable: HTTP 503 and ... 503: but not a deterministic AssertionError.
  • Live Test_query + TestTZ pass; mocked suites unaffected.
  • The originally-failed CI run was re-run and is now green.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs changed the title test(nwis): rerun live tests on transient errors; fix collection-time call test: rerun live-API tests on transient errors (fix nwis collection-time call) Jun 23, 2026
The Ubuntu CI on main (run 27969553989, on the DOI-USGS#326 merge) failed: a transient
503 from the live legacy NWIS site service hit the live tests in nwis_test.py.
Windows passed by timing. Not a regression from DOI-USGS#326 — the live tests and the
offending call date to DOI-USGS#62. Two things made the transient fatal:

- nwis_test.py had no flaky-rerun marker (waterdata/ngwmn got one in DOI-USGS#325), and
- TestTZ fetched its sites in the class body (`sites, _ = what_sites(...)`), i.e.
  at collection time, where a 503 aborts the whole module and can never be reran
  (rerunfailures retries failed tests, not collection errors).

Changes:

- Add one shared marker, `conftest.flaky_api`, with a unified transient pattern
  list, and apply it to every live suite: nwis_test.py / waterdata_test.py /
  ngwmn_test.py (module-level) and utils_test.py::Test_query (class-level — the
  only other live suite, found by auditing each module behind a dead proxy).
  Replaces three drifted inline copies.
- The status pattern now matches both the OGC path's "<status>:" and the legacy
  query path's "HTTP <status>" message shapes, so one list serves all four. This
  closes a latent gap: ngwmn's old marker omitted ServiceUnavailable and the
  chunked QuotaExhausted/ServiceInterrupted wrappers (DOI-USGS#325 fixed waterdata but
  missed ngwmn), so a 503 in an ngwmn live test would have failed CI too.
- Move TestTZ's class-body fetch into a class-scoped `sites` fixture so it runs
  at test time, where the marker can retry a transient.

Everything else is mocked, module-skipped (nadp), or — the chunking
..._on_real_transport test — a deterministic local http.server, not external.

Verified: nwis collection is network-free; the markers attach with the unified
config; ngwmn now covers ServiceUnavailable/ServiceInterrupted; live Test_query
and TestTZ pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Sjb14HkwuCydKSKMsaXsgd
@thodson-usgs thodson-usgs force-pushed the fix/nwis-flaky-rerun branch from 095b964 to 704f734 Compare June 23, 2026 13:57
@thodson-usgs thodson-usgs changed the title test: rerun live-API tests on transient errors (fix nwis collection-time call) test: unify a shared flaky-rerun marker; fix nwis collection-time call Jun 23, 2026
@thodson-usgs thodson-usgs merged commit 5b4c1f0 into DOI-USGS:main Jun 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant