Skip to content

ci: add DBR LTS install check to catch ES-1960554-class regressions#843

Open
vikrantpuppala wants to merge 1 commit into
mainfrom
ci/dbr-lts-install-check
Open

ci: add DBR LTS install check to catch ES-1960554-class regressions#843
vikrantpuppala wants to merge 1 commit into
mainfrom
ci/dbr-lts-install-check

Conversation

@vikrantpuppala

Copy link
Copy Markdown
Contributor

What

Adds a DBR LTS Install CI check that builds the connector wheel and installs it inside real DBR LTS clusters (via the PECO workspace Jobs API — no PyPI publish needed), then runs a SELECT 1 smoke test. Matrix = supported LTS {13.3, 14.3, 15.4, 16.4, 17.3} × install target {base, pyarrow, kernel}.

Also adds an incident-linked comment on the thrift pin in pyproject.toml so nobody re-widens it before the upstream fix ships.

Why

The thrift 0.23.0 bump (PR #796, shipped in 4.2.7) broke pip install on DBR LTS — SEV0 ES-1960554. thrift ships sdist-only, and 0.23.0's setup.py calls sys.exit(0) on the build-success path, killing the PEP 517 backend before pip writes output.json. On the old setuptools shipped by DBR 14.3/15.4 LTS this is a hard install failure. 4.2.7 was yanked and the bump reverted (PR #840).

Our CI never caught it because every job installs via poetry install on a modern runner — it never does a fresh pip install of the built wheel on an LTS toolchain, which is the real customer path that failed. This PR closes exactly that gap.

How it works

Per matrix leg, scripts/dbr_lts_install_check.py (driver, runs on the GH runner):

  1. builds the wheel and uploads it to a UC Volume on the PECO workspace,
  2. imports scripts/dbr_lts_smoke_notebook.py into the workspace,
  3. submits a one-off Job on an ephemeral single-node SINGLE_USER cluster pinned to the target spark_version; the notebook pip installs the wheel (+ extras) and runs SELECT 1,
  4. polls to completion (tolerating transient API errors) and exits non-zero on any non-SUCCESS,
  5. cleans up the per-run wheel dir + notebook in a finally (every exit path).

Several non-obvious DBR-cluster constraints are baked in and commented (notebook_task not spark_python_task-from-Volume; SINGLE_USER access mode for UC/Volume access; dbutils.fs.cp the wheel off /Volumes; dbutils.library.restartPython() after install).

Gating

Runs on pull_request, but the cluster matrix runs only when dependency-affecting files change (pyproject.toml / poetry.lock / this workflow / the two scripts) — the only surface that can introduce this failure class. Informational check (not a required merge-queue gate). Uses only secrets already present in the azure-prod environment (DATABRICKS_HOST, DATABRICKS_TOKEN, TEST_PECO_WAREHOUSE_HTTP_PATH); the notebook-import dir is derived from the token's own identity.

Validation

Exercised end-to-end against the PECO workspace:

  • ✅ Green on the current pin (thrift 0.22.0) — install + SELECT 1 pass on 15.4 for both base and [pyarrow].
  • Guard proven: re-widening the pin to <0.24.0 fails on 14.3 and 15.4 with the exact incident error (Downloading thrift-0.23.0.tar.gzOSError ... output.json) — a true guard, not a check that always passes.
  • ✅ Cleanup verified: wheel dir + notebook confirmed removed after a run.

Follow-ups (external / not in this PR)

This pull request and its description were written by Isaac.

@vikrantpuppala vikrantpuppala force-pushed the ci/dbr-lts-install-check branch from 5040862 to f4fd300 Compare July 2, 2026 10:30
@vikrantpuppala vikrantpuppala force-pushed the ci/dbr-lts-install-check branch from 1e96f3a to 0b1ad22 Compare July 2, 2026 11:00
@vikrantpuppala vikrantpuppala force-pushed the ci/dbr-lts-install-check branch from 0b1ad22 to f125819 Compare July 2, 2026 11:21
…554)

The thrift 0.23.0 bump (PR #796, shipped in 4.2.7) broke `pip install` on
DBR LTS: thrift ships sdist-only and 0.23.0's setup.py calls sys.exit(0) on
the build-success path, killing the PEP 517 backend before pip writes
output.json. On the old setuptools shipped by DBR 14.3/15.4 LTS this is a
hard install failure (SEV0 ES-1960554); 4.2.7 was yanked and reverted (#840).

Our CI never caught it because every job installs via `poetry install` on a
modern runner -- it never does a fresh `pip install` of the built wheel on
an LTS toolchain, the real customer path that failed.

CI check
--------
Adds a PR check (gated to dependency changes) that builds the wheel and
installs it INSIDE real DBR LTS clusters via the PECO workspace Jobs API
(no PyPI publish) then runs a SELECT 1 smoke test. Matrix = supported LTS
{13.3, 14.3, 15.4, 16.4, 17.3} x install target {base, pyarrow, kernel}.
Auth is OAuth M2M as the PECO service principal throughout (driver ->
workspace API and the notebook's connector -> warehouse smoke query); a PAT
is warehouse-scoped and rejected by the workspace REST API. Older LTS ship an
SDK too old for auth_type=oauth-m2m, so the smoke harness upgrades
databricks-sdk. Per-run artifacts are cleaned up in a finally block.

Connector fix (caught by the check)
-----------------------------------
The check surfaced a real latent bug: a base install (no [pyarrow] extra)
runs against a runtime's bundled pyarrow, and on DBR 13.3/14.3 that pyarrow
predates the `promote_options` kwarg, so concat_table_chunks raised
`TypeError: concat_tables() got an unexpected keyword argument
'promote_options'` on the Arrow result path. utils.py now falls back to the
legacy `promote=True` (equivalent to promote_options="default") when the
kwarg is unsupported, with a regression test.

Validated end-to-end against the PECO workspace: green on thrift 0.22.0, and
re-widening the pin to <0.24.0 fails on 14.3+15.4 with the exact output.json
error -- a true guard, not a check that always passes.

Also adds an incident-linked comment on the thrift pin so nobody re-widens it
before the upstream fix (THRIFT-6067 / apache/thrift#3584) ships.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants