Context
The databricks-pack plugin in this marketplace is being rebuilt from the ground up. The v1 was thin documentation; v2 is a working plugin pack with live workspace access — six skills, one shared MCP server, real detection logic against real workloads.
Every design decision is on the table here before code lands. I have catalogued 34 production failure modes from secondary research (Reddit, GitHub issues, Databricks community forum, vendor docs) and ranked them into six skills. The research is solid for what gets formally reported. It is not solid for what gets cursed at silently — and that is the input I am asking this community to provide.
The honest stance: I can build any of this. I am not the one running production Databricks workloads. If you tell me which failure modes you would actually want this to catch, which detection thresholds will fire correctly at 2 AM versus on every healthy workspace, and which skill designs senior data engineers will actually reach for, I will build what you say — not what I would guess. Implementation is not the rate-limiting input; calibrated priority signal is.
The six skills
| # |
Skill |
What it catches |
Issue |
Status |
| 1 |
databricks-workspace-mcp |
Typed control-plane access (clusters.events, instance pools, pipeline event logs, external locations, storage credentials) |
#789 |
🟡 design open |
| 2 |
databricks-cost-leak-hunter |
Ranked real-dollar cost leaks from system.billing.usage + live workspace state (pilot — start here) |
#790 |
🟡 design open |
| 3 |
databricks-cluster-forensics |
Cold starts, launch failures, Photon fallback, DBR-upgrade triage from live cluster events |
#791 |
🟡 design open |
| 4 |
databricks-streaming-guardian |
Delta + Liquid Clustering + Structured Streaming + DLT operations (the largest skill, twelve failure modes) |
#792 |
🟡 design open |
| 5 |
databricks-uc-migration-pilot |
Unity Catalog readiness, IAM/SCIM diagnostics, system-table access tracing (Sept 30 2026 deadline driver) |
#793 |
🟡 design open |
| 6 |
databricks-bundle-medic |
Asset Bundles deploy diagnostics, CMK rotation, PrivateLink endpoint audit |
#794 |
🟡 design open |
Start here if you only have 10 minutes
Read #790 — databricks-cost-leak-hunter first. It is the pilot skill — the first thing in the rebuilt pack a user will actually run, designed to be FinOps-grokkable on the first read. The four cost categories it catches are the ones I would most want validated against your real workloads, and the design questions inside are the ones I am most uncertain about. If you only leave thumbs-up / thumbs-down on the question bullets there, you have already calibrated the most important skill in the pack.
How to engage
- Comment on any issue with thoughts, corrections, or alternative approaches.
- Thumbs-up / thumbs-down on individual design-question bullets is equally valuable — that is pure priority signal and takes 30 seconds.
- Voice memo via WhatsApp works for anyone in Luciano's community who would prefer to speak instead of type in English — I will transcribe back into the relevant issue with credit. Portuguese is fine.
What this is not
- Not a code review request. There is no code yet — by design.
- Not a sales pitch. The pack will ship under the same marketplace license as every other plugin here.
- Not a request for unpaid consulting hours. Any depth of engagement is fine; no engagement is also fine.
Why public
Every catalogued failure mode is already public (forum threads, GitHub issues, vendor docs). The implementation is the novelty, not the design — so the design can live in the open and the people who would benefit can shape it before code lands. Full transparency was deliberate; if any of you would have made a different call, please say so in this thread.
Reference material
All design records, pain research, and pressure tests live in plugins/saas-packs/databricks-pack/000-docs/ and are linked individually below. Index file: 000-INDEX.md.
Pain research (RL-RSRC)
The 34 catalogued failure modes, grouped by Databricks domain. These are the inputs that drove which skills got built and at what threshold.
| Doc |
What it covers |
002-RL-RSRC |
Domain 1 — Compute / Photon / DBR versioning / cost |
003-RL-RSRC |
Domain 2 — Delta Lake, Liquid Clustering, Structured Streaming, DLT |
004-RL-RSRC |
Domain 3 — Unity Catalog, Asset Bundles, identity, workspace ops, secrets, networking |
005-RL-RSRC |
Architectural patterns extracted from Anthropic-published Claude Code skills (2026) |
006-RL-RSRC |
v2-rebuild synthesis — failure modes → six skills mapping |
010-RL-RSRC |
Databricks MCP landscape (official, managed) |
011-RL-RSRC |
Databricks MCP community landscape (2026) |
012-RL-RSRC |
Claude Code / Cowork ↔ Databricks MCP integration architecture |
Architecture decisions (AT-ADEC)
Locked design decisions with reasoning. These are the load-bearing records — if you disagree, this is where to push back.
| Doc |
What it locks |
007-AT-ADEC |
CTO decision — Databricks Pack v2 rebuild (REVISED) |
013-AT-ADEC |
Epic 1 MCP scope adjustment — auth flows, dual-MCP design, scope cut from 8 → 6 endpoints |
Pressure tests (RA-REVW)
Adversarial validation of the decisions above.
| Doc |
What it stress-tests |
008-RA-REVW |
Pack-handling pressure test — version bump in place vs. fresh pack |
009-RA-REVW |
Pilot-timing pressure test — databricks-cost-leak-hunter as pilot, 3-5 day window |
Direct reading is welcome but not required. The issue bodies summarize what is needed for response.
- Jeremy Longshore
intentsolutions.io
Internal tracking (beads for agent context — not for external action):
Context
The
databricks-packplugin in this marketplace is being rebuilt from the ground up. The v1 was thin documentation; v2 is a working plugin pack with live workspace access — six skills, one shared MCP server, real detection logic against real workloads.Every design decision is on the table here before code lands. I have catalogued 34 production failure modes from secondary research (Reddit, GitHub issues, Databricks community forum, vendor docs) and ranked them into six skills. The research is solid for what gets formally reported. It is not solid for what gets cursed at silently — and that is the input I am asking this community to provide.
The honest stance: I can build any of this. I am not the one running production Databricks workloads. If you tell me which failure modes you would actually want this to catch, which detection thresholds will fire correctly at 2 AM versus on every healthy workspace, and which skill designs senior data engineers will actually reach for, I will build what you say — not what I would guess. Implementation is not the rate-limiting input; calibrated priority signal is.
The six skills
databricks-workspace-mcpdatabricks-cost-leak-huntersystem.billing.usage+ live workspace state (pilot — start here)databricks-cluster-forensicsdatabricks-streaming-guardiandatabricks-uc-migration-pilotdatabricks-bundle-medicStart here if you only have 10 minutes
Read #790 — databricks-cost-leak-hunter first. It is the pilot skill — the first thing in the rebuilt pack a user will actually run, designed to be FinOps-grokkable on the first read. The four cost categories it catches are the ones I would most want validated against your real workloads, and the design questions inside are the ones I am most uncertain about. If you only leave thumbs-up / thumbs-down on the question bullets there, you have already calibrated the most important skill in the pack.
How to engage
What this is not
Why public
Every catalogued failure mode is already public (forum threads, GitHub issues, vendor docs). The implementation is the novelty, not the design — so the design can live in the open and the people who would benefit can shape it before code lands. Full transparency was deliberate; if any of you would have made a different call, please say so in this thread.
Reference material
All design records, pain research, and pressure tests live in
plugins/saas-packs/databricks-pack/000-docs/and are linked individually below. Index file:000-INDEX.md.Pain research (RL-RSRC)
The 34 catalogued failure modes, grouped by Databricks domain. These are the inputs that drove which skills got built and at what threshold.
002-RL-RSRC003-RL-RSRC004-RL-RSRC005-RL-RSRC006-RL-RSRC010-RL-RSRC011-RL-RSRC012-RL-RSRCArchitecture decisions (AT-ADEC)
Locked design decisions with reasoning. These are the load-bearing records — if you disagree, this is where to push back.
007-AT-ADEC013-AT-ADECPressure tests (RA-REVW)
Adversarial validation of the decisions above.
008-RA-REVW009-RA-REVWdatabricks-cost-leak-hunteras pilot, 3-5 day windowDirect reading is welcome but not required. The issue bodies summarize what is needed for response.
intentsolutions.io
Internal tracking (beads for agent context — not for external action):
claude-koxwclaude-02m1claude-hsocclaude-vjawclaude-h53aclaude-jhnj