Your churn dashboard is broken. Someone's in your Slack about it.
Read the message below, then click Trigger the agent to watch what happens next. The agent walks the same tools you would have, one at a time, so you can see exactly what it's doing.
#data-platform-on-call
Slack · 14 members · 09:13 AM
hey @data-platform-oncall — our churn dashboard hasn't refreshed the last 3 nights and QBR is tomorrow. is etl_customer_360_nightly stuck?
This message would normally start a 60-minute context-switch across Workflows, Spark UI, cluster events, Unity Catalog, and the notebook. Instead it triggers Cursor's Background Agent (Slack adapter → SDK harness). Click below to play it back.
Also wired: Lakehouse Monitoring webhook · nightly DBU-over-baseline sweep
Three things a senior data engineer cares about
This is not a pitch about adopting another tool. It is about making your Spark triage loop — the one you already run a few times a week — faster, cheaper, and more reviewable.
Catch the jobs your alerts miss
Your alerting policy pages on hard failure and schedule slip. The quieter problem — jobs running 2-5x over baseline that nobody notices until a downstream dashboard breaks — gets caught by Cursor's nightly DBU-over-baseline sweep or on-demand from a Slack /triage mention.
Get the proof with the fix
Every triage finishes with a Spark UI artifact, a planner remediation report, and a sample-run validation — the same evidence you'd collect over an hour of clicking. The PR body cites which executor, which stage, and which line of the notebook caused the bleed.
You still own merge
Cursor never self-merges, never edits production from inside the agent, and never opens a ticket on you. The output is a PR your on-call reviews, with a Codex behavioral-equivalence check and a 1% sample run already green. You stay in the review loop you already trust.
One job, one triage, the math
The fix is rarely complicated — usually a broadcast hint, a salt, an AQE flag, an instance-type bump. The cost is the engineer-hours spent finding it. The waste is the compute you keep paying for until someone has time to look.
| Line item | Before | After Cursor | Delta |
|---|---|---|---|
| Per-run wall time | 4h 12m | ~32m (projected) | 8× faster |
| Per-run DBU cost | $1,840 | ~$230 | ~$1,610 saved / run |
| Monthly recurring DBU bleed on this one job | ~$55,200 | ~$6,900 | ~$48,300 / mo |
| On-call surface for this job | silent · no alert configured | timeout 5,400s · alert on drift > 1.5× | baseline-drift visibility added |
| Engineer time per triage cycle | 60-120 min · 4-6 surfaces | ~5 min review of PR + plan | ~95% of cycle reclaimed |
One Background Agent · 5 MCPs · 8 skills · 3 model classes
Every piece below is a concrete primitive your team deploys. The skill files in .cursor/skills/ are real — you can read them, fork them, and adapt the prompts.
Architecture
3 trigger surfaces
Slack /triage · Lakehouse Monitoring webhook · nightly DBU sweep
Cursor Background Agent
SDK harness · loads skills · spawns subagents · streams to Slack
5 MCPs · 8 skills
planner · 5 specialists · editor · reviewer — each with its own model + scoped tool surface
databricks-job-triage-planner
plannerRead evidence, decompose the investigation, dispatch 5 specialist subagents, synthesize a ranked remediation plan. Does not edit code.
databricks-skew-detective
specialistSample input table distribution and rank join keys by skew. Suggest broadcast / salt / AQE skew-join.
databricks-cluster-sizer
specialistRead executor JVM + GC + cluster events. Recommend instance type + count + AQE configuration as a cluster JSON diff.
databricks-code-surgeon
specialistStatic-analyze the notebook for missing broadcast hints, non-vectorized UDFs, missing AQE, and other anti-patterns.
databricks-lineage-cartographer
specialistWalk Unity Catalog downstream lineage. Rank consumers by criticality. Return a safe cutover window.
databricks-cost-analyst
specialistRoll up 30-day DBU usage vs baseline. Price the remediation. Return projected savings.
databricks-jobs-patch
editorApply the remediation plan as concrete edits to the notebook + databricks.yml + job config. Open a branch.
codex-review-databricks
reviewerReview the patch for behavioral equivalence, row-count parity hooks, and rollback safety. Block on hard violations.
databricks-mcp
Workflows + Clusters + SQLworkflows.get · workflows.list · runs.list · runs.get · +3
backed by Databricks REST API 2.1 + 2.0
spark-history-mcp
Spark UI history serverapplications.get · stages.get · stages.task_summary · executors.get
backed by Spark History Server REST
unity-catalog-mcp
Catalog + lineage + table statslineage.downstream · lineage.upstream · table.stats · table.grants
backed by Unity Catalog REST + system.access.audit
slack-mcp
Threaded collaborationthreads.reply · channels.history · mentions.search
backed by Slack Web API + Events API
github-mcp
Notebook + asset bundle reporepo.get_file · branches.create · pulls.create · pulls.add_review
backed by GitHub REST + GraphQL
Slack /triage slash command
A stakeholder in #data-platform-on-call pings the team ("the churn dashboard hasn't refreshed in 3 nights"). The Slack adapter routes mentions of @cursor + a job reference to the Background Agent.
POST /api/databricks-triage-webhook (Slack-signed)
Databricks Lakehouse Monitoring alert
A Lakehouse Monitoring metric (DBU-over-baseline or task duration p95) crosses threshold and fires a webhook. The agent treats it the same as a Slack ping but with structured payload.
POST /api/databricks-triage-webhook (HMAC-signed)
Nightly DBU-over-baseline sweep
A Databricks Workflow runs `databricks-jobs-sweep` nightly, ranks every job by `dbu_burn / baseline_dbu`, and invokes the planner for any job above 1.5×. Surfaces silent waste the alerting policy never paged on.
Databricks Workflow → cursor.background.invoke()
Guardrails every triage run clears
One planner agent, five specialists in parallel, one PR your on-call can merge.Now imagine running this against every job in your workspace, every night.