D
+
C

Your churn dashboard is broken. Someone's in your Slack about it.

Read the message below, then click Trigger the agent to watch what happens next. The agent walks the same tools you would have, one at a time, so you can see exactly what it's doing.

Click Trigger the agent to start

#data-platform-on-call

Slack · 14 members · 09:13 AM

D
Derek Tan· analytics-marketing09:13 AM

hey @data-platform-oncall — our churn dashboard hasn't refreshed the last 3 nights and QBR is tomorrow. is etl_customer_360_nightly stuck?

This message would normally start a 60-minute context-switch across Workflows, Spark UI, cluster events, Unity Catalog, and the notebook. Instead it triggers Cursor's Background Agent (Slack adapter → SDK harness). Click below to play it back.

Also wired: Lakehouse Monitoring webhook · nightly DBU-over-baseline sweep

For data platform teams

Three things a senior data engineer cares about

This is not a pitch about adopting another tool. It is about making your Spark triage loop — the one you already run a few times a week — faster, cheaper, and more reviewable.

Catch the jobs your alerts miss

Your alerting policy pages on hard failure and schedule slip. The quieter problem — jobs running 2-5x over baseline that nobody notices until a downstream dashboard breaks — gets caught by Cursor's nightly DBU-over-baseline sweep or on-demand from a Slack /triage mention.

Get the proof with the fix

Every triage finishes with a Spark UI artifact, a planner remediation report, and a sample-run validation — the same evidence you'd collect over an hour of clicking. The PR body cites which executor, which stage, and which line of the notebook caused the bleed.

You still own merge

Cursor never self-merges, never edits production from inside the agent, and never opens a ticket on you. The output is a PR your on-call reviews, with a Codex behavioral-equivalence check and a 1% sample run already green. You stay in the review loop you already trust.

Economics of one inefficient job

One job, one triage, the math

The fix is rarely complicated — usually a broadcast hint, a salt, an AQE flag, an instance-type bump. The cost is the engineer-hours spent finding it. The waste is the compute you keep paying for until someone has time to look.

Line itemBeforeAfter CursorDelta
Per-run wall time4h 12m~32m (projected)8× faster
Per-run DBU cost$1,840~$230~$1,610 saved / run
Monthly recurring DBU bleed on this one job~$55,200~$6,900~$48,300 / mo
On-call surface for this jobsilent · no alert configuredtimeout 5,400s · alert on drift > 1.5×baseline-drift visibility added
Engineer time per triage cycle60-120 min · 4-6 surfaces~5 min review of PR + plan~95% of cycle reclaimed
How this is wired

One Background Agent · 5 MCPs · 8 skills · 3 model classes

Every piece below is a concrete primitive your team deploys. The skill files in .cursor/skills/ are real — you can read them, fork them, and adapt the prompts.

Architecture

3 trigger surfaces

Slack /triage · Lakehouse Monitoring webhook · nightly DBU sweep

Cursor Background Agent

SDK harness · loads skills · spawns subagents · streams to Slack

5 MCPs · 8 skills

planner · 5 specialists · editor · reviewer — each with its own model + scoped tool surface

Skills · subagents · models
.cursor/skills/

databricks-job-triage-planner

planner

Read evidence, decompose the investigation, dispatch 5 specialist subagents, synthesize a ranked remediation plan. Does not edit code.

Claude Opusdatabricks-mcp.* + 3

databricks-skew-detective

specialist

Sample input table distribution and rank join keys by skew. Suggest broadcast / salt / AQE skew-join.

Composer 2databricks-mcp.sql.execute + 2

databricks-cluster-sizer

specialist

Read executor JVM + GC + cluster events. Recommend instance type + count + AQE configuration as a cluster JSON diff.

Composer 2spark-history-mcp.executors.* + 1

databricks-code-surgeon

specialist

Static-analyze the notebook for missing broadcast hints, non-vectorized UDFs, missing AQE, and other anti-patterns.

Composer 2github-mcp.repo.get_file + 1

databricks-lineage-cartographer

specialist

Walk Unity Catalog downstream lineage. Rank consumers by criticality. Return a safe cutover window.

Composer 2unity-catalog-mcp.lineage.*

databricks-cost-analyst

specialist

Roll up 30-day DBU usage vs baseline. Price the remediation. Return projected savings.

Composer 2databricks-mcp.account.usage.aggregate

databricks-jobs-patch

editor

Apply the remediation plan as concrete edits to the notebook + databricks.yml + job config. Open a branch.

Composer 2github-mcp.repo.* + 1

codex-review-databricks

reviewer

Review the patch for behavioral equivalence, row-count parity hooks, and rollback safety. Block on hard violations.

Codexgithub-mcp.pulls.add_review + 1
MCPs

databricks-mcp

Workflows + Clusters + SQL

workflows.get · workflows.list · runs.list · runs.get · +3

backed by Databricks REST API 2.1 + 2.0

spark-history-mcp

Spark UI history server

applications.get · stages.get · stages.task_summary · executors.get

backed by Spark History Server REST

unity-catalog-mcp

Catalog + lineage + table stats

lineage.downstream · lineage.upstream · table.stats · table.grants

backed by Unity Catalog REST + system.access.audit

slack-mcp

Threaded collaboration

threads.reply · channels.history · mentions.search

backed by Slack Web API + Events API

github-mcp

Notebook + asset bundle repo

repo.get_file · branches.create · pulls.create · pulls.add_review

backed by GitHub REST + GraphQL

Trigger surfaces

Slack /triage slash command

A stakeholder in #data-platform-on-call pings the team ("the churn dashboard hasn't refreshed in 3 nights"). The Slack adapter routes mentions of @cursor + a job reference to the Background Agent.

POST /api/databricks-triage-webhook (Slack-signed)

Databricks Lakehouse Monitoring alert

A Lakehouse Monitoring metric (DBU-over-baseline or task duration p95) crosses threshold and fires a webhook. The agent treats it the same as a Slack ping but with structured payload.

POST /api/databricks-triage-webhook (HMAC-signed)

Nightly DBU-over-baseline sweep

A Databricks Workflow runs `databricks-jobs-sweep` nightly, ranks every job by `dbu_burn / baseline_dbu`, and invokes the planner for any job above 1.5×. Surfaces silent waste the alerting policy never paged on.

Databricks Workflow → cursor.background.invoke()

Guardrails every triage run clears

Opus decomposes and synthesizes; Composer specialists never edit code·Codex behavioral-equivalence review before any PR opens·1% sample run on staging must pass row-count parity before ship·Cursor never self-merges · human owns the PR merge button·Cutover window taken from Unity Catalog lineage, not heuristics

One planner agent, five specialists in parallel, one PR your on-call can merge.Now imagine running this against every job in your workspace, every night.