Skip to content

How LakeSentry Works for Databricks Cost Intelligence

LakeSentry turns Databricks system tables and API metadata into cost, workload, attribution, and optimization views.

LakeSentry runs as isolated single-tenant instances. Each customer has separate application and database infrastructure. Runtime tenant databases do not rely on tenant ID columns for isolation.

LakeSentry ingests Databricks system data including:

  • Billing usage and list prices.
  • Compute clusters, node timeline, node types, warehouses, and warehouse events.
  • Lakeflow jobs, tasks, job runs, pipelines, and pipeline updates.
  • SQL query history.
  • Workspace, lineage, network, clean room, assistant, storage, serving, and information-schema metadata where available.
  • REST API enrichment for accounts, workspaces, jobs, clusters, and SQL warehouses.

MLflow source extraction and Databricks audit-log extraction are planned but are not active in the current default direct-extraction registry. LakeSentry’s Audit Log is LakeSentry’s own internal audit trail.

ModeHow it worksBest for
Direct ConnectionLakeSentry pulls data with Databricks APIs and SQL Statement API using configured credentials.Current self-service setup path.
External ConnectorA Python wheel runs as a Databricks workflow in the customer’s workspace and pushes extracted data to LakeSentry.Controlled deployments for customer-managed or private-network extraction.

Connectors may use PAT credentials or OAuth M2M service-principal credentials. OAuth M2M is preferred when possible.

LakeSentry uses a strict sequential pipeline:

Extraction → raw.* → ledger.* → attribution → metrics.* → insight.*

Rules:

  1. Raw tables are append-only source-of-truth records from extraction.
  2. Ledger tables are built from raw tables, never directly during ingestion.
  3. Metrics are pre-aggregated from ledger data for dashboard performance.
  4. Insights are generated from ledger and metrics outputs.
  5. Rebuilds use the same transform path as regular processing.

raw.* tables store normalized copies of Databricks source records. They preserve source fields and provide an audit trail for backfills and rebuilds.

ledger.* tables are the canonical business model: workspaces, clusters, warehouses, work units, runs, usage line items, principals, ownership, and related cost entities.

The attribution engine assigns usage line items to teams, projects, shared buckets, or workspace/unattributed spend. Metrics workers then precompute cost, utilization, trend, and quality aggregates used by the UI.

Insight workers detect anomalies, waste, and optimization opportunities. Some insights generate action plans for review and approval.

Freshness depends on Databricks system table lag, connector schedule, and transform queue health. Cost views are designed for investigation and FinOps workflows, not second-by-second operations monitoring.