How LakeSentry Works for Databricks Cost Intelligence

LakeSentry turns Databricks system tables and API metadata into cost, workload, attribution, and optimization views.

Deployment model

LakeSentry runs as isolated single-tenant instances. Each customer has separate application and database infrastructure. Runtime tenant databases do not rely on tenant ID columns for isolation.

Data sources

LakeSentry ingests Databricks system data including:

Billing usage and list prices.
Compute clusters, node timeline, node types, warehouses, and warehouse events.
Lakeflow jobs, tasks, job runs, task runs, pipelines, and pipeline updates.
SQL query history.
Workspace, lineage, network, clean room, assistant, storage, serving, and information-schema metadata where available.
REST API enrichment for accounts, workspaces, jobs, clusters, and SQL warehouses.

MLflow source extraction and Databricks audit-log extraction are planned but are not active in the current default direct-extraction registry. LakeSentry’s Audit Log is LakeSentry’s own internal audit trail.

Extraction modes

Mode	How it works	Best for
Direct Connection	LakeSentry pulls data with Databricks APIs and SQL Statement API using configured credentials.	Current self-service setup path.
External Connector	A Python wheel runs as a Databricks workflow in the customer’s workspace and pushes extracted data to LakeSentry.	Controlled deployments for customer-managed or private-network extraction.

Connectors may use PAT credentials or OAuth M2M service-principal credentials. OAuth M2M is preferred when possible.

Sequential pipeline

LakeSentry uses a strict sequential pipeline:

Extraction → raw.* → ledger.* → attribution → metrics.* → insight.*

Rules:

Raw tables are append-only source-of-truth records from extraction.
Ledger tables are built from raw tables, never directly during ingestion.
Metrics are pre-aggregated from ledger data for dashboard performance.
Insights are generated from ledger and metrics outputs.
Rebuilds use the same transform path as regular processing.

Raw layer

raw.* tables store normalized copies of Databricks source records. They preserve source fields and provide an audit trail for backfills and rebuilds.

Ledger layer

ledger.* tables are the canonical business model: workspaces, clusters, warehouses, work units, runs, usage line items, principals, ownership, and related cost entities.

Attribution and metrics

The attribution engine assigns usage line items to teams, projects, shared buckets, or workspace/unattributed spend. Metrics workers then precompute cost, utilization, trend, and quality aggregates used by the UI.

Insights

Insight workers detect anomalies, waste, and optimization opportunities. Some insights generate action plans for review and approval.

Freshness expectations

Freshness depends on Databricks system table lag, connector schedule, and transform queue health. Cost views are designed for investigation and FinOps workflows, not second-by-second operations monitoring.