How LakeSentry Works for Databricks Cost Intelligence
LakeSentry turns Databricks system tables and API metadata into cost, workload, attribution, and optimization views.
Deployment model
Section titled “Deployment model”LakeSentry runs as isolated single-tenant instances. Each customer has separate application and database infrastructure. Runtime tenant databases do not rely on tenant ID columns for isolation.
Data sources
Section titled “Data sources”LakeSentry ingests Databricks system data including:
- Billing usage and list prices.
- Compute clusters, node timeline, node types, warehouses, and warehouse events.
- Lakeflow jobs, tasks, job runs, pipelines, and pipeline updates.
- SQL query history.
- Workspace, lineage, network, clean room, assistant, storage, serving, and information-schema metadata where available.
- REST API enrichment for accounts, workspaces, jobs, clusters, and SQL warehouses.
MLflow source extraction and Databricks audit-log extraction are planned but are not active in the current default direct-extraction registry. LakeSentry’s Audit Log is LakeSentry’s own internal audit trail.
Extraction modes
Section titled “Extraction modes”| Mode | How it works | Best for |
|---|---|---|
| Direct Connection | LakeSentry pulls data with Databricks APIs and SQL Statement API using configured credentials. | Current self-service setup path. |
| External Connector | A Python wheel runs as a Databricks workflow in the customer’s workspace and pushes extracted data to LakeSentry. | Controlled deployments for customer-managed or private-network extraction. |
Connectors may use PAT credentials or OAuth M2M service-principal credentials. OAuth M2M is preferred when possible.
Sequential pipeline
Section titled “Sequential pipeline”LakeSentry uses a strict sequential pipeline:
Extraction → raw.* → ledger.* → attribution → metrics.* → insight.*Rules:
- Raw tables are append-only source-of-truth records from extraction.
- Ledger tables are built from raw tables, never directly during ingestion.
- Metrics are pre-aggregated from ledger data for dashboard performance.
- Insights are generated from ledger and metrics outputs.
- Rebuilds use the same transform path as regular processing.
Raw layer
Section titled “Raw layer”raw.* tables store normalized copies of Databricks source records. They preserve source fields and provide an audit trail for backfills and rebuilds.
Ledger layer
Section titled “Ledger layer”ledger.* tables are the canonical business model: workspaces, clusters, warehouses, work units, runs, usage line items, principals, ownership, and related cost entities.
Attribution and metrics
Section titled “Attribution and metrics”The attribution engine assigns usage line items to teams, projects, shared buckets, or workspace/unattributed spend. Metrics workers then precompute cost, utilization, trend, and quality aggregates used by the UI.
Insights
Section titled “Insights”Insight workers detect anomalies, waste, and optimization opportunities. Some insights generate action plans for review and approval.
Freshness expectations
Section titled “Freshness expectations”Freshness depends on Databricks system table lag, connector schedule, and transform queue health. Cost views are designed for investigation and FinOps workflows, not second-by-second operations monitoring.