Exam Study Notebook

Databricks Certified Data Engineer Associate

45 questions
90 minutes
Pass: ~70%
Validity: 2 years
Fee: $200
01
Databricks Intelligence Platform
Value of the Data Intelligence Platform Core

The Databricks Data Intelligence Platform is a unified, cloud-native analytics platform that merges data lakes and data warehouses into what Databricks calls a lakehouse architecture. The key value proposition is that it eliminates the need to maintain separate systems for data engineering, data science, machine learning, and SQL analytics — everything runs on one platform with shared governance.

What is a Lakehouse?

A lakehouse combines the low-cost, scalable storage of a data lake (files on S3/ADLS/GCS in open formats like Parquet) with the reliability, performance, and governance features of a data warehouse (ACID transactions, schema enforcement, indexing). Delta Lake is the technology that makes this possible — it's an open-source storage layer that sits on top of your cloud object storage and adds transactional capabilities.

Core Components

  • Delta Lake — The default storage format. Open-source, ACID-compliant, built on Parquet. Enables time travel (query previous versions), schema enforcement, schema evolution, and optimized file management. Every table you create in Databricks is a Delta table by default.
  • Unity Catalog — Centralized governance layer for ALL data and AI assets across workspaces. Handles permissions, lineage, auditing, and data discovery. This is the single source of truth for who can access what.
  • Apache Spark — The distributed processing engine under the hood. Databricks provides optimized runtimes (Databricks Runtime) that bundle Spark with performance enhancements and compatible libraries.
  • Photon — Databricks' proprietary C++-based vectorized query engine. Replaces parts of the standard Spark execution engine for dramatically faster SQL and DataFrame operations. More on this below.
  • Lakeflow — Databricks' unified solution for data ingestion (Auto Loader, Lakeflow Connect), transformation (Declarative Pipelines / formerly DLT), and orchestration (Jobs / Workflows).

The platform's value is unification. Instead of Kafka + Airflow + Snowflake + Spark + MLflow as separate tools, Databricks provides equivalents of all of these on one governed platform. The exam tests whether you understand when each component is the right tool for a given use case.

Features That Optimize Query Performance & Data Layout Critical

The exam objective says: "Enable features that simplify data layout decisions and optimize query performance." This is about knowing which Databricks features automatically improve performance without manual tuning.

Liquid Clustering

Liquid Clustering is Databricks' modern replacement for both partitioning and Z-ordering. Instead of choosing partition columns upfront (which is hard to change later and often leads to small-file problems), you specify clustering columns and Databricks incrementally re-organizes data based on actual query patterns.

  • Enabled with CLUSTER BY (col1, col2) during table creation or via ALTER TABLE.
  • Incrementally re-clusters — no need to run manual OPTIMIZE commands with Z-ORDER.
  • Adapts over time as query patterns change.
  • Eliminates the small-file problem that partitioning creates with high-cardinality columns.
  • Works with both batch and streaming workloads.
CREATE TABLE sales (
  sale_id BIGINT,
  region STRING,
  sale_date DATE,
  amount DOUBLE
)
CLUSTER BY (region, sale_date);  -- Databricks handles the rest

Predictive I/O

A Databricks-managed optimization that automatically predicts which data files a query will need and pre-fetches them. You don't configure anything — it works transparently on Delta tables. It analyzes query patterns and statistics to skip irrelevant files entirely (data skipping) and pre-load relevant ones.

Photon Engine

Photon is a C++-based vectorized query engine that replaces parts of the Spark JVM execution engine. It processes data in columnar batches rather than row-by-row, which dramatically speeds up SQL and DataFrame operations.

  • Automatically enabled on Photon-capable clusters and all SQL warehouses.
  • Best for: aggregations, joins, filters, window functions — standard SQL workloads.
  • Works seamlessly — no code changes needed.
  • Typically 2-8x faster than standard Spark for compatible operations.

Photon only accelerates SQL and DataFrame operations. It does NOT speed up arbitrary Python code, UDFs, ML training, or RDD operations. If an exam question asks about optimizing a pipeline that's slow due to a Python UDF, Photon is the wrong answer — you'd need to rewrite the UDF as a native Spark function or use pandas UDFs.

Auto-Compaction & Optimized Writes

  • Auto-compaction — Automatically compacts small files into larger ones after writes. Prevents the "small file problem" that degrades read performance.
  • Optimized writes — Repartitions data before writing to reduce the number of output files. Particularly useful for streaming workloads that produce many tiny files.
Compute Types — When to Use Each Critical

The exam objective says: "Identify the applicable compute to use for a specific use case." This means you'll be given a scenario and need to pick the right compute type. Here's the full breakdown:

Compute TypeUse CaseKey CharacteristicsCost Model
All-Purpose ClusterInteractive development, notebook exploration, ad-hoc analysisLong-running, shared by multiple users, manually started/stopped or auto-terminatedHigher per-hour cost; stays running even when idle
Job ClusterProduction scheduled workloads, automated ETLAuto-created per job run, automatically terminated on completion, dedicated to one jobLower cost; only runs during job execution
SQL WarehouseSQL analytics, BI dashboards, SQL queries, GenieOptimized for SQL only, Photon-enabled, T-shirt sizing (S/M/L/XL), supports classic or serverlessPer-second billing; auto-stop on idle
Serverless ComputeAny workload where you want zero infrastructure managementNo cluster config, instant startup, auto-scaling, fully Databricks-managedPer-second billing, no idle costs

All-purpose clusters are NOT for production jobs. They're more expensive and stay running between uses. Production scheduled workloads should use Job Clusters. If the exam says "a data engineer needs to run a daily ETL job in production," the answer is Job Cluster (or serverless), never All-Purpose.

Serverless = zero cluster management. Any question mentioning "hands-off compute," "no infrastructure management," "auto-optimized compute managed by Databricks," or "instant startup with no configuration" = serverless. Available for SQL warehouses, notebooks, jobs, and Lakeflow pipelines.

Access Modes (formerly Cluster Modes)

  • Standard access mode (formerly Shared) — Multiple users can share the cluster. Unity Catalog enforced. Supports Python, SQL, Scala.
  • Dedicated access mode (formerly Single User) — One user assigned. Full feature support including file notification mode for Auto Loader and ML libraries.
  • No Isolation Shared — Legacy mode, minimal isolation. Not recommended.
02
Development & Ingestion
Databricks Connect Core

Databricks Connect is a client library that lets you run Spark code from your local IDE (VS Code, PyCharm, IntelliJ) against a remote Databricks cluster. Your code is written locally but executes on the remote cluster's compute resources — you get the power of distributed Spark without leaving your development environment.

Key Facts for the Exam

  • Code executes on the remote cluster, not locally. Only the driver-side operations run on your machine.
  • Supports Python (PySpark) and Scala.
  • Uses the same Spark APIs available in notebooks — DataFrames, Spark SQL, Structured Streaming.
  • Enables standard software engineering workflows: unit testing, IDE debugging, Git integration, CI/CD.
  • Requires a running cluster to connect to (all-purpose or assigned cluster).
  • Databricks Connect v2 is built on top of Spark Connect and is the current recommended version.

Databricks Connect does NOT run Spark locally. It's not a local Spark installation. It connects to a remote cluster and sends execution plans there. If the exam asks where code executes, the answer is always "on the remote Databricks cluster."

When to Use Databricks Connect vs Notebooks

Use Databricks Connect when:
  • You prefer your local IDE
  • You need local debugging/breakpoints
  • You're building CI/CD pipelines
  • You want unit testing frameworks (pytest)
Use Notebooks when:
  • Exploring data interactively
  • Collaborating in real-time
  • Quick ad-hoc analysis
  • Visualizations and dashboards
Notebook Capabilities Core

Databricks notebooks are collaborative, interactive documents that combine code, visualizations, and narrative text. They're the primary development tool for most Databricks users.

Core Capabilities

  • Multi-language support — A single notebook can mix Python, SQL, Scala, and R cells using magic commands: %python, %sql, %scala, %r, %md (markdown).
  • Built-in visualizations — Automatically generates charts from query results. Supports bar, line, scatter, pie, map, and more.
  • Real-time collaboration — Multiple users can edit the same notebook simultaneously, with commenting and presence indicators.
  • Version history — Every save creates a version. You can diff, revert, and track changes over time. Integrates with Git (Repos).
  • Widgets — Input parameters using dbutils.widgets. Types: text, dropdown, combobox, multiselect. Useful for parameterized notebooks.
  • %run — Execute another notebook inline. The called notebook shares the same scope (variables, imports) as the calling notebook. Useful for modular code organization.
  • dbutils — Utility module with sub-modules: dbutils.fs (file system), dbutils.secrets (secret management), dbutils.widgets (parameters), dbutils.notebook (workflow control).

%run vs dbutils.notebook.run()

Feature%rundbutils.notebook.run()
ScopeShares scope (variables accessible)Isolated scope (separate execution)
Return valueNo explicit returnReturns a string value
Use caseImporting shared code/configsOrchestrating notebook workflows
ParallelismSequential onlyCan run in parallel with threading
Auto Loader — Valid Sources & Use Cases Critical

Auto Loader is Databricks' recommended tool for incrementally ingesting files from cloud object storage into Delta Lake. It's built on Spark Structured Streaming and uses the cloudFiles format.

Valid Sources — CLOUD OBJECT STORAGE ONLY

  • AWS S3
  • Azure Data Lake Storage Gen2 (ADLS Gen2)
  • Azure Blob Storage (legacy WASB driver, deprecated — use ABFS)
  • Google Cloud Storage (GCS)

Auto Loader ONLY reads from cloud object storage. It does NOT read from Kafka, JDBC databases, REST APIs, or any other source. This is a common wrong answer on the exam. If you need to ingest from Kafka, you use Spark Structured Streaming with the Kafka connector directly — not Auto Loader.

Supported File Formats

JSON, CSV, XML, Parquet, Avro, ORC, TEXT, BINARYFILE — memorize these. The exam may ask whether a specific format is supported.

Key Benefits Over Basic File Source

  • Scalability — Can discover billions of files efficiently. File discovery cost scales with number of new files, not total directory size.
  • Exactly-once processing — Uses a RocksDB-backed checkpoint to track which files have been processed. Guarantees no duplicates, no missed files.
  • Schema inference and evolution — Automatically infers schema and handles new columns (see next topic).
  • Fault tolerance — Resumes from checkpoint on failure. No manual state management needed.
  • Cost optimization — Use Trigger.AvailableNow for batch-style runs instead of continuous streaming to save compute when low latency isn't required.

File Discovery Modes

ModeHow It WorksBest ForSetup
Directory Listing (default)Periodically lists files in the source directory to find new onesSmall-to-medium volumes, simple setup, no cloud permissions neededNo extra config
File NotificationUses cloud event notifications (SQS/EventGrid/Pub-Sub) to detect new filesLarge volumes with billions of files, lower latency, cost-efficient at scaleRequires cloud permissions; use dedicated access mode

File notification mode is only supported on compute with dedicated access mode (formerly single user). If you're on standard access mode, you must use directory listing.

Auto Loader — Syntax & Configuration Syntax

Basic Read Syntax

# Reading with Auto Loader
df = (spark.readStream
    .format("cloudFiles")                          # Always "cloudFiles"
    .option("cloudFiles.format", "json")            # Source file format
    .option("cloudFiles.schemaLocation", "/schema") # Where inferred schema is stored
    .load("/path/to/source/files")                  # Cloud storage path
)

Basic Write Syntax

# Writing to Delta table
(df.writeStream
    .option("checkpointLocation", "/checkpoint")  # REQUIRED for exactly-once
    .option("mergeSchema", "true")                # Allow schema evolution on write
    .trigger(availableNow=True)                  # Process all available, then stop
    .table("catalog.schema.target_table")          # Target Delta table
)

Common Options

OptionPurposeExample Value
cloudFiles.formatFile format to read"json", "csv", "parquet", etc.
cloudFiles.schemaLocationPath to store/read inferred schema"/mnt/schema/orders"
cloudFiles.schemaEvolutionModeHow to handle new columns"addNewColumns", "rescue", etc.
cloudFiles.inferColumnTypesInfer data types (vs all STRING)"true" or "false"
cloudFiles.useNotificationsEnable file notification mode"true"
checkpointLocationCheckpoint for exactly-once (on write)"/mnt/checkpoint/orders"

Auto Loader Inside Lakeflow Declarative Pipelines

When using Auto Loader inside a Lakeflow pipeline, you do NOT need to specify schemaLocation or checkpointLocation — the pipeline manages these automatically.

import dlt

@dlt.table
def bronze_orders():
    return (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/data/orders/")
    )
Auto Loader — Schema Inference & Evolution Critical

Schema handling is one of the most exam-tested Auto Loader topics. There are two distinct concepts: schema inference (what is the schema?) and schema evolution (what happens when the schema changes?).

Schema Inference

When you don't provide a schema, Auto Loader samples files in the source directory and infers one. The inferred schema is saved to cloudFiles.schemaLocation so it doesn't re-infer on every restart. Key options:

  • cloudFiles.inferColumnTypes — When "true", infers actual types (INT, DOUBLE, TIMESTAMP). When "false", all columns come in as STRING. Default varies by format.
  • cloudFiles.schemaHints — Manually override inferred types for specific columns. Example: "date DATE, amount DOUBLE"

Schema Evolution Modes (cloudFiles.schemaEvolutionMode)

This controls what happens when Auto Loader encounters data with new columns that weren't in the original schema:

ModeBehaviorNew Column HandlingStream Restarts?
addNewColumnsAutomatically adds new columns to the schemaNew columns appear in the target tableYes — stream restarts to apply new schema
rescueRoutes unmatched fields to a _rescued_data columnNew fields stored as JSON string in _rescued_dataNo — schema stays the same
failOnNewColumnsThrows an error when new columns appearPipeline stops; manual schema update requiredN/A — pipeline fails
noneIgnores new columns completelyNew fields are silently droppedNo

inferSchema and schemaEvolutionMode are different things. inferColumnTypes controls whether types are inferred at all. schemaEvolutionMode controls what happens when NEW columns appear in the data. The exam deliberately confuses these. If a question asks "how to automatically add new columns as the schema changes," the answer is schemaEvolutionMode = "addNewColumns" — NOT inferSchema = true.

The _rescued_data Column

When using rescue mode, any data that doesn't match the current schema gets stored in a special column called _rescued_data as a JSON string. This is useful for debugging and data recovery — you don't lose data, but you also don't pollute your schema with every random new field. You can query this column later to inspect what was rescued and decide whether to add those columns officially.

Built-in Debugging Tools Core
  • Notebook debugger — Python notebooks support step-through debugging with breakpoints. You can inspect variables, view the call stack, and step over/into/out of functions. Available in the notebook toolbar.
  • Spark UI — Built-in web UI for diagnosing Spark job performance. Accessible from any cluster. Covered in depth in Section 4.
  • Driver logs — Access stdout/stderr from the cluster driver. Useful for print-statement debugging and reviewing error tracebacks.
  • %run for modular testing — Break code into separate notebooks and test each independently.
  • dbutils.notebook.run() — Run notebooks programmatically and capture return values for integration testing.
  • Databricks Assistant — AI-powered code completion and error explanation built into the notebook editor.
03
Data Processing & Transformations
Medallion Architecture (Bronze / Silver / Gold) Core

The Medallion Architecture is a data design pattern that organizes a lakehouse into three layers of increasing data quality. Think of it as a progressive refinement pipeline.

LayerPurposeData QualityTypical OperationsExample
Bronze (Raw)Land raw data exactly as-is from sourceUnvalidated, may contain duplicates, errors, nullsAppend-only ingestion, add metadata (load_timestamp, source_file)Raw JSON API responses, CSV dumps
Silver (Validated)Clean, filter, conform, deduplicateSchema enforced, types cast, nulls handled, dedupedType casting, deduplication, joins, filtering invalid recordsCleaned customer records with proper types
Gold (Business)Business-level aggregations and curated datasetsCurated for specific use cases, ready for BI/MLAggregations, KPIs, star-schema facts/dimensionsDaily revenue by region, customer lifetime value

The key principle: Each layer increases data quality and decreases data volume. Bronze is the widest (everything), Gold is the narrowest (only what's needed for business questions). You should be able to fully recompute Silver from Bronze, and Gold from Silver.

Bronze is NOT "bad data." It's raw, unprocessed data preserved as-is for auditability and reprocessing. Don't let an exam answer trick you into thinking Bronze requires validation — that's Silver's responsibility.

Cluster Types & Configuration for Scenarios Core

This objective asks you to pick the right cluster type and configuration based on a scenario. The decision tree is:

  • Multiple users running interactive queries? → Standard access mode (shared cluster) with autoscaling.
  • Single heavy ETL job? → Dedicated access mode with workers sized for the data volume.
  • Small dataset, single-machine ML library? → Single Node cluster (no workers, just a driver).
  • Production scheduled job? → Job cluster with autoscaling. Terminates after job completes.
  • SQL/BI workload? → SQL Warehouse (serverless preferred for instant start).
  • No infrastructure management desired? → Serverless compute.

Autoscaling

Set a min and max number of workers. Databricks automatically adjusts the cluster size based on workload. For cost-sensitive workloads, set min workers to 0 or 1. For performance-critical workloads, set min workers higher to avoid cold-start delays.

Lakeflow Spark Declarative Pipelines — Advantages Core

The exam objective says: "Emphasize the advantages of Lakeflow Spark Declarative Pipelines for ETL process in Databricks." This is a "why use this" question. Here's the pitch:

Key Advantages

  • Declarative — You define WHAT your pipeline should produce (target tables, transformations, quality rules). Databricks figures out HOW (infrastructure, dependencies, execution order, scaling).
  • Automatic dependency management — The framework builds a DAG of your tables automatically. If table B reads from table A, Databricks ensures A is processed first. No manual orchestration.
  • Built-in data quality (Expectations) — Declarative constraints that validate data inline. No separate testing framework needed.
  • Automatic infrastructure management — Clusters scale up/down based on workload. No manual cluster configuration within the pipeline.
  • Unified batch and streaming — Same API handles both. A streaming table processes incrementally; a materialized view recomputes fully. You pick the right abstraction and Databricks handles the rest.
  • Built-in monitoring — Event log, data quality metrics, pipeline health all available in the UI.
  • Schema evolution handled — Works with Auto Loader's schema evolution seamlessly.
  • Fault tolerance — Automatic retries, checkpointing, and exactly-once processing built in.

Pipeline Editions

EditionIncludesUse When
CoreStreaming ingest, basic transformationsSimple ETL without CDC or quality checks
ProCore + Change Data Capture (SCD Type 1/2)Need to track row-level changes from source
AdvancedPro + Data quality expectationsNeed built-in data quality validation

Expectations require the Advanced edition. If your pipeline uses @dlt.expect and you're on Core or Pro, you'll get an error message telling you to upgrade. The exam may test whether you know which features are in which edition.

Implementing Lakeflow Spark Declarative Pipelines Critical

Core Building Blocks

  • Streaming Table — Append-only table that processes data incrementally. Each pipeline update processes only new data. Best for event logs, CDC, and any append-heavy workload.
  • Materialized View — Fully recomputed on each pipeline update. Best for aggregations, dimensions, and slowly-changing reference data.
  • Flows — Define how data moves from source to target. Default flows are created automatically when you define a table. You can create additional flows for complex patterns (e.g., multiple sources feeding one table).

Reading Data Within a Pipeline

✅ dlt.read("table_name")

Use for reading from tables within the same pipeline. Maintains dependency graph. Table name is just the table name (no catalog/schema prefix).

❌ spark.read.table("table_name")

Bypasses the pipeline dependency graph. Use ONLY for reading from external tables outside the pipeline, with full 3-level name: catalog.schema.table.

Data Quality Expectations — ONLY THREE TYPES EXIST

ExpectationOn ViolationRecords Kept?Pipeline Stops?
@dlt.expect("name", "condition")Log violation in event log metricsYES — all records kept (pass + fail)NO
@dlt.expect_or_drop("name", "condition")Silently drop failing recordsNO — only passing records keptNO
@dlt.expect_or_fail("name", "condition")Halt entire pipeline immediatelyN/A — pipeline stopsYES

@dlt.expect_or_quarantine DOES NOT EXIST. This is the most common fabricated answer on the exam. If you want to quarantine bad records, the pattern is: use @dlt.expect (which keeps ALL records), then in a downstream table, filter where the expectation condition is false into a separate "quarantine" table. This is a two-step pattern, not a single decorator.

Complete Pipeline Example

import dlt
from pyspark.sql.functions import col, current_timestamp

# Bronze: Raw ingestion with Auto Loader
@dlt.table(
    comment="Raw order data from S3"
)
def bronze_orders():
    return (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/data/orders/")
        .select("*", current_timestamp().alias("_load_ts"))
    )

# Silver: Cleaned + validated
@dlt.table
@dlt.expect("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
def silver_orders():
    return dlt.read_stream("bronze_orders").select(
        col("order_id").cast("long"),
        col("customer_id").cast("long"),
        col("amount").cast("double"),
        col("order_date").cast("date")
    )

# Gold: Aggregated for business
@dlt.table
def gold_daily_revenue():
    return dlt.read("silver_orders").groupBy("order_date").agg(
        sum("amount").alias("total_revenue"),
        count_distinct("order_id").alias("unique_orders")
    )

Pipeline Modes

ModeBehaviorUse When
TriggeredRuns once, processes all available data, then stopsBatch/scheduled ETL, cost-sensitive, latency isn't critical
ContinuousRuns constantly, processes data as it arrivesReal-time streaming, low-latency requirements
DDL/DML Features Syntax

CREATE TABLE Variants

-- 1. CREATE OR REPLACE: Drops and recreates REGARDLESS of existence
CREATE OR REPLACE TABLE employees (
  employeeId STRING,
  startDate  DATE,
  avgRating  FLOAT
);

-- 2. CREATE IF NOT EXISTS: Only creates if table doesn't exist (skips otherwise)
CREATE TABLE IF NOT EXISTS employees (
  employeeId STRING,
  startDate  DATE,
  avgRating  FLOAT
);

-- 3. CTAS: Schema inferred from the SELECT query
CREATE TABLE employees AS SELECT * FROM source_table;

"Regardless of whether a table already exists with this name" = CREATE OR REPLACE. "Only if the table doesn't already exist" = CREATE TABLE IF NOT EXISTS. The exam tests exact wording. REPLACE always recreates; IF NOT EXISTS never overwrites.

DML Operations

-- INSERT: Append new rows
INSERT INTO my_table VALUES ('a1', 6, 9.4);

-- UPDATE: Modify existing rows
UPDATE my_table SET avgRating = 10.0 WHERE employeeId = 'a1';

-- DELETE: Remove rows
DELETE FROM my_table WHERE employeeId = 'a1';

-- MERGE (Upsert): Insert or update based on match
MERGE INTO target USING source
  ON target.id = source.id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *;

INSERT syntax word order: It's INSERT INTO table VALUES (...). NOT INSERT VALUES (...) INTO table. The exam's sample Q5 tests exactly this. Option C reverses the word order — that's the trap.

PySpark Aggregations & Complex Metrics Syntax

Common Aggregation Functions

from pyspark.sql.functions import (
    sum, count, count_distinct, avg, min, max,
    round, when, col
)

# Group by + multiple aggregations
daily_revenue = billing_df.groupBy("billing_date").agg(
    sum("amount_billed").alias("total_revenue"),
    count_distinct("billing_id").alias("unique_invoices"),
    count("*").alias("total_records"),
    avg("amount_billed").alias("avg_amount"),
    max("amount_billed").alias("max_amount")
)

# Window functions
from pyspark.sql.window import Window
window_spec = Window.partitionBy("department").orderBy("billing_date")
df = df.withColumn("running_total", sum("amount").over(window_spec))

count() vs count_distinct() vs sum() — The exam's sample Q1 tests this directly. sum("billing_id") sums the numeric ID values, which is meaningless (option A is wrong). count_distinct("patient_id") counts unique patients, not invoices (option C is wrong). count_distinct("billing_id") counts unique invoices — that's the correct answer (D). Read the question carefully to determine whether they want count, unique count, or sum.

04
Productionizing Data Pipelines
DABs vs Traditional Deployment Methods Core

Declarative Automation Bundles (DABs, formerly Databricks Asset Bundles / DAB) are an infrastructure-as-code (IaC) approach to managing Databricks projects. They let you define jobs, pipelines, notebooks, and other resources as YAML source files, version-control them in Git, and deploy them across environments using the Databricks CLI.

AspectDABs (Current)Legacy Deployment
Configuration formatYAML (databricks.yml)JSON (exported/imported via REST API or CLI)
Environment managementtargets with per-env overrides (dev/staging/prod in one config)Separate JSON files per environment, manual editing
Version controlYAML files in Git, fully source-controlled, diffableJSON exports, less structured, harder to diff
CI/CD integrationNative CLI: databricks bundle validate|deploy|runManual REST API calls, custom scripts
Deployment approachDeclarative — define desired state, CLI reconcilesImperative — manually create/update each resource
CollaborationStandard code review workflows (PRs, diffs)Harder to review JSON changes meaningfully

DABs use YAML, not JSON. Legacy deployment used JSON. If the exam asks about the configuration format for Asset Bundles, the answer is YAML (databricks.yml). This is a simple but testable distinction.

Asset Bundles — Structure & Configuration Syntax

Bundle Project Structure

my-etl-project/
├── databricks.yml          # Root config (required, one per project)
├── resources/
│   ├── etl_job.yml         # Job definitions
│   └── pipeline.yml        # Pipeline definitions
├── src/
│   ├── ingest.py           # Notebook/script source files
│   ├── transform.py
│   └── quality_checks.py
└── tests/
    └── test_transform.py   # Unit tests

Key YAML Structure

# databricks.yml
bundle:
  name: my-etl-bundle           # Required: unique bundle name

include:
  - resources/*.yml              # Include additional config files

resources:
  jobs:
    daily_etl:
      name: "Daily ETL Pipeline"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"   # 6 AM daily
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./src/ingest.py
        - task_key: transform
          depends_on:
            - task_key: ingest
          notebook_task:
            notebook_path: ./src/transform.py

targets:
  dev:                              # Development target
    default: true
    workspace:
      host: https://dev.cloud.databricks.com
  prod:                             # Production target
    workspace:
      host: https://prod.cloud.databricks.com
    run_as:
      service_principal_name: prod-sp

Key YAML Mappings

  • bundle — Required. Contains the bundle name and optional metadata.
  • resources — Defines Databricks objects: jobs, pipelines, schemas, experiments, models.
  • targets — Environment-specific overrides. Each target can override workspace, resources, and run_as settings.
  • workspace — Workspace connection details (host, artifact_path, etc.).
  • include — References to additional YAML config files.
  • run_as — Identity (user or service principal) under which jobs execute.

CLI Commands

databricks bundle validate   # Check config for errors
databricks bundle deploy     # Deploy to target workspace
databricks bundle run        # Run a specific resource
databricks bundle destroy    # Remove deployed resources
Deploy, Repair, and Rerun Workflows Critical

Databricks Workflows (Jobs) let you orchestrate multi-task pipelines as a DAG (Directed Acyclic Graph). Tasks can depend on other tasks, and the workflow engine handles execution order, retries, and failure recovery.

How Repair Run Works

When a task in a job fails and you fix the underlying issue, you trigger a repair run instead of restarting the entire job. Here's what happens:

  1. Upstream successful tasks DO NOT rerun. Their results are reused. This saves time and compute cost.
  2. The failed task reruns with the fix applied.
  3. All downstream tasks of the failed task rerun — but only if the failed task now succeeds.

Example: Job has tasks A → B → C → D. Task B fails.

TaskOriginal RunRepair Run
A✅ Succeeded⏭ Skipped (reuses result)
B❌ Failed🔄 Reruns (with fix)
C⏭ Never ran🔄 Runs if B succeeds
D⏭ Never ran🔄 Runs if C succeeds

Repair run does NOT restart the entire job from Task A. This is a very common exam question. Only the failed task and its downstream dependents are executed. Upstream successful tasks are skipped entirely.

Serverless Compute Core

Serverless compute is Databricks' fully managed compute option. You don't create, configure, or manage clusters — Databricks handles all of it.

Key Characteristics

  • No cluster configuration — No instance types, no worker counts, no autoscaling ranges. Just run.
  • Instant startup — No cold-start cluster provisioning. Compute is available in seconds.
  • Auto-scaling — Scales based on workload demand automatically.
  • Per-second billing — Pay only for compute you use. No idle costs.
  • Available for — SQL warehouses, notebooks, jobs/workflows, and Lakeflow pipelines.

The exam objective says "Use serverless for a hands-off, auto-optimized compute managed by Databricks." — that's basically the definition. Any question that describes compute with zero management overhead, instant availability, and Databricks-managed infrastructure = serverless.

Analyzing the Spark UI to Optimize Queries Critical

The Spark UI is your primary tool for diagnosing performance problems in Spark workloads. The exam tests whether you can look at symptoms and identify the root cause.

Key Tabs

  • Jobs tab — Shows all Spark jobs triggered by your code. Click into a job to see its stages.
  • Stages tab — Shows shuffle read/write, task count, task distribution, and input/output sizes per stage. This is where you spot skew and spill.
  • SQL/DataFrame tab — Shows the physical execution plan, including scan types, join strategies, and filter pushdown. Useful for understanding how Spark translates your query.
  • Storage tab — Shows cached DataFrames/tables and their memory/disk usage.
  • Executors tab — Shows resource usage per executor: memory, disk, GC time.

Diagnosing Problems — The Big Four

SymptomDiagnosisRoot CauseFix
ONE task takes 10x+ longer than peers in the same stageDATA SKEWOne partition has vastly more data than others (e.g., US has 90% of rows when partitioned by country)Salt the join/group key, repartition on higher-cardinality column, use AQE skew join optimization
MANY tasks show spill to disk in the stage detailsSHUFFLE SPILLNot enough executor memory to hold shuffle data in RAM; spills to diskIncrease executor memory, reduce shuffle partition count (spark.sql.shuffle.partitions), optimize join order
Query scans entire table despite WHERE clauseNO PREDICATE PUSHDOWNFilter applied after scan instead of before; or table has no clustering/Z-orderUse Delta column statistics, liquid clustering, or ensure filter columns are in the table's layout
Thousands of tiny output filesSMALL FILE PROBLEMToo many partitions on write, streaming micro-batches creating many filesRun OPTIMIZE, enable auto-compaction, use coalesce() before write

Skew vs Spill — the #1 Spark UI trap:
Skew = ONE outlier task. You'll see one task bar much longer than the rest in the stage detail view. Caused by uneven data distribution across partitions.
Spill = MANY tasks affected. You'll see "Spill (Memory)" and "Spill (Disk)" metrics across most tasks. Caused by insufficient memory.

If the question says "one task takes significantly longer than all others" — it's skew, not spill. This distinction is critical.

05
Data Governance & Quality
Managed vs External Tables Critical
PropertyManaged TableExternal Table
Data locationDatabricks-managed storage (inside the metastore's managed location)User-specified cloud storage path (you control where data lives)
DROP TABLEDeletes metadata AND data filesDeletes metadata ONLY — data files remain in your cloud storage
Storage controlDatabricks manages the storage lifecycleYou manage the storage lifecycle and permissions
PortabilityData is tightly coupled to the metastoreData can be accessed by other tools/systems
Best forMost use cases — simplest management, recommended defaultData shared with other systems, data you don't want Databricks to delete, regulatory requirements
-- Managed table (default - no LOCATION specified)
CREATE TABLE catalog.schema.managed_tbl (id INT, name STRING);

-- External table (LOCATION specified)
CREATE TABLE catalog.schema.external_tbl (id INT, name STRING)
  LOCATION 's3://my-bucket/external-data/';

Dropping a managed table deletes everything (metadata + data). Dropping an external table only deletes metadata — your data files remain untouched. This is tested on virtually every exam. The way to remember: managed = Databricks manages everything including deletion. External = you manage the storage, Databricks only manages the catalog entry.

UC Permissions — GRANT Syntax & Inheritance Critical

Unity Catalog uses a three-level namespace: catalog.schema.table. Privileges are granted on securable objects and flow downward through inheritance.

The #1 Exam Distinction: GRANT ON SCHEMA vs ON ALL TABLES

✅ GRANT SELECT ON SCHEMA sales TO analysts

Covers all current AND future tables in the sales schema. When new tables are created later, analysts automatically get SELECT on them. This is the future-proof option.

❌ GRANT SELECT ON ALL TABLES IN SCHEMA sales TO analysts

Covers only tables that exist right now. New tables created after this GRANT are NOT covered. The analyst would need a new GRANT for each new table.

This distinction is tested on almost every exam. The sample Q2 tests it directly. ON SCHEMA = future-proof. ON ALL TABLES IN SCHEMA = point-in-time only.

Privilege Inheritance

  • Granting on a catalog → inherited by all schemas and tables within (current AND future).
  • Granting on a schema → inherited by all tables within (current AND future).
  • Metastore-level grants do NOT inherit down to catalogs.

Example: GRANT SELECT ON CATALOG main TO analysts gives analysts SELECT on every table in every schema in the main catalog, including tables that don't exist yet.

Key Privileges Reference

PrivilegeGrantsApplied To
USE CATALOGAccess to the catalog (required for any object within)Catalog
USE SCHEMAAccess to the schema (required for any object within)Schema
SELECTRead data from tables/viewsTable, View, Schema, Catalog
MODIFYAdd, update, delete dataTable, Schema, Catalog
CREATE TABLECreate tables within a schemaSchema, Catalog
CREATE SCHEMACreate schemas within a catalogCatalog
ALL PRIVILEGESAll applicable privileges (expands at check time)Any securable
BROWSEDiscover objects, view metadata (no data access)Catalog, Metastore
MANAGEEdit, drop, manage privileges on an objectAny securable

USE CATALOG and USE SCHEMA are prerequisites. A user needs USE CATALOG on the catalog AND USE SCHEMA on the schema before they can do anything with a table inside. SELECT alone isn't enough without these parent permissions (unless they have BROWSE).

GRANT Syntax

-- Grant read access on a schema (includes future tables)
GRANT SELECT ON SCHEMA main.sales TO analysts;

-- Grant catalog-wide read access
GRANT SELECT ON CATALOG main TO analysts;

-- Grant specific table access
GRANT SELECT ON TABLE main.sales.orders TO `user@company.com`;

-- Revoke access
REVOKE SELECT ON SCHEMA main.sales FROM analysts;

-- Show grants
SHOW GRANTS ON SCHEMA main.sales;
Key UC Roles Core
RoleScopeKey Responsibilities
Account AdminEntire Databricks accountManage account settings, users, groups, workspace assignments. Can grant metastore-level privileges.
Metastore AdminUnity Catalog metastoreManage metastore, create/drop catalogs, manage top-level permissions, create storage credentials and external locations.
Workspace AdminIndividual workspaceManage workspace settings, compute, users. Default owner of the auto-created workspace catalog.
Object OwnerSpecific object (catalog, schema, table)Full privileges on the owned object. Can grant privileges to others. Can drop the object.

Important Ownership Rules

  • The principal that creates an object becomes its initial owner.
  • Ownership does NOT inherit downward — owning a catalog does NOT mean you own its schemas. But owners CAN grant themselves privileges on child objects.
  • Each object has exactly one owner (can be a user, service principal, or group).
  • Owners can transfer ownership to another principal.
  • The MANAGE privilege is similar to ownership but can be granted to multiple principals.
Audit Logs Core

Unity Catalog generates audit logs for all data access and administrative actions. These logs are essential for compliance, security monitoring, and forensic investigation.

  • Logs are delivered to your cloud storage (S3, ADLS Gen2, or GCS) as JSON files.
  • Captures: who accessed what data, when, from which IP, what operations were performed, which workspace was used.
  • Types of events logged: table reads/writes, permission changes, login events, cluster operations, notebook runs, job executions.
  • Can be analyzed using Databricks itself — read the logs as Delta tables and query with SQL.
  • Configurable at the account level via the Account Console.
  • Logs are immutable and append-only in the delivery location.
Data Lineage in Unity Catalog Core

Unity Catalog automatically tracks table-level and column-level lineage — the upstream-to-downstream relationships between data assets.

  • Table-level lineage: Shows which tables/views are read to produce a given table. Visible as a graph in Catalog Explorer.
  • Column-level lineage: Shows which source columns flow into which target columns. Useful for impact analysis and debugging transformations.
  • Captured automatically from: notebooks, jobs, Lakeflow pipelines, SQL queries, and Spark DataFrame operations.
  • No configuration needed — lineage tracking is built into Unity Catalog and happens automatically.

Use Cases

  • Impact analysis: "If I change column X in table A, which downstream tables and reports are affected?"
  • Root cause analysis: "Where did this bad data in my Gold table originate?"
  • Compliance: "Prove that PII from this source is properly masked in all downstream tables."
  • Data discovery: "What are all the sources feeding this dashboard?"
Delta Sharing Critical

Delta Sharing is an open protocol for securely sharing data from your Delta tables with external recipients — without copying the data. Recipients can be on Databricks or any other platform.

Core Concepts

  • Provider: The organization sharing data. They create shares, add tables and views to shares, and manage recipients.
  • Recipient: The organization receiving data. Gets access via an activation link (open sharing) or Unity Catalog identity (Databricks-to-Databricks).
  • Share: A logical container of tables and views being shared. One share can contain tables from multiple schemas/catalogs.

Two Types of Delta Sharing

TypeRecipient Uses Databricks?AuthenticationCapabilities
Databricks-to-DatabricksYesUnity Catalog identity federationRead access, governed by UC permissions on both sides
Open SharingNo — any platform (Spark, pandas, Power BI, etc.)Bearer token via activation linkRead access via open protocol connectors

Advantages

  • No data copying — recipients access data in-place at the provider's storage.
  • Real-time access to the latest data.
  • Open protocol — recipients don't need Databricks.
  • Cross-cloud sharing supported.
  • Governed by Unity Catalog — audit who accesses what.

Limitations

  • READ-ONLY for recipients. External partners can NEVER get write access through Delta Sharing.
  • Cross-cloud sharing incurs data egress charges (paid by the provider's cloud account).
  • Requires the provider to explicitly create shares and manage recipients.

External partners get READ-ONLY access through Delta Sharing. Period. They never get WRITE access via the share. If an exam answer says "grant READ/WRITE permissions to external partners through Delta Share," it's wrong. Internal teams can get WRITE access to the source tables through normal Unity Catalog grants — but that's UC, not Delta Sharing. The sample Q3 tests this directly.

Cross-Cloud Cost Considerations

  • Same region = minimal/no egress cost.
  • Cross-region or cross-cloud = data egress charges from the provider's cloud account.
  • To minimize cost: co-locate shares in the same region as your recipients.
Lakehouse Federation Core

Lakehouse Federation lets you query external databases directly from Databricks without copying or moving data. It creates a "foreign catalog" in Unity Catalog that mirrors the external database's structure.

Supported External Sources

PostgreSQL, MySQL, SQL Server, Snowflake, Amazon Redshift, Google BigQuery, and other JDBC-compatible databases.

How It Works

  1. Create a Connection object with the external database credentials.
  2. Create a Foreign Catalog that maps to the external database.
  3. Query the foreign catalog using standard SQL — SELECT * FROM foreign_catalog.schema.table.
  4. Queries are pushed down to the external source where possible (predicate pushdown, projection pushdown).

Key Characteristics

  • No data movement or replication — queries hit the external source directly.
  • Governed by Unity Catalog — same permissions, lineage, and audit logging.
  • Read-only — you can query but not write to the external source through Federation.
  • Good for: data exploration, ad-hoc cross-system queries, gradual migration planning.
Delta Sharing vs Lakehouse Federation — Know the Difference Trap

The exam loves testing whether you can distinguish these two. They both involve accessing data across boundaries, but the direction is completely different.

AspectDelta SharingLakehouse Federation
DirectionPUSH OUT — you share YOUR data with othersPULL IN — you query THEIR data from Databricks
Data locationYour Delta tables (on your cloud storage)External databases (Postgres, Snowflake, etc.)
Data movementNone — shared in-placeNone — queried in-place
Who initiates?Provider creates share, recipient consumesYou (the Databricks user) connect to external DB
Recipient needs Databricks?No (open sharing) or Yes (D2D)N/A — you're always on Databricks
ProtocolOpen Delta Sharing protocolJDBC connections via foreign catalogs
Use case"Give partner X read access to our sales table""Query the Postgres production DB without ETL"

The clean rule:
Sharing = you give access to YOUR data. Think outbound.
Federation = you query THEIR data. Think inbound.

If the question says "query an external PostgreSQL database without copying data" → Federation.
If the question says "give an external partner read access to your Delta table" → Sharing.
If the question says "replicate data across workspaces" → Neither. That's a different pattern entirely.