Exam Study Notebook

Databricks Certified Data Engineer Associate

45 questions

90 minutes

Pass: ~70%

Validity: 2 years

Fee: $200

⌕

Databricks Intelligence Platform

Value of the Data Intelligence Platform Core

The Databricks Data Intelligence Platform is a unified, cloud-native analytics platform that merges data lakes and data warehouses into what Databricks calls a lakehouse architecture. The key value proposition is that it eliminates the need to maintain separate systems for data engineering, data science, machine learning, and SQL analytics — everything runs on one platform with shared governance.

What is a Lakehouse?

A lakehouse combines the low-cost, scalable storage of a data lake (files on S3/ADLS/GCS in open formats like Parquet) with the reliability, performance, and governance features of a data warehouse (ACID transactions, schema enforcement, indexing). Delta Lake is the technology that makes this possible — it's an open-source storage layer that sits on top of your cloud object storage and adds transactional capabilities.

Core Components

Delta Lake — The default storage format. Open-source, ACID-compliant, built on Parquet. Enables time travel (query previous versions), schema enforcement, schema evolution, and optimized file management. Every table you create in Databricks is a Delta table by default.
Unity Catalog — Centralized governance layer for ALL data and AI assets across workspaces. Handles permissions, lineage, auditing, and data discovery. This is the single source of truth for who can access what.
Apache Spark — The distributed processing engine under the hood. Databricks provides optimized runtimes (Databricks Runtime) that bundle Spark with performance enhancements and compatible libraries.
Photon — Databricks' proprietary C++-based vectorized query engine. Replaces parts of the standard Spark execution engine for dramatically faster SQL and DataFrame operations. More on this below.
Lakeflow — Databricks' unified solution for data ingestion (Auto Loader, Lakeflow Connect), transformation (Declarative Pipelines / formerly DLT), and orchestration (Jobs / Workflows).

The platform's value is unification. Instead of Kafka + Airflow + Snowflake + Spark + MLflow as separate tools, Databricks provides equivalents of all of these on one governed platform. The exam tests whether you understand when each component is the right tool for a given use case.

docs.databricks.com/aws/en/introduction/

Features That Optimize Query Performance & Data Layout Critical

The exam objective says: "Enable features that simplify data layout decisions and optimize query performance." This is about knowing which Databricks features automatically improve performance without manual tuning.

Liquid Clustering

Liquid Clustering is Databricks' modern replacement for both partitioning and Z-ordering. Instead of choosing partition columns upfront (which is hard to change later and often leads to small-file problems), you specify clustering columns and Databricks incrementally re-organizes data based on actual query patterns.

Enabled with CLUSTER BY (col1, col2) during table creation or via ALTER TABLE.
Incrementally re-clusters — no need to run manual OPTIMIZE commands with Z-ORDER.
Adapts over time as query patterns change.
Eliminates the small-file problem that partitioning creates with high-cardinality columns.
Works with both batch and streaming workloads.

CREATE TABLE sales (
  sale_id BIGINT,
  region STRING,
  sale_date DATE,
  amount DOUBLE
)
CLUSTER BY (region, sale_date);  -- Databricks handles the rest

Predictive I/O

A Databricks-managed optimization that automatically predicts which data files a query will need and pre-fetches them. You don't configure anything — it works transparently on Delta tables. It analyzes query patterns and statistics to skip irrelevant files entirely (data skipping) and pre-load relevant ones.

Photon Engine

Photon is a C++-based vectorized query engine that replaces parts of the Spark JVM execution engine. It processes data in columnar batches rather than row-by-row, which dramatically speeds up SQL and DataFrame operations.

Automatically enabled on Photon-capable clusters and all SQL warehouses.
Best for: aggregations, joins, filters, window functions — standard SQL workloads.
Works seamlessly — no code changes needed.
Typically 2-8x faster than standard Spark for compatible operations.

Photon only accelerates SQL and DataFrame operations. It does NOT speed up arbitrary Python code, UDFs, ML training, or RDD operations. If an exam question asks about optimizing a pipeline that's slow due to a Python UDF, Photon is the wrong answer — you'd need to rewrite the UDF as a native Spark function or use pandas UDFs.

Auto-Compaction & Optimized Writes

Auto-compaction — Automatically compacts small files into larger ones after writes. Prevents the "small file problem" that degrades read performance.
Optimized writes — Repartitions data before writing to reduce the number of output files. Particularly useful for streaming workloads that produce many tiny files.

docs.databricks.com/aws/en/delta/clustering.html docs.databricks.com/aws/en/compute/photon.html

Compute Types — When to Use Each Critical

The exam objective says: "Identify the applicable compute to use for a specific use case." This means you'll be given a scenario and need to pick the right compute type. Here's the full breakdown:

Compute Type	Use Case	Key Characteristics	Cost Model
All-Purpose Cluster	Interactive development, notebook exploration, ad-hoc analysis	Long-running, shared by multiple users, manually started/stopped or auto-terminated	Higher per-hour cost; stays running even when idle
Job Cluster	Production scheduled workloads, automated ETL	Auto-created per job run, automatically terminated on completion, dedicated to one job	Lower cost; only runs during job execution
SQL Warehouse	SQL analytics, BI dashboards, SQL queries, Genie	Optimized for SQL only, Photon-enabled, T-shirt sizing (S/M/L/XL), supports classic or serverless	Per-second billing; auto-stop on idle
Serverless Compute	Any workload where you want zero infrastructure management	No cluster config, instant startup, auto-scaling, fully Databricks-managed	Per-second billing, no idle costs

All-purpose clusters are NOT for production jobs. They're more expensive and stay running between uses. Production scheduled workloads should use Job Clusters. If the exam says "a data engineer needs to run a daily ETL job in production," the answer is Job Cluster (or serverless), never All-Purpose.

Serverless = zero cluster management. Any question mentioning "hands-off compute," "no infrastructure management," "auto-optimized compute managed by Databricks," or "instant startup with no configuration" = serverless. Available for SQL warehouses, notebooks, jobs, and Lakeflow pipelines.

Access Modes (formerly Cluster Modes)

Standard access mode (formerly Shared) — Multiple users can share the cluster. Unity Catalog enforced. Supports Python, SQL, Scala.
Dedicated access mode (formerly Single User) — One user assigned. Full feature support including file notification mode for Auto Loader and ML libraries.
No Isolation Shared — Legacy mode, minimal isolation. Not recommended.

docs.databricks.com/aws/en/compute/ docs.databricks.com/aws/en/serverless-compute/index.html docs.databricks.com/aws/en/compute/sql-warehouse/

Development & Ingestion

Databricks Connect Core

Databricks Connect is a client library that lets you run Spark code from your local IDE (VS Code, PyCharm, IntelliJ) against a remote Databricks cluster. Your code is written locally but executes on the remote cluster's compute resources — you get the power of distributed Spark without leaving your development environment.

Key Facts for the Exam

Code executes on the remote cluster, not locally. Only the driver-side operations run on your machine.
Supports Python (PySpark) and Scala.
Uses the same Spark APIs available in notebooks — DataFrames, Spark SQL, Structured Streaming.
Enables standard software engineering workflows: unit testing, IDE debugging, Git integration, CI/CD.
Requires a running cluster to connect to (all-purpose or assigned cluster).
Databricks Connect v2 is built on top of Spark Connect and is the current recommended version.

Databricks Connect does NOT run Spark locally. It's not a local Spark installation. It connects to a remote cluster and sends execution plans there. If the exam asks where code executes, the answer is always "on the remote Databricks cluster."

When to Use Databricks Connect vs Notebooks

Use Databricks Connect when:

You prefer your local IDE
You need local debugging/breakpoints
You're building CI/CD pipelines
You want unit testing frameworks (pytest)

Use Notebooks when:

Exploring data interactively
Collaborating in real-time
Quick ad-hoc analysis
Visualizations and dashboards

docs.databricks.com/aws/en/dev-tools/databricks-connect/

Notebook Capabilities Core

Databricks notebooks are collaborative, interactive documents that combine code, visualizations, and narrative text. They're the primary development tool for most Databricks users.

Core Capabilities

Multi-language support — A single notebook can mix Python, SQL, Scala, and R cells using magic commands: %python, %sql, %scala, %r, %md (markdown).
Built-in visualizations — Automatically generates charts from query results. Supports bar, line, scatter, pie, map, and more.
Real-time collaboration — Multiple users can edit the same notebook simultaneously, with commenting and presence indicators.
Version history — Every save creates a version. You can diff, revert, and track changes over time. Integrates with Git (Repos).
Widgets — Input parameters using dbutils.widgets. Types: text, dropdown, combobox, multiselect. Useful for parameterized notebooks.
%run — Execute another notebook inline. The called notebook shares the same scope (variables, imports) as the calling notebook. Useful for modular code organization.
dbutils — Utility module with sub-modules: dbutils.fs (file system), dbutils.secrets (secret management), dbutils.widgets (parameters), dbutils.notebook (workflow control).

%run vs dbutils.notebook.run()

Feature	%run	dbutils.notebook.run()
Scope	Shares scope (variables accessible)	Isolated scope (separate execution)
Return value	No explicit return	Returns a string value
Use case	Importing shared code/configs	Orchestrating notebook workflows
Parallelism	Sequential only	Can run in parallel with threading

docs.databricks.com/aws/en/notebooks/

Auto Loader — Valid Sources & Use Cases Critical

Auto Loader is Databricks' recommended tool for incrementally ingesting files from cloud object storage into Delta Lake. It's built on Spark Structured Streaming and uses the cloudFiles format.

Valid Sources — CLOUD OBJECT STORAGE ONLY

AWS S3
Azure Data Lake Storage Gen2 (ADLS Gen2)
Azure Blob Storage (legacy WASB driver, deprecated — use ABFS)
Google Cloud Storage (GCS)

Auto Loader ONLY reads from cloud object storage. It does NOT read from Kafka, JDBC databases, REST APIs, or any other source. This is a common wrong answer on the exam. If you need to ingest from Kafka, you use Spark Structured Streaming with the Kafka connector directly — not Auto Loader.

Supported File Formats

JSON, CSV, XML, Parquet, Avro, ORC, TEXT, BINARYFILE — memorize these. The exam may ask whether a specific format is supported.

Key Benefits Over Basic File Source

Scalability — Can discover billions of files efficiently. File discovery cost scales with number of new files, not total directory size.
Exactly-once processing — Uses a RocksDB-backed checkpoint to track which files have been processed. Guarantees no duplicates, no missed files.
Schema inference and evolution — Automatically infers schema and handles new columns (see next topic).
Fault tolerance — Resumes from checkpoint on failure. No manual state management needed.
Cost optimization — Use Trigger.AvailableNow for batch-style runs instead of continuous streaming to save compute when low latency isn't required.

File Discovery Modes

Mode	How It Works	Best For	Setup
Directory Listing (default)	Periodically lists files in the source directory to find new ones	Small-to-medium volumes, simple setup, no cloud permissions needed	No extra config
File Notification	Uses cloud event notifications (SQS/EventGrid/Pub-Sub) to detect new files	Large volumes with billions of files, lower latency, cost-efficient at scale	Requires cloud permissions; use dedicated access mode

File notification mode is only supported on compute with dedicated access mode (formerly single user). If you're on standard access mode, you must use directory listing.

docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/

Auto Loader — Syntax & Configuration Syntax

Basic Read Syntax

# Reading with Auto Loader
df = (spark.readStream
    .format("cloudFiles")                          # Always "cloudFiles"
    .option("cloudFiles.format", "json")            # Source file format
    .option("cloudFiles.schemaLocation", "/schema") # Where inferred schema is stored
    .load("/path/to/source/files")                  # Cloud storage path
)

Basic Write Syntax

# Writing to Delta table
(df.writeStream
    .option("checkpointLocation", "/checkpoint")  # REQUIRED for exactly-once
    .option("mergeSchema", "true")                # Allow schema evolution on write
    .trigger(availableNow=True)                  # Process all available, then stop
    .table("catalog.schema.target_table")          # Target Delta table
)

Common Options

Option	Purpose	Example Value
`cloudFiles.format`	File format to read	"json", "csv", "parquet", etc.
`cloudFiles.schemaLocation`	Path to store/read inferred schema	"/mnt/schema/orders"
`cloudFiles.schemaEvolutionMode`	How to handle new columns	"addNewColumns", "rescue", etc.
`cloudFiles.inferColumnTypes`	Infer data types (vs all STRING)	"true" or "false"
`cloudFiles.useNotifications`	Enable file notification mode	"true"
`checkpointLocation`	Checkpoint for exactly-once (on write)	"/mnt/checkpoint/orders"

Auto Loader Inside Lakeflow Declarative Pipelines

When using Auto Loader inside a Lakeflow pipeline, you do NOT need to specify schemaLocation or checkpointLocation — the pipeline manages these automatically.

import dlt

@dlt.table
def bronze_orders():
    return (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/data/orders/")
    )

docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options

Auto Loader — Schema Inference & Evolution Critical

Schema handling is one of the most exam-tested Auto Loader topics. There are two distinct concepts: schema inference (what is the schema?) and schema evolution (what happens when the schema changes?).

Schema Inference

When you don't provide a schema, Auto Loader samples files in the source directory and infers one. The inferred schema is saved to cloudFiles.schemaLocation so it doesn't re-infer on every restart. Key options:

cloudFiles.inferColumnTypes — When "true", infers actual types (INT, DOUBLE, TIMESTAMP). When "false", all columns come in as STRING. Default varies by format.
cloudFiles.schemaHints — Manually override inferred types for specific columns. Example: "date DATE, amount DOUBLE"

Schema Evolution Modes (cloudFiles.schemaEvolutionMode)

This controls what happens when Auto Loader encounters data with new columns that weren't in the original schema:

Mode	Behavior	New Column Handling	Stream Restarts?
addNewColumns	Automatically adds new columns to the schema	New columns appear in the target table	Yes — stream restarts to apply new schema
rescue	Routes unmatched fields to a `_rescued_data` column	New fields stored as JSON string in _rescued_data	No — schema stays the same
failOnNewColumns	Throws an error when new columns appear	Pipeline stops; manual schema update required	N/A — pipeline fails
none	Ignores new columns completely	New fields are silently dropped	No

inferSchema and schemaEvolutionMode are different things. inferColumnTypes controls whether types are inferred at all. schemaEvolutionMode controls what happens when NEW columns appear in the data. The exam deliberately confuses these. If a question asks "how to automatically add new columns as the schema changes," the answer is schemaEvolutionMode = "addNewColumns" — NOT inferSchema = true.

The _rescued_data Column

When using rescue mode, any data that doesn't match the current schema gets stored in a special column called _rescued_data as a JSON string. This is useful for debugging and data recovery — you don't lose data, but you also don't pollute your schema with every random new field. You can query this column later to inspect what was rescued and decide whether to add those columns officially.

docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/schema

Built-in Debugging Tools Core

Notebook debugger — Python notebooks support step-through debugging with breakpoints. You can inspect variables, view the call stack, and step over/into/out of functions. Available in the notebook toolbar.
Spark UI — Built-in web UI for diagnosing Spark job performance. Accessible from any cluster. Covered in depth in Section 4.
Driver logs — Access stdout/stderr from the cluster driver. Useful for print-statement debugging and reviewing error tracebacks.
%run for modular testing — Break code into separate notebooks and test each independently.
dbutils.notebook.run() — Run notebooks programmatically and capture return values for integration testing.
Databricks Assistant — AI-powered code completion and error explanation built into the notebook editor.

docs.databricks.com/aws/en/notebooks/

Data Processing & Transformations

Medallion Architecture (Bronze / Silver / Gold) Core

The Medallion Architecture is a data design pattern that organizes a lakehouse into three layers of increasing data quality. Think of it as a progressive refinement pipeline.

Layer	Purpose	Data Quality	Typical Operations	Example
Bronze (Raw)	Land raw data exactly as-is from source	Unvalidated, may contain duplicates, errors, nulls	Append-only ingestion, add metadata (load_timestamp, source_file)	Raw JSON API responses, CSV dumps
Silver (Validated)	Clean, filter, conform, deduplicate	Schema enforced, types cast, nulls handled, deduped	Type casting, deduplication, joins, filtering invalid records	Cleaned customer records with proper types
Gold (Business)	Business-level aggregations and curated datasets	Curated for specific use cases, ready for BI/ML	Aggregations, KPIs, star-schema facts/dimensions	Daily revenue by region, customer lifetime value

The key principle: Each layer increases data quality and decreases data volume. Bronze is the widest (everything), Gold is the narrowest (only what's needed for business questions). You should be able to fully recompute Silver from Bronze, and Gold from Silver.

Bronze is NOT "bad data." It's raw, unprocessed data preserved as-is for auditability and reprocessing. Don't let an exam answer trick you into thinking Bronze requires validation — that's Silver's responsibility.

docs.databricks.com/aws/en/lakehouse/medallion.html

Cluster Types & Configuration for Scenarios Core

This objective asks you to pick the right cluster type and configuration based on a scenario. The decision tree is:

Multiple users running interactive queries? → Standard access mode (shared cluster) with autoscaling.
Single heavy ETL job? → Dedicated access mode with workers sized for the data volume.
Small dataset, single-machine ML library? → Single Node cluster (no workers, just a driver).
Production scheduled job? → Job cluster with autoscaling. Terminates after job completes.
SQL/BI workload? → SQL Warehouse (serverless preferred for instant start).
No infrastructure management desired? → Serverless compute.

Autoscaling

Set a min and max number of workers. Databricks automatically adjusts the cluster size based on workload. For cost-sensitive workloads, set min workers to 0 or 1. For performance-critical workloads, set min workers higher to avoid cold-start delays.

docs.databricks.com/aws/en/compute/cluster-config-best-practices.html

Lakeflow Spark Declarative Pipelines — Advantages Core

The exam objective says: "Emphasize the advantages of Lakeflow Spark Declarative Pipelines for ETL process in Databricks." This is a "why use this" question. Here's the pitch:

Key Advantages

Declarative — You define WHAT your pipeline should produce (target tables, transformations, quality rules). Databricks figures out HOW (infrastructure, dependencies, execution order, scaling).
Automatic dependency management — The framework builds a DAG of your tables automatically. If table B reads from table A, Databricks ensures A is processed first. No manual orchestration.
Built-in data quality (Expectations) — Declarative constraints that validate data inline. No separate testing framework needed.
Automatic infrastructure management — Clusters scale up/down based on workload. No manual cluster configuration within the pipeline.
Unified batch and streaming — Same API handles both. A streaming table processes incrementally; a materialized view recomputes fully. You pick the right abstraction and Databricks handles the rest.
Built-in monitoring — Event log, data quality metrics, pipeline health all available in the UI.
Schema evolution handled — Works with Auto Loader's schema evolution seamlessly.
Fault tolerance — Automatic retries, checkpointing, and exactly-once processing built in.

Pipeline Editions

Edition	Includes	Use When
Core	Streaming ingest, basic transformations	Simple ETL without CDC or quality checks
Pro	Core + Change Data Capture (SCD Type 1/2)	Need to track row-level changes from source
Advanced	Pro + Data quality expectations	Need built-in data quality validation

Expectations require the Advanced edition. If your pipeline uses @dlt.expect and you're on Core or Pro, you'll get an error message telling you to upgrade. The exam may test whether you know which features are in which edition.

docs.databricks.com/aws/en/ldp/ docs.databricks.com/aws/en/ldp/configure-pipeline

Implementing Lakeflow Spark Declarative Pipelines Critical

Core Building Blocks

Streaming Table — Append-only table that processes data incrementally. Each pipeline update processes only new data. Best for event logs, CDC, and any append-heavy workload.
Materialized View — Fully recomputed on each pipeline update. Best for aggregations, dimensions, and slowly-changing reference data.
Flows — Define how data moves from source to target. Default flows are created automatically when you define a table. You can create additional flows for complex patterns (e.g., multiple sources feeding one table).

Reading Data Within a Pipeline

✅ dlt.read("table_name")

Use for reading from tables within the same pipeline. Maintains dependency graph. Table name is just the table name (no catalog/schema prefix).

❌ spark.read.table("table_name")

Bypasses the pipeline dependency graph. Use ONLY for reading from external tables outside the pipeline, with full 3-level name: catalog.schema.table.

Data Quality Expectations — ONLY THREE TYPES EXIST

Expectation	On Violation	Records Kept?	Pipeline Stops?
`@dlt.expect("name", "condition")`	Log violation in event log metrics	YES — all records kept (pass + fail)	NO
`@dlt.expect_or_drop("name", "condition")`	Silently drop failing records	NO — only passing records kept	NO
`@dlt.expect_or_fail("name", "condition")`	Halt entire pipeline immediately	N/A — pipeline stops	YES

@dlt.expect_or_quarantine DOES NOT EXIST. This is the most common fabricated answer on the exam. If you want to quarantine bad records, the pattern is: use @dlt.expect (which keeps ALL records), then in a downstream table, filter where the expectation condition is false into a separate "quarantine" table. This is a two-step pattern, not a single decorator.

Complete Pipeline Example

import dlt
from pyspark.sql.functions import col, current_timestamp

# Bronze: Raw ingestion with Auto Loader
@dlt.table(
    comment="Raw order data from S3"
)
def bronze_orders():
    return (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/data/orders/")
        .select("*", current_timestamp().alias("_load_ts"))
    )

# Silver: Cleaned + validated
@dlt.table
@dlt.expect("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
def silver_orders():
    return dlt.read_stream("bronze_orders").select(
        col("order_id").cast("long"),
        col("customer_id").cast("long"),
        col("amount").cast("double"),
        col("order_date").cast("date")
    )

# Gold: Aggregated for business
@dlt.table
def gold_daily_revenue():
    return dlt.read("silver_orders").groupBy("order_date").agg(
        sum("amount").alias("total_revenue"),
        count_distinct("order_id").alias("unique_orders")
    )

Pipeline Modes

Mode	Behavior	Use When
Triggered	Runs once, processes all available data, then stops	Batch/scheduled ETL, cost-sensitive, latency isn't critical
Continuous	Runs constantly, processes data as it arrives	Real-time streaming, low-latency requirements

docs.databricks.com/aws/en/ldp/data-quality docs.databricks.com/aws/en/ldp/flows

DDL/DML Features Syntax

CREATE TABLE Variants

-- 1. CREATE OR REPLACE: Drops and recreates REGARDLESS of existence
CREATE OR REPLACE TABLE employees (
  employeeId STRING,
  startDate  DATE,
  avgRating  FLOAT
);

-- 2. CREATE IF NOT EXISTS: Only creates if table doesn't exist (skips otherwise)
CREATE TABLE IF NOT EXISTS employees (
  employeeId STRING,
  startDate  DATE,
  avgRating  FLOAT
);

-- 3. CTAS: Schema inferred from the SELECT query
CREATE TABLE employees AS SELECT * FROM source_table;

"Regardless of whether a table already exists with this name" = CREATE OR REPLACE. "Only if the table doesn't already exist" = CREATE TABLE IF NOT EXISTS. The exam tests exact wording. REPLACE always recreates; IF NOT EXISTS never overwrites.

DML Operations

-- INSERT: Append new rows
INSERT INTO my_table VALUES ('a1', 6, 9.4);

-- UPDATE: Modify existing rows
UPDATE my_table SET avgRating = 10.0 WHERE employeeId = 'a1';

-- DELETE: Remove rows
DELETE FROM my_table WHERE employeeId = 'a1';

-- MERGE (Upsert): Insert or update based on match
MERGE INTO target USING source
  ON target.id = source.id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *;

INSERT syntax word order: It's INSERT INTO table VALUES (...). NOT INSERT VALUES (...) INTO table. The exam's sample Q5 tests exactly this. Option C reverses the word order — that's the trap.

docs.databricks.com/aws/en/delta/index.html

PySpark Aggregations & Complex Metrics Syntax

Common Aggregation Functions

from pyspark.sql.functions import (
    sum, count, count_distinct, avg, min, max,
    round, when, col
)

# Group by + multiple aggregations
daily_revenue = billing_df.groupBy("billing_date").agg(
    sum("amount_billed").alias("total_revenue"),
    count_distinct("billing_id").alias("unique_invoices"),
    count("*").alias("total_records"),
    avg("amount_billed").alias("avg_amount"),
    max("amount_billed").alias("max_amount")
)

# Window functions
from pyspark.sql.window import Window
window_spec = Window.partitionBy("department").orderBy("billing_date")
df = df.withColumn("running_total", sum("amount").over(window_spec))

count() vs count_distinct() vs sum() — The exam's sample Q1 tests this directly. sum("billing_id") sums the numeric ID values, which is meaningless (option A is wrong). count_distinct("patient_id") counts unique patients, not invoices (option C is wrong). count_distinct("billing_id") counts unique invoices — that's the correct answer (D). Read the question carefully to determine whether they want count, unique count, or sum.

docs.databricks.com/aws/en/pyspark/basics.html

Productionizing Data Pipelines

DABs vs Traditional Deployment Methods Core

Declarative Automation Bundles (DABs, formerly Databricks Asset Bundles / DAB) are an infrastructure-as-code (IaC) approach to managing Databricks projects. They let you define jobs, pipelines, notebooks, and other resources as YAML source files, version-control them in Git, and deploy them across environments using the Databricks CLI.

Aspect	DABs (Current)	Legacy Deployment
Configuration format	YAML (`databricks.yml`)	JSON (exported/imported via REST API or CLI)
Environment management	`targets` with per-env overrides (dev/staging/prod in one config)	Separate JSON files per environment, manual editing
Version control	YAML files in Git, fully source-controlled, diffable	JSON exports, less structured, harder to diff
CI/CD integration	Native CLI: `databricks bundle validate\|deploy\|run`	Manual REST API calls, custom scripts
Deployment approach	Declarative — define desired state, CLI reconciles	Imperative — manually create/update each resource
Collaboration	Standard code review workflows (PRs, diffs)	Harder to review JSON changes meaningfully

DABs use YAML, not JSON. Legacy deployment used JSON. If the exam asks about the configuration format for Asset Bundles, the answer is YAML (databricks.yml). This is a simple but testable distinction.

docs.databricks.com/aws/en/dev-tools/bundles/

Asset Bundles — Structure & Configuration Syntax

Bundle Project Structure

my-etl-project/
├── databricks.yml          # Root config (required, one per project)
├── resources/
│   ├── etl_job.yml         # Job definitions
│   └── pipeline.yml        # Pipeline definitions
├── src/
│   ├── ingest.py           # Notebook/script source files
│   ├── transform.py
│   └── quality_checks.py
└── tests/
    └── test_transform.py   # Unit tests

Key YAML Structure

# databricks.yml
bundle:
  name: my-etl-bundle           # Required: unique bundle name

include:
  - resources/*.yml              # Include additional config files

resources:
  jobs:
    daily_etl:
      name: "Daily ETL Pipeline"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"   # 6 AM daily
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./src/ingest.py
        - task_key: transform
          depends_on:
            - task_key: ingest
          notebook_task:
            notebook_path: ./src/transform.py

targets:
  dev:                              # Development target
    default: true
    workspace:
      host: https://dev.cloud.databricks.com
  prod:                             # Production target
    workspace:
      host: https://prod.cloud.databricks.com
    run_as:
      service_principal_name: prod-sp

Key YAML Mappings

bundle — Required. Contains the bundle name and optional metadata.
resources — Defines Databricks objects: jobs, pipelines, schemas, experiments, models.
targets — Environment-specific overrides. Each target can override workspace, resources, and run_as settings.
workspace — Workspace connection details (host, artifact_path, etc.).
include — References to additional YAML config files.
run_as — Identity (user or service principal) under which jobs execute.

CLI Commands

databricks bundle validate   # Check config for errors
databricks bundle deploy     # Deploy to target workspace
databricks bundle run        # Run a specific resource
databricks bundle destroy    # Remove deployed resources

docs.databricks.com/aws/en/dev-tools/bundles/settings docs.databricks.com/aws/en/dev-tools/bundles/resources

Deploy, Repair, and Rerun Workflows Critical

Databricks Workflows (Jobs) let you orchestrate multi-task pipelines as a DAG (Directed Acyclic Graph). Tasks can depend on other tasks, and the workflow engine handles execution order, retries, and failure recovery.

How Repair Run Works

When a task in a job fails and you fix the underlying issue, you trigger a repair run instead of restarting the entire job. Here's what happens:

Upstream successful tasks DO NOT rerun. Their results are reused. This saves time and compute cost.
The failed task reruns with the fix applied.
All downstream tasks of the failed task rerun — but only if the failed task now succeeds.

Example: Job has tasks A → B → C → D. Task B fails.

Task	Original Run	Repair Run
A	✅ Succeeded	⏭ Skipped (reuses result)
B	❌ Failed	🔄 Reruns (with fix)
C	⏭ Never ran	🔄 Runs if B succeeds
D	⏭ Never ran	🔄 Runs if C succeeds

Repair run does NOT restart the entire job from Task A. This is a very common exam question. Only the failed task and its downstream dependents are executed. Upstream successful tasks are skipped entirely.

docs.databricks.com/aws/en/jobs/repair-job-failures.html

Serverless Compute Core

Serverless compute is Databricks' fully managed compute option. You don't create, configure, or manage clusters — Databricks handles all of it.

Key Characteristics

No cluster configuration — No instance types, no worker counts, no autoscaling ranges. Just run.
Instant startup — No cold-start cluster provisioning. Compute is available in seconds.
Auto-scaling — Scales based on workload demand automatically.
Per-second billing — Pay only for compute you use. No idle costs.
Available for — SQL warehouses, notebooks, jobs/workflows, and Lakeflow pipelines.

The exam objective says "Use serverless for a hands-off, auto-optimized compute managed by Databricks." — that's basically the definition. Any question that describes compute with zero management overhead, instant availability, and Databricks-managed infrastructure = serverless.

docs.databricks.com/aws/en/serverless-compute/

Analyzing the Spark UI to Optimize Queries Critical

The Spark UI is your primary tool for diagnosing performance problems in Spark workloads. The exam tests whether you can look at symptoms and identify the root cause.

Key Tabs

Jobs tab — Shows all Spark jobs triggered by your code. Click into a job to see its stages.
Stages tab — Shows shuffle read/write, task count, task distribution, and input/output sizes per stage. This is where you spot skew and spill.
SQL/DataFrame tab — Shows the physical execution plan, including scan types, join strategies, and filter pushdown. Useful for understanding how Spark translates your query.
Storage tab — Shows cached DataFrames/tables and their memory/disk usage.
Executors tab — Shows resource usage per executor: memory, disk, GC time.

Diagnosing Problems — The Big Four

Symptom	Diagnosis	Root Cause	Fix
ONE task takes 10x+ longer than peers in the same stage	DATA SKEW	One partition has vastly more data than others (e.g., US has 90% of rows when partitioned by country)	Salt the join/group key, repartition on higher-cardinality column, use AQE skew join optimization
MANY tasks show spill to disk in the stage details	SHUFFLE SPILL	Not enough executor memory to hold shuffle data in RAM; spills to disk	Increase executor memory, reduce shuffle partition count (`spark.sql.shuffle.partitions`), optimize join order
Query scans entire table despite WHERE clause	NO PREDICATE PUSHDOWN	Filter applied after scan instead of before; or table has no clustering/Z-order	Use Delta column statistics, liquid clustering, or ensure filter columns are in the table's layout
Thousands of tiny output files	SMALL FILE PROBLEM	Too many partitions on write, streaming micro-batches creating many files	Run OPTIMIZE, enable auto-compaction, use coalesce() before write

Skew vs Spill — the #1 Spark UI trap:
Skew = ONE outlier task. You'll see one task bar much longer than the rest in the stage detail view. Caused by uneven data distribution across partitions.
Spill = MANY tasks affected. You'll see "Spill (Memory)" and "Spill (Disk)" metrics across most tasks. Caused by insufficient memory.

If the question says "one task takes significantly longer than all others" — it's skew, not spill. This distinction is critical.

docs.databricks.com/aws/en/compute/spark-ui.html

Data Governance & Quality

Managed vs External Tables Critical

Property	Managed Table	External Table
Data location	Databricks-managed storage (inside the metastore's managed location)	User-specified cloud storage path (you control where data lives)
DROP TABLE	Deletes metadata AND data files	Deletes metadata ONLY — data files remain in your cloud storage
Storage control	Databricks manages the storage lifecycle	You manage the storage lifecycle and permissions
Portability	Data is tightly coupled to the metastore	Data can be accessed by other tools/systems
Best for	Most use cases — simplest management, recommended default	Data shared with other systems, data you don't want Databricks to delete, regulatory requirements

-- Managed table (default - no LOCATION specified)
CREATE TABLE catalog.schema.managed_tbl (id INT, name STRING);

-- External table (LOCATION specified)
CREATE TABLE catalog.schema.external_tbl (id INT, name STRING)
  LOCATION 's3://my-bucket/external-data/';

Dropping a managed table deletes everything (metadata + data). Dropping an external table only deletes metadata — your data files remain untouched. This is tested on virtually every exam. The way to remember: managed = Databricks manages everything including deletion. External = you manage the storage, Databricks only manages the catalog entry.

docs.databricks.com/aws/en/tables/managed-external.html

UC Permissions — GRANT Syntax & Inheritance Critical

Unity Catalog uses a three-level namespace: catalog.schema.table. Privileges are granted on securable objects and flow downward through inheritance.

The #1 Exam Distinction: GRANT ON SCHEMA vs ON ALL TABLES

✅ GRANT SELECT ON SCHEMA sales TO analysts

Covers all current AND future tables in the sales schema. When new tables are created later, analysts automatically get SELECT on them. This is the future-proof option.

❌ GRANT SELECT ON ALL TABLES IN SCHEMA sales TO analysts

Covers only tables that exist right now. New tables created after this GRANT are NOT covered. The analyst would need a new GRANT for each new table.

This distinction is tested on almost every exam. The sample Q2 tests it directly. ON SCHEMA = future-proof. ON ALL TABLES IN SCHEMA = point-in-time only.

Privilege Inheritance

Granting on a catalog → inherited by all schemas and tables within (current AND future).
Granting on a schema → inherited by all tables within (current AND future).
Metastore-level grants do NOT inherit down to catalogs.

Example: GRANT SELECT ON CATALOG main TO analysts gives analysts SELECT on every table in every schema in the main catalog, including tables that don't exist yet.

Key Privileges Reference

Privilege	Grants	Applied To
`USE CATALOG`	Access to the catalog (required for any object within)	Catalog
`USE SCHEMA`	Access to the schema (required for any object within)	Schema
`SELECT`	Read data from tables/views	Table, View, Schema, Catalog
`MODIFY`	Add, update, delete data	Table, Schema, Catalog
`CREATE TABLE`	Create tables within a schema	Schema, Catalog
`CREATE SCHEMA`	Create schemas within a catalog	Catalog
`ALL PRIVILEGES`	All applicable privileges (expands at check time)	Any securable
`BROWSE`	Discover objects, view metadata (no data access)	Catalog, Metastore
`MANAGE`	Edit, drop, manage privileges on an object	Any securable

USE CATALOG and USE SCHEMA are prerequisites. A user needs USE CATALOG on the catalog AND USE SCHEMA on the schema before they can do anything with a table inside. SELECT alone isn't enough without these parent permissions (unless they have BROWSE).

GRANT Syntax

-- Grant read access on a schema (includes future tables)
GRANT SELECT ON SCHEMA main.sales TO analysts;

-- Grant catalog-wide read access
GRANT SELECT ON CATALOG main TO analysts;

-- Grant specific table access
GRANT SELECT ON TABLE main.sales.orders TO `user@company.com`;

-- Revoke access
REVOKE SELECT ON SCHEMA main.sales FROM analysts;

-- Show grants
SHOW GRANTS ON SCHEMA main.sales;

docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/privileges docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/

Key UC Roles Core

Role	Scope	Key Responsibilities
Account Admin	Entire Databricks account	Manage account settings, users, groups, workspace assignments. Can grant metastore-level privileges.
Metastore Admin	Unity Catalog metastore	Manage metastore, create/drop catalogs, manage top-level permissions, create storage credentials and external locations.
Workspace Admin	Individual workspace	Manage workspace settings, compute, users. Default owner of the auto-created workspace catalog.
Object Owner	Specific object (catalog, schema, table)	Full privileges on the owned object. Can grant privileges to others. Can drop the object.

Important Ownership Rules

The principal that creates an object becomes its initial owner.
Ownership does NOT inherit downward — owning a catalog does NOT mean you own its schemas. But owners CAN grant themselves privileges on child objects.
Each object has exactly one owner (can be a user, service principal, or group).
Owners can transfer ownership to another principal.
The MANAGE privilege is similar to ownership but can be granted to multiple principals.

docs.databricks.com/aws/en/data-governance/unity-catalog/access-control/permissions-concepts

Audit Logs Core

Unity Catalog generates audit logs for all data access and administrative actions. These logs are essential for compliance, security monitoring, and forensic investigation.

Logs are delivered to your cloud storage (S3, ADLS Gen2, or GCS) as JSON files.
Captures: who accessed what data, when, from which IP, what operations were performed, which workspace was used.
Types of events logged: table reads/writes, permission changes, login events, cluster operations, notebook runs, job executions.
Can be analyzed using Databricks itself — read the logs as Delta tables and query with SQL.
Configurable at the account level via the Account Console.
Logs are immutable and append-only in the delivery location.

docs.databricks.com/aws/en/administration-guide/account-settings/audit-logs.html

Data Lineage in Unity Catalog Core

Unity Catalog automatically tracks table-level and column-level lineage — the upstream-to-downstream relationships between data assets.

Table-level lineage: Shows which tables/views are read to produce a given table. Visible as a graph in Catalog Explorer.
Column-level lineage: Shows which source columns flow into which target columns. Useful for impact analysis and debugging transformations.
Captured automatically from: notebooks, jobs, Lakeflow pipelines, SQL queries, and Spark DataFrame operations.
No configuration needed — lineage tracking is built into Unity Catalog and happens automatically.

Use Cases

Impact analysis: "If I change column X in table A, which downstream tables and reports are affected?"
Root cause analysis: "Where did this bad data in my Gold table originate?"
Compliance: "Prove that PII from this source is properly masked in all downstream tables."
Data discovery: "What are all the sources feeding this dashboard?"

docs.databricks.com/aws/en/data-governance/unity-catalog/data-lineage.html

Delta Sharing Critical

Delta Sharing is an open protocol for securely sharing data from your Delta tables with external recipients — without copying the data. Recipients can be on Databricks or any other platform.

Core Concepts

Provider: The organization sharing data. They create shares, add tables and views to shares, and manage recipients.
Recipient: The organization receiving data. Gets access via an activation link (open sharing) or Unity Catalog identity (Databricks-to-Databricks).
Share: A logical container of tables and views being shared. One share can contain tables from multiple schemas/catalogs.

Two Types of Delta Sharing

Type	Recipient Uses Databricks?	Authentication	Capabilities
Databricks-to-Databricks	Yes	Unity Catalog identity federation	Read access, governed by UC permissions on both sides
Open Sharing	No — any platform (Spark, pandas, Power BI, etc.)	Bearer token via activation link	Read access via open protocol connectors

Advantages

No data copying — recipients access data in-place at the provider's storage.
Real-time access to the latest data.
Open protocol — recipients don't need Databricks.
Cross-cloud sharing supported.
Governed by Unity Catalog — audit who accesses what.

Limitations

READ-ONLY for recipients. External partners can NEVER get write access through Delta Sharing.
Cross-cloud sharing incurs data egress charges (paid by the provider's cloud account).
Requires the provider to explicitly create shares and manage recipients.

External partners get READ-ONLY access through Delta Sharing. Period. They never get WRITE access via the share. If an exam answer says "grant READ/WRITE permissions to external partners through Delta Share," it's wrong. Internal teams can get WRITE access to the source tables through normal Unity Catalog grants — but that's UC, not Delta Sharing. The sample Q3 tests this directly.

Cross-Cloud Cost Considerations

Same region = minimal/no egress cost.
Cross-region or cross-cloud = data egress charges from the provider's cloud account.
To minimize cost: co-locate shares in the same region as your recipients.

docs.databricks.com/aws/en/delta-sharing/ docs.databricks.com/aws/en/delta-sharing/share-data-databricks.html

Lakehouse Federation Core

Lakehouse Federation lets you query external databases directly from Databricks without copying or moving data. It creates a "foreign catalog" in Unity Catalog that mirrors the external database's structure.

Supported External Sources

PostgreSQL, MySQL, SQL Server, Snowflake, Amazon Redshift, Google BigQuery, and other JDBC-compatible databases.

How It Works

Create a Connection object with the external database credentials.
Create a Foreign Catalog that maps to the external database.
Query the foreign catalog using standard SQL — SELECT * FROM foreign_catalog.schema.table.
Queries are pushed down to the external source where possible (predicate pushdown, projection pushdown).

Key Characteristics

No data movement or replication — queries hit the external source directly.
Governed by Unity Catalog — same permissions, lineage, and audit logging.
Read-only — you can query but not write to the external source through Federation.
Good for: data exploration, ad-hoc cross-system queries, gradual migration planning.

docs.databricks.com/aws/en/query-federation/

Delta Sharing vs Lakehouse Federation — Know the Difference Trap

The exam loves testing whether you can distinguish these two. They both involve accessing data across boundaries, but the direction is completely different.

Aspect	Delta Sharing	Lakehouse Federation
Direction	PUSH OUT — you share YOUR data with others	PULL IN — you query THEIR data from Databricks
Data location	Your Delta tables (on your cloud storage)	External databases (Postgres, Snowflake, etc.)
Data movement	None — shared in-place	None — queried in-place
Who initiates?	Provider creates share, recipient consumes	You (the Databricks user) connect to external DB
Recipient needs Databricks?	No (open sharing) or Yes (D2D)	N/A — you're always on Databricks
Protocol	Open Delta Sharing protocol	JDBC connections via foreign catalogs
Use case	"Give partner X read access to our sales table"	"Query the Postgres production DB without ETL"

The clean rule:
Sharing = you give access to YOUR data. Think outbound.
Federation = you query THEIR data. Think inbound.

If the question says "query an external PostgreSQL database without copying data" → Federation.
If the question says "give an external partner read access to your Delta table" → Sharing.
If the question says "replicate data across workspaces" → Neither. That's a different pattern entirely.

docs.databricks.com/aws/en/delta-sharing/ docs.databricks.com/aws/en/query-federation/