Commit 06a56ee
feat: add Docker-based Spark integration test setup (#946)
* feat: add Docker-based Spark integration test setup
- Add docker-compose-spark.yml with Spark Thrift Server + Hive Metastore
- Add Docker config files (Dockerfile, entrypoint, hive-site.xml, spark-defaults.conf)
- Add Spark profile to profiles.yml.j2 (thrift method on port 10000)
- Add 'Start Spark' step in test-warehouse.yml CI workflow
- Add Spark to Docker targets matrix in test-all-warehouses.yml
- Fix python-dev -> python3-dev for Ubuntu 22.04+ compatibility
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: handle Spark database=Undefined in test source configs and address review feedback
- Fix database=Undefined error for Spark target by returning None from
get_database_and_schema_properties() and omitting database field from
YAML source configs when None
- Upgrade Hive Metastore from postgres:9-alpine to postgres:15-alpine
- Use HTTPS for Spark download in Dockerfile
- Add bounded timeout to entrypoint.sh wait loop (180s default)
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: add Spark-specific macros for database/schema resolution and string escaping
- Add spark__get_package_database_and_schema to handle Spark's lack of
catalog/database, ensuring is_elementary_enabled() returns True
- Add spark__escape_special_chars to replace newlines with spaces instead
of \n, which Spark's SQL parser treats as literal line breaks in
INSERT VALUES statements, causing row corruption
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: add Delta Lake support, Spark dispatches for schema/artifact operations, and escape fixes
- Add Delta Lake JARs to Spark Docker image for v2 table support (DELETE/MERGE)
- Configure file_format=delta for Spark targets in dbt_project.yml and profiles
- Add spark__get_default_incremental_strategy returning 'merge' for Delta tables
- Fix spark__get_delete_and_insert_queries to always use MERGE (Delta doesn't support DELETE with subqueries)
- Add Spark dispatches for test helpers: edr_create_schema, edr_drop_schema, edr_schema_exists, edr_list_schemas
- Add spark__get_anomaly_config dispatch to handle Spark's database==schema requirement
- Fix is_elementary_enabled to check schema as fallback for adapters without database concept
- Fix spark__escape_special_chars to use C-style backslash escaping (Spark doesn't support SQL-standard '')
- Revert spark__get_package_database_and_schema (default dispatch returns [None, schema] correctly)
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: address CodeRabbit review - escape backticks, fix MERGE aliases, add docs, slim Dockerfile
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* feat: unskip exposure_schema_validity tests for Spark, tune Spark config for speed
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* perf: add SparkDirectSeeder (bypass dbt seed) and tune Spark config for -n8 parallelism
- Add SparkDirectSeeder that executes CREATE TABLE + INSERT VALUES directly
via the dbt adapter, bypassing the ~4s dbt subprocess overhead per seed
- Add execute_sql() and schema_name property to AdapterQueryRunner
- DbtProject auto-selects SparkDirectSeeder when target is 'spark'
- Tune spark-defaults.conf: executor.cores=4, default.parallelism=4,
thriftServer.async=true for better concurrent session handling
- Restore -n8 parallelism for Spark in CI (was -n4)
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* perf: revert Spark parallelism to -n4 (keep direct seeder optimization)
The -n8 experiment may be causing resource contention on the 2-vCPU CI
runner. Reverting Spark to -n4 while keeping the SparkDirectSeeder and
Spark config tuning (executor.cores=4, async=true). The direct seeder
alone should provide meaningful speedup (~3.6x faster per seed).
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: add type inference to SparkDirectSeeder (BIGINT, DOUBLE, BOOLEAN, STRING)
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* refactor: extract BaseDirectSeeder base class for Spark and ClickHouse seeders
- Shared logic: type inference, CSV writing for ref(), batched inserts, cleanup
- SparkDirectSeeder and ClickHouseDirectSeeder are thin subclasses
- SparkDirectSeeder now writes CSV for ref() resolution (was missing)
- _create_seeder() handles both spark and clickhouse targets
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: revert Spark config tuning (perf regression), add Docker healthcheck, fix MERGE comment
- Revert executor.cores back to 1, default.parallelism back to 2, remove
thriftServer.async — these were introduced in 6a8ec2a and correlated with
a performance regression (tests taking longer than the original ~36 min)
- Add Docker healthcheck to spark-thrift container (nc -z localhost 10000)
- Use docker inspect healthcheck in CI instead of raw nc port polling
- Add explicit container_name to spark-thrift for reliable docker inspect
- Fix MERGE comment in delete_and_insert.sql to accurately describe why
we use MERGE unconditionally on Spark
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* fix: restore non-Delta fallback in spark__get_delete_and_insert_queries
Restore the elif branch for non-Delta Spark tables (used by dbt-databricks).
The three branches are now:
1. relation.metadata and is_delta → MERGE (dbt-databricks, Delta)
2. not relation.metadata → MERGE (dbt-spark thrift, assumes Delta via config)
3. else → DELETE with subquery (dbt-databricks, non-Delta)
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* refactor: unify MERGE branches in spark__get_delete_and_insert_queries
Combine the Delta and no-metadata MERGE conditions into a single branch:
(relation.metadata and relation.is_delta) or not relation.metadata
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* perf: revert SparkDirectSeeder to DbtDataSeeder to isolate performance regression
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* perf: fix SparkDirectSeeder to use PyHive directly, avoiding global dbt state corruption
Root cause: AdapterQueryRunner._create_adapter() calls set_from_args() and
reset_adapters() which corrupt global dbt state (GLOBAL_FLAGS, adapter
registry). Since tests use the in-process APIDbtRunner (dbtRunner().invoke()),
this corrupted state causes subsequent dbt test calls to run with wrong flags,
leading to 3-10x regressions on multi-call tests (e.g. volume_anomaly).
Fix: SparkDirectSeeder now uses PyHive/Thrift directly instead of going
through AdapterQueryRunner, completely avoiding the global state issue.
Connection details are read from profiles.yml.
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
* revert: remove SparkDirectSeeder, use DbtDataSeeder for Spark
The SparkDirectSeeder (both AdapterQueryRunner and PyHive variants) caused
a ~60% regression on Spark CI (58 min vs 36 min baseline). The regression
was concentrated in volume_anomaly tests (3-10x slower) which call
dbt_project.test() multiple times per pytest function.
Two approaches were tested:
1. SparkDirectSeeder via AdapterQueryRunner: 57:15 (hypothesis: global
dbt state corruption via set_from_args/reset_adapters)
2. SparkDirectSeeder via PyHive directly: 58:03 (bypasses dbt entirely)
Both showed the same regression, disproving the global state corruption
hypothesis. The root cause of the interaction between direct SQL seeding
and subsequent dbt test calls remains undetermined.
Reverting Spark to DbtDataSeeder restores the 36:47 baseline.
ClickHouseDirectSeeder (via BaseDirectSeeder) is kept as it works correctly.
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Itamar Hartstein <haritamar@gmail.com>1 parent 8731abd commit 06a56ee
23 files changed
Lines changed: 407 additions & 107 deletions
File tree
- .github/workflows
- integration_tests
- dbt_project
- macros
- ci_schemas_cleanup
- schema_utils
- docker/spark
- profiles
- tests
- test_dbt_artifacts
- macros
- edr/system/configuration
- utils
- cross_db_utils
- table_operations
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | | - | |
| 51 | + | |
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
| 61 | + | |
60 | 62 | | |
61 | 63 | | |
62 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
100 | 100 | | |
101 | 101 | | |
102 | 102 | | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
103 | 113 | | |
104 | 114 | | |
105 | 115 | | |
| |||
108 | 118 | | |
109 | 119 | | |
110 | 120 | | |
111 | | - | |
| 121 | + | |
112 | 122 | | |
113 | 123 | | |
114 | 124 | | |
| |||
168 | 178 | | |
169 | 179 | | |
170 | 180 | | |
171 | | - | |
| 181 | + | |
172 | 182 | | |
173 | 183 | | |
174 | 184 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| 31 | + | |
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
22 | 27 | | |
23 | 28 | | |
24 | 29 | | |
| |||
Lines changed: 24 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
28 | 51 | | |
29 | 52 | | |
30 | 53 | | |
| |||
46 | 69 | | |
47 | 70 | | |
48 | 71 | | |
49 | | - | |
| 72 | + | |
Lines changed: 9 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
Lines changed: 6 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
0 commit comments