Skip to content

Commit 66ad06a

Browse files
feat: add MinIO S3 CSV seeder for Spark integration tests (#950)
* feat: add MinIO S3 CSV seeder for Spark integration tests - Add MinIO service to docker-compose-spark.yml with bucket setup - Add hadoop-aws and aws-java-sdk-bundle jars to Spark Dockerfile - Configure S3A filesystem in spark-defaults.conf for MinIO endpoint - Implement SparkS3CsvSeeder that uploads CSVs to MinIO and creates external Spark tables via CREATE TABLE ... USING CSV - Uses PyHive directly (not AdapterQueryRunner) to avoid corrupting dbt global state - NULL values written as empty CSV cells, Spark reads them as SQL NULL - Bypasses dbt seed entirely, avoiding the _fix_binding NULL bug in dbt-spark's session adapter - Add boto3 to requirements.txt for S3 uploads - Add MinIO health check to CI workflow Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * address CodeRabbit review: connection reuse, error handling, compose deps - Reuse single PyHive connection per seed operation (context manager) - Add empty data guard with clear ValueError - Harden _read_profile_schema with explicit error context - Validate MinIO setup exit code in CI workflow - Use completion-based depends_on for minio-setup in docker-compose Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * add missing docstrings to improve coverage (70% -> 80%) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * address CodeRabbit nitpicks: set -e, env-configurable credentials, shared type inference Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * address CodeRabbit: empty-seed guard in DbtDataSeeder, immutable class maps Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * fix: clean up local CSV after seed to prevent dbt compilation errors Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * fix: move try/finally to cover upload + table creation for CSV cleanup Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * fix: use QUOTE_ALL in CSV writer to prevent Spark skipping blank lines for NULL rows Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> * refactor: rename BaseDirectSeeder to BaseSqlInsertSeeder for clarity Co-Authored-By: Itamar Hartstein <haritamar@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Itamar Hartstein <haritamar@gmail.com>
1 parent f1b70aa commit 66ad06a

7 files changed

Lines changed: 331 additions & 48 deletions

File tree

.github/workflows/test-warehouse.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,13 @@ jobs:
106106
run: |
107107
docker compose -f docker-compose-spark.yml build
108108
docker compose -f docker-compose-spark.yml up -d
109+
echo "Waiting for MinIO setup to complete..."
110+
timeout 60 bash -c '
111+
until [ "$(docker inspect -f "{{.State.Status}}" spark-minio-setup 2>/dev/null)" = "exited" ]; do sleep 2; done
112+
EXIT_CODE=$(docker inspect -f "{{.State.ExitCode}}" spark-minio-setup 2>/dev/null)
113+
if [ "$EXIT_CODE" != "0" ]; then echo "MinIO setup failed with exit code $EXIT_CODE"; exit 1; fi
114+
'
115+
echo "MinIO is ready."
109116
echo "Waiting for Spark Thrift Server to become healthy..."
110117
timeout 180 bash -c 'until [ "$(docker inspect -f {{.State.Health.Status}} spark-thrift 2>/dev/null)" = "healthy" ]; do sleep 5; done'
111118
echo "Spark Thrift Server is healthy."

integration_tests/docker-compose-spark.yml

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,10 @@ services:
1010
- "10000:10000"
1111
- "4040:4040"
1212
depends_on:
13-
- spark-hive-metastore
13+
spark-hive-metastore:
14+
condition: service_started
15+
minio-setup:
16+
condition: service_completed_successfully
1417
command: >
1518
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
1619
--name Thrift JDBC/ODBC Server
@@ -36,6 +39,40 @@ services:
3639
- POSTGRES_PASSWORD=dbt
3740
- POSTGRES_DB=metastore
3841

42+
minio:
43+
image: minio/minio:latest
44+
container_name: spark-minio
45+
ports:
46+
- "9000:9000"
47+
- "9001:9001"
48+
environment:
49+
- MINIO_ROOT_USER=minioadmin
50+
- MINIO_ROOT_PASSWORD=minioadmin
51+
command: ["server", "/data", "--console-address", ":9001"]
52+
healthcheck:
53+
test: ["CMD-SHELL", "mc ready local || exit 1"]
54+
interval: 5s
55+
timeout: 5s
56+
retries: 10
57+
start_period: 5s
58+
volumes:
59+
- minio-data:/data
60+
61+
minio-setup:
62+
image: minio/mc
63+
container_name: spark-minio-setup
64+
depends_on:
65+
minio:
66+
condition: service_healthy
67+
entrypoint: >
68+
/bin/sh -c "
69+
set -e;
70+
mc alias set myminio http://minio:9000 minioadmin minioadmin;
71+
mc mb --ignore-existing myminio/spark-seeds;
72+
echo 'MinIO bucket spark-seeds created.';
73+
"
74+
3975
volumes:
4076
spark-warehouse:
4177
hive-metastore:
78+
minio-data:

integration_tests/docker/spark/Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ FROM eclipse-temurin:${OPENJDK_VERSION}-jre
44
ARG SPARK_VERSION=3.3.2
55
ARG HADOOP_VERSION=3
66
ARG DELTA_VERSION=2.2.0
7+
ARG HADOOP_FULL_VERSION=3.3.2
8+
ARG AWS_SDK_VERSION=1.11.1026
79

810
ENV SPARK_HOME /usr/spark
911
ENV PATH="/usr/spark/bin:/usr/spark/sbin:${PATH}"
@@ -19,6 +21,10 @@ RUN apt-get update && \
1921
-P /usr/spark/jars/ && \
2022
wget -q "https://repo1.maven.org/maven2/io/delta/delta-storage/${DELTA_VERSION}/delta-storage-${DELTA_VERSION}.jar" \
2123
-P /usr/spark/jars/ && \
24+
wget -q "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_FULL_VERSION}/hadoop-aws-${HADOOP_FULL_VERSION}.jar" \
25+
-P /usr/spark/jars/ && \
26+
wget -q "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar" \
27+
-P /usr/spark/jars/ && \
2228
apt-get remove -y wget && \
2329
apt-get autoremove -y && \
2430
apt-get clean

integration_tests/docker/spark/spark-defaults.conf

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,11 @@ spark.sql.shuffle.partitions 2
1212
spark.default.parallelism 2
1313
spark.ui.enabled false
1414
spark.sql.adaptive.enabled true
15+
16+
# S3A configuration for MinIO
17+
spark.hadoop.fs.s3a.endpoint http://minio:9000
18+
spark.hadoop.fs.s3a.access.key minioadmin
19+
spark.hadoop.fs.s3a.secret.key minioadmin
20+
spark.hadoop.fs.s3a.path.style.access true
21+
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
22+
spark.hadoop.fs.s3a.connection.ssl.enabled false

integration_tests/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ pytest-parametrization
44
pytest-html
55
filelock
66
tenacity
7+
boto3>=1.26.0
78
# urllib3>=2.2.2 fixes CVE-2023-45803 and CVE-2024-37891
89
# Upper bound <3.0.0 prevents breaking changes from future major versions
910
urllib3>=2.2.2,<3.0.0

0 commit comments

Comments
 (0)