These scripts measure the performance of two approaches for managing tags on Databricks tables and columns:
Approach A (direct ALTER): Call ALTER TABLE SET TAGS or ALTER TABLE ALTER COLUMN SET TAGS directly, overwriting any existing tags without reading them first. SET TAGS is idempotent — setting the same key overwrites the value.
Approach B (read then write): First query system.information_schema.column_tags or system.information_schema.table_tags to read existing tags, compute a diff, then issue ALTERs only for changes.
The goal is to determine whether the information_schema read step is worth the cost, or whether direct ALTER is faster even though it may redundantly set unchanged tags.
- Python 3.x with
databricks-sql-connectorinstalled (pip install -e .from repo root) - A Databricks SQL warehouse
- 64 tables (
table1throughtable64) with 128 STRING columns each. Create them by runningprofile_column_tags.pywithout--skip-setup. - Credentials in
examples/credentials.env(gitignored). Copy and edit:
cp examples/credentials.env.example examples/credentials.env
# Edit credentials.env with your workspace detailsThe file format is:
SERVER_HOSTNAME=your-workspace.cloud.databricks.com
HTTP_PATH=/sql/1.0/warehouses/your_warehouse_id
ACCESS_TOKEN=your_token
CATALOG=your_catalog
SCHEMA=your_schema
All scripts read from this file via load_credentials.py. To switch workspaces, just edit credentials.env.
Sets tags on columns directly via ALTER statements. No information_schema reads.
# Create tables + validate
python examples/profile_column_tags.py --columns 1 --tags 1 --threads 1 --iterations 1 --validate
# Full experiment: 100 columns, 9 tags each, 8 threads, 3 iterations
python examples/profile_column_tags.py --columns 100 --tags 9 --threads 8 --iterations 3 --skip-setupArguments:
--columns: Number of columns to tag per table--tags: Number of tags per ALTER command--threads: Concurrent connections--iterations: Times to repeat the full 64-table sweep--skip-setup: Skip table creation--validate: Force 1 iteration for quick validation
Output: examples/results/column_tags/
Sets tags on tables directly via ALTER statements. One ALTER per table.
python examples/profile_table_tags.py --tags 1 --threads 8 --iterations 3Arguments:
--tags: Number of tags per ALTER command--threads: Concurrent connections--iterations: Times to repeat the full 64-table sweep--validate: Force 1 iteration
Output: examples/results/table_tags/
Queries system.information_schema.column_tags for each table. No ALTER — measures the read cost only.
python examples/profile_read_then_write_tags.py --threads 1 --iterations 3Arguments:
--threads: Concurrent connections--iterations: Times to repeat the full 64-table sweep--validate: Force 1 iteration
Output: examples/results/read_then_write/
Queries system.information_schema.table_tags for each table. No ALTER — measures the read cost only.
python examples/profile_read_then_write_table_tags.py --threads 1 --iterations 3Arguments:
--threads: Concurrent connections--iterations: Times to repeat the full 64-table sweep--validate: Force 1 iteration
Output: examples/results/read_then_write_table_tags/
Removes all column tags and table tags from all 64 tables using 32 threads. Run this to reset state between experiments.
python examples/cleanup_column_tags.pyReads all report files and generates comparison charts as PNGs.
pip install matplotlib
python examples/plot_comparison.pyOutput:
examples/results/comparison_column_tags.png— column tags: ALTER vs info_schemaexamples/results/comparison_table_tags.png— table tags: ALTER vs info_schema
Each PNG has 4 charts: wall-clock time, throughput, P50 latency, P99 latency, all plotted against thread count.
# Step 1: Create tables (once)
python examples/profile_column_tags.py --columns 1 --tags 1 --threads 1 --iterations 1 --validate
# Step 2: Run info_schema reads across thread counts
# Stop early if latency is already unacceptable
for n in 1 2 4 8 16 32 64; do
python examples/profile_read_then_write_tags.py --threads $n --iterations 3
done
# Step 3: Run direct ALTERs across thread counts
for n in 1 2 4 8 16 32 64; do
python examples/profile_column_tags.py --columns 100 --tags 9 --threads $n --iterations 3 --skip-setup
done
# Step 4: Generate charts
python examples/plot_comparison.pyThe scripts capture retry behavior via [PROFILE] log lines added to the connector:
src/databricks/sql/backend/thrift_backend.py— logs per-attempt timing, success, statement IDs, and retry sleeps inmake_request()src/databricks/sql/auth/retry.py— logs urllib3-level retry decisions (should_retry) and sleep durations with HTTP status codes, Thrift method names, and SQL text
These are written to *_retries.log files alongside each report. Use grep "[PROFILE]" to filter.
Each script run produces three files:
*_report.md— Markdown report with latency percentiles, throughput, error analysis, retry analysis*_data.jsonl— Raw per-operation data (one JSON line per ALTER or SELECT)*_retries.log— Full connector debug logs with[PROFILE]instrumentation
On Azure workspaces, system.information_schema.column_tags queries can take 60-110 seconds under concurrency due to server-side queuing (visible as repeated GetOperationStatus polling in logs). Direct ALTER SET TAGS consistently completes in ~500ms regardless of concurrency. The information_schema read alone is slower than performing all the writes it was meant to optimize.