|
| 1 | +# SET TAGS Profiling Scripts |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +These scripts measure the performance of two approaches for managing tags on Databricks tables and columns: |
| 6 | + |
| 7 | +**Approach A (direct ALTER)**: Call `ALTER TABLE SET TAGS` or `ALTER TABLE ALTER COLUMN SET TAGS` directly, overwriting any existing tags without reading them first. SET TAGS is idempotent — setting the same key overwrites the value. |
| 8 | + |
| 9 | +**Approach B (read then write)**: First query `system.information_schema.column_tags` or `system.information_schema.table_tags` to read existing tags, compute a diff, then issue ALTERs only for changes. |
| 10 | + |
| 11 | +The goal is to determine whether the information_schema read step is worth the cost, or whether direct ALTER is faster even though it may redundantly set unchanged tags. |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +- Python 3.x with `databricks-sql-connector` installed (`pip install -e .` from repo root) |
| 16 | +- A Databricks SQL warehouse |
| 17 | +- 64 tables (`table1` through `table64`) with 128 STRING columns each. Create them by running `profile_column_tags.py` without `--skip-setup`. |
| 18 | +- Credentials in `examples/credentials.env` (gitignored). Copy and edit: |
| 19 | + |
| 20 | +```bash |
| 21 | +cp examples/credentials.env.example examples/credentials.env |
| 22 | +# Edit credentials.env with your workspace details |
| 23 | +``` |
| 24 | + |
| 25 | +The file format is: |
| 26 | +``` |
| 27 | +SERVER_HOSTNAME=your-workspace.cloud.databricks.com |
| 28 | +HTTP_PATH=/sql/1.0/warehouses/your_warehouse_id |
| 29 | +ACCESS_TOKEN=your_token |
| 30 | +CATALOG=your_catalog |
| 31 | +SCHEMA=your_schema |
| 32 | +``` |
| 33 | + |
| 34 | +All scripts read from this file via `load_credentials.py`. To switch workspaces, just edit `credentials.env`. |
| 35 | + |
| 36 | +## Scripts |
| 37 | + |
| 38 | +### profile_column_tags.py — Direct ALTER column tags |
| 39 | + |
| 40 | +Sets tags on columns directly via ALTER statements. No information_schema reads. |
| 41 | + |
| 42 | +```bash |
| 43 | +# Create tables + validate |
| 44 | +python examples/profile_column_tags.py --columns 1 --tags 1 --threads 1 --iterations 1 --validate |
| 45 | + |
| 46 | +# Full experiment: 100 columns, 9 tags each, 8 threads, 3 iterations |
| 47 | +python examples/profile_column_tags.py --columns 100 --tags 9 --threads 8 --iterations 3 --skip-setup |
| 48 | +``` |
| 49 | + |
| 50 | +Arguments: |
| 51 | +- `--columns`: Number of columns to tag per table |
| 52 | +- `--tags`: Number of tags per ALTER command |
| 53 | +- `--threads`: Concurrent connections |
| 54 | +- `--iterations`: Times to repeat the full 64-table sweep |
| 55 | +- `--skip-setup`: Skip table creation |
| 56 | +- `--validate`: Force 1 iteration for quick validation |
| 57 | + |
| 58 | +Output: `examples/results/column_tags/` |
| 59 | + |
| 60 | +### profile_table_tags.py — Direct ALTER table tags |
| 61 | + |
| 62 | +Sets tags on tables directly via ALTER statements. One ALTER per table. |
| 63 | + |
| 64 | +```bash |
| 65 | +python examples/profile_table_tags.py --tags 1 --threads 8 --iterations 3 |
| 66 | +``` |
| 67 | + |
| 68 | +Arguments: |
| 69 | +- `--tags`: Number of tags per ALTER command |
| 70 | +- `--threads`: Concurrent connections |
| 71 | +- `--iterations`: Times to repeat the full 64-table sweep |
| 72 | +- `--validate`: Force 1 iteration |
| 73 | + |
| 74 | +Output: `examples/results/table_tags/` |
| 75 | + |
| 76 | +### profile_read_then_write_tags.py — information_schema column_tags SELECT |
| 77 | + |
| 78 | +Queries `system.information_schema.column_tags` for each table. No ALTER — measures the read cost only. |
| 79 | + |
| 80 | +```bash |
| 81 | +python examples/profile_read_then_write_tags.py --threads 1 --iterations 3 |
| 82 | +``` |
| 83 | + |
| 84 | +Arguments: |
| 85 | +- `--threads`: Concurrent connections |
| 86 | +- `--iterations`: Times to repeat the full 64-table sweep |
| 87 | +- `--validate`: Force 1 iteration |
| 88 | + |
| 89 | +Output: `examples/results/read_then_write/` |
| 90 | + |
| 91 | +### profile_read_then_write_table_tags.py — information_schema table_tags SELECT |
| 92 | + |
| 93 | +Queries `system.information_schema.table_tags` for each table. No ALTER — measures the read cost only. |
| 94 | + |
| 95 | +```bash |
| 96 | +python examples/profile_read_then_write_table_tags.py --threads 1 --iterations 3 |
| 97 | +``` |
| 98 | + |
| 99 | +Arguments: |
| 100 | +- `--threads`: Concurrent connections |
| 101 | +- `--iterations`: Times to repeat the full 64-table sweep |
| 102 | +- `--validate`: Force 1 iteration |
| 103 | + |
| 104 | +Output: `examples/results/read_then_write_table_tags/` |
| 105 | + |
| 106 | +### cleanup_column_tags.py — Remove all tags |
| 107 | + |
| 108 | +Removes all column tags and table tags from all 64 tables using 32 threads. Run this to reset state between experiments. |
| 109 | + |
| 110 | +```bash |
| 111 | +python examples/cleanup_column_tags.py |
| 112 | +``` |
| 113 | + |
| 114 | +### plot_comparison.py — Generate charts |
| 115 | + |
| 116 | +Reads all report files and generates comparison charts as PNGs. |
| 117 | + |
| 118 | +```bash |
| 119 | +pip install matplotlib |
| 120 | +python examples/plot_comparison.py |
| 121 | +``` |
| 122 | + |
| 123 | +Output: |
| 124 | +- `examples/results/comparison_column_tags.png` — column tags: ALTER vs info_schema |
| 125 | +- `examples/results/comparison_table_tags.png` — table tags: ALTER vs info_schema |
| 126 | + |
| 127 | +Each PNG has 4 charts: wall-clock time, throughput, P50 latency, P99 latency, all plotted against thread count. |
| 128 | + |
| 129 | +## Running the definitive experiment |
| 130 | + |
| 131 | +```bash |
| 132 | +# Step 1: Create tables (once) |
| 133 | +python examples/profile_column_tags.py --columns 1 --tags 1 --threads 1 --iterations 1 --validate |
| 134 | + |
| 135 | +# Step 2: Run info_schema reads across thread counts |
| 136 | +# Stop early if latency is already unacceptable |
| 137 | +for n in 1 2 4 8 16 32 64; do |
| 138 | + python examples/profile_read_then_write_tags.py --threads $n --iterations 3 |
| 139 | +done |
| 140 | + |
| 141 | +# Step 3: Run direct ALTERs across thread counts |
| 142 | +for n in 1 2 4 8 16 32 64; do |
| 143 | + python examples/profile_column_tags.py --columns 100 --tags 9 --threads $n --iterations 3 --skip-setup |
| 144 | +done |
| 145 | + |
| 146 | +# Step 4: Generate charts |
| 147 | +python examples/plot_comparison.py |
| 148 | +``` |
| 149 | + |
| 150 | +## Connector instrumentation |
| 151 | + |
| 152 | +The scripts capture retry behavior via `[PROFILE]` log lines added to the connector: |
| 153 | +- `src/databricks/sql/backend/thrift_backend.py` — logs per-attempt timing, success, statement IDs, and retry sleeps in `make_request()` |
| 154 | +- `src/databricks/sql/auth/retry.py` — logs urllib3-level retry decisions (`should_retry`) and sleep durations with HTTP status codes, Thrift method names, and SQL text |
| 155 | + |
| 156 | +These are written to `*_retries.log` files alongside each report. Use `grep "[PROFILE]"` to filter. |
| 157 | + |
| 158 | +## Output structure |
| 159 | + |
| 160 | +Each script run produces three files: |
| 161 | +- `*_report.md` — Markdown report with latency percentiles, throughput, error analysis, retry analysis |
| 162 | +- `*_data.jsonl` — Raw per-operation data (one JSON line per ALTER or SELECT) |
| 163 | +- `*_retries.log` — Full connector debug logs with `[PROFILE]` instrumentation |
| 164 | + |
| 165 | +## Key finding |
| 166 | + |
| 167 | +On Azure workspaces, `system.information_schema.column_tags` queries can take 60-110 seconds under concurrency due to server-side queuing (visible as repeated `GetOperationStatus` polling in logs). Direct ALTER SET TAGS consistently completes in ~500ms regardless of concurrency. The information_schema read alone is slower than performing all the writes it was meant to optimize. |
0 commit comments