Skip to content

Commit 763bdf3

Browse files
committed
Add SET TAGS profiling scripts and connector instrumentation
Profiling scripts to compare direct ALTER SET TAGS vs reading from information_schema before writing. Includes scripts for column tags, table tags, information_schema reads, cleanup, and chart generation. Credentials are loaded from examples/credentials.env (gitignored). Copy credentials.env.example and fill in workspace details. Connector instrumentation adds [PROFILE] log lines in the Thrift backend retry loop and urllib3 retry policy to capture per-attempt timing, statement IDs, retry decisions, and SQL text. Co-authored-by: Isaac
1 parent ca4d7bc commit 763bdf3

13 files changed

Lines changed: 2962 additions & 4 deletions

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
11

2+
# Profiling credentials
3+
examples/credentials.env
4+
25
# Created by https://www.toptal.com/developers/gitignore/api/python,macos
36
# Edit at https://www.toptal.com/developers/gitignore?templates=python,macos
47

examples/PROFILING_README.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# SET TAGS Profiling Scripts
2+
3+
## Context
4+
5+
These scripts measure the performance of two approaches for managing tags on Databricks tables and columns:
6+
7+
**Approach A (direct ALTER)**: Call `ALTER TABLE SET TAGS` or `ALTER TABLE ALTER COLUMN SET TAGS` directly, overwriting any existing tags without reading them first. SET TAGS is idempotent — setting the same key overwrites the value.
8+
9+
**Approach B (read then write)**: First query `system.information_schema.column_tags` or `system.information_schema.table_tags` to read existing tags, compute a diff, then issue ALTERs only for changes.
10+
11+
The goal is to determine whether the information_schema read step is worth the cost, or whether direct ALTER is faster even though it may redundantly set unchanged tags.
12+
13+
## Prerequisites
14+
15+
- Python 3.x with `databricks-sql-connector` installed (`pip install -e .` from repo root)
16+
- A Databricks SQL warehouse
17+
- 64 tables (`table1` through `table64`) with 128 STRING columns each. Create them by running `profile_column_tags.py` without `--skip-setup`.
18+
- Credentials in `examples/credentials.env` (gitignored). Copy and edit:
19+
20+
```bash
21+
cp examples/credentials.env.example examples/credentials.env
22+
# Edit credentials.env with your workspace details
23+
```
24+
25+
The file format is:
26+
```
27+
SERVER_HOSTNAME=your-workspace.cloud.databricks.com
28+
HTTP_PATH=/sql/1.0/warehouses/your_warehouse_id
29+
ACCESS_TOKEN=your_token
30+
CATALOG=your_catalog
31+
SCHEMA=your_schema
32+
```
33+
34+
All scripts read from this file via `load_credentials.py`. To switch workspaces, just edit `credentials.env`.
35+
36+
## Scripts
37+
38+
### profile_column_tags.py — Direct ALTER column tags
39+
40+
Sets tags on columns directly via ALTER statements. No information_schema reads.
41+
42+
```bash
43+
# Create tables + validate
44+
python examples/profile_column_tags.py --columns 1 --tags 1 --threads 1 --iterations 1 --validate
45+
46+
# Full experiment: 100 columns, 9 tags each, 8 threads, 3 iterations
47+
python examples/profile_column_tags.py --columns 100 --tags 9 --threads 8 --iterations 3 --skip-setup
48+
```
49+
50+
Arguments:
51+
- `--columns`: Number of columns to tag per table
52+
- `--tags`: Number of tags per ALTER command
53+
- `--threads`: Concurrent connections
54+
- `--iterations`: Times to repeat the full 64-table sweep
55+
- `--skip-setup`: Skip table creation
56+
- `--validate`: Force 1 iteration for quick validation
57+
58+
Output: `examples/results/column_tags/`
59+
60+
### profile_table_tags.py — Direct ALTER table tags
61+
62+
Sets tags on tables directly via ALTER statements. One ALTER per table.
63+
64+
```bash
65+
python examples/profile_table_tags.py --tags 1 --threads 8 --iterations 3
66+
```
67+
68+
Arguments:
69+
- `--tags`: Number of tags per ALTER command
70+
- `--threads`: Concurrent connections
71+
- `--iterations`: Times to repeat the full 64-table sweep
72+
- `--validate`: Force 1 iteration
73+
74+
Output: `examples/results/table_tags/`
75+
76+
### profile_read_then_write_tags.py — information_schema column_tags SELECT
77+
78+
Queries `system.information_schema.column_tags` for each table. No ALTER — measures the read cost only.
79+
80+
```bash
81+
python examples/profile_read_then_write_tags.py --threads 1 --iterations 3
82+
```
83+
84+
Arguments:
85+
- `--threads`: Concurrent connections
86+
- `--iterations`: Times to repeat the full 64-table sweep
87+
- `--validate`: Force 1 iteration
88+
89+
Output: `examples/results/read_then_write/`
90+
91+
### profile_read_then_write_table_tags.py — information_schema table_tags SELECT
92+
93+
Queries `system.information_schema.table_tags` for each table. No ALTER — measures the read cost only.
94+
95+
```bash
96+
python examples/profile_read_then_write_table_tags.py --threads 1 --iterations 3
97+
```
98+
99+
Arguments:
100+
- `--threads`: Concurrent connections
101+
- `--iterations`: Times to repeat the full 64-table sweep
102+
- `--validate`: Force 1 iteration
103+
104+
Output: `examples/results/read_then_write_table_tags/`
105+
106+
### cleanup_column_tags.py — Remove all tags
107+
108+
Removes all column tags and table tags from all 64 tables using 32 threads. Run this to reset state between experiments.
109+
110+
```bash
111+
python examples/cleanup_column_tags.py
112+
```
113+
114+
### plot_comparison.py — Generate charts
115+
116+
Reads all report files and generates comparison charts as PNGs.
117+
118+
```bash
119+
pip install matplotlib
120+
python examples/plot_comparison.py
121+
```
122+
123+
Output:
124+
- `examples/results/comparison_column_tags.png` — column tags: ALTER vs info_schema
125+
- `examples/results/comparison_table_tags.png` — table tags: ALTER vs info_schema
126+
127+
Each PNG has 4 charts: wall-clock time, throughput, P50 latency, P99 latency, all plotted against thread count.
128+
129+
## Running the definitive experiment
130+
131+
```bash
132+
# Step 1: Create tables (once)
133+
python examples/profile_column_tags.py --columns 1 --tags 1 --threads 1 --iterations 1 --validate
134+
135+
# Step 2: Run info_schema reads across thread counts
136+
# Stop early if latency is already unacceptable
137+
for n in 1 2 4 8 16 32 64; do
138+
python examples/profile_read_then_write_tags.py --threads $n --iterations 3
139+
done
140+
141+
# Step 3: Run direct ALTERs across thread counts
142+
for n in 1 2 4 8 16 32 64; do
143+
python examples/profile_column_tags.py --columns 100 --tags 9 --threads $n --iterations 3 --skip-setup
144+
done
145+
146+
# Step 4: Generate charts
147+
python examples/plot_comparison.py
148+
```
149+
150+
## Connector instrumentation
151+
152+
The scripts capture retry behavior via `[PROFILE]` log lines added to the connector:
153+
- `src/databricks/sql/backend/thrift_backend.py` — logs per-attempt timing, success, statement IDs, and retry sleeps in `make_request()`
154+
- `src/databricks/sql/auth/retry.py` — logs urllib3-level retry decisions (`should_retry`) and sleep durations with HTTP status codes, Thrift method names, and SQL text
155+
156+
These are written to `*_retries.log` files alongside each report. Use `grep "[PROFILE]"` to filter.
157+
158+
## Output structure
159+
160+
Each script run produces three files:
161+
- `*_report.md` — Markdown report with latency percentiles, throughput, error analysis, retry analysis
162+
- `*_data.jsonl` — Raw per-operation data (one JSON line per ALTER or SELECT)
163+
- `*_retries.log` — Full connector debug logs with `[PROFILE]` instrumentation
164+
165+
## Key finding
166+
167+
On Azure workspaces, `system.information_schema.column_tags` queries can take 60-110 seconds under concurrency due to server-side queuing (visible as repeated `GetOperationStatus` polling in logs). Direct ALTER SET TAGS consistently completes in ~500ms regardless of concurrency. The information_schema read alone is slower than performing all the writes it was meant to optimize.

examples/cleanup_column_tags.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
#!/usr/bin/env python3
2+
"""Remove all column tags and table tags from all 64 tables using 32 threads."""
3+
4+
import sys
5+
from collections import defaultdict
6+
from concurrent.futures import ThreadPoolExecutor, as_completed
7+
8+
sys.stdout.reconfigure(line_buffering=True)
9+
10+
import urllib3
11+
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
12+
13+
from databricks import sql
14+
from load_credentials import load_credentials
15+
16+
_creds = load_credentials()
17+
SERVER_HOSTNAME = _creds["SERVER_HOSTNAME"]
18+
HTTP_PATH = _creds["HTTP_PATH"]
19+
ACCESS_TOKEN = _creds["ACCESS_TOKEN"]
20+
CATALOG = _creds["CATALOG"]
21+
SCHEMA = _creds["SCHEMA"]
22+
23+
NUM_TABLES = 64
24+
NUM_THREADS = 32
25+
26+
27+
def cleanup_table(table_name):
28+
table_fqn = f"`{CATALOG}`.`{SCHEMA}`.{table_name}"
29+
total_removed = 0
30+
31+
with sql.connect(
32+
server_hostname=SERVER_HOSTNAME,
33+
http_path=HTTP_PATH,
34+
access_token=ACCESS_TOKEN,
35+
_tls_no_verify=True,
36+
) as conn:
37+
with conn.cursor() as cursor:
38+
# --- Clean up column tags ---
39+
cursor.execute(
40+
f"SELECT column_name, tag_name FROM system.information_schema.column_tags "
41+
f"WHERE catalog_name = '{CATALOG}' AND schema_name = '{SCHEMA}' AND table_name = '{table_name}'"
42+
)
43+
col_rows = cursor.fetchall()
44+
45+
if col_rows:
46+
col_tags = defaultdict(list)
47+
for row in col_rows:
48+
col_tags[row[0]].append(row[1])
49+
50+
for col, tags in col_tags.items():
51+
tag_list = ", ".join(f"'{tag}'" for tag in tags)
52+
cursor.execute(f"ALTER TABLE {table_fqn} ALTER COLUMN {col} UNSET TAGS ({tag_list})")
53+
54+
total_removed += len(col_rows)
55+
56+
# --- Clean up table tags ---
57+
cursor.execute(
58+
f"SELECT tag_name FROM system.information_schema.table_tags "
59+
f"WHERE catalog_name = '{CATALOG}' AND schema_name = '{SCHEMA}' AND table_name = '{table_name}'"
60+
)
61+
tbl_rows = cursor.fetchall()
62+
63+
if tbl_rows:
64+
tag_list = ", ".join(f"'{row[0]}'" for row in tbl_rows)
65+
cursor.execute(f"ALTER TABLE {table_fqn} UNSET TAGS ({tag_list})")
66+
total_removed += len(tbl_rows)
67+
68+
print(f"{table_name}: removed {len(col_rows)} column tags, {len(tbl_rows)} table tags")
69+
return total_removed
70+
71+
72+
total_removed = 0
73+
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
74+
futures = {
75+
executor.submit(cleanup_table, f"table{t}"): t
76+
for t in range(1, NUM_TABLES + 1)
77+
}
78+
for f in as_completed(futures):
79+
total_removed += f.result()
80+
81+
print(f"\nDone. Removed {total_removed} total tags (column + table).")

examples/credentials.env.example

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# SET TAGS Profiling — Workspace Credentials
2+
# Copy this file to credentials.env and fill in your values.
3+
# credentials.env is gitignored and will not be committed.
4+
5+
SERVER_HOSTNAME=your-workspace.cloud.databricks.com
6+
HTTP_PATH=/sql/1.0/warehouses/your_warehouse_id
7+
ACCESS_TOKEN=your_access_token
8+
CATALOG=your_catalog
9+
SCHEMA=your_schema

examples/load_credentials.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
"""Load credentials from examples/credentials.env"""
2+
3+
import os
4+
5+
6+
def load_credentials(env_path=None):
7+
"""Read credentials.env and return a dict of key=value pairs."""
8+
if env_path is None:
9+
env_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "credentials.env")
10+
11+
if not os.path.exists(env_path):
12+
raise FileNotFoundError(
13+
f"Credentials file not found: {env_path}\n"
14+
f"Copy examples/credentials.env.example to examples/credentials.env and fill in your values."
15+
)
16+
17+
creds = {}
18+
with open(env_path) as f:
19+
for line in f:
20+
line = line.strip()
21+
if not line or line.startswith("#"):
22+
continue
23+
key, _, value = line.partition("=")
24+
creds[key.strip()] = value.strip()
25+
26+
required = ["SERVER_HOSTNAME", "HTTP_PATH", "ACCESS_TOKEN", "CATALOG", "SCHEMA"]
27+
missing = [k for k in required if k not in creds]
28+
if missing:
29+
raise ValueError(f"Missing required credentials: {', '.join(missing)}")
30+
31+
return creds

0 commit comments

Comments
 (0)