Skip to content

Commit e2a85e6

Browse files
fedemgptejassp-dbsd-db
authored
feat: Add Notebook-scoped packages for command submits or notebook job run (#1321)
<!-- Please review our pull request review process in CONTRIBUTING.md before your proceed. --> Resolves #1320 <!--- Include the number of the issue addressed by this PR above if applicable. Example: resolves #1234 Please review our pull request review process in CONTRIBUTING.md before your proceed. --> ### Description Add feature to install packages in Notebook-scoped environment for PythonCommandSubmitt and PythonNotebookUploader classes #### Execution Test I made the changes and tested it in our environment, looking that the compiled code now have the prepended package installation ##### All purpose cluster <img width="1240" height="740" alt="image" src="https://github.com/user-attachments/assets/4c75aae2-6740-49b5-9e6e-d7676a6aac02" /> ##### Serverless cluster <img width="1228" height="716" alt="image" src="https://github.com/user-attachments/assets/1d91cae9-6cdf-4836-a86f-18d6156c5f99" /> #### Job cluster <img width="1241" height="353" alt="image" src="https://github.com/user-attachments/assets/546480f2-ee4c-47de-87af-9cb6adf9f851" /> ### Checklist - [x] I have run this code in development and it appears to resolve the stated issue - [x] This PR includes tests, or tests are not required/relevant for this PR - [x] I have updated the `CHANGELOG.md` and added information about my change to the "dbt-databricks next" section. --------- Signed-off-by: Federico Manuel Gomez Peter <federico.gomez@payclip.com> Co-authored-by: tejassp-db <241722411+tejassp-db@users.noreply.github.com> Co-authored-by: Shubham Dhal <shubham.dhal@databricks.com>
1 parent 54e8040 commit e2a85e6

File tree

9 files changed

+678
-17
lines changed

9 files changed

+678
-17
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
## dbt-databricks 1.11.7 (TBD)
22

3+
### Features
4+
5+
- Enable Notebook scoped python packages installation
6+
-
37
### Fixes
48

59
- Fix column order mismatch in microbatch and replace_where incremental strategies by using INSERT BY NAME syntax ([#1338](https://github.com/databricks/dbt-databricks/issues/1338))

dbt/adapters/databricks/python_models/python_config.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,14 @@ class Config:
2424
extra = "allow"
2525

2626

27+
class PythonPackagesConfig(BaseModel):
28+
"""Pydantic model for python packages configuration."""
29+
30+
packages: list[str]
31+
notebook_scoped: bool
32+
index_url: Optional[str] = None
33+
34+
2735
class PythonModelConfig(BaseModel):
2836
"""
2937
Pydantic model for a Python model configuration.
@@ -42,6 +50,7 @@ class PythonModelConfig(BaseModel):
4250
cluster_id: Optional[str] = None
4351
http_path: Optional[str] = None
4452
create_notebook: bool = False
53+
notebook_scoped_libraries: bool = False
4554
environment_key: Optional[str] = None
4655
environment_dependencies: list[str] = Field(default_factory=list)
4756

@@ -69,6 +78,14 @@ def validate_notebook_permissions(cls, v: list[dict[str, str]]) -> list[dict[str
6978
)
7079
return v
7180

81+
@property
82+
def python_packages_config(self) -> PythonPackagesConfig:
83+
return PythonPackagesConfig(
84+
packages=self.packages,
85+
index_url=self.index_url,
86+
notebook_scoped=self.notebook_scoped_libraries,
87+
)
88+
7289

7390
class ParsedPythonModel(BaseModel):
7491
"""Pydantic model for a Python model parsed from a dbt manifest"""

dbt/adapters/databricks/python_models/python_submissions.py

Lines changed: 81 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,55 @@
99
from dbt.adapters.databricks.api_client import CommandExecution, DatabricksApiClient, WorkflowJobApi
1010
from dbt.adapters.databricks.credentials import DatabricksCredentials
1111
from dbt.adapters.databricks.logging import logger
12-
from dbt.adapters.databricks.python_models.python_config import ParsedPythonModel
12+
from dbt.adapters.databricks.python_models.python_config import (
13+
ParsedPythonModel,
14+
PythonPackagesConfig,
15+
)
1316
from dbt.adapters.databricks.python_models.run_tracking import PythonRunTracker
1417

1518
DEFAULT_TIMEOUT = 60 * 60 * 24
19+
NOTEBOOK_SEPARATOR = "\n\n# COMMAND ----------\n\n"
1620

1721

1822
class PythonSubmitter(ABC):
1923
"""Interface for submitting Python models to run on Databricks."""
2024

25+
def __init__(self, packages_config: PythonPackagesConfig) -> None:
26+
self.packages_config = packages_config
27+
2128
@abstractmethod
2229
def submit(self, compiled_code: str) -> None:
2330
"""Submit the compiled code to Databricks."""
2431
pass
2532

33+
def _prepare_code_with_notebook_scoped_packages(
34+
self, compiled_code: str, separator: str = NOTEBOOK_SEPARATOR
35+
) -> str:
36+
"""
37+
Prepend notebook-scoped package installation commands to the compiled code.
38+
39+
If notebook-scoped flag is not set, or if there are no packages to install,
40+
returns the original compiled code.
41+
"""
42+
if not self.packages_config.packages or not self.packages_config.notebook_scoped:
43+
return compiled_code
44+
45+
index_url = (
46+
f"--index-url {self.packages_config.index_url}"
47+
if self.packages_config.index_url
48+
else ""
49+
)
50+
# Build the %pip install command for notebook-scoped packages
51+
packages = " ".join(self.packages_config.packages)
52+
pip_install_cmd = f"%pip install {index_url} -q {packages}".replace(" ", " ")
53+
logger.debug(f"Adding notebook-scoped package installation: {pip_install_cmd}")
54+
55+
# Add extra restart python command for Databricks runtimes 13.0 and above
56+
restart_cmd = "dbutils.library.restartPython()"
57+
58+
# Prepend the pip install command to the compiled code
59+
return f"{pip_install_cmd}{separator}{restart_cmd}{separator}{compiled_code}"
60+
2661

2762
class BaseDatabricksHelper(PythonJobHelper):
2863
"""Base helper for python models on Databricks."""
@@ -63,16 +98,24 @@ class PythonCommandSubmitter(PythonSubmitter):
6398
"""Submitter for Python models using the Command API."""
6499

65100
def __init__(
66-
self, api_client: DatabricksApiClient, tracker: PythonRunTracker, cluster_id: str
101+
self,
102+
api_client: DatabricksApiClient,
103+
tracker: PythonRunTracker,
104+
cluster_id: str,
105+
parsed_model: ParsedPythonModel,
67106
) -> None:
68107
self.api_client = api_client
69108
self.tracker = tracker
70109
self.cluster_id = cluster_id
110+
super().__init__(parsed_model.config.python_packages_config)
71111

72112
@override
73113
def submit(self, compiled_code: str) -> None:
74114
logger.debug("Submitting Python model using the Command API.")
75115

116+
# Prepare code with notebook-scoped package installation if needed
117+
compiled_code = self._prepare_code_with_notebook_scoped_packages(compiled_code)
118+
76119
context_id = self.api_client.command_contexts.create(self.cluster_id)
77120
command_exec: Optional[CommandExecution] = None
78121
try:
@@ -252,16 +295,24 @@ def get_library_config(
252295
packages: list[str],
253296
index_url: Optional[str],
254297
additional_libraries: list[dict[str, Any]],
298+
notebook_scoped_libraries: bool = False,
255299
) -> dict[str, Any]:
256-
"""Update the job configuration with the required libraries."""
300+
"""
301+
Update the job configuration with the required libraries.
302+
303+
If notebook_scoped_libraries is True, packages are not included in the library config
304+
as they will be installed via %pip install in the notebook itself.
305+
"""
257306

258307
libraries = []
259308

260-
for package in packages:
261-
if index_url:
262-
libraries.append({"pypi": {"package": package, "repo": index_url}})
263-
else:
264-
libraries.append({"pypi": {"package": package}})
309+
# Only add packages to cluster-level libraries if not using notebook-scoped
310+
if not notebook_scoped_libraries:
311+
for package in packages:
312+
if index_url:
313+
libraries.append({"pypi": {"package": package, "repo": index_url}})
314+
else:
315+
libraries.append({"pypi": {"package": package}})
265316

266317
for library in additional_libraries:
267318
libraries.append(library)
@@ -286,7 +337,10 @@ def __init__(
286337
packages = parsed_model.config.packages
287338
index_url = parsed_model.config.index_url
288339
additional_libraries = parsed_model.config.additional_libs
289-
library_config = get_library_config(packages, index_url, additional_libraries)
340+
notebook_scoped_libraries = parsed_model.config.notebook_scoped_libraries
341+
library_config = get_library_config(
342+
packages, index_url, additional_libraries, notebook_scoped_libraries
343+
)
290344
self.cluster_spec = {**cluster_spec, **library_config}
291345
self.job_grants = parsed_model.config.python_job_config.grants
292346
self.additional_job_settings = parsed_model.config.python_job_config.dict()
@@ -335,11 +389,13 @@ def __init__(
335389
tracker: PythonRunTracker,
336390
uploader: PythonNotebookUploader,
337391
config_compiler: PythonJobConfigCompiler,
392+
parsed_model: ParsedPythonModel,
338393
) -> None:
339394
self.api_client = api_client
340395
self.tracker = tracker
341396
self.uploader = uploader
342397
self.config_compiler = config_compiler
398+
super().__init__(parsed_model.config.python_packages_config)
343399

344400
@staticmethod
345401
def create(
@@ -356,12 +412,17 @@ def create(
356412
parsed_model,
357413
cluster_spec,
358414
)
359-
return PythonNotebookSubmitter(api_client, tracker, notebook_uploader, config_compiler)
415+
return PythonNotebookSubmitter(
416+
api_client, tracker, notebook_uploader, config_compiler, parsed_model
417+
)
360418

361419
@override
362420
def submit(self, compiled_code: str) -> None:
363421
logger.debug("Submitting Python model using the Job Run API.")
364422

423+
# Prepare code with notebook-scoped package installation if needed
424+
compiled_code = self._prepare_code_with_notebook_scoped_packages(compiled_code)
425+
365426
file_path = self.uploader.upload(compiled_code)
366427
job_config = self.config_compiler.compile(file_path)
367428

@@ -444,7 +505,12 @@ def build_submitter(self) -> PythonSubmitter:
444505
{"existing_cluster_id": self.cluster_id},
445506
)
446507
else:
447-
return PythonCommandSubmitter(self.api_client, self.tracker, self.cluster_id or "")
508+
return PythonCommandSubmitter(
509+
self.api_client,
510+
self.tracker,
511+
self.cluster_id or "",
512+
self.parsed_model,
513+
)
448514

449515
@override
450516
def validate_config(self) -> None:
@@ -572,6 +638,7 @@ def __init__(
572638
workflow_creater: PythonWorkflowCreator,
573639
job_grants: dict[str, list[dict[str, str]]],
574640
acls: list[dict[str, str]],
641+
parsed_model: ParsedPythonModel,
575642
) -> None:
576643
self.api_client = api_client
577644
self.tracker = tracker
@@ -581,6 +648,7 @@ def __init__(
581648
self.workflow_creater = workflow_creater
582649
self.job_grants = job_grants
583650
self.acls = acls
651+
super().__init__(parsed_model.config.python_packages_config)
584652

585653
@staticmethod
586654
def create(
@@ -599,6 +667,7 @@ def create(
599667
workflow_creater,
600668
parsed_model.config.python_job_config.grants,
601669
parsed_model.config.access_control_list,
670+
parsed_model,
602671
)
603672

604673
@override
@@ -611,6 +680,7 @@ def submit(self, compiled_code: str) -> None:
611680
logger.debug(
612681
f"[Workflow Debug] Compiled code preview: {compiled_code[:preview_len]}..."
613682
)
683+
compiled_code = self._prepare_code_with_notebook_scoped_packages(compiled_code)
614684

615685
file_path = self.uploader.upload(compiled_code)
616686
logger.debug(f"[Workflow Debug] Uploaded notebook to: {file_path}")

docs/workflow-job-submission.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,11 @@ models:
7070
runtime_engine: "{{ var('job_cluster_defaults.runtime_engine') }}"
7171
data_security_mode: "{{ var('job_cluster_defaults.data_security_mode') }}"
7272
autoscale: { "min_workers": 1, "max_workers": 4 }
73+
74+
# Python package configuration
75+
packages: ["pandas", "numpy==1.24.0"]
76+
index_url: "https://pypi.org/simple" # Optional custom PyPI index
77+
notebook_scoped_libraries: false # Set to true for notebook-scoped installation
7378
```
7479
7580
### Configuration
@@ -173,6 +178,65 @@ grants:
173178
manage: []
174179
```
175180

181+
#### Python Packages
182+
183+
You can install Python packages for your models using the `packages` configuration. There are two ways to install packages:
184+
185+
##### Cluster-level installation (default)
186+
187+
By default, packages are installed at the cluster level using Databricks libraries. This is the traditional approach where packages are installed when the cluster starts.
188+
189+
```yaml
190+
models:
191+
- name: my_model
192+
config:
193+
packages: ["pandas", "numpy==1.24.0", "scikit-learn>=1.0"]
194+
index_url: "https://pypi.org/simple" # Optional: custom PyPI index
195+
notebook_scoped_libraries: false # Default behavior
196+
```
197+
198+
**Benefits:**
199+
- Packages are available for the entire cluster lifecycle
200+
- Faster model execution (no installation overhead per run)
201+
202+
**Limitations:**
203+
- Requires cluster restart to update packages
204+
- All tasks on the cluster share the same package versions
205+
206+
##### Notebook-scoped installation
207+
208+
When `notebook_scoped_libraries: true`, packages are installed at the notebook level using `%pip install` magic commands. This prepends installation commands to your compiled code.
209+
210+
```yaml
211+
models:
212+
- name: my_model
213+
config:
214+
packages: ["pandas", "numpy==1.24.0", "scikit-learn>=1.0"]
215+
index_url: "https://pypi.org/simple" # Optional: custom PyPI index
216+
notebook_scoped_libraries: true # Enable notebook-scoped installation
217+
```
218+
219+
**Benefits:**
220+
- Packages are installed per model execution
221+
- No cluster restart required to change packages
222+
- Different models can use different package versions
223+
- Works with serverless compute and all-purpose clusters
224+
225+
**How it works:**
226+
The adapter prepends the following commands to your model code:
227+
```python
228+
%pip install -q pandas numpy==1.24.0 scikit-learn>=1.0
229+
dbutils.library.restartPython()
230+
# Your model code follows...
231+
```
232+
233+
**Supported submission methods:**
234+
- `all_purpose_cluster` (Command API)
235+
- `job_cluster` (Notebook Job Run)
236+
- `workflow_job` (Workflow Job)
237+
238+
**Note:** For Databricks Runtime 13.0 and above, `dbutils.library.restartPython()` is automatically added after package installation to ensure packages are properly loaded.
239+
176240
#### Post hooks
177241

178242
It is possible to add in python hooks by using the `config.python_job_config.post_hook_tasks`

tests/functional/adapter/python_model/fixtures.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,3 +337,41 @@ def model(dbt, spark):
337337
- name: test_table
338338
identifier: source
339339
"""
340+
341+
# Notebook-scoped packages via Command API (all_purpose_cluster, create_notebook=False)
342+
notebook_scoped_packages_cmd_api_model = """
343+
def model(dbt, spark):
344+
dbt.config(
345+
materialized='table',
346+
submission_method='all_purpose_cluster',
347+
create_notebook=False,
348+
notebook_scoped_libraries=True,
349+
packages=['chispa'],
350+
)
351+
# it will break if not installed
352+
from chispa import assert_df_equality
353+
df = spark.createDataFrame(
354+
schema="id int, data string",
355+
data=[(1, "a"), (2, "b")]
356+
)
357+
return df
358+
"""
359+
360+
# Notebook-scoped packages via notebook job run (all_purpose_cluster, create_notebook=True)
361+
notebook_scoped_packages_notebook_run_model = """
362+
def model(dbt, spark):
363+
dbt.config(
364+
materialized='table',
365+
submission_method='all_purpose_cluster',
366+
create_notebook=True,
367+
notebook_scoped_libraries=True,
368+
packages=['chispa'],
369+
)
370+
# it will break if not installed
371+
from chispa import assert_df_equality
372+
df = spark.createDataFrame(
373+
schema="id int, data string",
374+
data=[(1, "a"), (2, "b")]
375+
)
376+
return df
377+
"""

0 commit comments

Comments
 (0)