GUI4DE is a Python package for LLM-powered data engineering tasks. This package provides access to various data processing and analysis tasks through simple function calls.
pip install git+https://github.com/DataManagementLab/lab25-gui4de.gitimport gui4de
from pathlib import Path
budget = 0.2 # US
# Estimate task costs before running
estimated_cost = gui4de.column_type_annotation_costs(Path("data.csv"))
if estimated_cost <= budget:
# Run a single task
results, cost = gui4de.column_type_annotation_task(
csv_file=Path("your_data.csv"),
ontology_type="DBPedia",
budget=10.0
)Automatically annotates CSV columns with semantic types from ontologies.
results, cost = gui4de.column_type_annotation_task(
csv_file=Path("data.csv"), # Path to CSV file or file-like object
ontology_type="DBPedia", # Ontology type (e.g., "DBPedia", "SchemaOrg")
budget=10.0, # Maximum cost allowed (in USD)
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode ("local" or "web")
verify_generated_types=True, # Verify types against ontology
max_completion_tokens=150, # Max tokens per column
progress_callback=None # Callback for progress updates
)
# Returns: (list[str], float) - Column type annotations and actual costMatches entities between two CSV files.
results, cost = gui4de.entity_matching_task(
first_csv_file=Path("table1.csv"), # First CSV file
second_csv_file=Path("table2.csv"), # Second CSV file
budget=15.0, # Maximum cost allowed
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode
max_completion_tokens=200, # Max tokens per comparison
progress_callback=None # Progress callback
)
# Returns: (list[list[str]], float) - Matching results and actual costDetects errors and inconsistencies in CSV data.
results, cost = gui4de.error_detection_task(
csv_file=Path("data.csv"), # CSV file to analyze
budget=8.0, # Maximum cost allowed
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode
max_completion_tokens=150, # Max tokens per analysis
progress_callback=None # Progress callback
)
# Returns: (list[list[str]], float) - Error detection results and costIntelligently fills missing values in CSV data.
results, cost = gui4de.missing_value_imputation_task(
csv_file=Path("data.csv"), # CSV file with missing values
budget=12.0, # Maximum cost allowed
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode
max_completion_tokens=100, # Max tokens per imputation
progress_callback=None # Progress callback
)
# Returns: (list[list[str]], float) - Imputed data and costMatches schemas between two CSV files.
results, cost = gui4de.schema_matching_task(
first_csv_file=Path("schema1.csv"), # First CSV file
second_csv_file=Path("schema2.csv"), # Second CSV file
budget=10.0, # Maximum cost allowed
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode
max_completion_tokens=150, # Max tokens per matching
progress_callback=None # Progress callback
)
# Returns: (list[dict[str, Union[str, list[str]]]], float) - Schema matching results and costConverts flat tables into relational format.
results, cost = gui4de.table_relationalization_task(
csv_file=Path("flat_table.csv"), # CSV file to relationalize
budget=20.0, # Maximum cost allowed
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode
max_completion_tokens=200, # Max tokens per operation
progress_callback=None # Progress callback
)
# Returns: (list[list[str]], float) - Relationalized data and costProvides data analysis recommendations and insights.
results, cost = gui4de.advisor_mode_task(
csv_file=Path("data.csv"), # CSV file to analyze
budget=15.0, # Maximum cost allowed
# Optional parameters:
model="gpt-4", # LLM model to use
mode="local", # Execution mode
max_completion_tokens=300, # Max tokens for advice
progress_callback=None # Progress callback
)
# Returns: (str, float) - Analysis advice and costRun multiple tasks sequentially on the same data:
result, total_cost = gui4de.multiple_task_execution(
tasks=[
"column_type_annotation",
"missing_value_imputation"
],
csv_file=Path("data.csv"),
budget=25.0,
# Optional parameters apply to all tasks:
model="gpt-4",
mode="local",
max_completion_tokens=300, # Max tokens for each task execution
progress_callback=None # Progress callback
)
# Returns: (str, float) - Final processed CSV content and total costEstimate costs before running tasks:
# task cost functions
cost = gui4de.column_type_annotation_costs(Path("data.csv"))
cost = gui4de.entity_matching_costs(Path("table1.csv"), Path("table2.csv"))
cost = gui4de.error_detection_costs(Path("data.csv"))
cost = gui4de.missing_value_imputation_costs(Path("data.csv"))
cost = gui4de.schema_matching_costs(Path("table1.csv"), Path("table2.csv"))
cost = gui4de.table_relationalization_costs(Path("data.csv"))
cost = gui4de.advisor_mode_costs(Path("data.csv"))Monitor task progress with callback functions:
def progress_callback(current_step: int, total_steps: int, message: str):
print(f"Progress: {current_step}/{total_steps} - {message}")
results, cost = gui4de.column_type_annotation_task(
csv_file=Path("data.csv"),
ontology_type="dbo",
budget=10.0,
progress_callback=progress_callback
)All tasks accept CSV input in two formats:
-
File Path: Use
pathlib.Pathfor files on diskcsv_file = Path("data.csv")
-
File-like Object: Use
StringIOor other file-like objectsfrom io import StringIO csv_content = StringIO("column1,column2\nvalue1,value2")
All task functions return a tuple of (results, cost):
- results: Task-specific output format (list of strings, list of lists, list of dict or string)
- cost: Actual cost incurred in USD (float)
The package uses OpenAI GPT models by default. Set your API key:
export OPENAI_API_KEY="your-api-key-here"Tasks may raise exceptions for:
- Invalid CSV files
- Insufficient budget
- API errors
- Configuration issues
Always wrap task calls in try-except blocks for production use:
try:
results, cost = gui4de.column_type_annotation_task(
csv_file=Path("data.csv"),
ontology_type="dbo",
budget=10.0
)
print(f"Task completed successfully. Cost: ${cost:.2f}")
except Exception as e:
print(f"Task failed: {e}")The package provides these main modules:
- Task functions for data engineering operations
- Cost estimation utilities
- Sequential task execution
- Progress monitoring capabilities
All functions are available directly from the gui4de package namespace for convenient usage.