GUI4DE Package

GUI4DE is a Python package for LLM-powered data engineering tasks. This package provides access to various data processing and analysis tasks through simple function calls.

Installation

pip install git+https://github.com/DataManagementLab/lab25-gui4de.git

Quick Start

import gui4de
from pathlib import Path

budget = 0.2 # US

# Estimate task costs before running
estimated_cost = gui4de.column_type_annotation_costs(Path("data.csv"))

if estimated_cost <= budget:
    # Run a single task
    results, cost = gui4de.column_type_annotation_task(
        csv_file=Path("your_data.csv"),
        ontology_type="DBPedia",
        budget=10.0
    )

Available Tasks

Column Type Annotation

Automatically annotates CSV columns with semantic types from ontologies.

results, cost = gui4de.column_type_annotation_task(
    csv_file=Path("data.csv"),  # Path to CSV file or file-like object
    ontology_type="DBPedia",        # Ontology type (e.g., "DBPedia", "SchemaOrg")
    budget=10.0,                # Maximum cost allowed (in USD)
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode ("local" or "web")
    verify_generated_types=True, # Verify types against ontology
    max_completion_tokens=150,   # Max tokens per column
    progress_callback=None       # Callback for progress updates
)
# Returns: (list[str], float) - Column type annotations and actual cost

Entity Matching

Matches entities between two CSV files.

results, cost = gui4de.entity_matching_task(
    first_csv_file=Path("table1.csv"),   # First CSV file
    second_csv_file=Path("table2.csv"),  # Second CSV file
    budget=15.0,                         # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",                       # LLM model to use
    mode="local",                        # Execution mode
    max_completion_tokens=200,           # Max tokens per comparison
    progress_callback=None               # Progress callback
)
# Returns: (list[list[str]], float) - Matching results and actual cost

Error Detection

Detects errors and inconsistencies in CSV data.

results, cost = gui4de.error_detection_task(
    csv_file=Path("data.csv"),  # CSV file to analyze
    budget=8.0,                 # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode
    max_completion_tokens=150,  # Max tokens per analysis
    progress_callback=None      # Progress callback
)
# Returns: (list[list[str]], float) - Error detection results and cost

Missing Value Imputation

Intelligently fills missing values in CSV data.

results, cost = gui4de.missing_value_imputation_task(
    csv_file=Path("data.csv"),  # CSV file with missing values
    budget=12.0,                # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode
    max_completion_tokens=100,  # Max tokens per imputation
    progress_callback=None      # Progress callback
)
# Returns: (list[list[str]], float) - Imputed data and cost

Schema Matching

Matches schemas between two CSV files.

results, cost = gui4de.schema_matching_task(
    first_csv_file=Path("schema1.csv"),  # First CSV file
    second_csv_file=Path("schema2.csv"), # Second CSV file
    budget=10.0,                         # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",                       # LLM model to use
    mode="local",                        # Execution mode
    max_completion_tokens=150,           # Max tokens per matching
    progress_callback=None               # Progress callback
)
# Returns: (list[dict[str, Union[str, list[str]]]], float) - Schema matching results and cost

Table Relationalization

Converts flat tables into relational format.

results, cost = gui4de.table_relationalization_task(
    csv_file=Path("flat_table.csv"),  # CSV file to relationalize
    budget=20.0,                      # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",                    # LLM model to use
    mode="local",                     # Execution mode
    max_completion_tokens=200,        # Max tokens per operation
    progress_callback=None            # Progress callback
)
# Returns: (list[list[str]], float) - Relationalized data and cost

Advisor Mode

Provides data analysis recommendations and insights.

results, cost = gui4de.advisor_mode_task(
    csv_file=Path("data.csv"),  # CSV file to analyze
    budget=15.0,                # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode
    max_completion_tokens=300,  # Max tokens for advice
    progress_callback=None      # Progress callback
)
# Returns: (str, float) - Analysis advice and cost

Running Multiple Tasks

Run multiple tasks sequentially on the same data:

result, total_cost = gui4de.multiple_task_execution(
    tasks=[
        "column_type_annotation",
        "missing_value_imputation"
    ],
    csv_file=Path("data.csv"),
    budget=25.0,
    # Optional parameters apply to all tasks:
    model="gpt-4",
    mode="local",
    max_completion_tokens=300,  # Max tokens for each task execution
    progress_callback=None      # Progress callback
)
# Returns: (str, float) - Final processed CSV content and total cost

Cost Estimation

Estimate costs before running tasks:

# task cost functions
cost = gui4de.column_type_annotation_costs(Path("data.csv"))
cost = gui4de.entity_matching_costs(Path("table1.csv"), Path("table2.csv"))
cost = gui4de.error_detection_costs(Path("data.csv"))
cost = gui4de.missing_value_imputation_costs(Path("data.csv"))
cost = gui4de.schema_matching_costs(Path("table1.csv"), Path("table2.csv"))
cost = gui4de.table_relationalization_costs(Path("data.csv"))
cost = gui4de.advisor_mode_costs(Path("data.csv"))

Progress Callbacks

Monitor task progress with callback functions:

def progress_callback(current_step: int, total_steps: int, message: str):
    print(f"Progress: {current_step}/{total_steps} - {message}")

results, cost = gui4de.column_type_annotation_task(
    csv_file=Path("data.csv"),
    ontology_type="dbo",
    budget=10.0,
    progress_callback=progress_callback
)

Input Formats

All tasks accept CSV input in two formats:

File Path: Use pathlib.Path for files on disk
```
csv_file = Path("data.csv")
```

File-like Object: Use StringIO or other file-like objects

from io import StringIO
csv_content = StringIO("column1,column2\nvalue1,value2")

Return Values

All task functions return a tuple of (results, cost):

results: Task-specific output format (list of strings, list of lists, list of dict or string)
cost: Actual cost incurred in USD (float)

Configuration

The package uses OpenAI GPT models by default. Set your API key:

export OPENAI_API_KEY="your-api-key-here"

Error Handling

Tasks may raise exceptions for:

Invalid CSV files
Insufficient budget
API errors
Configuration issues

Always wrap task calls in try-except blocks for production use:

try:
    results, cost = gui4de.column_type_annotation_task(
        csv_file=Path("data.csv"),
        ontology_type="dbo",
        budget=10.0
    )
    print(f"Task completed successfully. Cost: ${cost:.2f}")
except Exception as e:
    print(f"Task failed: {e}")

Package Structure

The package provides these main modules:

Task functions for data engineering operations
Cost estimation utilities
Sequential task execution
Progress monitoring capabilities

All functions are available directly from the gui4de package namespace for convenient usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GUI4DE Package

Installation

Quick Start

Available Tasks

Column Type Annotation

Entity Matching

Error Detection

Missing Value Imputation

Schema Matching

Table Relationalization

Advisor Mode

Running Multiple Tasks

Cost Estimation

Progress Callbacks

Input Formats

Return Values

Configuration

Error Handling

Package Structure

FilesExpand file tree

README_PACKAGE.md

Latest commit

History

README_PACKAGE.md

File metadata and controls

GUI4DE Package

Installation

Quick Start

Available Tasks

Column Type Annotation

Entity Matching

Error Detection

Missing Value Imputation

Schema Matching

Table Relationalization

Advisor Mode

Running Multiple Tasks

Cost Estimation

Progress Callbacks

Input Formats

Return Values

Configuration

Error Handling

Package Structure