Skip to content

Latest commit

 

History

History
267 lines (211 loc) · 7.86 KB

File metadata and controls

267 lines (211 loc) · 7.86 KB

GUI4DE Package

GUI4DE is a Python package for LLM-powered data engineering tasks. This package provides access to various data processing and analysis tasks through simple function calls.

Installation

pip install git+https://github.com/DataManagementLab/lab25-gui4de.git

Quick Start

import gui4de
from pathlib import Path

budget = 0.2 # US

# Estimate task costs before running
estimated_cost = gui4de.column_type_annotation_costs(Path("data.csv"))

if estimated_cost <= budget:
    # Run a single task
    results, cost = gui4de.column_type_annotation_task(
        csv_file=Path("your_data.csv"),
        ontology_type="DBPedia",
        budget=10.0
    )

Available Tasks

Column Type Annotation

Automatically annotates CSV columns with semantic types from ontologies.

results, cost = gui4de.column_type_annotation_task(
    csv_file=Path("data.csv"),  # Path to CSV file or file-like object
    ontology_type="DBPedia",        # Ontology type (e.g., "DBPedia", "SchemaOrg")
    budget=10.0,                # Maximum cost allowed (in USD)
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode ("local" or "web")
    verify_generated_types=True, # Verify types against ontology
    max_completion_tokens=150,   # Max tokens per column
    progress_callback=None       # Callback for progress updates
)
# Returns: (list[str], float) - Column type annotations and actual cost

Entity Matching

Matches entities between two CSV files.

results, cost = gui4de.entity_matching_task(
    first_csv_file=Path("table1.csv"),   # First CSV file
    second_csv_file=Path("table2.csv"),  # Second CSV file
    budget=15.0,                         # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",                       # LLM model to use
    mode="local",                        # Execution mode
    max_completion_tokens=200,           # Max tokens per comparison
    progress_callback=None               # Progress callback
)
# Returns: (list[list[str]], float) - Matching results and actual cost

Error Detection

Detects errors and inconsistencies in CSV data.

results, cost = gui4de.error_detection_task(
    csv_file=Path("data.csv"),  # CSV file to analyze
    budget=8.0,                 # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode
    max_completion_tokens=150,  # Max tokens per analysis
    progress_callback=None      # Progress callback
)
# Returns: (list[list[str]], float) - Error detection results and cost

Missing Value Imputation

Intelligently fills missing values in CSV data.

results, cost = gui4de.missing_value_imputation_task(
    csv_file=Path("data.csv"),  # CSV file with missing values
    budget=12.0,                # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode
    max_completion_tokens=100,  # Max tokens per imputation
    progress_callback=None      # Progress callback
)
# Returns: (list[list[str]], float) - Imputed data and cost

Schema Matching

Matches schemas between two CSV files.

results, cost = gui4de.schema_matching_task(
    first_csv_file=Path("schema1.csv"),  # First CSV file
    second_csv_file=Path("schema2.csv"), # Second CSV file
    budget=10.0,                         # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",                       # LLM model to use
    mode="local",                        # Execution mode
    max_completion_tokens=150,           # Max tokens per matching
    progress_callback=None               # Progress callback
)
# Returns: (list[dict[str, Union[str, list[str]]]], float) - Schema matching results and cost

Table Relationalization

Converts flat tables into relational format.

results, cost = gui4de.table_relationalization_task(
    csv_file=Path("flat_table.csv"),  # CSV file to relationalize
    budget=20.0,                      # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",                    # LLM model to use
    mode="local",                     # Execution mode
    max_completion_tokens=200,        # Max tokens per operation
    progress_callback=None            # Progress callback
)
# Returns: (list[list[str]], float) - Relationalized data and cost

Advisor Mode

Provides data analysis recommendations and insights.

results, cost = gui4de.advisor_mode_task(
    csv_file=Path("data.csv"),  # CSV file to analyze
    budget=15.0,                # Maximum cost allowed
    # Optional parameters:
    model="gpt-4",              # LLM model to use
    mode="local",               # Execution mode
    max_completion_tokens=300,  # Max tokens for advice
    progress_callback=None      # Progress callback
)
# Returns: (str, float) - Analysis advice and cost

Running Multiple Tasks

Run multiple tasks sequentially on the same data:

result, total_cost = gui4de.multiple_task_execution(
    tasks=[
        "column_type_annotation",
        "missing_value_imputation"
    ],
    csv_file=Path("data.csv"),
    budget=25.0,
    # Optional parameters apply to all tasks:
    model="gpt-4",
    mode="local",
    max_completion_tokens=300,  # Max tokens for each task execution
    progress_callback=None      # Progress callback
)
# Returns: (str, float) - Final processed CSV content and total cost

Cost Estimation

Estimate costs before running tasks:

# task cost functions
cost = gui4de.column_type_annotation_costs(Path("data.csv"))
cost = gui4de.entity_matching_costs(Path("table1.csv"), Path("table2.csv"))
cost = gui4de.error_detection_costs(Path("data.csv"))
cost = gui4de.missing_value_imputation_costs(Path("data.csv"))
cost = gui4de.schema_matching_costs(Path("table1.csv"), Path("table2.csv"))
cost = gui4de.table_relationalization_costs(Path("data.csv"))
cost = gui4de.advisor_mode_costs(Path("data.csv"))

Progress Callbacks

Monitor task progress with callback functions:

def progress_callback(current_step: int, total_steps: int, message: str):
    print(f"Progress: {current_step}/{total_steps} - {message}")

results, cost = gui4de.column_type_annotation_task(
    csv_file=Path("data.csv"),
    ontology_type="dbo",
    budget=10.0,
    progress_callback=progress_callback
)

Input Formats

All tasks accept CSV input in two formats:

  1. File Path: Use pathlib.Path for files on disk

    csv_file = Path("data.csv")
  2. File-like Object: Use StringIO or other file-like objects

    from io import StringIO
    csv_content = StringIO("column1,column2\nvalue1,value2")

Return Values

All task functions return a tuple of (results, cost):

  • results: Task-specific output format (list of strings, list of lists, list of dict or string)
  • cost: Actual cost incurred in USD (float)

Configuration

The package uses OpenAI GPT models by default. Set your API key:

export OPENAI_API_KEY="your-api-key-here"

Error Handling

Tasks may raise exceptions for:

  • Invalid CSV files
  • Insufficient budget
  • API errors
  • Configuration issues

Always wrap task calls in try-except blocks for production use:

try:
    results, cost = gui4de.column_type_annotation_task(
        csv_file=Path("data.csv"),
        ontology_type="dbo",
        budget=10.0
    )
    print(f"Task completed successfully. Cost: ${cost:.2f}")
except Exception as e:
    print(f"Task failed: {e}")

Package Structure

The package provides these main modules:

  • Task functions for data engineering operations
  • Cost estimation utilities
  • Sequential task execution
  • Progress monitoring capabilities

All functions are available directly from the gui4de package namespace for convenient usage.