Overview · Datasets · Repository Structure · How to Run
This repository provides the datasets and processing pipeline for an empirical study, conducted at the School of Electrical and Computer Engineering (NTUA), on how Large Language Models (LLMs) are used across Software Development Life Cycle (SDLC) phases in project-based Software Engineering courses.
Students were encouraged to:
- use AI tools throughout their projects,
- document their interactions,
- submit one log entry per interaction via a custom logging service.
The dataset spans 5 semesters (Fall 2023 – Fall 2025), capturing the evolution of AI usage—from early adoption to its integration as a standard development partner. It includes AI usage logs from two Software Engineering courses:
- Software Engineering (SoftEng – Fall)
- Software as a Service Technologies (SaaS – Spring)
The collected datasets include both objective and subjective attributes describing the AI-assisted interaction.
- SDLC phase
- Action type
- Programming language
- AI tool used
- Perceived quality of AI assistance
- Knowledge gained from the interaction
- Time saved due to AI usage
- Perceived threat of AI to future software engineering roles
A unified schema (after normalization) is available at data_preparation/1-normalization/B-schema_normalization/normalized_schema.json
Raw datasets (per semester) are provided in datasets/raw/, in CSV format.
Data Preparation - data_preparation/
Pipeline for preprocessing the datasets used in the empirical analysis.
1. Normalization — data_preparation/1-normalization/
This stage ensures consistency across datasets collected from different semesters, which originally had variations in schema and value representations.
| Step | Path | What it does |
|---|---|---|
| SoftEng-23b normalization | data_preparation/1-normalization/A-23b_normalization/ | Normalizes the Fall 2023 dataset, where some subjective metrics were recorded as textual values (e.g., “low”, “medium”, “high”) instead of the unified 0–5 scale used in later datasets. These values are normalized to ensure comparability. See: softeng23b_normalization.md |
| Schema Normalization | data_preparation/1-normalization/B-schema_normalization/ | Harmonizes column naming across semesters and removes redundant columns, keeping only attributes relevant for analysis. |
| Actions Normalization | data_preparation/1-normalization/C-actions_normalization/ | Standardizes the action field across datasets. The action represents the specific type of task the student performed using AI assistance (e.g., code authoring, design decisions, use case specification, etc.). Since different semesters used slightly different labels for similar actions, they are grouped into a unified taxonomy defined in actions_normalization.json |
| Scopes Normalization | data_preparation/1-normalization/D-scopes_normalization/ | Standardizes the scope field, which defines the granularity or target of the AI interaction (e.g., frontend, backend, UML modeling, etc.). Similar to actions, scope values are normalized across semesters based on a unified taxonomy defined in scope_normalization.json |
2. Enrichment — data_preparation/2-enrichment/
This stage computes additional derived attributes per interaction to support analysis.
Examples include:
avg_action_experiencephase_experience
3. Validation — data_preparation/3-validation/
This stage filters out semantically invalid AI usage logs based on predefined consistency rules between:
- SDLC phase
- action type
- scope
Each SDLC phase defines which combinations of actions and scopes are valid. This ensures that only meaningful and realistic phase–action–scope interactions are included in the final dataset.
Validation rules are defined in data_preparation/3-validation/rules/
Run full preprocessing pipeline:
./data_preparation/run_all_data_preparation.shDatasets - datasets/
| Dataset | Description |
|---|---|
| datasets/raw/ | → original per-semester datasets |
| datasets/normalized/ | → schema-aligned datasets |
| datasets/enriched/ | → datasets with derived attributes |
| datasets/validated/ | → final validated datasets used for analysis |
All datasets are provided in CSV format.
Research Questions - RQs/
This directory contains the full analysis pipeline used to answer the research questions in the context of this empirical study.
Contains data processing scripts per research question.
Each RQ includes:
input_processing.py: reads validated datasets as input and generates results in CSV formatplot.py: reads the generated CSVs and produces visualization plots
Contains:
- CSV outputs of analysis
- generated plots and figures
| # | RQ Description |
|---|---|
| RQ1 | On which software development tasks do students focus their use of AI tools? |
| RQ2 | How do students rank the quality, knowledge gained, and time saved when using AI assistants? |
| RQ3 | Do students perceive AI as a supportive tool or as a potential threat, both currently and in the future? |
Run full analysis:
./RQs/run_all.shUtils — utils/
Shared utility functions used across preprocessing and analysis pipelines.
python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txtchmod +x run_all_pipeline.sh
./run_all_pipeline.shOr run steps separately:
cd data_preparation
chmod +x run_all_data_preparation.sh
./run_all_data_preparation.shcd RQs
chmod +x run_all.sh
./run_all.sh