LLMs across the SDLC in Project-Based SE Courses

Overview · Datasets · Repository Structure · How to Run

Overview

This repository provides the datasets and processing pipeline for an empirical study, conducted at the School of Electrical and Computer Engineering (NTUA), on how Large Language Models (LLMs) are used across Software Development Life Cycle (SDLC) phases in project-based Software Engineering courses.

Students were encouraged to:

use AI tools throughout their projects,
document their interactions,
submit one log entry per interaction via a custom logging service.

The dataset spans 5 semesters (Fall 2023 – Fall 2025), capturing the evolution of AI usage—from early adoption to its integration as a standard development partner. It includes AI usage logs from two Software Engineering courses:

Software Engineering (SoftEng – Fall)
Software as a Service Technologies (SaaS – Spring)

Datasets

The collected datasets include both objective and subjective attributes describing the AI-assisted interaction.

Objective attributes

SDLC phase
Action type
Programming language
AI tool used

Subjective attributes

Perceived quality of AI assistance
Knowledge gained from the interaction
Time saved due to AI usage
Perceived threat of AI to future software engineering roles

A unified schema (after normalization) is available at data_preparation/1-normalization/B-schema_normalization/normalized_schema.json

Raw datasets (per semester) are provided in datasets/raw/, in CSV format.

Repository Structure

Data Preparation - data_preparation/

Pipeline for preprocessing the datasets used in the empirical analysis.

1. Normalization — data_preparation/1-normalization/

This stage ensures consistency across datasets collected from different semesters, which originally had variations in schema and value representations.

Step	Path	What it does
SoftEng-23b normalization	data_preparation/1-normalization/A-23b_normalization/	Normalizes the Fall 2023 dataset, where some subjective metrics were recorded as textual values (e.g., “low”, “medium”, “high”) instead of the unified 0–5 scale used in later datasets. These values are normalized to ensure comparability. See: softeng23b_normalization.md
Schema Normalization	data_preparation/1-normalization/B-schema_normalization/	Harmonizes column naming across semesters and removes redundant columns, keeping only attributes relevant for analysis.
Actions Normalization	data_preparation/1-normalization/C-actions_normalization/	Standardizes the action field across datasets. The action represents the specific type of task the student performed using AI assistance (e.g., code authoring, design decisions, use case specification, etc.). Since different semesters used slightly different labels for similar actions, they are grouped into a unified taxonomy defined in actions_normalization.json
Scopes Normalization	data_preparation/1-normalization/D-scopes_normalization/	Standardizes the scope field, which defines the granularity or target of the AI interaction (e.g., frontend, backend, UML modeling, etc.). Similar to actions, scope values are normalized across semesters based on a unified taxonomy defined in scope_normalization.json

2. Enrichment — data_preparation/2-enrichment/

This stage computes additional derived attributes per interaction to support analysis.

Examples include:

avg_action_experience
phase_experience

3. Validation — data_preparation/3-validation/

This stage filters out semantically invalid AI usage logs based on predefined consistency rules between:

SDLC phase
action type
scope

Each SDLC phase defines which combinations of actions and scopes are valid. This ensures that only meaningful and realistic phase–action–scope interactions are included in the final dataset.

Validation rules are defined in data_preparation/3-validation/rules/

Run full preprocessing pipeline:

./data_preparation/run_all_data_preparation.sh

Datasets - datasets/

Dataset	Description
datasets/raw/	→ original per-semester datasets
datasets/normalized/	→ schema-aligned datasets
datasets/enriched/	→ datasets with derived attributes
datasets/validated/	→ final validated datasets used for analysis

All datasets are provided in CSV format.

Research Questions - RQs/

This directory contains the full analysis pipeline used to answer the research questions in the context of this empirical study.

RQs/scripts/

Contains data processing scripts per research question.

Each RQ includes:

input_processing.py: reads validated datasets as input and generates results in CSV format
plot.py: reads the generated CSVs and produces visualization plots

RQs/results/

Contains:

CSV outputs of analysis
generated plots and figures

Research Questions

#	RQ Description
RQ1	On which software development tasks do students focus their use of AI tools?
RQ2	How do students rank the quality, knowledge gained, and time saved when using AI assistants?
RQ3	Do students perceive AI as a supportive tool or as a potential threat, both currently and in the future?

Run full analysis:

./RQs/run_all.sh

Utils — utils/

Shared utility functions used across preprocessing and analysis pipelines.

How to Run

1. Setup environment

python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Run full pipeline

chmod +x run_all_pipeline.sh
./run_all_pipeline.sh

Or run steps separately:

Preprocessing

cd data_preparation
chmod +x run_all_data_preparation.sh
./run_all_data_preparation.sh

Analysis (RQs)

cd RQs
chmod +x run_all.sh
./run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMs across the SDLC in Project-Based SE Courses

Overview

Datasets

Objective attributes

Subjective attributes

Repository Structure

Data Preparation - data_preparation/

1. Normalization — data_preparation/1-normalization/

2. Enrichment — data_preparation/2-enrichment/

3. Validation — data_preparation/3-validation/

Datasets - datasets/

Research Questions - RQs/

RQs/scripts/

RQs/results/

Research Questions

Utils — utils/

How to Run

1. Setup environment

2. Run full pipeline

Preprocessing

Analysis (RQs)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
RQs		RQs
data_preparation		data_preparation
datasets		datasets
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_all_pipeline.sh		run_all_pipeline.sh

Folders and files

Latest commit

History

Repository files navigation

LLMs across the SDLC in Project-Based SE Courses

Overview

Datasets

Objective attributes

Subjective attributes

Repository Structure

Data Preparation - data_preparation/

1. Normalization — data_preparation/1-normalization/

2. Enrichment — data_preparation/2-enrichment/

3. Validation — data_preparation/3-validation/

Datasets - datasets/

Research Questions - RQs/

RQs/scripts/

RQs/results/

Research Questions

Utils — utils/

How to Run

1. Setup environment

2. Run full pipeline

Preprocessing

Analysis (RQs)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages