Skip to content

ntua/aiinsecourses

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMs across the SDLC in Project-Based SE Courses

Overview · Datasets · Repository Structure · How to Run


Overview

This repository provides the datasets and processing pipeline for an empirical study, conducted at the School of Electrical and Computer Engineering (NTUA), on how Large Language Models (LLMs) are used across Software Development Life Cycle (SDLC) phases in project-based Software Engineering courses.

Students were encouraged to:

  • use AI tools throughout their projects,
  • document their interactions,
  • submit one log entry per interaction via a custom logging service.

The dataset spans 5 semesters (Fall 2023 – Fall 2025), capturing the evolution of AI usage—from early adoption to its integration as a standard development partner. It includes AI usage logs from two Software Engineering courses:

  • Software Engineering (SoftEng – Fall)
  • Software as a Service Technologies (SaaS – Spring)

Datasets

The collected datasets include both objective and subjective attributes describing the AI-assisted interaction.

Objective attributes

  • SDLC phase
  • Action type
  • Programming language
  • AI tool used

Subjective attributes

  • Perceived quality of AI assistance
  • Knowledge gained from the interaction
  • Time saved due to AI usage
  • Perceived threat of AI to future software engineering roles

A unified schema (after normalization) is available at data_preparation/1-normalization/B-schema_normalization/normalized_schema.json

Raw datasets (per semester) are provided in datasets/raw/, in CSV format.


Repository Structure

Data Preparation - data_preparation/

Pipeline for preprocessing the datasets used in the empirical analysis.


This stage ensures consistency across datasets collected from different semesters, which originally had variations in schema and value representations.

Step Path What it does
SoftEng-23b normalization data_preparation/1-normalization/A-23b_normalization/ Normalizes the Fall 2023 dataset, where some subjective metrics were recorded as textual values (e.g., “low”, “medium”, “high”) instead of the unified 0–5 scale used in later datasets. These values are normalized to ensure comparability. See: softeng23b_normalization.md
Schema Normalization data_preparation/1-normalization/B-schema_normalization/ Harmonizes column naming across semesters and removes redundant columns, keeping only attributes relevant for analysis.
Actions Normalization data_preparation/1-normalization/C-actions_normalization/ Standardizes the action field across datasets. The action represents the specific type of task the student performed using AI assistance (e.g., code authoring, design decisions, use case specification, etc.). Since different semesters used slightly different labels for similar actions, they are grouped into a unified taxonomy defined in actions_normalization.json
Scopes Normalization data_preparation/1-normalization/D-scopes_normalization/ Standardizes the scope field, which defines the granularity or target of the AI interaction (e.g., frontend, backend, UML modeling, etc.). Similar to actions, scope values are normalized across semesters based on a unified taxonomy defined in scope_normalization.json

This stage computes additional derived attributes per interaction to support analysis.

Examples include:

  • avg_action_experience
  • phase_experience

This stage filters out semantically invalid AI usage logs based on predefined consistency rules between:

  • SDLC phase
  • action type
  • scope

Each SDLC phase defines which combinations of actions and scopes are valid. This ensures that only meaningful and realistic phase–action–scope interactions are included in the final dataset.

Validation rules are defined in data_preparation/3-validation/rules/

Run full preprocessing pipeline:

./data_preparation/run_all_data_preparation.sh

Datasets - datasets/

Dataset Description
datasets/raw/ → original per-semester datasets
datasets/normalized/ → schema-aligned datasets
datasets/enriched/ → datasets with derived attributes
datasets/validated/ → final validated datasets used for analysis

All datasets are provided in CSV format.


Research Questions - RQs/

This directory contains the full analysis pipeline used to answer the research questions in the context of this empirical study.

Contains data processing scripts per research question.

Each RQ includes:

  • input_processing.py: reads validated datasets as input and generates results in CSV format
  • plot.py: reads the generated CSVs and produces visualization plots

Contains:

  • CSV outputs of analysis
  • generated plots and figures

Research Questions

# RQ Description
RQ1 On which software development tasks do students focus their use of AI tools?
RQ2 How do students rank the quality, knowledge gained, and time saved when using AI assistants?
RQ3 Do students perceive AI as a supportive tool or as a potential threat, both currently and in the future?

Run full analysis:

./RQs/run_all.sh

Utils — utils/

Shared utility functions used across preprocessing and analysis pipelines.


How to Run

1. Setup environment

python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Run full pipeline

chmod +x run_all_pipeline.sh
./run_all_pipeline.sh

Or run steps separately:

Preprocessing

cd data_preparation
chmod +x run_all_data_preparation.sh
./run_all_data_preparation.sh

Analysis (RQs)

cd RQs
chmod +x run_all.sh
./run_all.sh

About

AI in software engineering courses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors