Skip to content

Michal1337/GenTextDetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenTextDetect

Repository for my MSC thesis titled: AI-generated vs human-authored texts: comparative analysis of datasets and NLP methods.
Author: Michał Gromadzki
Supervisors: PhD in Computer Science, Anna Wróblewska and PhD in Linguistics, Agnieszka Kaliska
University: Warsaw University of Technology, Faculty of Mathematics and Information Science
Year: 2024/2025

Project Overview

This thesis highlights the inconsistencies in data quality and methodological approaches in the current systems. It addresses these shortcomings by conducting a comparative analysis on a thoroughly collected textual corpus. It further presents a series of experiments designed to enhance the efficiency of AI-generated text detection systems. These experiments introduce two novel training paradigms aimed at improving detection accuracy and scalability across diverse textual domains. Finally, it introduces a robust and universally applicable evaluation pipeline.

Results

The following plot summarizes the performance of the developed models in distinguishing AI-generated from human-authored texts.

Results

Repository Structure

  1. ./logs/ - Training history for all experiments
  2. ./notebooks/ - Jupyter Notebooks used for development
  3. ./plots/ - Plots used in the thesis
  4. ./predictions/ - Prediction of all evaluated models on the test datasets
  5. ./src/ - Source code

All Jupyter Notebooks were used for the development of the solutions. They may contain errors, unused plots, or experimental solutions.

SLURM

All runner.sh scripts were used to submit jobs to the SLURM Queuing System.

Data sources

https://www.kaggle.com/datasets/kazanova/sentiment140
https://www.kaggle.com/datasets/smagnan/1-million-reddit-comments-from-40-subreddits
https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus
https://www.kaggle.com/datasets/benjaminawd/new-york-times-articles-comments-2020
https://www.kaggle.com/datasets/thedrcat/daigt-external-train-dataset
https://huggingface.co/datasets/liamdugan/raid
https://huggingface.co/datasets/EdinburghNLP/xsum
https://huggingface.co/datasets/euclaise/writingprompts
https://huggingface.co/datasets/google-research-datasets/natural_questions

About

My MSc thesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published