Repository for my MSC thesis titled: AI-generated vs human-authored texts: comparative analysis of datasets and NLP methods.
Author: Michał Gromadzki
Supervisors: PhD in Computer Science, Anna Wróblewska and PhD in Linguistics, Agnieszka Kaliska
University: Warsaw University of Technology, Faculty of Mathematics and Information Science
Year: 2024/2025
This thesis highlights the inconsistencies in data quality and methodological approaches in the current systems. It addresses these shortcomings by conducting a comparative analysis on a thoroughly collected textual corpus. It further presents a series of experiments designed to enhance the efficiency of AI-generated text detection systems. These experiments introduce two novel training paradigms aimed at improving detection accuracy and scalability across diverse textual domains. Finally, it introduces a robust and universally applicable evaluation pipeline.
The following plot summarizes the performance of the developed models in distinguishing AI-generated from human-authored texts.
./logs/- Training history for all experiments./notebooks/- Jupyter Notebooks used for development./plots/- Plots used in the thesis./predictions/- Prediction of all evaluated models on the test datasets./src/- Source code
All Jupyter Notebooks were used for the development of the solutions. They may contain errors, unused plots, or experimental solutions.
All runner.sh scripts were used to submit jobs to the SLURM Queuing System.
https://www.kaggle.com/datasets/kazanova/sentiment140
https://www.kaggle.com/datasets/smagnan/1-million-reddit-comments-from-40-subreddits
https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus
https://www.kaggle.com/datasets/benjaminawd/new-york-times-articles-comments-2020
https://www.kaggle.com/datasets/thedrcat/daigt-external-train-dataset
https://huggingface.co/datasets/liamdugan/raid
https://huggingface.co/datasets/EdinburghNLP/xsum
https://huggingface.co/datasets/euclaise/writingprompts
https://huggingface.co/datasets/google-research-datasets/natural_questions
