Machine Generated Text Detector

We test the generalizable capabilities of RoBERTa to classify machine generated text in a series of different settings; from different LLMs that generated the text, to different dataset domains that the text belongs to, such as QA style questions (followupQG), wikipedia (wikitext) or general knowledge (SQuAD).

Machine Generated Datasets

We used humanly generated data, to generate the relevant LLM text - answers.

Question-Answer

We took human questions and human answers, to compare with LLM answers.

Text Completion

We took human written text, and used it as a prime (20 tokens) for an LLM to continue the generation for another 140 tokens.

How to Run

Download from our drive the data into the ./data folder and preserve the skeleton. a. The way these data are constructed are from the ./data_engineering/download_datasets.ipynb script to slice the human data. b. Then under ./scripts there are scripts to run inference with all LLMs c. All scripts are named as ./scripts/run_pre_processing_*
By the process above we are building our own datasets to run finetuning on RoBERTa. a. Run the scripts ./scripts/run_finetuning_* to finetune RoBERTa on all settings as we describe on our paper.
Evaluate by running ./scripts/run_eval.sh

Link to our drive: https://amsuni-my.sharepoint.com/:f:/g/personal/theofanis_aslanidis_student_uva_nl/EvMkEjWdMHhJjPuzfbQhRGQBswVjd1HuC0K22FXcbfb6pA

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
data_engineering		data_engineering
finetune		finetune
model		model
model_classes		model_classes
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
BLEU_COMET_similarity.py		BLEU_COMET_similarity.py
BLEU_COMET_similarity_prime.py		BLEU_COMET_similarity_prime.py
README.md		README.md
batch_openai.py		batch_openai.py
env.yml		env.yml
eval_all.py		eval_all.py
preprocessing.py		preprocessing.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Generated Text Detector

Machine Generated Datasets

Question-Answer

Text Completion

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Generated Text Detector

Machine Generated Datasets

Question-Answer

Text Completion

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages