We test the generalizable capabilities of RoBERTa to classify machine generated text in a series of different settings; from different LLMs that generated the text, to different dataset domains that the text belongs to, such as QA style questions (followupQG), wikipedia (wikitext) or general knowledge (SQuAD).
We used humanly generated data, to generate the relevant LLM text - answers.
We took human questions and human answers, to compare with LLM answers.
We took human written text, and used it as a prime (20 tokens) for an LLM to continue the generation for another 140 tokens.
- Download from our drive the data into the
./datafolder and preserve the skeleton. a. The way these data are constructed are from the./data_engineering/download_datasets.ipynbscript to slice the human data. b. Then under./scriptsthere are scripts to run inference with all LLMs c. All scripts are named as./scripts/run_pre_processing_* - By the process above we are building our own datasets to run finetuning on RoBERTa.
a. Run the scripts
./scripts/run_finetuning_*to finetune RoBERTa on all settings as we describe on our paper. - Evaluate by running
./scripts/run_eval.sh
Link to our drive: https://amsuni-my.sharepoint.com/:f:/g/personal/theofanis_aslanidis_student_uva_nl/EvMkEjWdMHhJjPuzfbQhRGQBswVjd1HuC0K22FXcbfb6pA