This README documents the current end-to-end pipeline used in this project:
- Build raw Reddit thread trees with depth filtering
- Build
seed-A-B-C-Dchains with metadata - Run agent simulation
- Score generated texts with Detoxify
- Analyze results in notebook
From project root:
cd /u/yian3/toxic_agentInstall core dependencies if needed:
pip install pandas networkx torch detoxify tqdm jupyterScript: data/build_reddit.py
This script reconstructs nested trees from raw Reddit branch-style JSON and writes:
data/extracted/politics_depth_ge5.jsonl
Run:
python /u/yian3/toxic_agent/data/build_reddit.pyNotes:
- In the current script, input/output paths are set inside
__main__. - Default
min_depth=5.
Script: data/build_threads.py
This extracts one deepest chain per thread and writes:
seed, A, B, C, Dtext and metadata (created_utc,score, etc.)- Detoxify scores for each role
- Output file:
data/extracted/politics_seedA_BCD_chains_detoxify.jsonl
Run:
python /u/yian3/toxic_agent/data/build_threads.pyScript: src/run_influence_baseline.py
Run from src/ so -m run_influence_baseline resolves correctly:
cd /u/yian3/toxic_agent/src
python -m run_influence_baseline \
--seed_source reddit_jsonl \
--reddit_jsonl /projects/bcxt/agent/data/reddit/extracted/politics_seedA_BCD_chains_detoxify.jsonl \
--seed_strategy root \
--reddit_require_max_depth -1 \
--out_jsonl ../data/influence_baseline_threads_reddit.jsonl \
--out_summary ../data/influence_baseline_summary_reddit.json \
--model gpt-4o-mini \
--n_seeds 50 \
--compute_sentimentposition 1
python -m run_influence_baseline
--seed_source reddit_jsonl
--reddit_jsonl /projects/bcxt/agent/data/reddit/extracted/politics_seedA_BCD_chains_detoxify.jsonl
--seed_strategy root
--reddit_require_max_depth -1
--out_jsonl ../data/influence_baseline_threads_detoxify_200.jsonl
--out_summary ../data/influence_baseline_summary_detoxify_200.json
--model gpt-4o-mini
--n_seeds 200
--compute_sentiment
--intervention_position pos1
Current simulation behavior:
- Agents can vote and/or reply at each opportunity.
- Vote events immediately update the target message score.
- Message
created_utcis tracked for simulated replies. - Vote events are stored in the output JSONL under
votes.
Script: data/read_jsonl.py
This script:
- loads
influence_baseline_threads_reddit.jsonl - scores each generated message with Detoxify
- writes scores back into each message under
message["detoxify"]
Run:
cd /u/yian3/toxic_agent/data
python read_jsonl.pyNote:
- The path is currently set in the script (
path = "influence_baseline_threads_reddit.jsonl").
Notebook: data/read_jsonl.ipynb
Use this notebook for:
- per-mode comparisons (
toxic,neutral,removed) - turn-level sentiment/toxicity trends
- vote behavior and score changes over turns
- summary tables and plots
Run:
cd /u/yian3/toxic_agent/data
jupyter notebook read_jsonl.ipynb
python -m evaluate_single_thread_influence \
--mild_path ../data/reddit/influence_baseline_threads_detoxify_mild.jsonl \
--medium_path ../data/reddit/influence_baseline_threads_detoxify_medium.jsonl \
--strong_path ../data/reddit/influence_baseline_threads_detoxify_strong.jsonl \
--analysis_dir ../analysis
data/extracted/politics_depth_ge5.jsonldata/extracted/politics_seedA_BCD_chains_detoxify.jsonldata/influence_baseline_threads_reddit.jsonldata/influence_baseline_summary_reddit.json