Reddit Thread Simulation Workflow

This README documents the current end-to-end pipeline used in this project:

Build raw Reddit thread trees with depth filtering
Build seed-A-B-C-D chains with metadata
Run agent simulation
Score generated texts with Detoxify
Analyze results in notebook

0) Environment

From project root:

cd /u/yian3/toxic_agent

Install core dependencies if needed:

pip install pandas networkx torch detoxify tqdm jupyter

1) Build Reddit trees (depth >= 5)

Script: data/build_reddit.py

This script reconstructs nested trees from raw Reddit branch-style JSON and writes:

data/extracted/politics_depth_ge5.jsonl

Run:

python /u/yian3/toxic_agent/data/build_reddit.py

Notes:

In the current script, input/output paths are set inside __main__.
Default min_depth=5.

2) Build chain records for simulation

Script: data/build_threads.py

This extracts one deepest chain per thread and writes:

seed, A, B, C, D text and metadata (created_utc, score, etc.)
Detoxify scores for each role
Output file: data/extracted/politics_seedA_BCD_chains_detoxify.jsonl

Run:

python /u/yian3/toxic_agent/data/build_threads.py

3) Run influence simulation

Script: src/run_influence_baseline.py

Run from src/ so -m run_influence_baseline resolves correctly:

cd /u/yian3/toxic_agent/src

python -m run_influence_baseline \
  --seed_source reddit_jsonl \
  --reddit_jsonl /projects/bcxt/agent/data/reddit/extracted/politics_seedA_BCD_chains_detoxify.jsonl \
  --seed_strategy root \
  --reddit_require_max_depth -1 \
  --out_jsonl ../data/influence_baseline_threads_reddit.jsonl \
  --out_summary ../data/influence_baseline_summary_reddit.json \
  --model gpt-4o-mini \
  --n_seeds 50 \
  --compute_sentiment

position 1 python -m run_influence_baseline
--seed_source reddit_jsonl
--reddit_jsonl /projects/bcxt/agent/data/reddit/extracted/politics_seedA_BCD_chains_detoxify.jsonl
--seed_strategy root
--reddit_require_max_depth -1
--out_jsonl ../data/influence_baseline_threads_detoxify_200.jsonl
--out_summary ../data/influence_baseline_summary_detoxify_200.json
--model gpt-4o-mini
--n_seeds 200
--compute_sentiment
--intervention_position pos1

Current simulation behavior:

Agents can vote and/or reply at each opportunity.
Vote events immediately update the target message score.
Message created_utc is tracked for simulated replies.
Vote events are stored in the output JSONL under votes.

4) Add Detoxify scores to generated messages

Script: data/read_jsonl.py

This script:

loads influence_baseline_threads_reddit.jsonl
scores each generated message with Detoxify
writes scores back into each message under message["detoxify"]

Run:

cd /u/yian3/toxic_agent/data
python read_jsonl.py

Note:

The path is currently set in the script (path = "influence_baseline_threads_reddit.jsonl").

5) Analysis in notebook

Notebook: data/read_jsonl.ipynb

Use this notebook for:

per-mode comparisons (toxic, neutral, removed)
turn-level sentiment/toxicity trends
vote behavior and score changes over turns
summary tables and plots

Run:

cd /u/yian3/toxic_agent/data
jupyter notebook read_jsonl.ipynb

6) Analysis in python code


python -m evaluate_single_thread_influence \
  --mild_path ../data/reddit/influence_baseline_threads_detoxify_mild.jsonl \
  --medium_path ../data/reddit/influence_baseline_threads_detoxify_medium.jsonl \
  --strong_path ../data/reddit/influence_baseline_threads_detoxify_strong.jsonl \
  --analysis_dir ../analysis

Outputs You Should See

data/extracted/politics_depth_ge5.jsonl
data/extracted/politics_seedA_BCD_chains_detoxify.jsonl
data/influence_baseline_threads_reddit.jsonl
data/influence_baseline_summary_reddit.json

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
molt		molt
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Thread Simulation Workflow

0) Environment

1) Build Reddit trees (depth >= 5)

2) Build chain records for simulation

3) Run influence simulation

4) Add Detoxify scores to generated messages

5) Analysis in notebook

6) Analysis in python code

Outputs You Should See

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reddit Thread Simulation Workflow

0) Environment

1) Build Reddit trees (depth >= 5)

2) Build chain records for simulation

3) Run influence simulation

4) Add Detoxify scores to generated messages

5) Analysis in notebook

6) Analysis in python code

Outputs You Should See

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages