Add CPU video training and prototype generation to oxidize-train by Jackson57279 · Pull Request #18 · Zapdev-labs/oxidize

Jackson57279 · 2026-06-18T09:25:55Z

Summary

Adds a pure-Rust CPU video pipeline to oxidize-train for TikTok-style clip datasets (metadata JSON + mp4 files).
Introduces oxidize-train video for training a patch-embedding classifier with creator/virality/engagement labels and optional class balancing.
Adds oxidize-train prototype to render a smoothed base clip by averaging frames across selected creators (supports --exclude cellow111).

Test plan

cargo test -p oxidize-train
cargo clippy -p oxidize-train --all-targets -- -D warnings
Local smoke run: oxidize-train video --data ~/tt-downloader/videos --task creator --max-videos 60
Local prototype run: oxidize-train prototype --data ~/tt-downloader/videos --exclude cellow111 --out ~/tt-downloader/oxidize-base-video.mp4

Made with Cursor

Summary by cubic

Adds CPU short‑video training and prototype generation to oxidize-train, including new video and prototype subcommands, frame caching, and JSON model export. Trains a small patch‑embedding classifier on TikTok‑style datasets (mp4 + *_metadata.json).

New Features
- oxidize-train video trains a clip classifier on creator, virality, or engagement labels (quantile buckets).
- Deterministic split, optional class balancing, --max-videos cap, and frame caching in <data>/.oxidize-frames.
- Pure‑Rust data path with ffmpeg for frame extraction; patch‑embed + temporal mean‑pool + 2‑layer MLP head (CPU).
- Saves model and metadata to JSON (default: oxidize-video-<task>.json under the data root) and prints train/val metrics with a majority baseline.
- oxidize-train prototype renders a smoothed “base” clip by averaging frames across creators, supports --exclude, and writes mp4 + optional contact‑sheet PNG.
- New CLI structure: csv, video, and prototype subcommands.
Migration
- Install ffmpeg on PATH for frame extraction and prototype encoding.
- Example:
  - oxidize-train video --data ~/tt-downloader/videos --task creator --max-videos 60
  - oxidize-train prototype --data ~/tt-downloader/videos --exclude cellow111 --out ~/tt-downloader/oxidize-base-video.mp4

^{Written for commit de66d15. Summary will update on new commits.}

Introduce a video pipeline for TikTok-style clip datasets with ffmpeg frame extraction, a trainable patch-embedding classifier, and a prototype subcommand that renders averaged base clips while excluding selected creators. Co-authored-by: Cursor <cursoragent@cursor.com>

cubic-dev-ai

9 issues found across 13 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oxidize-train/src/video/dataset.rs">

<violation number="1" location="oxidize-train/src/video/dataset.rs:162">
P1: Balancing can silently produce an empty dataset when `bucketize` creates empty classes (e.g. buckets > samples). `balanced_indices` takes the minimum count across all classes, so any empty class causes `min_count == 0` and an empty selection. The code then builds and returns an empty `VideoDataset` without error.</violation>
</file>

<file name="oxidize-train/src/video/train.rs">

<violation number="1" location="oxidize-train/src/video/train.rs:65">
P1: Missing guard for `batch_size == 0` in training loop — panics at runtime if zero.</violation>
</file>

<file name="oxidize-train/src/video/frames.rs">

<violation number="1" location="oxidize-train/src/video/frames.rs:31">
P1: Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.</violation>
</file>

<file name="oxidize-train/src/video/model.rs">

<violation number="1" location="oxidize-train/src/video/model.rs:43">
P2: Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if `tokens_per_clip` is not evenly divisible by `num_frames`.</violation>

<violation number="2" location="oxidize-train/src/video/model.rs:203">
P1: `reshape` silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.</violation>
</file>

<file name="oxidize-train/src/video/prototype.rs">

<violation number="1" location="oxidize-train/src/video/prototype.rs:35">
P2: `upscale` and `fps` are not validated before use, despite similar parameters (`num_frames`, `frame_size`) being validated.</violation>

<violation number="2" location="oxidize-train/src/video/prototype.rs:69">
P1: Clip processing errors are silently discarded via `.ok()`, causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.</violation>
</file>

<file name="oxidize-train/src/video/manifest.rs">

<violation number="1" location="oxidize-train/src/video/manifest.rs:56">
P2: Manifest discovery is overly permissive and can include unintended videos from the data root.</violation>

<violation number="2" location="oxidize-train/src/video/manifest.rs:115">
P2: Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels</violation>
</file>

_{Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic}

cubic-dev-ai · 2026-06-18T09:31:10Z

+        (0..labels.len()).collect()
+    };
+
+    let mut data = Vec::with_capacity(selection.len() * span);


P1: Balancing can silently produce an empty dataset when bucketize creates empty classes (e.g. buckets > samples). balanced_indices takes the minimum count across all classes, so any empty class causes min_count == 0 and an empty selection. The code then builds and returns an empty VideoDataset without error.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/dataset.rs, line 162: <comment>Balancing can silently produce an empty dataset when `bucketize` creates empty classes (e.g. buckets > samples). `balanced_indices` takes the minimum count across all classes, so any empty class causes `min_count == 0` and an empty selection. The code then builds and returns an empty `VideoDataset` without error.</comment> <file context> @@ -0,0 +1,316 @@ + (0..labels.len()).collect() + }; + + let mut data = Vec::with_capacity(selection.len() * span); + let mut final_labels = Vec::with_capacity(selection.len()); + let mut final_samples = Vec::with_capacity(selection.len()); </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+        let mut weighted_loss = 0.0;
+        let mut seen = 0usize;
+
+        for batch in order.chunks(config.batch_size) {


P1: Missing guard for batch_size == 0 in training loop — panics at runtime if zero.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/train.rs, line 65: <comment>Missing guard for `batch_size == 0` in training loop — panics at runtime if zero.</comment> <file context> @@ -0,0 +1,227 @@ + let mut weighted_loss = 0.0; + let mut seen = 0usize; + + for batch in order.chunks(config.batch_size) { + let (input, labels) = gather(dataset, batch)?; + let loss = model.train_step(&input, &labels, &mut optimizer); </file context>

Suggested change

for batch in order.chunks(config.batch_size) {

+ for batch in order.chunks(config.batch_size.max(1)) {

cubic-dev-ai · 2026-06-18T09:31:10Z

+    cfg: FrameConfig,
+    cache_dir: &Path,
+) -> Result<Vec<f32>, VideoError> {
+    let frame_dir = cache_dir.join(&sample.id);


P1: Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/frames.rs, line 31: <comment>Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.</comment> <file context> @@ -0,0 +1,226 @@ + cfg: FrameConfig, + cache_dir: &Path, +) -> Result<Vec<f32>, VideoError> { + let frame_dir = cache_dir.join(&sample.id); + let mut frames = existing_frames(&frame_dir); + if frames.is_empty() { </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+
+/// Reinterpret a matrix's contiguous data under a new (rows, cols) shape.
+fn reshape(m: Matrix, rows: usize, cols: usize) -> Matrix {
+    Matrix::from_vec(rows, cols, m.data().to_vec()).unwrap_or_else(|_| Matrix::zeros(rows, cols))


P1: reshape silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/model.rs, line 203: <comment>`reshape` silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.</comment> <file context> @@ -0,0 +1,265 @@ + +/// Reinterpret a matrix's contiguous data under a new (rows, cols) shape. +fn reshape(m: Matrix, rows: usize, cols: usize) -> Matrix { + Matrix::from_vec(rows, cols, m.data().to_vec()).unwrap_or_else(|_| Matrix::zeros(rows, cols)) +} + </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+
+    let loaded: Vec<Vec<RgbImage>> = samples
+        .par_iter()
+        .filter_map(|sample| clip_to_frames(sample, frames_cfg, cache_dir).ok())


P1: Clip processing errors are silently discarded via .ok(), causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/prototype.rs, line 69: <comment>Clip processing errors are silently discarded via `.ok()`, causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.</comment> <file context> @@ -0,0 +1,177 @@ + + let loaded: Vec<Vec<RgbImage>> = samples + .par_iter() + .filter_map(|sample| clip_to_frames(sample, frames_cfg, cache_dir).ok()) + .collect(); + if loaded.is_empty() { </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+        seed: u64,
+    ) -> Self {
+        let num_frames = num_frames.max(1);
+        let patches_per_frame = (tokens_per_clip / num_frames).max(1);


P2: Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if tokens_per_clip is not evenly divisible by num_frames.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/model.rs, line 43: <comment>Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if `tokens_per_clip` is not evenly divisible by `num_frames`.</comment> <file context> @@ -0,0 +1,265 @@ + seed: u64, + ) -> Self { + let num_frames = num_frames.max(1); + let patches_per_frame = (tokens_per_clip / num_frames).max(1); + let mut rng = StdRng::seed_from_u64(seed); + Self { </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+    upscale: u32,
+    fps: u32,
+) -> Result<PrototypeReport, VideoError> {
+    if frames_cfg.num_frames == 0 || frames_cfg.frame_size == 0 {


P2: upscale and fps are not validated before use, despite similar parameters (num_frames, frame_size) being validated.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/prototype.rs, line 35: <comment>`upscale` and `fps` are not validated before use, despite similar parameters (`num_frames`, `frame_size`) being validated.</comment> <file context> @@ -0,0 +1,177 @@ + upscale: u32, + fps: u32, +) -> Result<PrototypeReport, VideoError> { + if frames_cfg.num_frames == 0 || frames_cfg.frame_size == 0 { + return Err(VideoError::Config( + "num_frames and frame_size must be > 0".into(), </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+        if !is_video_file(path) {
+            continue;
+        }
+        let Some(id) = id_from_filename(path) else {


P2: Manifest discovery is overly permissive and can include unintended videos from the data root.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/manifest.rs, line 56: <comment>Manifest discovery is overly permissive and can include unintended videos from the data root.</comment> <file context> @@ -0,0 +1,221 @@ + if !is_video_file(path) { + continue; + } + let Some(id) = id_from_filename(path) else { + continue; + }; </file context>

cubic-dev-ai · 2026-06-18T09:31:10Z

+            let Some(id) = json_id(video.get("id")) else {
+                continue;
+            };
+            index.insert(


P2: Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/manifest.rs, line 115: <comment>Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels</comment> <file context> @@ -0,0 +1,221 @@ + let Some(id) = json_id(video.get("id")) else { + continue; + }; + index.insert( + id, + MetaEntry { </file context>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de66d158bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T09:31:36Z

+    let frame_dir = cache_dir.join(&sample.id);
+    let mut frames = existing_frames(&frame_dir);
+    if frames.is_empty() {
+        frames = extract_frames(sample, cfg, &frame_dir)?;


Make frame caches config-specific

When a clip has any cached JPGs, extraction is skipped regardless of the requested FrameConfig. If a user first runs with fewer frames (for example --frames 8) and later reruns with a larger --frames value using the default cache, pick_indices repeats/selects only the stale cached frames instead of resampling the video at the requested temporal resolution, silently training or averaging on the wrong data. Include the sampling config in the cache key, or at least require the cache to satisfy the requested frame count before reusing it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T09:31:36Z

+    let work = out
+        .parent()
+        .filter(|p| !p.as_os_str().is_empty())
+        .unwrap_or_else(|| Path::new("."))
+        .join(".oxidize-prototype");


Clear prototype work frames before encoding

Every prototype run writes into the same .oxidize-prototype directory without removing old mean_*.jpg files. If a later run in the same output directory uses fewer frames, higher-numbered images from the previous run remain and the mean_%03d.jpg ffmpeg sequence will include those stale frames, so the output mp4 can mix frames from a different dataset or config. Use a fresh temp directory or delete the old sequence before writing new frames.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T09:31:36Z

+    for (index, &label) in labels.iter().enumerate() {
+        per_class[label].push(index);
+    }
+    let min_count = per_class.iter().map(Vec::len).min().unwrap_or(0);


Reject empty buckets before balancing

For virality/engagement tasks, bucketize creates buckets class names even when fewer clips survive, such as --max-videos 2 --task virality --balance with the default 3 buckets. One class then has zero labels, so including it in min_count makes the balanced selection empty and the loader returns a zero-clip dataset that trains/saves a meaningless model instead of failing. Drop empty buckets or error before balancing.

Useful? React with 👍 / 👎.

cubic-dev-ai Bot reviewed Jun 18, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

	for batch in order.chunks(config.batch_size) {
	+ for batch in order.chunks(config.batch_size.max(1)) {

Conversation

Jackson57279 commented Jun 18, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jackson57279 commented Jun 18, 2026 •

edited by cubic-dev-ai Bot

Loading