Skip to content

Add CPU video training and prototype generation to oxidize-train#18

Open
Jackson57279 wants to merge 1 commit into
masterfrom
feat/oxidize-train-video
Open

Add CPU video training and prototype generation to oxidize-train#18
Jackson57279 wants to merge 1 commit into
masterfrom
feat/oxidize-train-video

Conversation

@Jackson57279

@Jackson57279 Jackson57279 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a pure-Rust CPU video pipeline to oxidize-train for TikTok-style clip datasets (metadata JSON + mp4 files).
  • Introduces oxidize-train video for training a patch-embedding classifier with creator/virality/engagement labels and optional class balancing.
  • Adds oxidize-train prototype to render a smoothed base clip by averaging frames across selected creators (supports --exclude cellow111).

Test plan

  • cargo test -p oxidize-train
  • cargo clippy -p oxidize-train --all-targets -- -D warnings
  • Local smoke run: oxidize-train video --data ~/tt-downloader/videos --task creator --max-videos 60
  • Local prototype run: oxidize-train prototype --data ~/tt-downloader/videos --exclude cellow111 --out ~/tt-downloader/oxidize-base-video.mp4

Made with Cursor


Summary by cubic

Adds CPU short‑video training and prototype generation to oxidize-train, including new video and prototype subcommands, frame caching, and JSON model export. Trains a small patch‑embedding classifier on TikTok‑style datasets (mp4 + *_metadata.json).

  • New Features

    • oxidize-train video trains a clip classifier on creator, virality, or engagement labels (quantile buckets).
    • Deterministic split, optional class balancing, --max-videos cap, and frame caching in <data>/.oxidize-frames.
    • Pure‑Rust data path with ffmpeg for frame extraction; patch‑embed + temporal mean‑pool + 2‑layer MLP head (CPU).
    • Saves model and metadata to JSON (default: oxidize-video-<task>.json under the data root) and prints train/val metrics with a majority baseline.
    • oxidize-train prototype renders a smoothed “base” clip by averaging frames across creators, supports --exclude, and writes mp4 + optional contact‑sheet PNG.
    • New CLI structure: csv, video, and prototype subcommands.
  • Migration

    • Install ffmpeg on PATH for frame extraction and prototype encoding.
    • Example:
      • oxidize-train video --data ~/tt-downloader/videos --task creator --max-videos 60
      • oxidize-train prototype --data ~/tt-downloader/videos --exclude cellow111 --out ~/tt-downloader/oxidize-base-video.mp4

Written for commit de66d15. Summary will update on new commits.

Review in cubic

Introduce a video pipeline for TikTok-style clip datasets with ffmpeg frame extraction, a trainable patch-embedding classifier, and a prototype subcommand that renders averaged base clips while excluding selected creators.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 13 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oxidize-train/src/video/dataset.rs">

<violation number="1" location="oxidize-train/src/video/dataset.rs:162">
P1: Balancing can silently produce an empty dataset when `bucketize` creates empty classes (e.g. buckets > samples). `balanced_indices` takes the minimum count across all classes, so any empty class causes `min_count == 0` and an empty selection. The code then builds and returns an empty `VideoDataset` without error.</violation>
</file>

<file name="oxidize-train/src/video/train.rs">

<violation number="1" location="oxidize-train/src/video/train.rs:65">
P1: Missing guard for `batch_size == 0` in training loop — panics at runtime if zero.</violation>
</file>

<file name="oxidize-train/src/video/frames.rs">

<violation number="1" location="oxidize-train/src/video/frames.rs:31">
P1: Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.</violation>
</file>

<file name="oxidize-train/src/video/model.rs">

<violation number="1" location="oxidize-train/src/video/model.rs:43">
P2: Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if `tokens_per_clip` is not evenly divisible by `num_frames`.</violation>

<violation number="2" location="oxidize-train/src/video/model.rs:203">
P1: `reshape` silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.</violation>
</file>

<file name="oxidize-train/src/video/prototype.rs">

<violation number="1" location="oxidize-train/src/video/prototype.rs:35">
P2: `upscale` and `fps` are not validated before use, despite similar parameters (`num_frames`, `frame_size`) being validated.</violation>

<violation number="2" location="oxidize-train/src/video/prototype.rs:69">
P1: Clip processing errors are silently discarded via `.ok()`, causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.</violation>
</file>

<file name="oxidize-train/src/video/manifest.rs">

<violation number="1" location="oxidize-train/src/video/manifest.rs:56">
P2: Manifest discovery is overly permissive and can include unintended videos from the data root.</violation>

<violation number="2" location="oxidize-train/src/video/manifest.rs:115">
P2: Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels</violation>
</file>

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

(0..labels.len()).collect()
};

let mut data = Vec::with_capacity(selection.len() * span);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Balancing can silently produce an empty dataset when bucketize creates empty classes (e.g. buckets > samples). balanced_indices takes the minimum count across all classes, so any empty class causes min_count == 0 and an empty selection. The code then builds and returns an empty VideoDataset without error.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/dataset.rs, line 162:

<comment>Balancing can silently produce an empty dataset when `bucketize` creates empty classes (e.g. buckets > samples). `balanced_indices` takes the minimum count across all classes, so any empty class causes `min_count == 0` and an empty selection. The code then builds and returns an empty `VideoDataset` without error.</comment>

<file context>
@@ -0,0 +1,316 @@
+        (0..labels.len()).collect()
+    };
+
+    let mut data = Vec::with_capacity(selection.len() * span);
+    let mut final_labels = Vec::with_capacity(selection.len());
+    let mut final_samples = Vec::with_capacity(selection.len());
</file context>

let mut weighted_loss = 0.0;
let mut seen = 0usize;

for batch in order.chunks(config.batch_size) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Missing guard for batch_size == 0 in training loop — panics at runtime if zero.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/train.rs, line 65:

<comment>Missing guard for `batch_size == 0` in training loop — panics at runtime if zero.</comment>

<file context>
@@ -0,0 +1,227 @@
+        let mut weighted_loss = 0.0;
+        let mut seen = 0usize;
+
+        for batch in order.chunks(config.batch_size) {
+            let (input, labels) = gather(dataset, batch)?;
+            let loss = model.train_step(&input, &labels, &mut optimizer);
</file context>
Suggested change
for batch in order.chunks(config.batch_size) {
+ for batch in order.chunks(config.batch_size.max(1)) {

cfg: FrameConfig,
cache_dir: &Path,
) -> Result<Vec<f32>, VideoError> {
let frame_dir = cache_dir.join(&sample.id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/frames.rs, line 31:

<comment>Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.</comment>

<file context>
@@ -0,0 +1,226 @@
+    cfg: FrameConfig,
+    cache_dir: &Path,
+) -> Result<Vec<f32>, VideoError> {
+    let frame_dir = cache_dir.join(&sample.id);
+    let mut frames = existing_frames(&frame_dir);
+    if frames.is_empty() {
</file context>


/// Reinterpret a matrix's contiguous data under a new (rows, cols) shape.
fn reshape(m: Matrix, rows: usize, cols: usize) -> Matrix {
Matrix::from_vec(rows, cols, m.data().to_vec()).unwrap_or_else(|_| Matrix::zeros(rows, cols))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: reshape silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/model.rs, line 203:

<comment>`reshape` silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.</comment>

<file context>
@@ -0,0 +1,265 @@
+
+/// Reinterpret a matrix's contiguous data under a new (rows, cols) shape.
+fn reshape(m: Matrix, rows: usize, cols: usize) -> Matrix {
+    Matrix::from_vec(rows, cols, m.data().to_vec()).unwrap_or_else(|_| Matrix::zeros(rows, cols))
+}
+
</file context>


let loaded: Vec<Vec<RgbImage>> = samples
.par_iter()
.filter_map(|sample| clip_to_frames(sample, frames_cfg, cache_dir).ok())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Clip processing errors are silently discarded via .ok(), causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/prototype.rs, line 69:

<comment>Clip processing errors are silently discarded via `.ok()`, causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.</comment>

<file context>
@@ -0,0 +1,177 @@
+
+    let loaded: Vec<Vec<RgbImage>> = samples
+        .par_iter()
+        .filter_map(|sample| clip_to_frames(sample, frames_cfg, cache_dir).ok())
+        .collect();
+    if loaded.is_empty() {
</file context>

seed: u64,
) -> Self {
let num_frames = num_frames.max(1);
let patches_per_frame = (tokens_per_clip / num_frames).max(1);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if tokens_per_clip is not evenly divisible by num_frames.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/model.rs, line 43:

<comment>Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if `tokens_per_clip` is not evenly divisible by `num_frames`.</comment>

<file context>
@@ -0,0 +1,265 @@
+        seed: u64,
+    ) -> Self {
+        let num_frames = num_frames.max(1);
+        let patches_per_frame = (tokens_per_clip / num_frames).max(1);
+        let mut rng = StdRng::seed_from_u64(seed);
+        Self {
</file context>

upscale: u32,
fps: u32,
) -> Result<PrototypeReport, VideoError> {
if frames_cfg.num_frames == 0 || frames_cfg.frame_size == 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: upscale and fps are not validated before use, despite similar parameters (num_frames, frame_size) being validated.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/prototype.rs, line 35:

<comment>`upscale` and `fps` are not validated before use, despite similar parameters (`num_frames`, `frame_size`) being validated.</comment>

<file context>
@@ -0,0 +1,177 @@
+    upscale: u32,
+    fps: u32,
+) -> Result<PrototypeReport, VideoError> {
+    if frames_cfg.num_frames == 0 || frames_cfg.frame_size == 0 {
+        return Err(VideoError::Config(
+            "num_frames and frame_size must be > 0".into(),
</file context>

if !is_video_file(path) {
continue;
}
let Some(id) = id_from_filename(path) else {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Manifest discovery is overly permissive and can include unintended videos from the data root.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/manifest.rs, line 56:

<comment>Manifest discovery is overly permissive and can include unintended videos from the data root.</comment>

<file context>
@@ -0,0 +1,221 @@
+        if !is_video_file(path) {
+            continue;
+        }
+        let Some(id) = id_from_filename(path) else {
+            continue;
+        };
</file context>

let Some(id) = json_id(video.get("id")) else {
continue;
};
index.insert(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/manifest.rs, line 115:

<comment>Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels</comment>

<file context>
@@ -0,0 +1,221 @@
+            let Some(id) = json_id(video.get("id")) else {
+                continue;
+            };
+            index.insert(
+                id,
+                MetaEntry {
</file context>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de66d158bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +31 to +34
let frame_dir = cache_dir.join(&sample.id);
let mut frames = existing_frames(&frame_dir);
if frames.is_empty() {
frames = extract_frames(sample, cfg, &frame_dir)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make frame caches config-specific

When a clip has any cached JPGs, extraction is skipped regardless of the requested FrameConfig. If a user first runs with fewer frames (for example --frames 8) and later reruns with a larger --frames value using the default cache, pick_indices repeats/selects only the stale cached frames instead of resampling the video at the requested temporal resolution, silently training or averaging on the wrong data. Include the sampling config in the cache key, or at least require the cache to satisfy the requested frame count before reusing it.

Useful? React with 👍 / 👎.

Comment on lines +99 to +103
let work = out
.parent()
.filter(|p| !p.as_os_str().is_empty())
.unwrap_or_else(|| Path::new("."))
.join(".oxidize-prototype");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear prototype work frames before encoding

Every prototype run writes into the same .oxidize-prototype directory without removing old mean_*.jpg files. If a later run in the same output directory uses fewer frames, higher-numbered images from the previous run remain and the mean_%03d.jpg ffmpeg sequence will include those stale frames, so the output mp4 can mix frames from a different dataset or config. Use a fresh temp directory or delete the old sequence before writing new frames.

Useful? React with 👍 / 👎.

for (index, &label) in labels.iter().enumerate() {
per_class[label].push(index);
}
let min_count = per_class.iter().map(Vec::len).min().unwrap_or(0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty buckets before balancing

For virality/engagement tasks, bucketize creates buckets class names even when fewer clips survive, such as --max-videos 2 --task virality --balance with the default 3 buckets. One class then has zero labels, so including it in min_count makes the balanced selection empty and the loader returns a zero-clip dataset that trains/saves a meaningless model instead of failing. Drop empty buckets or error before balancing.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant