Add CPU video training and prototype generation to oxidize-train#18
Add CPU video training and prototype generation to oxidize-train#18Jackson57279 wants to merge 1 commit into
Conversation
Introduce a video pipeline for TikTok-style clip datasets with ffmpeg frame extraction, a trainable patch-embedding classifier, and a prototype subcommand that renders averaged base clips while excluding selected creators. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
9 issues found across 13 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="oxidize-train/src/video/dataset.rs">
<violation number="1" location="oxidize-train/src/video/dataset.rs:162">
P1: Balancing can silently produce an empty dataset when `bucketize` creates empty classes (e.g. buckets > samples). `balanced_indices` takes the minimum count across all classes, so any empty class causes `min_count == 0` and an empty selection. The code then builds and returns an empty `VideoDataset` without error.</violation>
</file>
<file name="oxidize-train/src/video/train.rs">
<violation number="1" location="oxidize-train/src/video/train.rs:65">
P1: Missing guard for `batch_size == 0` in training loop — panics at runtime if zero.</violation>
</file>
<file name="oxidize-train/src/video/frames.rs">
<violation number="1" location="oxidize-train/src/video/frames.rs:31">
P1: Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.</violation>
</file>
<file name="oxidize-train/src/video/model.rs">
<violation number="1" location="oxidize-train/src/video/model.rs:43">
P2: Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if `tokens_per_clip` is not evenly divisible by `num_frames`.</violation>
<violation number="2" location="oxidize-train/src/video/model.rs:203">
P1: `reshape` silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.</violation>
</file>
<file name="oxidize-train/src/video/prototype.rs">
<violation number="1" location="oxidize-train/src/video/prototype.rs:35">
P2: `upscale` and `fps` are not validated before use, despite similar parameters (`num_frames`, `frame_size`) being validated.</violation>
<violation number="2" location="oxidize-train/src/video/prototype.rs:69">
P1: Clip processing errors are silently discarded via `.ok()`, causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.</violation>
</file>
<file name="oxidize-train/src/video/manifest.rs">
<violation number="1" location="oxidize-train/src/video/manifest.rs:56">
P2: Manifest discovery is overly permissive and can include unintended videos from the data root.</violation>
<violation number="2" location="oxidize-train/src/video/manifest.rs:115">
P2: Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels</violation>
</file>
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
| (0..labels.len()).collect() | ||
| }; | ||
|
|
||
| let mut data = Vec::with_capacity(selection.len() * span); |
There was a problem hiding this comment.
P1: Balancing can silently produce an empty dataset when bucketize creates empty classes (e.g. buckets > samples). balanced_indices takes the minimum count across all classes, so any empty class causes min_count == 0 and an empty selection. The code then builds and returns an empty VideoDataset without error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/dataset.rs, line 162:
<comment>Balancing can silently produce an empty dataset when `bucketize` creates empty classes (e.g. buckets > samples). `balanced_indices` takes the minimum count across all classes, so any empty class causes `min_count == 0` and an empty selection. The code then builds and returns an empty `VideoDataset` without error.</comment>
<file context>
@@ -0,0 +1,316 @@
+ (0..labels.len()).collect()
+ };
+
+ let mut data = Vec::with_capacity(selection.len() * span);
+ let mut final_labels = Vec::with_capacity(selection.len());
+ let mut final_samples = Vec::with_capacity(selection.len());
</file context>
| let mut weighted_loss = 0.0; | ||
| let mut seen = 0usize; | ||
|
|
||
| for batch in order.chunks(config.batch_size) { |
There was a problem hiding this comment.
P1: Missing guard for batch_size == 0 in training loop — panics at runtime if zero.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/train.rs, line 65:
<comment>Missing guard for `batch_size == 0` in training loop — panics at runtime if zero.</comment>
<file context>
@@ -0,0 +1,227 @@
+ let mut weighted_loss = 0.0;
+ let mut seen = 0usize;
+
+ for batch in order.chunks(config.batch_size) {
+ let (input, labels) = gather(dataset, batch)?;
+ let loss = model.train_step(&input, &labels, &mut optimizer);
</file context>
| for batch in order.chunks(config.batch_size) { | |
| + for batch in order.chunks(config.batch_size.max(1)) { |
| cfg: FrameConfig, | ||
| cache_dir: &Path, | ||
| ) -> Result<Vec<f32>, VideoError> { | ||
| let frame_dir = cache_dir.join(&sample.id); |
There was a problem hiding this comment.
P1: Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/frames.rs, line 31:
<comment>Cache reuse ignores FrameConfig, allowing stale cached frames from prior runs with different settings to silently poison later datasets.</comment>
<file context>
@@ -0,0 +1,226 @@
+ cfg: FrameConfig,
+ cache_dir: &Path,
+) -> Result<Vec<f32>, VideoError> {
+ let frame_dir = cache_dir.join(&sample.id);
+ let mut frames = existing_frames(&frame_dir);
+ if frames.is_empty() {
</file context>
|
|
||
| /// Reinterpret a matrix's contiguous data under a new (rows, cols) shape. | ||
| fn reshape(m: Matrix, rows: usize, cols: usize) -> Matrix { | ||
| Matrix::from_vec(rows, cols, m.data().to_vec()).unwrap_or_else(|_| Matrix::zeros(rows, cols)) |
There was a problem hiding this comment.
P1: reshape silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/model.rs, line 203:
<comment>`reshape` silently falls back to an all-zero matrix on shape mismatch, masking data/shape bugs and causing silent training corruption.</comment>
<file context>
@@ -0,0 +1,265 @@
+
+/// Reinterpret a matrix's contiguous data under a new (rows, cols) shape.
+fn reshape(m: Matrix, rows: usize, cols: usize) -> Matrix {
+ Matrix::from_vec(rows, cols, m.data().to_vec()).unwrap_or_else(|_| Matrix::zeros(rows, cols))
+}
+
</file context>
|
|
||
| let loaded: Vec<Vec<RgbImage>> = samples | ||
| .par_iter() | ||
| .filter_map(|sample| clip_to_frames(sample, frames_cfg, cache_dir).ok()) |
There was a problem hiding this comment.
P1: Clip processing errors are silently discarded via .ok(), causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/prototype.rs, line 69:
<comment>Clip processing errors are silently discarded via `.ok()`, causing silent data loss and potentially biased prototype output. Failed clips are excluded from averaging with no warning, while the user is told all selected clips will be averaged.</comment>
<file context>
@@ -0,0 +1,177 @@
+
+ let loaded: Vec<Vec<RgbImage>> = samples
+ .par_iter()
+ .filter_map(|sample| clip_to_frames(sample, frames_cfg, cache_dir).ok())
+ .collect();
+ if loaded.is_empty() {
</file context>
| seed: u64, | ||
| ) -> Self { | ||
| let num_frames = num_frames.max(1); | ||
| let patches_per_frame = (tokens_per_clip / num_frames).max(1); |
There was a problem hiding this comment.
P2: Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if tokens_per_clip is not evenly divisible by num_frames.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/model.rs, line 43:
<comment>Internal token-count derivation uses floor division without validating divisibility, creating a fragile cross-file shape contract that can silently corrupt data if `tokens_per_clip` is not evenly divisible by `num_frames`.</comment>
<file context>
@@ -0,0 +1,265 @@
+ seed: u64,
+ ) -> Self {
+ let num_frames = num_frames.max(1);
+ let patches_per_frame = (tokens_per_clip / num_frames).max(1);
+ let mut rng = StdRng::seed_from_u64(seed);
+ Self {
</file context>
| upscale: u32, | ||
| fps: u32, | ||
| ) -> Result<PrototypeReport, VideoError> { | ||
| if frames_cfg.num_frames == 0 || frames_cfg.frame_size == 0 { |
There was a problem hiding this comment.
P2: upscale and fps are not validated before use, despite similar parameters (num_frames, frame_size) being validated.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/prototype.rs, line 35:
<comment>`upscale` and `fps` are not validated before use, despite similar parameters (`num_frames`, `frame_size`) being validated.</comment>
<file context>
@@ -0,0 +1,177 @@
+ upscale: u32,
+ fps: u32,
+) -> Result<PrototypeReport, VideoError> {
+ if frames_cfg.num_frames == 0 || frames_cfg.frame_size == 0 {
+ return Err(VideoError::Config(
+ "num_frames and frame_size must be > 0".into(),
</file context>
| if !is_video_file(path) { | ||
| continue; | ||
| } | ||
| let Some(id) = id_from_filename(path) else { |
There was a problem hiding this comment.
P2: Manifest discovery is overly permissive and can include unintended videos from the data root.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/manifest.rs, line 56:
<comment>Manifest discovery is overly permissive and can include unintended videos from the data root.</comment>
<file context>
@@ -0,0 +1,221 @@
+ if !is_video_file(path) {
+ continue;
+ }
+ let Some(id) = id_from_filename(path) else {
+ continue;
+ };
</file context>
| let Some(id) = json_id(video.get("id")) else { | ||
| continue; | ||
| }; | ||
| index.insert( |
There was a problem hiding this comment.
P2: Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At oxidize-train/src/video/manifest.rs, line 115:
<comment>Duplicate metadata IDs are silently overwritten, risking nondeterministic or incorrect labels</comment>
<file context>
@@ -0,0 +1,221 @@
+ let Some(id) = json_id(video.get("id")) else {
+ continue;
+ };
+ index.insert(
+ id,
+ MetaEntry {
</file context>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de66d158bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let frame_dir = cache_dir.join(&sample.id); | ||
| let mut frames = existing_frames(&frame_dir); | ||
| if frames.is_empty() { | ||
| frames = extract_frames(sample, cfg, &frame_dir)?; |
There was a problem hiding this comment.
Make frame caches config-specific
When a clip has any cached JPGs, extraction is skipped regardless of the requested FrameConfig. If a user first runs with fewer frames (for example --frames 8) and later reruns with a larger --frames value using the default cache, pick_indices repeats/selects only the stale cached frames instead of resampling the video at the requested temporal resolution, silently training or averaging on the wrong data. Include the sampling config in the cache key, or at least require the cache to satisfy the requested frame count before reusing it.
Useful? React with 👍 / 👎.
| let work = out | ||
| .parent() | ||
| .filter(|p| !p.as_os_str().is_empty()) | ||
| .unwrap_or_else(|| Path::new(".")) | ||
| .join(".oxidize-prototype"); |
There was a problem hiding this comment.
Clear prototype work frames before encoding
Every prototype run writes into the same .oxidize-prototype directory without removing old mean_*.jpg files. If a later run in the same output directory uses fewer frames, higher-numbered images from the previous run remain and the mean_%03d.jpg ffmpeg sequence will include those stale frames, so the output mp4 can mix frames from a different dataset or config. Use a fresh temp directory or delete the old sequence before writing new frames.
Useful? React with 👍 / 👎.
| for (index, &label) in labels.iter().enumerate() { | ||
| per_class[label].push(index); | ||
| } | ||
| let min_count = per_class.iter().map(Vec::len).min().unwrap_or(0); |
There was a problem hiding this comment.
Reject empty buckets before balancing
For virality/engagement tasks, bucketize creates buckets class names even when fewer clips survive, such as --max-videos 2 --task virality --balance with the default 3 buckets. One class then has zero labels, so including it in min_count makes the balanced selection empty and the loader returns a zero-clip dataset that trains/saves a meaningless model instead of failing. Drop empty buckets or error before balancing.
Useful? React with 👍 / 👎.
Summary
oxidize-trainfor TikTok-style clip datasets (metadata JSON + mp4 files).oxidize-train videofor training a patch-embedding classifier with creator/virality/engagement labels and optional class balancing.oxidize-train prototypeto render a smoothed base clip by averaging frames across selected creators (supports--exclude cellow111).Test plan
cargo test -p oxidize-traincargo clippy -p oxidize-train --all-targets -- -D warningsoxidize-train video --data ~/tt-downloader/videos --task creator --max-videos 60oxidize-train prototype --data ~/tt-downloader/videos --exclude cellow111 --out ~/tt-downloader/oxidize-base-video.mp4Made with Cursor
Summary by cubic
Adds CPU short‑video training and prototype generation to
oxidize-train, including newvideoandprototypesubcommands, frame caching, and JSON model export. Trains a small patch‑embedding classifier on TikTok‑style datasets (mp4 +*_metadata.json).New Features
oxidize-train videotrains a clip classifier oncreator,virality, orengagementlabels (quantile buckets).--max-videoscap, and frame caching in<data>/.oxidize-frames.ffmpegfor frame extraction; patch‑embed + temporal mean‑pool + 2‑layer MLP head (CPU).oxidize-video-<task>.jsonunder the data root) and prints train/val metrics with a majority baseline.oxidize-train prototyperenders a smoothed “base” clip by averaging frames across creators, supports--exclude, and writes mp4 + optional contact‑sheet PNG.csv,video, andprototypesubcommands.Migration
ffmpegon PATH for frame extraction and prototype encoding.oxidize-train video --data ~/tt-downloader/videos --task creator --max-videos 60oxidize-train prototype --data ~/tt-downloader/videos --exclude cellow111 --out ~/tt-downloader/oxidize-base-video.mp4Written for commit de66d15. Summary will update on new commits.