Skip to content

[FEATURE] Support diffusion model cost tracking (Flux, Stable Diffusion) #8

@Defilan

Description

@Defilan

Feature Description

InferCost currently tracks cost for LLM inference workloads by scraping llama.cpp/vLLM token metrics. Diffusion model workloads (Flux, Stable Diffusion, ComfyUI) also consume significant GPU resources but are invisible to InferCost because they don't produce tokens.

Problem Statement

A Flux image generation server consuming a full GPU (150W+) incurs real hardware and electricity costs, but InferCost doesn't track it because:

  1. Pod discovery is hardcoded to the inference.llmkube.dev/model label
  2. The scraper only understands llama.cpp token metrics
  3. The cost model is token-specific (cost_per_token, tokens_per_hour)

Proposed Solution

Generalize the cost model from tokens to "work units":

Workload Unit Metric Source
LLM Tokens llamacpp:tokens_predicted_total
Diffusion Images or Steps images_generated_total, diffusion_steps_total
Embeddings Requests requests_total
Audio/TTS Seconds audio_seconds_total

The core cost formula doesn't change: hourly_cost / units_per_hour = cost_per_unit

Implementation Approach

  1. Pluggable scraper interface: Define a WorkloadScraper interface with adapters for llama.cpp, vLLM, and diffusion frameworks
  2. Configurable pod discovery: Support custom label selectors or annotations beyond inference.llmkube.dev/model
  3. Generalize snapshots: Replace TokenSnapshot with WorkloadSnapshot carrying a UnitType field
  4. Skip cloud comparison for non-token workloads: No standard cloud pricing exists for image generation APIs in the same way

Files That Would Need Changes

  • internal/scraper/ - Add scraper interface + diffusion adapter
  • internal/controller/costprofile_controller.go - Dispatcher on workload type
  • internal/calculator/calculator.go - Generic rate/cost computation (minimal change)
  • internal/api/store.go - Generalize ModelData structure
  • internal/metrics/metrics.go - New metric families for non-token workloads

Why This Matters

Organizations running mixed AI workloads (LLMs + image generation + embeddings) on shared GPU infrastructure need cost visibility across all of them, not just LLMs. This is the difference between tracking 60% of GPU costs and tracking 100%.

Alternatives Considered

  • Tracking only GPU-hours for non-LLM workloads (loses per-unit granularity)
  • Requiring users to add LLMKube labels to non-LLMKube pods (hacky, breaks the "works with any stack" promise)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions