Feature Description
InferCost currently tracks cost for LLM inference workloads by scraping llama.cpp/vLLM token metrics. Diffusion model workloads (Flux, Stable Diffusion, ComfyUI) also consume significant GPU resources but are invisible to InferCost because they don't produce tokens.
Problem Statement
A Flux image generation server consuming a full GPU (150W+) incurs real hardware and electricity costs, but InferCost doesn't track it because:
- Pod discovery is hardcoded to the
inference.llmkube.dev/model label
- The scraper only understands llama.cpp token metrics
- The cost model is token-specific (
cost_per_token, tokens_per_hour)
Proposed Solution
Generalize the cost model from tokens to "work units":
| Workload |
Unit |
Metric Source |
| LLM |
Tokens |
llamacpp:tokens_predicted_total |
| Diffusion |
Images or Steps |
images_generated_total, diffusion_steps_total |
| Embeddings |
Requests |
requests_total |
| Audio/TTS |
Seconds |
audio_seconds_total |
The core cost formula doesn't change: hourly_cost / units_per_hour = cost_per_unit
Implementation Approach
- Pluggable scraper interface: Define a
WorkloadScraper interface with adapters for llama.cpp, vLLM, and diffusion frameworks
- Configurable pod discovery: Support custom label selectors or annotations beyond
inference.llmkube.dev/model
- Generalize snapshots: Replace
TokenSnapshot with WorkloadSnapshot carrying a UnitType field
- Skip cloud comparison for non-token workloads: No standard cloud pricing exists for image generation APIs in the same way
Files That Would Need Changes
internal/scraper/ - Add scraper interface + diffusion adapter
internal/controller/costprofile_controller.go - Dispatcher on workload type
internal/calculator/calculator.go - Generic rate/cost computation (minimal change)
internal/api/store.go - Generalize ModelData structure
internal/metrics/metrics.go - New metric families for non-token workloads
Why This Matters
Organizations running mixed AI workloads (LLMs + image generation + embeddings) on shared GPU infrastructure need cost visibility across all of them, not just LLMs. This is the difference between tracking 60% of GPU costs and tracking 100%.
Alternatives Considered
- Tracking only GPU-hours for non-LLM workloads (loses per-unit granularity)
- Requiring users to add LLMKube labels to non-LLMKube pods (hacky, breaks the "works with any stack" promise)
Feature Description
InferCost currently tracks cost for LLM inference workloads by scraping llama.cpp/vLLM token metrics. Diffusion model workloads (Flux, Stable Diffusion, ComfyUI) also consume significant GPU resources but are invisible to InferCost because they don't produce tokens.
Problem Statement
A Flux image generation server consuming a full GPU (150W+) incurs real hardware and electricity costs, but InferCost doesn't track it because:
inference.llmkube.dev/modellabelcost_per_token,tokens_per_hour)Proposed Solution
Generalize the cost model from tokens to "work units":
llamacpp:tokens_predicted_totalimages_generated_total,diffusion_steps_totalrequests_totalaudio_seconds_totalThe core cost formula doesn't change:
hourly_cost / units_per_hour = cost_per_unitImplementation Approach
WorkloadScraperinterface with adapters for llama.cpp, vLLM, and diffusion frameworksinference.llmkube.dev/modelTokenSnapshotwithWorkloadSnapshotcarrying aUnitTypefieldFiles That Would Need Changes
internal/scraper/- Add scraper interface + diffusion adapterinternal/controller/costprofile_controller.go- Dispatcher on workload typeinternal/calculator/calculator.go- Generic rate/cost computation (minimal change)internal/api/store.go- GeneralizeModelDatastructureinternal/metrics/metrics.go- New metric families for non-token workloadsWhy This Matters
Organizations running mixed AI workloads (LLMs + image generation + embeddings) on shared GPU infrastructure need cost visibility across all of them, not just LLMs. This is the difference between tracking 60% of GPU costs and tracking 100%.
Alternatives Considered