Skip to content

[FEATURE] First-class compatibility with Observe Inc (board, docs, monitors) #65

@Defilan

Description

@Defilan

Feature Description

First-class compatibility with Observe Inc so InferCost users running Observe as their monitoring backend can ingest GPU/token cost metrics, get a pre-built Observe board (the equivalent of the Grafana dashboard we ship), and use Observe's monitor primitives for budget alerting without writing OPAL from scratch.

Problem Statement

As an InferCost operator standardized on Observe (not Grafana/Datadog), I want the same per-namespace, per-model, per-GPU cost visibility I'd get from the Grafana dashboard, surfaced inside Observe's UI with sensible defaults so I don't have to reverse-engineer the metric semantics from scratch.

Today the README claims "Any monitoring tool (Grafana, Datadog, New Relic, etc.)" but the only concrete artifact we ship is the Grafana dashboard JSON. Anyone running Observe has to:

  1. Stand up the Prometheus → Observe scrape path themselves.
  2. Reverse-engineer the 12 InferCost metric names + label cardinality from the operator source.
  3. Write OPAL queries by hand for cost-per-token, namespace breakdown, GPU power, etc.
  4. Author monitors for budget breach alerts.

For a tool whose core promise is "AI cost attribution out of the box," asking operators to do that work themselves contradicts the value prop on the dimension that matters most to Observe customers.

Proposed Solution

Three deliverables, each independently shippable:

1. config/observe/ — pre-built board (equivalent of the Grafana JSON)

A reproducible Observe board definition (whatever format Observe currently supports for board-as-code; check their terraform-provider-observe, observectl, or board export JSON) covering the same panels as our Grafana dashboard:

  • Cost-per-token (USD), faceted by model + namespace
  • GPU power draw (watts), faceted by node + GPU index
  • Hourly cost vs cloud equivalent
  • UsageReport CRD summaries
  • Cumulative spend per namespace

Ship as config/observe/infercost-board.json (or .tf if Terraform is the canonical path).

2. docs/integrations/observe.md — setup recipe

End-to-end recipe an Observe customer can follow:

  • Point Observe's Prometheus collector at the InferCost operator's /metrics endpoint (or use the PodMonitor we already ship, paired with the kube-prometheus-stack remote-write to Observe).
  • Import the board.
  • (Optional) Map our infercost_* metric names to Observe's "Dataset" abstraction so they appear under a recognizable namespace.

3. config/observe/monitors.yaml — pre-built Observe monitors

Three alerts that mirror our TokenBudget CRD semantics:

  • infercost_budget_breach: hourly cost exceeds the namespace budget for two consecutive samples.
  • infercost_gpu_idle_with_load: GPU power < 50W while requests/sec > 0 (signals a stuck llama.cpp pod, dual-use with LLMKube's own alerts).
  • infercost_anomaly_per_token_cost: per-token cost > 2× the 24h trailing mean (detects a misconfigured CostProfile or a stuck-loop model).

These are also useful as the source of truth for the Alert phase row in our Roadmap.

Example operator config: no new CRD fields, just a new optional values block in the Helm chart so helm install ... --set integrations.observe.enabled=true materializes the relevant ConfigMap + (eventually) auto-provisions the board via Observe's API. Phase 1 ships the artifacts; phase 2 (later) automates installation.

Alternatives Considered

  • Do nothing; users figure it out. Rejected: this is the same argument we already rejected for Grafana, and Grafana support is one of the most-cited reasons people pick InferCost. Parity for Observe is just consistency with the value prop.
  • Generic OpenTelemetry-only path. Considered. We do plan to expose OTLP metrics in a future PR (it's the right long-term abstraction). But OTLP alone doesn't ship dashboards or alerts; those are the part operators actually need. The Observe artifacts on top of OTLP/Prometheus would still be the work item.
  • Wait for Observe to write the integration themselves. Most vendors do publish third-party content; relying on it is fragile and the value prop is sharper when it's first-party. Compare: we ship the Grafana dashboard ourselves rather than depending on grafana.com community boards.

Additional Context

  • Related issues: #
  • Similar features in other projects:
    • Datadog Marketplace integrations for cost tools (e.g. Vantage)
    • Grafana Cloud's first-party LLM cost dashboards
    • OpenCost / Kubecost — both publish vendor integration recipes for the major observability backends

Priority

  • Critical
  • High
  • Medium — Nice to have (high value for Observe-shop prospects; low value for everyone else)
  • Low

Willingness to Contribute

  • Yes, I can submit a PR
  • Yes, I can help test
  • No, but I can provide feedback

Notes for the implementer

  • Verify Observe's current board-as-code format first. Their product surface evolves; what was "JSON board export" two years ago may be Terraform-only or API-only today. The right starting point is observectl or the terraform-provider-observe README.
  • Cardinality check. Our 12 metrics have ~5-15 label combinations each. Observe's pricing is volume-based on event ingest; the recipe should call out the expected daily event volume so customers don't get a surprise bill.
  • Auth model. Observe uses workspace API tokens. The setup doc should be specific about least-privilege scope (read on Datasets, write on Boards and Monitors, no admin).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions