Skip to content

spike: verify InferCost cost metric compatibility with Observe monitoring platform #59

@Defilan

Description

@Defilan

What this is

An investigation, not implementation work. Enterprise adopters running Observe as their observability backend will want to ingest InferCost's cost telemetry alongside the rest of their infrastructure data. We need to know what shape that ingestion takes, what (if anything) breaks, and whether any work is required on the InferCost side to be a first-class data source.

This issue may spawn follow-up issues if gaps are found.

Why this matters

Observe is increasingly adopted in regulated and large-enterprise environments because its event-streaming-on-Snowflake model fits compliance and retention requirements that Datadog/New Relic struggle with. If InferCost ships clean integration here, it removes a friction point for the exact buyer profile we care about for AI FinOps.

Scope of the investigation

The goal is a written assessment, not code. Specifically:

  1. Ingestion paths. Document the supported paths from InferCost's current metric surface (Prometheus on /metrics) to Observe:

    • OTel Collector with Prometheus receiver and OTLP exporter to Observe
    • Direct Prometheus remote_write (does Observe accept it?)
    • HTTP push with custom payload
    • Anything else Observe supports
  2. What InferCost emits today. Inventory the actual Prometheus metric names, labels, and types currently emitted by the cost controller and any REST API endpoints. Confirm we have a clear, documented schema.

  3. Compatibility check. For each ingestion path:

    • Does Observe preserve our labels (cluster, namespace, team, model, GPU type, accelerator)?
    • Does it preserve metric semantics (counters vs gauges vs histograms)?
    • Are there cardinality limits we'd blow through (per-team x per-model x per-GPU labelling can get wide)?
    • Cost-of-ingestion implications for the customer's Observe bill.
  4. Dashboard parity. Note whether our published Grafana dashboards translate cleanly to Observe's worksheet/dashboard model, or if we'd need to ship Observe-native artifacts.

  5. Auth & secret handling. Document the auth model (API tokens, workspace IDs) and how it would slot into InferCost's existing secret-reference patterns.

Deliverables

A markdown doc (target location: docs/observability/observe-platform.md, or wherever similar integration notes live) covering:

  • Recommended ingestion path with reasoning
  • Step-by-step setup instructions for that path
  • Known limitations and workarounds
  • Any required InferCost-side changes (if found, file as follow-up issues referencing this one)

Possible follow-up issues

Listing these speculatively. File them only if the investigation surfaces a real gap:

  • OTel Collector example config in examples/observability/observe/
  • Observe-native dashboard JSON in docs/observe/
  • OTLP exporter wired into the InferCost controller (if direct push beats collector-based scraping)
  • Label conformance pass on existing Prometheus metrics if cardinality is a problem

Non-goals

  • Building the integration. This issue is the investigation only.
  • Comparison with other observability platforms (Datadog, New Relic, Honeycomb). Useful context but not required.

Success criteria

The doc lands and we can answer "does InferCost work with Observe today?" with a definitive yes / yes-with-caveats / not-without-this-work. The owner of the doc has reviewed it against the actual Observe product (free tier or trial), not just their marketing pages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions