Prometheus

https://prometheus.io/docs

Open source metrics monitoring and alerting framework.

Scrapes text-based metrics over HTTP from remote agents.

Stores metrics in a local Time Series DB with querying capabilities.

https://demo.promlabs.com/

Key Points
Setup
HTTP API
PromQL
Comparisons
Collections Agents
Exporters
Alert Manager
Push Gateway
PromLens
Resources

Key Points

metrics + alerting by SoundCloud
written in Go
second project in the CNCF after Kubernetes
standalone, no dependencies, only local disk
1 static binary - prometheus
1 config file - prometheus.yml
pull based - HTTP /metrics scraping
TSDB - time-series DB
Queryable
tag based metrics
Push Gateway - port 9091 - for short lived jobs
Exporters - agents serving HTTP /metrics to be used as Scrape Targets by Prometheus server
- lots of different exporters, see Exports section for list
Scrape Targets:
- Static config or
- Service Discovery - discover targets to scrape dynamically:
  - Generic - DNS, Consul, ZooKeeper
  - VMs - AWS EC2, Azure, GCP, Openstack
  - Cluster Managers - Kubernetes, Marathon (Mesos)
No Auth or TLS - use Nginx / HTTPd / HAProxy in front for SSL + Authentication
very efficient - unlikely to need sharding + federation until thousands of machines
Scale - single server can handle:
- millions of metrics
- hundreds of thousands of data points per sec
Federation - pulling metrics from other Prometheus servers:
- tree aggregates from other Prometheus
- set Prometheus as scrape targets metrics_path: /federate
- match[] section
High Availability - dual ingest to 2 identical servers + Load Balance
AlertManager email / pager:
- discovery similar to target discovery
- groups clients so mass outage appears as 1 alert not hundreds / thousands
- inhibitions - suppress similar alerts
PromQL query language:
- can make query predictions based on current rate, such as:
  - how full will disks be in 4 hours
not suitable for billing as not detailed / complete enough
backfill support on roadmap (OpenTSDB already has this)
Recording Rules:
- like InfluxDB continuous queries
- compute new time series at regular intervals
- pre-materialise expensive queries for faster dashboards
Alerting Rules - sends alerts
- similar config to Recording Rules
- based on expression results eg. >= 1

Key Summary:

Multi-dimensional data model - stores metrics with labels (key-value pairs) for flexible querying
PromQL (Prometheus Query Language) - powerful language used to query, aggregate, and display metrics
Time-series storage - stores metrics as time-stamped values for historical analysis
Visualization - integrates with tools like Grafana to create dashboards and visualize metrics
Alerting - configurable alerts based on metric thresholds or conditions
Service discovery - automatically discovers services and targets based on configurations

Key Components

Prometheus Server
- Scrapes and stores time-series data from configured targets (like services, applications, and systems)
- it also hosts a basic web UI for querying
Exporters
- agents that gather metrics and present them as semi-structured text over HTTP on an non-standard port
Client Libraries
- instrument your own apps to expose /metrics in the Prometheus text format for Prometheus server to scrape
Alertmanager
- handles alerts generated by Prometheus based on predefined conditions
- routes them to appropriate channels (email, Slack, etc.)
Pushgateway
- metrics sink long-running server for short-lived jobs to push metrics to for Prometheus server to then scrape (since Prometheus server cannot scrape ephemeral short-lived jobs that come and go)

Ports

Port	Description
9090	Prometheus
9091	Push Gateway
9100	Node Exporter
9103	Collectd exporter
9273	Telegraf exporter

Architecture Diagram

UI

Prometheus has native Web UI on port 9090 with nice metric names autocomplete.

But Grafana is generally considered the gold standard of metric UI and integrates with most major open source metrics and time series databases like InfluxDB and OpenTSDB

PromLens is a web-based query builder for PromQL by PromLabs.

Metrics

https://demo.promlabs.com/metrics

Scrapes metrics from HTTP /metrics endpoints called 'targets'.

Stored as:

Timestamp - 64-bit int in ms
Value - 64-bit float (in future will be histogram)

Types of Metrics

Gauge: current measurement, can go up or down
Counter: cumulative count over time
- usually need to wrap it in queries with a function like rate() / irate() / increase() to be meaningful eg. rate(my_counter[5m])
Summaries: percentiles / quantiles
- do not aggregate across summaries - not statistically valid
Histogram: bucketed stats, cumulative

For a gauge or counter that is process start time:

time() - process_start_time_seconds

Storage

No long term storage, but can forward to remote storage.

Remote Storage Adapters for:

Graphite
InfluxDB - write + read back
OpenTSDB - write only (still use this for history graphs)
- forwarder listens on 9201
Cortex - scalable long term storage for Prometheus

Configure remote_write to send to remote storage.

Read back supported on some adapters eg. InfluxDB
But no PromQL push down - all data must be read back and computed on Prometheus
Inefficient / Poor Performance / Non-Scalable

Scaling - Thanos

Thanos - federates multiple Prometheus for scaling

Setup

Install Locally

Download and install Prometheus binary and any Exporters you want to run.

You can download and extract the Prometheus tarball manually from Downloads page or just run the scripts from DevOps-Bash-tools repo to install the latest GitHub release to your $PATH (/usr/local/bin or $HOME/bin):

install_prometheus.sh

Download starter config from HariSekhon/Templates repo:

wget https://raw.githubusercontent.com/HariSekhon/Templates/refs/heads/master/prometheus.yml

Run prometheus:

prometheus

Creates $PWD/data/ directory full of data.

Defaults worth noting:

--config.file='prometheus.yml'
--storage.tsdb.path='data/'
--storage.tsdb.retention='15d'

Using Docker

Manually run:

docker run -ti -p 9090:9090 prom/prometheus

Using Docker-Compose tooling:

HariSekhon/DevOps-Bash-tools - docker-compose/prometheus.yml

docker-compose -f prometheus.yml up

Fully scripted using above docker-compose/prometheus.yml in DevOps-Bash-tools repo:

HariSekhon/DevOps-Bash-tools - kubernetes/prometheus.sh

prometheus.sh

Check Targets Scraping Status

List the targets and check they have UP in the State column that they are being correctly scraped.

http://localhost:9090/targets?search=

Graph Page / Expression Browser

http://localhost:9090/graph

Enter a metric name to query:

prometheus_tsdb_head_samples_appended_total

Switch from Table to Graph view tab.

Since this metric only goes up, calculate its rate of ingestion instead by changing the query to this:

rate(prometheus_tsdb_head_samples_appended_total[1m])

Graph the 90th percentile of request durations for the demo scrape target in last 5 mins split by URL /paths:

histogram_quantile(0.9, sum by(le, path) (rate(demo_api_request_duration_seconds_bucket[5m])) )

HTTP API

/api/v1/query?query="..."

PromQL

https://promlabs.com/promql-cheat-sheet/

mymetric{tag1="value1", tag2!="value2", tag3=~match3, tag4!~match4}[interval]

http_requests_total{kubernetes_namespace="dev", _weave_service="podinfo"}))

sum(http_requests_total) by (kubernetes_namespace, _weave_service)

Interval must be at least as big as collection period otherwise will get no results for say 10s if collection interval is 1m (default, set to 30s for prod).

rate(http_requests_total[1m])

Find per second:

sum(rate(http_requests_total{_weave_service="podinfo"}[1m])) / 60

Ratio of unsuccessful requests - put this in to a Prometheus alert > 50

sum(rate(http_requests_total{_weave_service="podinfo",status!="200"}[1m])) / sum(rate(http_requests_total{_weave_service="podinfo"}[1m]))

sum(rate(http_requests_total{kubernetes_namespace="dev", _weave_service="podinfo"}[1m]))

Alerts

ALERT errorRate
IF sum(rate(http_requests_total{_weave_service="podinfo",status!="200"}[1m])) / sum(rate(http_requests_total{_weave_service="podinfo"}[1m])) > 0.5
FOR 1m
LABELS      { severity="critical" }
ANNOTATIONS {
  summary = "error rate > 50%",
  impact = "bad",
  detail = "blah"
}

Comparisons

Prometheus vs Graphite

Prometheus	Graphite
proactive (scraping, alerting, rule processing)	passive Time Series db
irregular timeseries	fixed interval
label name{key1=val1,key2=val2}	name.key.key2
better for filtering via labels	has clustering
arbitrary precision	more complicated setup
Whisper-like RRD overwrites old data	Uses Whisper

Prometheus vs InfluxDB

Prometheus	InfluxDB
active, scrape, alert	passive Time Series db
metadata for ts stored once	stored for every event => 11x storage
indexes all columns	only indexes row timestamp (0.9 targeted for cols)
better for cumulative (downsampling feature)	better for storing individual events
single server (must shard manually or Thanos)	clustering (proprietary)
float only	int, float, bool, string
ms only	s, ms, microsecs, nanosecs
mem + 5 min flushes = data loss	durable (WAL)

Prometheus vs OpenTSDB

Prometheus	OpenTSDB
active, scrape, alert	passive Time Series db
full query language	lacks full query language
PromQL more complex querying possible	only simple aggregations via API
doesn't scale, must Thanos	scales much better using HBase for horizontal scaling and sharding

Collections Agents

Static binary - exposes port 9100 /metrics for scraping.

Collectd & Telegraf have Prometheus plugins that listen for /metrics scraping requests.

Systemd integration gives stats on processes:

./node_exporter --collector.systemd

Collectd port 9103 /metrics
Telegraf port 9273 /metrics
Docker 1.13+ port 4999 /metrics
- set docker daemon flags:
  - daemon.json: { "metrics_addr": "0.0.0.0:9323", "experimental": true }
    or
  - --experimental=true --metrics-addr=0.0.0.0:4999

Exporters

Download exporters from the Prometheus Downloads page.

Or quickly install the binaries for these exporters using the install/install_prometheus_*.sh scripts in the DevOps-Bash-tools repo.

Node Exporter

prometheus/node_exporter

Prometheus Node Exporter is a popular way to collect system level metrics from operating systems, such as:

CPU
Disk
Network
Process stats

install_node_exporter.sh

Run --help to see its options:

node_exporter --help

node_exporter

Listens on port 9100.

See prometheus/node_exporter GitHub homepage above for the list of different stats to enable / disable.

Import the ready-made dashboard for Node Exporter dashboard into Grafana.

YouTube - Monitoring Linux Host Metrics with Prometheus | Node Exporter (Setup, Scrape, Query, Grafana)

Blackbox Exporter

prometheus/blackbox_exporter

Endpoint monitoring for uptime and availability metrics by probing of endpoints over HTTP(S), DNS, TCP, ICMP and gRPC.

install_prometheus_blackbox_exporter.sh

TODO: revision control configuration

Consul Exporter

prometheus/consul_exporter

install_prometheus_consul_exporter.sh

Graphite Exporter

prometheus/graphite_exporter

install_prometheus_graphite_exporter.sh

Memcached Exporter

prometheus/memcached_exporter

install_prometheus_memcached_exporter.sh

MySQLd Exporter

prometheus/mysqld_exporter

install_prometheus_mysqld_exporter.sh

Statsd Exporter

prometheus/statsd_exporter

install_prometheus_statsd_exporter.sh

Kube State Metrics

Kube State metrics is a service that talks to the Kubernetes API server to get all the details about all the API objects like deployments, pods, daemonsets, Statefulsets, etc.

Ccloud Exporter

Dabz/ccloudexporter

Confluent Cloud Exporter - exports metrics from Confluent Cloud Metric API on port 2122

Confluent Connector

ArgoCD Metrics

https://argo-cd.readthedocs.io/en/stable/operator-manual/metrics/

Capture the lag for each consumer offsets on different environments and alert when lag is more than 100.

Alert Manager

prometheus/alertmanager

The Alert Manager handles alerts sent by client applications such as the Prometheus server.

It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

Alert Manager can be run in HA mode to ensure alerts are not missed.

install_prometheus_alertmanager.sh

Push Gateway

prometheus/pushgateway

install_prometheus_push_gateway.sh

PromLens

https://promlens.com/

prometheus/promlens

Web-based query builder for PromQL.

Quickly install the binary using DevOps-Bash-tools:

install_promlens.sh