Prometheus is an open-source systems monitoring and alerting toolkit that follows a pull-based model. Unlike traditional monitoring tools that use a push model, Prometheus scrapes metrics from instrumented jobs at configured intervals.
Key differences:
Example of a Prometheus scrape configuration:
scrape_configs:
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets: ['node-exporter:9100']
Prometheus uses a multi-dimensional data model where time series are identified by a metric name and a set of key-value pairs called labels.
Structure: <metric_name>{<label_name>=<label_value>, ...}
Labels enable:
Example:
http_requests_total{status="200", method="GET", handler="/api/users"}
This represents the total count of HTTP requests with status 200, method GET, to the /api/users endpoint.
In PromQL, you can query and filter using these labels:
http_requests_total{status="200", method="GET"} # All 200 GET requests
http_requests_total{handler=~"/api/.*"} # All requests to API endpoints
Exporters are applications that convert metrics from existing systems into the Prometheus format, making them available for scraping. They expose an HTTP endpoint (usually /metrics
) with data in the Prometheus text-based format.
Common exporters:
Example of Node Exporter metrics:
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 8570.6
node_cpu_seconds_total{cpu="0",mode="system"} 1239.02
node_cpu_seconds_total{cpu="0",mode="user"} 2873.21
To configure Prometheus to scrape a new target, you add a new job to the scrape_configs
section in the prometheus.yml
file:
scrape_configs:
- job_name: 'api-service'
scrape_interval: 10s # Override the global default
metrics_path: '/metrics' # Default is /metrics
scheme: 'http' # Default is http
static_configs:
- targets: ['api-service:8080', 'api-service-replica:8080']
labels:
environment: 'production'
team: 'backend'
For dynamic environments, you can use service discovery:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
After updating the configuration, reload Prometheus:
curl -X POST http://localhost:9090/-/reload
Counters: Monotonically increasing values that only go up (or reset to zero). Used for counts like requests, errors, or completed tasks.
# Example: Total HTTP requests
http_requests_total{method="GET"} 12345
Query rate: rate(http_requests_total[5m])
Gauges: Values that can go up and down, representing current state. Used for metrics like temperature, memory usage, or concurrent connections.
# Example: Current memory usage
node_memory_MemAvailable_bytes 2.1470634e+09
Query directly: node_memory_MemAvailable_bytes
Histograms: Sample observations and count them in configurable buckets, also providing a sum of all observed values. Used for request durations or response sizes.
# Example: HTTP request duration histogram
http_request_duration_seconds_bucket{le="0.1"} 12342 # requests under 100ms
http_request_duration_seconds_bucket{le="0.5"} 12951 # requests under 500ms
http_request_duration_seconds_bucket{le="1"} 13001 # requests under 1s
http_request_duration_seconds_bucket{le="+Inf"} 13005 # all requests
http_request_duration_seconds_sum 893.2 # sum of all durations
http_request_duration_seconds_count 13005 # count of all observations
Query 95th percentile: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Summaries: Similar to histograms but calculate quantiles on the client side. Used when exact quantiles are needed.
# Example: Request duration summary
http_request_duration_seconds{quantile="0.5"} 0.052 # 50th percentile
http_request_duration_seconds{quantile="0.9"} 0.564 # 90th percentile
http_request_duration_seconds{quantile="0.99"} 1.2 # 99th percentile
http_request_duration_seconds_sum 893.2 # sum of all durations
http_request_duration_seconds_count 13005 # count of all observations
PromQL (Prometheus Query Language) is Prometheus's functional query language that lets you select and aggregate time series data in real time.
To calculate an HTTP request error rate (percentage of 5xx responses):
# Error rate as a percentage over the last 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Breaking this down:
rate(http_requests_total{status=~"5.."}[5m])
- Calculate the per-second rate of 5xx errors over 5 minutessum(...)
- Sum these rates across all instances/servicessum(rate(http_requests_total[5m]))
- Calculate the total request rateFor a more specific example, to get the error rate for a specific API endpoint:
sum(rate(http_requests_total{handler="/api/users", status=~"5.."}[5m])) /
sum(rate(http_requests_total{handler="/api/users"}[5m])) * 100
Alerting in Prometheus is configured in two parts:
Step 1: Define alert rules in a file (e.g., alerts.yml
):
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="api-server"} > 0.5
for: 10m
labels:
severity: critical
team: backend
annotations:
summary: "High request latency on {{ $labels.job }}"
description: "{{ $labels.job }} has a request latency above 500ms (current value: {{ $value }}s)"
Step 2: Include this file in Prometheus configuration:
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Step 3: Configure Alertmanager (alertmanager.yml
):
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-emails'
routes:
- match:
severity: critical
receiver: 'pager'
receivers:
- name: 'team-emails'
email_configs:
- to: 'team@example.com'
- name: 'pager'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
This setup will:
Prometheus uses a custom time-series database (TSDB) optimized for storing metrics with high write and query performance.
Storage Mechanism:
Retention Configuration:
# In prometheus.yml
storage:
tsdb:
path: /data
retention.time: 15d # How long to keep data
retention.size: 50GB # Optional max storage size
wal:
retention.time: 12h # WAL retention
By default, Prometheus keeps data for 15 days. You can adjust this based on:
For long-term storage, you can use remote write/read to send data to systems like:
Example remote write configuration:
remote_write:
- url: "http://remote-storage:9201/write"
queue_config:
max_samples_per_send: 10000
capacity: 500000
max_shards: 30
There are several approaches to implement high availability for Prometheus:
1. Multiple Prometheus Instances with Thanos
Thanos extends Prometheus with high availability and long-term storage:
# prometheus.yml with Thanos sidecar
global:
external_labels:
replica: replica-1
region: us-west
# Thanos sidecar configuration
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
Thanos components:
2. Prometheus Federation
Have a global Prometheus instance that scrapes metrics from multiple Prometheus instances:
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="node"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus-1:9090'
- 'prometheus-2:9090'
3. Cortex or Victoria Metrics
Use a distributed Prometheus-compatible system:
remote_write:
- url: "http://cortex:9009/api/v1/push"
queue_config:
max_samples_per_send: 10000
Key HA Considerations:
Service discovery allows Prometheus to automatically find and monitor targets in dynamic environments like cloud or container platforms, eliminating the need for manual configuration.
Common Service Discovery Mechanisms:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_container_port_number]
action: replace
regex: (.+);(.+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
scrape_configs:
- job_name: 'ec2-instances'
ec2_sd_configs:
- region: us-west-2
access_key: ACCESS_KEY
secret_key: SECRET_KEY
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Environment]
regex: production
action: keep
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
services: ['web', 'api']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_node]
target_label: node
scrape_configs:
- job_name: 'file-discovery'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
refresh_interval: 5m
With a JSON file like:
[
{
"targets": ["service1:9100", "service2:9100"],
"labels": {
"env": "production",
"job": "node"
}
}
]
Service discovery is essential in environments like Kubernetes where pods come and go frequently. The relabel_configs section is crucial for transforming the discovered metadata into usable target configurations.
Monitoring a Java application with Prometheus involves several steps:
1. Instrument the Java application
Use a library like Micrometer with Prometheus registry:
<!-- Maven dependency -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>1.10.2</version>
</dependency>
2. Configure the metrics in your Java code:
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import io.micrometer.prometheus.PrometheusConfig;
import io.micrometer.prometheus.PrometheusMeterRegistry;
import io.prometheus.client.exporter.HTTPServer;
public class Application {
public static void main(String[] args) throws IOException {
// Create a Prometheus registry
PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
// Create metrics
Counter requestCounter = registry.counter("app_requests_total", "status", "success");
Timer requestTimer = registry.timer("app_request_duration_seconds", "endpoint", "/api");
// Increment counter in your code
requestCounter.increment();
// Time operations
requestTimer.record(() -> {
// Your operation here
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
});
// Start an HTTP server to expose metrics
HTTPServer server = new HTTPServer.Builder()
.withPort(8080)
.withRegistry(registry)
.build();
}
}
For Spring Boot applications, it's even simpler:
@SpringBootApplication
@EnablePrometheusMetrics
public class Application {
public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
}
3. Configure Prometheus to scrape the Java application:
scrape_configs:
- job_name: 'java-app'
metrics_path: '/actuator/prometheus' # For Spring Boot
static_configs:
- targets: ['java-app:8080']
4. Key metrics to monitor:
JVM metrics:
jvm_memory_used_bytes
, jvm_memory_max_bytes
jvm_gc_collection_seconds_count
, jvm_gc_collection_seconds_sum
jvm_threads_states_threads
Application metrics:
http_server_requests_seconds_count
http_server_requests_seconds_sum
http_server_requests_seconds_count{status="5xx"}
Custom business metrics:
5. JMX Exporter (alternative approach)
For legacy applications, use the JMX Exporter:
java -javaagent:./jmx_prometheus_javaagent-0.16.1.jar=8080:config.yaml -jar your-application.jar
With a config like:
---
startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
- pattern: ".*"
Recording rules in Prometheus allow you to precompute frequently used or computationally expensive expressions and save their results as new time series. This improves query performance and reduces the load on the Prometheus server.
Configuration Example:
# In prometheus.yml
rule_files:
- "recording_rules.yml"
# In recording_rules.yml
groups:
- name: example
interval: 5s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: instance:node_cpu_utilization:avg5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: job:request_latency_seconds:p95
expr: histogram_quantile(0.95, sum(rate(request_latency_bucket[5m])) by (job, le))
When to Use Recording Rules:
For expensive queries: Calculations like percentiles over large datasets
# Original expensive query
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
# Recording rule
job:http_request_duration_seconds:p95_5m
For frequently used queries: Dashboards that many users access
# Original query used in multiple dashboards
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Recording rule
job:http_requests_error_ratio:5m
For alert conditions: To ensure alerts evaluate quickly
# Alert based on recording rule
ALERT HighErrorRate
IF job:http_requests_error_ratio:5m > 0.05
FOR 10m
For aggregations across multiple instances:
# Recording rule aggregating across instances
record: job:node_memory_utilization:avg
expr: avg by (job) (node_memory_used_bytes / node_memory_total_bytes)
Best practices:
level:metric_name:operations
prometheus_rule_evaluation_duration_seconds
When Prometheus isn't scraping a target, follow this systematic troubleshooting approach:
1. Check the Targets page in Prometheus UI
2. Verify the target is accessible
Test direct access to the metrics endpoint:
curl http://target-host:port/metrics
Check for network connectivity issues:
telnet target-host port
3. Check Prometheus configuration
Verify the target is correctly defined in prometheus.yml
:
scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['my-service:8080']
Validate the configuration:
promtool check config prometheus.yml
4. Check for relabeling issues
If using relabeling, check if it's dropping your target:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: my-app
action: keep
5. Check service discovery
For Kubernetes, check if the pod has the right annotations:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
6. Check for TLS/authentication issues
If using TLS, verify certificates:
scheme: https
tls_config:
ca_file: /path/to/ca.crt
cert_file: /path/to/client.crt
key_file: /path/to/client.key
insecure_skip_verify: false
7. Check Prometheus logs
Look for scrape-related errors:
grep "scrape" /var/log/prometheus/prometheus.log
8. Check resource constraints
Check if scrape timeouts are occurring due to slow target responses:
scrape_timeout: 15s # Increase if needed
9. Use debug metrics
Check Prometheus's own metrics about scrapes:
scrape_samples_scraped{job="my-service"} # Should be > 0
scrape_duration_seconds{job="my-service"} # Duration of scrapes
up{job="my-service"} # 1 if up, 0 if down
Example of a complete troubleshooting session:
# 1. Check if the target is accessible
curl http://api-service:8080/metrics
# Result: Connection refused
# 2. Check network connectivity
telnet api-service 8080
# Result: Unable to connect
# 3. Check if the service is running
kubectl get pods | grep api-service
# Result: api-service-5d4d7b9f68-2jlmn 0/1 CrashLoopBackOff 5 10m
# 4. Check pod logs
kubectl logs api-service-5d4d7b9f68-2jlmn
# Result: Error binding to port 8080: address already in use
# 5. Fix the port conflict and restart the pod
kubectl delete pod api-service-5d4d7b9f68-2jlmn
# 6. Verify Prometheus can now scrape the target
curl http://prometheus:9090/api/v1/query?query=up{job="api-service"}
# Result: {"status":"success","data":{"resultType":"vector","result":[{"metric":{"job":"api-service","instance":"api-service:8080"},"value":[1619712345.123,"1"]}]}}
Blackbox monitoring tests external behavior of a system from the outside, like checking if a website is up or an API is responding correctly. Prometheus's Blackbox Exporter is designed for this purpose.
Step 1: Deploy the Blackbox Exporter
# docker-compose.yml example
services:
blackbox-exporter:
image: prom/blackbox-exporter:latest
ports:
- "9115:9115"
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/config.yml
Step 2: Configure the Blackbox Exporter
# blackbox.yml
modules:
http_2xx: # Simple HTTP probe
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
no_follow_redirects: false
fail_if_ssl: false
fail_if_not_ssl: false
preferred_ip_protocol: "ip4"
http_post_2xx: # HTTP POST probe
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{"test": "test"}'
tcp_connect: # TCP connection probe
prober: tcp
timeout: 5s
icmp: # Ping probe
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
ssl: # SSL/TLS probe
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
fail_if_not_ssl: true
tls_config:
insecure_skip_verify: false
Step 3: Configure Prometheus to use the Blackbox Exporter
# prometheus.yml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Use the http_2xx module
static_configs:
- targets:
- https://example.com # Target to probe
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115 # The blackbox exporter's address
Step 4: Create alerts for blackbox monitoring
# alerts.yml
groups:
- name: blackbox
rules:
- alert: EndpointDown
expr: probe_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Endpoint {{ $labels.instance }} down"
description: "Endpoint {{ $labels.instance }} has been down for more than 5 minutes."
- alert: SlowResponse
expr: probe_duration_seconds > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response from {{ $labels.instance }}"
description: "{{ $labels.instance }} has response time > 1s for more than 10 minutes (current value: {{ $value }}s)"
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon for {{ $labels.instance }}"
description: "SSL certificate for {{ $labels.instance }} expires in less than 30 days"
Step 5: Create a Grafana dashboard for blackbox monitoring
Key metrics to monitor:
probe_success
: 1 if the probe was successful, 0 if it failedprobe_duration_seconds
: Duration of the probeprobe_http_status_code
: HTTP status codeprobe_ssl_earliest_cert_expiry
: Timestamp of certificate expirationprobe_http_version
: HTTP version usedExample PromQL queries for the dashboard:
# Uptime percentage over 24 hours
avg_over_time(probe_success[24h]) * 100
# Response time over time
probe_duration_seconds
# SSL certificate expiration in days
(probe_ssl_earliest_cert_expiry - time()) / 86400
This setup allows you to monitor:
Implementing custom metrics in a Node.js application involves using the prom-client
library to expose metrics in the Prometheus format.
Step 1: Install the required packages
npm install prom-client express
Step 2: Set up the metrics in your Node.js application
const express = require('express');
const client = require('prom-client');
// Create a Registry to register the metrics
const register = new client.Registry();
// Add a default label to all metrics
client.collectDefaultMetrics({
app: 'node-application',
prefix: 'node_',
timeout: 10000,
register
});
// Create custom metrics
const httpRequestCounter = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
registers: [register]
});
const activeConnections = new client.Gauge({
name: 'http_active_connections',
help: 'Number of active HTTP connections',
registers: [register]
});
// Create an Express app
const app = express();
const PORT = process.env.PORT || 3000;
// Middleware to track request duration and count
app.use((req, res, next) => {
// Increment active connections
activeConnections.inc();
// Track request start time
const start = Date.now();
// Track when response is finished
res.on('finish', () => {
// Record request duration
const duration = (Date.now() - start) / 1000;
httpRequestDuration.observe(
{ method: req.method, route: req.path, status_code: res.statusCode },
duration
);
// Count the request
httpRequestCounter.inc({
method: req.method,
route: req.path,
status_code: res.statusCode
});
// Decrement active connections
activeConnections.dec();
});
next();
});
// Business logic routes
app.get('/', (req, res) => {
res.send('Hello World!');
});
app.get('/api/users', (req, res) => {
// Simulate database query
setTimeout(() => {
res.json({ users: ['user1', 'user2', 'user3'] });
}, 200);
});
// Expose metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
res.setHeader('Content-Type', register.contentType);
res.send(await register.metrics());
});
// Start the server
app.listen(PORT, () => {
console.log(`Server listening on port ${PORT}`);
});
Step 3: Configure Prometheus to scrape your Node.js application
scrape_configs:
- job_name: 'nodejs'
scrape_interval: 5s
static_configs:
- targets: ['nodejs-app:3000']
Step 4: Create a custom dashboard in Grafana
Key metrics to monitor:
rate(http_requests_total[5m])
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
http_active_connections
Example of adding business metrics:
// Create business metrics
const orderCounter = new client.Counter({
name: 'business_orders_total',
help: 'Total number of orders',
labelNames: ['status', 'payment_method'],
registers: [register]
});
const orderValue = new client.Histogram({
name: 'business_order_value_dollars',
help: 'Value of orders in dollars',
buckets: [10, 50, 100, 500, 1000, 5000],
registers: [register]
});
// Use in your order processing route
app.post('/api/orders', (req, res) => {
// Process order
const order = processOrder(req.body);
// Record metrics
orderCounter.inc({
status: order.status,
payment_method: order.paymentMethod
});
orderValue.observe(order.totalAmount);
res.status(201).json(order);
});
This implementation provides:
Grafana is an open-source analytics and visualization platform that integrates with various data sources to create dashboards and alerts. It complements Prometheus in several key ways:
How Grafana complements Prometheus:
Setting up Grafana with Prometheus:
docker run -d -p 3000:3000 --name grafana grafana/grafana
Add Prometheus as a data source:
http://prometheus:9090
){
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Memory Usage",
"type": "gauge",
"datasource": "Prometheus",
"targets": [
{
"expr": "node_memory_MemUsed_bytes / node_memory_MemTotal_bytes * 100"
}
],
"options": {
"thresholds": [
{ "value": 75, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
]
}
Example of a complete monitoring stack:
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-storage:/var/lib/grafana
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter
ports:
- "9100:9100"
volumes:
grafana-storage:
With this setup, Grafana provides:
Grafana variables allow you to create dynamic, interactive dashboards where users can filter data without editing the dashboard. Here's how to implement them:
Step 1: Create a new dashboard
Step 2: Add dashboard variables
Types of variables:
# Configuration
Name: node
Label: Node
Data source: Prometheus
Query: label_values(node_cpu_seconds_total, instance)
Regex: /(.*):.*/
Sort: Alphabetical (asc)
Multi-value: Enabled
Include All option: Enabled
# Configuration
Name: interval
Label: Interval
Values: 1m,5m,10m,30m,1h,6h,12h,1d
# Configuration
Name: environment
Label: Environment
Values: production, staging, development
Multi-value: Enabled
Include All option: Enabled
# Configuration
Name: threshold
Label: Alert Threshold
Default value: 80
# First variable
Name: app
Label: Application
Query: label_values(application)
# Second variable (depends on app)
Name: service
Label: Service
Query: label_values(app_metrics{application="$app"}, service)
Step 3: Use variables in panel queries
# CPU usage by selected node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",instance=~"$node"}[$interval])) * 100)
# Memory usage filtered by environment
node_memory_used_bytes{env="$environment"} / node_memory_total_bytes{env="$environment"} * 100
# Alert threshold from user input
node_memory_used_bytes / node_memory_total_bytes * 100 > $threshold
Step 4: Create a dashboard with multiple variable-based panels
{
"title": "Node Exporter Dashboard",
"templating": {
"list": [
{
"name": "node",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(node_cpu_seconds_total, instance)",
"regex": "/(.*?):.*/",
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"auto": false,
"values": ["1m", "5m", "10m", "30m", "1h", "6h", "12h", "1d"]
},
{
"name": "disk",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(node_filesystem_avail_bytes{instance=~\"$node\"}, mountpoint)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"title": "CPU Usage for $node",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$node\"}[$interval])) * 100)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Memory Usage for $node",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemFree_bytes{instance=~\"$node\"} - node_memory_Cached_bytes{instance=~\"$node\"} - node_memory_Buffers_bytes{instance=~\"$node\"}) / node_memory_MemTotal_bytes{instance=~\"$node\"} * 100",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Disk Usage for $disk on $node",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$node\",mountpoint=~\"$disk\"} * 100) / node_filesystem_size_bytes{instance=~\"$node\",mountpoint=~\"$disk\"})",
"legendFormat": "{{instance}} - {{mountpoint}}"
}
]
}
]
}
Advanced variable techniques:
# Configuration
Data source: Prometheus
Filter key field: label
Filter value field: value
$__interval
: Calculated based on time range$__from
and $__to
: Dashboard time range$__name
: Panel nameglob
: Values with glob (*, ?) patternsregex
: Values with regex patternspipe
: Values separated by |distributed
: Multiple separate queriesThis approach creates highly interactive dashboards where users can:
Grafana provides a robust alerting system that can notify you when metrics meet certain conditions. Here's how to set it up:
Step 1: Configure Alerting
Navigate to Alerting in the Grafana UI:
Step 2: Create Alert Rules
For Grafana 8+ (New alerting system):
1. Click "New alert rule"
2. Define the query:
- Data source: Prometheus
- PromQL: rate(node_cpu_seconds_total{mode="idle"}[5m]) < 0.1
3. Set conditions:
- Condition: When last() OF query(A) IS BELOW 0.1
- For: 5m (duration before alerting)
4. Add labels:
- severity: critical
- category: system
5. Add annotations:
- summary: High CPU usage on {{$labels.instance}}
- description: CPU idle is below 10% for 5 minutes
6. Set evaluation interval: 1m
For Grafana 7 and earlier (Classic alerting):
1. Edit a Graph panel
2. Go to the Alert tab
3. Click "Create Alert"
4. Set conditions:
- WHEN avg() OF query(A,5m,now) IS BELOW 0.1
- For: 5m
5. Add notifications:
- Send to: CPU Alert
- Message: High CPU usage detected
Step 3: Configure Notification Channels
Grafana supports multiple notification channels:
Name: Email Alerts
Type: Email
Addresses: team@example.com, oncall@example.com
Name: Slack Alerts
Type: Slack
Webhook URL: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX
Channel: #alerts
Name: PagerDuty Critical
Type: PagerDuty
Integration Key: your_pagerduty_integration_key
Severity: critical
Name: Custom Webhook
Type: Webhook
URL: https://example.com/alert-webhook
HTTP Method: POST
Username: webhook_user
Password: webhook_password
Name: MS Teams
Type: Microsoft Teams
Webhook URL: https://outlook.office.com/webhook/...
Name: Google Chat
Type: Google Hangouts Chat
Webhook URL: https://chat.googleapis.com/v1/spaces/...
Name: Telegram Alerts
Type: Telegram
BOT API Token: your_telegram_bot_token
Chat ID: your_chat_id
Name: OpsGenie
Type: OpsGenie
API Key: your_opsgenie_api_key
Alert API URL: https://api.opsgenie.com/v2/alerts
Step 4: Create Alert Notification Policies (Grafana 8+)
1. Go to Alerting > Notification policies
2. Create a new policy:
- Name: High Priority
- Matching labels: severity=critical
- Contact points: PagerDuty Critical
- Group by: [alertname, instance]
- Timing options:
- Group wait: 30s
- Group interval: 5m
- Repeat interval: 4h
Step 5: Test Alerts
1. Go to Alerting > Alert rules
2. Find your alert
3. Click "Test" button
4. Select a notification channel
5. Click "Send test notification"
Example of a complete alert setup in Grafana API format:
{
"dashboard": {
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"alert": {
"name": "High CPU Usage",
"message": "CPU usage is above 90% for 5 minutes",
"handler": 1,
"frequency": "60s",
"conditions": [
{
"type": "query",
"query": { "params": ["A", "5m", "now"] },
"reducer": { "type": "avg", "params": [] },
"evaluator": { "type": "gt", "params": [90] },
"operator": { "type": "and" }
}
],
"notifications": [
{ "uid": "slack-notifications" },
{ "uid": "email-notifications" }
]
}
}
]
},
"alerting": {
"alertmanagers": [
{
"name": "Prometheus Alertmanager",
"type": "prometheus-alertmanager",
"uid": "alertmanager1",
"url": "http://alertmanager:9093"
}
],
"contactPoints": [
{
"name": "slack-notifications",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX",
"recipient": "#alerts"
}
},
{
"name": "email-notifications",
"type": "email",
"settings": {
"addresses": "team@example.com"
}
}
]
}
}
Best Practices for Grafana Alerting:
{{$labels.instance}}
, {{$value}}
)Creating a Grafana dashboard for Kubernetes monitoring involves several steps, from setting up the data collection to building comprehensive visualizations.
Step 1: Set up Prometheus for Kubernetes monitoring
Deploy Prometheus using Helm or the Prometheus Operator:
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
This installs:
Step 2: Create a comprehensive Kubernetes dashboard
If you didn't use the kube-prometheus-stack (which includes dashboards), create your own:
Cluster Overview Dashboard:
{
"title": "Kubernetes Cluster Overview",
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_namespace_status_phase, namespace)",
"multi": true,
"includeAll": true
},
{
"name": "pod",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)",
"multi": true,
"includeAll": true
},
{
"name": "node",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_node_info, node)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"title": "Cluster CPU Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\", image!=\"\"}[5m])) by (namespace)",
"legendFormat": "{{namespace}}"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{container!=\"\", image!=\"\"}) by (namespace)",
"legendFormat": "{{namespace}}"
}
],
"yaxes": [
{
"format": "bytes"
}
]
},
{
"title": "Pod Status",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(kube_pod_status_phase{phase=\"Running\"})",
"legendFormat": "Running"
},
{
"expr": "sum(kube_pod_status_phase{phase=\"Pending\"})",
"legendFormat": "Pending"
},
{
"expr": "sum(kube_pod_status_phase{phase=\"Failed\"})",
"legendFormat": "Failed"
}
],
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto"
}
},
{
"title": "Node Status",
"type": "gauge",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"Ready\", status=\"true\"}) / count(kube_node_info) * 100",
"legendFormat": "Ready Nodes"
}
],
"options": {
"thresholds": [
{ "value": 50, "color": "red" },
{ "value": 80, "color": "yellow" },
{ "value": 90, "color": "green" }
],
"min": 0,
"max": 100,
"unit": "percent"
}
},
{
"title": "Pod CPU Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\", pod=~\"$pod\", container!=\"\", image!=\"\"}[5m])) by (pod)",
"legendFormat": "{{pod}}"
}
]
},
{
"title": "Pod Memory Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{namespace=~\"$namespace\", pod=~\"$pod\", container!=\"\", image!=\"\"}) by (pod)",
"legendFormat": "{{pod}}"
}
],
"yaxes": [
{
"format": "bytes"
}
]
},
{
"title": "Pod Network I/O",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(container_network_receive_bytes_total{namespace=~\"$namespace\", pod=~\"$pod\"}[5m])) by (pod)",
"legendFormat": "{{pod}} Received"
},
{
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\", pod=~\"$pod\"}[5m])) by (pod)",
"legendFormat": "{{pod}} Transmitted"
}
],
"yaxes": [
{
"format": "bytes"
}
]
},
{
"title": "Node CPU Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\", node=~\"$node\"}[5m])) by (node) / on(node) group_left count by (node) (node_cpu_seconds_total{mode=\"idle\"}) * 100",
"legendFormat": "{{node}}"
}
],
"yaxes": [
{
"format": "percent",
"max": 100
}
]
},
{
"title": "Node Memory Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes{node=~\"$node\"} - node_memory_MemAvailable_bytes{node=~\"$node\"}) / node_memory_MemTotal_bytes{node=~\"$node\"} * 100",
"legendFormat": "{{node}}"
}
],
"yaxes": [
{
"format": "percent",
"max": 100
}
]
},
{
"title": "Node Disk Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "(1 - node_filesystem_avail_bytes{node=~\"$node\", mountpoint=\"/\"} / node_filesystem_size_bytes{node=~\"$node\", mountpoint=\"/\"}) * 100",
"legendFormat": "{{node}}"
}
],
"yaxes": [
{
"format": "percent",
"max": 100
}
]
}
]
}
Step 3: Create specialized dashboards for different aspects
1. Pod Resource Dashboard:
Key metrics:
CPU usage vs. requests/limits:
sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod="$pod"}[5m])) by (container) /
sum(kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", resource="cpu"}) by (container)
Memory usage vs. requests/limits:
sum(container_memory_working_set_bytes{namespace="$namespace", pod="$pod"}) by (container) /
sum(kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", resource="memory"}) by (container)
Restart count:
kube_pod_container_status_restarts_total{namespace="$namespace", pod="$pod"}
2. Node Resource Dashboard:
Key metrics:
Node capacity vs. allocatable:
kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}
Pods per node:
count(kube_pod_info) by (node)
Node conditions:
kube_node_status_condition{condition="DiskPressure", status="true"}
3. Namespace Resource Dashboard:
Key metrics:
Resource quotas:
kube_resourcequota{namespace="$namespace", type="used"} /
kube_resourcequota{namespace="$namespace", type="hard"}
Deployment status:
kube_deployment_status_replicas_available{namespace="$namespace"} /
kube_deployment_spec_replicas{namespace="$namespace"}
Step 4: Set up alerts for Kubernetes monitoring
groups:
- name: kubernetes-alerts
rules:
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{$labels.namespace}}/{{$labels.pod}} is crash looping"
description: "Pod {{$labels.namespace}}/{{$labels.pod}} is restarting {{$value}} times / 15 minutes"
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{$labels.node}} not ready"
description: "Node {{$labels.node}} has been unready for more than 5 minutes"
- alert: KubernetesPodNotHealthy
expr: min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[15m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{$labels.namespace}}/{{$labels.pod}} not healthy"
description: "Pod {{$labels.namespace}}/{{$labels.pod}} has been in a non-ready state for more than 15 minutes"
- alert: KubernetesHighCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])) by (namespace, pod, container) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (namespace, pod, container) > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU usage in {{$labels.namespace}}/{{$labels.pod}}/{{$labels.container}}"
description: "Container {{$labels.container}} in pod {{$labels.namespace}}/{{$labels.pod}} is using more than 90% of its CPU limit for 15 minutes"
Step 5: Create a Kubernetes Events Dashboard
# Query to show Kubernetes events
label_replace(
kube_event_info,
"reason",
"$1",
"reason",
"(.+)"
)
This comprehensive approach provides:
Annotations in Grafana allow you to mark points in time with rich events that add context to your metrics. They help correlate metrics with deployments, incidents, or other significant events.
Types of Annotations:
Step 1: Adding manual annotations
To add a manual annotation:
Description: Deployed v2.3.4
Tags: deployment, frontend
Step 2: Configure annotation queries
In dashboard settings:
For a Prometheus annotation query:
Name: Deployments
Data source: Prometheus
Query: changes(version_info{app="my-app"}[1m]) > 0
Step: 60s
Title: $tag_version
Text: Deployed $tag_version to $tag_environment
Tags: deployment, $tag_environment
For a Loki annotation query:
Name: Error Logs
Data source: Loki
Query: {app="my-app"} |= "error" | logfmt
Title: Error detected
Text: {{.message}}
Tags: error, {{.level}}
Step 3: Create annotations via API
Using curl to create an annotation:
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/annotations \
-d '{
"dashboardId": 1,
"panelId": 2,
"time": 1619712345000,
"timeEnd": 1619712400000,
"tags": ["deployment", "api"],
"text": "Deployed API v1.2.3"
}'
Using a script to automatically create annotations for deployments:
// deploy.js
const axios = require('axios');
async function createDeploymentAnnotation() {
const version = process.env.VERSION;
const environment = process.env.ENVIRONMENT;
await axios.post('http://grafana-host:3000/api/annotations', {
dashboardId: 1,
time: Date.now(),
tags: ['deployment', environment],
text: `Deployed version ${version} to ${environment}`
}, {
headers: {
'Authorization': `Bearer ${process.env.GRAFANA_API_KEY}`
}
});
console.log('Annotation created for deployment');
}
createDeploymentAnnotation().catch(console.error);
Step 4: Configure global annotations
To make annotations visible across all dashboards:
Name: System Maintenance
Data source: Prometheus
Query: changes(maintenance_mode[1m]) > 0
Show in all dashboards: Enabled
Step 5: Using annotations in dashboards
To display annotations in a panel:
Example use cases:
Deployment markers:
Query: changes(kube_deployment_status_observed_generation{deployment="app"}[1m]) > 0
Scaling events:
Query: changes(kube_deployment_spec_replicas{deployment="app"}[1m]) > 0
Configuration changes:
Query: changes(config_hash{app="my-app"}[1m]) > 0
Incident timeline:
Manual annotations with tags: incident, outage
Database maintenance:
Query: mysql_global_status_ongoing_maintenance > 0
Annotations provide crucial context to your metrics, helping teams:
Grafana's role-based access control (RBAC) allows you to manage what users can see and do within the platform. Here's how to implement it:
Step 1: Understanding Grafana's permission levels
Grafana has several built-in roles:
Step 2: Configure authentication
First, set up authentication in grafana.ini
:
[auth]
# Disable anonymous access
disable_login_form = false
anonymous_enabled = false
# LDAP authentication example
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
# OAuth example (GitHub)
[auth.github]
enabled = true
allow_sign_up = true
client_id = YOUR_GITHUB_APP_CLIENT_ID
client_secret = YOUR_GITHUB_APP_CLIENT_SECRET
scopes = user:email,read:org
auth_url = https://github.com/login/oauth/authorize
token_url = https://github.com/login/oauth/access_token
api_url = https://api.github.com/user
team_ids =
allowed_organizations = your-github-org
Step 3: Configure LDAP integration for team mapping
Create an LDAP configuration file (ldap.toml
):
[[servers]]
host = "ldap.example.com"
port = 389
use_ssl = false
bind_dn = "cn=admin,dc=example,dc=com"
bind_password = "admin_password"
search_filter = "(uid=%s)"
search_base_dns = ["ou=users,dc=example,dc=com"]
# Map LDAP groups to Grafana roles
[[servers.group_mappings]]
group_dn = "cn=admins,ou=groups,dc=example,dc=com"
org_role = "Admin"
[[servers.group_mappings]]
group_dn = "cn=editors,ou=groups,dc=example,dc=com"
org_role = "Editor"
[[servers.group_mappings]]
group_dn = "cn=viewers,ou=groups,dc=example,dc=com"
org_role = "Viewer"
Step 4: Create teams and assign permissions
Using the Grafana UI:
Using the API:
# Create a team
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/teams \
-d '{"name": "DevOps Team"}'
# Add user to team
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/teams/1/members \
-d '{"userId": 2}'
Step 5: Set up folder permissions
Using the Grafana UI:
Using the API:
# Set folder permissions
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/folders/1/permissions \
-d '{
"items": [
{
"teamId": 1,
"permission": 2
}
]
}'
Permission levels:
Step 6: Set up dashboard permissions
Using the Grafana UI:
Using the API:
# Set dashboard permissions
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/dashboards/id/1/permissions \
-d '{
"items": [
{
"teamId": 1,
"permission": 1
}
]
}'
Step 7: Configure data source permissions
In Grafana Enterprise, you can restrict access to data sources:
# Set data source permissions (Enterprise only)
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/datasources/1/permissions \
-d '{
"items": [
{
"teamId": 1,
"permission": 1
}
]
}'
Step 8: Implement fine-grained access control (Grafana Enterprise)
For Grafana Enterprise, enable fine-grained access control in grafana.ini
:
[feature_toggles]
accesscontrol = true
Create custom roles:
# Create a custom role
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/access-control/roles \
-d '{
"name": "DashboardPublisher",
"description": "Can create and publish dashboards",
"permissions": [
{
"action": "dashboards:create"
},
{
"action": "dashboards:write",
"scope": "folders:*"
}
]
}'
Assign roles to users:
# Assign role to user
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/access-control/users/2/roles \
-d '{
"roleUid": "custom:dashboard-publisher"
}'
Step 9: Implement service accounts for automation
For automated processes:
# Create service account
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/serviceaccounts \
-d '{
"name": "CI/CD Pipeline",
"role": "Editor"
}'
Step 10: Audit user activity
Enable audit logs in grafana.ini
:
[log]
filters = audit:debug
This comprehensive RBAC implementation allows you to:
Optimizing Grafana dashboards is crucial for maintaining responsiveness, especially with large datasets or many concurrent users. Here's a comprehensive approach:
1. Optimize Prometheus Queries
PromQL queries can significantly impact dashboard performance:
# Inefficient query (calculates rate for each series then sums)
sum(rate(http_requests_total[5m]))
# More efficient query (sums counters first, then calculates rate)
rate(sum(http_requests_total)[5m])
Use recording rules for expensive calculations:
# prometheus.yml
rule_files:
- "recording_rules.yml"
# recording_rules.yml
groups:
- name: http_requests
interval: 1m
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Then use the pre-computed metric in Grafana:
job:http_requests_total:rate5m{job="api"}
2. Adjust Time Range and Resolution
Configure appropriate min/max data points:
Dashboard settings > Data source options:
- Min time interval: 1m
- Max data points: 500
For individual panels:
Panel options > Query options:
- Min interval: 1m
- Max data points: 100
3. Use Template Variables Efficiently
Limit the number of values in multi-value variables:
Variable settings:
- Include All option: Disabled
- Preview of values: 10
Use regex to filter values:
Variable query: label_values(node_cpu_seconds_total, instance)
Regex: /prod-.*/
Cache template variables:
Variable settings:
- Refresh: On dashboard load
4. Optimize Panel Rendering
Reduce the number of series per graph:
# Use aggregation to reduce series
sum by (job) (rate(http_requests_total[5m]))
# Use top/bottom functions
topk(5, sum by (handler) (rate(http_requests_total[5m])))
Use appropriate visualization types:
5. Implement Dashboard Caching
For Grafana Enterprise, enable dashboard caching:
# grafana.ini
[caching]
enabled = true
For open-source Grafana, use a reverse proxy with caching:
# nginx.conf
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=grafana_cache:10m max_size=1g;
server {
location / {
proxy_pass http://grafana:3000;
proxy_cache grafana_cache;
proxy_cache_valid 200 1m;
proxy_cache_key "$host$request_uri$cookie_grafana_session";
}
}
6. Optimize Data Source Settings
For Prometheus:
# Prometheus data source settings
timeInterval: "1m" # Minimum interval
queryTimeout: "60s"
httpMethod: "POST" # Better for large queries
For other data sources like MySQL:
# MySQL data source settings
maxOpenConns: 10
maxIdleConns: 5
connMaxLifetime: 14400 # 4 hours
7. Use Lazy Loading Panels
Enable lazy loading in dashboard settings:
Dashboard settings > General:
- Lazy loading: Enabled
8. Implement Dashboard Snapshots for Static Data
For historical data that doesn't change:
Dashboard > Share > Snapshot
9. Optimize Alert Rules
Move complex alert evaluations to Prometheus:
# In Prometheus rules
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
Instead of Grafana alerts with complex queries.
10. Implement Dashboard Provisioning
Use provisioned dashboards for consistent performance:
# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
folder: 'General'
type: file
options:
path: /var/lib/grafana/dashboards
11. Real-world Example: Optimizing a Node Exporter Dashboard
Before optimization:
# Inefficient query showing all CPU cores for all nodes
node_cpu_seconds_total{mode!="idle"}
After optimization:
# More efficient query showing CPU usage by node
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) /
count by (instance) (count by (cpu, instance) (node_cpu_seconds_total)) * 100
Before optimization:
# Showing all filesystem metrics
node_filesystem_avail_bytes
After optimization:
# Only showing filesystems that matter
node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
12. Performance Testing
Use Grafana's built-in rendering metrics:
# In Prometheus
grafana_rendering_request_duration_seconds_sum
grafana_rendering_request_duration_seconds_count
Or use external tools:
# Using Apache Bench to test dashboard load
ab -n 100 -c 10 "http://grafana:3000/d/dashboard-uid/dashboard-name?orgId=1"
By implementing these optimizations, you can achieve:
Grafana and Loki together provide a powerful solution for log visualization and analysis. Here's how to set it up and use it effectively:
Step 1: Set up Loki as a data source in Grafana
http://loki:3100
)# Example docker-compose.yml for Loki and Grafana
version: '3'
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yaml
command: -config.file=/etc/promtail/config.yaml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-storage:/var/lib/grafana
volumes:
grafana-storage:
Step 2: Configure Promtail to ship logs to Loki
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: containerlogs
__path__: /var/lib/docker/containers/*/*log
# Extract container metadata
pipeline_stages:
- json:
expressions:
stream: stream
attrs: attrs
tag: attrs.tag
- regex:
expression: (?P<container_name>(?:[^|]*[^|]))
- labels:
container_name:
Step 3: Create a Logs Dashboard in Grafana
Create a new dashboard with these panels:
1. Log Volume Panel:
# Query
sum(count_over_time({job="containerlogs"}[5m])) by (container_name)
2. Log Explorer Panel:
# Query
{job="containerlogs"} |= "$search"
3. Error Rate Panel:
# Query
sum(count_over_time({job="containerlogs"} |= "error" [5m])) by (container_name)
Step 4: Use LogQL for Advanced Queries
Basic filtering:
# Show all logs containing "error"
{job="containerlogs"} |= "error"
# Show all logs NOT containing "health"
{job="containerlogs"} != "health"
# Combine filters
{job="containerlogs"} |= "error" != "timeout"
Pattern matching:
# Using regex
{job="containerlogs"} |~ "error.*timeout"
# Case insensitive matching
{job="containerlogs"} |~ "(?i)error"
Extracting fields with JSON:
# Parse JSON logs
{job="containerlogs"} | json
# Extract specific fields
{job="containerlogs"} | json field="message"
# Filter on extracted fields
{job="containerlogs"} | json | status_code >= 400
Line formatting:
# Format output
{job="containerlogs"} | json | line_format "{{.status_code}} {{.path}} {{.duration}}ms"
Aggregations:
# Count by status code
sum(count_over_time({job="containerlogs"} | json | status_code>=400 [5m])) by (status_code)
# Average response time
avg(sum_over_time({job="containerlogs"} | json | unwrap duration [5m])) by (path)
Step 5: Create a comprehensive logging dashboard
{
"title": "Application Logs Dashboard",
"panels": [
{
"title": "Log Volume Over Time",
"type": "graph",
"datasource": "Loki",
"targets": [
{
"expr": "sum(count_over_time({job=\"containerlogs\"}[5m])) by (container_name)",
"legendFormat": "{{container_name}}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"datasource": "Loki",
"targets": [
{
"expr": "sum(count_over_time({job=\"containerlogs\"} |= \"error\" [5m])) by (container_name)",
"legendFormat": "{{container_name}}"
}
]
},
{
"title": "HTTP Status Codes",
"type": "graph",
"datasource": "Loki",
"targets": [
{
"expr": "sum(count_over_time({job=\"containerlogs\"} | json | status_code>=200 [5m])) by (status_code)",
"legendFormat": "{{status_code}}"
}
]
},
{
"title": "Slow Requests (>500ms)",
"type": "graph",
"datasource": "Loki",
"targets": [
{
"expr": "sum(count_over_time({job=\"containerlogs\"} | json | duration>500 [5m])) by (path)",
"legendFormat": "{{path}}"
}
]
},
{
"title": "Log Explorer",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"containerlogs\"} |= \"$search\"",
"refId": "A"
}
]
}
],
"templating": {
"list": [
{
"name": "container",
"type": "query",
"datasource": "Loki",
"query": "label_values(container_name)",
"multi": true,
"includeAll": true
},
{
"name": "search",
"type": "textbox",
"label": "Search",
"description": "Text to search for in logs"
}
]
},
"time": {
"from": "now-6h",
"to": "now"
}
}
Step 6: Set up log-based alerts
In Grafana 8+:
1. Go to Alerting > New alert rule
2. Configure:
- Name: High Error Rate
- Data source: Loki
- Query: sum(count_over_time({job="containerlogs"} |= "error" [5m])) > 100
- For: 5m
- Notifications: Send to Slack
Step 7: Correlate logs with metrics
Create a dashboard with both metrics and logs:
1. Add a graph panel with Prometheus metrics
- Query: sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance)
2. Add a logs panel below
- Query: {job="containerlogs"} |= "error"
3. Link them with annotations
- Add annotation query from Loki
- Query: {job="containerlogs"} |= "deployed"
Step 8: Implement log retention and aggregation
Configure Loki retention in loki-config.yaml
:
limits_config:
retention_period: 168h # 7 days
schema_config:
configs:
- from: 2020-07-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
Advanced Techniques:
# In Loki data source settings
Derived fields:
- Name: trace_id
- Regex: "traceID":"(\w+)"
- URL: http://jaeger:16686/trace/$${__value.raw}
# Find unusual log patterns
{job="containerlogs"} | pattern
# Alert on sudden log volume increase
abs(
sum(count_over_time({job="containerlogs"}[5m]))
/
sum(count_over_time({job="containerlogs"}[5m] offset 5m))
- 1
) > 0.5
This comprehensive approach provides:
Multi-tenancy in Grafana allows multiple teams or customers to use the same Grafana instance while keeping their data and dashboards separate. Here's how to implement it:
Step 1: Understand Grafana's multi-tenancy model
Grafana uses "Organizations" as its multi-tenancy unit:
Step 2: Configure Grafana for multi-tenancy
Edit grafana.ini
:
[users]
# Allow users to sign up
allow_sign_up = true
# Default role for new users
auto_assign_org_role = Viewer
# Allow users to create organizations
allow_org_create = false
[auth]
# Disable anonymous access
disable_login_form = false
anonymous_enabled = false
# Enable multiple organizations
[auth.basic]
enabled = true
[server]
# Enable subpath support if needed
root_url = https://grafana.example.com
Step 3: Create and manage organizations
Using the Grafana UI:
Using the API:
# Create a new organization
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/orgs \
-d '{"name": "Customer A"}'
# Response: {"orgId":2,"message":"Organization created"}
Step 4: Manage users within organizations
Add users to an organization via UI:
Using the API:
# Add user to organization
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/orgs/2/users \
-d '{"loginOrEmail": "user@example.com", "role": "Editor"}'
# Change user's role in organization
curl -X PATCH \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/orgs/2/users/3 \
-d '{"role": "Admin"}'
# Remove user from organization
curl -X DELETE \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/orgs/2/users/3
Step 5: Switch between organizations
Via UI:
Via API:
# Switch to a different organization
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
http://grafana-host:3000/api/user/using/2
Step 6: Configure data source isolation
Each organization has its own data sources. For Prometheus:
# Organization 1 Prometheus
url: http://prometheus-org1:9090
basicAuth: true
basicAuthUser: org1
basicAuthPassword: password1
# Organization 2 Prometheus
url: http://prometheus-org2:9090
basicAuth: true
basicAuthUser: org2
basicAuthPassword: password2
For shared Prometheus with data isolation:
# Organization 1 Prometheus with tenant label
url: http://prometheus:9090
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "org1"
# Organization 2 Prometheus with tenant label
url: http://prometheus:9090
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "org2"
Step 7: Implement automated organization provisioning
Create a provisioning script:
#!/usr/bin/env python3
import requests
import json
GRAFANA_URL = "http://grafana:3000"
ADMIN_KEY = "admin_api_key"
def create_organization(name):
response = requests.post(
f"{GRAFANA_URL}/api/orgs",
headers={"Authorization": f"Bearer {ADMIN_KEY}"},
json={"name": name}
)
return response.json()["orgId"]
def create_api_key(org_id, name):
# Switch to organization
requests.post(
f"{GRAFANA_URL}/api/user/using/{org_id}",
headers={"Authorization": f"Bearer {ADMIN_KEY}"}
)
# Create API key