Kubernetes Grafana Questions

30 Technical Prometheus and Grafana Interview Questions with Detailed Answers

Prometheus Questions

1. What is Prometheus and how does it differ from traditional monitoring tools?

Prometheus is an open-source systems monitoring and alerting toolkit that follows a pull-based model. Unlike traditional monitoring tools that use a push model, Prometheus scrapes metrics from instrumented jobs at configured intervals.

Key differences:

Pull vs Push: Prometheus pulls metrics from targets, while many traditional tools receive pushed metrics
Time-series database: Built-in TSDB optimized for metrics
PromQL: Powerful query language for data analysis
Service discovery: Dynamic target discovery
Multi-dimensional data model: Uses key-value pairs (labels) for better filtering and grouping

Example of a Prometheus scrape configuration:

scrape_configs:
  - job_name: 'node'
    scrape_interval: 15s
    static_configs:
      - targets: ['node-exporter:9100']

2. Explain Prometheus's data model and how labels work.

Prometheus uses a multi-dimensional data model where time series are identified by a metric name and a set of key-value pairs called labels.

Structure: <metric_name>{<label_name>=<label_value>, ...}

Labels enable:

Filtering metrics
Grouping metrics
Applying different operations to different subsets of data

Example:

http_requests_total{status="200", method="GET", handler="/api/users"}

This represents the total count of HTTP requests with status 200, method GET, to the /api/users endpoint.

In PromQL, you can query and filter using these labels:

http_requests_total{status="200", method="GET"}  # All 200 GET requests
http_requests_total{handler=~"/api/.*"}          # All requests to API endpoints

3. What are Prometheus exporters? Give examples of common exporters.

Exporters are applications that convert metrics from existing systems into the Prometheus format, making them available for scraping. They expose an HTTP endpoint (usually /metrics) with data in the Prometheus text-based format.

Common exporters:

Node Exporter: System metrics (CPU, memory, disk, network)
Blackbox Exporter: Probes endpoints for availability, response time, SSL validity
MySQL Exporter: MySQL database metrics
Redis Exporter: Redis metrics
cAdvisor: Container metrics
JMX Exporter: Java application metrics

Example of Node Exporter metrics:

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 8570.6
node_cpu_seconds_total{cpu="0",mode="system"} 1239.02
node_cpu_seconds_total{cpu="0",mode="user"} 2873.21

4. How would you configure Prometheus to scrape a new target?

To configure Prometheus to scrape a new target, you add a new job to the scrape_configs section in the prometheus.yml file:

scrape_configs:
  - job_name: 'api-service'
    scrape_interval: 10s  # Override the global default
    metrics_path: '/metrics'  # Default is /metrics
    scheme: 'http'  # Default is http
    static_configs:
      - targets: ['api-service:8080', 'api-service-replica:8080']
        labels:
          environment: 'production'
          team: 'backend'

For dynamic environments, you can use service discovery:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

After updating the configuration, reload Prometheus:

curl -X POST http://localhost:9090/-/reload

5. Explain the difference between counters, gauges, histograms, and summaries in Prometheus.

Counters: Monotonically increasing values that only go up (or reset to zero). Used for counts like requests, errors, or completed tasks.

# Example: Total HTTP requests
http_requests_total{method="GET"} 12345

Query rate: rate(http_requests_total[5m])

Gauges: Values that can go up and down, representing current state. Used for metrics like temperature, memory usage, or concurrent connections.

# Example: Current memory usage
node_memory_MemAvailable_bytes 2.1470634e+09

Query directly: node_memory_MemAvailable_bytes

Histograms: Sample observations and count them in configurable buckets, also providing a sum of all observed values. Used for request durations or response sizes.

# Example: HTTP request duration histogram
http_request_duration_seconds_bucket{le="0.1"} 12342  # requests under 100ms
http_request_duration_seconds_bucket{le="0.5"} 12951  # requests under 500ms
http_request_duration_seconds_bucket{le="1"} 13001    # requests under 1s
http_request_duration_seconds_bucket{le="+Inf"} 13005 # all requests
http_request_duration_seconds_sum 893.2               # sum of all durations
http_request_duration_seconds_count 13005             # count of all observations

Query 95th percentile: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Summaries: Similar to histograms but calculate quantiles on the client side. Used when exact quantiles are needed.

# Example: Request duration summary
http_request_duration_seconds{quantile="0.5"} 0.052    # 50th percentile
http_request_duration_seconds{quantile="0.9"} 0.564    # 90th percentile
http_request_duration_seconds{quantile="0.99"} 1.2     # 99th percentile
http_request_duration_seconds_sum 893.2                # sum of all durations
http_request_duration_seconds_count 13005              # count of all observations

6. What is PromQL and how would you use it to calculate a request error rate?

PromQL (Prometheus Query Language) is Prometheus's functional query language that lets you select and aggregate time series data in real time.

To calculate an HTTP request error rate (percentage of 5xx responses):

# Error rate as a percentage over the last 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Breaking this down:

rate(http_requests_total{status=~"5.."}[5m]) - Calculate the per-second rate of 5xx errors over 5 minutes
sum(...) - Sum these rates across all instances/services
sum(rate(http_requests_total[5m])) - Calculate the total request rate
Divide and multiply by 100 to get a percentage

For a more specific example, to get the error rate for a specific API endpoint:

sum(rate(http_requests_total{handler="/api/users", status=~"5.."}[5m])) / 
sum(rate(http_requests_total{handler="/api/users"}[5m])) * 100

7. How do you set up alerting in Prometheus?

Alerting in Prometheus is configured in two parts:

Alert rules in Prometheus that define conditions
Alertmanager that handles notifications, deduplication, and routing

Step 1: Define alert rules in a file (e.g., alerts.yml):

groups:
- name: example
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:mean5m{job="api-server"} > 0.5
    for: 10m
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "High request latency on {{ $labels.job }}"
      description: "{{ $labels.job }} has a request latency above 500ms (current value: {{ $value }}s)"

Step 2: Include this file in Prometheus configuration:

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

Step 3: Configure Alertmanager (alertmanager.yml):

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'
  routes:
  - match:
      severity: critical
    receiver: 'pager'

receivers:
- name: 'team-emails'
  email_configs:
  - to: 'team@example.com'
- name: 'pager'
  pagerduty_configs:
  - service_key: '<pagerduty-service-key>'

This setup will:

Monitor API server latency
Fire an alert if it exceeds 500ms for 10 minutes
Send critical alerts to PagerDuty and others to email

8. Explain Prometheus's storage mechanism and retention policies.

Prometheus uses a custom time-series database (TSDB) optimized for storing metrics with high write and query performance.

Storage Mechanism:

Data is stored on disk in blocks (typically 2-hour chunks)
Each block contains all time series for its time range
Older blocks are compacted into larger blocks
An in-memory write-ahead log (WAL) protects against crashes
Samples are compressed for efficient storage

Retention Configuration:

# In prometheus.yml
storage:
  tsdb:
    path: /data
    retention.time: 15d    # How long to keep data
    retention.size: 50GB   # Optional max storage size
    wal:
      retention.time: 12h  # WAL retention

By default, Prometheus keeps data for 15 days. You can adjust this based on:

Available disk space
Query patterns (how far back users typically query)
Compliance requirements

For long-term storage, you can use remote write/read to send data to systems like:

Thanos
Cortex
Victoria Metrics
Prometheus TSDB

Example remote write configuration:

remote_write:
  - url: "http://remote-storage:9201/write"
    queue_config:
      max_samples_per_send: 10000
      capacity: 500000
      max_shards: 30

9. How would you implement high availability for Prometheus?

There are several approaches to implement high availability for Prometheus:

1. Multiple Prometheus Instances with Thanos

Thanos extends Prometheus with high availability and long-term storage:

# prometheus.yml with Thanos sidecar
global:
  external_labels:
    replica: replica-1
    region: us-west

# Thanos sidecar configuration
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"

Thanos components:

Sidecar: Connects to Prometheus, uploads blocks to object storage
Store Gateway: Provides access to metrics in object storage
Querier: Aggregates metrics from multiple Prometheus instances
Compactor: Downsamples data for efficient long-term storage

2. Prometheus Federation

Have a global Prometheus instance that scrapes metrics from multiple Prometheus instances:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="node"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 'prometheus-1:9090'
        - 'prometheus-2:9090'

3. Cortex or Victoria Metrics

Use a distributed Prometheus-compatible system:

remote_write:
  - url: "http://cortex:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 10000

Key HA Considerations:

Use service discovery to dynamically find targets
Implement proper alerting deduplication in Alertmanager
Use consistent external labels to identify Prometheus instances
Consider data consistency requirements when choosing an approach

10. What is service discovery in Prometheus and how would you use it in a dynamic environment?

Service discovery allows Prometheus to automatically find and monitor targets in dynamic environments like cloud or container platforms, eliminating the need for manual configuration.

Common Service Discovery Mechanisms:

Kubernetes Service Discovery:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_container_port_number]
        action: replace
        regex: (.+);(.+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

AWS EC2 Service Discovery:

scrape_configs:
  - job_name: 'ec2-instances'
    ec2_sd_configs:
      - region: us-west-2
        access_key: ACCESS_KEY
        secret_key: SECRET_KEY
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        regex: production
        action: keep
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name

Consul Service Discovery:

scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        services: ['web', 'api']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_node]
        target_label: node

File-based Service Discovery (for custom solutions):

scrape_configs:
  - job_name: 'file-discovery'
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/*.json'
        refresh_interval: 5m

With a JSON file like:

[
  {
    "targets": ["service1:9100", "service2:9100"],
    "labels": {
      "env": "production",
      "job": "node"
    }
  }
]

Service discovery is essential in environments like Kubernetes where pods come and go frequently. The relabel_configs section is crucial for transforming the discovered metadata into usable target configurations.

11. How would you monitor a Java application with Prometheus?

Monitoring a Java application with Prometheus involves several steps:

1. Instrument the Java application

Use a library like Micrometer with Prometheus registry:

<!-- Maven dependency -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <version>1.10.2</version>
</dependency>

2. Configure the metrics in your Java code:

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import io.micrometer.prometheus.PrometheusConfig;
import io.micrometer.prometheus.PrometheusMeterRegistry;
import io.prometheus.client.exporter.HTTPServer;

public class Application {
    public static void main(String[] args) throws IOException {
        // Create a Prometheus registry
        PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
        
        // Create metrics
        Counter requestCounter = registry.counter("app_requests_total", "status", "success");
        Timer requestTimer = registry.timer("app_request_duration_seconds", "endpoint", "/api");
        
        // Increment counter in your code
        requestCounter.increment();
        
        // Time operations
        requestTimer.record(() -> {
            // Your operation here
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        });
        
        // Start an HTTP server to expose metrics
        HTTPServer server = new HTTPServer.Builder()
                .withPort(8080)
                .withRegistry(registry)
                .build();
    }
}

For Spring Boot applications, it's even simpler:

@SpringBootApplication
@EnablePrometheusMetrics
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}

3. Configure Prometheus to scrape the Java application:

scrape_configs:
  - job_name: 'java-app'
    metrics_path: '/actuator/prometheus'  # For Spring Boot
    static_configs:
      - targets: ['java-app:8080']

4. Key metrics to monitor:

JVM metrics:
- Memory usage: jvm_memory_used_bytes, jvm_memory_max_bytes
- Garbage collection: jvm_gc_collection_seconds_count, jvm_gc_collection_seconds_sum
- Thread counts: jvm_threads_states_threads
Application metrics:
- Request counts: http_server_requests_seconds_count
- Request duration: http_server_requests_seconds_sum
- Error rates: http_server_requests_seconds_count{status="5xx"}
Custom business metrics:
- Transaction counts
- Queue sizes
- Processing times

5. JMX Exporter (alternative approach)

For legacy applications, use the JMX Exporter:

java -javaagent:./jmx_prometheus_javaagent-0.16.1.jar=8080:config.yaml -jar your-application.jar

With a config like:

---
startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
- pattern: ".*"

12. What are recording rules in Prometheus and when would you use them?

Recording rules in Prometheus allow you to precompute frequently used or computationally expensive expressions and save their results as new time series. This improves query performance and reduces the load on the Prometheus server.

Configuration Example:

# In prometheus.yml
rule_files:
  - "recording_rules.yml"

# In recording_rules.yml
groups:
  - name: example
    interval: 5s
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      
      - record: instance:node_cpu_utilization:avg5m
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
      
      - record: job:request_latency_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(request_latency_bucket[5m])) by (job, le))

When to Use Recording Rules:

For expensive queries: Calculations like percentiles over large datasets
# Original expensive query histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) # Recording rule job:http_request_duration_seconds:p95_5m
For frequently used queries: Dashboards that many users access
# Original query used in multiple dashboards sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Recording rule job:http_requests_error_ratio:5m
For alert conditions: To ensure alerts evaluate quickly
# Alert based on recording rule ALERT HighErrorRate IF job:http_requests_error_ratio:5m > 0.05 FOR 10m
For aggregations across multiple instances:
# Recording rule aggregating across instances record: job:node_memory_utilization:avg expr: avg by (job) (node_memory_used_bytes / node_memory_total_bytes)

Best practices:

Use a naming convention like level:metric_name:operations
Keep recording rules in separate files organized by service or function
Set appropriate intervals (default is 1 minute)
Monitor the performance of rule evaluation with prometheus_rule_evaluation_duration_seconds

13. How would you troubleshoot if Prometheus is not scraping a target?

When Prometheus isn't scraping a target, follow this systematic troubleshooting approach:

1. Check the Targets page in Prometheus UI

Go to Status > Targets in the Prometheus web UI
Look for your target and check its state (up/down)
Check the error message if it's down

2. Verify the target is accessible

Test direct access to the metrics endpoint:
curl http://target-host:port/metrics
Check for network connectivity issues:
telnet target-host port

3. Check Prometheus configuration

Verify the target is correctly defined in prometheus.yml:
scrape_configs: - job_name: 'my-service' static_configs: - targets: ['my-service:8080']
Validate the configuration:
promtool check config prometheus.yml

4. Check for relabeling issues

If using relabeling, check if it's dropping your target:
relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: my-app action: keep
Temporarily remove relabeling to see if that's the issue

5. Check service discovery

For Kubernetes, check if the pod has the right annotations:
annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080"
For other service discovery, check the logs for discovery issues

6. Check for TLS/authentication issues

If using TLS, verify certificates:
scheme: https tls_config: ca_file: /path/to/ca.crt cert_file: /path/to/client.crt key_file: /path/to/client.key insecure_skip_verify: false

7. Check Prometheus logs

Look for scrape-related errors:
grep "scrape" /var/log/prometheus/prometheus.log

8. Check resource constraints

Verify Prometheus has enough resources (CPU/memory)
Check if scrape timeouts are occurring due to slow target responses:
scrape_timeout: 15s # Increase if needed

9. Use debug metrics

Check Prometheus's own metrics about scrapes:
scrape_samples_scraped{job="my-service"} # Should be > 0 scrape_duration_seconds{job="my-service"} # Duration of scrapes up{job="my-service"} # 1 if up, 0 if down

Example of a complete troubleshooting session:

# 1. Check if the target is accessible
curl http://api-service:8080/metrics
# Result: Connection refused

# 2. Check network connectivity
telnet api-service 8080
# Result: Unable to connect

# 3. Check if the service is running
kubectl get pods | grep api-service
# Result: api-service-5d4d7b9f68-2jlmn   0/1     CrashLoopBackOff   5          10m

# 4. Check pod logs
kubectl logs api-service-5d4d7b9f68-2jlmn
# Result: Error binding to port 8080: address already in use

# 5. Fix the port conflict and restart the pod
kubectl delete pod api-service-5d4d7b9f68-2jlmn

# 6. Verify Prometheus can now scrape the target
curl http://prometheus:9090/api/v1/query?query=up{job="api-service"}
# Result: {"status":"success","data":{"resultType":"vector","result":[{"metric":{"job":"api-service","instance":"api-service:8080"},"value":[1619712345.123,"1"]}]}}

14. Explain how to use Prometheus for blackbox monitoring.

Blackbox monitoring tests external behavior of a system from the outside, like checking if a website is up or an API is responding correctly. Prometheus's Blackbox Exporter is designed for this purpose.

Step 1: Deploy the Blackbox Exporter

# docker-compose.yml example
services:
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox.yml:/etc/blackbox_exporter/config.yml

Step 2: Configure the Blackbox Exporter

# blackbox.yml
modules:
  http_2xx:  # Simple HTTP probe
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]
      method: GET
      no_follow_redirects: false
      fail_if_ssl: false
      fail_if_not_ssl: false
      preferred_ip_protocol: "ip4"
  
  http_post_2xx:  # HTTP POST probe
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "test"}'
  
  tcp_connect:  # TCP connection probe
    prober: tcp
    timeout: 5s
  
  icmp:  # Ping probe
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
  
  ssl:  # SSL/TLS probe
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]
      method: GET
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: false

Step 3: Configure Prometheus to use the Blackbox Exporter

# prometheus.yml
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Use the http_2xx module
    static_configs:
      - targets:
        - https://example.com    # Target to probe
        - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115  # The blackbox exporter's address

Step 4: Create alerts for blackbox monitoring

# alerts.yml
groups:
- name: blackbox
  rules:
  - alert: EndpointDown
    expr: probe_success == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"
      description: "Endpoint {{ $labels.instance }} has been down for more than 5 minutes."
  
  - alert: SlowResponse
    expr: probe_duration_seconds > 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Slow response from {{ $labels.instance }}"
      description: "{{ $labels.instance }} has response time > 1s for more than 10 minutes (current value: {{ $value }}s)"
  
  - alert: SSLCertExpiringSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate expiring soon for {{ $labels.instance }}"
      description: "SSL certificate for {{ $labels.instance }} expires in less than 30 days"

Step 5: Create a Grafana dashboard for blackbox monitoring

Key metrics to monitor:

probe_success: 1 if the probe was successful, 0 if it failed
probe_duration_seconds: Duration of the probe
probe_http_status_code: HTTP status code
probe_ssl_earliest_cert_expiry: Timestamp of certificate expiration
probe_http_version: HTTP version used

Example PromQL queries for the dashboard:

# Uptime percentage over 24 hours
avg_over_time(probe_success[24h]) * 100

# Response time over time
probe_duration_seconds

# SSL certificate expiration in days
(probe_ssl_earliest_cert_expiry - time()) / 86400

This setup allows you to monitor:

Website availability
API endpoints
DNS resolution
TCP services
SSL/TLS certificate validity
Response times

15. How would you implement custom metrics in a Node.js application for Prometheus?

Implementing custom metrics in a Node.js application involves using the prom-client library to expose metrics in the Prometheus format.

Step 1: Install the required packages

npm install prom-client express

Step 2: Set up the metrics in your Node.js application

const express = require('express');
const client = require('prom-client');

// Create a Registry to register the metrics
const register = new client.Registry();

// Add a default label to all metrics
client.collectDefaultMetrics({
  app: 'node-application',
  prefix: 'node_',
  timeout: 10000,
  register
});

// Create custom metrics
const httpRequestCounter = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
  registers: [register]
});

const activeConnections = new client.Gauge({
  name: 'http_active_connections',
  help: 'Number of active HTTP connections',
  registers: [register]
});

// Create an Express app
const app = express();
const PORT = process.env.PORT || 3000;

// Middleware to track request duration and count
app.use((req, res, next) => {
  // Increment active connections
  activeConnections.inc();
  
  // Track request start time
  const start = Date.now();
  
  // Track when response is finished
  res.on('finish', () => {
    // Record request duration
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.observe(
      { method: req.method, route: req.path, status_code: res.statusCode },
      duration
    );
    
    // Count the request
    httpRequestCounter.inc({
      method: req.method,
      route: req.path,
      status_code: res.statusCode
    });
    
    // Decrement active connections
    activeConnections.dec();
  });
  
  next();
});

// Business logic routes
app.get('/', (req, res) => {
  res.send('Hello World!');
});

app.get('/api/users', (req, res) => {
  // Simulate database query
  setTimeout(() => {
    res.json({ users: ['user1', 'user2', 'user3'] });
  }, 200);
});

// Expose metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.setHeader('Content-Type', register.contentType);
  res.send(await register.metrics());
});

// Start the server
app.listen(PORT, () => {
  console.log(`Server listening on port ${PORT}`);
});

Step 3: Configure Prometheus to scrape your Node.js application

scrape_configs:
  - job_name: 'nodejs'
    scrape_interval: 5s
    static_configs:
      - targets: ['nodejs-app:3000']

Step 4: Create a custom dashboard in Grafana

Key metrics to monitor:

Request rate: rate(http_requests_total[5m])
Error rate: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
Request duration: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Active connections: http_active_connections

Example of adding business metrics:

// Create business metrics
const orderCounter = new client.Counter({
  name: 'business_orders_total',
  help: 'Total number of orders',
  labelNames: ['status', 'payment_method'],
  registers: [register]
});

const orderValue = new client.Histogram({
  name: 'business_order_value_dollars',
  help: 'Value of orders in dollars',
  buckets: [10, 50, 100, 500, 1000, 5000],
  registers: [register]
});

// Use in your order processing route
app.post('/api/orders', (req, res) => {
  // Process order
  const order = processOrder(req.body);
  
  // Record metrics
  orderCounter.inc({
    status: order.status,
    payment_method: order.paymentMethod
  });
  
  orderValue.observe(order.totalAmount);
  
  res.status(201).json(order);
});

This implementation provides:

Standard Node.js metrics (memory, CPU, event loop)
HTTP request counts, durations, and error rates
Custom business metrics
Active connection tracking

Grafana Questions

16. What is Grafana and how does it complement Prometheus?

Grafana is an open-source analytics and visualization platform that integrates with various data sources to create dashboards and alerts. It complements Prometheus in several key ways:

How Grafana complements Prometheus:

Visualization: While Prometheus has a basic expression browser, Grafana provides rich, interactive visualizations with multiple panel types (graphs, gauges, tables, heatmaps).
Multi-data source support: Grafana can combine Prometheus data with other sources like InfluxDB, Elasticsearch, or SQL databases in a single dashboard.
Advanced dashboarding: Grafana offers template variables, annotations, and dashboard linking for creating comprehensive monitoring solutions.
Alerting: Grafana has its own alerting system that can work alongside Prometheus Alertmanager, with additional notification channels and visual alert states.
User management: Grafana provides authentication, authorization, and team-based access controls.

Setting up Grafana with Prometheus:

Install Grafana:

docker run -d -p 3000:3000 --name grafana grafana/grafana

Add Prometheus as a data source:
- Navigate to Configuration > Data Sources
- Click "Add data source" and select "Prometheus"
- Set the URL to your Prometheus server (e.g., http://prometheus:9090)
- Click "Save & Test"
Create a dashboard:

{
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Memory Usage",
      "type": "gauge",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "node_memory_MemUsed_bytes / node_memory_MemTotal_bytes * 100"
        }
      ],
      "options": {
        "thresholds": [
          { "value": 75, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  ]
}

Example of a complete monitoring stack:

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
  
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana
    depends_on:
      - prometheus
  
  node-exporter:
    image: prom/node-exporter
    ports:
      - "9100:9100"

volumes:
  grafana-storage:

With this setup, Grafana provides:

Beautiful visualizations of Prometheus metrics
Shareable dashboards across teams
Alerting with multiple notification channels
User authentication and authorization
Annotation support for marking events
Template variables for dynamic dashboards

17. How do you create a Grafana dashboard with variables for dynamic filtering?

Grafana variables allow you to create dynamic, interactive dashboards where users can filter data without editing the dashboard. Here's how to implement them:

Step 1: Create a new dashboard

Click "+ Create" > "Dashboard"
Add panels as needed for your metrics

Step 2: Add dashboard variables

Click the gear icon to open dashboard settings
Select "Variables" from the left menu
Click "Add variable"

Types of variables:

Query Variable (getting values from your data source):

# Configuration
Name: node
Label: Node
Data source: Prometheus
Query: label_values(node_cpu_seconds_total, instance)
Regex: /(.*):.*/
Sort: Alphabetical (asc)
Multi-value: Enabled
Include All option: Enabled

Interval Variable (for time ranges):

# Configuration
Name: interval
Label: Interval
Values: 1m,5m,10m,30m,1h,6h,12h,1d

Custom Variable (static list):

# Configuration
Name: environment
Label: Environment
Values: production, staging, development
Multi-value: Enabled
Include All option: Enabled

Text Box Variable (user input):

# Configuration
Name: threshold
Label: Alert Threshold
Default value: 80

Chained Variables (dependent on another variable):

# First variable
Name: app
Label: Application
Query: label_values(application)

# Second variable (depends on app)
Name: service
Label: Service
Query: label_values(app_metrics{application="$app"}, service)

Step 3: Use variables in panel queries

# CPU usage by selected node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",instance=~"$node"}[$interval])) * 100)

# Memory usage filtered by environment
node_memory_used_bytes{env="$environment"} / node_memory_total_bytes{env="$environment"} * 100

# Alert threshold from user input
node_memory_used_bytes / node_memory_total_bytes * 100 > $threshold

Step 4: Create a dashboard with multiple variable-based panels

{
  "title": "Node Exporter Dashboard",
  "templating": {
    "list": [
      {
        "name": "node",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(node_cpu_seconds_total, instance)",
        "regex": "/(.*?):.*/",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": false,
        "values": ["1m", "5m", "10m", "30m", "1h", "6h", "12h", "1d"]
      },
      {
        "name": "disk",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(node_filesystem_avail_bytes{instance=~\"$node\"}, mountpoint)",
        "multi": true,
        "includeAll": true
      }
    ]
  },
  "panels": [
    {
      "title": "CPU Usage for $node",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$node\"}[$interval])) * 100)",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Memory Usage for $node",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "(node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemFree_bytes{instance=~\"$node\"} - node_memory_Cached_bytes{instance=~\"$node\"} - node_memory_Buffers_bytes{instance=~\"$node\"}) / node_memory_MemTotal_bytes{instance=~\"$node\"} * 100",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Disk Usage for $disk on $node",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "100 - ((node_filesystem_avail_bytes{instance=~\"$node\",mountpoint=~\"$disk\"} * 100) / node_filesystem_size_bytes{instance=~\"$node\",mountpoint=~\"$disk\"})",
          "legendFormat": "{{instance}} - {{mountpoint}}"
        }
      ]
    }
  ]
}

Advanced variable techniques:

Ad-hoc filters (for key-value filtering):

# Configuration
Data source: Prometheus
Filter key field: label
Filter value field: value

Variable refresh options:

On dashboard load
On time range change
Never

Global variables (available without definition):

$__interval: Calculated based on time range
$__from and $__to: Dashboard time range
$__name: Panel name

Variable format options:

glob: Values with glob (*, ?) patterns
regex: Values with regex patterns
pipe: Values separated by |
distributed: Multiple separate queries

This approach creates highly interactive dashboards where users can:

Select specific nodes to monitor
Change time intervals for analysis
Filter by different environments
Set custom thresholds
Drill down from applications to services

18. How do you set up alerting in Grafana and what are the different notification channels?

Grafana provides a robust alerting system that can notify you when metrics meet certain conditions. Here's how to set it up:

Step 1: Configure Alerting

Navigate to Alerting in the Grafana UI:

Grafana 8+: Go to Alerting in the left sidebar
Grafana 7 and earlier: Configure > Alerting

Step 2: Create Alert Rules

For Grafana 8+ (New alerting system):

1. Click "New alert rule"
2. Define the query:
   - Data source: Prometheus
   - PromQL: rate(node_cpu_seconds_total{mode="idle"}[5m]) < 0.1
3. Set conditions:
   - Condition: When last() OF query(A) IS BELOW 0.1
   - For: 5m (duration before alerting)
4. Add labels:
   - severity: critical
   - category: system
5. Add annotations:
   - summary: High CPU usage on {{$labels.instance}}
   - description: CPU idle is below 10% for 5 minutes
6. Set evaluation interval: 1m

For Grafana 7 and earlier (Classic alerting):

1. Edit a Graph panel
2. Go to the Alert tab
3. Click "Create Alert"
4. Set conditions:
   - WHEN avg() OF query(A,5m,now) IS BELOW 0.1
   - For: 5m
5. Add notifications:
   - Send to: CPU Alert
   - Message: High CPU usage detected

Step 3: Configure Notification Channels

Grafana supports multiple notification channels:

Email:

Name: Email Alerts
Type: Email
Addresses: team@example.com, oncall@example.com

Slack:

Name: Slack Alerts
Type: Slack
Webhook URL: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX
Channel: #alerts

PagerDuty:

Name: PagerDuty Critical
Type: PagerDuty
Integration Key: your_pagerduty_integration_key
Severity: critical

Webhook:

Name: Custom Webhook
Type: Webhook
URL: https://example.com/alert-webhook
HTTP Method: POST
Username: webhook_user
Password: webhook_password

Microsoft Teams:

Name: MS Teams
Type: Microsoft Teams
Webhook URL: https://outlook.office.com/webhook/...

Google Chat:

Name: Google Chat
Type: Google Hangouts Chat
Webhook URL: https://chat.googleapis.com/v1/spaces/...

Telegram:

Name: Telegram Alerts
Type: Telegram
BOT API Token: your_telegram_bot_token
Chat ID: your_chat_id

OpsGenie:

Name: OpsGenie
Type: OpsGenie
API Key: your_opsgenie_api_key
Alert API URL: https://api.opsgenie.com/v2/alerts

Step 4: Create Alert Notification Policies (Grafana 8+)

1. Go to Alerting > Notification policies
2. Create a new policy:
   - Name: High Priority
   - Matching labels: severity=critical
   - Contact points: PagerDuty Critical
   - Group by: [alertname, instance]
   - Timing options:
     - Group wait: 30s
     - Group interval: 5m
     - Repeat interval: 4h

Step 5: Test Alerts

1. Go to Alerting > Alert rules
2. Find your alert
3. Click "Test" button
4. Select a notification channel
5. Click "Send test notification"

Example of a complete alert setup in Grafana API format:

{
  "dashboard": {
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "alert": {
          "name": "High CPU Usage",
          "message": "CPU usage is above 90% for 5 minutes",
          "handler": 1,
          "frequency": "60s",
          "conditions": [
            {
              "type": "query",
              "query": { "params": ["A", "5m", "now"] },
              "reducer": { "type": "avg", "params": [] },
              "evaluator": { "type": "gt", "params": [90] },
              "operator": { "type": "and" }
            }
          ],
          "notifications": [
            { "uid": "slack-notifications" },
            { "uid": "email-notifications" }
          ]
        }
      }
    ]
  },
  "alerting": {
    "alertmanagers": [
      {
        "name": "Prometheus Alertmanager",
        "type": "prometheus-alertmanager",
        "uid": "alertmanager1",
        "url": "http://alertmanager:9093"
      }
    ],
    "contactPoints": [
      {
        "name": "slack-notifications",
        "type": "slack",
        "settings": {
          "url": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX",
          "recipient": "#alerts"
        }
      },
      {
        "name": "email-notifications",
        "type": "email",
        "settings": {
          "addresses": "team@example.com"
        }
      }
    ]
  }
}

Best Practices for Grafana Alerting:

Use meaningful alert names and descriptions
Include variables in alert messages (e.g., {{$labels.instance}}, {{$value}})
Set appropriate thresholds to avoid alert fatigue
Use different notification channels for different severity levels
Implement escalation paths for critical alerts
Add runbooks or documentation links in alert annotations
Test alerts regularly to ensure they work as expected

19. How do you create a Grafana dashboard for monitoring Kubernetes with Prometheus?

Creating a Grafana dashboard for Kubernetes monitoring involves several steps, from setting up the data collection to building comprehensive visualizations.

Step 1: Set up Prometheus for Kubernetes monitoring

Deploy Prometheus using Helm or the Prometheus Operator:

# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack

This installs:

Prometheus server
Alertmanager
Node exporter
kube-state-metrics
Grafana (pre-configured)

Step 2: Create a comprehensive Kubernetes dashboard

If you didn't use the kube-prometheus-stack (which includes dashboards), create your own:

Cluster Overview Dashboard:

{
  "title": "Kubernetes Cluster Overview",
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_namespace_status_phase, namespace)",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "pod",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "node",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_node_info, node)",
        "multi": true,
        "includeAll": true
      }
    ]
  },
  "panels": [
    {
      "title": "Cluster CPU Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\", image!=\"\"}[5m])) by (namespace)",
          "legendFormat": "{{namespace}}"
        }
      ]
    },
    {
      "title": "Cluster Memory Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(container_memory_working_set_bytes{container!=\"\", image!=\"\"}) by (namespace)",
          "legendFormat": "{{namespace}}"
        }
      ],
      "yaxes": [
        {
          "format": "bytes"
        }
      ]
    },
    {
      "title": "Pod Status",
      "type": "stat",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(kube_pod_status_phase{phase=\"Running\"})",
          "legendFormat": "Running"
        },
        {
          "expr": "sum(kube_pod_status_phase{phase=\"Pending\"})",
          "legendFormat": "Pending"
        },
        {
          "expr": "sum(kube_pod_status_phase{phase=\"Failed\"})",
          "legendFormat": "Failed"
        }
      ],
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto"
      }
    },
    {
      "title": "Node Status",
      "type": "gauge",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(kube_node_status_condition{condition=\"Ready\", status=\"true\"}) / count(kube_node_info) * 100",
          "legendFormat": "Ready Nodes"
        }
      ],
      "options": {
        "thresholds": [
          { "value": 50, "color": "red" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "green" }
        ],
        "min": 0,
        "max": 100,
        "unit": "percent"
      }
    },
    {
      "title": "Pod CPU Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\", pod=~\"$pod\", container!=\"\", image!=\"\"}[5m])) by (pod)",
          "legendFormat": "{{pod}}"
        }
      ]
    },
    {
      "title": "Pod Memory Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(container_memory_working_set_bytes{namespace=~\"$namespace\", pod=~\"$pod\", container!=\"\", image!=\"\"}) by (pod)",
          "legendFormat": "{{pod}}"
        }
      ],
      "yaxes": [
        {
          "format": "bytes"
        }
      ]
    },
    {
      "title": "Pod Network I/O",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(container_network_receive_bytes_total{namespace=~\"$namespace\", pod=~\"$pod\"}[5m])) by (pod)",
          "legendFormat": "{{pod}} Received"
        },
        {
          "expr": "sum(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\", pod=~\"$pod\"}[5m])) by (pod)",
          "legendFormat": "{{pod}} Transmitted"
        }
      ],
      "yaxes": [
        {
          "format": "bytes"
        }
      ]
    },
    {
      "title": "Node CPU Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\", node=~\"$node\"}[5m])) by (node) / on(node) group_left count by (node) (node_cpu_seconds_total{mode=\"idle\"}) * 100",
          "legendFormat": "{{node}}"
        }
      ],
      "yaxes": [
        {
          "format": "percent",
          "max": 100
        }
      ]
    },
    {
      "title": "Node Memory Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "(node_memory_MemTotal_bytes{node=~\"$node\"} - node_memory_MemAvailable_bytes{node=~\"$node\"}) / node_memory_MemTotal_bytes{node=~\"$node\"} * 100",
          "legendFormat": "{{node}}"
        }
      ],
      "yaxes": [
        {
          "format": "percent",
          "max": 100
        }
      ]
    },
    {
      "title": "Node Disk Usage",
      "type": "graph",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "(1 - node_filesystem_avail_bytes{node=~\"$node\", mountpoint=\"/\"} / node_filesystem_size_bytes{node=~\"$node\", mountpoint=\"/\"}) * 100",
          "legendFormat": "{{node}}"
        }
      ],
      "yaxes": [
        {
          "format": "percent",
          "max": 100
        }
      ]
    }
  ]
}

Step 3: Create specialized dashboards for different aspects

1. Pod Resource Dashboard:

Key metrics:

CPU usage vs. requests/limits:
sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod="$pod"}[5m])) by (container) / sum(kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", resource="cpu"}) by (container)
Memory usage vs. requests/limits:
sum(container_memory_working_set_bytes{namespace="$namespace", pod="$pod"}) by (container) / sum(kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", resource="memory"}) by (container)
Restart count:
kube_pod_container_status_restarts_total{namespace="$namespace", pod="$pod"}

2. Node Resource Dashboard:

Key metrics:

Node capacity vs. allocatable:
kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}
Pods per node:
count(kube_pod_info) by (node)
Node conditions:
kube_node_status_condition{condition="DiskPressure", status="true"}

3. Namespace Resource Dashboard:

Key metrics:

Resource quotas:
kube_resourcequota{namespace="$namespace", type="used"} / kube_resourcequota{namespace="$namespace", type="hard"}
Deployment status:
kube_deployment_status_replicas_available{namespace="$namespace"} / kube_deployment_spec_replicas{namespace="$namespace"}

Step 4: Set up alerts for Kubernetes monitoring

groups:
- name: kubernetes-alerts
  rules:
  - alert: KubernetesPodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{$labels.namespace}}/{{$labels.pod}} is crash looping"
      description: "Pod {{$labels.namespace}}/{{$labels.pod}} is restarting {{$value}} times / 15 minutes"
  
  - alert: KubernetesNodeNotReady
    expr: kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node {{$labels.node}} not ready"
      description: "Node {{$labels.node}} has been unready for more than 5 minutes"
  
  - alert: KubernetesPodNotHealthy
    expr: min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[15m]) > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{$labels.namespace}}/{{$labels.pod}} not healthy"
      description: "Pod {{$labels.namespace}}/{{$labels.pod}} has been in a non-ready state for more than 15 minutes"
  
  - alert: KubernetesHighCPUUsage
    expr: sum(rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])) by (namespace, pod, container) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (namespace, pod, container) > 0.9
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage in {{$labels.namespace}}/{{$labels.pod}}/{{$labels.container}}"
      description: "Container {{$labels.container}} in pod {{$labels.namespace}}/{{$labels.pod}} is using more than 90% of its CPU limit for 15 minutes"

Step 5: Create a Kubernetes Events Dashboard

# Query to show Kubernetes events
label_replace(
  kube_event_info, 
  "reason", 
  "$1", 
  "reason", 
  "(.+)"
)

This comprehensive approach provides:

Cluster-wide visibility
Namespace-specific monitoring
Pod and container-level metrics
Node health and performance tracking
Resource utilization compared to requests/limits
Event monitoring and alerting
Historical trends for capacity planning

20. How do you use Grafana annotations to mark events on dashboards?

Annotations in Grafana allow you to mark points in time with rich events that add context to your metrics. They help correlate metrics with deployments, incidents, or other significant events.

Types of Annotations:

Built-in annotations: Automatically created by Grafana
Manual annotations: Added by users through the UI
Query-based annotations: Generated from a data source
API annotations: Created programmatically

Step 1: Adding manual annotations

To add a manual annotation:

Click and drag on a graph panel
Enter a description in the popup
Optionally add tags
Click Save

Description: Deployed v2.3.4
Tags: deployment, frontend

Step 2: Configure annotation queries

In dashboard settings:

Go to Dashboard Settings > Annotations
Click "New"

For a Prometheus annotation query:

Name: Deployments
Data source: Prometheus
Query: changes(version_info{app="my-app"}[1m]) > 0
Step: 60s
Title: $tag_version
Text: Deployed $tag_version to $tag_environment
Tags: deployment, $tag_environment

For a Loki annotation query:

Name: Error Logs
Data source: Loki
Query: {app="my-app"} |= "error" | logfmt
Title: Error detected
Text: {{.message}}
Tags: error, {{.level}}

Step 3: Create annotations via API

Using curl to create an annotation:

curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/annotations \
  -d '{
    "dashboardId": 1,
    "panelId": 2,
    "time": 1619712345000,
    "timeEnd": 1619712400000,
    "tags": ["deployment", "api"],
    "text": "Deployed API v1.2.3"
  }'

Using a script to automatically create annotations for deployments:

// deploy.js
const axios = require('axios');

async function createDeploymentAnnotation() {
  const version = process.env.VERSION;
  const environment = process.env.ENVIRONMENT;
  
  await axios.post('http://grafana-host:3000/api/annotations', {
    dashboardId: 1,
    time: Date.now(),
    tags: ['deployment', environment],
    text: `Deployed version ${version} to ${environment}`
  }, {
    headers: {
      'Authorization': `Bearer ${process.env.GRAFANA_API_KEY}`
    }
  });
  
  console.log('Annotation created for deployment');
}

createDeploymentAnnotation().catch(console.error);

Step 4: Configure global annotations

To make annotations visible across all dashboards:

Go to Configuration > Annotations
Add a new annotation
Enable "Show in all dashboards"

Name: System Maintenance
Data source: Prometheus
Query: changes(maintenance_mode[1m]) > 0
Show in all dashboards: Enabled

Step 5: Using annotations in dashboards

To display annotations in a panel:

Edit the panel
Go to the "Visualization" tab
Enable "Show annotations"
Select which annotation queries to display

Example use cases:

Deployment markers:
Query: changes(kube_deployment_status_observed_generation{deployment="app"}[1m]) > 0
Scaling events:
Query: changes(kube_deployment_spec_replicas{deployment="app"}[1m]) > 0
Configuration changes:
Query: changes(config_hash{app="my-app"}[1m]) > 0
Incident timeline:
Manual annotations with tags: incident, outage
Database maintenance:
Query: mysql_global_status_ongoing_maintenance > 0

Annotations provide crucial context to your metrics, helping teams:

Correlate performance changes with deployments
Document incidents directly on dashboards
Mark maintenance windows
Create a visual history of important events
Improve post-incident analysis

21. How do you implement role-based access control (RBAC) in Grafana?

Grafana's role-based access control (RBAC) allows you to manage what users can see and do within the platform. Here's how to implement it:

Step 1: Understanding Grafana's permission levels

Grafana has several built-in roles:

Admin: Full access to all organizations
Editor: Can edit dashboards but not manage users/orgs
Viewer: Can only view dashboards
Grafana Admin: Super admin across all organizations

Step 2: Configure authentication

First, set up authentication in grafana.ini:

[auth]
# Disable anonymous access
disable_login_form = false
anonymous_enabled = false

# LDAP authentication example
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml

# OAuth example (GitHub)
[auth.github]
enabled = true
allow_sign_up = true
client_id = YOUR_GITHUB_APP_CLIENT_ID
client_secret = YOUR_GITHUB_APP_CLIENT_SECRET
scopes = user:email,read:org
auth_url = https://github.com/login/oauth/authorize
token_url = https://github.com/login/oauth/access_token
api_url = https://api.github.com/user
team_ids =
allowed_organizations = your-github-org

Step 3: Configure LDAP integration for team mapping

Create an LDAP configuration file (ldap.toml):

[[servers]]
host = "ldap.example.com"
port = 389
use_ssl = false
bind_dn = "cn=admin,dc=example,dc=com"
bind_password = "admin_password"
search_filter = "(uid=%s)"
search_base_dns = ["ou=users,dc=example,dc=com"]

# Map LDAP groups to Grafana roles
[[servers.group_mappings]]
group_dn = "cn=admins,ou=groups,dc=example,dc=com"
org_role = "Admin"

[[servers.group_mappings]]
group_dn = "cn=editors,ou=groups,dc=example,dc=com"
org_role = "Editor"

[[servers.group_mappings]]
group_dn = "cn=viewers,ou=groups,dc=example,dc=com"
org_role = "Viewer"

Step 4: Create teams and assign permissions

Using the Grafana UI:

Go to Configuration > Teams
Click "New Team"
Add users to the team
Set permissions for dashboards/folders

Using the API:

# Create a team
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/teams \
  -d '{"name": "DevOps Team"}'

# Add user to team
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/teams/1/members \
  -d '{"userId": 2}'

Step 5: Set up folder permissions

Using the Grafana UI:

Go to Dashboards > Browse
Navigate to a folder
Click the gear icon
Go to Permissions
Add team/user permissions

Using the API:

# Set folder permissions
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/folders/1/permissions \
  -d '{
    "items": [
      {
        "teamId": 1,
        "permission": 2
      }
    ]
  }'

Permission levels:

1: View
2: Edit
4: Admin

Step 6: Set up dashboard permissions

Using the Grafana UI:

Open a dashboard
Click the gear icon
Go to Permissions
Add team/user permissions

Using the API:

# Set dashboard permissions
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/dashboards/id/1/permissions \
  -d '{
    "items": [
      {
        "teamId": 1,
        "permission": 1
      }
    ]
  }'

Step 7: Configure data source permissions

In Grafana Enterprise, you can restrict access to data sources:

Go to Configuration > Data Sources
Select a data source
Go to Permissions tab
Add team/user permissions

# Set data source permissions (Enterprise only)
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/datasources/1/permissions \
  -d '{
    "items": [
      {
        "teamId": 1,
        "permission": 1
      }
    ]
  }'

Step 8: Implement fine-grained access control (Grafana Enterprise)

For Grafana Enterprise, enable fine-grained access control in grafana.ini:

[feature_toggles]
accesscontrol = true

Create custom roles:

# Create a custom role
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/access-control/roles \
  -d '{
    "name": "DashboardPublisher",
    "description": "Can create and publish dashboards",
    "permissions": [
      {
        "action": "dashboards:create"
      },
      {
        "action": "dashboards:write",
        "scope": "folders:*"
      }
    ]
  }'

Assign roles to users:

# Assign role to user
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/access-control/users/2/roles \
  -d '{
    "roleUid": "custom:dashboard-publisher"
  }'

Step 9: Implement service accounts for automation

For automated processes:

Go to Configuration > Service accounts
Create a new service account
Assign appropriate roles
Generate a token

# Create service account
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/serviceaccounts \
  -d '{
    "name": "CI/CD Pipeline",
    "role": "Editor"
  }'

Step 10: Audit user activity

Enable audit logs in grafana.ini:

[log]
filters = audit:debug

This comprehensive RBAC implementation allows you to:

Control who can view, edit, and administer dashboards
Restrict access to sensitive data sources
Automate dashboard management securely
Audit user activities for compliance
Integrate with existing identity providers
Create custom roles for specific needs

22. How do you optimize Grafana dashboards for performance?

Optimizing Grafana dashboards is crucial for maintaining responsiveness, especially with large datasets or many concurrent users. Here's a comprehensive approach:

1. Optimize Prometheus Queries

PromQL queries can significantly impact dashboard performance:

# Inefficient query (calculates rate for each series then sums)
sum(rate(http_requests_total[5m]))

# More efficient query (sums counters first, then calculates rate)
rate(sum(http_requests_total)[5m])

Use recording rules for expensive calculations:

# prometheus.yml
rule_files:
  - "recording_rules.yml"

# recording_rules.yml
groups:
  - name: http_requests
    interval: 1m
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Then use the pre-computed metric in Grafana:

job:http_requests_total:rate5m{job="api"}

2. Adjust Time Range and Resolution

Configure appropriate min/max data points:

Dashboard settings > Data source options:
- Min time interval: 1m
- Max data points: 500

For individual panels:

Panel options > Query options:
- Min interval: 1m
- Max data points: 100

3. Use Template Variables Efficiently

Limit the number of values in multi-value variables:

Variable settings:
- Include All option: Disabled
- Preview of values: 10

Use regex to filter values:

Variable query: label_values(node_cpu_seconds_total, instance)
Regex: /prod-.*/

Cache template variables:

Variable settings:
- Refresh: On dashboard load

4. Optimize Panel Rendering

Reduce the number of series per graph:

# Use aggregation to reduce series
sum by (job) (rate(http_requests_total[5m]))

# Use top/bottom functions
topk(5, sum by (handler) (rate(http_requests_total[5m])))

Use appropriate visualization types:

Use table panels for many series
Use heatmaps instead of graphs for distribution data
Use stat panels for single values

5. Implement Dashboard Caching

For Grafana Enterprise, enable dashboard caching:

# grafana.ini
[caching]
enabled = true

For open-source Grafana, use a reverse proxy with caching:

# nginx.conf
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=grafana_cache:10m max_size=1g;

server {
    location / {
        proxy_pass http://grafana:3000;
        proxy_cache grafana_cache;
        proxy_cache_valid 200 1m;
        proxy_cache_key "$host$request_uri$cookie_grafana_session";
    }
}

6. Optimize Data Source Settings

For Prometheus:

# Prometheus data source settings
timeInterval: "1m"  # Minimum interval
queryTimeout: "60s"
httpMethod: "POST"  # Better for large queries

For other data sources like MySQL:

# MySQL data source settings
maxOpenConns: 10
maxIdleConns: 5
connMaxLifetime: 14400 # 4 hours

7. Use Lazy Loading Panels

Enable lazy loading in dashboard settings:

Dashboard settings > General:
- Lazy loading: Enabled

8. Implement Dashboard Snapshots for Static Data

For historical data that doesn't change:

Dashboard > Share > Snapshot

9. Optimize Alert Rules

Move complex alert evaluations to Prometheus:

# In Prometheus rules
- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m

Instead of Grafana alerts with complex queries.

10. Implement Dashboard Provisioning

Use provisioned dashboards for consistent performance:

# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    folder: 'General'
    type: file
    options:
      path: /var/lib/grafana/dashboards

11. Real-world Example: Optimizing a Node Exporter Dashboard

Before optimization:

# Inefficient query showing all CPU cores for all nodes
node_cpu_seconds_total{mode!="idle"}

After optimization:

# More efficient query showing CPU usage by node
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) / 
count by (instance) (count by (cpu, instance) (node_cpu_seconds_total)) * 100

Before optimization:

# Showing all filesystem metrics
node_filesystem_avail_bytes

After optimization:

# Only showing filesystems that matter
node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}

12. Performance Testing

Use Grafana's built-in rendering metrics:

# In Prometheus
grafana_rendering_request_duration_seconds_sum
grafana_rendering_request_duration_seconds_count

Or use external tools:

# Using Apache Bench to test dashboard load
ab -n 100 -c 10 "http://grafana:3000/d/dashboard-uid/dashboard-name?orgId=1"

By implementing these optimizations, you can achieve:

Faster dashboard loading times
Reduced server resource usage
Better user experience with large datasets
Support for more concurrent users
More responsive interactive dashboards

23. How do you use Grafana for log visualization with Loki?

Grafana and Loki together provide a powerful solution for log visualization and analysis. Here's how to set it up and use it effectively:

Step 1: Set up Loki as a data source in Grafana

Go to Configuration > Data Sources
Click "Add data source"
Select "Loki"
Configure the URL (e.g., http://loki:3100)
Click "Save & Test"

# Example docker-compose.yml for Loki and Grafana
version: '3'
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
  
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yaml
    command: -config.file=/etc/promtail/config.yaml
  
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana

volumes:
  grafana-storage:

Step 2: Configure Promtail to ship logs to Loki

# promtail-config.yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*log

  - job_name: containers
    static_configs:
      - targets:
          - localhost
        labels:
          job: containerlogs
          __path__: /var/lib/docker/containers/*/*log

    # Extract container metadata
    pipeline_stages:
      - json:
          expressions:
            stream: stream
            attrs: attrs
            tag: attrs.tag
      - regex:
          expression: (?P<container_name>(?:[^|]*[^|]))
      - labels:
          container_name:

Step 3: Create a Logs Dashboard in Grafana

Create a new dashboard with these panels:

1. Log Volume Panel:

# Query
sum(count_over_time({job="containerlogs"}[5m])) by (container_name)

2. Log Explorer Panel:

# Query
{job="containerlogs"} |= "$search"

3. Error Rate Panel:

# Query
sum(count_over_time({job="containerlogs"} |= "error" [5m])) by (container_name)

Step 4: Use LogQL for Advanced Queries

Basic filtering:

# Show all logs containing "error"
{job="containerlogs"} |= "error"

# Show all logs NOT containing "health"
{job="containerlogs"} != "health"

# Combine filters
{job="containerlogs"} |= "error" != "timeout"

Pattern matching:

# Using regex
{job="containerlogs"} |~ "error.*timeout"

# Case insensitive matching
{job="containerlogs"} |~ "(?i)error"

Extracting fields with JSON:

# Parse JSON logs
{job="containerlogs"} | json

# Extract specific fields
{job="containerlogs"} | json field="message"

# Filter on extracted fields
{job="containerlogs"} | json | status_code >= 400

Line formatting:

# Format output
{job="containerlogs"} | json | line_format "{{.status_code}} {{.path}} {{.duration}}ms"

Aggregations:

# Count by status code
sum(count_over_time({job="containerlogs"} | json | status_code>=400 [5m])) by (status_code)

# Average response time
avg(sum_over_time({job="containerlogs"} | json | unwrap duration [5m])) by (path)

Step 5: Create a comprehensive logging dashboard

{
  "title": "Application Logs Dashboard",
  "panels": [
    {
      "title": "Log Volume Over Time",
      "type": "graph",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "sum(count_over_time({job=\"containerlogs\"}[5m])) by (container_name)",
          "legendFormat": "{{container_name}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "sum(count_over_time({job=\"containerlogs\"} |= \"error\" [5m])) by (container_name)",
          "legendFormat": "{{container_name}}"
        }
      ]
    },
    {
      "title": "HTTP Status Codes",
      "type": "graph",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "sum(count_over_time({job=\"containerlogs\"} | json | status_code>=200 [5m])) by (status_code)",
          "legendFormat": "{{status_code}}"
        }
      ]
    },
    {
      "title": "Slow Requests (>500ms)",
      "type": "graph",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "sum(count_over_time({job=\"containerlogs\"} | json | duration>500 [5m])) by (path)",
          "legendFormat": "{{path}}"
        }
      ]
    },
    {
      "title": "Log Explorer",
      "type": "logs",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "{job=\"containerlogs\"} |= \"$search\"",
          "refId": "A"
        }
      ]
    }
  ],
  "templating": {
    "list": [
      {
        "name": "container",
        "type": "query",
        "datasource": "Loki",
        "query": "label_values(container_name)",
        "multi": true,
        "includeAll": true
      },
      {
        "name": "search",
        "type": "textbox",
        "label": "Search",
        "description": "Text to search for in logs"
      }
    ]
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  }
}

Step 6: Set up log-based alerts

In Grafana 8+:

1. Go to Alerting > New alert rule
2. Configure:
   - Name: High Error Rate
   - Data source: Loki
   - Query: sum(count_over_time({job="containerlogs"} |= "error" [5m])) > 100
   - For: 5m
   - Notifications: Send to Slack

Step 7: Correlate logs with metrics

Create a dashboard with both metrics and logs:

1. Add a graph panel with Prometheus metrics
   - Query: sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance)

2. Add a logs panel below
   - Query: {job="containerlogs"} |= "error"
   
3. Link them with annotations
   - Add annotation query from Loki
   - Query: {job="containerlogs"} |= "deployed"

Step 8: Implement log retention and aggregation

Configure Loki retention in loki-config.yaml:

limits_config:
  retention_period: 168h  # 7 days

schema_config:
  configs:
    - from: 2020-07-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

Advanced Techniques:

Derived fields for tracing integration:

# In Loki data source settings
Derived fields:
- Name: trace_id
- Regex: "traceID":"(\w+)"
- URL: http://jaeger:16686/trace/$${__value.raw}

Log patterns for anomaly detection:

# Find unusual log patterns
{job="containerlogs"} | pattern

Log volume anomalies:

# Alert on sudden log volume increase
abs(
  sum(count_over_time({job="containerlogs"}[5m])) 
  / 
  sum(count_over_time({job="containerlogs"}[5m] offset 5m)) 
  - 1
) > 0.5

This comprehensive approach provides:

Real-time log visualization
Structured log parsing and filtering
Log volume and error rate monitoring
Correlation between logs and metrics
Alerting on log patterns and content
Efficient log storage and retention

24. How do you implement multi-tenancy in Grafana?

Multi-tenancy in Grafana allows multiple teams or customers to use the same Grafana instance while keeping their data and dashboards separate. Here's how to implement it:

Step 1: Understand Grafana's multi-tenancy model

Grafana uses "Organizations" as its multi-tenancy unit:

Each organization has its own dashboards, data sources, and users
Users can belong to multiple organizations with different roles
Switching between organizations is done via the UI or API

Step 2: Configure Grafana for multi-tenancy

Edit grafana.ini:

[users]
# Allow users to sign up
allow_sign_up = true

# Default role for new users
auto_assign_org_role = Viewer

# Allow users to create organizations
allow_org_create = false

[auth]
# Disable anonymous access
disable_login_form = false
anonymous_enabled = false

# Enable multiple organizations
[auth.basic]
enabled = true

[server]
# Enable subpath support if needed
root_url = https://grafana.example.com

Step 3: Create and manage organizations

Using the Grafana UI:

Log in as an admin
Go to Configuration > Organizations
Click "New Organization"
Enter a name and click "Create"

Using the API:

# Create a new organization
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/orgs \
  -d '{"name": "Customer A"}'

# Response: {"orgId":2,"message":"Organization created"}

Step 4: Manage users within organizations

Add users to an organization via UI:

Go to Configuration > Users
Click "Invite"
Enter email and select role
Click "Submit"

Using the API:

# Add user to organization
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/orgs/2/users \
  -d '{"loginOrEmail": "user@example.com", "role": "Editor"}'

# Change user's role in organization
curl -X PATCH \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/orgs/2/users/3 \
  -d '{"role": "Admin"}'

# Remove user from organization
curl -X DELETE \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/orgs/2/users/3

Step 5: Switch between organizations

Via UI:

Click on user profile
Select "Switch organization"
Choose the organization

Via API:

# Switch to a different organization
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  http://grafana-host:3000/api/user/using/2

Step 6: Configure data source isolation

Each organization has its own data sources. For Prometheus:

# Organization 1 Prometheus
url: http://prometheus-org1:9090
basicAuth: true
basicAuthUser: org1
basicAuthPassword: password1

# Organization 2 Prometheus
url: http://prometheus-org2:9090
basicAuth: true
basicAuthUser: org2
basicAuthPassword: password2

For shared Prometheus with data isolation:

# Organization 1 Prometheus with tenant label
url: http://prometheus:9090
jsonData:
  httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
  httpHeaderValue1: "org1"

# Organization 2 Prometheus with tenant label
url: http://prometheus:9090
jsonData:
  httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
  httpHeaderValue1: "org2"

Step 7: Implement automated organization provisioning

Create a provisioning script:

#!/usr/bin/env python3
import requests
import json

GRAFANA_URL = "http://grafana:3000"
ADMIN_KEY = "admin_api_key"

def create_organization(name):
    response = requests.post(
        f"{GRAFANA_URL}/api/orgs",
        headers={"Authorization": f"Bearer {ADMIN_KEY}"},
        json={"name": name}
    )
    return response.json()["orgId"]

def create_api_key(org_id, name):
    # Switch to organization
    requests.post(
        f"{GRAFANA_URL}/api/user/using/{org_id}",
        headers={"Authorization": f"Bearer {ADMIN_KEY}"}
    )
    
    # Create API key