API Monitoring and Alerting

You've built a great API, deployed it to production, and users are starting to rely on it. Now comes the hard part: keeping it running smoothly. Without proper monitoring and alerting, you're flying blind—you won't know when things break until users start complaining.

Let's fix that.

Why API Monitoring Matters

APIs are the backbone of modern applications. When your API goes down or slows down, everything that depends on it suffers. Monitoring helps you:

Catch problems before users notice them
Understand how your API performs under load
Make data-driven decisions about scaling and optimization
Meet your SLAs and keep customers happy
Debug issues faster with historical data

Key Metrics to Track

1. Latency

Latency measures how long it takes your API to respond to requests. It's usually measured in milliseconds and is one of the most important metrics for user experience.

Here's how to track latency in your API:

// Express.js middleware to track latency
const prometheus = require('prom-client');

const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in milliseconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000]
});

app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
  });

  next();
});

Track these latency percentiles:

p50 (median): Half of requests are faster than this
p95: 95% of requests are faster than this
p99: 99% of requests are faster than this
p99.9: The slowest 0.1% of requests

Why percentiles matter: Your average latency might be 100ms, but if your p99 is 5 seconds, 1% of your users are having a terrible experience.

2. Error Rate

Error rate tells you what percentage of requests are failing. Track both 4xx (client errors) and 5xx (server errors) separately.

const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestErrors = new prometheus.Counter({
  name: 'http_request_errors_total',
  help: 'Total number of HTTP request errors',
  labelNames: ['method', 'route', 'status_code', 'error_type']
});

app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequestsTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();

    if (res.statusCode >= 400) {
      const errorType = res.statusCode >= 500 ? 'server_error' : 'client_error';
      httpRequestErrors
        .labels(req.method, req.route?.path || req.path, res.statusCode, errorType)
        .inc();
    }
  });

  next();
});

Calculate error rate like this:

Error Rate = (Number of Failed Requests / Total Requests) × 100

A healthy API typically has an error rate below 0.1% for 5xx errors.

3. Throughput

Throughput measures how many requests your API handles per second (RPS) or per minute (RPM). It helps you understand usage patterns and capacity.

const httpRequestRate = new prometheus.Counter({
  name: 'http_requests_per_second',
  help: 'Rate of HTTP requests per second',
  labelNames: ['method', 'route']
});

app.use((req, res, next) => {
  httpRequestRate
    .labels(req.method, req.route?.path || req.path)
    .inc();
  next();
});

Track throughput by: - Endpoint (which endpoints get the most traffic?) - Time of day (when are peak hours?) - Geographic region (where are users located?)

4. Availability and Uptime

Availability is the percentage of time your API is operational. It's usually expressed as "nines":

99% uptime = 7.2 hours of downtime per month
99.9% uptime = 43 minutes of downtime per month
99.99% uptime = 4.3 minutes of downtime per month
99.999% uptime = 26 seconds of downtime per month

Here's a simple health check endpoint:

app.get('/health', async (req, res) => {
  const health = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    status: 'healthy',
    checks: {}
  };

  try {
    // Check database connection
    await db.ping();
    health.checks.database = 'healthy';
  } catch (error) {
    health.checks.database = 'unhealthy';
    health.status = 'degraded';
  }

  try {
    // Check Redis connection
    await redis.ping();
    health.checks.redis = 'healthy';
  } catch (error) {
    health.checks.redis = 'unhealthy';
    health.status = 'degraded';
  }

  const statusCode = health.status === 'healthy' ? 200 : 503;
  res.status(statusCode).json(health);
});

5. Saturation

Saturation measures how "full" your system is. Track:

CPU usage
Memory usage
Disk I/O
Network bandwidth
Database connection pool usage

const systemMetrics = new prometheus.Gauge({
  name: 'system_resource_usage',
  help: 'System resource usage percentage',
  labelNames: ['resource']
});

setInterval(() => {
  const usage = process.memoryUsage();
  systemMetrics.labels('memory_heap_used').set(usage.heapUsed);
  systemMetrics.labels('memory_heap_total').set(usage.heapTotal);
  systemMetrics.labels('memory_rss').set(usage.rss);
}, 10000);

Setting Up Uptime Monitoring

Uptime monitoring pings your API from external locations to ensure it's accessible.

Using a Simple Cron Job

#!/bin/bash
# check_api_health.sh

API_URL="https://api.petstore.com/health"
ALERT_EMAIL="ops@petstore.com"

response=$(curl -s -o /dev/null -w "%{http_code}" $API_URL)

if [ $response -ne 200 ]; then
  echo "API is down! Status code: $response" | mail -s "API Alert" $ALERT_EMAIL
fi

Add to crontab to run every minute:

* * * * * /path/to/check_api_health.sh

Using Prometheus Blackbox Exporter

# prometheus.yml
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://api.petstore.com/health
        - https://api.petstore.com/api/v1/pets
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Alerting Strategies

Good alerts wake you up when something's wrong, not when everything's fine. Here's how to set them up right.

Alert on Symptoms, Not Causes

Bad alert: "CPU usage is above 80%" Good alert: "API latency p99 is above 1 second"

Users don't care about CPU usage—they care about slow responses.

Use Multiple Severity Levels

# Prometheus alerting rules
groups:
  - name: api_alerts
    interval: 30s
    rules:
      # Critical: Page someone immediately
      - alert: APIHighErrorRate
        expr: rate(http_request_errors_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      # Warning: Send to Slack, investigate during business hours
      - alert: APISlowResponses
        expr: histogram_quantile(0.99, rate(http_request_duration_ms_bucket[5m])) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API responses are slow"
          description: "p99 latency is {{ $value }}ms"

      # Info: Log it, no immediate action needed
      - alert: APIHighTraffic
        expr: rate(http_requests_total[5m]) > 1000
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "High traffic detected"
          description: "Request rate is {{ $value }} requests/second"

Implement Alert Fatigue Prevention

Use these techniques to avoid alert fatigue:

Require sustained issues: Use for: 5m to only alert if the problem persists
Group related alerts: Don't send 50 alerts for the same incident
Add context: Include runbooks and relevant metrics in alerts
Route intelligently: Send critical alerts to PagerDuty, warnings to Slack

Here's an Alertmanager configuration:

# alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty
      continue: true

    - match:
        severity: warning
      receiver: slack

    - match:
        severity: info
      receiver: email

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<your-pagerduty-key>'
        description: '{{ .GroupLabels.alertname }}'

  - name: 'slack'
    slack_configs:
      - api_url: '<your-slack-webhook>'
        channel: '#api-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'email'
    email_configs:
      - to: 'ops@petstore.com'
        from: 'alerts@petstore.com'
        smarthost: 'smtp.gmail.com:587'

Building Effective Dashboards

Dashboards should answer questions at a glance. Here's a Grafana dashboard configuration for API monitoring:

{
  "dashboard": {
    "title": "PetStore API Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])",
            "legendFormat": "Error Rate"
          }
        ],
        "type": "graph",
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ]
        }
      },
      {
        "title": "Latency Percentiles",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(http_request_duration_ms_bucket[5m]))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_ms_bucket[5m]))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_ms_bucket[5m]))",
            "legendFormat": "p99"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Dashboard Best Practices

Use the RED method (Rate, Errors, Duration) for request-driven services
Show trends over time: Include 24-hour and 7-day views
Add annotations: Mark deployments and incidents on graphs
Keep it simple: Don't cram 50 metrics on one screen
Use consistent colors: Red for errors, green for success, yellow for warnings

SLAs and SLOs

Service Level Agreements (SLAs)

SLAs are contracts with your users. They define what you promise to deliver and what happens if you don't.

Example SLA:

## PetStore API Service Level Agreement

### Availability
- **Target**: 99.9% uptime per month
- **Measurement**: Percentage of successful health check responses
- **Exclusions**: Scheduled maintenance windows (announced 7 days in advance)

### Performance
- **Target**: p95 latency < 200ms for all GET requests
- **Measurement**: 95th percentile response time measured at API gateway

### Support
- **Critical issues**: Response within 1 hour, resolution within 4 hours
- **Non-critical issues**: Response within 24 hours

### Remedies
- 99.0-99.9% uptime: 10% service credit
- 95.0-99.0% uptime: 25% service credit
- <95.0% uptime: 50% service credit

Service Level Objectives (SLOs)

SLOs are internal targets that are stricter than your SLAs. They give you a buffer to fix problems before breaking your SLA.

# SLO definitions
slos:
  - name: api_availability
    target: 99.95  # SLA is 99.9%, so we aim higher
    window: 30d

  - name: api_latency_p95
    target: 150ms  # SLA is 200ms
    window: 30d

  - name: api_error_rate
    target: 0.1%   # Less than 0.1% of requests should error
    window: 7d

Track your error budget:

// Calculate error budget
const SLO_TARGET = 0.9995; // 99.95%
const TOTAL_REQUESTS = 10000000; // 10M requests per month

const allowedFailures = TOTAL_REQUESTS * (1 - SLO_TARGET);
console.log(`Error budget: ${allowedFailures} failed requests per month`);

// If you've had 3000 failures so far this month:
const actualFailures = 3000;
const budgetRemaining = allowedFailures - actualFailures;
const budgetUsedPercent = (actualFailures / allowedFailures) * 100;

console.log(`Budget remaining: ${budgetRemaining} failures`);
console.log(`Budget used: ${budgetUsedPercent.toFixed(2)}%`);

When you're burning through your error budget too fast, it's time to stop shipping new features and focus on reliability.

Monitoring Tools Comparison

Prometheus + Grafana

Prometheus is an open-source monitoring system that scrapes metrics from your services. Grafana visualizes them.

Pros:- Free and open source - Powerful query language (PromQL) - Great for infrastructure and application metrics - Active community

Cons:- Requires setup and maintenance - No built-in alerting UI (need Alertmanager) - Steep learning curve

Complete setup example:

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false

  alertmanager:
    image: prom/alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  prometheus-data:
  grafana-data:

Datadog

Datadog is a fully managed monitoring platform with APM, logs, and infrastructure monitoring.

Pros:- Zero setup—just install the agent - Beautiful dashboards out of the box - Excellent APM with distributed tracing - Great alerting and incident management

Cons:- Expensive at scale - Vendor lock-in - Less flexible than Prometheus

Here's how to instrument your API with Datadog:

const tracer = require('dd-trace').init({
  service: 'petstore-api',
  env: process.env.NODE_ENV,
  version: '1.0.0',
  logInjection: true
});

const express = require('express');
const app = express();

// Datadog automatically instruments Express
app.get('/api/v1/pets', async (req, res) => {
  const span = tracer.scope().active();
  span.setTag('user.id', req.user.id);

  try {
    const pets = await db.query('SELECT * FROM pets');
    res.json(pets);
  } catch (error) {
    span.setTag('error', true);
    span.setTag('error.message', error.message);
    res.status(500).json({ error: 'Internal server error' });
  }
});

Create custom metrics:

const StatsD = require('hot-shots');
const dogstatsd = new StatsD();

// Increment a counter
dogstatsd.increment('api.pets.created', 1, ['environment:production']);

// Record a histogram
dogstatsd.histogram('api.database.query.time', 45, ['query:select_pets']);

// Set a gauge
dogstatsd.gauge('api.database.connections', 23);

Other Tools Worth Considering

New Relic: Similar to Datadog, great APM and user experience monitoring

Elastic Stack (ELK): Excellent for log aggregation and analysis

Sentry: Specialized in error tracking and debugging

Pingdom/UptimeRobot: Simple uptime monitoring from multiple locations

Real-World Monitoring Setup

Let's put it all together with a complete monitoring setup for a PetStore API:

// monitoring.js
const prometheus = require('prom-client');
const express = require('express');

// Create a Registry
const register = new prometheus.Register();

// Add default metrics (CPU, memory, etc.)
prometheus.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5]
});

const httpRequestTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new prometheus.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

const databaseQueryDuration = new prometheus.Histogram({
  name: 'database_query_duration_seconds',
  help: 'Duration of database queries',
  labelNames: ['query_type'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
register.registerMetric(databaseQueryDuration);

// Middleware
function monitoringMiddleware(req, res, next) {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route?.path || req.path;

    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);

    httpRequestTotal
      .labels(req.method, route, res.statusCode)
      .inc();

    activeConnections.dec();
  });

  next();
}

// Metrics endpoint
function metricsEndpoint(req, res) {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
}

module.exports = {
  monitoringMiddleware,
  metricsEndpoint,
  metrics: {
    httpRequestDuration,
    httpRequestTotal,
    activeConnections,
    databaseQueryDuration
  }
};

Use it in your app:

const express = require('express');
const { monitoringMiddleware, metricsEndpoint } = require('./monitoring');

const app = express();

// Add monitoring middleware
app.use(monitoringMiddleware);

// Expose metrics endpoint
app.get('/metrics', metricsEndpoint);

// Your API routes
app.get('/api/v1/pets', async (req, res) => {
  // Your logic here
});

app.listen(3000);

Wrapping Up

Monitoring and alerting aren't optional—they're essential for running reliable APIs. Start with the basics: track latency, error rate, and throughput. Set up simple alerts for critical issues. Build dashboards that answer your most important questions.

As you grow, add more sophisticated monitoring: distributed tracing, log aggregation, synthetic monitoring. But don't overcomplicate things early on. A simple setup that you actually use beats a complex one that you ignore.

The goal isn't to collect every possible metric—it's to know when something's wrong and have the data to fix it quickly. Focus on that, and you'll be ahead of most teams.