You've built a great API, deployed it to production, and users are starting to rely on it. Now comes the hard part: keeping it running smoothly. Without proper monitoring and alerting, you're flying blind—you won't know when things break until users start complaining.
Let's fix that.
Why API Monitoring Matters
APIs are the backbone of modern applications. When your API goes down or slows down, everything that depends on it suffers. Monitoring helps you:
- Catch problems before users notice them
- Understand how your API performs under load
- Make data-driven decisions about scaling and optimization
- Meet your SLAs and keep customers happy
- Debug issues faster with historical data
Key Metrics to Track
1. Latency
Latency measures how long it takes your API to respond to requests. It's usually measured in milliseconds and is one of the most important metrics for user experience.
Here's how to track latency in your API:
// Express.js middleware to track latency
const prometheus = require('prom-client');
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_ms',
help: 'Duration of HTTP requests in milliseconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000]
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
});
next();
});
Track these latency percentiles:
- p50 (median): Half of requests are faster than this
- p95: 95% of requests are faster than this
- p99: 99% of requests are faster than this
- p99.9: The slowest 0.1% of requests
Why percentiles matter: Your average latency might be 100ms, but if your p99 is 5 seconds, 1% of your users are having a terrible experience.
2. Error Rate
Error rate tells you what percentage of requests are failing. Track both 4xx (client errors) and 5xx (server errors) separately.
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const httpRequestErrors = new prometheus.Counter({
name: 'http_request_errors_total',
help: 'Total number of HTTP request errors',
labelNames: ['method', 'route', 'status_code', 'error_type']
});
app.use((req, res, next) => {
res.on('finish', () => {
httpRequestsTotal
.labels(req.method, req.route?.path || req.path, res.statusCode)
.inc();
if (res.statusCode >= 400) {
const errorType = res.statusCode >= 500 ? 'server_error' : 'client_error';
httpRequestErrors
.labels(req.method, req.route?.path || req.path, res.statusCode, errorType)
.inc();
}
});
next();
});
Calculate error rate like this:
Error Rate = (Number of Failed Requests / Total Requests) × 100
A healthy API typically has an error rate below 0.1% for 5xx errors.
3. Throughput
Throughput measures how many requests your API handles per second (RPS) or per minute (RPM). It helps you understand usage patterns and capacity.
const httpRequestRate = new prometheus.Counter({
name: 'http_requests_per_second',
help: 'Rate of HTTP requests per second',
labelNames: ['method', 'route']
});
app.use((req, res, next) => {
httpRequestRate
.labels(req.method, req.route?.path || req.path)
.inc();
next();
});
Track throughput by: - Endpoint (which endpoints get the most traffic?) - Time of day (when are peak hours?) - Geographic region (where are users located?)
4. Availability and Uptime
Availability is the percentage of time your API is operational. It's usually expressed as "nines":
- 99% uptime = 7.2 hours of downtime per month
- 99.9% uptime = 43 minutes of downtime per month
- 99.99% uptime = 4.3 minutes of downtime per month
- 99.999% uptime = 26 seconds of downtime per month
Here's a simple health check endpoint:
app.get('/health', async (req, res) => {
const health = {
uptime: process.uptime(),
timestamp: Date.now(),
status: 'healthy',
checks: {}
};
try {
// Check database connection
await db.ping();
health.checks.database = 'healthy';
} catch (error) {
health.checks.database = 'unhealthy';
health.status = 'degraded';
}
try {
// Check Redis connection
await redis.ping();
health.checks.redis = 'healthy';
} catch (error) {
health.checks.redis = 'unhealthy';
health.status = 'degraded';
}
const statusCode = health.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(health);
});
5. Saturation
Saturation measures how "full" your system is. Track:
- CPU usage
- Memory usage
- Disk I/O
- Network bandwidth
- Database connection pool usage
const systemMetrics = new prometheus.Gauge({
name: 'system_resource_usage',
help: 'System resource usage percentage',
labelNames: ['resource']
});
setInterval(() => {
const usage = process.memoryUsage();
systemMetrics.labels('memory_heap_used').set(usage.heapUsed);
systemMetrics.labels('memory_heap_total').set(usage.heapTotal);
systemMetrics.labels('memory_rss').set(usage.rss);
}, 10000);
Setting Up Uptime Monitoring
Uptime monitoring pings your API from external locations to ensure it's accessible.
Using a Simple Cron Job
#!/bin/bash
# check_api_health.sh
API_URL="https://api.petstore.com/health"
ALERT_EMAIL="ops@petstore.com"
response=$(curl -s -o /dev/null -w "%{http_code}" $API_URL)
if [ $response -ne 200 ]; then
echo "API is down! Status code: $response" | mail -s "API Alert" $ALERT_EMAIL
fi
Add to crontab to run every minute:
* * * * * /path/to/check_api_health.sh
Using Prometheus Blackbox Exporter
# prometheus.yml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.petstore.com/health
- https://api.petstore.com/api/v1/pets
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Alerting Strategies
Good alerts wake you up when something's wrong, not when everything's fine. Here's how to set them up right.
Alert on Symptoms, Not Causes
Bad alert: "CPU usage is above 80%" Good alert: "API latency p99 is above 1 second"
Users don't care about CPU usage—they care about slow responses.
Use Multiple Severity Levels
# Prometheus alerting rules
groups:
- name: api_alerts
interval: 30s
rules:
# Critical: Page someone immediately
- alert: APIHighErrorRate
expr: rate(http_request_errors_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
# Warning: Send to Slack, investigate during business hours
- alert: APISlowResponses
expr: histogram_quantile(0.99, rate(http_request_duration_ms_bucket[5m])) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "API responses are slow"
description: "p99 latency is {{ $value }}ms"
# Info: Log it, no immediate action needed
- alert: APIHighTraffic
expr: rate(http_requests_total[5m]) > 1000
for: 10m
labels:
severity: info
annotations:
summary: "High traffic detected"
description: "Request rate is {{ $value }} requests/second"
Implement Alert Fatigue Prevention
Use these techniques to avoid alert fatigue:
- Require sustained issues: Use
for: 5mto only alert if the problem persists - Group related alerts: Don't send 50 alerts for the same incident
- Add context: Include runbooks and relevant metrics in alerts
- Route intelligently: Send critical alerts to PagerDuty, warnings to Slack
Here's an Alertmanager configuration:
# alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty
continue: true
- match:
severity: warning
receiver: slack
- match:
severity: info
receiver: email
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<your-pagerduty-key>'
description: '{{ .GroupLabels.alertname }}'
- name: 'slack'
slack_configs:
- api_url: '<your-slack-webhook>'
channel: '#api-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'email'
email_configs:
- to: 'ops@petstore.com'
from: 'alerts@petstore.com'
smarthost: 'smtp.gmail.com:587'
Building Effective Dashboards
Dashboards should answer questions at a glance. Here's a Grafana dashboard configuration for API monitoring:
{
"dashboard": {
"title": "PetStore API Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])",
"legendFormat": "Error Rate"
}
],
"type": "graph",
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.01],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
]
}
},
{
"title": "Latency Percentiles",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_ms_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_ms_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_ms_bucket[5m]))",
"legendFormat": "p99"
}
],
"type": "graph"
}
]
}
}
Dashboard Best Practices
- Use the RED method (Rate, Errors, Duration) for request-driven services
- Show trends over time: Include 24-hour and 7-day views
- Add annotations: Mark deployments and incidents on graphs
- Keep it simple: Don't cram 50 metrics on one screen
- Use consistent colors: Red for errors, green for success, yellow for warnings
SLAs and SLOs
Service Level Agreements (SLAs)
SLAs are contracts with your users. They define what you promise to deliver and what happens if you don't.
Example SLA:
## PetStore API Service Level Agreement
### Availability
- **Target**: 99.9% uptime per month
- **Measurement**: Percentage of successful health check responses
- **Exclusions**: Scheduled maintenance windows (announced 7 days in advance)
### Performance
- **Target**: p95 latency < 200ms for all GET requests
- **Measurement**: 95th percentile response time measured at API gateway
### Support
- **Critical issues**: Response within 1 hour, resolution within 4 hours
- **Non-critical issues**: Response within 24 hours
### Remedies
- 99.0-99.9% uptime: 10% service credit
- 95.0-99.0% uptime: 25% service credit
- <95.0% uptime: 50% service credit
Service Level Objectives (SLOs)
SLOs are internal targets that are stricter than your SLAs. They give you a buffer to fix problems before breaking your SLA.
# SLO definitions
slos:
- name: api_availability
target: 99.95 # SLA is 99.9%, so we aim higher
window: 30d
- name: api_latency_p95
target: 150ms # SLA is 200ms
window: 30d
- name: api_error_rate
target: 0.1% # Less than 0.1% of requests should error
window: 7d
Track your error budget:
// Calculate error budget
const SLO_TARGET = 0.9995; // 99.95%
const TOTAL_REQUESTS = 10000000; // 10M requests per month
const allowedFailures = TOTAL_REQUESTS * (1 - SLO_TARGET);
console.log(`Error budget: ${allowedFailures} failed requests per month`);
// If you've had 3000 failures so far this month:
const actualFailures = 3000;
const budgetRemaining = allowedFailures - actualFailures;
const budgetUsedPercent = (actualFailures / allowedFailures) * 100;
console.log(`Budget remaining: ${budgetRemaining} failures`);
console.log(`Budget used: ${budgetUsedPercent.toFixed(2)}%`);
When you're burning through your error budget too fast, it's time to stop shipping new features and focus on reliability.
Monitoring Tools Comparison
Prometheus + Grafana
Prometheus is an open-source monitoring system that scrapes metrics from your services. Grafana visualizes them.
Pros:- Free and open source - Powerful query language (PromQL) - Great for infrastructure and application metrics - Active community
Cons:- Requires setup and maintenance - No built-in alerting UI (need Alertmanager) - Steep learning curve
Complete setup example:
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
alertmanager:
image: prom/alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
prometheus-data:
grafana-data:
Datadog
Datadog is a fully managed monitoring platform with APM, logs, and infrastructure monitoring.
Pros:- Zero setup—just install the agent - Beautiful dashboards out of the box - Excellent APM with distributed tracing - Great alerting and incident management
Cons:- Expensive at scale - Vendor lock-in - Less flexible than Prometheus
Here's how to instrument your API with Datadog:
const tracer = require('dd-trace').init({
service: 'petstore-api',
env: process.env.NODE_ENV,
version: '1.0.0',
logInjection: true
});
const express = require('express');
const app = express();
// Datadog automatically instruments Express
app.get('/api/v1/pets', async (req, res) => {
const span = tracer.scope().active();
span.setTag('user.id', req.user.id);
try {
const pets = await db.query('SELECT * FROM pets');
res.json(pets);
} catch (error) {
span.setTag('error', true);
span.setTag('error.message', error.message);
res.status(500).json({ error: 'Internal server error' });
}
});
Create custom metrics:
const StatsD = require('hot-shots');
const dogstatsd = new StatsD();
// Increment a counter
dogstatsd.increment('api.pets.created', 1, ['environment:production']);
// Record a histogram
dogstatsd.histogram('api.database.query.time', 45, ['query:select_pets']);
// Set a gauge
dogstatsd.gauge('api.database.connections', 23);
Other Tools Worth Considering
New Relic: Similar to Datadog, great APM and user experience monitoring
Elastic Stack (ELK): Excellent for log aggregation and analysis
Sentry: Specialized in error tracking and debugging
Pingdom/UptimeRobot: Simple uptime monitoring from multiple locations
Real-World Monitoring Setup
Let's put it all together with a complete monitoring setup for a PetStore API:
// monitoring.js
const prometheus = require('prom-client');
const express = require('express');
// Create a Registry
const register = new prometheus.Register();
// Add default metrics (CPU, memory, etc.)
prometheus.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5]
});
const httpRequestTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new prometheus.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
const databaseQueryDuration = new prometheus.Histogram({
name: 'database_query_duration_seconds',
help: 'Duration of database queries',
labelNames: ['query_type'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
register.registerMetric(databaseQueryDuration);
// Middleware
function monitoringMiddleware(req, res, next) {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, route, res.statusCode)
.inc();
activeConnections.dec();
});
next();
}
// Metrics endpoint
function metricsEndpoint(req, res) {
res.set('Content-Type', register.contentType);
res.end(register.metrics());
}
module.exports = {
monitoringMiddleware,
metricsEndpoint,
metrics: {
httpRequestDuration,
httpRequestTotal,
activeConnections,
databaseQueryDuration
}
};
Use it in your app:
const express = require('express');
const { monitoringMiddleware, metricsEndpoint } = require('./monitoring');
const app = express();
// Add monitoring middleware
app.use(monitoringMiddleware);
// Expose metrics endpoint
app.get('/metrics', metricsEndpoint);
// Your API routes
app.get('/api/v1/pets', async (req, res) => {
// Your logic here
});
app.listen(3000);
Wrapping Up
Monitoring and alerting aren't optional—they're essential for running reliable APIs. Start with the basics: track latency, error rate, and throughput. Set up simple alerts for critical issues. Build dashboards that answer your most important questions.
As you grow, add more sophisticated monitoring: distributed tracing, log aggregation, synthetic monitoring. But don't overcomplicate things early on. A simple setup that you actually use beats a complex one that you ignore.
The goal isn't to collect every possible metric—it's to know when something's wrong and have the data to fix it quickly. Focus on that, and you'll be ahead of most teams.