Your API works in development. Tests pass. You're ready to ship.
But production-ready means more than "it works on my machine." It means your API stays up during deployments, recovers from failures automatically, provides clear signals when something's wrong, and gives your team the tools to fix issues fast.
This guide covers everything you need to take an API from working code to production-ready service.
What "Production-Ready" Actually Means
Production-ready isn't a binary state. It's a spectrum. But at minimum, your API should:
- Stay available during deployments: No downtime for routine updates
- Fail gracefully: Degrade functionality instead of crashing
- Signal health clearly: Monitoring knows when something's wrong
- Recover automatically: Restart after crashes, reconnect to databases
- Provide debugging context: Logs and traces help diagnose issues
- Have documented procedures: Team knows how to respond to incidents
Let's build each piece.
Health Checks: The Foundation
Health checks tell load balancers, orchestrators, and monitoring systems whether your API is ready to serve traffic.
Basic Health Check
Start with a simple endpoint that checks critical dependencies:
// routes/health.js
const express = require('express');
const router = express.Router();
const db = require('../db');
const redis = require('../redis');
router.get('/health', async (req, res) => {
const checks = {
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
checks: {}
};
// Check database
try {
await db.query('SELECT 1');
checks.checks.database = 'healthy';
} catch (error) {
checks.checks.database = 'unhealthy';
checks.status = 'unhealthy';
}
// Check Redis
try {
await redis.ping();
checks.checks.redis = 'healthy';
} catch (error) {
checks.checks.redis = 'unhealthy';
checks.status = 'unhealthy';
}
const statusCode = checks.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(checks);
});
module.exports = router;
This returns 200 when healthy, 503 when not. Load balancers use this to route traffic.
Liveness vs Readiness
Kubernetes distinguishes between two types of health checks:
Liveness: Is the process alive? Should we restart it? Readiness: Is it ready to serve traffic? Should we send requests?
// Liveness: Just check if the process responds
router.get('/health/live', (req, res) => {
res.status(200).json({
status: 'alive',
timestamp: new Date().toISOString()
});
});
// Readiness: Check if dependencies are available
router.get('/health/ready', async (req, res) => {
const checks = {
database: false,
redis: false
};
try {
await db.query('SELECT 1');
checks.database = true;
} catch (error) {
// Database down
}
try {
await redis.ping();
checks.redis = true;
} catch (error) {
// Redis down
}
const ready = checks.database && checks.redis;
const statusCode = ready ? 200 : 503;
res.status(statusCode).json({
status: ready ? 'ready' : 'not ready',
checks,
timestamp: new Date().toISOString()
});
});
Configure Kubernetes to use both:
apiVersion: apps/v1
kind: Deployment
metadata:
name: petstore-api
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: petstore-api:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Liveness failures trigger restarts. Readiness failures stop traffic routing.
Startup Probes for Slow Initialization
Some APIs take time to warm up. Use startup probes to give them space:
startupProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # 30 * 5 = 150 seconds to start
This gives your API 150 seconds to become ready before liveness checks begin.
Graceful Shutdown
When you deploy new code, existing requests should complete before the old process dies.
Handling SIGTERM
Kubernetes sends SIGTERM before killing pods. Catch it and shut down gracefully:
// server.js
const express = require('express');
const app = express();
const server = app.listen(3000, () => {
console.log('Server started on port 3000');
});
// Track active connections
let connections = new Set();
server.on('connection', (conn) => {
connections.add(conn);
conn.on('close', () => {
connections.delete(conn);
});
});
// Graceful shutdown handler
function gracefulShutdown(signal) {
console.log(`Received ${signal}, starting graceful shutdown`);
// Stop accepting new connections
server.close(() => {
console.log('Server closed, no new connections accepted');
});
// Set a deadline for existing requests
const shutdownTimeout = setTimeout(() => {
console.error('Shutdown timeout, forcing exit');
process.exit(1);
}, 30000); // 30 seconds
// Close database connections
db.end().then(() => {
console.log('Database connections closed');
});
// Close Redis connections
redis.quit().then(() => {
console.log('Redis connections closed');
});
// Wait for active connections to finish
const checkInterval = setInterval(() => {
if (connections.size === 0) {
clearInterval(checkInterval);
clearTimeout(shutdownTimeout);
console.log('All connections closed, exiting');
process.exit(0);
} else {
console.log(`Waiting for ${connections.size} connections to close`);
}
}, 1000);
}
// Register shutdown handlers
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
// Handle uncaught errors
process.on('uncaughtException', (error) => {
console.error('Uncaught exception:', error);
gracefulShutdown('uncaughtException');
});
process.on('unhandledRejection', (reason, promise) => {
console.error('Unhandled rejection at:', promise, 'reason:', reason);
gracefulShutdown('unhandledRejection');
});
This ensures: - No new requests accepted after SIGTERM - Existing requests complete - Database connections close cleanly - Process exits within 30 seconds maximum
Kubernetes Termination Grace Period
Give Kubernetes enough time for graceful shutdown:
apiVersion: apps/v1
kind: Deployment
metadata:
name: petstore-api
spec:
template:
spec:
terminationGracePeriodSeconds: 45
containers:
- name: api
image: petstore-api:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
The preStop hook gives load balancers time to remove the pod from rotation before shutdown begins.
Zero-Downtime Deployments
Graceful shutdown isn't enough. You need a deployment strategy that maintains availability.
Rolling Updates
Deploy new versions gradually, replacing old pods one at a time:
apiVersion: apps/v1
kind: Deployment
metadata:
name: petstore-api
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Can have 8 pods during update (6 + 2)
maxUnavailable: 1 # At least 5 pods must be available
template:
spec:
containers:
- name: api
image: petstore-api:v2.0.0
This ensures: - At least 5 pods always available - New pods start before old ones stop - Rollout pauses if new pods fail health checks
Blue-Green Deployments
Run two complete environments, switch traffic instantly:
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: petstore-api-blue
labels:
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: petstore-api
version: blue
template:
metadata:
labels:
app: petstore-api
version: blue
spec:
containers:
- name: api
image: petstore-api:v1.0.0
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: petstore-api-green
labels:
version: green
spec:
replicas: 3
selector:
matchLabels:
app: petstore-api
version: green
template:
metadata:
labels:
app: petstore-api
version: green
spec:
containers:
- name: api
image: petstore-api:v2.0.0
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: petstore-api
spec:
selector:
app: petstore-api
version: blue # Switch to 'green' to cut over
ports:
- port: 80
targetPort: 3000
Deploy green, test it, then update the service selector to switch traffic. Instant rollback if needed.
Canary Deployments
Send a small percentage of traffic to the new version:
# Using Istio for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: petstore-api
spec:
hosts:
- petstore-api
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: petstore-api
subset: v2
- route:
- destination:
host: petstore-api
subset: v1
weight: 95
- destination:
host: petstore-api
subset: v2
weight: 5
This sends 5% of traffic to v2, 95% to v1. Gradually increase the percentage if metrics look good.
Disaster Recovery
Things will break. Plan for it.
Database Backups
Automate backups and test restores:
#!/bin/bash
# backup-database.sh
set -e
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="petstore_backup_${TIMESTAMP}.sql.gz"
S3_BUCKET="s3://petstore-backups"
echo "Starting backup at $(date)"
# Dump database
pg_dump -h $DB_HOST -U $DB_USER -d petstore \
| gzip > /tmp/$BACKUP_FILE
# Upload to S3
aws s3 cp /tmp/$BACKUP_FILE $S3_BUCKET/
# Verify upload
aws s3 ls $S3_BUCKET/$BACKUP_FILE
# Clean up local file
rm /tmp/$BACKUP_FILE
# Delete backups older than 30 days
aws s3 ls $S3_BUCKET/ | while read -r line; do
createDate=$(echo $line | awk {'print $1" "$2'})
createDate=$(date -d "$createDate" +%s)
olderThan=$(date -d "30 days ago" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk {'print $4'})
if [[ $fileName != "" ]]; then
aws s3 rm $S3_BUCKET/$fileName
fi
fi
done
echo "Backup completed at $(date)"
Run this daily via cron:
0 2 * * * /usr/local/bin/backup-database.sh >> /var/log/backup.log 2>&1
Testing Restores
Backups are worthless if you can't restore. Test monthly:
#!/bin/bash
# test-restore.sh
set -e
# Get latest backup
LATEST_BACKUP=$(aws s3 ls s3://petstore-backups/ | sort | tail -n 1 | awk '{print $4}')
echo "Testing restore of $LATEST_BACKUP"
# Download backup
aws s3 cp s3://petstore-backups/$LATEST_BACKUP /tmp/test_restore.sql.gz
# Create test database
psql -h $DB_HOST -U $DB_USER -c "DROP DATABASE IF EXISTS petstore_restore_test"
psql -h $DB_HOST -U $DB_USER -c "CREATE DATABASE petstore_restore_test"
# Restore
gunzip < /tmp/test_restore.sql.gz | psql -h $DB_HOST -U $DB_USER -d petstore_restore_test
# Verify data
RECORD_COUNT=$(psql -h $DB_HOST -U $DB_USER -d petstore_restore_test -t -c "SELECT COUNT(*) FROM pets")
echo "Restored $RECORD_COUNT records"
# Clean up
psql -h $DB_HOST -U $DB_USER -c "DROP DATABASE petstore_restore_test"
rm /tmp/test_restore.sql.gz
echo "Restore test successful"
Multi-Region Failover
For critical APIs, run in multiple regions:
# Route 53 health check and failover
Resources:
HealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
ResourcePath: /health
FullyQualifiedDomainName: api-us-east-1.petstore.com
Port: 443
RequestInterval: 30
FailureThreshold: 3
DNSRecord:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: Z1234567890ABC
Name: api.petstore.com
Type: A
SetIdentifier: us-east-1
Failover: PRIMARY
HealthCheckId: !Ref HealthCheck
AliasTarget:
HostedZoneId: Z1234567890ABC
DNSName: api-us-east-1.petstore.com
DNSRecordFailover:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: Z1234567890ABC
Name: api.petstore.com
Type: A
SetIdentifier: us-west-2
Failover: SECONDARY
AliasTarget:
HostedZoneId: Z1234567890ABC
DNSName: api-us-west-2.petstore.com
Traffic automatically fails over to us-west-2 if us-east-1 health checks fail.
Runbooks: Documentation That Matters
When your API breaks at 3 AM, your on-call engineer needs clear instructions.
Runbook Template
Create a runbook for each common incident:
# Runbook: High API Latency
## Symptoms
- P95 latency > 1000ms
- Alert: "API latency high" firing
- Users reporting slow responses
## Impact
- Degraded user experience
- Potential timeouts
- Increased error rates
## Diagnosis
### 1. Check current latency
```bash
# Query Prometheus
curl -G 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
2. Identify slow endpoints
# Check APM dashboard
open https://apm.petstore.com/services/api/operations
# Or query logs
kubectl logs -l app=petstore-api --tail=1000 | grep "duration" | sort -k5 -n | tail -20
3. Check database performance
-- Find slow queries
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;
-- Check for locks
SELECT * FROM pg_locks WHERE NOT granted;
4. Check resource usage
# CPU and memory
kubectl top pods -l app=petstore-api
# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
Resolution
If database is slow:
- Check for missing indexes
- Kill long-running queries if safe
- Scale up database if needed
- Enable read replicas for read-heavy queries
If API pods are overloaded:
- Scale up replicas:
kubectl scale deployment petstore-api --replicas=10 - Check for memory leaks in recent deployments
- Consider rolling back recent changes
If external service is slow:
- Enable circuit breakers
- Increase timeouts temporarily
- Contact external service provider
Prevention
- Add database indexes for slow queries
- Implement caching for expensive operations
- Set up auto-scaling based on latency metrics
- Add circuit breakers for external dependencies
Related Runbooks
### Runbook Checklist
Every runbook should have:
- **Symptoms**: How to recognize the problem
- **Impact**: What's affected and how severely
- **Diagnosis**: Step-by-step investigation
- **Resolution**: How to fix it
- **Prevention**: How to avoid it next time
- **Related runbooks**: Links to similar issues
## On-Call Procedures
Good on-call procedures reduce stress and improve response times.
### On-Call Rotation
Set up a fair rotation:
```yaml
# PagerDuty schedule example
schedules:
- name: "API On-Call"
time_zone: "America/New_York"
layers:
- name: "Primary"
rotation_virtual_start: "2026-01-01T00:00:00"
rotation_turn_length_seconds: 604800 # 1 week
users:
- user: "alice@petstore.com"
- user: "bob@petstore.com"
- user: "carol@petstore.com"
- name: "Secondary"
rotation_virtual_start: "2026-01-01T00:00:00"
rotation_turn_length_seconds: 604800
users:
- user: "dave@petstore.com"
- user: "eve@petstore.com"
Escalation Policy
Define clear escalation paths:
escalation_policies:
- name: "API Escalation"
escalation_rules:
- escalation_delay_in_minutes: 5
targets:
- type: "schedule"
id: "primary-oncall"
- escalation_delay_in_minutes: 10
targets:
- type: "schedule"
id: "secondary-oncall"
- escalation_delay_in_minutes: 15
targets:
- type: "user"
id: "engineering-manager"
If primary doesn't respond in 5 minutes, page secondary. If no response in 10 more minutes, page the manager.
Alert Severity Levels
Not all alerts need immediate response:
P0 - Critical- API completely down - Data loss occurring - Security breach - Response: Immediate, wake up on-call
P1 - High- Partial outage - High error rates - Performance degradation - Response: Within 15 minutes
P2 - Medium- Single endpoint failing - Elevated latency - Non-critical feature broken - Response: Within 1 hour
P3 - Low- Minor issues - Warnings - Capacity concerns - Response: Next business day
Configure alert routing based on severity:
# Prometheus AlertManager config
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
Post-Incident Reviews
After every major incident, write a blameless postmortem:
# Incident Review: API Outage 2026-03-13
## Summary
API was unavailable for 23 minutes due to database connection pool exhaustion.
## Timeline (all times UTC)
- 14:32: Alert fired: "API error rate high"
- 14:33: On-call engineer paged
- 14:35: Engineer acknowledged, began investigation
- 14:38: Identified database connection pool exhausted
- 14:40: Attempted to restart API pods
- 14:45: Restart failed, connections still exhausted
- 14:47: Restarted database connection pooler
- 14:50: API began recovering
- 14:55: All health checks passing, incident resolved
## Root Cause
A slow database query caused connections to pile up. The connection pool (max 20 connections) filled up, causing new requests to fail.
## Impact
- 23 minutes of downtime
- ~1,500 failed requests
- 12 customer support tickets
## What Went Well
- Alert fired quickly
- On-call engineer responded within 3 minutes
- Runbook helped identify issue
## What Went Wrong
- Connection pool too small for traffic volume
- No monitoring on connection pool usage
- Slow query not caught in development
## Action Items
- [ ] Increase connection pool to 50 (Alice, by 2026-03-15)
- [ ] Add connection pool metrics to dashboard (Bob, by 2026-03-17)
- [ ] Add slow query detection to CI (Carol, by 2026-03-20)
- [ ] Update runbook with connection pool troubleshooting (Dave, by 2026-03-14)
## Lessons Learned
- Monitor resource pools (connections, threads, memory)
- Load test with realistic query patterns
- Connection pool size should scale with traffic
Share postmortems with the whole team. Learning from incidents makes everyone better.
Production Readiness Checklist
Use this checklist before launching:
Observability
- [ ] Health check endpoints implemented
- [ ] Metrics exported (Prometheus format)
- [ ] Structured logging with correlation IDs
- [ ] Distributed tracing configured
- [ ] Dashboards created for key metrics
- [ ] Alerts configured for critical issues
Reliability
- [ ] Graceful shutdown implemented
- [ ] Connection pooling configured
- [ ] Timeouts set on all external calls
- [ ] Circuit breakers for external dependencies
- [ ] Retry logic with exponential backoff
- [ ] Rate limiting implemented
Deployment
- [ ] Rolling update strategy configured
- [ ] Health checks prevent bad deployments
- [ ] Rollback procedure documented
- [ ] Database migrations automated
- [ ] Feature flags for risky changes
Disaster Recovery
- [ ] Automated database backups
- [ ] Backup restore tested
- [ ] Multi-region deployment (if needed)
- [ ] Failover procedure documented
- [ ] Data retention policy defined
Security
- [ ] API authentication required
- [ ] Rate limiting per API key
- [ ] Input validation on all endpoints
- [ ] SQL injection prevention
- [ ] Secrets stored securely (not in code)
- [ ] TLS/HTTPS enforced
Documentation
- [ ] API documentation published
- [ ] Runbooks for common incidents
- [ ] Architecture diagram created
- [ ] On-call procedures documented
- [ ] Escalation policy defined
Testing
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Load tests completed
- [ ] Security scan completed
- [ ] Chaos engineering tests (optional)
Conclusion
Production-ready isn't a destination, it's a practice. You'll never check every box perfectly. But each improvement makes your API more reliable, your team more confident, and your users happier.
Start with the basics: health checks, graceful shutdown, and good logging. Add monitoring and alerts. Document your procedures. Test your disaster recovery.
The goal isn't perfection. It's being ready when things go wrong, because they will. Production-ready means you can handle it.
Now go ship something reliable.