Building Production-Ready APIs

Your API works in development. Tests pass. You're ready to ship.

But production-ready means more than "it works on my machine." It means your API stays up during deployments, recovers from failures automatically, provides clear signals when something's wrong, and gives your team the tools to fix issues fast.

This guide covers everything you need to take an API from working code to production-ready service.

What "Production-Ready" Actually Means

Production-ready isn't a binary state. It's a spectrum. But at minimum, your API should:

Stay available during deployments: No downtime for routine updates
Fail gracefully: Degrade functionality instead of crashing
Signal health clearly: Monitoring knows when something's wrong
Recover automatically: Restart after crashes, reconnect to databases
Provide debugging context: Logs and traces help diagnose issues
Have documented procedures: Team knows how to respond to incidents

Let's build each piece.

Health Checks: The Foundation

Health checks tell load balancers, orchestrators, and monitoring systems whether your API is ready to serve traffic.

Basic Health Check

Start with a simple endpoint that checks critical dependencies:

// routes/health.js
const express = require('express');
const router = express.Router();
const db = require('../db');
const redis = require('../redis');

router.get('/health', async (req, res) => {
  const checks = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {}
  };

  // Check database
  try {
    await db.query('SELECT 1');
    checks.checks.database = 'healthy';
  } catch (error) {
    checks.checks.database = 'unhealthy';
    checks.status = 'unhealthy';
  }

  // Check Redis
  try {
    await redis.ping();
    checks.checks.redis = 'healthy';
  } catch (error) {
    checks.checks.redis = 'unhealthy';
    checks.status = 'unhealthy';
  }

  const statusCode = checks.status === 'healthy' ? 200 : 503;
  res.status(statusCode).json(checks);
});

module.exports = router;

This returns 200 when healthy, 503 when not. Load balancers use this to route traffic.

Liveness vs Readiness

Kubernetes distinguishes between two types of health checks:

Liveness: Is the process alive? Should we restart it? Readiness: Is it ready to serve traffic? Should we send requests?

// Liveness: Just check if the process responds
router.get('/health/live', (req, res) => {
  res.status(200).json({
    status: 'alive',
    timestamp: new Date().toISOString()
  });
});

// Readiness: Check if dependencies are available
router.get('/health/ready', async (req, res) => {
  const checks = {
    database: false,
    redis: false
  };

  try {
    await db.query('SELECT 1');
    checks.database = true;
  } catch (error) {
    // Database down
  }

  try {
    await redis.ping();
    checks.redis = true;
  } catch (error) {
    // Redis down
  }

  const ready = checks.database && checks.redis;
  const statusCode = ready ? 200 : 503;

  res.status(statusCode).json({
    status: ready ? 'ready' : 'not ready',
    checks,
    timestamp: new Date().toISOString()
  });
});

Configure Kubernetes to use both:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: petstore-api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: petstore-api:latest
        ports:
        - containerPort: 3000
        livenessProbe:
          httpGet:
            path: /health/live
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2

Liveness failures trigger restarts. Readiness failures stop traffic routing.

Startup Probes for Slow Initialization

Some APIs take time to warm up. Use startup probes to give them space:

startupProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 30  # 30 * 5 = 150 seconds to start

This gives your API 150 seconds to become ready before liveness checks begin.

Graceful Shutdown

When you deploy new code, existing requests should complete before the old process dies.

Handling SIGTERM

Kubernetes sends SIGTERM before killing pods. Catch it and shut down gracefully:

// server.js
const express = require('express');
const app = express();

const server = app.listen(3000, () => {
  console.log('Server started on port 3000');
});

// Track active connections
let connections = new Set();

server.on('connection', (conn) => {
  connections.add(conn);
  conn.on('close', () => {
    connections.delete(conn);
  });
});

// Graceful shutdown handler
function gracefulShutdown(signal) {
  console.log(`Received ${signal}, starting graceful shutdown`);

  // Stop accepting new connections
  server.close(() => {
    console.log('Server closed, no new connections accepted');
  });

  // Set a deadline for existing requests
  const shutdownTimeout = setTimeout(() => {
    console.error('Shutdown timeout, forcing exit');
    process.exit(1);
  }, 30000); // 30 seconds

  // Close database connections
  db.end().then(() => {
    console.log('Database connections closed');
  });

  // Close Redis connections
  redis.quit().then(() => {
    console.log('Redis connections closed');
  });

  // Wait for active connections to finish
  const checkInterval = setInterval(() => {
    if (connections.size === 0) {
      clearInterval(checkInterval);
      clearTimeout(shutdownTimeout);
      console.log('All connections closed, exiting');
      process.exit(0);
    } else {
      console.log(`Waiting for ${connections.size} connections to close`);
    }
  }, 1000);
}

// Register shutdown handlers
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

// Handle uncaught errors
process.on('uncaughtException', (error) => {
  console.error('Uncaught exception:', error);
  gracefulShutdown('uncaughtException');
});

process.on('unhandledRejection', (reason, promise) => {
  console.error('Unhandled rejection at:', promise, 'reason:', reason);
  gracefulShutdown('unhandledRejection');
});

This ensures: - No new requests accepted after SIGTERM - Existing requests complete - Database connections close cleanly - Process exits within 30 seconds maximum

Kubernetes Termination Grace Period

Give Kubernetes enough time for graceful shutdown:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: petstore-api
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 45
      containers:
      - name: api
        image: petstore-api:latest
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]

The preStop hook gives load balancers time to remove the pod from rotation before shutdown begins.

Zero-Downtime Deployments

Graceful shutdown isn't enough. You need a deployment strategy that maintains availability.

Rolling Updates

Deploy new versions gradually, replacing old pods one at a time:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: petstore-api
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # Can have 8 pods during update (6 + 2)
      maxUnavailable: 1  # At least 5 pods must be available
  template:
    spec:
      containers:
      - name: api
        image: petstore-api:v2.0.0

This ensures: - At least 5 pods always available - New pods start before old ones stop - Rollout pauses if new pods fail health checks

Blue-Green Deployments

Run two complete environments, switch traffic instantly:

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: petstore-api-blue
  labels:
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: petstore-api
      version: blue
  template:
    metadata:
      labels:
        app: petstore-api
        version: blue
    spec:
      containers:
      - name: api
        image: petstore-api:v1.0.0

---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: petstore-api-green
  labels:
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: petstore-api
      version: green
  template:
    metadata:
      labels:
        app: petstore-api
        version: green
    spec:
      containers:
      - name: api
        image: petstore-api:v2.0.0

---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: petstore-api
spec:
  selector:
    app: petstore-api
    version: blue  # Switch to 'green' to cut over
  ports:
  - port: 80
    targetPort: 3000

Deploy green, test it, then update the service selector to switch traffic. Instant rollback if needed.

Canary Deployments

Send a small percentage of traffic to the new version:

# Using Istio for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: petstore-api
spec:
  hosts:
  - petstore-api
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: petstore-api
        subset: v2
  - route:
    - destination:
        host: petstore-api
        subset: v1
      weight: 95
    - destination:
        host: petstore-api
        subset: v2
      weight: 5

This sends 5% of traffic to v2, 95% to v1. Gradually increase the percentage if metrics look good.

Disaster Recovery

Things will break. Plan for it.

Database Backups

Automate backups and test restores:

#!/bin/bash
# backup-database.sh

set -e

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="petstore_backup_${TIMESTAMP}.sql.gz"
S3_BUCKET="s3://petstore-backups"

echo "Starting backup at $(date)"

# Dump database
pg_dump -h $DB_HOST -U $DB_USER -d petstore \
  | gzip > /tmp/$BACKUP_FILE

# Upload to S3
aws s3 cp /tmp/$BACKUP_FILE $S3_BUCKET/

# Verify upload
aws s3 ls $S3_BUCKET/$BACKUP_FILE

# Clean up local file
rm /tmp/$BACKUP_FILE

# Delete backups older than 30 days
aws s3 ls $S3_BUCKET/ | while read -r line; do
  createDate=$(echo $line | awk {'print $1" "$2'})
  createDate=$(date -d "$createDate" +%s)
  olderThan=$(date -d "30 days ago" +%s)
  if [[ $createDate -lt $olderThan ]]; then
    fileName=$(echo $line | awk {'print $4'})
    if [[ $fileName != "" ]]; then
      aws s3 rm $S3_BUCKET/$fileName
    fi
  fi
done

echo "Backup completed at $(date)"

Run this daily via cron:

0 2 * * * /usr/local/bin/backup-database.sh >> /var/log/backup.log 2>&1

Testing Restores

Backups are worthless if you can't restore. Test monthly:

#!/bin/bash
# test-restore.sh

set -e

# Get latest backup
LATEST_BACKUP=$(aws s3 ls s3://petstore-backups/ | sort | tail -n 1 | awk '{print $4}')

echo "Testing restore of $LATEST_BACKUP"

# Download backup
aws s3 cp s3://petstore-backups/$LATEST_BACKUP /tmp/test_restore.sql.gz

# Create test database
psql -h $DB_HOST -U $DB_USER -c "DROP DATABASE IF EXISTS petstore_restore_test"
psql -h $DB_HOST -U $DB_USER -c "CREATE DATABASE petstore_restore_test"

# Restore
gunzip < /tmp/test_restore.sql.gz | psql -h $DB_HOST -U $DB_USER -d petstore_restore_test

# Verify data
RECORD_COUNT=$(psql -h $DB_HOST -U $DB_USER -d petstore_restore_test -t -c "SELECT COUNT(*) FROM pets")

echo "Restored $RECORD_COUNT records"

# Clean up
psql -h $DB_HOST -U $DB_USER -c "DROP DATABASE petstore_restore_test"
rm /tmp/test_restore.sql.gz

echo "Restore test successful"

Multi-Region Failover

For critical APIs, run in multiple regions:

# Route 53 health check and failover
Resources:
  HealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTPS
        ResourcePath: /health
        FullyQualifiedDomainName: api-us-east-1.petstore.com
        Port: 443
        RequestInterval: 30
        FailureThreshold: 3

  DNSRecord:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z1234567890ABC
      Name: api.petstore.com
      Type: A
      SetIdentifier: us-east-1
      Failover: PRIMARY
      HealthCheckId: !Ref HealthCheck
      AliasTarget:
        HostedZoneId: Z1234567890ABC
        DNSName: api-us-east-1.petstore.com

  DNSRecordFailover:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z1234567890ABC
      Name: api.petstore.com
      Type: A
      SetIdentifier: us-west-2
      Failover: SECONDARY
      AliasTarget:
        HostedZoneId: Z1234567890ABC
        DNSName: api-us-west-2.petstore.com

Traffic automatically fails over to us-west-2 if us-east-1 health checks fail.

Runbooks: Documentation That Matters

When your API breaks at 3 AM, your on-call engineer needs clear instructions.

Runbook Template

Create a runbook for each common incident:

# Runbook: High API Latency

## Symptoms
- P95 latency > 1000ms
- Alert: "API latency high" firing
- Users reporting slow responses

## Impact
- Degraded user experience
- Potential timeouts
- Increased error rates

## Diagnosis

### 1. Check current latency
```bash
# Query Prometheus
curl -G 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

2. Identify slow endpoints

# Check APM dashboard
open https://apm.petstore.com/services/api/operations

# Or query logs
kubectl logs -l app=petstore-api --tail=1000 | grep "duration" | sort -k5 -n | tail -20

3. Check database performance

-- Find slow queries
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

-- Check for locks
SELECT * FROM pg_locks WHERE NOT granted;

4. Check resource usage

# CPU and memory
kubectl top pods -l app=petstore-api

# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

Resolution

If database is slow:

Check for missing indexes
Kill long-running queries if safe
Scale up database if needed
Enable read replicas for read-heavy queries

If API pods are overloaded:

Scale up replicas: kubectl scale deployment petstore-api --replicas=10
Check for memory leaks in recent deployments
Consider rolling back recent changes

If external service is slow:

Enable circuit breakers
Increase timeouts temporarily
Contact external service provider

Prevention

Add database indexes for slow queries
Implement caching for expensive operations
Set up auto-scaling based on latency metrics
Add circuit breakers for external dependencies

### Runbook Checklist

Every runbook should have:

- **Symptoms**: How to recognize the problem
- **Impact**: What's affected and how severely
- **Diagnosis**: Step-by-step investigation
- **Resolution**: How to fix it
- **Prevention**: How to avoid it next time
- **Related runbooks**: Links to similar issues

## On-Call Procedures

Good on-call procedures reduce stress and improve response times.

### On-Call Rotation

Set up a fair rotation:

```yaml
# PagerDuty schedule example
schedules:
  - name: "API On-Call"
    time_zone: "America/New_York"
    layers:
      - name: "Primary"
        rotation_virtual_start: "2026-01-01T00:00:00"
        rotation_turn_length_seconds: 604800  # 1 week
        users:
          - user: "alice@petstore.com"
          - user: "bob@petstore.com"
          - user: "carol@petstore.com"
      - name: "Secondary"
        rotation_virtual_start: "2026-01-01T00:00:00"
        rotation_turn_length_seconds: 604800
        users:
          - user: "dave@petstore.com"
          - user: "eve@petstore.com"

Escalation Policy

Define clear escalation paths:

escalation_policies:
  - name: "API Escalation"
    escalation_rules:
      - escalation_delay_in_minutes: 5
        targets:
          - type: "schedule"
            id: "primary-oncall"
      - escalation_delay_in_minutes: 10
        targets:
          - type: "schedule"
            id: "secondary-oncall"
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "engineering-manager"

If primary doesn't respond in 5 minutes, page secondary. If no response in 10 more minutes, page the manager.

Alert Severity Levels

Not all alerts need immediate response:

P0 - Critical- API completely down - Data loss occurring - Security breach - Response: Immediate, wake up on-call

P1 - High- Partial outage - High error rates - Performance degradation - Response: Within 15 minutes

P2 - Medium- Single endpoint failing - Elevated latency - Non-critical feature broken - Response: Within 1 hour

P3 - Low- Minor issues - Warnings - Capacity concerns - Response: Next business day

Configure alert routing based on severity:

# Prometheus AlertManager config
route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'

Post-Incident Reviews

After every major incident, write a blameless postmortem:

# Incident Review: API Outage 2026-03-13

## Summary
API was unavailable for 23 minutes due to database connection pool exhaustion.

## Timeline (all times UTC)
- 14:32: Alert fired: "API error rate high"
- 14:33: On-call engineer paged
- 14:35: Engineer acknowledged, began investigation
- 14:38: Identified database connection pool exhausted
- 14:40: Attempted to restart API pods
- 14:45: Restart failed, connections still exhausted
- 14:47: Restarted database connection pooler
- 14:50: API began recovering
- 14:55: All health checks passing, incident resolved

## Root Cause
A slow database query caused connections to pile up. The connection pool (max 20 connections) filled up, causing new requests to fail.

## Impact
- 23 minutes of downtime
- ~1,500 failed requests
- 12 customer support tickets

## What Went Well
- Alert fired quickly
- On-call engineer responded within 3 minutes
- Runbook helped identify issue

## What Went Wrong
- Connection pool too small for traffic volume
- No monitoring on connection pool usage
- Slow query not caught in development

## Action Items
- [ ] Increase connection pool to 50 (Alice, by 2026-03-15)
- [ ] Add connection pool metrics to dashboard (Bob, by 2026-03-17)
- [ ] Add slow query detection to CI (Carol, by 2026-03-20)
- [ ] Update runbook with connection pool troubleshooting (Dave, by 2026-03-14)

## Lessons Learned
- Monitor resource pools (connections, threads, memory)
- Load test with realistic query patterns
- Connection pool size should scale with traffic

Share postmortems with the whole team. Learning from incidents makes everyone better.

Production Readiness Checklist

Use this checklist before launching:

Observability

[ ] Health check endpoints implemented
[ ] Metrics exported (Prometheus format)
[ ] Structured logging with correlation IDs
[ ] Distributed tracing configured
[ ] Dashboards created for key metrics
[ ] Alerts configured for critical issues

Reliability

[ ] Graceful shutdown implemented
[ ] Connection pooling configured
[ ] Timeouts set on all external calls
[ ] Circuit breakers for external dependencies
[ ] Retry logic with exponential backoff
[ ] Rate limiting implemented

Deployment

[ ] Rolling update strategy configured
[ ] Health checks prevent bad deployments
[ ] Rollback procedure documented
[ ] Database migrations automated
[ ] Feature flags for risky changes

Disaster Recovery

[ ] Automated database backups
[ ] Backup restore tested
[ ] Multi-region deployment (if needed)
[ ] Failover procedure documented
[ ] Data retention policy defined

Security

[ ] API authentication required
[ ] Rate limiting per API key
[ ] Input validation on all endpoints
[ ] SQL injection prevention
[ ] Secrets stored securely (not in code)
[ ] TLS/HTTPS enforced

Documentation

[ ] API documentation published
[ ] Runbooks for common incidents
[ ] Architecture diagram created
[ ] On-call procedures documented
[ ] Escalation policy defined

Testing

[ ] Unit tests passing
[ ] Integration tests passing
[ ] Load tests completed
[ ] Security scan completed
[ ] Chaos engineering tests (optional)

Conclusion

Production-ready isn't a destination, it's a practice. You'll never check every box perfectly. But each improvement makes your API more reliable, your team more confident, and your users happier.

Start with the basics: health checks, graceful shutdown, and good logging. Add monitoring and alerts. Document your procedures. Test your disaster recovery.

The goal isn't perfection. It's being ready when things go wrong, because they will. Production-ready means you can handle it.

Now go ship something reliable.

Building Production-Ready APIs