Operations Plan

Operations Plan

Deployment phases, operating model, and go-live execution.

AI Surveillance Platform — 24/7 Operations & Reliability Plan

Version: 1.0
Date: 2025-01-15
Classification: Internal — Operations & Engineering
System: 8-Channel AI Surveillance Platform (Cloud + Edge)
Target: Industrial-grade autonomous operations with minimal human intervention


Table of Contents

  1. Monitoring & Observability
  2. Logging Strategy
  3. Health Checks
  4. Service Restart & Recovery
  5. Backup Strategy
  6. Data Retention
  7. Storage Management
  8. Incident Response
  9. Upgrades & Maintenance
  10. Performance Optimization
  11. Disaster Recovery
  12. Capacity Planning

Document Control

Version Date Author Changes
1.0 2025-01-15 SRE Team Initial comprehensive operations plan

Approval

Role Name Date
Head of Engineering _____________ //______
Security Officer _____________ //______
Operations Lead _____________ //______

1. Monitoring & Observability

1.1 Overview

The monitoring stack provides real-time visibility into all platform components, enabling proactive issue detection and rapid incident response. All metrics are collected at 15-second intervals with 15-month retention.

Tooling Choice: Prometheus + Grafana (primary) with Alertmanager for notification routing.

Architecture:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Node      │     │  Prometheus │     │   Grafana   │
│  Exporter   │────▶│   Server    │────▶│  Dashboards │
│ (per host)  │     │  (TSDB)     │     │  (visualize)│
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────┴──────┐
                    │ Alertmanager │────▶ PagerDuty / OpsGenie / Slack
                    └─────────────┘

1.2 Metrics Collection

1.2.1 System Metrics (Node Exporter + cAdvisor)

Metric Category Specific Metrics Collection Interval Retention
CPU Usage % per core, load average (1m/5m/15m), steal time, iowait 15s 15 months
Memory Used/available/total, swap usage, OOM kills, page faults 15s 15 months
Disk Usage % per volume, IOPS, read/write latency, inode usage 15s 15 months
Network RX/TX bytes/packets/drops per interface, TCP connections, retransmits 15s 15 months
Containers CPU/memory per container, restart count, network IO per container 15s 15 months

Prometheus scrape configuration:

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s
    scrape_timeout: 10s

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  - job_name: 'surveillance-api'
    static_configs:
      - targets: ['surveillance-api:8080']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'ai-inference'
    static_configs:
      - targets: ['ai-inference:8080']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'video-processor'
    static_configs:
      - targets: ['video-processor:8080']
    scrape_interval: 15s
    metrics_path: /metrics

1.2.2 Application Metrics (Custom / OpenTelemetry)

Metric Name Type Description Labels
surveillance_fps_per_camera Gauge Current FPS being processed per camera camera_id, location
surveillance_detection_rate Gauge Detections per second per stream camera_id, model_version
surveillance_alert_rate Counter Total alerts generated severity, camera_id, alert_type
surveillance_pipeline_latency_ms Histogram End-to-end processing latency stage, camera_id
surveillance_frame_drop_rate Gauge Percentage of frames dropped camera_id, reason
surveillance_model_inference_ms Histogram AI model inference time model_name, batch_size
surveillance_stream_active Gauge Whether stream is active (1/0) camera_id, source
surveillance_face_recognition_matches Counter Face recognition hits/misses camera_id, match_type

Application instrumentation (Python example):

from prometheus_client import Counter, Histogram, Gauge, generate_latest
from functools import wraps
import time

# Define metrics
DETECTION_COUNTER = Counter(
    'surveillance_detections_total',
    'Total detections by type',
    ['camera_id', 'detection_type', 'model_version']
)

PIPELINE_LATENCY = Histogram(
    'surveillance_pipeline_latency_ms',
    'End-to-end pipeline latency in milliseconds',
    ['stage', 'camera_id'],
    buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
)

CAMERA_FPS = Gauge(
    'surveillance_fps_per_camera',
    'Current FPS per camera stream',
    ['camera_id', 'location']
)

STREAM_ACTIVE = Gauge(
    'surveillance_stream_active',
    'Stream connectivity status',
    ['camera_id', 'source']
)

def track_latency(stage, camera_id):
    """Decorator to track function latency."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            try:
                return func(*args, **kwargs)
            finally:
                elapsed_ms = (time.time() - start) * 1000
                PIPELINE_LATENCY.labels(
                    stage=stage,
                    camera_id=camera_id
                ).observe(elapsed_ms)
        return wrapper
    return decorator

1.2.3 Business Metrics

Metric Name Type Business Purpose Alert Threshold
surveillance_persons_detected_daily Counter Daily person detection volume Anomaly detection
surveillance_unknown_persons Counter Unknown/alerted persons per period Trend analysis
surveillance_alerts_sent Counter Alerts successfully delivered Delivery health
surveillance_alerts_failed Counter Failed alert deliveries > 5 in 5 min = P2
surveillance_camera_uptime_pct Gauge Per-camera uptime percentage < 99% = P3
surveillance_detection_accuracy Gauge Model accuracy score < threshold = P2

1.2.4 Error Metrics

Metric Name Type Description Severity
surveillance_errors_total Counter Errors by type and service All
surveillance_stream_errors Counter Stream connection errors P2 if > 10/min
surveillance_model_errors Counter Model inference failures P1 if > 5/min
surveillance_db_errors Counter Database operation failures P1 if > 3/min
surveillance_storage_errors Counter Storage read/write failures P2 if > 5/min

1.3 Alerting Rules

1.3.1 Critical Alerts (P1) — Page Immediately

# /etc/prometheus/alerts/critical.yml
groups:
  - name: critical
    rules:
      - alert: AllStreamsDown
        expr: sum(surveillance_stream_active) == 0
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "ALL camera streams are down"
          description: "No active streams detected for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/all-streams-down"

      - alert: AIPipelineDown
        expr: rate(surveillance_detections_total[5m]) == 0
        for: 2m
        labels:
          severity: p1
        annotations:
          summary: "AI pipeline not producing detections"
          description: "Zero detections in the last 2 minutes across all streams"

      - alert: StorageFull
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "Storage critically low: {{ $labels.mountpoint }}"
          description: "Less than 5% storage remaining on {{ $labels.instance }}"

      - alert: DatabaseUnreachable
        expr: pg_up == 0
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "PostgreSQL database is unreachable"
          description: "Cannot connect to primary database"

      - alert: HighErrorRate
        expr: rate(surveillance_errors_total[5m]) > 10
        for: 2m
        labels:
          severity: p1
        annotations:
          summary: "High error rate across services"
          description: "Error rate exceeds 10 errors per second"

1.3.2 High Severity Alerts (P2) — Page Within 1 Hour

# /etc/prometheus/alerts/high.yml
groups:
  - name: high
    rules:
      - alert: SingleCameraDown
        expr: surveillance_stream_active{camera_id=~"cam.*"} == 0
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Camera {{ $labels.camera_id }} is offline"
          description: "Camera stream has been down for more than 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.95,
          rate(surveillance_pipeline_latency_ms_bucket[5m])) > 2000
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Pipeline latency is high"
          description: "P95 latency exceeds 2000ms"

      - alert: ModelAccuracyDegraded
        expr: surveillance_detection_accuracy < 0.85
        for: 10m
        labels:
          severity: p2
        annotations:
          summary: "AI model accuracy degraded"
          description: "Detection accuracy below 85%"

      - alert: MemoryPressure
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.90
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: "Memory usage above 90%"

      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Disk space warning: {{ $labels.mountpoint }}"
          description: "Less than 15% disk space remaining"

1.3.3 Medium Severity Alerts (P3) — Respond Within 4 Hours

# /etc/prometheus/alerts/medium.yml
groups:
  - name: medium
    rules:
      - alert: CameraFPSLow
        expr: surveillance_fps_per_camera < 15
        for: 10m
        labels:
          severity: p3
        annotations:
          summary: "Camera {{ $labels.camera_id }} FPS below threshold"

      - alert: FrameDropsHigh
        expr: surveillance_frame_drop_rate > 0.10
        for: 10m
        labels:
          severity: p3
        annotations:
          summary: "High frame drop rate on {{ $labels.camera_id }}"

      - alert: CertificateExpiry
        expr: (ssl_certificate_expiry_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "TLS certificate expiring soon"

      - alert: BackupNotRun
        expr: time() - surveillance_last_backup_timestamp > 90000
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "Database backup has not run in 25+ hours"

1.3.4 Low Severity Alerts (P4) — Respond Within 24 Hours

# /etc/prometheus/alerts/low.yml
groups:
  - name: low
    rules:
      - alert: HighCPU
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 30m
        labels:
          severity: p4
        annotations:
          summary: "CPU usage high on {{ $labels.instance }}"

      - alert: ContainerRestartLoop
        expr: rate(container_restarts_total[15m]) > 3
        for: 15m
        labels:
          severity: p4
        annotations:
          summary: "Container restart loop detected"

1.4 Alertmanager Configuration

# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@surveillance.company.com'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: '<SLACK_WEBHOOK_URL>'

# Inhibit alerts of lower severity when higher severity fires
inhibit_rules:
  - source_match:
      severity: 'p1'
    target_match:
      severity: 'p2'
    equal: ['alertname', 'instance']

route:
  receiver: 'default'
  group_by: ['alertname', 'severity', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # P1 alerts — page immediately, no grouping delay
    - match:
        severity: p1
      receiver: 'p1-critical'
      group_wait: 0s
      repeat_interval: 15m
      continue: true

    # P2 alerts — page within 1 hour
    - match:
        severity: p2
      receiver: 'p2-high'
      group_wait: 2m
      repeat_interval: 1h

    # P3 alerts — Slack + email only
    - match:
        severity: p3
      receiver: 'p3-medium'
      group_wait: 5m
      repeat_interval: 4h

    # P4 alerts — daily digest
    - match:
        severity: p4
      receiver: 'p4-low'
      group_wait: 10m
      repeat_interval: 24h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#surveillance-alerts'
        title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'p1-critical'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: critical
        description: '{{ .GroupLabels.alertname }}'
    slack_configs:
      - channel: '#surveillance-critical'
        send_resolved: true
        title: 'P1 CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    email_configs:
      - to: 'oncall@company.com'
        subject: '[P1 CRITICAL] Surveillance Platform Alert'

  - name: 'p2-high'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: error
    slack_configs:
      - channel: '#surveillance-alerts'
        send_resolved: true

  - name: 'p3-medium'
    slack_configs:
      - channel: '#surveillance-warnings'
        send_resolved: true

  - name: 'p4-low'
    email_configs:
      - to: 'ops-team@company.com'
        subject: '[P4 Low] Surveillance Platform — Daily Digest'

1.5 Grafana Dashboards

1.5.1 Dashboard: Infrastructure Overview (ID: infra-overview)

{
  "dashboard": {
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage %",
        "type": "timeseries",
        "targets": [{
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{ instance }}"
        }],
        "alert": {
          "conditions": [{
            "evaluator": {"params": [85], "type": "gt"},
            "operator": {"type": "and"},
            "query": {"params": ["A", "5m", "now"]},
            "reducer": {"type": "avg"}
          }]
        }
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [{
          "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
          "legendFormat": "{{ instance }}"
        }]
      },
      {
        "title": "Disk Usage",
        "type": "gauge",
        "targets": [{
          "expr": "100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)"
        }],
        "fieldConfig": {
          "max": 100,
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "yellow", "value": 70},
              {"color": "orange", "value": 85},
              {"color": "red", "value": 95}
            ]
          }
        }
      },
      {
        "title": "Network I/O",
        "type": "timeseries",
        "targets": [
          {"expr": "rate(node_network_receive_bytes_total[5m])", "legendFormat": "RX {{ device }}"},
          {"expr": "rate(node_network_transmit_bytes_total[5m])", "legendFormat": "TX {{ device }}"}
        ]
      },
      {
        "title": "Container Count",
        "type": "stat",
        "targets": [{
          "expr": "count(container_last_seen)"
        }]
      },
      {
        "title": "Container Restarts (15m)",
        "type": "stat",
        "targets": [{
          "expr": "increase(container_restarts_total[15m])"
        }],
        "fieldConfig": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "red", "value": 1}
            ]
          }
        }
      }
    ]
  }
}

1.5.2 Dashboard: Camera Health (ID: camera-health)

Panel Type Query / Data Source
Stream Status Grid Stat grid (8 panels) surveillance_stream_active{camera_id=~"cam.*"}
FPS per Camera Timeseries surveillance_fps_per_camera by camera_id
Frame Drop Rate Timeseries surveillance_frame_drop_rate by camera_id
Camera Uptime % Gauge per camera avg_over_time(surveillance_stream_active[24h]) * 100
Stream Error Count Bar chart increase(surveillance_stream_errors[1h]) by camera_id
Last Frame Timestamp Table Time since last frame per camera
Bitrate per Stream Timeseries surveillance_stream_bitrate_kbps

Camera Health Score Calculation:

# Overall camera health score (0-100)
(
  avg(surveillance_stream_active) * 50 +
  (1 - avg(surveillance_frame_drop_rate)) * 30 +
  (avg(surveillance_fps_per_camera) / 30) * 20
) * 100

1.5.3 Dashboard: AI Pipeline Performance (ID: ai-pipeline)

Panel Type Metric
Inference Latency (P50/P95/P99) Timeseries histogram_quantile(0.5x, rate(...))
Detections per Second Timeseries rate(surveillance_detections_total[5m])
Model Accuracy Trend Timeseries surveillance_detection_accuracy
Pipeline Throughput Stat Total frames processed/minute
GPU Utilization (if applicable) Gauge nvidia_gpu_utilization_gpu
GPU Memory Usage Timeseries nvidia_gpu_memory_used_bytes
Model Load Status Table Current model version, load time, status
Batch Size Distribution Heatmap Inference batch sizes over time

1.5.4 Dashboard: Alert Delivery Stats (ID: alert-delivery)

Panel Type Query
Alerts Sent Today Stat increase(surveillance_alerts_sent[24h])
Alerts Failed Stat increase(surveillance_alerts_failed[24h])
Delivery Success Rate Gauge alerts_sent / (alerts_sent + alerts_failed)
Alerts by Severity Pie chart surveillance_alerts_sent by severity
Alerts by Camera Bar chart Top cameras by alert count
Notification Channel Status Table Channel health per delivery method
Alert Response Time Histogram Time from detection to notification

1.5.5 Dashboard: Storage Usage Trends (ID: storage-trends)

Panel Type Query
Total Storage Used Stat Sum of all storage volumes
Storage Growth Rate Timeseries Daily increase in bytes
Retention Policy Status Table Days remaining per retention tier
Media vs. Metadata Split Pie chart Storage breakdown by type
Projected Capacity Exhaustion Stat Days until full at current growth rate
Cleanup Job Status Table Last run, records cleaned, errors
Cross-Region Replication Lag Timeseries Replication delay in seconds

1.6 On-Call Rotation

Shift Time (UTC) Primary On-Call Secondary
APAC 00:00 — 08:00 APAC SRE Team EMEA Escalation
EMEA 08:00 — 16:00 EMEA SRE Team Americas Escalation
Americas 16:00 — 00:00 Americas SRE Team APAC Escalation

Escalation Policy (PagerDuty):

  1. Notification: Alert fires → Notify on-call engineer via PagerDuty push + SMS
  2. Acknowledge: 5-minute acknowledge window
  3. Escalation 1: No acknowledge → Escalate to team lead (15 min)
  4. Escalation 2: No response → Escalate to engineering manager (30 min)
  5. Escalation 3: No response → Escalate to VP Engineering (45 min)

2. Logging Strategy

2.1 Log Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Application │────▶│  Filebeat   │────▶│  Logstash   │────▶│ Elasticsearch│
│  (JSON logs)│     │  (shipper)  │     │ (processor) │     │   (store)   │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
                                                                    │
                                                             ┌──────┴──────┐
                                                             │    Kibana   │
                                                             │  (visualize) │
                                                             └─────────────┘

2.2 Log Levels

Level Numeric Usage Retention Action
DEBUG 10 Detailed diagnostic info 7 days Development only
INFO 20 Normal operational events 90 days Standard operations
WARNING 30 Anomalous but non-critical conditions 90 days Monitor trends
ERROR 40 Operational failures, handled exceptions 1 year Alert if rate > threshold
CRITICAL 50 System-threatening failures 1 year Immediate P1 alert

Production default level: INFO (DEBUG only enabled per-request for troubleshooting)

2.3 Structured Logging Format

All application logs MUST be in JSON format:

{
  "timestamp": "2025-01-15T08:30:15.123456Z",
  "level": "ERROR",
  "logger": "surveillance.video_processor",
  "message": "Failed to connect to camera stream",
  "request_id": "req_abc123def456",
  "trace_id": "trace_789xyz",
  "service": "video-processor",
  "version": "2.3.1",
  "host": "edge-node-01",
  "environment": "production",
  "camera_id": "cam_03_entrance",
  "location": "main_entrance",
  "error": {
    "type": "ConnectionTimeout",
    "message": "Connection to rtsp://192.168.1.103:554/stream timed out after 10s",
    "retry_count": 3,
    "stack_trace": "..."
  },
  "context": {
    "stream_url": "rtsp://***.***.1.***:554/stream",
    "connection_duration_ms": 10000,
    "previous_disconnect": "2025-01-15T08:25:00Z"
  },
  "performance": {
    "processing_time_ms": 0.5,
    "memory_mb": 128.5
  }
}

Python logging configuration:

# logging_config.py
import logging
import json
from pythonjsonlogger import jsonlogger
import os

class StructuredLogFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super().add_fields(log_record, record, message_dict)
        log_record['timestamp'] = datetime.utcnow().isoformat() + 'Z'
        log_record['level'] = record.levelname
        log_record['logger'] = record.name
        log_record['service'] = os.environ.get('SERVICE_NAME', 'unknown')
        log_record['version'] = os.environ.get('SERVICE_VERSION', 'unknown')
        log_record['host'] = os.environ.get('HOSTNAME', 'unknown')
        log_record['environment'] = os.environ.get('ENV', 'production')

LOGGING_CONFIG = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'json': {
            '()': StructuredLogFormatter,
            'format': '%(timestamp)s %(level)s %(message)s'
        }
    },
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
            'formatter': 'json',
            'stream': 'ext://sys.stdout'
        },
        'file': {
            'class': 'logging.handlers.RotatingFileHandler',
            'formatter': 'json',
            'filename': '/var/log/surveillance/app.log',
            'maxBytes': 104857600,  # 100 MB
            'backupCount': 10
        }
    },
    'loggers': {
        'surveillance': {
            'level': os.environ.get('LOG_LEVEL', 'INFO'),
            'handlers': ['console', 'file'],
            'propagate': False
        }
    }
}

2.4 Log Correlation

Every request receives a unique request_id and trace_id:

import uuid
import contextvars

# Context variable for request-scoped tracing
request_id_var = contextvars.ContextVar('request_id', default=None)
trace_id_var = contextvars.ContextVar('trace_id', default=None)

def get_current_request_id() -> str:
    req_id = request_id_var.get()
    if req_id is None:
        req_id = f"req_{uuid.uuid4().hex[:16]}"
        request_id_var.set(req_id)
    return req_id

def get_current_trace_id() -> str:
    trace_id = trace_id_var.get()
    if trace_id is None:
        trace_id = f"trace_{uuid.uuid4().hex[:16]}"
        trace_id_var.set(trace_id)
    return trace_id

Propagation across services:

  • HTTP: X-Request-ID and X-Trace-ID headers
  • Message queue: Metadata fields in message envelope
  • gRPC: Custom metadata

2.5 Log Retention Policy

Log Category Retention Storage Class Compression
Application logs (INFO+) 90 days Hot (SSD) 30d → Warm 60d After 7 days
Error logs (ERROR+) 1 year Warm 90d → Cold 275d After 30 days
Audit logs 1 year Hot 90d → Warm 180d → Cold 95d After 90 days
Debug logs 7 days Hot only None
Access logs 90 days Warm 30d → Cold 60d After 30 days
System logs (syslog/journald) 90 days Warm After 7 days

Elasticsearch Index Lifecycle Management (ILM):

PUT _ilm/policy/surveillance-logs
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d",
            "max_docs": 100000000
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "require": { "data": "warm" }
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

2.6 Sensitive Data Handling

NEVER log:

  • Face embeddings or biometric data
  • Full-resolution images of detected persons
  • PII (names, employee IDs, phone numbers)
  • Credentials, API keys, tokens, passwords
  • Stream URLs with embedded credentials
  • Internal network topology
  • VPN configuration details

Sanitization rules:

import re

SENSITIVE_PATTERNS = [
    (r'rtsp://[^:]+:[^@]+@', 'rtsp://***:***@'),
    (r'password[=:]\s*\S+', 'password=***'),
    (r'api[_-]?key[=:]\s*\S+', 'api_key=***'),
    (r'token[=:]\s*\S+', 'token=***'),
    (r'embedding[=:]\s*\[.*?\]', 'embedding=[REDACTED]'),
    (r'face[_-]?vector[=:]\s*\[.*?\]', 'face_vector=[REDACTED]'),
]

def sanitize_log_message(message: str) -> str:
    for pattern, replacement in SENSITIVE_PATTERNS:
        message = re.sub(pattern, replacement, message, flags=re.IGNORECASE)
    return message

3. Health Checks

3.1 Health Check Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Health Check Endpoints                    │
│                                                             │
│  /health        → Liveness probe (Kubernetes/Docker)       │
│  /health/ready  → Readiness probe (accepting traffic)       │
│  /health/deep   → Deep health (full pipeline validation)    │
└─────────────────────────────────────────────────────────────┘

3.2 Endpoint Specifications

3.2.1 Liveness Probe — GET /health

Purpose: Determine if the process is running and not deadlocked.

Response:

{
  "status": "alive",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-api",
  "version": "2.3.1",
  "uptime_seconds": 86400
}

Criteria:

  • Process is running
  • Main thread is not blocked
  • Returns HTTP 200 within 1 second

Failure action: Container orchestrator restarts the container.

Configuration:

# Kubernetes
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

# Docker Compose
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 10s
  timeout: 3s
  retries: 3
  start_period: 30s

3.2.2 Readiness Probe — GET /health/ready

Purpose: Determine if the service is ready to accept traffic.

Response:

{
  "status": "ready",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-api",
  "version": "2.3.1",
  "checks": {
    "database": {
      "status": "pass",
      "response_time_ms": 12,
      "message": "Connected to PostgreSQL primary"
    },
    "object_storage": {
      "status": "pass",
      "response_time_ms": 45,
      "message": "S3 bucket accessible"
    },
    "cache": {
      "status": "pass",
      "response_time_ms": 2,
      "message": "Redis connection OK"
    }
  }
}

Criteria:

  • All required dependencies reachable
  • Database connection pool has available connections
  • Object storage accessible
  • Cache layer accessible
  • AI model loaded (for inference services)

Failure response: HTTP 503 with details

{
  "status": "not_ready",
  "timestamp": "2025-01-15T08:30:15Z",
  "checks": {
    "database": {
      "status": "fail",
      "response_time_ms": 5000,
      "message": "Connection timeout after 5000ms"
    },
    "object_storage": { "status": "pass" },
    "cache": { "status": "pass" }
  }
}

Configuration:

# Kubernetes
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 2

3.2.3 Deep Health Check — GET /health/deep

Purpose: Validate the entire processing pipeline end-to-end.

Response:

{
  "status": "healthy",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-platform",
  "version": "2.3.1",
  "checks": {
    "database": {
      "status": "pass",
      "response_time_ms": 8,
      "details": {
        "connection": "ok",
        "query_execution": "ok",
        "replication_lag_seconds": 0
      }
    },
    "object_storage": {
      "status": "pass",
      "response_time_ms": 67,
      "details": {
        "read_test": "ok",
        "write_test": "ok",
        "list_test": "ok"
      }
    },
    "ai_model": {
      "status": "pass",
      "response_time_ms": 145,
      "details": {
        "model_loaded": true,
        "model_version": "face-detection-v2.1",
        "gpu_available": true,
        "test_inference": "ok"
      }
    },
    "streams": {
      "status": "pass",
      "details": {
        "active_streams": 8,
        "expected_streams": 8,
        "streams": [
          {"camera_id": "cam_01", "fps": 30, "status": "active"},
          {"camera_id": "cam_02", "fps": 30, "status": "active"},
          {"camera_id": "cam_03", "fps": 25, "status": "active"},
          {"camera_id": "cam_04", "fps": 30, "status": "active"},
          {"camera_id": "cam_05", "fps": 30, "status": "active"},
          {"camera_id": "cam_06", "fps": 28, "status": "active"},
          {"camera_id": "cam_07", "fps": 30, "status": "active"},
          {"camera_id": "cam_08", "fps": 30, "status": "active"}
        ]
      }
    },
    "cache": {
      "status": "pass",
      "response_time_ms": 1,
      "details": {
        "set_test": "ok",
        "get_test": "ok",
        "memory_usage_pct": 45
      }
    },
    "alert_delivery": {
      "status": "pass",
      "details": {
        "channels_tested": 3,
        "success": 3
      }
    },
    "pipeline_e2e": {
      "status": "pass",
      "response_time_ms": 523,
      "details": {
        "capture": "ok",
        "inference": "ok",
        "alert_generation": "ok",
        "storage": "ok"
      }
    }
  }
}

Execution:

  • Triggered manually or by monitoring every 5 minutes
  • NOT used for Kubernetes probes (too slow)
  • Full pipeline validation takes 1-5 seconds

3.3 Dependency Health Check Matrix

Dependency Check Method Timeout Expected Result Failure Action
PostgreSQL SELECT 1 3s Row returned Return not_ready
Redis Cache PINGPONG 2s PONG received Degrade to DB only
S3 / Object Storage List + Put + Get test object 10s All operations succeed Queue for retry
AI Model Load model + test inference 30s Inference completes Report model error
Camera Streams RTSP describe/ping 10s Stream metadata received Mark stream offline
VPN Tunnel ICMP to edge gateway 5s Response received Mark edge offline
SMTP/Notification TCP connect + EHLO 5s SMTP greeting received Queue alerts

3.4 Health Check Implementation

# health.py
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import time
import asyncio

class HealthStatus(Enum):
    PASS = "pass"
    FAIL = "fail"
    WARN = "warn"

@dataclass
class HealthCheckResult:
    name: str
    status: HealthStatus
    response_time_ms: float
    message: str
    details: Dict = field(default_factory=dict)

class HealthChecker:
    def __init__(self):
        self.checks = {}
    
    def register(self, name: str, check_func):
        self.checks[name] = check_func
    
    async def run_all(self, timeout: float = 30.0) -> List[HealthCheckResult]:
        tasks = [
            self._run_check(name, func, timeout)
            for name, func in self.checks.items()
        ]
        return await asyncio.gather(*tasks)
    
    async def _run_check(self, name: str, func, timeout: float) -> HealthCheckResult:
        start = time.monotonic()
        try:
            result = await asyncio.wait_for(func(), timeout=timeout)
            elapsed = (time.monotonic() - start) * 1000
            result.response_time_ms = round(elapsed, 2)
            return result
        except asyncio.TimeoutError:
            return HealthCheckResult(
                name=name,
                status=HealthStatus.FAIL,
                response_time_ms=timeout * 1000,
                message=f"Health check timed out after {timeout}s"
            )
        except Exception as e:
            elapsed = (time.monotonic() - start) * 1000
            return HealthCheckResult(
                name=name,
                status=HealthStatus.FAIL,
                response_time_ms=round(elapsed, 2),
                message=str(e)
            )

# Usage
health_checker = HealthChecker()

# Register checks
health_checker.register("database", check_database)
health_checker.register("object_storage", check_object_storage)
health_checker.register("ai_model", check_ai_model)
health_checker.register("streams", check_all_streams)
health_checker.register("cache", check_cache)

# FastAPI endpoint
from fastapi import FastAPI
app = FastAPI()

@app.get("/health")
async def liveness():
    return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}

@app.get("/health/ready")
async def readiness():
    results = await health_checker.run_all(timeout=5.0)
    all_pass = all(r.status == HealthStatus.PASS for r in results)
    
    status_code = 200 if all_pass else 503
    status = "ready" if all_pass else "not_ready"
    
    return JSONResponse(
        status_code=status_code,
        content={
            "status": status,
            "timestamp": datetime.utcnow().isoformat(),
            "checks": {
                r.name: {
                    "status": r.status.value,
                    "response_time_ms": r.response_time_ms,
                    "message": r.message,
                    **r.details
                }
                for r in results
            }
        }
    )

@app.get("/health/deep")
async def deep_health():
    # Runs full pipeline check
    results = await health_checker.run_all(timeout=30.0)
    # ... similar to readiness but with pipeline_e2e

4. Service Restart & Recovery

4.1 Service Startup Sequence

Services must start in strict dependency order. Docker Compose depends_on or Kubernetes init containers enforce this.

Phase 1: Infrastructure
  ├─ PostgreSQL (primary + replica)
  ├─ Redis Cache
  └─ MinIO / S3 Object Storage

Phase 2: Core Services
  ├─ Message Queue (RabbitMQ / NATS)
  ├─ Configuration Service
  └─ Identity/Auth Service

Phase 3: AI Pipeline
  ├─ Model Service (download & load models)
  ├─ Video Capture Service (connect to cameras)
  ├─ AI Inference Service
  └─ Post-Processing Service

Phase 4: Application Layer
  ├─ API Gateway
  ├─ Surveillance API Service
  ├─ Alert Service
  └─ WebSocket / Real-time Service

Phase 5: Frontend
  ├─ Nginx / Reverse Proxy
  └─ Web Dashboard

Docker Compose startup configuration:

# docker-compose.yml (relevant section)
services:
  postgres:
    image: postgres:15.4@sha256:abc123...
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U surveillance"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7.2@sha256:def456...
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    depends_on:
      postgres:
        condition: service_healthy

  model-service:
    image: surveillance/model-service:2.3.1@sha256:ghi789...
    restart: unless-stopped
    environment:
      - MODEL_PATH=/models
      - DOWNLOAD_IF_MISSING=true
    volumes:
      - model-cache:/models
    depends_on:
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 30s
      retries: 10
      start_period: 60s

  video-capture:
    image: surveillance/capture:2.3.1@sha256:jkl012...
    restart: unless-stopped
    depends_on:
      model-service:
        condition: service_healthy
    environment:
      - STREAM_RETRY_MAX=10
      - STREAM_RETRY_DELAY=5
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  ai-inference:
    image: surveillance/inference:2.3.1@sha256:mno345...
    restart: unless-stopped
    depends_on:
      video-capture:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 10s
      retries: 5
      start_period: 120s

  surveillance-api:
    image: surveillance/api:2.3.1@sha256:pqr678...
    restart: unless-stopped
    depends_on:
      ai-inference:
        condition: service_healthy
    environment:
      - DATABASE_URL=postgresql://...@postgres/surveillance
      - REDIS_URL=redis://redis:6379
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 20s

  nginx:
    image: nginx:alpine@sha256:stu901...
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      surveillance-api:
        condition: service_healthy

4.2 Graceful Shutdown Procedure

All services must handle SIGTERM for graceful shutdown:

# shutdown_handler.py
import asyncio
import signal
import logging
from contextlib import asynccontextmanager

logger = logging.getLogger(__name__)

class GracefulShutdown:
    def __init__(self, shutdown_timeout: float = 30.0):
        self.shutdown_timeout = shutdown_timeout
        self._shutdown_event = asyncio.Event()
        self._tasks = []
    
    def register_task(self, task):
        self._tasks.append(task)
    
    async def wait_for_shutdown(self):
        await self._shutdown_event.wait()
    
    def trigger_shutdown(self):
        logger.info("Shutdown signal received, initiating graceful shutdown...")
        self._shutdown_event.set()
    
    async def shutdown(self):
        """Execute graceful shutdown sequence."""
        logger.info("Starting graceful shutdown sequence...")
        
        # 1. Stop accepting new requests/connections
        logger.info("1. Stopping request acceptance")
        await self._stop_accepting_requests()
        
        # 2. Wait for in-flight requests to complete
        logger.info("2. Waiting for in-flight requests (timeout: %.0fs)", 
                     self.shutdown_timeout)
        try:
            await asyncio.wait_for(
                self._wait_inflight_requests(),
                timeout=self.shutdown_timeout * 0.6
            )
        except asyncio.TimeoutError:
            logger.warning("In-flight requests did not complete in time")
        
        # 3. Flush buffers and complete pending writes
        logger.info("3. Flushing buffers")
        await self._flush_buffers()
        
        # 4. Close camera streams gracefully
        logger.info("4. Closing camera streams")
        await self._close_streams()
        
        # 5. Release resources
        logger.info("5. Releasing resources")
        await self._release_resources()
        
        # 6. Close database connections
        logger.info("6. Closing database connections")
        await self._close_database_connections()
        
        logger.info("Graceful shutdown complete")
    
    async def _stop_accepting_requests(self):
        # Mark service as not ready
        pass
    
    async def _wait_inflight_requests(self):
        # Wait for active request count to reach zero
        pass
    
    async def _flush_buffers(self):
        # Flush any pending log buffers, metric batches
        pass
    
    async def _close_streams(self):
        # Send RTSP TEARDOWN, release capture resources
        pass
    
    async def _release_resources(self):
        # Release GPU memory, file handles
        pass
    
    async def _close_database_connections(self):
        # Return connections to pool, close pool
        pass

def setup_signal_handlers(shutdown_manager: GracefulShutdown):
    loop = asyncio.get_event_loop()
    
    def handle_signal(sig):
        logger.info("Received signal %s", sig.name)
        shutdown_manager.trigger_shutdown()
        asyncio.create_task(shutdown_manager.shutdown())
    
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, lambda s=sig: handle_signal(s))

Kubernetes graceful termination:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: surveillance-api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5 && curl -X POST localhost:8080/shutdown"]

4.3 Crash Recovery & Automatic Restart

Scenario Detection Automatic Action Manual Intervention
Container exits non-zero Docker/K8s Restart with exponential backoff (max 5 min) If > 5 restarts in 10 min
OOM killed Kernel event Restart with 25% memory increase (max 3x) Review memory limits
Health check fails Probe failure Restart container If restart loop persists
Node failure Node not ready Reschedule to healthy node Investigate failed node
Camera stream disconnect No frames received Retry with exponential backoff If > 30 min offline
AI model load failure Inference timeout Reload model from backup If model corrupted
Database connection lost Query timeout Retry connection, use replica If primary down > 5 min

Exponential backoff for stream reconnection:

import asyncio
import random

async def reconnect_stream(camera_id: str, max_retries: int = 100):
    base_delay = 5  # seconds
    max_delay = 300  # 5 minutes
    
    for attempt in range(1, max_retries + 1):
        delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
        jitter = random.uniform(0, delay * 0.1)
        wait_time = delay + jitter
        
        logger.info("Camera %s: Reconnect attempt %d/%d in %.1fs",
                    camera_id, attempt, max_retries, wait_time)
        await asyncio.sleep(wait_time)
        
        try:
            stream = await connect_stream(camera_id)
            logger.info("Camera %s: Reconnected successfully", camera_id)
            return stream
        except Exception as e:
            logger.warning("Camera %s: Reconnect failed: %s", camera_id, e)
    
    logger.error("Camera %s: Max retries exceeded, stream marked offline", camera_id)
    return None

4.4 Circuit Breaker Pattern

Protect against cascading failures when dependencies are down:

# circuit_breaker.py
from enum import Enum
import asyncio
import time
from dataclasses import dataclass

class CircuitState(Enum):
    CLOSED = "closed"       # Normal operation
    OPEN = "open"          # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 3
    success_threshold: int = 2

class CircuitBreaker:
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0
        self._lock = asyncio.Lock()
    
    async def call(self, func, *args, **kwargs):
        async with self._lock:
            await self._transition_state()
            
            if self.state == CircuitState.OPEN:
                raise CircuitBreakerOpen(
                    f"Circuit breaker '{self.name}' is OPEN"
                )
            
            if self.state == CircuitState.HALF_OPEN:
                if self.half_open_calls >= self.config.half_open_max_calls:
                    raise CircuitBreakerOpen(
                        f"Circuit '{self.name}' half-open limit reached"
                    )
                self.half_open_calls += 1
        
        # Execute outside lock
        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise
    
    async def _transition_state(self):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.config.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                self.success_count = 0
    
    async def _on_success(self):
        async with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.config.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            else:
                self.failure_count = 0
    
    async def _on_failure(self):
        async with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.config.failure_threshold:
                self.state = CircuitState.OPEN

class CircuitBreakerOpen(Exception):
    pass

Usage:

# Create breakers for each dependency
db_breaker = CircuitBreaker("database", CircuitBreakerConfig(
    failure_threshold=3,
    recovery_timeout=30.0
))

storage_breaker = CircuitBreaker("object_storage", CircuitBreakerConfig(
    failure_threshold=5,
    recovery_timeout=60.0
))

# Use in service calls
async def save_detection(detection):
    return await db_breaker.call(
        db_repository.save_detection, detection
    )

async def store_frame(frame):
    return await storage_breaker.call(
        s3_client.upload, frame
    )

4.5 Bulkhead Pattern — Resource Isolation

Isolate resources to prevent one failing component from consuming all resources:

# bulkhead.py
import asyncio
from asyncio import Semaphore

class Bulkhead:
    """Limits concurrent operations per service/camera."""
    
    def __init__(self, name: str, max_concurrent: int, max_queue: int = 100):
        self.name = name
        self.semaphore = Semaphore(max_concurrent)
        self.max_queue = max_queue
        self.queue_size = 0
        self._lock = asyncio.Lock()
    
    async def execute(self, func, *args, **kwargs):
        async with self._lock:
            if self.queue_size >= self.max_queue:
                raise BulkheadFull(
                    f"Bulkhead '{self.name}' queue full ({self.max_queue})"
                )
            self.queue_size += 1
        
        try:
            async with self.semaphore:
                return await func(*args, **kwargs)
        finally:
            async with self._lock:
                self.queue_size -= 1

class BulkheadFull(Exception):
    pass

# Per-camera bulkheads to isolate failures
camera_bulkheads = {
    f"cam_{i:02d}": Bulkhead(f"cam_{i:02d}", max_concurrent=4)
    for i in range(1, 9)
}

# Per-service bulkheads
db_bulkhead = Bulkhead("database", max_concurrent=20)
storage_bulkhead = Bulkhead("storage", max_concurrent=10)
inference_bulkhead = Bulkhead("inference", max_concurrent=8)

4.6 Recovery State Persistence

Critical state is persisted to survive restarts:

State Type Storage Recovery Action
Camera configurations PostgreSQL Reload on startup
Alert rules PostgreSQL Reload on startup
Processing offsets Redis Resume from last offset
In-flight detections Redis → PostgreSQL Replay from queue
Model version Object Storage Load specified version
Stream connection state Local file Attempt reconnection
Audit log buffer Local file → Async flush Recover unflushed entries

5. Backup Strategy

5.1 Backup Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        BACKUP PIPELINE                          │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   PostgreSQL │───▶│   pgBackRest │───▶│  S3 (Primary)    │  │
│  │   (Primary)  │    │  (Full/Incr) │    │  us-east-1       │  │
│  └──────────────┘    └──────────────┘    └────────┬─────────┘  │
│                                                     │           │
│                              ┌──────────────────────┘           │
│                              │                                  │
│                              ▼                                  │
│                    ┌──────────────────┐                         │
│                    │  S3 (Secondary)  │                         │
│                    │  us-west-2         │  Cross-region replication│
│                    └──────────────────┘                         │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   Object     │───▶│   S3 Cross   │───▶│  Glacier Deep    │  │
│  │   Storage    │    │   Region     │    │  Archive         │  │
│  │   Bucket     │    │   Replication│    │  (7-year)        │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ Infrastructure│───▶│    Git       │───▶│  Encrypted Git   │  │
│  │   Config     │    │   Repository │    │  Backups         │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

5.2 PostgreSQL Backup (pgBackRest)

Tool: pgBackRest 2.48+ with S3 integration

Backup Schedule:

Backup Type Frequency Start Time (UTC) Retention
Full backup Weekly Sunday 02:00 12 weeks
Differential Daily (Mon-Sat) 02:00 30 days
WAL archiving Continuous Real-time 30 days
Manual backup On-demand Any 90 days

pgBackRest configuration:

# /etc/pgbackrest/pgbackrest.conf
[surveillance]
pg1-path=/var/lib/postgresql/15/main
pg1-port=5432

[global]
repo1-type=s3
repo1-s3-region=us-east-1
repo1-s3-bucket=surveillance-db-backups
repo1-s3-key=<ACCESS_KEY>
repo1-s3-key-secret=<SECRET_KEY>
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=12
repo1-retention-diff=30
repo1-retention-archive=30

# Encryption
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<STRONG_PASSPHRASE>

# Performance
process-max=4
compress-type=zst
compress-level=6

# Logging
log-level-file=detail
log-path=/var/log/pgbackrest

# Notifications
exec-start=/usr/local/bin/pgbackrest-notify.sh

Backup cron schedule:

# /etc/cron.d/pgbackrest
# Full backup every Sunday at 2 AM UTC
0 2 * * 0 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=full

# Differential backup daily at 2 AM UTC (Mon-Sat)
0 2 * * 1-6 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=diff

# Verify latest backup at 6 AM UTC daily
0 6 * * * postgres /usr/bin/pgbackrest --stanza=surveillance verify

WAL archiving configuration (postgresql.conf):

wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=surveillance archive-push %p'
max_wal_senders = 3
wal_keep_size = 1GB

5.3 Backup Retention Schedule

Timeline:
Day 1-30:    Daily backups available (full + diffs)
Week 1-12:   Weekly full backups
Month 1-12:  Monthly full backups (last Sunday of each month)
Year 1-7:    Annual snapshot in Glacier Deep Archive
Tier Frequency Copies Kept Storage Class Location
Daily (hot) Every 24h 30 S3 Standard Primary region
Weekly (warm) Every Sunday 12 S3 Standard-IA Primary region
Monthly (cold) Last Sunday 12 S3 Glacier Flexible Primary region
Annual (archive) Year-end 7 S3 Glacier Deep Archive Cross-region

5.4 Object Storage Backup

Cross-region replication:

// S3 bucket replication configuration
{
  "Role": "arn:aws:iam::ACCOUNT:role/S3ReplicationRole",
  "Rules": [
    {
      "ID": "surveillance-media-replication",
      "Status": "Enabled",
      "Priority": 1,
      "DeleteMarkerReplication": { "Status": "Disabled" },
      "Filter": {
        "And": {
          "Prefix": "media/",
          "Tag": {
            "Key": "replicate",
            "Value": "true"
          }
        }
      },
      "Destination": {
        "Bucket": "arn:aws:s3:::surveillance-media-backup-west",
        "StorageClass": "STANDARD_IA",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": { "Minutes": 15 }
        },
        "Metrics": {
          "Status": "Enabled",
          "EventThreshold": { "Minutes": 15 }
        },
        "EncryptionConfiguration": {
          "ReplicaKmsKeyID": "arn:aws:kms:us-west-2:ACCOUNT:key/KEY-ID"
        }
      },
      "SourceSelectionCriteria": {
        "SseKmsEncryptedObjects": { "Status": "Enabled" }
      }
    }
  ]
}

Lifecycle policy for media storage:

{
  "Rules": [
    {
      "ID": "media-lifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "media/recordings/" },
      "Transitions": [
        {
          "Days": 7,
          "StorageClass": "INTELLIGENT_TIERING"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": { "Days": 2555 }
    },
    {
      "ID": "event-data-lifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "events/" },
      "Transitions": [
        { "Days": 90, "StorageClass": "STANDARD_IA" },
        { "Days": 365, "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 730 }
    }
  ]
}

5.5 Configuration Backup

All infrastructure configuration is stored as code in Git:

surveillance-ops/
├── terraform/
│   ├── main.tf                 # Main infrastructure
│   ├── variables.tf            # Environment variables
│   ├── outputs.tf              # Output definitions
│   ├── modules/
│   │   ├── vpc/                # Network configuration
│   │   ├── eks/                # Kubernetes cluster
│   │   ├── rds/                # PostgreSQL instances
│   │   └── s3/                 # Object storage
│   └── environments/
│       ├── production/         # Production config
│       └── dr/                 # DR site config
├── kubernetes/
│   ├── base/                   # Kustomize base resources
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── postgres/
│   │   ├── redis/
│   │   ├── api/
│   │   ├── inference/
│   │   └── capture/
│   └── overlays/
│       ├── production/
│       ├── staging/
│       └── dr/
├── docker-compose/
│   ├── docker-compose.yml      # Edge deployment
│   └── .env.example
├── ansible/
│   ├── playbook.yml            # Host provisioning
│   └── inventory/
├── monitoring/
│   ├── prometheus/
│   ├── grafana-dashboards/
│   └── alertmanager/
└── docs/
    ├── runbooks/
    ├── postmortems/
    └── architecture/

Git backup to secondary provider:

#!/bin/bash
# /usr/local/bin/backup-git-repos.sh
# Mirrors all critical repos to secondary Git provider

REPOS=(
  "git@github.com:company/surveillance-ops.git"
  "git@github.com:company/surveillance-app.git"
  "git@github.com:company/surveillance-models.git"
)

BACKUP_REMOTE="git@gitlab-backup.company.com:surveillance"
DATE=$(date +%Y%m%d)

for repo in "${REPOS[@]}"; do
  name=$(basename "$repo" .git)
  echo "Backing up $name..."
  
  git clone --mirror "$repo" "/tmp/$name-mirror"
  cd "/tmp/$name-mirror"
  
  # Push to backup remote
  git remote add backup "$BACKUP_REMOTE/$name.git" 2>/dev/null || true
  git push backup --mirror
  
  # Create dated archive
  tar czf "/backup/git/$name-$DATE.tar.gz" -C "/tmp" "$name-mirror"
  
  rm -rf "/tmp/$name-mirror"
done

# Upload to S3
aws s3 sync /backup/git/ "s3://surveillance-config-backups/git/" --storage-class STANDARD_IA

5.6 Encryption

Data at Rest Encryption Method Key Management
PostgreSQL backups AES-256-CBC (pgBackRest native) AWS KMS CMK
S3 object storage SSE-KMS AWS KMS CMK with automatic rotation
Configuration backups AES-256-GCM (age tool) YubiKey HSM stored keys
Log archives SSE-S3 (AES-256) AWS managed

KMS key policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Enable IAM User Permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Allow pgBackRest",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT:role/BackupServiceRole"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:GenerateDataKey*"
      ],
      "Resource": "*"
    }
  ]
}

5.7 Backup Verification

Automated integrity checks (daily at 06:00 UTC):

#!/bin/bash
# /usr/local/bin/verify-backup.sh

set -euo pipefail

STANZA="surveillance"
LOG_FILE="/var/log/backup/verify-$(date +%Y%m%d).log"
ALERT_WEBHOOK="https://hooks.slack.com/services/..."

log() {
    echo "[$(date -Iseconds)] $1" | tee -a "$LOG_FILE"
}

# 1. Verify latest backup exists
LATEST=$(pgbackrest --stanza=$STANZA info --output=json | jq -r '.[0].backup[-1].label')
if [ -z "$LATEST" ]; then
    log "ERROR: No backup found!"
    curl -X POST -H 'Content-type: application/json' \
        --data '{"text":"CRITICAL: No database backup found!"}' \
        "$ALERT_WEBHOOK"
    exit 1
fi

log "Latest backup: $LATEST"

# 2. Verify backup integrity
if ! pgbackrest --stanza=$STANZA verify --set=$LATEST >> "$LOG_FILE" 2>&1; then
    log "ERROR: Backup integrity check failed for $LATEST"
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"CRITICAL: Backup integrity check failed for $LATEST\"}" \
        "$ALERT_WEBHOOK"
    exit 1
fi

# 3. Check WAL archive continuity
MISSING=$(pgbackrest --stanza=$STANZA verify 2>&1 | grep -c "missing" || true)
if [ "$MISSING" -gt 0 ]; then
    log "WARNING: $MISSING WAL files missing"
fi

# 4. Verify S3 accessibility
if ! aws s3 ls "s3://surveillance-db-backups/pgbackrest/" > /dev/null 2>&1; then
    log "ERROR: Cannot access S3 backup bucket"
    exit 1
fi

# 5. Check backup age
BACKUP_AGE=$(pgbackrest --stanza=$STANZA info --output=json | \
    jq -r '.[0].backup[-1].timestamp.stop')
BACKUP_AGE_SEC=$(( $(date +%s) - $(date -d "$BACKUP_AGE" +%s) ))

if [ "$BACKUP_AGE_SEC" -gt 90000 ]; then  # > 25 hours
    log "WARNING: Latest backup is older than 25 hours"
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"WARNING: Latest backup is $((BACKUP_AGE_SEC / 3600)) hours old\"}" \
        "$ALERT_WEBHOOK"
fi

log "Backup verification completed successfully"

5.8 Restore Procedures

5.8.1 Point-in-Time Recovery (PITR)

#!/bin/bash
# restore-pitr.sh — Restore to specific point in time

STANZA="surveillance"
TARGET_TIME="$1"  # e.g., "2025-01-15 08:30:00"

# Stop application
kubectl scale deployment surveillance-api --replicas=0

# Restore from backup
pgbackrest --stanza=$STANZA restore \
    --type=time \
    --target="$TARGET_TIME" \
    --target-action=promote \
    --delta

# Verify database
psql -U surveillance -d surveillance -c "SELECT pg_last_xact_replay_timestamp();"

# Restart application
kubectl scale deployment surveillance-api --replicas=3

# Verify application health
curl -f http://surveillance-api:8080/health/ready

5.8.2 Full Disaster Recovery

#!/bin/bash
# restore-full.sh — Complete database restoration to new instance

STANZA="surveillance"
NEW_DATA_DIR="/var/lib/postgresql/15/main"

# 1. Install PostgreSQL (same version as backup)
apt-get install postgresql-15

# 2. Stop PostgreSQL
systemctl stop postgresql

# 3. Clear data directory
rm -rf "$NEW_DATA_DIR/*"

# 4. Restore full backup
pgbackrest --stanza=$STANZA restore \
    --type=immediate \
    --set=LATEST

# 5. Start PostgreSQL
systemctl start postgresql

# 6. Verify
pgbackrest --stanza=$STANZA check

# 7. Run consistency check
psql -U surveillance -d surveillance -c "SELECT count(*) FROM events;"
psql -U surveillance -d surveillance -c "SELECT pg_database_size('surveillance');"

5.9 Monthly Restore Drill

Schedule: First Saturday of each month at 02:00 UTC

Procedure:

  1. Provision isolated restore environment (separate namespace/VM)
  2. Restore latest full backup
  3. Apply differential backups
  4. Verify data integrity (row counts, checksums)
  5. Run application smoke tests
  6. Verify media files accessible
  7. Document results in restore log
  8. Tear down restore environment

Restore drill checklist:

## Restore Drill — 2025-01-04
- [x] Isolated environment provisioned
- [x] Full backup restored (duration: 23 min)
- [x] Differential backup applied (duration: 4 min)
- [x] WAL replay completed (duration: 12 min)
- [x] Database row counts verified
  - events: 12,456,789 (expected: 12,456,789) ✓
  - cameras: 8 (expected: 8) ✓
  - alerts: 1,234 (expected: 1,234) ✓
- [x] Application smoke tests passed
- [x] Media file accessibility verified (100/100 random samples)
- [x] Total RTO: 41 minutes (target: < 60 min) ✓
- [x] Total RPO: 8 minutes (target: < 15 min) ✓
- [x] Environment cleaned up

**Notes:** WAL replay was slower than usual due to high write volume on Jan 3.

6. Data Retention

6.1 Retention Policy Matrix

Data Category Retention Period Action After Retention Legal Basis
Raw video recordings 90 days (configurable) Delete or archive to cold storage Operational necessity
Event clips (alerts) 1 year Archive to cold storage for 2 additional years Incident investigation
Detection metadata 1 year Anonymize & aggregate Analytics
Audit logs 1 year Archive for 6 additional years Compliance
System health logs 90 days Delete Operational monitoring
Access logs 90 days Delete Security monitoring
Face embeddings (enrolled) Indefinite until deleted User-initiated deletion Authorized personnel database
Face embeddings (detected) Never stored N/A — computed and discarded immediately Privacy by design
Alert history 2 years Archive Incident reference
Training data Indefinite Explicit deletion by admin AI model improvement
Configuration history 2 years Archive Change tracking
Backup archives 7 years (Glacier) Delete per backup schedule Disaster recovery

6.2 Automated Cleanup Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Data Lifecycle Manager                     │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Retention  │  │   Cleanup    │  │   Archive        │   │
│  │   Policy     │──│   Executor   │──│   Manager        │   │
│  │   Engine     │  │   (CronJob)  │  │   (S3/Glacier)   │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
│         │                 │                   │              │
│         ▼                 ▼                   ▼              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  PostgreSQL  │  │  S3 Object   │  │   Elasticsearch  │   │
│  │  (metadata)  │  │  Storage     │  │   (logs)         │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
└──────────────────────────────────────────────────────────────┘

6.3 Cleanup Job Implementation

# retention_manager.py
from datetime import datetime, timedelta
from typing import List, Optional
import asyncio
import logging

logger = logging.getLogger(__name__)

class RetentionPolicy:
    def __init__(self, name: str, retention_days: int, archive_first: bool = False,
                 archive_days: int = 0, anonymize: bool = False):
        self.name = name
        self.retention_days = retention_days
        self.archive_first = archive_first
        self.archive_days = archive_days
        self.anonymize = anonymize

class DataRetentionManager:
    def __init__(self):
        self.policies = {}
    
    def register_policy(self, policy: RetentionPolicy):
        self.policies[policy.name] = policy
    
    async def execute_cleanup(self, policy_name: str, dry_run: bool = False):
        policy = self.policies.get(policy_name)
        if not policy:
            raise ValueError(f"Unknown policy: {policy_name}")
        
        cutoff_date = datetime.utcnow() - timedelta(days=policy.retention_days)
        logger.info("Executing cleanup for '%s' (cutoff: %s)", 
                     policy_name, cutoff_date.isoformat())
        
        if dry_run:
            count = await self._count_eligible(policy_name, cutoff_date)
            logger.info("[DRY RUN] Would delete %d records", count)
            return count
        
        # Archive before delete if configured
        if policy.archive_first:
            archive_cutoff = datetime.utcnow() - timedelta(
                days=policy.retention_days + policy.archive_days
            )
            archived = await self._archive_records(policy_name, cutoff_date, archive_cutoff)
            logger.info("Archived %d records", archived)
        
        # Anonymize if configured
        if policy.anonymize:
            anonymized = await self._anonymize_records(policy_name, cutoff_date)
            logger.info("Anonymized %d records", anonymized)
        else:
            # Delete expired records
            deleted = await self._delete_records(policy_name, cutoff_date)
            logger.info("Deleted %d records", deleted)
        
        return {"archived": archived if policy.archive_first else 0, "deleted": deleted}

# Register policies
retention = DataRetentionManager()
retention.register_policy(RetentionPolicy("raw_video", retention_days=90, archive_first=True, archive_days=180))
retention.register_policy(RetentionPolicy("event_clips", retention_days=365, archive_first=True, archive_days=730))
retention.register_policy(RetentionPolicy("detection_metadata", retention_days=365, anonymize=True))
retention.register_policy(RetentionPolicy("audit_logs", retention_days=365, archive_first=True, archive_days=2190))
retention.register_policy(RetentionPolicy("system_logs", retention_days=90))
retention.register_policy(RetentionPolicy("access_logs", retention_days=90))

Kubernetes CronJob:

# cleanup-job.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-retention-cleanup
  namespace: surveillance
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: surveillance/retention-manager:2.3.1
              command:
                - python
                - -m
                - retention_manager
                - --execute-all
                - --notify
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url
                - name: S3_BUCKET
                  value: surveillance-media
                - name: DRY_RUN
                  value: "false"
              resources:
                requests:
                  cpu: 100m
                  memory: 256Mi
                limits:
                  cpu: 500m
                  memory: 512Mi
          restartPolicy: OnFailure

6.4 Archive to Cold Storage

Before deletion, data is moved to cost-effective cold storage:

Stage Storage Class Cost Factor Access Time
Active S3 Standard 1x Immediate
7 days S3 Intelligent-Tiering 0.8x Immediate
90 days S3 Glacier Instant Retrieval 0.2x Milliseconds
1 year S3 Glacier Flexible Retrieval 0.08x Minutes-hours
2 years S3 Glacier Deep Archive 0.04x 12-48 hours

Archive process:

#!/bin/bash
# archive-old-media.sh

BUCKET="surveillance-media"
RETENTION_DAYS=90
CUTOFF=$(date -d "$RETENTION_DAYS days ago" +%Y-%m-%d)

# 1. Identify files to archive
aws s3api list-objects-v2 \
    --bucket "$BUCKET" \
    --prefix "recordings/" \
    --query "Contents[?LastModified<='$CUTOFF'].Key" \
    --output text > /tmp/archive-list.txt

# 2. Move to Glacier
while IFS= read -r key; do
    aws s3api copy-object \
        --copy-source "${BUCKET}/${key}" \
        --bucket "$BUCKET" \
        --key "$key" \
        --storage-class GLACIER_IR \
        --metadata-directive COPY
done < /tmp/archive-list.txt

# 3. Log archival
aws s3 cp /tmp/archive-list.txt \
    "s3://${BUCKET}/archive-logs/archive-$(date +%Y%m%d).txt"

# 4. Notify
echo "Archived $(wc -l < /tmp/archive-list.txt) files to Glacier IR"

6.5 Right to Deletion

For privacy compliance (GDPR/CCPA), implement data subject deletion:

async def delete_subject_data(subject_id: str):
    """
    Complete deletion of a data subject:
    1. Remove from enrolled persons database
    2. Delete associated face embeddings
    3. Remove references from detection logs
    4. Delete related event clips
    5. Log deletion for audit
    """
    async with db.transaction():
        # 1. Delete enrolled person
        await db.execute(
            "DELETE FROM enrolled_persons WHERE id = $1",
            subject_id
        )
        
        # 2. Delete embeddings (separate table for encryption)
        await db.execute(
            "DELETE FROM face_embeddings WHERE person_id = $1",
            subject_id
        )
        
        # 3. Anonymize detection references
        await db.execute(
            """UPDATE detections 
                SET person_id = NULL, 
                    person_name = '[REDACTED]',
                    face_embedding = NULL
                WHERE person_id = $1""",
            subject_id
        )
        
        # 4. Queue related event clips for deletion
        clips = await db.fetch(
            "SELECT storage_path FROM event_clips WHERE person_id = $1",
            subject_id
        )
        for clip in clips:
            await s3.delete_object(clip['storage_path'])
        
        # 5. Audit log
        await db.execute(
            """INSERT INTO deletion_audit_log 
                (subject_id, deleted_at, deleted_by, reason)
                VALUES ($1, NOW(), $2, 'data_subject_request')""",
            subject_id, current_user_id()
        )

7. Storage Management

7.1 Storage Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Storage Architecture                       │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Hot Tier   │  │   Warm Tier  │  │   Cold Tier      │   │
│  │   (NVMe/SSD) │  │   (HDD/S3)   │  │   (Glacier)      │   │
│  │              │  │              │  │                  │   │
│  │  Current     │  │  30-90 day   │  │  90+ day media   │   │
│  │  recordings  │  │  recordings  │  │  long-term       │   │
│  │  Active DB   │  │  Event clips │  │  archive         │   │
│  │  Cache       │  │  90-day logs │  │  compliance      │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
│                                                              │
│  Edge Node (local)  ←── VPN ──→  Cloud (S3/EBS/EFS)        │
└──────────────────────────────────────────────────────────────┘

7.2 Storage Capacity Planning (8 Camera Baseline)

Data Type Daily Volume Compression Storage/day Monthly
Raw video (8x 1080p@30fps, H.265) ~800 GB 50% ~400 GB ~12 TB
Event clips (alerts) ~5 GB None ~5 GB ~150 GB
Detection metadata ~500 MB None ~500 MB ~15 GB
Audit logs ~100 MB 70% ~30 MB ~1 GB
System metrics ~200 MB 80% ~40 MB ~1.2 GB
Database ~50 MB N/A ~50 MB ~1.5 GB
Model checkpoints N/A N/A N/A ~2 GB
Total ~406 GB/day ~12.2 TB/month

Annual raw capacity requirement: ~146 TB
With 90-day retention + archive: ~40 TB hot/warm + ~110 TB cold
Recommended provisioned capacity: 200 TB (with 50% growth headroom)

7.3 Storage Monitoring & Alerting

Prometheus rules:

groups:
  - name: storage-alerts
    rules:
      - alert: StorageWarning70
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.30
        for: 5m
        labels:
          severity: p4
        annotations:
          summary: "Storage at 70% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: StorageHigh85
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
        for: 2m
        labels:
          severity: p2
        annotations:
          summary: "Storage at 85% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: StorageCritical95
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "Storage CRITICAL at 95% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: S3BucketSizeGrowth
        expr: predict_linear(aws_s3_bucket_size_bytes[7d], 30*24*3600) > 
              aws_s3_bucket_quota_bytes * 0.9
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "S3 bucket {{ $labels.bucket }} projected to exceed quota in 30 days"

      - alert: StorageCleanupFailed
        expr: increase(surveillance_cleanup_failures_total[1h]) > 0
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Storage cleanup job failed"

7.4 Automated Cleanup Policies

# cleanup-policies.yaml
cleanup_policies:
  raw_video:
    description: "Raw video recordings"
    retention_days: 90
    archive_before_delete: true
    archive_storage_class: GLACIER_IR
    priority: oldest_first
    schedule: "0 2 * * *"
    
  event_clips:
    description: "Alert event video clips"
    retention_days: 365
    archive_before_delete: true
    archive_storage_class: GLACIER
    priority: oldest_first
    schedule: "0 3 * * *"
    
  temp_processing:
    description: "Temporary processing files"
    retention_days: 1
    archive_before_delete: false
    priority: all_expired
    schedule: "*/30 * * * *"
    
  failed_uploads:
    description: "Failed upload artifacts"
    retention_days: 7
    archive_before_delete: false
    priority: all_expired
    schedule: "0 4 * * *"
    
  system_logs:
    description: "Application and system logs"
    retention_days: 90
    archive_before_delete: true
    archive_storage_class: GLACIER_IR
    priority: oldest_first
    schedule: "0 5 * * *"

7.5 Compression Strategy

Data Age Compression Method Savings
0-7 days None Raw H.265 Baseline
7-30 days Re-encode H.265 → H.265 (lower CRF) 30-40%
30-90 days Transcode H.265 → AV1 40-50%
90+ days Archive AV1 + tarball 50-60%

Compression job:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: video-compression
  namespace: surveillance
spec:
  schedule: "0 1 * * *"
  jobTemplate:
    spec:
      parallelism: 2
      template:
        spec:
          containers:
            - name: compressor
              image: surveillance/media-processor:2.3.1
              command:
                - python
                - -m
                - compression
                - --age-days=7
                - --target-crf=30
                - --codec=libx265
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi
                limits:
                  cpu: "4"
                  memory: 8Gi
          restartPolicy: OnFailure

7.6 Auto-Scaling Cloud Storage

S3 Auto-scaling: S3 is inherently elastic — no manual scaling needed. Monitor bucket size and cost.

EBS volume scaling:

# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: surveillance-expandable
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: 3000
  throughput: 125
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"
allowVolumeExpansion: true  # Enable expansion
volumeBindingMode: WaitForFirstConsumer

Automated volume expansion:

#!/bin/bash
# auto-expand-storage.sh

THRESHOLD=80
PVC_NAMES=("postgres-data" "media-storage" "log-storage")
NAMESPACE="surveillance"

for pvc in "${PVC_NAMES[@]}"; do
    # Get current usage
    USAGE=$(kubectl exec -n "$NAMESPACE" deployment/surveillance-api \
        -- df -h "/data/$pvc" | awk 'NR==2 {print $5}' | tr -d '%')
    
    if [ "$USAGE" -gt "$THRESHOLD" ]; then
        CURRENT_SIZE=$(kubectl get pvc "$pvc" -n "$NAMESPACE" \
            -o jsonpath='{.status.capacity.storage}')
        
        # Increase by 50%
        CURRENT_GB=${CURRENT_SIZE%Gi}
        NEW_GB=$((CURRENT_GB + CURRENT_GB / 2))
        
        echo "Expanding $pvc from ${CURRENT_GB}Gi to ${NEW_GB}Gi"
        
        kubectl patch pvc "$pvc" -n "$NAMESPACE" \
            --type merge \
            -p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"${NEW_GB}Gi\"}}}}"
        
        # Notify
        curl -X POST "$SLACK_WEBHOOK" \
            -H 'Content-type: application/json' \
            -d "{\"text\":\"Auto-expanded PVC $pvc to ${NEW_GB}Gi (was ${USAGE}% full)\"}"
    fi
done

7.7 Storage Cost Optimization

Optimization Monthly Savings Implementation
S3 Intelligent-Tiering 20-30% Automatic
H.265 re-encode (older content) 30-40% Nightly job
Glacier IR for 30-90 day content 60-70% Lifecycle rule
Glacier Deep Archive for 1yr+ 95% Lifecycle rule
Reserved capacity for predictable workloads 30-40% Commitment

8. Incident Response

8.1 Severity Definitions

Severity Name Definition Examples Response Time
P1 Critical Complete service outage; no surveillance capability All cameras offline; AI pipeline completely down; storage full; database primary down 15 minutes
P2 High Major functionality degraded; partial surveillance loss Single camera offline > 30 min; high error rates; model accuracy degraded; backup failures 1 hour
P3 Medium Minor functionality issue; workarounds available Low FPS on camera; certificate expiring; certificate expiry warning; cleanup job failure 4 hours
P4 Low Cosmetic or non-urgent issue High CPU warning; UI glitch; documentation update needed; optimization opportunity 24 hours

8.2 Escalation Matrix

P1 (Critical) — 15 min response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 5 min: On-call must acknowledge
├── 15 min: No acknowledge → Escalate to Team Lead (SMS + Call)
├── 30 min: No response → Escalate to Engineering Manager
├── 45 min: No response → Escalate to VP Engineering
└── 60 min: No response → Escalate to CTO

P2 (High) — 1 hour response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 30 min: No acknowledge → Reminder notification
├── 60 min: No response → Escalate to Team Lead
└── 2 hours: No response → Escalate to Engineering Manager

P3 (Medium) — Slack + email only, 4 hour response
├── 0 min: Alert fires → Slack notification
└── 4 hours: No acknowledgment → Escalate to Team Lead

P4 (Low) — Daily digest email, 24 hour response
└── Daily digest at 09:00 UTC

Contact Information:

Role Primary Contact Secondary Contact Notification Method
On-Call Engineer Rotating (PagerDuty) PagerDuty Push + SMS
SRE Team Lead lead-sre@company.com +1-555-0100 SMS + Voice Call
Engineering Manager eng-mgr@company.com +1-555-0101 SMS + Voice Call
VP Engineering vp-eng@company.com +1-555-0102 Voice Call + Email
CTO cto@company.com +1-555-0103 Voice Call + Email

8.3 Incident Response Process

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  DETECT     │───▶│  RESPOND    │───▶│  RESOLVE    │───▶│  REVIEW     │
│  (Alert)    │    │  (Triage &  │    │  (Fix &     │    │  (Post-     │
│             │    │   Mitigate) │    │   Verify)   │    │   mortem)   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │
                    ┌─────┴─────┐
                    ▼           ▼
              ┌────────┐  ┌──────────┐
              │Mitigate│  │Communicate│
              │Impact  │  │Stakeholders│
              └────────┘  └──────────┘

Phase 1: Detect

  1. Monitoring alert fires
  2. On-call engineer receives page
  3. Acknowledge alert within 5 minutes
  4. Create incident channel in Slack: #inc-YYYY-MM-DD-brief-description

Phase 2: Respond

  1. Assess severity and impact
  2. Execute relevant runbook
  3. Apply immediate mitigation if possible
  4. Update incident timeline every 15 minutes
  5. Communicate to stakeholders

Phase 3: Resolve

  1. Implement fix
  2. Verify service recovery (all health checks pass)
  3. Monitor for 30 minutes post-recovery
  4. Close incident in PagerDuty
  5. Update incident log

Phase 4: Review

  1. Schedule post-mortem within 48 hours for P1/P2
  2. Complete post-mortem document
  3. Identify action items
  4. Track action items to completion

8.4 Runbooks

Runbook: Camera Offline

Detection: SingleCameraDown alert fires
Severity: P2
Initial Response Time: 1 hour

Diagnosis Steps:

# 1. Check camera stream status
curl http://video-capture:8080/api/v1/cameras/{camera_id}/status

# 2. Check camera connectivity
ping <camera_ip>
curl -v rtsp://<camera_ip>:554/stream

# 3. Check video-capture service logs
kubectl logs -l app=video-capture --tail=100 | grep {camera_id}

# 4. Check network path
tracert <camera_ip>
# Verify firewall rules, VPN tunnel

# 5. Check camera resource usage
kubectl top pod -l app=video-capture

Resolution Steps:

Issue Resolution Verification
Camera powered off Contact site personnel to power cycle Ping responds
Network connectivity Check switch port, cable, VLAN Ping + RTSP describe
VPN tunnel down See "VPN Tunnel Down" runbook Tunnel status
Camera firmware issue Power cycle camera remotely Stream reconnects
Stream URL changed Update camera configuration New stream active
Video-capture bug Restart capture container Stream reconnected
Resource exhaustion Scale up capture resources CPU/memory normal

Workaround: If camera cannot be restored within 30 minutes:

  • Mark camera as "maintenance mode" in dashboard
  • Disable alerts for this camera
  • Queue for on-site technician visit

Runbook: AI Pipeline Down

Detection: AIPipelineDown or HighErrorRate alert
Severity: P1
Initial Response Time: 15 minutes

Diagnosis Steps:

# 1. Check inference service health
curl http://ai-inference:8080/health/deep

# 2. Check if model is loaded
curl http://ai-inference:8080/api/v1/model/status

# 3. Check GPU status (if applicable)
nvidia-smi
# OR for CPU inference:
htop

# 4. Check inference logs
kubectl logs -l app=ai-inference --tail=200

# 5. Check resource usage
kubectl top pod -l app=ai-inference
kubectl describe pod -l app=ai-inference

# 6. Check model service
kubectl logs -l app=model-service --tail=100

# 7. Check if inference queue is backing up
redis-cli LLEN inference:queue

# 8. Test inference manually
curl -X POST http://ai-inference:8080/api/v1/inference/test \
  -H "Content-Type: application/json" \
  -d '{"test_image": "base64encoded"}'

Resolution Steps:

Issue Resolution Verification
Model not loaded Restart model-service pod Model status shows loaded
GPU OOM Restart inference pod; check memory limits nvidia-smi shows free memory
Model corruption Reload model from S3 backup Test inference succeeds
Inference timeout Scale inference replicas; check input Latency returns to normal
Queue backup Scale up consumers; check for dead consumers Queue depth returns to 0
Bad model update Rollback to previous model version Detection accuracy restored
Dependency failure Check circuit breaker status; restart dependencies All health checks pass

Immediate Mitigation:

  • If inference cannot be restored in 15 minutes:
    1. Switch to "detection-only" mode (skip recognition)
    2. Enable edge processing as backup
    3. Queue frames for delayed processing

Runbook: VPN Tunnel Down

Detection: Edge node unreachable; camera streams offline
Severity: P2 (P1 if all edge cameras affected)
Initial Response Time: 1 hour

Diagnosis Steps:

# 1. Check tunnel status from cloud side
ping <edge_gateway_ip>

# 2. Check VPN service status
kubectl logs -l app=vpn-gateway --tail=100

# 3. Check tunnel metrics
curl http://vpn-gateway:8080/metrics | grep vpn_tunnel

# 4. Check from edge side (if SSH available)
ssh edge-node "ping <cloud_gateway_ip>"
ssh edge-node "ipsec status"  # or wg show for WireGuard

# 5. Check network path
mtr <edge_gateway_ip>

# 6. Check certificates (if certificate-based VPN)
openssl x509 -in /etc/vpn/cert.pem -text -noout | grep "Not After"

Resolution Steps:

Issue Resolution Verification
Edge network down Contact ISP/site Ping responds
VPN service crash Restart VPN gateway Tunnel established
Certificate expired Renew certificates Valid cert, tunnel up
MTU mismatch Adjust tunnel MTU No packet fragmentation
Firewall change Restore firewall rules Tunnel traffic flowing
IPsec/IKE failure Restart IKE daemon; check config SA established
WireGuard key issue Regenerate keys Handshake succeeds

Workaround: If tunnel cannot be restored:

  • Activate local storage mode on edge (store locally, sync later)
  • Switch to cellular backup if available
  • Deploy technician on-site if needed

Runbook: Storage Full

Detection: StorageCritical95 alert fires
Severity: P1
Initial Response Time: 15 minutes

Immediate Actions (within 5 minutes):

# 1. Identify what's consuming space
df -h
ncdu /data/surveillance

# 2. Check if cleanup job is running
kubectl get jobs -n surveillance | grep cleanup

# 3. Temporarily expand storage (cloud)
# AWS EBS:
aws ec2 modify-volume --volume-id vol-XXXX --size $((CURRENT + 100))

# 4. Emergency cleanup — delete oldest temp files
find /data/surveillance/temp -type f -mtime +1 -delete
find /data/surveillance/cache -type f -atime +7 -delete

# 5. Force log rotation
logrotate -f /etc/logrotate.d/surveillance

# 6. Truncate oversized logs (>1GB)
find /var/log/surveillance -type f -size +1G -exec sh -c '> {}' \;

Resolution Steps:

Issue Resolution Verification
Normal growth Expand storage; review retention Usage < 80%
Runaway logs Fix log source; rotate logs Log growth rate normal
Cleanup job failed Restart cleanup job; fix root cause Cleanup completes
Retention too long Reduce retention period Space freed
Camera bitrate high Adjust camera encoding settings Bitrate normalized
Orphaned temp files Purge temp directory Space recovered

Runbook: Database Connectivity Issues

Detection: DatabaseUnreachable alert
Severity: P1
Initial Response Time: 15 minutes

Diagnosis Steps:

# 1. Check PostgreSQL pod status
kubectl get pods -l app=postgres
kubectl describe pod -l app=postgres

# 2. Check PostgreSQL logs
kubectl logs -l app=postgres --tail=200

# 3. Test connection from application pod
kubectl exec deployment/surveillance-api -- \
  pg_isready -h postgres -U surveillance

# 4. Check connection pool status
kubectl exec deployment/surveillance-api -- \
  python -c "from db import pool; print(pool.size(), pool.available())"

# 5. Check resource usage
kubectl top pod -l app=postgres

# 6. Check disk I/O
iostat -x 1 5

# 7. Check for locks
kubectl exec deployment/postgres -- \
  psql -U surveillance -c "SELECT * FROM pg_locks WHERE NOT granted;"

# 8. Check replication lag
kubectl exec deployment/postgres -- \
  psql -U surveillance -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS lag_seconds;"

Resolution Steps:

Issue Resolution Verification
PostgreSQL pod crash Restart pod; check for OOM Pod running, accepting connections
Connection pool exhausted Increase pool size; check for leaks Available connections > 0
Disk I/O saturation Scale storage IOPS; optimize queries I/O wait < 20%
Lock contention Kill blocking queries; optimize transactions No waiting locks
Replication lag Check replica resources; restart replication Lag < 5 seconds
Query overload Enable query caching; kill slow queries Active queries normal
Disk full See "Storage Full" runbook Free space available
Hardware failure Failover to replica; replace primary Replica promoted

Immediate Mitigation:

  • If primary is down:
    1. Promote replica to primary: pg_ctl promote
    2. Update connection strings
    3. Restart application pods

Runbook: High Error Rates

Detection: HighErrorRate alert fires
Severity: P1
Initial Response Time: 15 minutes

Diagnosis Steps:

# 1. Check error distribution by service
kubectl logs -l app=surveillance --tail=1000 | \
  jq -r '.service + ": " + .level + ": " + .message' | \
  sort | uniq -c | sort -rn | head -20

# 2. Check error rate per service
curl http://prometheus:9090/api/v1/query?query=\
  "rate(surveillance_errors_total[5m])"

# 3. Check for recent deployments
kubectl rollout history deployment/surveillance-api
kubectl rollout history deployment/ai-inference

# 4. Check dependency health
curl http://surveillance-api:8080/health/deep

# 5. Check for resource exhaustion
kubectl top pods

# 6. Review recent changes
# Check CI/CD pipeline, config changes

# 7. Check circuit breaker status
for service in database storage inference; do
  curl "http://surveillance-api:8080/api/v1/circuit-breakers/$service"
done

Resolution Steps:

Issue Resolution Verification
Bad deployment Rollback to previous version Error rate drops
Dependency down Fix dependency; check circuit breakers All deps healthy
Resource exhaustion Scale up; optimize resource usage Usage normal
Code bug Deploy hotfix; or rollback Errors eliminated
Configuration error Revert config change; validate config Config valid
External API failure Enable fallback; contact provider Fallback active
Database deadlock Kill blocking queries; fix code Deadlocks resolved

8.5 Post-Incident Review Template

# Post-Incident Review

## Incident Summary

| Field | Value |
|-------|-------|
| Incident ID | INC-2025-001 |
| Date/Time (UTC) | 2025-01-15 03:45 - 2025-01-15 05:20 |
| Severity | P1 |
| Detection Method | Automated alert (StorageCritical95) |
| Affected Systems | All camera streams, event storage |
| Impact | 1h 35m of degraded recording quality |

## Timeline

| Time (UTC) | Event |
|------------|-------|
| 03:42 | Storage usage crosses 95% threshold |
| 03:45 | P1 alert fires; on-call paged |
| 03:48 | On-call engineer acknowledges |
| 03:52 | Diagnosis begins; identified storage full |
| 04:05 | Emergency cleanup initiated; temp files removed |
| 04:15 | Storage expanded by 200GB |
| 04:30 | Cleanup job restarted; oldest files archived |
| 04:45 | All camera streams reconnecting |
| 05:00 | All health checks passing |
| 05:20 | Incident closed; monitoring continues |

## Root Cause Analysis

**5 Whys:**
1. Why did storage fill up? → Cleanup job had been failing for 3 days
2. Why was cleanup failing? → Credential rotation broke S3 access
3. Why didn't credential rotation update cleanup job? → Cleanup job uses hardcoded credentials
4. Why are credentials hardcoded? → Technical debt; not migrated to secret management
5. Why wasn't this caught? → No monitoring on cleanup job success/failure

**Root Cause:** Cleanup job used hardcoded S3 credentials that were not updated during routine credential rotation, causing 3 days of accumulated data without cleanup.

## Contributing Factors
- No alert on cleanup job failures
- Storage growth rate was not monitored
- No auto-expansion configured for media storage

## What Went Well
- Automated P1 alert fired immediately at 95%
- On-call responded within 3 minutes
- Emergency cleanup procedures were effective
- No data loss occurred

## What Went Wrong
- Cleanup job failure went undetected for 3 days
- Manual intervention required for storage expansion
- Edge cameras buffered locally but some frames were lost during reconnect

## Action Items

| ID | Action | Owner | Due Date | Priority |
|----|--------|-------|----------|----------|
| AI-1 | Migrate all jobs to use IAM roles / secret management | @sre-team | 2025-01-22 | High |
| AI-2 | Add alert for cleanup job failures | @sre-team | 2025-01-18 | High |
| AI-3 | Implement auto-expansion for media storage | @sre-team | 2025-01-29 | Medium |
| AI-4 | Add storage growth rate alerting | @sre-team | 2025-01-22 | Medium |
| AI-5 | Improve camera reconnection to reduce frame loss | @eng-team | 2025-02-05 | Low |
| AI-6 | Document hardcoded credential audit procedure | @security | 2025-01-22 | High |

## Lessons Learned
- Any automated job failure must have an alert
- Credential management must be centralized
- Storage monitoring needs predictive capability

## Signatures
- Incident Commander: _________________ Date: ___/___/______
- Engineering Lead: _________________ Date: ___/___/______

9. Upgrades & Maintenance

9.1 Zero-Downtime Deployment Strategy

Deployment Pattern: Rolling updates with readiness gate verification

Phase 1: Deploy new version alongside old version
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │   (serving traffic)
  └──────────┘    └──────────┘    └──────────┘

Phase 2: Add new version pod, verify health
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │    │  Pod v2  │   (new pod not yet serving)
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
                                                      ▲
                                                health check passes

Phase 3: Route traffic to new pod, drain old pod
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v2  │    │  Pod v2  │   (traffic shifting)
  └──────────┘    └──────────┘    └──────────┘    └──────────┘

Phase 4: Complete rollout
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v2  │    │  Pod v2  │    │  Pod v2  │   (all pods updated)
  └──────────┘    └──────────┘    └──────────┘

Rollback: Instantly revert to previous ReplicaSet
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │   (rollback in ~30 seconds)
  └──────────┘    └──────────┘    └──────────┘

Kubernetes Deployment Strategy:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: surveillance-api
  namespace: surveillance
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # Allow 1 extra pod during update
      maxUnavailable: 0    # Never reduce capacity
  selector:
    matchLabels:
      app: surveillance-api
  template:
    metadata:
      labels:
        app: surveillance-api
        version: "2.3.2"   # Updated with each release
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: surveillance/api:2.3.2@sha256:a1b2c3d4...
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 6
            successThreshold: 2
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - surveillance-api
                topologyKey: kubernetes.io/hostname

9.2 Deployment Pipeline

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Build     │───▶│   Test      │───▶│   Stage     │───▶│   Canary    │───▶│  Production │
│  (CI)       │    │  (Unit/Int) │    │  (E2E)      │    │  (5% traff) │    │  (100%)     │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │                  │                  │
                          ▼                  ▼                  ▼
                    ┌──────────┐      ┌──────────┐      ┌──────────┐
                    │ Fail =   │      │ Fail =   │      │ Fail =   │
                    │ Block    │      │ Block    │      │ Rollback │
                    └──────────┘      └──────────┘      └──────────┘

Automated promotion gates:

Gate Criteria Auto-promote Timeout
Build All tests pass; linting passes; security scan clean Immediate
Staging E2E tests pass; performance within 10% of baseline 30 min validation
Canary Error rate < 0.1%; p95 latency < baseline + 20% 15 min bake time
Production Canary metrics healthy for 30 min Auto-proceed

9.3 Database Migrations

Tool: Alembic (SQLAlchemy migrations) with yoyo-migrations for idempotent SQL

Migration rules:

  1. All migrations must be backward-compatible (add-only in one release)
  2. Destructive changes require a 2-phase deployment
  3. Migrations are versioned and reversible
  4. Migrations run automatically as init container before app startup
  5. Migration status exposed via /health/ready
# migrations/env.py — Alembic configuration
from alembic import context
from sqlalchemy import create_engine

config = context.config

def run_migrations():
    """Run migrations in online mode."""
    connectable = create_engine(config.get_main_option("sqlalchemy.url"))
    
    with connectable.connect() as connection:
        context.configure(
            connection=connection,
            target_metadata=target_metadata,
            transaction_per_migration=True,
            compare_type=True,
        )
        
        with context.begin_transaction():
            context.run_migrations()

# Migration example: add_column (backward-compatible)
# migrations/versions/20250115_add_camera_resolution.py
"""
Add resolution column to cameras table

Revision ID: 20250115_add_camera_resolution
Revises: 20250101_initial
Create Date: 2025-01-15 08:30:00
"""
from alembic import op
import sqlalchemy as sa

revision = '20250115_add_camera_resolution'
down_revision = '20250101_initial'

# Phase 1 (this release): Add column as nullable
def upgrade():
    op.add_column('cameras', sa.Column('resolution', sa.String(20), nullable=True))
    # Backfill existing data
    op.execute("UPDATE cameras SET resolution = '1920x1080' WHERE resolution IS NULL")

# Phase 2 (next release): Make column non-nullable
# def upgrade():
#     op.alter_column('cameras', 'resolution', nullable=False)

def downgrade():
    op.drop_column('cameras', 'resolution')

Migration execution (Kubernetes init container):

initContainers:
  - name: db-migrations
    image: surveillance/api:2.3.2@sha256:a1b2c3d4...
    command:
      - python
      - -m
      - alembic
      - upgrade
      - head
    env:
      - name: DATABASE_URL
        valueFrom:
          secretKeyRef:
            name: db-credentials
            key: url
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    # Must complete before main container starts
    restartPolicy: OnFailure

Two-phase destructive change example:

Phase 1 (Release N):

def upgrade():
    # Add new column
    op.add_column('detections', sa.Column('confidence_v2', sa.Float(), nullable=True))
    # Create index concurrently (no table lock)
    op.create_index('ix_detections_confidence_v2', 'detections', ['confidence_v2'],
                    postgresql_concurrently=True)
    # Backfill in batches
    op.execute("""
        UPDATE detections 
        SET confidence_v2 = confidence 
        WHERE confidence_v2 IS NULL
        AND id IN (SELECT id FROM detections WHERE confidence_v2 IS NULL LIMIT 10000)
    """)

Phase 2 (Release N+1):

def upgrade():
    # Now safe to drop old column (all code reads from new column)
    op.drop_column('detections', 'confidence')
    # Rename new column
    op.alter_column('detections', 'confidence_v2', new_column_name='confidence')

9.4 Model Update Deployment (Blue/Green)

AI model updates use blue/green to enable instant rollback:

Current State:
  ┌──────────────┐
  │  Model v2.1  │  ← Active (Blue)
  │   (Green)    │
  └──────────────┘
       ▲
   traffic: 100%

Deployment:
  1. Load Model v2.2 alongside v2.1
  2. Warm up v2.2 (run inference tests)
  3. Gradually shift traffic: 10% → 50% → 100%
  4. Monitor accuracy and latency
  
  ┌──────────────┐    ┌──────────────┐
  │  Model v2.1  │    │  Model v2.2  │
  │   (Blue)     │    │   (Green)    │
  └──────────────┘    └──────────────┘
    traffic: 70%         traffic: 30%

Rollback (instant):
  ┌──────────────┐    ┌──────────────┐
  │  Model v2.1  │    │  Model v2.2  │
  │   (Blue)     │    │  (Green)     │
  └──────────────┘    └──────────────┘
   traffic: 100%         traffic: 0%

Model deployment configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-blue
  namespace: surveillance
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: inference
          image: surveillance/inference:2.3.1
          env:
            - name: MODEL_VERSION
              value: "face-detection-v2.1"
            - name: MODEL_PATH
              value: "/models/v2.1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-green
  namespace: surveillance
spec:
  replicas: 0  # Scaled to 0 by default
  template:
    spec:
      containers:
        - name: inference
          image: surveillance/inference:2.3.1
          env:
            - name: MODEL_VERSION
              value: "face-detection-v2.2"
            - name: MODEL_PATH
              value: "/models/v2.2"
---
# Service routes to active model via label selector
apiVersion: v1
kind: Service
metadata:
  name: ai-inference
  annotations:
    active-model: "blue"
spec:
  selector:
    model: blue  # Changed to "green" for cutover
  ports:
    - port: 8080

Model switch script:

#!/bin/bash
# switch-model.sh — Switch between blue and green model deployments

NAMESPACE="surveillance"
TARGET="$1"  # blue or green

# Scale target to match current
CURRENT_REPLICAS=$(kubectl get deployment ai-inference-blue -n $NAMESPACE \
  -o jsonpath='{.status.replicas}')

echo "Scaling ai-inference-$TARGET to $CURRENT_REPLICAS replicas..."
kubectl scale deployment "ai-inference-$TARGET" --replicas="$CURRENT_REPLICAS" -n "$NAMESPACE"

# Wait for ready
kubectl rollout status "deployment/ai-inference-$TARGET" -n "$NAMESPACE" --timeout=300s

# Update service selector
echo "Switching service to $TARGET..."
kubectl patch service ai-inference -n "$NAMESPACE" \
  --type merge \
  -p "{\"spec\":{\"selector\":{\"model\":\"$TARGET\"}}}"

# Update annotation
kubectl annotate service ai-inference -n "$NAMESPACE" \
  "active-model=$TARGET" --overwrite

# Scale old version to 0
OLD_VERSION=$([ "$TARGET" == "blue" ] && echo "green" || echo "blue")
echo "Scaling down ai-inference-$OLD_VERSION..."
kubectl scale deployment "ai-inference-$OLD_VERSION" --replicas=0 -n "$NAMESPACE"

echo "Model switch complete. Active: $TARGET"

9.5 Maintenance Windows

Window Schedule Duration Allowed Activities
Weekly Sunday 02:00-06:00 UTC 4 hours Patches, minor updates, config changes
Monthly First Sunday 02:00-08:00 UTC 6 hours Database maintenance, major upgrades, model updates
Quarterly Scheduled 8 hours Infrastructure upgrades, DR drills
Emergency On-demand As needed Security patches, critical fixes

Maintenance mode API:

@app.post("/admin/maintenance")
async def enable_maintenance_mode(
    duration_minutes: int,
    reason: str,
    user: AdminUser = Depends(get_admin_user)
):
    """Enable maintenance mode — disable non-critical processing."""
    await redis.set("maintenance:active", "true", ex=duration_minutes * 60)
    await redis.set("maintenance:reason", reason, ex=duration_minutes * 60)
    
    # Notify all connected clients
    await websocket_manager.broadcast({
        "type": "maintenance",
        "status": "started",
        "reason": reason,
        "estimated_duration_minutes": duration_minutes
    })
    
    # Reduce non-critical processing
    await set_pipeline_mode("minimal")
    
    audit_log.info("Maintenance mode enabled by %s for %d minutes: %s",
                   user.username, duration_minutes, reason)

9.6 Rollback Capability

Every deployment maintains the previous N versions for instant rollback:

Rollback Type Method Time to Complete When to Use
Application rollback kubectl rollout undo ~30 seconds Bad deployment
Database rollback alembic downgrade 2-5 minutes Bad migration
Model rollback Switch service selector ~10 seconds Bad model update
Configuration rollback Git revert + apply 1-2 minutes Bad config change
Infrastructure rollback Terraform state revert 5-10 minutes Bad infra change
Full system rollback DR failover 15-30 minutes Catastrophic failure

Automated rollback triggers:

# rollback-alerts.yaml
- alert: DeploymentRollbackRequired
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      and
      delta(deployment_timestamp[10m]) > 0
    )
  for: 2m
  labels:
    severity: p1
  annotations:
    summary: "High error rate after deployment — rollback recommended"
    runbook_url: "https://wiki.internal/runbooks/auto-rollback"

9.7 Version Pinning

All container images MUST be pinned to digest, never to floating tags:

# GOOD — pinned to digest
image: surveillance/api:2.3.1@sha256:abc123def456...

# BAD — floating tag
image: surveillance/api:latest

# ACCEPTABLE — semver tag with digest verification
image: surveillance/api:2.3.1
# (digest verified by admission controller)

Image verification admission controller:

# Kyverno / OPA Gatekeeper policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-digest
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "All container images must be pinned to digest"
        pattern:
          spec:
            containers:
              - image: "*@sha256:*"

10. Performance Optimization

10.1 Query Optimization

Slow query monitoring:

-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slow queries
SELECT 
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Alert on slow queries:

- alert: SlowPostgresQueries
  expr: |
    pg_stat_statements_mean_time > 1000
  for: 5m
  labels:
    severity: p3
  annotations:
    summary: "Slow queries detected (>1000ms average)"

Index review (monthly):

-- Check for missing indexes
SELECT 
    schemaname,
    tablename,
    attname as column,
    n_tup_read,
    n_tup_fetch,
    n_tup_insert,
    n_tup_update
FROM pg_stats 
WHERE schemaname = 'public'
ORDER BY n_tup_read DESC;

-- Check for unused indexes
SELECT 
    schemaname,
    tablename,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

Current index strategy:

-- Core indexes for surveillance queries
CREATE INDEX CONCURRENTLY idx_detections_timestamp_camera 
    ON detections (timestamp DESC, camera_id);

CREATE INDEX CONCURRENTLY idx_detections_person_id 
    ON detections (person_id) WHERE person_id IS NOT NULL;

CREATE INDEX CONCURRENTLY idx_events_timestamp_type 
    ON events (timestamp DESC, event_type);

CREATE INDEX CONCURRENTLY idx_alerts_status_created 
    ON alerts (status, created_at DESC) 
    WHERE status IN ('pending', 'sent');

CREATE INDEX CONCURRENTLY idx_recordings_camera_timestamp 
    ON recordings (camera_id, start_time DESC);

-- Partial index for active alerts (most queried)
CREATE INDEX CONCURRENTLY idx_alerts_active 
    ON alerts (created_at DESC, camera_id, severity)
    WHERE status = 'active';

10.2 Cache Strategy (Redis)

Cache Type TTL Invalidation Purpose
Camera configuration 5 min On update Reduce DB reads
Person profiles 10 min On update Fast face lookup
Recent detections 1 min Time-based Dashboard display
Alert rules 5 min On update Rule evaluation
API responses (frequent) 30 sec On data change Reduce API load
Session data 24 hours On logout User sessions
Rate limiting 1 min Automatic API protection

Redis configuration:

# redis-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: surveillance
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    appendonly yes
    appendfsync everysec
    save 900 1
    save 300 10
    save 60 10000
    tcp-keepalive 60
    timeout 300

Cache implementation:

# cache.py
import redis.asyncio as redis
import json
import hashlib
from functools import wraps

redis_client = redis.Redis(
    host='redis',
    port=6379,
    db=0,
    decode_responses=True,
    socket_connect_timeout=5,
    socket_timeout=5,
    health_check_interval=30,
)

async def cached(ttl_seconds: int, key_prefix: str = "cache"):
    """Decorator to cache function results."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key
            cache_key = f"{key_prefix}:{func.__name__}:{_generate_key(args, kwargs)}"
            
            # Try cache
            cached = await redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute and cache
            result = await func(*args, **kwargs)
            await redis_client.setex(
                cache_key,
                ttl_seconds,
                json.dumps(result, default=str)
            )
            return result
        return wrapper
    return decorator

def _generate_key(args, kwargs):
    key_data = json.dumps({"args": args, "kwargs": kwargs}, sort_keys=True, default=str)
    return hashlib.sha256(key_data.encode()).hexdigest()[:16]

# Usage
@cached(ttl_seconds=300, key_prefix="camera")
async def get_camera_config(camera_id: str):
    return await db.fetchrow("SELECT * FROM cameras WHERE id = $1", camera_id)

@cached(ttl_seconds=60, key_prefix="detections")
async def get_recent_detections(camera_id: str, limit: int = 50):
    return await db.fetch(
        """SELECT * FROM detections 
           WHERE camera_id = $1 
           ORDER BY timestamp DESC 
           LIMIT $2""",
        camera_id, limit
    )

10.3 CDN Configuration

Static assets and archived media are served via CDN:

# CloudFront / CDN configuration
cdn:
  origins:
    - id: surveillance-media
      domain: surveillance-media.s3.amazonaws.com
      path: /recordings
      
    - id: surveillance-static
      domain: surveillance-static.s3.amazonaws.com
      path: /static
      
  behaviors:
    - path: /recordings/*.mp4
      ttl: 86400
      compress: true
      
    - path: /static/*
      ttl: 604800
      cache_control: "public, max-age=604800, immutable"
      
    - path: /api/*
      ttl: 0  # Don't cache API
      
  signed_urls:
    enabled: true
    key_pair_id: "K..."
    expiration: 3600  # 1 hour

10.4 Connection Pooling

Database Connection Pooling

# database.py
import asyncpg

DB_POOL_CONFIG = {
    "min_size": 5,
    "max_size": 20,
    "max_inactive_time": 300,
    "max_queries": 50000,
    "command_timeout": 30,
    "server_settings": {
        "jit": "off",
        "application_name": "surveillance-api"
    }
}

pool = None

async def init_pool(database_url: str):
    global pool
    pool = await asyncpg.create_pool(
        database_url,
        **DB_POOL_CONFIG
    )

async def get_connection():
    return await pool.acquire()

async def release_connection(conn):
    await pool.release(conn)

HTTP Connection Pooling (for inter-service communication)

# http_client.py
import httpx

class ServiceClient:
    def __init__(self):
        self.client = httpx.AsyncClient(
            timeout=httpx.Timeout(connect=5.0, read=30.0),
            limits=httpx.Limits(
                max_connections=100,
                max_keepalive_connections=20
            ),
            http2=True,
        )
    
    async def get(self, service: str, path: str):
        url = f"http://{service}:8080{path}"
        response = await self.client.get(url)
        response.raise_for_status()
        return response.json()

service_client = ServiceClient()

10.5 Resource Limits

# resource-limits.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: surveillance-limits
  namespace: surveillance
spec:
  limits:
    - default:
        cpu: "1"
        memory: 1Gi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      type: Container
---
# Per-service resource allocation
resources:
  # Video capture (I/O bound)
  video-capture:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "2"
      memory: 4Gi

  # AI inference (CPU/GPU bound)
  ai-inference:
    requests:
      cpu: "2"
      memory: 4Gi
    limits:
      cpu: "4"
      memory: 8Gi

  # API (moderate load)
  surveillance-api:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 2Gi

  # Database (high memory)
  postgres:
    requests:
      cpu: "1"
      memory: 4Gi
    limits:
      cpu: "4"
      memory: 16Gi

  # Redis (low CPU, moderate memory)
  redis:
    requests:
      cpu: 100m
      memory: 1Gi
    limits:
      cpu: "1"
      memory: 2Gi

10.6 Performance Benchmarks

Metric Target Alert Threshold Critical Threshold
Camera stream latency < 100ms > 200ms > 500ms
AI inference per frame < 50ms > 100ms > 200ms
End-to-end detection latency < 500ms > 1000ms > 2000ms
API response time (p50) < 50ms > 100ms > 500ms
API response time (p95) < 200ms > 500ms > 1000ms
Database query time (p95) < 10ms > 50ms > 200ms
Stream processing FPS 30 FPS < 25 FPS < 15 FPS
Frame drop rate < 0.1% > 1% > 5%
Alert delivery time < 5s > 10s > 30s

11. Disaster Recovery

11.1 DR Objectives

Metric Value Measurement
RTO (Recovery Time Objective) 1 hour Time from disaster declaration to service restoration
RPO (Recovery Point Objective) 15 minutes Maximum acceptable data loss
RTO (Database) 30 minutes Database failover time
RTO (Application) 15 minutes Application redeployment time
RPO (Database) < 1 minute With synchronous replication
RPO (Media) 15 minutes Cross-region replication lag

11.2 DR Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PRODUCTION (us-east-1)                        │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │   EKS        │  │   RDS        │  │   S3             │          │
│  │   Cluster    │  │   PostgreSQL │  │   Primary        │          │
│  │              │  │   (Primary)  │  │   Bucket         │          │
│  │  ┌────────┐  │  │              │  │                  │          │
│  │  │Capture │  │  │  ┌────────┐  │  │  ┌──────────┐    │          │
│  │  │API     │  │  │  │Primary │  │  │  │ Recordings│   │          │
│  │  │Inference│  │  │  │Replica │  │  │  │ Events    │   │          │
│  │  └────────┘  │  │  └────────┘  │  │  │ Models    │   │          │
│  └──────────────┘  └──────────────┘  └──────────────────┘          │
│           │                │                  │                      │
│           ▼                ▼                  ▼                      │
│     ┌─────────────────────────────────────────────────┐              │
│     │           Real-time Replication                  │              │
│     │  (WAL streaming + S3 cross-region replication)   │              │
│     └─────────────────────────────────────────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     DR SITE (us-west-2)                              │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │   EKS        │  │   RDS        │  │   S3             │          │
│  │   (Scaled    │  │   PostgreSQL │  │   Replica        │          │
│  │    to 0)     │  │   (Standby)  │  │   Bucket         │          │
│  │              │  │              │  │                  │          │
│  │  [Ready to   │  │  ┌────────┐  │  │  [Fully         │          │
│  │   scale up]  │  │  │Standby │  │  │   replicated]  │          │
│  │              │  │  │Replica │  │  │                  │          │
│  └──────────────┘  │  └────────┘  │  └──────────────────┘          │
│                    └──────────────┘                                  │
└─────────────────────────────────────────────────────────────────────┘

11.3 Data Replication

Database Replication

# RDS PostgreSQL cross-region read replica
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DRReadReplica:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceIdentifier: surveillance-dr-replica
      DBInstanceClass: db.r6g.xlarge
      Engine: postgres
      EngineVersion: '15.4'
      SourceDBInstanceIdentifier: 
        !Sub 'arn:aws:rds:us-east-1:${AWS::AccountId}:db:surveillance-primary'
      DBSubnetGroupName: !Ref DRSubnetGroup
      VPCSecurityGroups:
        - !Ref DRSecurityGroup
      MultiAZ: false  # Standby only; enable during failover
      StorageEncrypted: true
      KmsKeyId: !Ref DRKMSKey
      BackupRetentionPeriod: 7
      DeletionProtection: true
      Tags:
        - Key: Purpose
          Value: DR-Standby
        - Key: RPO
          Value: 15min

Replication monitoring:

-- Check replication lag (run on primary)
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- Alert if replication lag > 5 minutes

Object Storage Replication

S3 Cross-Region Replication (CRR) with 15-minute RPO:

  • All new objects replicated within 15 minutes
  • Replication status tracked per object
  • Failed replication events alerted

Configuration Replication

  • Terraform state stored in S3 with cross-region replication
  • Git repositories mirrored to secondary Git provider
  • Kubernetes manifests stored in Git (GitOps)

11.4 Failover Process

Automated Failover (Database — RDS)

RDS Multi-AZ provides automatic failover:

  1. Health check fails on primary
  2. RDS promotes standby to primary (typically 60-120 seconds)
  3. DNS endpoint updates automatically
  4. Application reconnects via connection pool

Manual DR Failover (Full Site)

#!/bin/bash
# dr-failover.sh — Execute full site failover to DR region

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
FAILOVER_REASON="$1"

log() {
    echo "[$(date -Iseconds)] $1" | tee -a /var/log/dr/failover-$(date +%Y%m%d).log
}

log "=== DR FAILOVER INITIATED ==="
log "Reason: $FAILOVER_REASON"
log "From: $PRIMARY_REGION → $DR_REGION"

# 1. Verify DR environment
log "1. Verifying DR environment readiness..."
if ! aws eks describe-cluster --name surveillance-dr --region $DR_REGION > /dev/null 2>&1; then
    log "ERROR: DR EKS cluster not accessible"
    exit 1
fi

# 2. Promote DR database from standby
log "2. Promoting DR database..."
aws rds promote-read-replica \
    --db-instance-identifier surveillance-dr-replica \
    --region $DR_REGION

# Wait for promotion
aws rds wait db-instance-available \
    --db-instance-identifier surveillance-dr-replica \
    --region $DR_REGION
log "   DR database promoted successfully"

# 3. Enable Multi-AZ on DR database
log "3. Enabling Multi-AZ on DR database..."
aws rds modify-db-instance \
    --db-instance-identifier surveillance-dr-replica \
    --multi-az \
    --apply-immediately \
    --region $DR_REGION

# 4. Scale up DR EKS cluster
log "4. Scaling up DR EKS cluster..."
aws eks update-nodegroup-config \
    --cluster-name surveillance-dr \
    --nodegroup-name surveillance-workers \
    --scaling-config minSize=3,maxSize=10,desiredSize=3 \
    --region $DR_REGION

# Wait for nodes
sleep 120
kubectl wait --for=condition=Ready nodes --all --timeout=300s

# 5. Deploy application to DR
log "5. Deploying application to DR..."
kubectl config use-context surveillance-dr
kubectl apply -k k8s/overlays/dr/

# Wait for deployments
kubectl wait --for=condition=available \
    --all deployments \
    --namespace surveillance \
    --timeout=600s

# 6. Update DNS to point to DR
log "6. Updating DNS to DR region..."
aws route53 change-resource-record-sets \
    --hosted-zone-id $HOSTED_ZONE_ID \
    --change-batch file://dr-dns-update.json

# 7. Verify health
log "7. Running health checks..."
for i in {1..10}; do
    if curl -f https://surveillance.company.com/health/deep > /dev/null 2>&1; then
        log "   Health check PASSED"
        break
    fi
    log "   Health check attempt $i/10..."
    sleep 10
done

# 8. Verify cameras reconnecting
log "8. Verifying camera streams..."
sleep 60
STREAM_COUNT=$(curl -s https://surveillance.company.com/api/v1/cameras/status | \
    jq '[.cameras[] | select(.status == "active")] | length')
log "   Active streams: $STREAM_COUNT/8"

# 9. Send notifications
log "9. Sending notifications..."
curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\":\"DR FAILOVER COMPLETE: Production now running in $DR_REGION. Reason: $FAILOVER_REASON. Active streams: $STREAM_COUNT/8\"}"

log "=== DR FAILOVER COMPLETE ==="
log "Total time: $(($(date +%s) - START_TIME)) seconds"

11.5 DR Testing Schedule

Test Type Frequency Scope Duration Validation
Backup restore drill Monthly Database + media 2 hours Data integrity verified
Application redeployment Monthly Full application stack 1 hour All services healthy
Network failover test Quarterly VPN, DNS 30 min Traffic routes correctly
Database failover test Quarterly RDS Multi-AZ promotion 1 hour Replication lag acceptable
Full DR drill Quarterly Complete site failover 4 hours All RTO/RPO met
Tabletop exercise Semi-annually Team response procedures 2 hours Process gaps identified

Full DR drill procedure:

  1. Week before: Schedule drill; notify stakeholders; prepare isolated test data
  2. Day of:
    • 09:00 — Initiate failover (simulate primary region failure)
    • 09:05 — DR team executes failover runbook
    • 09:30 — Verify database is promoted and accessible
    • 10:00 — Verify application is deployed and healthy
    • 10:30 — Verify camera streams reconnect
    • 11:00 — Verify alert delivery
    • 11:30 — Run E2E test suite
    • 12:00 — Validate data integrity (sample checks)
    • 12:30 — Measure and document RTO/RPO
    • 13:00 — Initiate failback to primary
    • 14:00 — Verify primary is restored
  3. Week after: Complete DR test report; file action items

DR Test Report Template:

## DR Drill Report — 2025-Q1

| Item | Result |
|------|--------|
| Date | 2025-03-15 |
| Scenario | Complete region failure (us-east-1) |
| Failover RTO Target | 60 minutes |
| Failover RTO Achieved | 42 minutes |
| RPO Target | 15 minutes |
| RPO Achieved | 8 minutes |
| Streams Restored | 8/8 (100%) |
| Data Integrity | PASS |
| E2E Tests | 47/47 PASS |

### Issues Found
1. Camera reconnection took 18 minutes (target: <10 min) — AI-7 filed
2. Alert service required manual restart — AI-8 filed

### Action Items
| ID | Description | Owner | Due |
|----|-------------|-------|-----|
| AI-7 | Optimize camera reconnection sequence | @eng | 2025-04-01 |
| AI-8 | Fix alert service startup dependency | @sre | 2025-03-22 |

11.6 DR Readiness Checklist

Verify monthly (automated where possible):

  • DR database replication lag < 1 minute
  • S3 cross-region replication caught up
  • DR EKS cluster accessible and nodes can scale
  • Latest container images available in DR region registry
  • DR Terraform plan applies without errors (dry-run)
  • Backup integrity verified (latest full backup)
  • Failover runbook accessible and up-to-date
  • DR contact list current
  • VPN/cross-region network paths verified

12. Capacity Planning

12.1 Current Capacity Baseline (8 Cameras)

Resource Current Usage Capacity Headroom
CPU (cloud) 4 cores avg 8 cores 100%
Memory (cloud) 12 GB 32 GB 167%
GPU (if used) 40% utilization 1x GPU 150%
Storage hot tier 6 TB / 20 TB 20 TB 233%
Storage warm tier 18 TB / 50 TB 50 TB 178%
Database storage 150 GB 500 GB 233%
Database connections 25 / 100 100 300%
Network egress 200 Mbps / 1 Gbps 1 Gbps 400%
Inference throughput 240 FPS (8x30) 480 FPS 100%
Alert volume 50/day 500/day 900%

12.2 Scaling Triggers

Metric Scale-Up Trigger Scale-Down Trigger Action
CPU utilization > 70% for 10 minutes < 30% for 30 minutes Add/remove inference pods
Memory utilization > 80% for 10 minutes < 40% for 30 minutes Add memory or pods
Inference latency > 100ms p95 for 5 min < 50ms p95 for 10 min Scale inference horizontally
Queue depth > 1000 frames < 100 frames Adjust consumer count
Storage usage > 70% N/A (manual) Expand volume or archive
Camera count > 8 cameras N/A Scale per-camera resources

Horizontal Pod Autoscaler configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
  namespace: surveillance
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: surveillance_pipeline_latency_ms
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12.3 Camera Addition Process

Step 1: Pre-deployment Assessment (Day -7)
├── Evaluate resource requirements
├── Verify network connectivity
├── Review camera positioning and coverage
└── Update configuration in Git

Step 2: Infrastructure Preparation (Day -3)
├── Calculate additional storage needs
├── Verify scaling headroom
├── Prepare camera configuration
└── Stage network/VPN configuration

Step 3: Deployment (Day 0)
├── Add camera to configuration
├── Deploy updated configuration
├── Verify stream connection
├── Validate AI processing
├── Test alert generation
└── Update dashboards

Step 4: Validation (Day 0-1)
├── Monitor for 24 hours
├── Verify FPS and quality
├── Confirm alerts working
├── Document in camera registry
└── Notify stakeholders

Camera addition checklist:

Step Item Verification
1 Camera network reachable ping <camera_ip>
2 RTSP stream accessible ffprobe rtsp://<camera>/stream
3 VPN tunnel supports additional bandwidth Bandwidth check
4 Configuration added to Git PR merged
5 Stream appears in video-capture Logs show connection
6 FPS meets target (>25) Grafana dashboard
7 AI inference processing frames Detection metrics
8 Alerts generated correctly Test alert
9 Storage projections updated Capacity review
10 Camera documented Registry updated

12.4 Per-Camera Resource Requirements

Resource Per Camera 8 Cameras 16 Cameras 24 Cameras
CPU (inference) 0.5 cores 4 cores 8 cores 12 cores
Memory (processing) 1 GB 8 GB 16 GB 24 GB
Storage (hot, daily) 50 GB/day 400 GB/day 800 GB/day 1.2 TB/day
Network (ingress) 25 Mbps 200 Mbps 400 Mbps 600 Mbps
GPU memory 512 MB 4 GB 8 GB 12 GB
Database IOPS 100 800 1,600 2,400

12.5 Scaling Roadmap

Phase Cameras Timeline Infrastructure Changes
Current 8 Now 3 inference pods, 8 CPU, 32 GB RAM
Phase 1 12 Q2 2025 4 inference pods, 12 CPU, 48 GB RAM
Phase 2 16 Q3 2025 6 inference pods, 16 CPU, 64 GB RAM, GPU add
Phase 3 24 Q1 2026 8 inference pods, 24 CPU, 96 GB RAM, 2 GPU
Phase 4 32+ Q3 2026 Shard by location, dedicated inference cluster

12.6 Performance Benchmarks

Benchmark suite executed monthly:

#!/bin/bash
# performance-benchmark.sh

API_URL="https://surveillance.company.com"
RESULTS_FILE="/var/log/benchmarks/$(date +%Y%m%d).json"

echo "{\"timestamp\": \"$(date -Iseconds)\"," > "$RESULTS_FILE"
echo "\"benchmarks\": {" >> "$RESULTS_FILE"

# 1. Health check latency
echo "  Running health check latency test..."
HEALTH_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health")
echo "  \"health_check_latency_ms\": $(echo "$HEALTH_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 2. Deep health check latency
echo "  Running deep health check..."
DEEP_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health/deep")
echo "  \"deep_health_latency_ms\": $(echo "$DEEP_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 3. API response time (events list)
echo "  Running API response time test..."
API_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
  "$API_URL/api/v1/events?limit=100&start=$(date -d '1 hour ago' -Iseconds)")
echo "  \"api_events_latency_ms\": $(echo "$API_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 4. Database query performance
echo "  Running database query test..."
DB_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
  "$API_URL/api/v1/admin/db-performance")
echo "  \"db_query_latency_ms\": $(echo "$DB_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 5. Stream status
echo "  Checking stream status..."
STREAMS=$(curl -s "$API_URL/api/v1/cameras/status" | jq '[.cameras[] | select(.status == "active")] | length')
echo "  \"active_streams\": $STREAMS," >> "$RESULTS_FILE"

# 6. Inference latency (from Prometheus)
echo "  Fetching inference metrics..."
INF_LAT=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(surveillance_model_inference_ms_bucket[5m])))" | \
  jq -r '.data.result[0].value[1] // "null"')
echo "  \"inference_p95_latency_ms\": $INF_LAT" >> "$RESULTS_FILE"

echo "}}" >> "$RESULTS_FILE"

echo "Benchmark complete. Results saved to $RESULTS_FILE"
cat "$RESULTS_FILE"

Benchmark history tracking:

Date Health (ms) Deep Health (ms) API (ms) Inference P95 (ms) Streams Active
2025-01-01 12 245 89 42 8/8
2025-01-08 11 238 92 45 8/8
2025-01-15 15 520 156 78 7/8 (cam_03 offline)

12.7 Resource Request & Provisioning Workflow

Requestor submits capacity request
        │
        ▼
┌───────────────┐
│ SRE Review    │ ← Assess impact, feasibility, alternatives
│ (2 biz days)  │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Approval      │ ← Engineering Manager + Finance (if >$X)
│ (1 biz day)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Implementation│ ← SRE executes change during maintenance window
│ (scheduled)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Validation    │ ← Verify performance meets requirements
│ (24-48 hours) │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Close Request │ ← Document in capacity ledger
└───────────────┘

Appendices

Appendix A: Contact Directory

Role Name Email Phone Slack
On-Call (rotating) See PagerDuty oncall@company.com Via PagerDuty #surveillance-oncall
SRE Team Lead [Name] sre-lead@company.com +1-555-0100 @sre-lead
Engineering Manager [Name] eng-mgr@company.com +1-555-0101 @eng-mgr
Security Officer [Name] security@company.com +1-555-0104 @security
Product Owner [Name] product@company.com +1-555-0105 @product
VP Engineering [Name] vp-eng@company.com +1-555-0102 @vp-eng

Appendix B: Tooling Inventory

Category Tool Version Purpose
Monitoring Prometheus 2.47+ Metrics collection
Monitoring Grafana 10.0+ Visualization
Monitoring Alertmanager 0.26+ Alert routing
Logging Elasticsearch 8.11+ Log storage
Logging Filebeat 8.11+ Log shipping
Logging Kibana 8.11+ Log visualization
Orchestration Kubernetes 1.28+ Container orchestration
Packaging Helm 3.13+ K8s package management
IaC Terraform 1.6+ Infrastructure provisioning
GitOps ArgoCD 2.9+ Continuous deployment
Backup pgBackRest 2.48+ PostgreSQL backup
Secrets Vault / AWS Secrets Manager Latest Secret management
Paging PagerDuty SaaS Incident paging
Communication Slack SaaS Team communication

Appendix C: Network Architecture

Internet
    │
    ▼
┌─────────┐    ┌─────────────┐    ┌──────────────────┐
│   CDN   │───▶│  Nginx/ALB  │───▶│  API Gateway     │
│         │    │  (TLS term) │    │  (auth/rate-lim) │
└─────────┘    └─────────────┘    └────────┬─────────┘
                                           │
                    ┌──────────────────────┼──────────────────────┐
                    │                      │                      │
                    ▼                      ▼                      ▼
            ┌──────────┐         ┌──────────────┐      ┌──────────┐
            │ Surveil- │         │   WebSocket  │      │ Grafana  │
            │ lance    │         │   Service    │      │ /Kibana  │
            │ API      │         │              │      │          │
            └────┬─────┘         └──────────────┘      └──────────┘
                 │
        ┌────────┼────────┬──────────────┐
        │        │        │              │
        ▼        ▼        ▼              ▼
   ┌────────┐ ┌─────┐ ┌──────────┐ ┌──────────┐
   │PostgreSQL│ │Redis│ │  S3/MinIO│ │ Prometheus│
   │         │ │     │ │          │ │           │
   └─────────┘ └─────┘ └──────────┘ └───────────┘

    VPN Tunnel
    ══════════
    ┌──────────────┐
    │  Edge Node   │◀── RTSP ──▶ [Cameras 1-8]
    │  (local proc)│
    └──────────────┘

Appendix D: Document Revision History

Version Date Author Changes
1.0 2025-01-15 SRE Team Initial comprehensive operations plan covering all 12 domains

END OF DOCUMENT

This document is a living document and should be reviewed and updated quarterly or after any significant infrastructure change.