AI Surveillance Platform — 24/7 Operations & Reliability Plan

Version: 1.0
Date: 2025-01-15
Classification: Internal — Operations & Engineering
System: 8-Channel AI Surveillance Platform (Cloud + Edge)
Target: Industrial-grade autonomous operations with minimal human intervention

Monitoring & Observability
Logging Strategy
Health Checks
Service Restart & Recovery
Backup Strategy
Data Retention
Storage Management
Incident Response
Upgrades & Maintenance
Performance Optimization
Disaster Recovery
Capacity Planning

Document Control

Version	Date	Author	Changes
1.0	2025-01-15	SRE Team	Initial comprehensive operations plan

Approval

Role	Name	Date
Head of Engineering	_____________	//______
Security Officer	_____________	//______
Operations Lead	_____________	//______

1. Monitoring & Observability

1.1 Overview

The monitoring stack provides real-time visibility into all platform components, enabling proactive issue detection and rapid incident response. All metrics are collected at 15-second intervals with 15-month retention.

Tooling Choice: Prometheus + Grafana (primary) with Alertmanager for notification routing.

Architecture:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Node      │     │  Prometheus │     │   Grafana   │
│  Exporter   │────▶│   Server    │────▶│  Dashboards │
│ (per host)  │     │  (TSDB)     │     │  (visualize)│
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────┴──────┐
                    │ Alertmanager │────▶ PagerDuty / OpsGenie / Slack
                    └─────────────┘

1.2 Metrics Collection

1.2.1 System Metrics (Node Exporter + cAdvisor)

Metric Category	Specific Metrics	Collection Interval	Retention
CPU	Usage % per core, load average (1m/5m/15m), steal time, iowait	15s	15 months
Memory	Used/available/total, swap usage, OOM kills, page faults	15s	15 months
Disk	Usage % per volume, IOPS, read/write latency, inode usage	15s	15 months
Network	RX/TX bytes/packets/drops per interface, TCP connections, retransmits	15s	15 months
Containers	CPU/memory per container, restart count, network IO per container	15s	15 months

Prometheus scrape configuration:

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s
    scrape_timeout: 10s

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  - job_name: 'surveillance-api'
    static_configs:
      - targets: ['surveillance-api:8080']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'ai-inference'
    static_configs:
      - targets: ['ai-inference:8080']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'video-processor'
    static_configs:
      - targets: ['video-processor:8080']
    scrape_interval: 15s
    metrics_path: /metrics

1.2.2 Application Metrics (Custom / OpenTelemetry)

Metric Name	Type	Description	Labels
`surveillance_fps_per_camera`	Gauge	Current FPS being processed per camera	`camera_id`, `location`
`surveillance_detection_rate`	Gauge	Detections per second per stream	`camera_id`, `model_version`
`surveillance_alert_rate`	Counter	Total alerts generated	`severity`, `camera_id`, `alert_type`
`surveillance_pipeline_latency_ms`	Histogram	End-to-end processing latency	`stage`, `camera_id`
`surveillance_frame_drop_rate`	Gauge	Percentage of frames dropped	`camera_id`, `reason`
`surveillance_model_inference_ms`	Histogram	AI model inference time	`model_name`, `batch_size`
`surveillance_stream_active`	Gauge	Whether stream is active (1/0)	`camera_id`, `source`
`surveillance_face_recognition_matches`	Counter	Face recognition hits/misses	`camera_id`, `match_type`

Application instrumentation (Python example):

from prometheus_client import Counter, Histogram, Gauge, generate_latest
from functools import wraps
import time

# Define metrics
DETECTION_COUNTER = Counter(
    'surveillance_detections_total',
    'Total detections by type',
    ['camera_id', 'detection_type', 'model_version']
)

PIPELINE_LATENCY = Histogram(
    'surveillance_pipeline_latency_ms',
    'End-to-end pipeline latency in milliseconds',
    ['stage', 'camera_id'],
    buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
)

CAMERA_FPS = Gauge(
    'surveillance_fps_per_camera',
    'Current FPS per camera stream',
    ['camera_id', 'location']
)

STREAM_ACTIVE = Gauge(
    'surveillance_stream_active',
    'Stream connectivity status',
    ['camera_id', 'source']
)

def track_latency(stage, camera_id):
    """Decorator to track function latency."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            try:
                return func(*args, **kwargs)
            finally:
                elapsed_ms = (time.time() - start) * 1000
                PIPELINE_LATENCY.labels(
                    stage=stage,
                    camera_id=camera_id
                ).observe(elapsed_ms)
        return wrapper
    return decorator

1.2.3 Business Metrics

Metric Name	Type	Business Purpose	Alert Threshold
`surveillance_persons_detected_daily`	Counter	Daily person detection volume	Anomaly detection
`surveillance_unknown_persons`	Counter	Unknown/alerted persons per period	Trend analysis
`surveillance_alerts_sent`	Counter	Alerts successfully delivered	Delivery health
`surveillance_alerts_failed`	Counter	Failed alert deliveries	> 5 in 5 min = P2
`surveillance_camera_uptime_pct`	Gauge	Per-camera uptime percentage	< 99% = P3
`surveillance_detection_accuracy`	Gauge	Model accuracy score	< threshold = P2

1.2.4 Error Metrics

Metric Name	Type	Description	Severity
`surveillance_errors_total`	Counter	Errors by type and service	All
`surveillance_stream_errors`	Counter	Stream connection errors	P2 if > 10/min
`surveillance_model_errors`	Counter	Model inference failures	P1 if > 5/min
`surveillance_db_errors`	Counter	Database operation failures	P1 if > 3/min
`surveillance_storage_errors`	Counter	Storage read/write failures	P2 if > 5/min

1.3 Alerting Rules

1.3.1 Critical Alerts (P1) — Page Immediately

# /etc/prometheus/alerts/critical.yml
groups:
  - name: critical
    rules:
      - alert: AllStreamsDown
        expr: sum(surveillance_stream_active) == 0
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "ALL camera streams are down"
          description: "No active streams detected for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/all-streams-down"

      - alert: AIPipelineDown
        expr: rate(surveillance_detections_total[5m]) == 0
        for: 2m
        labels:
          severity: p1
        annotations:
          summary: "AI pipeline not producing detections"
          description: "Zero detections in the last 2 minutes across all streams"

      - alert: StorageFull
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "Storage critically low: {{ $labels.mountpoint }}"
          description: "Less than 5% storage remaining on {{ $labels.instance }}"

      - alert: DatabaseUnreachable
        expr: pg_up == 0
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "PostgreSQL database is unreachable"
          description: "Cannot connect to primary database"

      - alert: HighErrorRate
        expr: rate(surveillance_errors_total[5m]) > 10
        for: 2m
        labels:
          severity: p1
        annotations:
          summary: "High error rate across services"
          description: "Error rate exceeds 10 errors per second"

1.3.2 High Severity Alerts (P2) — Page Within 1 Hour

# /etc/prometheus/alerts/high.yml
groups:
  - name: high
    rules:
      - alert: SingleCameraDown
        expr: surveillance_stream_active{camera_id=~"cam.*"} == 0
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Camera {{ $labels.camera_id }} is offline"
          description: "Camera stream has been down for more than 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.95,
          rate(surveillance_pipeline_latency_ms_bucket[5m])) > 2000
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Pipeline latency is high"
          description: "P95 latency exceeds 2000ms"

      - alert: ModelAccuracyDegraded
        expr: surveillance_detection_accuracy < 0.85
        for: 10m
        labels:
          severity: p2
        annotations:
          summary: "AI model accuracy degraded"
          description: "Detection accuracy below 85%"

      - alert: MemoryPressure
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.90
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: "Memory usage above 90%"

      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Disk space warning: {{ $labels.mountpoint }}"
          description: "Less than 15% disk space remaining"

1.3.3 Medium Severity Alerts (P3) — Respond Within 4 Hours

# /etc/prometheus/alerts/medium.yml
groups:
  - name: medium
    rules:
      - alert: CameraFPSLow
        expr: surveillance_fps_per_camera < 15
        for: 10m
        labels:
          severity: p3
        annotations:
          summary: "Camera {{ $labels.camera_id }} FPS below threshold"

      - alert: FrameDropsHigh
        expr: surveillance_frame_drop_rate > 0.10
        for: 10m
        labels:
          severity: p3
        annotations:
          summary: "High frame drop rate on {{ $labels.camera_id }}"

      - alert: CertificateExpiry
        expr: (ssl_certificate_expiry_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "TLS certificate expiring soon"

      - alert: BackupNotRun
        expr: time() - surveillance_last_backup_timestamp > 90000
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "Database backup has not run in 25+ hours"

1.3.4 Low Severity Alerts (P4) — Respond Within 24 Hours

# /etc/prometheus/alerts/low.yml
groups:
  - name: low
    rules:
      - alert: HighCPU
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 30m
        labels:
          severity: p4
        annotations:
          summary: "CPU usage high on {{ $labels.instance }}"

      - alert: ContainerRestartLoop
        expr: rate(container_restarts_total[15m]) > 3
        for: 15m
        labels:
          severity: p4
        annotations:
          summary: "Container restart loop detected"

1.4 Alertmanager Configuration

# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@surveillance.company.com'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: '<SLACK_WEBHOOK_URL>'

# Inhibit alerts of lower severity when higher severity fires
inhibit_rules:
  - source_match:
      severity: 'p1'
    target_match:
      severity: 'p2'
    equal: ['alertname', 'instance']

route:
  receiver: 'default'
  group_by: ['alertname', 'severity', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # P1 alerts — page immediately, no grouping delay
    - match:
        severity: p1
      receiver: 'p1-critical'
      group_wait: 0s
      repeat_interval: 15m
      continue: true

    # P2 alerts — page within 1 hour
    - match:
        severity: p2
      receiver: 'p2-high'
      group_wait: 2m
      repeat_interval: 1h

    # P3 alerts — Slack + email only
    - match:
        severity: p3
      receiver: 'p3-medium'
      group_wait: 5m
      repeat_interval: 4h

    # P4 alerts — daily digest
    - match:
        severity: p4
      receiver: 'p4-low'
      group_wait: 10m
      repeat_interval: 24h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#surveillance-alerts'
        title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'p1-critical'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: critical
        description: '{{ .GroupLabels.alertname }}'
    slack_configs:
      - channel: '#surveillance-critical'
        send_resolved: true
        title: 'P1 CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    email_configs:
      - to: 'oncall@company.com'
        subject: '[P1 CRITICAL] Surveillance Platform Alert'

  - name: 'p2-high'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: error
    slack_configs:
      - channel: '#surveillance-alerts'
        send_resolved: true

  - name: 'p3-medium'
    slack_configs:
      - channel: '#surveillance-warnings'
        send_resolved: true

  - name: 'p4-low'
    email_configs:
      - to: 'ops-team@company.com'
        subject: '[P4 Low] Surveillance Platform — Daily Digest'

1.5 Grafana Dashboards

1.5.1 Dashboard: Infrastructure Overview (ID: `infra-overview`)

{
  "dashboard": {
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage %",
        "type": "timeseries",
        "targets": [{
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{ instance }}"
        }],
        "alert": {
          "conditions": [{
            "evaluator": {"params": [85], "type": "gt"},
            "operator": {"type": "and"},
            "query": {"params": ["A", "5m", "now"]},
            "reducer": {"type": "avg"}
          }]
        }
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [{
          "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
          "legendFormat": "{{ instance }}"
        }]
      },
      {
        "title": "Disk Usage",
        "type": "gauge",
        "targets": [{
          "expr": "100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)"
        }],
        "fieldConfig": {
          "max": 100,
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "yellow", "value": 70},
              {"color": "orange", "value": 85},
              {"color": "red", "value": 95}
            ]
          }
        }
      },
      {
        "title": "Network I/O",
        "type": "timeseries",
        "targets": [
          {"expr": "rate(node_network_receive_bytes_total[5m])", "legendFormat": "RX {{ device }}"},
          {"expr": "rate(node_network_transmit_bytes_total[5m])", "legendFormat": "TX {{ device }}"}
        ]
      },
      {
        "title": "Container Count",
        "type": "stat",
        "targets": [{
          "expr": "count(container_last_seen)"
        }]
      },
      {
        "title": "Container Restarts (15m)",
        "type": "stat",
        "targets": [{
          "expr": "increase(container_restarts_total[15m])"
        }],
        "fieldConfig": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "red", "value": 1}
            ]
          }
        }
      }
    ]
  }
}

1.5.2 Dashboard: Camera Health (ID: `camera-health`)

Panel	Type	Query / Data Source
Stream Status Grid	Stat grid (8 panels)	`surveillance_stream_active{camera_id=~"cam.*"}`
FPS per Camera	Timeseries	`surveillance_fps_per_camera` by `camera_id`
Frame Drop Rate	Timeseries	`surveillance_frame_drop_rate` by `camera_id`
Camera Uptime %	Gauge per camera	`avg_over_time(surveillance_stream_active[24h]) * 100`
Stream Error Count	Bar chart	`increase(surveillance_stream_errors[1h])` by `camera_id`
Last Frame Timestamp	Table	Time since last frame per camera
Bitrate per Stream	Timeseries	`surveillance_stream_bitrate_kbps`

Camera Health Score Calculation:

# Overall camera health score (0-100)
(
  avg(surveillance_stream_active) * 50 +
  (1 - avg(surveillance_frame_drop_rate)) * 30 +
  (avg(surveillance_fps_per_camera) / 30) * 20
) * 100

1.5.3 Dashboard: AI Pipeline Performance (ID: `ai-pipeline`)

Panel	Type	Metric
Inference Latency (P50/P95/P99)	Timeseries	`histogram_quantile(0.5x, rate(...))`
Detections per Second	Timeseries	`rate(surveillance_detections_total[5m])`
Model Accuracy Trend	Timeseries	`surveillance_detection_accuracy`
Pipeline Throughput	Stat	Total frames processed/minute
GPU Utilization (if applicable)	Gauge	`nvidia_gpu_utilization_gpu`
GPU Memory Usage	Timeseries	`nvidia_gpu_memory_used_bytes`
Model Load Status	Table	Current model version, load time, status
Batch Size Distribution	Heatmap	Inference batch sizes over time

1.5.4 Dashboard: Alert Delivery Stats (ID: `alert-delivery`)

Panel	Type	Query
Alerts Sent Today	Stat	`increase(surveillance_alerts_sent[24h])`
Alerts Failed	Stat	`increase(surveillance_alerts_failed[24h])`
Delivery Success Rate	Gauge	`alerts_sent / (alerts_sent + alerts_failed)`
Alerts by Severity	Pie chart	`surveillance_alerts_sent` by `severity`
Alerts by Camera	Bar chart	Top cameras by alert count
Notification Channel Status	Table	Channel health per delivery method
Alert Response Time	Histogram	Time from detection to notification

1.5.5 Dashboard: Storage Usage Trends (ID: `storage-trends`)

Panel	Type	Query
Total Storage Used	Stat	Sum of all storage volumes
Storage Growth Rate	Timeseries	Daily increase in bytes
Retention Policy Status	Table	Days remaining per retention tier
Media vs. Metadata Split	Pie chart	Storage breakdown by type
Projected Capacity Exhaustion	Stat	Days until full at current growth rate
Cleanup Job Status	Table	Last run, records cleaned, errors
Cross-Region Replication Lag	Timeseries	Replication delay in seconds

1.6 On-Call Rotation

Shift	Time (UTC)	Primary On-Call	Secondary
APAC	00:00 — 08:00	APAC SRE Team	EMEA Escalation
EMEA	08:00 — 16:00	EMEA SRE Team	Americas Escalation
Americas	16:00 — 00:00	Americas SRE Team	APAC Escalation

Escalation Policy (PagerDuty):

Notification: Alert fires → Notify on-call engineer via PagerDuty push + SMS
Acknowledge: 5-minute acknowledge window
Escalation 1: No acknowledge → Escalate to team lead (15 min)
Escalation 2: No response → Escalate to engineering manager (30 min)
Escalation 3: No response → Escalate to VP Engineering (45 min)

2. Logging Strategy

2.1 Log Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Application │────▶│  Filebeat   │────▶│  Logstash   │────▶│ Elasticsearch│
│  (JSON logs)│     │  (shipper)  │     │ (processor) │     │   (store)   │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
                                                                    │
                                                             ┌──────┴──────┐
                                                             │    Kibana   │
                                                             │  (visualize) │
                                                             └─────────────┘

2.2 Log Levels

Level	Numeric	Usage	Retention	Action
DEBUG	10	Detailed diagnostic info	7 days	Development only
INFO	20	Normal operational events	90 days	Standard operations
WARNING	30	Anomalous but non-critical conditions	90 days	Monitor trends
ERROR	40	Operational failures, handled exceptions	1 year	Alert if rate > threshold
CRITICAL	50	System-threatening failures	1 year	Immediate P1 alert

Production default level: INFO (DEBUG only enabled per-request for troubleshooting)

2.3 Structured Logging Format

All application logs MUST be in JSON format:

{
  "timestamp": "2025-01-15T08:30:15.123456Z",
  "level": "ERROR",
  "logger": "surveillance.video_processor",
  "message": "Failed to connect to camera stream",
  "request_id": "req_abc123def456",
  "trace_id": "trace_789xyz",
  "service": "video-processor",
  "version": "2.3.1",
  "host": "edge-node-01",
  "environment": "production",
  "camera_id": "cam_03_entrance",
  "location": "main_entrance",
  "error": {
    "type": "ConnectionTimeout",
    "message": "Connection to rtsp://192.168.1.103:554/stream timed out after 10s",
    "retry_count": 3,
    "stack_trace": "..."
  },
  "context": {
    "stream_url": "rtsp://***.***.1.***:554/stream",
    "connection_duration_ms": 10000,
    "previous_disconnect": "2025-01-15T08:25:00Z"
  },
  "performance": {
    "processing_time_ms": 0.5,
    "memory_mb": 128.5
  }
}

Python logging configuration:

# logging_config.py
import logging
import json
from pythonjsonlogger import jsonlogger
import os

class StructuredLogFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super().add_fields(log_record, record, message_dict)
        log_record['timestamp'] = datetime.utcnow().isoformat() + 'Z'
        log_record['level'] = record.levelname
        log_record['logger'] = record.name
        log_record['service'] = os.environ.get('SERVICE_NAME', 'unknown')
        log_record['version'] = os.environ.get('SERVICE_VERSION', 'unknown')
        log_record['host'] = os.environ.get('HOSTNAME', 'unknown')
        log_record['environment'] = os.environ.get('ENV', 'production')

LOGGING_CONFIG = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'json': {
            '()': StructuredLogFormatter,
            'format': '%(timestamp)s %(level)s %(message)s'
        }
    },
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
            'formatter': 'json',
            'stream': 'ext://sys.stdout'
        },
        'file': {
            'class': 'logging.handlers.RotatingFileHandler',
            'formatter': 'json',
            'filename': '/var/log/surveillance/app.log',
            'maxBytes': 104857600,  # 100 MB
            'backupCount': 10
        }
    },
    'loggers': {
        'surveillance': {
            'level': os.environ.get('LOG_LEVEL', 'INFO'),
            'handlers': ['console', 'file'],
            'propagate': False
        }
    }
}

2.4 Log Correlation

Every request receives a unique request_id and trace_id:

import uuid
import contextvars

# Context variable for request-scoped tracing
request_id_var = contextvars.ContextVar('request_id', default=None)
trace_id_var = contextvars.ContextVar('trace_id', default=None)

def get_current_request_id() -> str:
    req_id = request_id_var.get()
    if req_id is None:
        req_id = f"req_{uuid.uuid4().hex[:16]}"
        request_id_var.set(req_id)
    return req_id

def get_current_trace_id() -> str:
    trace_id = trace_id_var.get()
    if trace_id is None:
        trace_id = f"trace_{uuid.uuid4().hex[:16]}"
        trace_id_var.set(trace_id)
    return trace_id

Propagation across services:

HTTP: X-Request-ID and X-Trace-ID headers
Message queue: Metadata fields in message envelope
gRPC: Custom metadata

2.5 Log Retention Policy

Log Category	Retention	Storage Class	Compression
Application logs (INFO+)	90 days	Hot (SSD) 30d → Warm 60d	After 7 days
Error logs (ERROR+)	1 year	Warm 90d → Cold 275d	After 30 days
Audit logs	1 year	Hot 90d → Warm 180d → Cold 95d	After 90 days
Debug logs	7 days	Hot only	None
Access logs	90 days	Warm 30d → Cold 60d	After 30 days
System logs (syslog/journald)	90 days	Warm	After 7 days

Elasticsearch Index Lifecycle Management (ILM):

PUT _ilm/policy/surveillance-logs
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d",
            "max_docs": 100000000
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "require": { "data": "warm" }
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

2.6 Sensitive Data Handling

NEVER log:

Face embeddings or biometric data
Full-resolution images of detected persons
PII (names, employee IDs, phone numbers)
Credentials, API keys, tokens, passwords
Stream URLs with embedded credentials
Internal network topology
VPN configuration details

Sanitization rules:

import re

SENSITIVE_PATTERNS = [
    (r'rtsp://[^:]+:[^@]+@', 'rtsp://***:***@'),
    (r'password[=:]\s*\S+', 'password=***'),
    (r'api[_-]?key[=:]\s*\S+', 'api_key=***'),
    (r'token[=:]\s*\S+', 'token=***'),
    (r'embedding[=:]\s*\[.*?\]', 'embedding=[REDACTED]'),
    (r'face[_-]?vector[=:]\s*\[.*?\]', 'face_vector=[REDACTED]'),
]

def sanitize_log_message(message: str) -> str:
    for pattern, replacement in SENSITIVE_PATTERNS:
        message = re.sub(pattern, replacement, message, flags=re.IGNORECASE)
    return message

3. Health Checks

3.1 Health Check Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Health Check Endpoints                    │
│                                                             │
│  /health        → Liveness probe (Kubernetes/Docker)       │
│  /health/ready  → Readiness probe (accepting traffic)       │
│  /health/deep   → Deep health (full pipeline validation)    │
└─────────────────────────────────────────────────────────────┘

3.2 Endpoint Specifications

3.2.1 Liveness Probe — `GET /health`

Purpose: Determine if the process is running and not deadlocked.

Response:

{
  "status": "alive",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-api",
  "version": "2.3.1",
  "uptime_seconds": 86400
}

Criteria:

Process is running
Main thread is not blocked
Returns HTTP 200 within 1 second

Failure action: Container orchestrator restarts the container.

Configuration:

# Kubernetes
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

# Docker Compose
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 10s
  timeout: 3s
  retries: 3
  start_period: 30s

3.2.2 Readiness Probe — `GET /health/ready`

Purpose: Determine if the service is ready to accept traffic.

Response:

{
  "status": "ready",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-api",
  "version": "2.3.1",
  "checks": {
    "database": {
      "status": "pass",
      "response_time_ms": 12,
      "message": "Connected to PostgreSQL primary"
    },
    "object_storage": {
      "status": "pass",
      "response_time_ms": 45,
      "message": "S3 bucket accessible"
    },
    "cache": {
      "status": "pass",
      "response_time_ms": 2,
      "message": "Redis connection OK"
    }
  }
}

Criteria:

All required dependencies reachable
Database connection pool has available connections
Object storage accessible
Cache layer accessible
AI model loaded (for inference services)

Failure response: HTTP 503 with details

{
  "status": "not_ready",
  "timestamp": "2025-01-15T08:30:15Z",
  "checks": {
    "database": {
      "status": "fail",
      "response_time_ms": 5000,
      "message": "Connection timeout after 5000ms"
    },
    "object_storage": { "status": "pass" },
    "cache": { "status": "pass" }
  }
}

Configuration:

# Kubernetes
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 2

3.2.3 Deep Health Check — `GET /health/deep`

Purpose: Validate the entire processing pipeline end-to-end.

Response:

{
  "status": "healthy",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-platform",
  "version": "2.3.1",
  "checks": {
    "database": {
      "status": "pass",
      "response_time_ms": 8,
      "details": {
        "connection": "ok",
        "query_execution": "ok",
        "replication_lag_seconds": 0
      }
    },
    "object_storage": {
      "status": "pass",
      "response_time_ms": 67,
      "details": {
        "read_test": "ok",
        "write_test": "ok",
        "list_test": "ok"
      }
    },
    "ai_model": {
      "status": "pass",
      "response_time_ms": 145,
      "details": {
        "model_loaded": true,
        "model_version": "face-detection-v2.1",
        "gpu_available": true,
        "test_inference": "ok"
      }
    },
    "streams": {
      "status": "pass",
      "details": {
        "active_streams": 8,
        "expected_streams": 8,
        "streams": [
          {"camera_id": "cam_01", "fps": 30, "status": "active"},
          {"camera_id": "cam_02", "fps": 30, "status": "active"},
          {"camera_id": "cam_03", "fps": 25, "status": "active"},
          {"camera_id": "cam_04", "fps": 30, "status": "active"},
          {"camera_id": "cam_05", "fps": 30, "status": "active"},
          {"camera_id": "cam_06", "fps": 28, "status": "active"},
          {"camera_id": "cam_07", "fps": 30, "status": "active"},
          {"camera_id": "cam_08", "fps": 30, "status": "active"}
        ]
      }
    },
    "cache": {
      "status": "pass",
      "response_time_ms": 1,
      "details": {
        "set_test": "ok",
        "get_test": "ok",
        "memory_usage_pct": 45
      }
    },
    "alert_delivery": {
      "status": "pass",
      "details": {
        "channels_tested": 3,
        "success": 3
      }
    },
    "pipeline_e2e": {
      "status": "pass",
      "response_time_ms": 523,
      "details": {
        "capture": "ok",
        "inference": "ok",
        "alert_generation": "ok",
        "storage": "ok"
      }
    }
  }
}

Execution:

Triggered manually or by monitoring every 5 minutes
NOT used for Kubernetes probes (too slow)
Full pipeline validation takes 1-5 seconds

3.3 Dependency Health Check Matrix

Dependency	Check Method	Timeout	Expected Result	Failure Action
PostgreSQL	`SELECT 1`	3s	Row returned	Return not_ready
Redis Cache	`PING` → `PONG`	2s	PONG received	Degrade to DB only
S3 / Object Storage	List + Put + Get test object	10s	All operations succeed	Queue for retry
AI Model	Load model + test inference	30s	Inference completes	Report model error
Camera Streams	RTSP describe/ping	10s	Stream metadata received	Mark stream offline
VPN Tunnel	ICMP to edge gateway	5s	Response received	Mark edge offline
SMTP/Notification	TCP connect + EHLO	5s	SMTP greeting received	Queue alerts

3.4 Health Check Implementation

# health.py
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import time
import asyncio

class HealthStatus(Enum):
    PASS = "pass"
    FAIL = "fail"
    WARN = "warn"

@dataclass
class HealthCheckResult:
    name: str
    status: HealthStatus
    response_time_ms: float
    message: str
    details: Dict = field(default_factory=dict)

class HealthChecker:
    def __init__(self):
        self.checks = {}
    
    def register(self, name: str, check_func):
        self.checks[name] = check_func
    
    async def run_all(self, timeout: float = 30.0) -> List[HealthCheckResult]:
        tasks = [
            self._run_check(name, func, timeout)
            for name, func in self.checks.items()
        ]
        return await asyncio.gather(*tasks)
    
    async def _run_check(self, name: str, func, timeout: float) -> HealthCheckResult:
        start = time.monotonic()
        try:
            result = await asyncio.wait_for(func(), timeout=timeout)
            elapsed = (time.monotonic() - start) * 1000
            result.response_time_ms = round(elapsed, 2)
            return result
        except asyncio.TimeoutError:
            return HealthCheckResult(
                name=name,
                status=HealthStatus.FAIL,
                response_time_ms=timeout * 1000,
                message=f"Health check timed out after {timeout}s"
            )
        except Exception as e:
            elapsed = (time.monotonic() - start) * 1000
            return HealthCheckResult(
                name=name,
                status=HealthStatus.FAIL,
                response_time_ms=round(elapsed, 2),
                message=str(e)
            )

# Usage
health_checker = HealthChecker()

# Register checks
health_checker.register("database", check_database)
health_checker.register("object_storage", check_object_storage)
health_checker.register("ai_model", check_ai_model)
health_checker.register("streams", check_all_streams)
health_checker.register("cache", check_cache)

# FastAPI endpoint
from fastapi import FastAPI
app = FastAPI()

@app.get("/health")
async def liveness():
    return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}

@app.get("/health/ready")
async def readiness():
    results = await health_checker.run_all(timeout=5.0)
    all_pass = all(r.status == HealthStatus.PASS for r in results)
    
    status_code = 200 if all_pass else 503
    status = "ready" if all_pass else "not_ready"
    
    return JSONResponse(
        status_code=status_code,
        content={
            "status": status,
            "timestamp": datetime.utcnow().isoformat(),
            "checks": {
                r.name: {
                    "status": r.status.value,
                    "response_time_ms": r.response_time_ms,
                    "message": r.message,
                    **r.details
                }
                for r in results
            }
        }
    )

@app.get("/health/deep")
async def deep_health():
    # Runs full pipeline check
    results = await health_checker.run_all(timeout=30.0)
    # ... similar to readiness but with pipeline_e2e

4. Service Restart & Recovery

4.1 Service Startup Sequence

Services must start in strict dependency order. Docker Compose depends_on or Kubernetes init containers enforce this.

Phase 1: Infrastructure
  ├─ PostgreSQL (primary + replica)
  ├─ Redis Cache
  └─ MinIO / S3 Object Storage

Phase 2: Core Services
  ├─ Message Queue (RabbitMQ / NATS)
  ├─ Configuration Service
  └─ Identity/Auth Service

Phase 3: AI Pipeline
  ├─ Model Service (download & load models)
  ├─ Video Capture Service (connect to cameras)
  ├─ AI Inference Service
  └─ Post-Processing Service

Phase 4: Application Layer
  ├─ API Gateway
  ├─ Surveillance API Service
  ├─ Alert Service
  └─ WebSocket / Real-time Service

Phase 5: Frontend
  ├─ Nginx / Reverse Proxy
  └─ Web Dashboard

Docker Compose startup configuration:

# docker-compose.yml (relevant section)
services:
  postgres:
    image: postgres:15.4@sha256:abc123...
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U surveillance"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7.2@sha256:def456...
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    depends_on:
      postgres:
        condition: service_healthy

  model-service:
    image: surveillance/model-service:2.3.1@sha256:ghi789...
    restart: unless-stopped
    environment:
      - MODEL_PATH=/models
      - DOWNLOAD_IF_MISSING=true
    volumes:
      - model-cache:/models
    depends_on:
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 30s
      retries: 10
      start_period: 60s

  video-capture:
    image: surveillance/capture:2.3.1@sha256:jkl012...
    restart: unless-stopped
    depends_on:
      model-service:
        condition: service_healthy
    environment:
      - STREAM_RETRY_MAX=10
      - STREAM_RETRY_DELAY=5
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  ai-inference:
    image: surveillance/inference:2.3.1@sha256:mno345...
    restart: unless-stopped
    depends_on:
      video-capture:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 10s
      retries: 5
      start_period: 120s

  surveillance-api:
    image: surveillance/api:2.3.1@sha256:pqr678...
    restart: unless-stopped
    depends_on:
      ai-inference:
        condition: service_healthy
    environment:
      - DATABASE_URL=postgresql://...@postgres/surveillance
      - REDIS_URL=redis://redis:6379
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 20s

  nginx:
    image: nginx:alpine@sha256:stu901...
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      surveillance-api:
        condition: service_healthy

4.2 Graceful Shutdown Procedure

All services must handle SIGTERM for graceful shutdown:

# shutdown_handler.py
import asyncio
import signal
import logging
from contextlib import asynccontextmanager

logger = logging.getLogger(__name__)

class GracefulShutdown:
    def __init__(self, shutdown_timeout: float = 30.0):
        self.shutdown_timeout = shutdown_timeout
        self._shutdown_event = asyncio.Event()
        self._tasks = []
    
    def register_task(self, task):
        self._tasks.append(task)
    
    async def wait_for_shutdown(self):
        await self._shutdown_event.wait()
    
    def trigger_shutdown(self):
        logger.info("Shutdown signal received, initiating graceful shutdown...")
        self._shutdown_event.set()
    
    async def shutdown(self):
        """Execute graceful shutdown sequence."""
        logger.info("Starting graceful shutdown sequence...")
        
        # 1. Stop accepting new requests/connections
        logger.info("1. Stopping request acceptance")
        await self._stop_accepting_requests()
        
        # 2. Wait for in-flight requests to complete
        logger.info("2. Waiting for in-flight requests (timeout: %.0fs)", 
                     self.shutdown_timeout)
        try:
            await asyncio.wait_for(
                self._wait_inflight_requests(),
                timeout=self.shutdown_timeout * 0.6
            )
        except asyncio.TimeoutError:
            logger.warning("In-flight requests did not complete in time")
        
        # 3. Flush buffers and complete pending writes
        logger.info("3. Flushing buffers")
        await self._flush_buffers()
        
        # 4. Close camera streams gracefully
        logger.info("4. Closing camera streams")
        await self._close_streams()
        
        # 5. Release resources
        logger.info("5. Releasing resources")
        await self._release_resources()
        
        # 6. Close database connections
        logger.info("6. Closing database connections")
        await self._close_database_connections()
        
        logger.info("Graceful shutdown complete")
    
    async def _stop_accepting_requests(self):
        # Mark service as not ready
        pass
    
    async def _wait_inflight_requests(self):
        # Wait for active request count to reach zero
        pass
    
    async def _flush_buffers(self):
        # Flush any pending log buffers, metric batches
        pass
    
    async def _close_streams(self):
        # Send RTSP TEARDOWN, release capture resources
        pass
    
    async def _release_resources(self):
        # Release GPU memory, file handles
        pass
    
    async def _close_database_connections(self):
        # Return connections to pool, close pool
        pass

def setup_signal_handlers(shutdown_manager: GracefulShutdown):
    loop = asyncio.get_event_loop()
    
    def handle_signal(sig):
        logger.info("Received signal %s", sig.name)
        shutdown_manager.trigger_shutdown()
        asyncio.create_task(shutdown_manager.shutdown())
    
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, lambda s=sig: handle_signal(s))

Kubernetes graceful termination:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: surveillance-api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5 && curl -X POST localhost:8080/shutdown"]

4.3 Crash Recovery & Automatic Restart

Scenario	Detection	Automatic Action	Manual Intervention
Container exits non-zero	Docker/K8s	Restart with exponential backoff (max 5 min)	If > 5 restarts in 10 min
OOM killed	Kernel event	Restart with 25% memory increase (max 3x)	Review memory limits
Health check fails	Probe failure	Restart container	If restart loop persists
Node failure	Node not ready	Reschedule to healthy node	Investigate failed node
Camera stream disconnect	No frames received	Retry with exponential backoff	If > 30 min offline
AI model load failure	Inference timeout	Reload model from backup	If model corrupted
Database connection lost	Query timeout	Retry connection, use replica	If primary down > 5 min

Exponential backoff for stream reconnection:

import asyncio
import random

async def reconnect_stream(camera_id: str, max_retries: int = 100):
    base_delay = 5  # seconds
    max_delay = 300  # 5 minutes
    
    for attempt in range(1, max_retries + 1):
        delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
        jitter = random.uniform(0, delay * 0.1)
        wait_time = delay + jitter
        
        logger.info("Camera %s: Reconnect attempt %d/%d in %.1fs",
                    camera_id, attempt, max_retries, wait_time)
        await asyncio.sleep(wait_time)
        
        try:
            stream = await connect_stream(camera_id)
            logger.info("Camera %s: Reconnected successfully", camera_id)
            return stream
        except Exception as e:
            logger.warning("Camera %s: Reconnect failed: %s", camera_id, e)
    
    logger.error("Camera %s: Max retries exceeded, stream marked offline", camera_id)
    return None

4.4 Circuit Breaker Pattern

Protect against cascading failures when dependencies are down:

# circuit_breaker.py
from enum import Enum
import asyncio
import time
from dataclasses import dataclass

class CircuitState(Enum):
    CLOSED = "closed"       # Normal operation
    OPEN = "open"          # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 3
    success_threshold: int = 2

class CircuitBreaker:
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0
        self._lock = asyncio.Lock()
    
    async def call(self, func, *args, **kwargs):
        async with self._lock:
            await self._transition_state()
            
            if self.state == CircuitState.OPEN:
                raise CircuitBreakerOpen(
                    f"Circuit breaker '{self.name}' is OPEN"
                )
            
            if self.state == CircuitState.HALF_OPEN:
                if self.half_open_calls >= self.config.half_open_max_calls:
                    raise CircuitBreakerOpen(
                        f"Circuit '{self.name}' half-open limit reached"
                    )
                self.half_open_calls += 1
        
        # Execute outside lock
        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise
    
    async def _transition_state(self):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.config.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                self.success_count = 0
    
    async def _on_success(self):
        async with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.config.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            else:
                self.failure_count = 0
    
    async def _on_failure(self):
        async with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.config.failure_threshold:
                self.state = CircuitState.OPEN

class CircuitBreakerOpen(Exception):
    pass

Usage:

# Create breakers for each dependency
db_breaker = CircuitBreaker("database", CircuitBreakerConfig(
    failure_threshold=3,
    recovery_timeout=30.0
))

storage_breaker = CircuitBreaker("object_storage", CircuitBreakerConfig(
    failure_threshold=5,
    recovery_timeout=60.0
))

# Use in service calls
async def save_detection(detection):
    return await db_breaker.call(
        db_repository.save_detection, detection
    )

async def store_frame(frame):
    return await storage_breaker.call(
        s3_client.upload, frame
    )

4.5 Bulkhead Pattern — Resource Isolation

Isolate resources to prevent one failing component from consuming all resources:

# bulkhead.py
import asyncio
from asyncio import Semaphore

class Bulkhead:
    """Limits concurrent operations per service/camera."""
    
    def __init__(self, name: str, max_concurrent: int, max_queue: int = 100):
        self.name = name
        self.semaphore = Semaphore(max_concurrent)
        self.max_queue = max_queue
        self.queue_size = 0
        self._lock = asyncio.Lock()
    
    async def execute(self, func, *args, **kwargs):
        async with self._lock:
            if self.queue_size >= self.max_queue:
                raise BulkheadFull(
                    f"Bulkhead '{self.name}' queue full ({self.max_queue})"
                )
            self.queue_size += 1
        
        try:
            async with self.semaphore:
                return await func(*args, **kwargs)
        finally:
            async with self._lock:
                self.queue_size -= 1

class BulkheadFull(Exception):
    pass

# Per-camera bulkheads to isolate failures
camera_bulkheads = {
    f"cam_{i:02d}": Bulkhead(f"cam_{i:02d}", max_concurrent=4)
    for i in range(1, 9)
}

# Per-service bulkheads
db_bulkhead = Bulkhead("database", max_concurrent=20)
storage_bulkhead = Bulkhead("storage", max_concurrent=10)
inference_bulkhead = Bulkhead("inference", max_concurrent=8)

4.6 Recovery State Persistence

Critical state is persisted to survive restarts:

State Type	Storage	Recovery Action
Camera configurations	PostgreSQL	Reload on startup
Alert rules	PostgreSQL	Reload on startup
Processing offsets	Redis	Resume from last offset
In-flight detections	Redis → PostgreSQL	Replay from queue
Model version	Object Storage	Load specified version
Stream connection state	Local file	Attempt reconnection
Audit log buffer	Local file → Async flush	Recover unflushed entries

5. Backup Strategy

5.1 Backup Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        BACKUP PIPELINE                          │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   PostgreSQL │───▶│   pgBackRest │───▶│  S3 (Primary)    │  │
│  │   (Primary)  │    │  (Full/Incr) │    │  us-east-1       │  │
│  └──────────────┘    └──────────────┘    └────────┬─────────┘  │
│                                                     │           │
│                              ┌──────────────────────┘           │
│                              │                                  │
│                              ▼                                  │
│                    ┌──────────────────┐                         │
│                    │  S3 (Secondary)  │                         │
│                    │  us-west-2         │  Cross-region replication│
│                    └──────────────────┘                         │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   Object     │───▶│   S3 Cross   │───▶│  Glacier Deep    │  │
│  │   Storage    │    │   Region     │    │  Archive         │  │
│  │   Bucket     │    │   Replication│    │  (7-year)        │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ Infrastructure│───▶│    Git       │───▶│  Encrypted Git   │  │
│  │   Config     │    │   Repository │    │  Backups         │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

5.2 PostgreSQL Backup (pgBackRest)

Tool: pgBackRest 2.48+ with S3 integration

Backup Schedule:

Backup Type	Frequency	Start Time (UTC)	Retention
Full backup	Weekly	Sunday 02:00	12 weeks
Differential	Daily (Mon-Sat)	02:00	30 days
WAL archiving	Continuous	Real-time	30 days
Manual backup	On-demand	Any	90 days

pgBackRest configuration:

# /etc/pgbackrest/pgbackrest.conf
[surveillance]
pg1-path=/var/lib/postgresql/15/main
pg1-port=5432

[global]
repo1-type=s3
repo1-s3-region=us-east-1
repo1-s3-bucket=surveillance-db-backups
repo1-s3-key=<ACCESS_KEY>
repo1-s3-key-secret=<SECRET_KEY>
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=12
repo1-retention-diff=30
repo1-retention-archive=30

# Encryption
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<STRONG_PASSPHRASE>

# Performance
process-max=4
compress-type=zst
compress-level=6

# Logging
log-level-file=detail
log-path=/var/log/pgbackrest

# Notifications
exec-start=/usr/local/bin/pgbackrest-notify.sh

Backup cron schedule:

# /etc/cron.d/pgbackrest
# Full backup every Sunday at 2 AM UTC
0 2 * * 0 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=full

# Differential backup daily at 2 AM UTC (Mon-Sat)
0 2 * * 1-6 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=diff

# Verify latest backup at 6 AM UTC daily
0 6 * * * postgres /usr/bin/pgbackrest --stanza=surveillance verify

WAL archiving configuration (postgresql.conf):

wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=surveillance archive-push %p'
max_wal_senders = 3
wal_keep_size = 1GB

5.3 Backup Retention Schedule

Timeline:
Day 1-30:    Daily backups available (full + diffs)
Week 1-12:   Weekly full backups
Month 1-12:  Monthly full backups (last Sunday of each month)
Year 1-7:    Annual snapshot in Glacier Deep Archive

Tier	Frequency	Copies Kept	Storage Class	Location
Daily (hot)	Every 24h	30	S3 Standard	Primary region
Weekly (warm)	Every Sunday	12	S3 Standard-IA	Primary region
Monthly (cold)	Last Sunday	12	S3 Glacier Flexible	Primary region
Annual (archive)	Year-end	7	S3 Glacier Deep Archive	Cross-region

5.4 Object Storage Backup

Cross-region replication:

// S3 bucket replication configuration
{
  "Role": "arn:aws:iam::ACCOUNT:role/S3ReplicationRole",
  "Rules": [
    {
      "ID": "surveillance-media-replication",
      "Status": "Enabled",
      "Priority": 1,
      "DeleteMarkerReplication": { "Status": "Disabled" },
      "Filter": {
        "And": {
          "Prefix": "media/",
          "Tag": {
            "Key": "replicate",
            "Value": "true"
          }
        }
      },
      "Destination": {
        "Bucket": "arn:aws:s3:::surveillance-media-backup-west",
        "StorageClass": "STANDARD_IA",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": { "Minutes": 15 }
        },
        "Metrics": {
          "Status": "Enabled",
          "EventThreshold": { "Minutes": 15 }
        },
        "EncryptionConfiguration": {
          "ReplicaKmsKeyID": "arn:aws:kms:us-west-2:ACCOUNT:key/KEY-ID"
        }
      },
      "SourceSelectionCriteria": {
        "SseKmsEncryptedObjects": { "Status": "Enabled" }
      }
    }
  ]
}

Lifecycle policy for media storage:

{
  "Rules": [
    {
      "ID": "media-lifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "media/recordings/" },
      "Transitions": [
        {
          "Days": 7,
          "StorageClass": "INTELLIGENT_TIERING"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": { "Days": 2555 }
    },
    {
      "ID": "event-data-lifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "events/" },
      "Transitions": [
        { "Days": 90, "StorageClass": "STANDARD_IA" },
        { "Days": 365, "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 730 }
    }
  ]
}

5.5 Configuration Backup

All infrastructure configuration is stored as code in Git:

surveillance-ops/
├── terraform/
│   ├── main.tf                 # Main infrastructure
│   ├── variables.tf            # Environment variables
│   ├── outputs.tf              # Output definitions
│   ├── modules/
│   │   ├── vpc/                # Network configuration
│   │   ├── eks/                # Kubernetes cluster
│   │   ├── rds/                # PostgreSQL instances
│   │   └── s3/                 # Object storage
│   └── environments/
│       ├── production/         # Production config
│       └── dr/                 # DR site config
├── kubernetes/
│   ├── base/                   # Kustomize base resources
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── postgres/
│   │   ├── redis/
│   │   ├── api/
│   │   ├── inference/
│   │   └── capture/
│   └── overlays/
│       ├── production/
│       ├── staging/
│       └── dr/
├── docker-compose/
│   ├── docker-compose.yml      # Edge deployment
│   └── .env.example
├── ansible/
│   ├── playbook.yml            # Host provisioning
│   └── inventory/
├── monitoring/
│   ├── prometheus/
│   ├── grafana-dashboards/
│   └── alertmanager/
└── docs/
    ├── runbooks/
    ├── postmortems/
    └── architecture/

Git backup to secondary provider:

#!/bin/bash
# /usr/local/bin/backup-git-repos.sh
# Mirrors all critical repos to secondary Git provider

REPOS=(
  "git@github.com:company/surveillance-ops.git"
  "git@github.com:company/surveillance-app.git"
  "git@github.com:company/surveillance-models.git"
)

BACKUP_REMOTE="git@gitlab-backup.company.com:surveillance"
DATE=$(date +%Y%m%d)

for repo in "${REPOS[@]}"; do
  name=$(basename "$repo" .git)
  echo "Backing up $name..."
  
  git clone --mirror "$repo" "/tmp/$name-mirror"
  cd "/tmp/$name-mirror"
  
  # Push to backup remote
  git remote add backup "$BACKUP_REMOTE/$name.git" 2>/dev/null || true
  git push backup --mirror
  
  # Create dated archive
  tar czf "/backup/git/$name-$DATE.tar.gz" -C "/tmp" "$name-mirror"
  
  rm -rf "/tmp/$name-mirror"
done

# Upload to S3
aws s3 sync /backup/git/ "s3://surveillance-config-backups/git/" --storage-class STANDARD_IA

5.6 Encryption

Data at Rest	Encryption Method	Key Management
PostgreSQL backups	AES-256-CBC (pgBackRest native)	AWS KMS CMK
S3 object storage	SSE-KMS	AWS KMS CMK with automatic rotation
Configuration backups	AES-256-GCM (age tool)	YubiKey HSM stored keys
Log archives	SSE-S3 (AES-256)	AWS managed

KMS key policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Enable IAM User Permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Allow pgBackRest",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT:role/BackupServiceRole"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:GenerateDataKey*"
      ],
      "Resource": "*"
    }
  ]
}

5.7 Backup Verification

Automated integrity checks (daily at 06:00 UTC):

#!/bin/bash
# /usr/local/bin/verify-backup.sh

set -euo pipefail

STANZA="surveillance"
LOG_FILE="/var/log/backup/verify-$(date +%Y%m%d).log"
ALERT_WEBHOOK="https://hooks.slack.com/services/..."

log() {
    echo "[$(date -Iseconds)] $1" | tee -a "$LOG_FILE"
}

# 1. Verify latest backup exists
LATEST=$(pgbackrest --stanza=$STANZA info --output=json | jq -r '.[0].backup[-1].label')
if [ -z "$LATEST" ]; then
    log "ERROR: No backup found!"
    curl -X POST -H 'Content-type: application/json' \
        --data '{"text":"CRITICAL: No database backup found!"}' \
        "$ALERT_WEBHOOK"
    exit 1
fi

log "Latest backup: $LATEST"

# 2. Verify backup integrity
if ! pgbackrest --stanza=$STANZA verify --set=$LATEST >> "$LOG_FILE" 2>&1; then
    log "ERROR: Backup integrity check failed for $LATEST"
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"CRITICAL: Backup integrity check failed for $LATEST\"}" \
        "$ALERT_WEBHOOK"
    exit 1
fi

# 3. Check WAL archive continuity
MISSING=$(pgbackrest --stanza=$STANZA verify 2>&1 | grep -c "missing" || true)
if [ "$MISSING" -gt 0 ]; then
    log "WARNING: $MISSING WAL files missing"
fi

# 4. Verify S3 accessibility
if ! aws s3 ls "s3://surveillance-db-backups/pgbackrest/" > /dev/null 2>&1; then
    log "ERROR: Cannot access S3 backup bucket"
    exit 1
fi

# 5. Check backup age
BACKUP_AGE=$(pgbackrest --stanza=$STANZA info --output=json | \
    jq -r '.[0].backup[-1].timestamp.stop')
BACKUP_AGE_SEC=$(( $(date +%s) - $(date -d "$BACKUP_AGE" +%s) ))

if [ "$BACKUP_AGE_SEC" -gt 90000 ]; then  # > 25 hours
    log "WARNING: Latest backup is older than 25 hours"
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"WARNING: Latest backup is $((BACKUP_AGE_SEC / 3600)) hours old\"}" \
        "$ALERT_WEBHOOK"
fi

log "Backup verification completed successfully"

5.8 Restore Procedures

5.8.1 Point-in-Time Recovery (PITR)

#!/bin/bash
# restore-pitr.sh — Restore to specific point in time

STANZA="surveillance"
TARGET_TIME="$1"  # e.g., "2025-01-15 08:30:00"

# Stop application
kubectl scale deployment surveillance-api --replicas=0

# Restore from backup
pgbackrest --stanza=$STANZA restore \
    --type=time \
    --target="$TARGET_TIME" \
    --target-action=promote \
    --delta

# Verify database
psql -U surveillance -d surveillance -c "SELECT pg_last_xact_replay_timestamp();"

# Restart application
kubectl scale deployment surveillance-api --replicas=3

# Verify application health
curl -f http://surveillance-api:8080/health/ready

5.8.2 Full Disaster Recovery

#!/bin/bash
# restore-full.sh — Complete database restoration to new instance

STANZA="surveillance"
NEW_DATA_DIR="/var/lib/postgresql/15/main"

# 1. Install PostgreSQL (same version as backup)
apt-get install postgresql-15

# 2. Stop PostgreSQL
systemctl stop postgresql

# 3. Clear data directory
rm -rf "$NEW_DATA_DIR/*"

# 4. Restore full backup
pgbackrest --stanza=$STANZA restore \
    --type=immediate \
    --set=LATEST

# 5. Start PostgreSQL
systemctl start postgresql

# 6. Verify
pgbackrest --stanza=$STANZA check

# 7. Run consistency check
psql -U surveillance -d surveillance -c "SELECT count(*) FROM events;"
psql -U surveillance -d surveillance -c "SELECT pg_database_size('surveillance');"

5.9 Monthly Restore Drill

Schedule: First Saturday of each month at 02:00 UTC

Procedure:

Provision isolated restore environment (separate namespace/VM)
Restore latest full backup
Apply differential backups
Verify data integrity (row counts, checksums)
Run application smoke tests
Verify media files accessible
Document results in restore log
Tear down restore environment

Restore drill checklist:

## Restore Drill — 2025-01-04
- [x] Isolated environment provisioned
- [x] Full backup restored (duration: 23 min)
- [x] Differential backup applied (duration: 4 min)
- [x] WAL replay completed (duration: 12 min)
- [x] Database row counts verified
  - events: 12,456,789 (expected: 12,456,789) ✓
  - cameras: 8 (expected: 8) ✓
  - alerts: 1,234 (expected: 1,234) ✓
- [x] Application smoke tests passed
- [x] Media file accessibility verified (100/100 random samples)
- [x] Total RTO: 41 minutes (target: < 60 min) ✓
- [x] Total RPO: 8 minutes (target: < 15 min) ✓
- [x] Environment cleaned up

**Notes:** WAL replay was slower than usual due to high write volume on Jan 3.

6. Data Retention

6.1 Retention Policy Matrix

Data Category	Retention Period	Action After Retention	Legal Basis
Raw video recordings	90 days (configurable)	Delete or archive to cold storage	Operational necessity
Event clips (alerts)	1 year	Archive to cold storage for 2 additional years	Incident investigation
Detection metadata	1 year	Anonymize & aggregate	Analytics
Audit logs	1 year	Archive for 6 additional years	Compliance
System health logs	90 days	Delete	Operational monitoring
Access logs	90 days	Delete	Security monitoring
Face embeddings (enrolled)	Indefinite until deleted	User-initiated deletion	Authorized personnel database
Face embeddings (detected)	Never stored	N/A — computed and discarded immediately	Privacy by design
Alert history	2 years	Archive	Incident reference
Training data	Indefinite	Explicit deletion by admin	AI model improvement
Configuration history	2 years	Archive	Change tracking
Backup archives	7 years (Glacier)	Delete per backup schedule	Disaster recovery

6.2 Automated Cleanup Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Data Lifecycle Manager                     │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Retention  │  │   Cleanup    │  │   Archive        │   │
│  │   Policy     │──│   Executor   │──│   Manager        │   │
│  │   Engine     │  │   (CronJob)  │  │   (S3/Glacier)   │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
│         │                 │                   │              │
│         ▼                 ▼                   ▼              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  PostgreSQL  │  │  S3 Object   │  │   Elasticsearch  │   │
│  │  (metadata)  │  │  Storage     │  │   (logs)         │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
└──────────────────────────────────────────────────────────────┘

6.3 Cleanup Job Implementation

# retention_manager.py
from datetime import datetime, timedelta
from typing import List, Optional
import asyncio
import logging

logger = logging.getLogger(__name__)

class RetentionPolicy:
    def __init__(self, name: str, retention_days: int, archive_first: bool = False,
                 archive_days: int = 0, anonymize: bool = False):
        self.name = name
        self.retention_days = retention_days
        self.archive_first = archive_first
        self.archive_days = archive_days
        self.anonymize = anonymize

class DataRetentionManager:
    def __init__(self):
        self.policies = {}
    
    def register_policy(self, policy: RetentionPolicy):
        self.policies[policy.name] = policy
    
    async def execute_cleanup(self, policy_name: str, dry_run: bool = False):
        policy = self.policies.get(policy_name)
        if not policy:
            raise ValueError(f"Unknown policy: {policy_name}")
        
        cutoff_date = datetime.utcnow() - timedelta(days=policy.retention_days)
        logger.info("Executing cleanup for '%s' (cutoff: %s)", 
                     policy_name, cutoff_date.isoformat())
        
        if dry_run:
            count = await self._count_eligible(policy_name, cutoff_date)
            logger.info("[DRY RUN] Would delete %d records", count)
            return count
        
        # Archive before delete if configured
        if policy.archive_first:
            archive_cutoff = datetime.utcnow() - timedelta(
                days=policy.retention_days + policy.archive_days
            )
            archived = await self._archive_records(policy_name, cutoff_date, archive_cutoff)
            logger.info("Archived %d records", archived)
        
        # Anonymize if configured
        if policy.anonymize:
            anonymized = await self._anonymize_records(policy_name, cutoff_date)
            logger.info("Anonymized %d records", anonymized)
        else:
            # Delete expired records
            deleted = await self._delete_records(policy_name, cutoff_date)
            logger.info("Deleted %d records", deleted)
        
        return {"archived": archived if policy.archive_first else 0, "deleted": deleted}

# Register policies
retention = DataRetentionManager()
retention.register_policy(RetentionPolicy("raw_video", retention_days=90, archive_first=True, archive_days=180))
retention.register_policy(RetentionPolicy("event_clips", retention_days=365, archive_first=True, archive_days=730))
retention.register_policy(RetentionPolicy("detection_metadata", retention_days=365, anonymize=True))
retention.register_policy(RetentionPolicy("audit_logs", retention_days=365, archive_first=True, archive_days=2190))
retention.register_policy(RetentionPolicy("system_logs", retention_days=90))
retention.register_policy(RetentionPolicy("access_logs", retention_days=90))

Kubernetes CronJob:

# cleanup-job.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-retention-cleanup
  namespace: surveillance
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: surveillance/retention-manager:2.3.1
              command:
                - python
                - -m
                - retention_manager
                - --execute-all
                - --notify
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url
                - name: S3_BUCKET
                  value: surveillance-media
                - name: DRY_RUN
                  value: "false"
              resources:
                requests:
                  cpu: 100m
                  memory: 256Mi
                limits:
                  cpu: 500m
                  memory: 512Mi
          restartPolicy: OnFailure

6.4 Archive to Cold Storage

Before deletion, data is moved to cost-effective cold storage:

Stage	Storage Class	Cost Factor	Access Time
Active	S3 Standard	1x	Immediate
7 days	S3 Intelligent-Tiering	0.8x	Immediate
90 days	S3 Glacier Instant Retrieval	0.2x	Milliseconds
1 year	S3 Glacier Flexible Retrieval	0.08x	Minutes-hours
2 years	S3 Glacier Deep Archive	0.04x	12-48 hours

Archive process:

#!/bin/bash
# archive-old-media.sh

BUCKET="surveillance-media"
RETENTION_DAYS=90
CUTOFF=$(date -d "$RETENTION_DAYS days ago" +%Y-%m-%d)

# 1. Identify files to archive
aws s3api list-objects-v2 \
    --bucket "$BUCKET" \
    --prefix "recordings/" \
    --query "Contents[?LastModified<='$CUTOFF'].Key" \
    --output text > /tmp/archive-list.txt

# 2. Move to Glacier
while IFS= read -r key; do
    aws s3api copy-object \
        --copy-source "${BUCKET}/${key}" \
        --bucket "$BUCKET" \
        --key "$key" \
        --storage-class GLACIER_IR \
        --metadata-directive COPY
done < /tmp/archive-list.txt

# 3. Log archival
aws s3 cp /tmp/archive-list.txt \
    "s3://${BUCKET}/archive-logs/archive-$(date +%Y%m%d).txt"

# 4. Notify
echo "Archived $(wc -l < /tmp/archive-list.txt) files to Glacier IR"

6.5 Right to Deletion

For privacy compliance (GDPR/CCPA), implement data subject deletion:

async def delete_subject_data(subject_id: str):
    """
    Complete deletion of a data subject:
    1. Remove from enrolled persons database
    2. Delete associated face embeddings
    3. Remove references from detection logs
    4. Delete related event clips
    5. Log deletion for audit
    """
    async with db.transaction():
        # 1. Delete enrolled person
        await db.execute(
            "DELETE FROM enrolled_persons WHERE id = $1",
            subject_id
        )
        
        # 2. Delete embeddings (separate table for encryption)
        await db.execute(
            "DELETE FROM face_embeddings WHERE person_id = $1",
            subject_id
        )
        
        # 3. Anonymize detection references
        await db.execute(
            """UPDATE detections 
                SET person_id = NULL, 
                    person_name = '[REDACTED]',
                    face_embedding = NULL
                WHERE person_id = $1""",
            subject_id
        )
        
        # 4. Queue related event clips for deletion
        clips = await db.fetch(
            "SELECT storage_path FROM event_clips WHERE person_id = $1",
            subject_id
        )
        for clip in clips:
            await s3.delete_object(clip['storage_path'])
        
        # 5. Audit log
        await db.execute(
            """INSERT INTO deletion_audit_log 
                (subject_id, deleted_at, deleted_by, reason)
                VALUES ($1, NOW(), $2, 'data_subject_request')""",
            subject_id, current_user_id()
        )

7. Storage Management

7.1 Storage Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Storage Architecture                       │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Hot Tier   │  │   Warm Tier  │  │   Cold Tier      │   │
│  │   (NVMe/SSD) │  │   (HDD/S3)   │  │   (Glacier)      │   │
│  │              │  │              │  │                  │   │
│  │  Current     │  │  30-90 day   │  │  90+ day media   │   │
│  │  recordings  │  │  recordings  │  │  long-term       │   │
│  │  Active DB   │  │  Event clips │  │  archive         │   │
│  │  Cache       │  │  90-day logs │  │  compliance      │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
│                                                              │
│  Edge Node (local)  ←── VPN ──→  Cloud (S3/EBS/EFS)        │
└──────────────────────────────────────────────────────────────┘

7.2 Storage Capacity Planning (8 Camera Baseline)

Data Type	Daily Volume	Compression	Storage/day	Monthly
Raw video (8x 1080p@30fps, H.265)	~800 GB	50%	~400 GB	~12 TB
Event clips (alerts)	~5 GB	None	~5 GB	~150 GB
Detection metadata	~500 MB	None	~500 MB	~15 GB
Audit logs	~100 MB	70%	~30 MB	~1 GB
System metrics	~200 MB	80%	~40 MB	~1.2 GB
Database	~50 MB	N/A	~50 MB	~1.5 GB
Model checkpoints	N/A	N/A	N/A	~2 GB
Total			~406 GB/day	~12.2 TB/month

Annual raw capacity requirement: ~146 TB
With 90-day retention + archive: ~40 TB hot/warm + ~110 TB cold
Recommended provisioned capacity: 200 TB (with 50% growth headroom)

7.3 Storage Monitoring & Alerting

Prometheus rules:

groups:
  - name: storage-alerts
    rules:
      - alert: StorageWarning70
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.30
        for: 5m
        labels:
          severity: p4
        annotations:
          summary: "Storage at 70% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: StorageHigh85
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
        for: 2m
        labels:
          severity: p2
        annotations:
          summary: "Storage at 85% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: StorageCritical95
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "Storage CRITICAL at 95% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: S3BucketSizeGrowth
        expr: predict_linear(aws_s3_bucket_size_bytes[7d], 30*24*3600) > 
              aws_s3_bucket_quota_bytes * 0.9
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "S3 bucket {{ $labels.bucket }} projected to exceed quota in 30 days"

      - alert: StorageCleanupFailed
        expr: increase(surveillance_cleanup_failures_total[1h]) > 0
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Storage cleanup job failed"

7.4 Automated Cleanup Policies

# cleanup-policies.yaml
cleanup_policies:
  raw_video:
    description: "Raw video recordings"
    retention_days: 90
    archive_before_delete: true
    archive_storage_class: GLACIER_IR
    priority: oldest_first
    schedule: "0 2 * * *"
    
  event_clips:
    description: "Alert event video clips"
    retention_days: 365
    archive_before_delete: true
    archive_storage_class: GLACIER
    priority: oldest_first
    schedule: "0 3 * * *"
    
  temp_processing:
    description: "Temporary processing files"
    retention_days: 1
    archive_before_delete: false
    priority: all_expired
    schedule: "*/30 * * * *"
    
  failed_uploads:
    description: "Failed upload artifacts"
    retention_days: 7
    archive_before_delete: false
    priority: all_expired
    schedule: "0 4 * * *"
    
  system_logs:
    description: "Application and system logs"
    retention_days: 90
    archive_before_delete: true
    archive_storage_class: GLACIER_IR
    priority: oldest_first
    schedule: "0 5 * * *"

7.5 Compression Strategy

Data Age	Compression	Method	Savings
0-7 days	None	Raw H.265	Baseline
7-30 days	Re-encode	H.265 → H.265 (lower CRF)	30-40%
30-90 days	Transcode	H.265 → AV1	40-50%
90+ days	Archive	AV1 + tarball	50-60%

Compression job:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: video-compression
  namespace: surveillance
spec:
  schedule: "0 1 * * *"
  jobTemplate:
    spec:
      parallelism: 2
      template:
        spec:
          containers:
            - name: compressor
              image: surveillance/media-processor:2.3.1
              command:
                - python
                - -m
                - compression
                - --age-days=7
                - --target-crf=30
                - --codec=libx265
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi
                limits:
                  cpu: "4"
                  memory: 8Gi
          restartPolicy: OnFailure

7.6 Auto-Scaling Cloud Storage

S3 Auto-scaling: S3 is inherently elastic — no manual scaling needed. Monitor bucket size and cost.

EBS volume scaling:

# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: surveillance-expandable
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: 3000
  throughput: 125
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"
allowVolumeExpansion: true  # Enable expansion
volumeBindingMode: WaitForFirstConsumer

Automated volume expansion:

#!/bin/bash
# auto-expand-storage.sh

THRESHOLD=80
PVC_NAMES=("postgres-data" "media-storage" "log-storage")
NAMESPACE="surveillance"

for pvc in "${PVC_NAMES[@]}"; do
    # Get current usage
    USAGE=$(kubectl exec -n "$NAMESPACE" deployment/surveillance-api \
        -- df -h "/data/$pvc" | awk 'NR==2 {print $5}' | tr -d '%')
    
    if [ "$USAGE" -gt "$THRESHOLD" ]; then
        CURRENT_SIZE=$(kubectl get pvc "$pvc" -n "$NAMESPACE" \
            -o jsonpath='{.status.capacity.storage}')
        
        # Increase by 50%
        CURRENT_GB=${CURRENT_SIZE%Gi}
        NEW_GB=$((CURRENT_GB + CURRENT_GB / 2))
        
        echo "Expanding $pvc from ${CURRENT_GB}Gi to ${NEW_GB}Gi"
        
        kubectl patch pvc "$pvc" -n "$NAMESPACE" \
            --type merge \
            -p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"${NEW_GB}Gi\"}}}}"
        
        # Notify
        curl -X POST "$SLACK_WEBHOOK" \
            -H 'Content-type: application/json' \
            -d "{\"text\":\"Auto-expanded PVC $pvc to ${NEW_GB}Gi (was ${USAGE}% full)\"}"
    fi
done

7.7 Storage Cost Optimization

Optimization	Monthly Savings	Implementation
S3 Intelligent-Tiering	20-30%	Automatic
H.265 re-encode (older content)	30-40%	Nightly job
Glacier IR for 30-90 day content	60-70%	Lifecycle rule
Glacier Deep Archive for 1yr+	95%	Lifecycle rule
Reserved capacity for predictable workloads	30-40%	Commitment

8. Incident Response

8.1 Severity Definitions

Severity	Name	Definition	Examples	Response Time
P1	Critical	Complete service outage; no surveillance capability	All cameras offline; AI pipeline completely down; storage full; database primary down	15 minutes
P2	High	Major functionality degraded; partial surveillance loss	Single camera offline > 30 min; high error rates; model accuracy degraded; backup failures	1 hour
P3	Medium	Minor functionality issue; workarounds available	Low FPS on camera; certificate expiring; certificate expiry warning; cleanup job failure	4 hours
P4	Low	Cosmetic or non-urgent issue	High CPU warning; UI glitch; documentation update needed; optimization opportunity	24 hours

8.2 Escalation Matrix

P1 (Critical) — 15 min response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 5 min: On-call must acknowledge
├── 15 min: No acknowledge → Escalate to Team Lead (SMS + Call)
├── 30 min: No response → Escalate to Engineering Manager
├── 45 min: No response → Escalate to VP Engineering
└── 60 min: No response → Escalate to CTO

P2 (High) — 1 hour response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 30 min: No acknowledge → Reminder notification
├── 60 min: No response → Escalate to Team Lead
└── 2 hours: No response → Escalate to Engineering Manager

P3 (Medium) — Slack + email only, 4 hour response
├── 0 min: Alert fires → Slack notification
└── 4 hours: No acknowledgment → Escalate to Team Lead

P4 (Low) — Daily digest email, 24 hour response
└── Daily digest at 09:00 UTC

Contact Information:

Role	Primary Contact	Secondary Contact	Notification Method
On-Call Engineer	Rotating (PagerDuty)	—	PagerDuty Push + SMS
SRE Team Lead	lead-sre@company.com	+1-555-0100	SMS + Voice Call
Engineering Manager	eng-mgr@company.com	+1-555-0101	SMS + Voice Call
VP Engineering	vp-eng@company.com	+1-555-0102	Voice Call + Email
CTO	cto@company.com	+1-555-0103	Voice Call + Email

8.3 Incident Response Process

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  DETECT     │───▶│  RESPOND    │───▶│  RESOLVE    │───▶│  REVIEW     │
│  (Alert)    │    │  (Triage &  │    │  (Fix &     │    │  (Post-     │
│             │    │   Mitigate) │    │   Verify)   │    │   mortem)   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │
                    ┌─────┴─────┐
                    ▼           ▼
              ┌────────┐  ┌──────────┐
              │Mitigate│  │Communicate│
              │Impact  │  │Stakeholders│
              └────────┘  └──────────┘

Phase 1: Detect

Monitoring alert fires
On-call engineer receives page
Acknowledge alert within 5 minutes
Create incident channel in Slack: #inc-YYYY-MM-DD-brief-description

Phase 2: Respond

Assess severity and impact
Execute relevant runbook
Apply immediate mitigation if possible
Update incident timeline every 15 minutes
Communicate to stakeholders

Phase 3: Resolve

Implement fix
Verify service recovery (all health checks pass)
Monitor for 30 minutes post-recovery
Close incident in PagerDuty
Update incident log

Phase 4: Review

Schedule post-mortem within 48 hours for P1/P2
Complete post-mortem document
Identify action items
Track action items to completion

8.4 Runbooks

Runbook: Camera Offline

Detection: SingleCameraDown alert fires
Severity: P2
Initial Response Time: 1 hour

Diagnosis Steps:

# 1. Check camera stream status
curl http://video-capture:8080/api/v1/cameras/{camera_id}/status

# 2. Check camera connectivity
ping <camera_ip>
curl -v rtsp://<camera_ip>:554/stream

# 3. Check video-capture service logs
kubectl logs -l app=video-capture --tail=100 | grep {camera_id}

# 4. Check network path
tracert <camera_ip>
# Verify firewall rules, VPN tunnel

# 5. Check camera resource usage
kubectl top pod -l app=video-capture

Resolution Steps:

Issue	Resolution	Verification
Camera powered off	Contact site personnel to power cycle	Ping responds
Network connectivity	Check switch port, cable, VLAN	Ping + RTSP describe
VPN tunnel down	See "VPN Tunnel Down" runbook	Tunnel status
Camera firmware issue	Power cycle camera remotely	Stream reconnects
Stream URL changed	Update camera configuration	New stream active
Video-capture bug	Restart capture container	Stream reconnected
Resource exhaustion	Scale up capture resources	CPU/memory normal

Workaround: If camera cannot be restored within 30 minutes:

Mark camera as "maintenance mode" in dashboard
Disable alerts for this camera
Queue for on-site technician visit

Runbook: AI Pipeline Down

Detection: AIPipelineDown or HighErrorRate alert
Severity: P1
Initial Response Time: 15 minutes

Diagnosis Steps:

# 1. Check inference service health
curl http://ai-inference:8080/health/deep

# 2. Check if model is loaded
curl http://ai-inference:8080/api/v1/model/status

# 3. Check GPU status (if applicable)
nvidia-smi
# OR for CPU inference:
htop

# 4. Check inference logs
kubectl logs -l app=ai-inference --tail=200

# 5. Check resource usage
kubectl top pod -l app=ai-inference
kubectl describe pod -l app=ai-inference

# 6. Check model service
kubectl logs -l app=model-service --tail=100

# 7. Check if inference queue is backing up
redis-cli LLEN inference:queue

# 8. Test inference manually
curl -X POST http://ai-inference:8080/api/v1/inference/test \
  -H "Content-Type: application/json" \
  -d '{"test_image": "base64encoded"}'

Resolution Steps:

Issue	Resolution	Verification
Model not loaded	Restart model-service pod	Model status shows loaded
GPU OOM	Restart inference pod; check memory limits	nvidia-smi shows free memory
Model corruption	Reload model from S3 backup	Test inference succeeds
Inference timeout	Scale inference replicas; check input	Latency returns to normal
Queue backup	Scale up consumers; check for dead consumers	Queue depth returns to 0
Bad model update	Rollback to previous model version	Detection accuracy restored
Dependency failure	Check circuit breaker status; restart dependencies	All health checks pass

Immediate Mitigation:

If inference cannot be restored in 15 minutes:
1. Switch to "detection-only" mode (skip recognition)
2. Enable edge processing as backup
3. Queue frames for delayed processing

Runbook: VPN Tunnel Down

Detection: Edge node unreachable; camera streams offline
Severity: P2 (P1 if all edge cameras affected)
Initial Response Time: 1 hour

Diagnosis Steps:

# 1. Check tunnel status from cloud side
ping <edge_gateway_ip>

# 2. Check VPN service status
kubectl logs -l app=vpn-gateway --tail=100

# 3. Check tunnel metrics
curl http://vpn-gateway:8080/metrics | grep vpn_tunnel

# 4. Check from edge side (if SSH available)
ssh edge-node "ping <cloud_gateway_ip>"
ssh edge-node "ipsec status"  # or wg show for WireGuard

# 5. Check network path
mtr <edge_gateway_ip>

# 6. Check certificates (if certificate-based VPN)
openssl x509 -in /etc/vpn/cert.pem -text -noout | grep "Not After"

Resolution Steps:

Issue	Resolution	Verification
Edge network down	Contact ISP/site	Ping responds
VPN service crash	Restart VPN gateway	Tunnel established
Certificate expired	Renew certificates	Valid cert, tunnel up
MTU mismatch	Adjust tunnel MTU	No packet fragmentation
Firewall change	Restore firewall rules	Tunnel traffic flowing
IPsec/IKE failure	Restart IKE daemon; check config	SA established
WireGuard key issue	Regenerate keys	Handshake succeeds

Workaround: If tunnel cannot be restored:

Activate local storage mode on edge (store locally, sync later)
Switch to cellular backup if available
Deploy technician on-site if needed

Runbook: Storage Full

Detection: StorageCritical95 alert fires
Severity: P1
Initial Response Time: 15 minutes

Immediate Actions (within 5 minutes):

# 1. Identify what's consuming space
df -h
ncdu /data/surveillance

# 2. Check if cleanup job is running
kubectl get jobs -n surveillance | grep cleanup

# 3. Temporarily expand storage (cloud)
# AWS EBS:
aws ec2 modify-volume --volume-id vol-XXXX --size $((CURRENT + 100))

# 4. Emergency cleanup — delete oldest temp files
find /data/surveillance/temp -type f -mtime +1 -delete
find /data/surveillance/cache -type f -atime +7 -delete

# 5. Force log rotation
logrotate -f /etc/logrotate.d/surveillance

# 6. Truncate oversized logs (>1GB)
find /var/log/surveillance -type f -size +1G -exec sh -c '> {}' \;

Resolution Steps:

Issue	Resolution	Verification
Normal growth	Expand storage; review retention	Usage < 80%
Runaway logs	Fix log source; rotate logs	Log growth rate normal
Cleanup job failed	Restart cleanup job; fix root cause	Cleanup completes
Retention too long	Reduce retention period	Space freed
Camera bitrate high	Adjust camera encoding settings	Bitrate normalized
Orphaned temp files	Purge temp directory	Space recovered

Runbook: Database Connectivity Issues

Detection: DatabaseUnreachable alert
Severity: P1
Initial Response Time: 15 minutes

Diagnosis Steps:

# 1. Check PostgreSQL pod status
kubectl get pods -l app=postgres
kubectl describe pod -l app=postgres

# 2. Check PostgreSQL logs
kubectl logs -l app=postgres --tail=200

# 3. Test connection from application pod
kubectl exec deployment/surveillance-api -- \
  pg_isready -h postgres -U surveillance

# 4. Check connection pool status
kubectl exec deployment/surveillance-api -- \
  python -c "from db import pool; print(pool.size(), pool.available())"

# 5. Check resource usage
kubectl top pod -l app=postgres

# 6. Check disk I/O
iostat -x 1 5

# 7. Check for locks
kubectl exec deployment/postgres -- \
  psql -U surveillance -c "SELECT * FROM pg_locks WHERE NOT granted;"

# 8. Check replication lag
kubectl exec deployment/postgres -- \
  psql -U surveillance -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS lag_seconds;"

Resolution Steps:

Issue	Resolution	Verification
PostgreSQL pod crash	Restart pod; check for OOM	Pod running, accepting connections
Connection pool exhausted	Increase pool size; check for leaks	Available connections > 0
Disk I/O saturation	Scale storage IOPS; optimize queries	I/O wait < 20%
Lock contention	Kill blocking queries; optimize transactions	No waiting locks
Replication lag	Check replica resources; restart replication	Lag < 5 seconds
Query overload	Enable query caching; kill slow queries	Active queries normal
Disk full	See "Storage Full" runbook	Free space available
Hardware failure	Failover to replica; replace primary	Replica promoted

Immediate Mitigation:

If primary is down:
1. Promote replica to primary: pg_ctl promote
2. Update connection strings
3. Restart application pods

Runbook: High Error Rates

Detection: HighErrorRate alert fires
Severity: P1
Initial Response Time: 15 minutes

Diagnosis Steps:

# 1. Check error distribution by service
kubectl logs -l app=surveillance --tail=1000 | \
  jq -r '.service + ": " + .level + ": " + .message' | \
  sort | uniq -c | sort -rn | head -20

# 2. Check error rate per service
curl http://prometheus:9090/api/v1/query?query=\
  "rate(surveillance_errors_total[5m])"

# 3. Check for recent deployments
kubectl rollout history deployment/surveillance-api
kubectl rollout history deployment/ai-inference

# 4. Check dependency health
curl http://surveillance-api:8080/health/deep

# 5. Check for resource exhaustion
kubectl top pods

# 6. Review recent changes
# Check CI/CD pipeline, config changes

# 7. Check circuit breaker status
for service in database storage inference; do
  curl "http://surveillance-api:8080/api/v1/circuit-breakers/$service"
done

Resolution Steps:

Issue	Resolution	Verification
Bad deployment	Rollback to previous version	Error rate drops
Dependency down	Fix dependency; check circuit breakers	All deps healthy
Resource exhaustion	Scale up; optimize resource usage	Usage normal
Code bug	Deploy hotfix; or rollback	Errors eliminated
Configuration error	Revert config change; validate config	Config valid
External API failure	Enable fallback; contact provider	Fallback active
Database deadlock	Kill blocking queries; fix code	Deadlocks resolved

8.5 Post-Incident Review Template

# Post-Incident Review

## Incident Summary

| Field | Value |
|-------|-------|
| Incident ID | INC-2025-001 |
| Date/Time (UTC) | 2025-01-15 03:45 - 2025-01-15 05:20 |
| Severity | P1 |
| Detection Method | Automated alert (StorageCritical95) |
| Affected Systems | All camera streams, event storage |
| Impact | 1h 35m of degraded recording quality |

## Timeline

| Time (UTC) | Event |
|------------|-------|
| 03:42 | Storage usage crosses 95% threshold |
| 03:45 | P1 alert fires; on-call paged |
| 03:48 | On-call engineer acknowledges |
| 03:52 | Diagnosis begins; identified storage full |
| 04:05 | Emergency cleanup initiated; temp files removed |
| 04:15 | Storage expanded by 200GB |
| 04:30 | Cleanup job restarted; oldest files archived |
| 04:45 | All camera streams reconnecting |
| 05:00 | All health checks passing |
| 05:20 | Incident closed; monitoring continues |

## Root Cause Analysis

**5 Whys:**
1. Why did storage fill up? → Cleanup job had been failing for 3 days
2. Why was cleanup failing? → Credential rotation broke S3 access
3. Why didn't credential rotation update cleanup job? → Cleanup job uses hardcoded credentials
4. Why are credentials hardcoded? → Technical debt; not migrated to secret management
5. Why wasn't this caught? → No monitoring on cleanup job success/failure

**Root Cause:** Cleanup job used hardcoded S3 credentials that were not updated during routine credential rotation, causing 3 days of accumulated data without cleanup.

## Contributing Factors
- No alert on cleanup job failures
- Storage growth rate was not monitored
- No auto-expansion configured for media storage

## What Went Well
- Automated P1 alert fired immediately at 95%
- On-call responded within 3 minutes
- Emergency cleanup procedures were effective
- No data loss occurred

## What Went Wrong
- Cleanup job failure went undetected for 3 days
- Manual intervention required for storage expansion
- Edge cameras buffered locally but some frames were lost during reconnect

## Action Items

| ID | Action | Owner | Due Date | Priority |
|----|--------|-------|----------|----------|
| AI-1 | Migrate all jobs to use IAM roles / secret management | @sre-team | 2025-01-22 | High |
| AI-2 | Add alert for cleanup job failures | @sre-team | 2025-01-18 | High |
| AI-3 | Implement auto-expansion for media storage | @sre-team | 2025-01-29 | Medium |
| AI-4 | Add storage growth rate alerting | @sre-team | 2025-01-22 | Medium |
| AI-5 | Improve camera reconnection to reduce frame loss | @eng-team | 2025-02-05 | Low |
| AI-6 | Document hardcoded credential audit procedure | @security | 2025-01-22 | High |

## Lessons Learned
- Any automated job failure must have an alert
- Credential management must be centralized
- Storage monitoring needs predictive capability

## Signatures
- Incident Commander: _________________ Date: ___/___/______
- Engineering Lead: _________________ Date: ___/___/______

9. Upgrades & Maintenance

9.1 Zero-Downtime Deployment Strategy

Deployment Pattern: Rolling updates with readiness gate verification

Phase 1: Deploy new version alongside old version
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │   (serving traffic)
  └──────────┘    └──────────┘    └──────────┘

Phase 2: Add new version pod, verify health
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │    │  Pod v2  │   (new pod not yet serving)
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
                                                      ▲
                                                health check passes

Phase 3: Route traffic to new pod, drain old pod
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v2  │    │  Pod v2  │   (traffic shifting)
  └──────────┘    └──────────┘    └──────────┘    └──────────┘

Phase 4: Complete rollout
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v2  │    │  Pod v2  │    │  Pod v2  │   (all pods updated)
  └──────────┘    └──────────┘    └──────────┘

Rollback: Instantly revert to previous ReplicaSet
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │   (rollback in ~30 seconds)
  └──────────┘    └──────────┘    └──────────┘

Kubernetes Deployment Strategy:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: surveillance-api
  namespace: surveillance
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # Allow 1 extra pod during update
      maxUnavailable: 0    # Never reduce capacity
  selector:
    matchLabels:
      app: surveillance-api
  template:
    metadata:
      labels:
        app: surveillance-api
        version: "2.3.2"   # Updated with each release
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: surveillance/api:2.3.2@sha256:a1b2c3d4...
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 6
            successThreshold: 2
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - surveillance-api
                topologyKey: kubernetes.io/hostname

9.2 Deployment Pipeline

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Build     │───▶│   Test      │───▶│   Stage     │───▶│   Canary    │───▶│  Production │
│  (CI)       │    │  (Unit/Int) │    │  (E2E)      │    │  (5% traff) │    │  (100%)     │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │                  │                  │
                          ▼                  ▼                  ▼
                    ┌──────────┐      ┌──────────┐      ┌──────────┐
                    │ Fail =   │      │ Fail =   │      │ Fail =   │
                    │ Block    │      │ Block    │      │ Rollback │
                    └──────────┘      └──────────┘      └──────────┘

Automated promotion gates:

Gate	Criteria	Auto-promote Timeout
Build	All tests pass; linting passes; security scan clean	Immediate
Staging	E2E tests pass; performance within 10% of baseline	30 min validation
Canary	Error rate < 0.1%; p95 latency < baseline + 20%	15 min bake time
Production	Canary metrics healthy for 30 min	Auto-proceed

9.3 Database Migrations

Tool: Alembic (SQLAlchemy migrations) with yoyo-migrations for idempotent SQL

Migration rules:

All migrations must be backward-compatible (add-only in one release)
Destructive changes require a 2-phase deployment
Migrations are versioned and reversible
Migrations run automatically as init container before app startup
Migration status exposed via /health/ready

# migrations/env.py — Alembic configuration
from alembic import context
from sqlalchemy import create_engine

config = context.config

def run_migrations():
    """Run migrations in online mode."""
    connectable = create_engine(config.get_main_option("sqlalchemy.url"))
    
    with connectable.connect() as connection:
        context.configure(
            connection=connection,
            target_metadata=target_metadata,
            transaction_per_migration=True,
            compare_type=True,
        )
        
        with context.begin_transaction():
            context.run_migrations()

# Migration example: add_column (backward-compatible)
# migrations/versions/20250115_add_camera_resolution.py
"""
Add resolution column to cameras table

Revision ID: 20250115_add_camera_resolution
Revises: 20250101_initial
Create Date: 2025-01-15 08:30:00
"""
from alembic import op
import sqlalchemy as sa

revision = '20250115_add_camera_resolution'
down_revision = '20250101_initial'

# Phase 1 (this release): Add column as nullable
def upgrade():
    op.add_column('cameras', sa.Column('resolution', sa.String(20), nullable=True))
    # Backfill existing data
    op.execute("UPDATE cameras SET resolution = '1920x1080' WHERE resolution IS NULL")

# Phase 2 (next release): Make column non-nullable
# def upgrade():
#     op.alter_column('cameras', 'resolution', nullable=False)

def downgrade():
    op.drop_column('cameras', 'resolution')

Migration execution (Kubernetes init container):

initContainers:
  - name: db-migrations
    image: surveillance/api:2.3.2@sha256:a1b2c3d4...
    command:
      - python
      - -m
      - alembic
      - upgrade
      - head
    env:
      - name: DATABASE_URL
        valueFrom:
          secretKeyRef:
            name: db-credentials
            key: url
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    # Must complete before main container starts
    restartPolicy: OnFailure

Two-phase destructive change example:

Phase 1 (Release N):

def upgrade():
    # Add new column
    op.add_column('detections', sa.Column('confidence_v2', sa.Float(), nullable=True))
    # Create index concurrently (no table lock)
    op.create_index('ix_detections_confidence_v2', 'detections', ['confidence_v2'],
                    postgresql_concurrently=True)
    # Backfill in batches
    op.execute("""
        UPDATE detections 
        SET confidence_v2 = confidence 
        WHERE confidence_v2 IS NULL
        AND id IN (SELECT id FROM detections WHERE confidence_v2 IS NULL LIMIT 10000)
    """)

Phase 2 (Release N+1):

def upgrade():
    # Now safe to drop old column (all code reads from new column)
    op.drop_column('detections', 'confidence')
    # Rename new column
    op.alter_column('detections', 'confidence_v2', new_column_name='confidence')

9.4 Model Update Deployment (Blue/Green)

AI model updates use blue/green to enable instant rollback:

Current State:
  ┌──────────────┐
  │  Model v2.1  │  ← Active (Blue)
  │   (Green)    │
  └──────────────┘
       ▲
   traffic: 100%

Deployment:
  1. Load Model v2.2 alongside v2.1
  2. Warm up v2.2 (run inference tests)
  3. Gradually shift traffic: 10% → 50% → 100%
  4. Monitor accuracy and latency
  
  ┌──────────────┐    ┌──────────────┐
  │  Model v2.1  │    │  Model v2.2  │
  │   (Blue)     │    │   (Green)    │
  └──────────────┘    └──────────────┘
    traffic: 70%         traffic: 30%

Rollback (instant):
  ┌──────────────┐    ┌──────────────┐
  │  Model v2.1  │    │  Model v2.2  │
  │   (Blue)     │    │  (Green)     │
  └──────────────┘    └──────────────┘
   traffic: 100%         traffic: 0%

Model deployment configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-blue
  namespace: surveillance
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: inference
          image: surveillance/inference:2.3.1
          env:
            - name: MODEL_VERSION
              value: "face-detection-v2.1"
            - name: MODEL_PATH
              value: "/models/v2.1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-green
  namespace: surveillance
spec:
  replicas: 0  # Scaled to 0 by default
  template:
    spec:
      containers:
        - name: inference
          image: surveillance/inference:2.3.1
          env:
            - name: MODEL_VERSION
              value: "face-detection-v2.2"
            - name: MODEL_PATH
              value: "/models/v2.2"
---
# Service routes to active model via label selector
apiVersion: v1
kind: Service
metadata:
  name: ai-inference
  annotations:
    active-model: "blue"
spec:
  selector:
    model: blue  # Changed to "green" for cutover
  ports:
    - port: 8080

Model switch script:

#!/bin/bash
# switch-model.sh — Switch between blue and green model deployments

NAMESPACE="surveillance"
TARGET="$1"  # blue or green

# Scale target to match current
CURRENT_REPLICAS=$(kubectl get deployment ai-inference-blue -n $NAMESPACE \
  -o jsonpath='{.status.replicas}')

echo "Scaling ai-inference-$TARGET to $CURRENT_REPLICAS replicas..."
kubectl scale deployment "ai-inference-$TARGET" --replicas="$CURRENT_REPLICAS" -n "$NAMESPACE"

# Wait for ready
kubectl rollout status "deployment/ai-inference-$TARGET" -n "$NAMESPACE" --timeout=300s

# Update service selector
echo "Switching service to $TARGET..."
kubectl patch service ai-inference -n "$NAMESPACE" \
  --type merge \
  -p "{\"spec\":{\"selector\":{\"model\":\"$TARGET\"}}}"

# Update annotation
kubectl annotate service ai-inference -n "$NAMESPACE" \
  "active-model=$TARGET" --overwrite

# Scale old version to 0
OLD_VERSION=$([ "$TARGET" == "blue" ] && echo "green" || echo "blue")
echo "Scaling down ai-inference-$OLD_VERSION..."
kubectl scale deployment "ai-inference-$OLD_VERSION" --replicas=0 -n "$NAMESPACE"

echo "Model switch complete. Active: $TARGET"

9.5 Maintenance Windows

Window	Schedule	Duration	Allowed Activities
Weekly	Sunday 02:00-06:00 UTC	4 hours	Patches, minor updates, config changes
Monthly	First Sunday 02:00-08:00 UTC	6 hours	Database maintenance, major upgrades, model updates
Quarterly	Scheduled	8 hours	Infrastructure upgrades, DR drills
Emergency	On-demand	As needed	Security patches, critical fixes

Maintenance mode API:

@app.post("/admin/maintenance")
async def enable_maintenance_mode(
    duration_minutes: int,
    reason: str,
    user: AdminUser = Depends(get_admin_user)
):
    """Enable maintenance mode — disable non-critical processing."""
    await redis.set("maintenance:active", "true", ex=duration_minutes * 60)
    await redis.set("maintenance:reason", reason, ex=duration_minutes * 60)
    
    # Notify all connected clients
    await websocket_manager.broadcast({
        "type": "maintenance",
        "status": "started",
        "reason": reason,
        "estimated_duration_minutes": duration_minutes
    })
    
    # Reduce non-critical processing
    await set_pipeline_mode("minimal")
    
    audit_log.info("Maintenance mode enabled by %s for %d minutes: %s",
                   user.username, duration_minutes, reason)

9.6 Rollback Capability

Every deployment maintains the previous N versions for instant rollback:

Rollback Type	Method	Time to Complete	When to Use
Application rollback	`kubectl rollout undo`	~30 seconds	Bad deployment
Database rollback	`alembic downgrade`	2-5 minutes	Bad migration
Model rollback	Switch service selector	~10 seconds	Bad model update
Configuration rollback	Git revert + apply	1-2 minutes	Bad config change
Infrastructure rollback	Terraform state revert	5-10 minutes	Bad infra change
Full system rollback	DR failover	15-30 minutes	Catastrophic failure

Automated rollback triggers:

# rollback-alerts.yaml
- alert: DeploymentRollbackRequired
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      and
      delta(deployment_timestamp[10m]) > 0
    )
  for: 2m
  labels:
    severity: p1
  annotations:
    summary: "High error rate after deployment — rollback recommended"
    runbook_url: "https://wiki.internal/runbooks/auto-rollback"

9.7 Version Pinning

All container images MUST be pinned to digest, never to floating tags:

# GOOD — pinned to digest
image: surveillance/api:2.3.1@sha256:abc123def456...

# BAD — floating tag
image: surveillance/api:latest

# ACCEPTABLE — semver tag with digest verification
image: surveillance/api:2.3.1
# (digest verified by admission controller)

Image verification admission controller:

# Kyverno / OPA Gatekeeper policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-digest
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "All container images must be pinned to digest"
        pattern:
          spec:
            containers:
              - image: "*@sha256:*"

10. Performance Optimization

10.1 Query Optimization

Slow query monitoring:

-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slow queries
SELECT 
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Alert on slow queries:

- alert: SlowPostgresQueries
  expr: |
    pg_stat_statements_mean_time > 1000
  for: 5m
  labels:
    severity: p3
  annotations:
    summary: "Slow queries detected (>1000ms average)"

Index review (monthly):

-- Check for missing indexes
SELECT 
    schemaname,
    tablename,
    attname as column,
    n_tup_read,
    n_tup_fetch,
    n_tup_insert,
    n_tup_update
FROM pg_stats 
WHERE schemaname = 'public'
ORDER BY n_tup_read DESC;

-- Check for unused indexes
SELECT 
    schemaname,
    tablename,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

Current index strategy:

-- Core indexes for surveillance queries
CREATE INDEX CONCURRENTLY idx_detections_timestamp_camera 
    ON detections (timestamp DESC, camera_id);

CREATE INDEX CONCURRENTLY idx_detections_person_id 
    ON detections (person_id) WHERE person_id IS NOT NULL;

CREATE INDEX CONCURRENTLY idx_events_timestamp_type 
    ON events (timestamp DESC, event_type);

CREATE INDEX CONCURRENTLY idx_alerts_status_created 
    ON alerts (status, created_at DESC) 
    WHERE status IN ('pending', 'sent');

CREATE INDEX CONCURRENTLY idx_recordings_camera_timestamp 
    ON recordings (camera_id, start_time DESC);

-- Partial index for active alerts (most queried)
CREATE INDEX CONCURRENTLY idx_alerts_active 
    ON alerts (created_at DESC, camera_id, severity)
    WHERE status = 'active';

10.2 Cache Strategy (Redis)

Cache Type	TTL	Invalidation	Purpose
Camera configuration	5 min	On update	Reduce DB reads
Person profiles	10 min	On update	Fast face lookup
Recent detections	1 min	Time-based	Dashboard display
Alert rules	5 min	On update	Rule evaluation
API responses (frequent)	30 sec	On data change	Reduce API load
Session data	24 hours	On logout	User sessions
Rate limiting	1 min	Automatic	API protection

Redis configuration:

# redis-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: surveillance
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    appendonly yes
    appendfsync everysec
    save 900 1
    save 300 10
    save 60 10000
    tcp-keepalive 60
    timeout 300

Cache implementation:

# cache.py
import redis.asyncio as redis
import json
import hashlib
from functools import wraps

redis_client = redis.Redis(
    host='redis',
    port=6379,
    db=0,
    decode_responses=True,
    socket_connect_timeout=5,
    socket_timeout=5,
    health_check_interval=30,
)

async def cached(ttl_seconds: int, key_prefix: str = "cache"):
    """Decorator to cache function results."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key
            cache_key = f"{key_prefix}:{func.__name__}:{_generate_key(args, kwargs)}"
            
            # Try cache
            cached = await redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute and cache
            result = await func(*args, **kwargs)
            await redis_client.setex(
                cache_key,
                ttl_seconds,
                json.dumps(result, default=str)
            )
            return result
        return wrapper
    return decorator

def _generate_key(args, kwargs):
    key_data = json.dumps({"args": args, "kwargs": kwargs}, sort_keys=True, default=str)
    return hashlib.sha256(key_data.encode()).hexdigest()[:16]

# Usage
@cached(ttl_seconds=300, key_prefix="camera")
async def get_camera_config(camera_id: str):
    return await db.fetchrow("SELECT * FROM cameras WHERE id = $1", camera_id)

@cached(ttl_seconds=60, key_prefix="detections")
async def get_recent_detections(camera_id: str, limit: int = 50):
    return await db.fetch(
        """SELECT * FROM detections 
           WHERE camera_id = $1 
           ORDER BY timestamp DESC 
           LIMIT $2""",
        camera_id, limit
    )

10.3 CDN Configuration

Static assets and archived media are served via CDN:

# CloudFront / CDN configuration
cdn:
  origins:
    - id: surveillance-media
      domain: surveillance-media.s3.amazonaws.com
      path: /recordings
      
    - id: surveillance-static
      domain: surveillance-static.s3.amazonaws.com
      path: /static
      
  behaviors:
    - path: /recordings/*.mp4
      ttl: 86400
      compress: true
      
    - path: /static/*
      ttl: 604800
      cache_control: "public, max-age=604800, immutable"
      
    - path: /api/*
      ttl: 0  # Don't cache API
      
  signed_urls:
    enabled: true
    key_pair_id: "K..."
    expiration: 3600  # 1 hour

10.4 Connection Pooling

Database Connection Pooling

# database.py
import asyncpg

DB_POOL_CONFIG = {
    "min_size": 5,
    "max_size": 20,
    "max_inactive_time": 300,
    "max_queries": 50000,
    "command_timeout": 30,
    "server_settings": {
        "jit": "off",
        "application_name": "surveillance-api"
    }
}

pool = None

async def init_pool(database_url: str):
    global pool
    pool = await asyncpg.create_pool(
        database_url,
        **DB_POOL_CONFIG
    )

async def get_connection():
    return await pool.acquire()

async def release_connection(conn):
    await pool.release(conn)

HTTP Connection Pooling (for inter-service communication)

# http_client.py
import httpx

class ServiceClient:
    def __init__(self):
        self.client = httpx.AsyncClient(
            timeout=httpx.Timeout(connect=5.0, read=30.0),
            limits=httpx.Limits(
                max_connections=100,
                max_keepalive_connections=20
            ),
            http2=True,
        )
    
    async def get(self, service: str, path: str):
        url = f"http://{service}:8080{path}"
        response = await self.client.get(url)
        response.raise_for_status()
        return response.json()

service_client = ServiceClient()

10.5 Resource Limits

# resource-limits.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: surveillance-limits
  namespace: surveillance
spec:
  limits:
    - default:
        cpu: "1"
        memory: 1Gi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      type: Container
---
# Per-service resource allocation
resources:
  # Video capture (I/O bound)
  video-capture:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "2"
      memory: 4Gi

  # AI inference (CPU/GPU bound)
  ai-inference:
    requests:
      cpu: "2"
      memory: 4Gi
    limits:
      cpu: "4"
      memory: 8Gi

  # API (moderate load)
  surveillance-api:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 2Gi

  # Database (high memory)
  postgres:
    requests:
      cpu: "1"
      memory: 4Gi
    limits:
      cpu: "4"
      memory: 16Gi

  # Redis (low CPU, moderate memory)
  redis:
    requests:
      cpu: 100m
      memory: 1Gi
    limits:
      cpu: "1"
      memory: 2Gi

10.6 Performance Benchmarks

Metric	Target	Alert Threshold	Critical Threshold
Camera stream latency	< 100ms	> 200ms	> 500ms
AI inference per frame	< 50ms	> 100ms	> 200ms
End-to-end detection latency	< 500ms	> 1000ms	> 2000ms
API response time (p50)	< 50ms	> 100ms	> 500ms
API response time (p95)	< 200ms	> 500ms	> 1000ms
Database query time (p95)	< 10ms	> 50ms	> 200ms
Stream processing FPS	30 FPS	< 25 FPS	< 15 FPS
Frame drop rate	< 0.1%	> 1%	> 5%
Alert delivery time	< 5s	> 10s	> 30s

11. Disaster Recovery

11.1 DR Objectives

Metric	Value	Measurement
RTO (Recovery Time Objective)	1 hour	Time from disaster declaration to service restoration
RPO (Recovery Point Objective)	15 minutes	Maximum acceptable data loss
RTO (Database)	30 minutes	Database failover time
RTO (Application)	15 minutes	Application redeployment time
RPO (Database)	< 1 minute	With synchronous replication
RPO (Media)	15 minutes	Cross-region replication lag

11.2 DR Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PRODUCTION (us-east-1)                        │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │   EKS        │  │   RDS        │  │   S3             │          │
│  │   Cluster    │  │   PostgreSQL │  │   Primary        │          │
│  │              │  │   (Primary)  │  │   Bucket         │          │
│  │  ┌────────┐  │  │              │  │                  │          │
│  │  │Capture │  │  │  ┌────────┐  │  │  ┌──────────┐    │          │
│  │  │API     │  │  │  │Primary │  │  │  │ Recordings│   │          │
│  │  │Inference│  │  │  │Replica │  │  │  │ Events    │   │          │
│  │  └────────┘  │  │  └────────┘  │  │  │ Models    │   │          │
│  └──────────────┘  └──────────────┘  └──────────────────┘          │
│           │                │                  │                      │
│           ▼                ▼                  ▼                      │
│     ┌─────────────────────────────────────────────────┐              │
│     │           Real-time Replication                  │              │
│     │  (WAL streaming + S3 cross-region replication)   │              │
│     └─────────────────────────────────────────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     DR SITE (us-west-2)                              │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │   EKS        │  │   RDS        │  │   S3             │          │
│  │   (Scaled    │  │   PostgreSQL │  │   Replica        │          │
│  │    to 0)     │  │   (Standby)  │  │   Bucket         │          │
│  │              │  │              │  │                  │          │
│  │  [Ready to   │  │  ┌────────┐  │  │  [Fully         │          │
│  │   scale up]  │  │  │Standby │  │  │   replicated]  │          │
│  │              │  │  │Replica │  │  │                  │          │
│  └──────────────┘  │  └────────┘  │  └──────────────────┘          │
│                    └──────────────┘                                  │
└─────────────────────────────────────────────────────────────────────┘

11.3 Data Replication

Database Replication

# RDS PostgreSQL cross-region read replica
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DRReadReplica:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceIdentifier: surveillance-dr-replica
      DBInstanceClass: db.r6g.xlarge
      Engine: postgres
      EngineVersion: '15.4'
      SourceDBInstanceIdentifier: 
        !Sub 'arn:aws:rds:us-east-1:${AWS::AccountId}:db:surveillance-primary'
      DBSubnetGroupName: !Ref DRSubnetGroup
      VPCSecurityGroups:
        - !Ref DRSecurityGroup
      MultiAZ: false  # Standby only; enable during failover
      StorageEncrypted: true
      KmsKeyId: !Ref DRKMSKey
      BackupRetentionPeriod: 7
      DeletionProtection: true
      Tags:
        - Key: Purpose
          Value: DR-Standby
        - Key: RPO
          Value: 15min

Replication monitoring:

-- Check replication lag (run on primary)
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- Alert if replication lag > 5 minutes

Object Storage Replication

S3 Cross-Region Replication (CRR) with 15-minute RPO:

All new objects replicated within 15 minutes
Replication status tracked per object
Failed replication events alerted

Configuration Replication

Terraform state stored in S3 with cross-region replication
Git repositories mirrored to secondary Git provider
Kubernetes manifests stored in Git (GitOps)

11.4 Failover Process

Automated Failover (Database — RDS)

RDS Multi-AZ provides automatic failover:

Health check fails on primary
RDS promotes standby to primary (typically 60-120 seconds)
DNS endpoint updates automatically
Application reconnects via connection pool

Manual DR Failover (Full Site)

#!/bin/bash
# dr-failover.sh — Execute full site failover to DR region

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
FAILOVER_REASON="$1"

log() {
    echo "[$(date -Iseconds)] $1" | tee -a /var/log/dr/failover-$(date +%Y%m%d).log
}

log "=== DR FAILOVER INITIATED ==="
log "Reason: $FAILOVER_REASON"
log "From: $PRIMARY_REGION → $DR_REGION"

# 1. Verify DR environment
log "1. Verifying DR environment readiness..."
if ! aws eks describe-cluster --name surveillance-dr --region $DR_REGION > /dev/null 2>&1; then
    log "ERROR: DR EKS cluster not accessible"
    exit 1
fi

# 2. Promote DR database from standby
log "2. Promoting DR database..."
aws rds promote-read-replica \
    --db-instance-identifier surveillance-dr-replica \
    --region $DR_REGION

# Wait for promotion
aws rds wait db-instance-available \
    --db-instance-identifier surveillance-dr-replica \
    --region $DR_REGION
log "   DR database promoted successfully"

# 3. Enable Multi-AZ on DR database
log "3. Enabling Multi-AZ on DR database..."
aws rds modify-db-instance \
    --db-instance-identifier surveillance-dr-replica \
    --multi-az \
    --apply-immediately \
    --region $DR_REGION

# 4. Scale up DR EKS cluster
log "4. Scaling up DR EKS cluster..."
aws eks update-nodegroup-config \
    --cluster-name surveillance-dr \
    --nodegroup-name surveillance-workers \
    --scaling-config minSize=3,maxSize=10,desiredSize=3 \
    --region $DR_REGION

# Wait for nodes
sleep 120
kubectl wait --for=condition=Ready nodes --all --timeout=300s

# 5. Deploy application to DR
log "5. Deploying application to DR..."
kubectl config use-context surveillance-dr
kubectl apply -k k8s/overlays/dr/

# Wait for deployments
kubectl wait --for=condition=available \
    --all deployments \
    --namespace surveillance \
    --timeout=600s

# 6. Update DNS to point to DR
log "6. Updating DNS to DR region..."
aws route53 change-resource-record-sets \
    --hosted-zone-id $HOSTED_ZONE_ID \
    --change-batch file://dr-dns-update.json

# 7. Verify health
log "7. Running health checks..."
for i in {1..10}; do
    if curl -f https://surveillance.company.com/health/deep > /dev/null 2>&1; then
        log "   Health check PASSED"
        break
    fi
    log "   Health check attempt $i/10..."
    sleep 10
done

# 8. Verify cameras reconnecting
log "8. Verifying camera streams..."
sleep 60
STREAM_COUNT=$(curl -s https://surveillance.company.com/api/v1/cameras/status | \
    jq '[.cameras[] | select(.status == "active")] | length')
log "   Active streams: $STREAM_COUNT/8"

# 9. Send notifications
log "9. Sending notifications..."
curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\":\"DR FAILOVER COMPLETE: Production now running in $DR_REGION. Reason: $FAILOVER_REASON. Active streams: $STREAM_COUNT/8\"}"

log "=== DR FAILOVER COMPLETE ==="
log "Total time: $(($(date +%s) - START_TIME)) seconds"

11.5 DR Testing Schedule

Test Type	Frequency	Scope	Duration	Validation
Backup restore drill	Monthly	Database + media	2 hours	Data integrity verified
Application redeployment	Monthly	Full application stack	1 hour	All services healthy
Network failover test	Quarterly	VPN, DNS	30 min	Traffic routes correctly
Database failover test	Quarterly	RDS Multi-AZ promotion	1 hour	Replication lag acceptable
Full DR drill	Quarterly	Complete site failover	4 hours	All RTO/RPO met
Tabletop exercise	Semi-annually	Team response procedures	2 hours	Process gaps identified

Full DR drill procedure:

Week before: Schedule drill; notify stakeholders; prepare isolated test data
Day of:
- 09:00 — Initiate failover (simulate primary region failure)
- 09:05 — DR team executes failover runbook
- 09:30 — Verify database is promoted and accessible
- 10:00 — Verify application is deployed and healthy
- 10:30 — Verify camera streams reconnect
- 11:00 — Verify alert delivery
- 11:30 — Run E2E test suite
- 12:00 — Validate data integrity (sample checks)
- 12:30 — Measure and document RTO/RPO
- 13:00 — Initiate failback to primary
- 14:00 — Verify primary is restored
Week after: Complete DR test report; file action items

DR Test Report Template:

## DR Drill Report — 2025-Q1

| Item | Result |
|------|--------|
| Date | 2025-03-15 |
| Scenario | Complete region failure (us-east-1) |
| Failover RTO Target | 60 minutes |
| Failover RTO Achieved | 42 minutes |
| RPO Target | 15 minutes |
| RPO Achieved | 8 minutes |
| Streams Restored | 8/8 (100%) |
| Data Integrity | PASS |
| E2E Tests | 47/47 PASS |

### Issues Found
1. Camera reconnection took 18 minutes (target: <10 min) — AI-7 filed
2. Alert service required manual restart — AI-8 filed

### Action Items
| ID | Description | Owner | Due |
|----|-------------|-------|-----|
| AI-7 | Optimize camera reconnection sequence | @eng | 2025-04-01 |
| AI-8 | Fix alert service startup dependency | @sre | 2025-03-22 |

11.6 DR Readiness Checklist

Verify monthly (automated where possible):

DR database replication lag < 1 minute
S3 cross-region replication caught up
DR EKS cluster accessible and nodes can scale
Latest container images available in DR region registry
DR Terraform plan applies without errors (dry-run)
Backup integrity verified (latest full backup)
Failover runbook accessible and up-to-date
DR contact list current
VPN/cross-region network paths verified

12. Capacity Planning

12.1 Current Capacity Baseline (8 Cameras)

Resource	Current Usage	Capacity	Headroom
CPU (cloud)	4 cores avg	8 cores	100%
Memory (cloud)	12 GB	32 GB	167%
GPU (if used)	40% utilization	1x GPU	150%
Storage hot tier	6 TB / 20 TB	20 TB	233%
Storage warm tier	18 TB / 50 TB	50 TB	178%
Database storage	150 GB	500 GB	233%
Database connections	25 / 100	100	300%
Network egress	200 Mbps / 1 Gbps	1 Gbps	400%
Inference throughput	240 FPS (8x30)	480 FPS	100%
Alert volume	50/day	500/day	900%

12.2 Scaling Triggers

Metric	Scale-Up Trigger	Scale-Down Trigger	Action
CPU utilization	> 70% for 10 minutes	< 30% for 30 minutes	Add/remove inference pods
Memory utilization	> 80% for 10 minutes	< 40% for 30 minutes	Add memory or pods
Inference latency	> 100ms p95 for 5 min	< 50ms p95 for 10 min	Scale inference horizontally
Queue depth	> 1000 frames	< 100 frames	Adjust consumer count
Storage usage	> 70%	N/A (manual)	Expand volume or archive
Camera count	> 8 cameras	N/A	Scale per-camera resources

Horizontal Pod Autoscaler configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
  namespace: surveillance
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: surveillance_pipeline_latency_ms
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12.3 Camera Addition Process

Step 1: Pre-deployment Assessment (Day -7)
├── Evaluate resource requirements
├── Verify network connectivity
├── Review camera positioning and coverage
└── Update configuration in Git

Step 2: Infrastructure Preparation (Day -3)
├── Calculate additional storage needs
├── Verify scaling headroom
├── Prepare camera configuration
└── Stage network/VPN configuration

Step 3: Deployment (Day 0)
├── Add camera to configuration
├── Deploy updated configuration
├── Verify stream connection
├── Validate AI processing
├── Test alert generation
└── Update dashboards

Step 4: Validation (Day 0-1)
├── Monitor for 24 hours
├── Verify FPS and quality
├── Confirm alerts working
├── Document in camera registry
└── Notify stakeholders

Camera addition checklist:

Step	Item	Verification
1	Camera network reachable	`ping <camera_ip>`
2	RTSP stream accessible	`ffprobe rtsp://<camera>/stream`
3	VPN tunnel supports additional bandwidth	Bandwidth check
4	Configuration added to Git	PR merged
5	Stream appears in video-capture	Logs show connection
6	FPS meets target (>25)	Grafana dashboard
7	AI inference processing frames	Detection metrics
8	Alerts generated correctly	Test alert
9	Storage projections updated	Capacity review
10	Camera documented	Registry updated

12.4 Per-Camera Resource Requirements

Resource	Per Camera	8 Cameras	16 Cameras	24 Cameras
CPU (inference)	0.5 cores	4 cores	8 cores	12 cores
Memory (processing)	1 GB	8 GB	16 GB	24 GB
Storage (hot, daily)	50 GB/day	400 GB/day	800 GB/day	1.2 TB/day
Network (ingress)	25 Mbps	200 Mbps	400 Mbps	600 Mbps
GPU memory	512 MB	4 GB	8 GB	12 GB
Database IOPS	100	800	1,600	2,400

12.5 Scaling Roadmap

Phase	Cameras	Timeline	Infrastructure Changes
Current	8	Now	3 inference pods, 8 CPU, 32 GB RAM
Phase 1	12	Q2 2025	4 inference pods, 12 CPU, 48 GB RAM
Phase 2	16	Q3 2025	6 inference pods, 16 CPU, 64 GB RAM, GPU add
Phase 3	24	Q1 2026	8 inference pods, 24 CPU, 96 GB RAM, 2 GPU
Phase 4	32+	Q3 2026	Shard by location, dedicated inference cluster

12.6 Performance Benchmarks

Benchmark suite executed monthly:

#!/bin/bash
# performance-benchmark.sh

API_URL="https://surveillance.company.com"
RESULTS_FILE="/var/log/benchmarks/$(date +%Y%m%d).json"

echo "{\"timestamp\": \"$(date -Iseconds)\"," > "$RESULTS_FILE"
echo "\"benchmarks\": {" >> "$RESULTS_FILE"

# 1. Health check latency
echo "  Running health check latency test..."
HEALTH_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health")
echo "  \"health_check_latency_ms\": $(echo "$HEALTH_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 2. Deep health check latency
echo "  Running deep health check..."
DEEP_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health/deep")
echo "  \"deep_health_latency_ms\": $(echo "$DEEP_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 3. API response time (events list)
echo "  Running API response time test..."
API_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
  "$API_URL/api/v1/events?limit=100&start=$(date -d '1 hour ago' -Iseconds)")
echo "  \"api_events_latency_ms\": $(echo "$API_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 4. Database query performance
echo "  Running database query test..."
DB_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
  "$API_URL/api/v1/admin/db-performance")
echo "  \"db_query_latency_ms\": $(echo "$DB_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 5. Stream status
echo "  Checking stream status..."
STREAMS=$(curl -s "$API_URL/api/v1/cameras/status" | jq '[.cameras[] | select(.status == "active")] | length')
echo "  \"active_streams\": $STREAMS," >> "$RESULTS_FILE"

# 6. Inference latency (from Prometheus)
echo "  Fetching inference metrics..."
INF_LAT=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(surveillance_model_inference_ms_bucket[5m])))" | \
  jq -r '.data.result[0].value[1] // "null"')
echo "  \"inference_p95_latency_ms\": $INF_LAT" >> "$RESULTS_FILE"

echo "}}" >> "$RESULTS_FILE"

echo "Benchmark complete. Results saved to $RESULTS_FILE"
cat "$RESULTS_FILE"

Benchmark history tracking:

Date	Health (ms)	Deep Health (ms)	API (ms)	Inference P95 (ms)	Streams Active
2025-01-01	12	245	89	42	8/8
2025-01-08	11	238	92	45	8/8
2025-01-15	15	520	156	78	7/8 (cam_03 offline)

12.7 Resource Request & Provisioning Workflow

Requestor submits capacity request
        │
        ▼
┌───────────────┐
│ SRE Review    │ ← Assess impact, feasibility, alternatives
│ (2 biz days)  │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Approval      │ ← Engineering Manager + Finance (if >$X)
│ (1 biz day)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Implementation│ ← SRE executes change during maintenance window
│ (scheduled)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Validation    │ ← Verify performance meets requirements
│ (24-48 hours) │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Close Request │ ← Document in capacity ledger
└───────────────┘

Appendices

Appendix A: Contact Directory

Role	Name	Email	Phone	Slack
On-Call (rotating)	See PagerDuty	oncall@company.com	Via PagerDuty	#surveillance-oncall
SRE Team Lead	[Name]	sre-lead@company.com	+1-555-0100	@sre-lead
Engineering Manager	[Name]	eng-mgr@company.com	+1-555-0101	@eng-mgr
Security Officer	[Name]	security@company.com	+1-555-0104	@security
Product Owner	[Name]	product@company.com	+1-555-0105	@product
VP Engineering	[Name]	vp-eng@company.com	+1-555-0102	@vp-eng

Appendix B: Tooling Inventory

Category	Tool	Version	Purpose
Monitoring	Prometheus	2.47+	Metrics collection
Monitoring	Grafana	10.0+	Visualization
Monitoring	Alertmanager	0.26+	Alert routing
Logging	Elasticsearch	8.11+	Log storage
Logging	Filebeat	8.11+	Log shipping
Logging	Kibana	8.11+	Log visualization
Orchestration	Kubernetes	1.28+	Container orchestration
Packaging	Helm	3.13+	K8s package management
IaC	Terraform	1.6+	Infrastructure provisioning
GitOps	ArgoCD	2.9+	Continuous deployment
Backup	pgBackRest	2.48+	PostgreSQL backup
Secrets	Vault / AWS Secrets Manager	Latest	Secret management
Paging	PagerDuty	SaaS	Incident paging
Communication	Slack	SaaS	Team communication

Appendix C: Network Architecture

Internet
    │
    ▼
┌─────────┐    ┌─────────────┐    ┌──────────────────┐
│   CDN   │───▶│  Nginx/ALB  │───▶│  API Gateway     │
│         │    │  (TLS term) │    │  (auth/rate-lim) │
└─────────┘    └─────────────┘    └────────┬─────────┘
                                           │
                    ┌──────────────────────┼──────────────────────┐
                    │                      │                      │
                    ▼                      ▼                      ▼
            ┌──────────┐         ┌──────────────┐      ┌──────────┐
            │ Surveil- │         │   WebSocket  │      │ Grafana  │
            │ lance    │         │   Service    │      │ /Kibana  │
            │ API      │         │              │      │          │
            └────┬─────┘         └──────────────┘      └──────────┘
                 │
        ┌────────┼────────┬──────────────┐
        │        │        │              │
        ▼        ▼        ▼              ▼
   ┌────────┐ ┌─────┐ ┌──────────┐ ┌──────────┐
   │PostgreSQL│ │Redis│ │  S3/MinIO│ │ Prometheus│
   │         │ │     │ │          │ │           │
   └─────────┘ └─────┘ └──────────┘ └───────────┘

    VPN Tunnel
    ══════════
    ┌──────────────┐
    │  Edge Node   │◀── RTSP ──▶ [Cameras 1-8]
    │  (local proc)│
    └──────────────┘

Appendix D: Document Revision History

Version	Date	Author	Changes
1.0	2025-01-15	SRE Team	Initial comprehensive operations plan covering all 12 domains

END OF DOCUMENT

This document is a living document and should be reviewed and updated quarterly or after any significant infrastructure change.

Operations Plan

AI Surveillance Platform — 24/7 Operations & Reliability Plan

Table of Contents

Document Control

Approval

1. Monitoring & Observability

1.1 Overview

1.2 Metrics Collection

1.2.1 System Metrics (Node Exporter + cAdvisor)

1.2.2 Application Metrics (Custom / OpenTelemetry)

1.2.3 Business Metrics

1.2.4 Error Metrics

1.3 Alerting Rules

1.3.1 Critical Alerts (P1) — Page Immediately

1.3.2 High Severity Alerts (P2) — Page Within 1 Hour

1.3.3 Medium Severity Alerts (P3) — Respond Within 4 Hours

1.3.4 Low Severity Alerts (P4) — Respond Within 24 Hours

1.4 Alertmanager Configuration

1.5 Grafana Dashboards

1.5.1 Dashboard: Infrastructure Overview (ID: infra-overview)

1.5.2 Dashboard: Camera Health (ID: camera-health)

1.5.3 Dashboard: AI Pipeline Performance (ID: ai-pipeline)

1.5.4 Dashboard: Alert Delivery Stats (ID: alert-delivery)

1.5.5 Dashboard: Storage Usage Trends (ID: storage-trends)

1.6 On-Call Rotation

2. Logging Strategy

2.1 Log Architecture

2.2 Log Levels

2.3 Structured Logging Format

2.4 Log Correlation

2.5 Log Retention Policy

2.6 Sensitive Data Handling

3. Health Checks

3.1 Health Check Architecture

3.2 Endpoint Specifications

3.2.1 Liveness Probe — GET /health

3.2.2 Readiness Probe — GET /health/ready

3.2.3 Deep Health Check — GET /health/deep

3.3 Dependency Health Check Matrix

3.4 Health Check Implementation

4. Service Restart & Recovery

4.1 Service Startup Sequence

4.2 Graceful Shutdown Procedure

4.3 Crash Recovery & Automatic Restart

4.4 Circuit Breaker Pattern

4.5 Bulkhead Pattern — Resource Isolation

4.6 Recovery State Persistence

5. Backup Strategy

5.1 Backup Architecture

5.2 PostgreSQL Backup (pgBackRest)

5.3 Backup Retention Schedule

5.4 Object Storage Backup

5.5 Configuration Backup

5.6 Encryption

5.7 Backup Verification

5.8 Restore Procedures

5.8.1 Point-in-Time Recovery (PITR)

5.8.2 Full Disaster Recovery

5.9 Monthly Restore Drill

6. Data Retention

6.1 Retention Policy Matrix

6.2 Automated Cleanup Architecture

6.3 Cleanup Job Implementation

6.4 Archive to Cold Storage

6.5 Right to Deletion

7. Storage Management

7.1 Storage Architecture

7.2 Storage Capacity Planning (8 Camera Baseline)

7.3 Storage Monitoring & Alerting

7.4 Automated Cleanup Policies

7.5 Compression Strategy

7.6 Auto-Scaling Cloud Storage

7.7 Storage Cost Optimization

8. Incident Response

8.1 Severity Definitions

8.2 Escalation Matrix

8.3 Incident Response Process

8.4 Runbooks

Runbook: Camera Offline

Runbook: AI Pipeline Down

1.5.1 Dashboard: Infrastructure Overview (ID: `infra-overview`)

1.5.2 Dashboard: Camera Health (ID: `camera-health`)

1.5.3 Dashboard: AI Pipeline Performance (ID: `ai-pipeline`)

1.5.4 Dashboard: Alert Delivery Stats (ID: `alert-delivery`)

1.5.5 Dashboard: Storage Usage Trends (ID: `storage-trends`)

3.2.1 Liveness Probe — `GET /health`

3.2.2 Readiness Probe — `GET /health/ready`

3.2.3 Deep Health Check — `GET /health/deep`