# AI Surveillance Platform — 24/7 Operations & Reliability Plan

**Version:** 1.0  
**Date:** 2025-01-15  
**Classification:** Internal — Operations & Engineering  
**System:** 8-Channel AI Surveillance Platform (Cloud + Edge)  
**Target:** Industrial-grade autonomous operations with minimal human intervention

---

## Table of Contents

1. [Monitoring & Observability](#1-monitoring--observability)
2. [Logging Strategy](#2-logging-strategy)
3. [Health Checks](#3-health-checks)
4. [Service Restart & Recovery](#4-service-restart--recovery)
5. [Backup Strategy](#5-backup-strategy)
6. [Data Retention](#6-data-retention)
7. [Storage Management](#7-storage-management)
8. [Incident Response](#8-incident-response)
9. [Upgrades & Maintenance](#9-upgrades--maintenance)
10. [Performance Optimization](#10-performance-optimization)
11. [Disaster Recovery](#11-disaster-recovery)
12. [Capacity Planning](#12-capacity-planning)

---

## Document Control

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-01-15 | SRE Team | Initial comprehensive operations plan |

### Approval

| Role | Name | Date |
|------|------|------|
| Head of Engineering | _____________ | ___/___/______ |
| Security Officer | _____________ | ___/___/______ |
| Operations Lead | _____________ | ___/___/______ |

---

## 1. Monitoring & Observability

### 1.1 Overview

The monitoring stack provides real-time visibility into all platform components, enabling proactive issue detection and rapid incident response. All metrics are collected at 15-second intervals with 15-month retention.

**Tooling Choice:** Prometheus + Grafana (primary) with Alertmanager for notification routing.

**Architecture:**
```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Node      │     │  Prometheus │     │   Grafana   │
│  Exporter   │────▶│   Server    │────▶│  Dashboards │
│ (per host)  │     │  (TSDB)     │     │  (visualize)│
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────┴──────┐
                    │ Alertmanager │────▶ PagerDuty / OpsGenie / Slack
                    └─────────────┘
```

### 1.2 Metrics Collection

#### 1.2.1 System Metrics (Node Exporter + cAdvisor)

| Metric Category | Specific Metrics | Collection Interval | Retention |
|-----------------|-----------------|---------------------|-----------|
| **CPU** | Usage % per core, load average (1m/5m/15m), steal time, iowait | 15s | 15 months |
| **Memory** | Used/available/total, swap usage, OOM kills, page faults | 15s | 15 months |
| **Disk** | Usage % per volume, IOPS, read/write latency, inode usage | 15s | 15 months |
| **Network** | RX/TX bytes/packets/drops per interface, TCP connections, retransmits | 15s | 15 months |
| **Containers** | CPU/memory per container, restart count, network IO per container | 15s | 15 months |

**Prometheus scrape configuration:**
```yaml
# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s
    scrape_timeout: 10s

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  - job_name: 'surveillance-api'
    static_configs:
      - targets: ['surveillance-api:8080']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'ai-inference'
    static_configs:
      - targets: ['ai-inference:8080']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'video-processor'
    static_configs:
      - targets: ['video-processor:8080']
    scrape_interval: 15s
    metrics_path: /metrics
```

#### 1.2.2 Application Metrics (Custom / OpenTelemetry)

| Metric Name | Type | Description | Labels |
|-------------|------|-------------|--------|
| `surveillance_fps_per_camera` | Gauge | Current FPS being processed per camera | `camera_id`, `location` |
| `surveillance_detection_rate` | Gauge | Detections per second per stream | `camera_id`, `model_version` |
| `surveillance_alert_rate` | Counter | Total alerts generated | `severity`, `camera_id`, `alert_type` |
| `surveillance_pipeline_latency_ms` | Histogram | End-to-end processing latency | `stage`, `camera_id` |
| `surveillance_frame_drop_rate` | Gauge | Percentage of frames dropped | `camera_id`, `reason` |
| `surveillance_model_inference_ms` | Histogram | AI model inference time | `model_name`, `batch_size` |
| `surveillance_stream_active` | Gauge | Whether stream is active (1/0) | `camera_id`, `source` |
| `surveillance_face_recognition_matches` | Counter | Face recognition hits/misses | `camera_id`, `match_type` |

**Application instrumentation (Python example):**
```python
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from functools import wraps
import time

# Define metrics
DETECTION_COUNTER = Counter(
    'surveillance_detections_total',
    'Total detections by type',
    ['camera_id', 'detection_type', 'model_version']
)

PIPELINE_LATENCY = Histogram(
    'surveillance_pipeline_latency_ms',
    'End-to-end pipeline latency in milliseconds',
    ['stage', 'camera_id'],
    buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
)

CAMERA_FPS = Gauge(
    'surveillance_fps_per_camera',
    'Current FPS per camera stream',
    ['camera_id', 'location']
)

STREAM_ACTIVE = Gauge(
    'surveillance_stream_active',
    'Stream connectivity status',
    ['camera_id', 'source']
)

def track_latency(stage, camera_id):
    """Decorator to track function latency."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            try:
                return func(*args, **kwargs)
            finally:
                elapsed_ms = (time.time() - start) * 1000
                PIPELINE_LATENCY.labels(
                    stage=stage,
                    camera_id=camera_id
                ).observe(elapsed_ms)
        return wrapper
    return decorator
```

#### 1.2.3 Business Metrics

| Metric Name | Type | Business Purpose | Alert Threshold |
|-------------|------|-----------------|-----------------|
| `surveillance_persons_detected_daily` | Counter | Daily person detection volume | Anomaly detection |
| `surveillance_unknown_persons` | Counter | Unknown/alerted persons per period | Trend analysis |
| `surveillance_alerts_sent` | Counter | Alerts successfully delivered | Delivery health |
| `surveillance_alerts_failed` | Counter | Failed alert deliveries | > 5 in 5 min = P2 |
| `surveillance_camera_uptime_pct` | Gauge | Per-camera uptime percentage | < 99% = P3 |
| `surveillance_detection_accuracy` | Gauge | Model accuracy score | < threshold = P2 |

#### 1.2.4 Error Metrics

| Metric Name | Type | Description | Severity |
|-------------|------|-------------|----------|
| `surveillance_errors_total` | Counter | Errors by type and service | All |
| `surveillance_stream_errors` | Counter | Stream connection errors | P2 if > 10/min |
| `surveillance_model_errors` | Counter | Model inference failures | P1 if > 5/min |
| `surveillance_db_errors` | Counter | Database operation failures | P1 if > 3/min |
| `surveillance_storage_errors` | Counter | Storage read/write failures | P2 if > 5/min |

### 1.3 Alerting Rules

#### 1.3.1 Critical Alerts (P1) — Page Immediately

```yaml
# /etc/prometheus/alerts/critical.yml
groups:
  - name: critical
    rules:
      - alert: AllStreamsDown
        expr: sum(surveillance_stream_active) == 0
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "ALL camera streams are down"
          description: "No active streams detected for more than 1 minute"
          runbook_url: "https://wiki.internal/runbooks/all-streams-down"

      - alert: AIPipelineDown
        expr: rate(surveillance_detections_total[5m]) == 0
        for: 2m
        labels:
          severity: p1
        annotations:
          summary: "AI pipeline not producing detections"
          description: "Zero detections in the last 2 minutes across all streams"

      - alert: StorageFull
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "Storage critically low: {{ $labels.mountpoint }}"
          description: "Less than 5% storage remaining on {{ $labels.instance }}"

      - alert: DatabaseUnreachable
        expr: pg_up == 0
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "PostgreSQL database is unreachable"
          description: "Cannot connect to primary database"

      - alert: HighErrorRate
        expr: rate(surveillance_errors_total[5m]) > 10
        for: 2m
        labels:
          severity: p1
        annotations:
          summary: "High error rate across services"
          description: "Error rate exceeds 10 errors per second"
```

#### 1.3.2 High Severity Alerts (P2) — Page Within 1 Hour

```yaml
# /etc/prometheus/alerts/high.yml
groups:
  - name: high
    rules:
      - alert: SingleCameraDown
        expr: surveillance_stream_active{camera_id=~"cam.*"} == 0
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Camera {{ $labels.camera_id }} is offline"
          description: "Camera stream has been down for more than 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.95,
          rate(surveillance_pipeline_latency_ms_bucket[5m])) > 2000
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Pipeline latency is high"
          description: "P95 latency exceeds 2000ms"

      - alert: ModelAccuracyDegraded
        expr: surveillance_detection_accuracy < 0.85
        for: 10m
        labels:
          severity: p2
        annotations:
          summary: "AI model accuracy degraded"
          description: "Detection accuracy below 85%"

      - alert: MemoryPressure
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.90
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: "Memory usage above 90%"

      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Disk space warning: {{ $labels.mountpoint }}"
          description: "Less than 15% disk space remaining"
```

#### 1.3.3 Medium Severity Alerts (P3) — Respond Within 4 Hours

```yaml
# /etc/prometheus/alerts/medium.yml
groups:
  - name: medium
    rules:
      - alert: CameraFPSLow
        expr: surveillance_fps_per_camera < 15
        for: 10m
        labels:
          severity: p3
        annotations:
          summary: "Camera {{ $labels.camera_id }} FPS below threshold"

      - alert: FrameDropsHigh
        expr: surveillance_frame_drop_rate > 0.10
        for: 10m
        labels:
          severity: p3
        annotations:
          summary: "High frame drop rate on {{ $labels.camera_id }}"

      - alert: CertificateExpiry
        expr: (ssl_certificate_expiry_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "TLS certificate expiring soon"

      - alert: BackupNotRun
        expr: time() - surveillance_last_backup_timestamp > 90000
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "Database backup has not run in 25+ hours"
```

#### 1.3.4 Low Severity Alerts (P4) — Respond Within 24 Hours

```yaml
# /etc/prometheus/alerts/low.yml
groups:
  - name: low
    rules:
      - alert: HighCPU
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 30m
        labels:
          severity: p4
        annotations:
          summary: "CPU usage high on {{ $labels.instance }}"

      - alert: ContainerRestartLoop
        expr: rate(container_restarts_total[15m]) > 3
        for: 15m
        labels:
          severity: p4
        annotations:
          summary: "Container restart loop detected"
```

### 1.4 Alertmanager Configuration

```yaml
# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@surveillance.company.com'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: '<SLACK_WEBHOOK_URL>'

# Inhibit alerts of lower severity when higher severity fires
inhibit_rules:
  - source_match:
      severity: 'p1'
    target_match:
      severity: 'p2'
    equal: ['alertname', 'instance']

route:
  receiver: 'default'
  group_by: ['alertname', 'severity', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # P1 alerts — page immediately, no grouping delay
    - match:
        severity: p1
      receiver: 'p1-critical'
      group_wait: 0s
      repeat_interval: 15m
      continue: true

    # P2 alerts — page within 1 hour
    - match:
        severity: p2
      receiver: 'p2-high'
      group_wait: 2m
      repeat_interval: 1h

    # P3 alerts — Slack + email only
    - match:
        severity: p3
      receiver: 'p3-medium'
      group_wait: 5m
      repeat_interval: 4h

    # P4 alerts — daily digest
    - match:
        severity: p4
      receiver: 'p4-low'
      group_wait: 10m
      repeat_interval: 24h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#surveillance-alerts'
        title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'p1-critical'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: critical
        description: '{{ .GroupLabels.alertname }}'
    slack_configs:
      - channel: '#surveillance-critical'
        send_resolved: true
        title: 'P1 CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    email_configs:
      - to: 'oncall@company.com'
        subject: '[P1 CRITICAL] Surveillance Platform Alert'

  - name: 'p2-high'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: error
    slack_configs:
      - channel: '#surveillance-alerts'
        send_resolved: true

  - name: 'p3-medium'
    slack_configs:
      - channel: '#surveillance-warnings'
        send_resolved: true

  - name: 'p4-low'
    email_configs:
      - to: 'ops-team@company.com'
        subject: '[P4 Low] Surveillance Platform — Daily Digest'
```

### 1.5 Grafana Dashboards

#### 1.5.1 Dashboard: Infrastructure Overview (ID: `infra-overview`)

```json
{
  "dashboard": {
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage %",
        "type": "timeseries",
        "targets": [{
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{ instance }}"
        }],
        "alert": {
          "conditions": [{
            "evaluator": {"params": [85], "type": "gt"},
            "operator": {"type": "and"},
            "query": {"params": ["A", "5m", "now"]},
            "reducer": {"type": "avg"}
          }]
        }
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [{
          "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
          "legendFormat": "{{ instance }}"
        }]
      },
      {
        "title": "Disk Usage",
        "type": "gauge",
        "targets": [{
          "expr": "100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)"
        }],
        "fieldConfig": {
          "max": 100,
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "yellow", "value": 70},
              {"color": "orange", "value": 85},
              {"color": "red", "value": 95}
            ]
          }
        }
      },
      {
        "title": "Network I/O",
        "type": "timeseries",
        "targets": [
          {"expr": "rate(node_network_receive_bytes_total[5m])", "legendFormat": "RX {{ device }}"},
          {"expr": "rate(node_network_transmit_bytes_total[5m])", "legendFormat": "TX {{ device }}"}
        ]
      },
      {
        "title": "Container Count",
        "type": "stat",
        "targets": [{
          "expr": "count(container_last_seen)"
        }]
      },
      {
        "title": "Container Restarts (15m)",
        "type": "stat",
        "targets": [{
          "expr": "increase(container_restarts_total[15m])"
        }],
        "fieldConfig": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "red", "value": 1}
            ]
          }
        }
      }
    ]
  }
}
```

#### 1.5.2 Dashboard: Camera Health (ID: `camera-health`)

| Panel | Type | Query / Data Source |
|-------|------|---------------------|
| Stream Status Grid | Stat grid (8 panels) | `surveillance_stream_active{camera_id=~"cam.*"}` |
| FPS per Camera | Timeseries | `surveillance_fps_per_camera` by `camera_id` |
| Frame Drop Rate | Timeseries | `surveillance_frame_drop_rate` by `camera_id` |
| Camera Uptime % | Gauge per camera | `avg_over_time(surveillance_stream_active[24h]) * 100` |
| Stream Error Count | Bar chart | `increase(surveillance_stream_errors[1h])` by `camera_id` |
| Last Frame Timestamp | Table | Time since last frame per camera |
| Bitrate per Stream | Timeseries | `surveillance_stream_bitrate_kbps` |

**Camera Health Score Calculation:**
```promql
# Overall camera health score (0-100)
(
  avg(surveillance_stream_active) * 50 +
  (1 - avg(surveillance_frame_drop_rate)) * 30 +
  (avg(surveillance_fps_per_camera) / 30) * 20
) * 100
```

#### 1.5.3 Dashboard: AI Pipeline Performance (ID: `ai-pipeline`)

| Panel | Type | Metric |
|-------|------|--------|
| Inference Latency (P50/P95/P99) | Timeseries | `histogram_quantile(0.5x, rate(...))` |
| Detections per Second | Timeseries | `rate(surveillance_detections_total[5m])` |
| Model Accuracy Trend | Timeseries | `surveillance_detection_accuracy` |
| Pipeline Throughput | Stat | Total frames processed/minute |
| GPU Utilization (if applicable) | Gauge | `nvidia_gpu_utilization_gpu` |
| GPU Memory Usage | Timeseries | `nvidia_gpu_memory_used_bytes` |
| Model Load Status | Table | Current model version, load time, status |
| Batch Size Distribution | Heatmap | Inference batch sizes over time |

#### 1.5.4 Dashboard: Alert Delivery Stats (ID: `alert-delivery`)

| Panel | Type | Query |
|-------|------|-------|
| Alerts Sent Today | Stat | `increase(surveillance_alerts_sent[24h])` |
| Alerts Failed | Stat | `increase(surveillance_alerts_failed[24h])` |
| Delivery Success Rate | Gauge | `alerts_sent / (alerts_sent + alerts_failed)` |
| Alerts by Severity | Pie chart | `surveillance_alerts_sent` by `severity` |
| Alerts by Camera | Bar chart | Top cameras by alert count |
| Notification Channel Status | Table | Channel health per delivery method |
| Alert Response Time | Histogram | Time from detection to notification |

#### 1.5.5 Dashboard: Storage Usage Trends (ID: `storage-trends`)

| Panel | Type | Query |
|-------|------|-------|
| Total Storage Used | Stat | Sum of all storage volumes |
| Storage Growth Rate | Timeseries | Daily increase in bytes |
| Retention Policy Status | Table | Days remaining per retention tier |
| Media vs. Metadata Split | Pie chart | Storage breakdown by type |
| Projected Capacity Exhaustion | Stat | Days until full at current growth rate |
| Cleanup Job Status | Table | Last run, records cleaned, errors |
| Cross-Region Replication Lag | Timeseries | Replication delay in seconds |

### 1.6 On-Call Rotation

| Shift | Time (UTC) | Primary On-Call | Secondary |
|-------|-----------|-----------------|-----------|
| APAC | 00:00 — 08:00 | APAC SRE Team | EMEA Escalation |
| EMEA | 08:00 — 16:00 | EMEA SRE Team | Americas Escalation |
| Americas | 16:00 — 00:00 | Americas SRE Team | APAC Escalation |

**Escalation Policy (PagerDuty):**
1. **Notification:** Alert fires → Notify on-call engineer via PagerDuty push + SMS
2. **Acknowledge:** 5-minute acknowledge window
3. **Escalation 1:** No acknowledge → Escalate to team lead (15 min)
4. **Escalation 2:** No response → Escalate to engineering manager (30 min)
5. **Escalation 3:** No response → Escalate to VP Engineering (45 min)

---

## 2. Logging Strategy

### 2.1 Log Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Application │────▶│  Filebeat   │────▶│  Logstash   │────▶│ Elasticsearch│
│  (JSON logs)│     │  (shipper)  │     │ (processor) │     │   (store)   │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
                                                                    │
                                                             ┌──────┴──────┐
                                                             │    Kibana   │
                                                             │  (visualize) │
                                                             └─────────────┘
```

### 2.2 Log Levels

| Level | Numeric | Usage | Retention | Action |
|-------|---------|-------|-----------|--------|
| **DEBUG** | 10 | Detailed diagnostic info | 7 days | Development only |
| **INFO** | 20 | Normal operational events | 90 days | Standard operations |
| **WARNING** | 30 | Anomalous but non-critical conditions | 90 days | Monitor trends |
| **ERROR** | 40 | Operational failures, handled exceptions | 1 year | Alert if rate > threshold |
| **CRITICAL** | 50 | System-threatening failures | 1 year | Immediate P1 alert |

**Production default level:** INFO (DEBUG only enabled per-request for troubleshooting)

### 2.3 Structured Logging Format

All application logs MUST be in JSON format:

```json
{
  "timestamp": "2025-01-15T08:30:15.123456Z",
  "level": "ERROR",
  "logger": "surveillance.video_processor",
  "message": "Failed to connect to camera stream",
  "request_id": "req_abc123def456",
  "trace_id": "trace_789xyz",
  "service": "video-processor",
  "version": "2.3.1",
  "host": "edge-node-01",
  "environment": "production",
  "camera_id": "cam_03_entrance",
  "location": "main_entrance",
  "error": {
    "type": "ConnectionTimeout",
    "message": "Connection to rtsp://192.168.1.103:554/stream timed out after 10s",
    "retry_count": 3,
    "stack_trace": "..."
  },
  "context": {
    "stream_url": "rtsp://***.***.1.***:554/stream",
    "connection_duration_ms": 10000,
    "previous_disconnect": "2025-01-15T08:25:00Z"
  },
  "performance": {
    "processing_time_ms": 0.5,
    "memory_mb": 128.5
  }
}
```

**Python logging configuration:**
```python
# logging_config.py
import logging
import json
from pythonjsonlogger import jsonlogger
import os

class StructuredLogFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super().add_fields(log_record, record, message_dict)
        log_record['timestamp'] = datetime.utcnow().isoformat() + 'Z'
        log_record['level'] = record.levelname
        log_record['logger'] = record.name
        log_record['service'] = os.environ.get('SERVICE_NAME', 'unknown')
        log_record['version'] = os.environ.get('SERVICE_VERSION', 'unknown')
        log_record['host'] = os.environ.get('HOSTNAME', 'unknown')
        log_record['environment'] = os.environ.get('ENV', 'production')

LOGGING_CONFIG = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'json': {
            '()': StructuredLogFormatter,
            'format': '%(timestamp)s %(level)s %(message)s'
        }
    },
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
            'formatter': 'json',
            'stream': 'ext://sys.stdout'
        },
        'file': {
            'class': 'logging.handlers.RotatingFileHandler',
            'formatter': 'json',
            'filename': '/var/log/surveillance/app.log',
            'maxBytes': 104857600,  # 100 MB
            'backupCount': 10
        }
    },
    'loggers': {
        'surveillance': {
            'level': os.environ.get('LOG_LEVEL', 'INFO'),
            'handlers': ['console', 'file'],
            'propagate': False
        }
    }
}
```

### 2.4 Log Correlation

Every request receives a unique `request_id` and `trace_id`:

```python
import uuid
import contextvars

# Context variable for request-scoped tracing
request_id_var = contextvars.ContextVar('request_id', default=None)
trace_id_var = contextvars.ContextVar('trace_id', default=None)

def get_current_request_id() -> str:
    req_id = request_id_var.get()
    if req_id is None:
        req_id = f"req_{uuid.uuid4().hex[:16]}"
        request_id_var.set(req_id)
    return req_id

def get_current_trace_id() -> str:
    trace_id = trace_id_var.get()
    if trace_id is None:
        trace_id = f"trace_{uuid.uuid4().hex[:16]}"
        trace_id_var.set(trace_id)
    return trace_id
```

**Propagation across services:**
- HTTP: `X-Request-ID` and `X-Trace-ID` headers
- Message queue: Metadata fields in message envelope
- gRPC: Custom metadata

### 2.5 Log Retention Policy

| Log Category | Retention | Storage Class | Compression |
|--------------|-----------|---------------|-------------|
| Application logs (INFO+) | 90 days | Hot (SSD) 30d → Warm 60d | After 7 days |
| Error logs (ERROR+) | 1 year | Warm 90d → Cold 275d | After 30 days |
| Audit logs | 1 year | Hot 90d → Warm 180d → Cold 95d | After 90 days |
| Debug logs | 7 days | Hot only | None |
| Access logs | 90 days | Warm 30d → Cold 60d | After 30 days |
| System logs (syslog/journald) | 90 days | Warm | After 7 days |

**Elasticsearch Index Lifecycle Management (ILM):**
```json
PUT _ilm/policy/surveillance-logs
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d",
            "max_docs": 100000000
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "require": { "data": "warm" }
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}
```

### 2.6 Sensitive Data Handling

**NEVER log:**
- Face embeddings or biometric data
- Full-resolution images of detected persons
- PII (names, employee IDs, phone numbers)
- Credentials, API keys, tokens, passwords
- Stream URLs with embedded credentials
- Internal network topology
- VPN configuration details

**Sanitization rules:**
```python
import re

SENSITIVE_PATTERNS = [
    (r'rtsp://[^:]+:[^@]+@', 'rtsp://***:***@'),
    (r'password[=:]\s*\S+', 'password=***'),
    (r'api[_-]?key[=:]\s*\S+', 'api_key=***'),
    (r'token[=:]\s*\S+', 'token=***'),
    (r'embedding[=:]\s*\[.*?\]', 'embedding=[REDACTED]'),
    (r'face[_-]?vector[=:]\s*\[.*?\]', 'face_vector=[REDACTED]'),
]

def sanitize_log_message(message: str) -> str:
    for pattern, replacement in SENSITIVE_PATTERNS:
        message = re.sub(pattern, replacement, message, flags=re.IGNORECASE)
    return message
```

---

## 3. Health Checks

### 3.1 Health Check Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Health Check Endpoints                    │
│                                                             │
│  /health        → Liveness probe (Kubernetes/Docker)       │
│  /health/ready  → Readiness probe (accepting traffic)       │
│  /health/deep   → Deep health (full pipeline validation)    │
└─────────────────────────────────────────────────────────────┘
```

### 3.2 Endpoint Specifications

#### 3.2.1 Liveness Probe — `GET /health`

**Purpose:** Determine if the process is running and not deadlocked.

**Response:**
```json
{
  "status": "alive",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-api",
  "version": "2.3.1",
  "uptime_seconds": 86400
}
```

**Criteria:**
- Process is running
- Main thread is not blocked
- Returns HTTP 200 within 1 second

**Failure action:** Container orchestrator restarts the container.

**Configuration:**
```yaml
# Kubernetes
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

# Docker Compose
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 10s
  timeout: 3s
  retries: 3
  start_period: 30s
```

#### 3.2.2 Readiness Probe — `GET /health/ready`

**Purpose:** Determine if the service is ready to accept traffic.

**Response:**
```json
{
  "status": "ready",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-api",
  "version": "2.3.1",
  "checks": {
    "database": {
      "status": "pass",
      "response_time_ms": 12,
      "message": "Connected to PostgreSQL primary"
    },
    "object_storage": {
      "status": "pass",
      "response_time_ms": 45,
      "message": "S3 bucket accessible"
    },
    "cache": {
      "status": "pass",
      "response_time_ms": 2,
      "message": "Redis connection OK"
    }
  }
}
```

**Criteria:**
- All required dependencies reachable
- Database connection pool has available connections
- Object storage accessible
- Cache layer accessible
- AI model loaded (for inference services)

**Failure response:** HTTP 503 with details
```json
{
  "status": "not_ready",
  "timestamp": "2025-01-15T08:30:15Z",
  "checks": {
    "database": {
      "status": "fail",
      "response_time_ms": 5000,
      "message": "Connection timeout after 5000ms"
    },
    "object_storage": { "status": "pass" },
    "cache": { "status": "pass" }
  }
}
```

**Configuration:**
```yaml
# Kubernetes
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 2
```

#### 3.2.3 Deep Health Check — `GET /health/deep`

**Purpose:** Validate the entire processing pipeline end-to-end.

**Response:**
```json
{
  "status": "healthy",
  "timestamp": "2025-01-15T08:30:15Z",
  "service": "surveillance-platform",
  "version": "2.3.1",
  "checks": {
    "database": {
      "status": "pass",
      "response_time_ms": 8,
      "details": {
        "connection": "ok",
        "query_execution": "ok",
        "replication_lag_seconds": 0
      }
    },
    "object_storage": {
      "status": "pass",
      "response_time_ms": 67,
      "details": {
        "read_test": "ok",
        "write_test": "ok",
        "list_test": "ok"
      }
    },
    "ai_model": {
      "status": "pass",
      "response_time_ms": 145,
      "details": {
        "model_loaded": true,
        "model_version": "face-detection-v2.1",
        "gpu_available": true,
        "test_inference": "ok"
      }
    },
    "streams": {
      "status": "pass",
      "details": {
        "active_streams": 8,
        "expected_streams": 8,
        "streams": [
          {"camera_id": "cam_01", "fps": 30, "status": "active"},
          {"camera_id": "cam_02", "fps": 30, "status": "active"},
          {"camera_id": "cam_03", "fps": 25, "status": "active"},
          {"camera_id": "cam_04", "fps": 30, "status": "active"},
          {"camera_id": "cam_05", "fps": 30, "status": "active"},
          {"camera_id": "cam_06", "fps": 28, "status": "active"},
          {"camera_id": "cam_07", "fps": 30, "status": "active"},
          {"camera_id": "cam_08", "fps": 30, "status": "active"}
        ]
      }
    },
    "cache": {
      "status": "pass",
      "response_time_ms": 1,
      "details": {
        "set_test": "ok",
        "get_test": "ok",
        "memory_usage_pct": 45
      }
    },
    "alert_delivery": {
      "status": "pass",
      "details": {
        "channels_tested": 3,
        "success": 3
      }
    },
    "pipeline_e2e": {
      "status": "pass",
      "response_time_ms": 523,
      "details": {
        "capture": "ok",
        "inference": "ok",
        "alert_generation": "ok",
        "storage": "ok"
      }
    }
  }
}
```

**Execution:**
- Triggered manually or by monitoring every 5 minutes
- NOT used for Kubernetes probes (too slow)
- Full pipeline validation takes 1-5 seconds

### 3.3 Dependency Health Check Matrix

| Dependency | Check Method | Timeout | Expected Result | Failure Action |
|------------|-------------|---------|-----------------|----------------|
| PostgreSQL | `SELECT 1` | 3s | Row returned | Return not_ready |
| Redis Cache | `PING` → `PONG` | 2s | PONG received | Degrade to DB only |
| S3 / Object Storage | List + Put + Get test object | 10s | All operations succeed | Queue for retry |
| AI Model | Load model + test inference | 30s | Inference completes | Report model error |
| Camera Streams | RTSP describe/ping | 10s | Stream metadata received | Mark stream offline |
| VPN Tunnel | ICMP to edge gateway | 5s | Response received | Mark edge offline |
| SMTP/Notification | TCP connect + EHLO | 5s | SMTP greeting received | Queue alerts |

### 3.4 Health Check Implementation

```python
# health.py
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import time
import asyncio

class HealthStatus(Enum):
    PASS = "pass"
    FAIL = "fail"
    WARN = "warn"

@dataclass
class HealthCheckResult:
    name: str
    status: HealthStatus
    response_time_ms: float
    message: str
    details: Dict = field(default_factory=dict)

class HealthChecker:
    def __init__(self):
        self.checks = {}
    
    def register(self, name: str, check_func):
        self.checks[name] = check_func
    
    async def run_all(self, timeout: float = 30.0) -> List[HealthCheckResult]:
        tasks = [
            self._run_check(name, func, timeout)
            for name, func in self.checks.items()
        ]
        return await asyncio.gather(*tasks)
    
    async def _run_check(self, name: str, func, timeout: float) -> HealthCheckResult:
        start = time.monotonic()
        try:
            result = await asyncio.wait_for(func(), timeout=timeout)
            elapsed = (time.monotonic() - start) * 1000
            result.response_time_ms = round(elapsed, 2)
            return result
        except asyncio.TimeoutError:
            return HealthCheckResult(
                name=name,
                status=HealthStatus.FAIL,
                response_time_ms=timeout * 1000,
                message=f"Health check timed out after {timeout}s"
            )
        except Exception as e:
            elapsed = (time.monotonic() - start) * 1000
            return HealthCheckResult(
                name=name,
                status=HealthStatus.FAIL,
                response_time_ms=round(elapsed, 2),
                message=str(e)
            )

# Usage
health_checker = HealthChecker()

# Register checks
health_checker.register("database", check_database)
health_checker.register("object_storage", check_object_storage)
health_checker.register("ai_model", check_ai_model)
health_checker.register("streams", check_all_streams)
health_checker.register("cache", check_cache)

# FastAPI endpoint
from fastapi import FastAPI
app = FastAPI()

@app.get("/health")
async def liveness():
    return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}

@app.get("/health/ready")
async def readiness():
    results = await health_checker.run_all(timeout=5.0)
    all_pass = all(r.status == HealthStatus.PASS for r in results)
    
    status_code = 200 if all_pass else 503
    status = "ready" if all_pass else "not_ready"
    
    return JSONResponse(
        status_code=status_code,
        content={
            "status": status,
            "timestamp": datetime.utcnow().isoformat(),
            "checks": {
                r.name: {
                    "status": r.status.value,
                    "response_time_ms": r.response_time_ms,
                    "message": r.message,
                    **r.details
                }
                for r in results
            }
        }
    )

@app.get("/health/deep")
async def deep_health():
    # Runs full pipeline check
    results = await health_checker.run_all(timeout=30.0)
    # ... similar to readiness but with pipeline_e2e
```

---

## 4. Service Restart & Recovery

### 4.1 Service Startup Sequence

Services must start in strict dependency order. Docker Compose depends_on or Kubernetes init containers enforce this.

```
Phase 1: Infrastructure
  ├─ PostgreSQL (primary + replica)
  ├─ Redis Cache
  └─ MinIO / S3 Object Storage

Phase 2: Core Services
  ├─ Message Queue (RabbitMQ / NATS)
  ├─ Configuration Service
  └─ Identity/Auth Service

Phase 3: AI Pipeline
  ├─ Model Service (download & load models)
  ├─ Video Capture Service (connect to cameras)
  ├─ AI Inference Service
  └─ Post-Processing Service

Phase 4: Application Layer
  ├─ API Gateway
  ├─ Surveillance API Service
  ├─ Alert Service
  └─ WebSocket / Real-time Service

Phase 5: Frontend
  ├─ Nginx / Reverse Proxy
  └─ Web Dashboard
```

**Docker Compose startup configuration:**
```yaml
# docker-compose.yml (relevant section)
services:
  postgres:
    image: postgres:15.4@sha256:abc123...
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U surveillance"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7.2@sha256:def456...
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    depends_on:
      postgres:
        condition: service_healthy

  model-service:
    image: surveillance/model-service:2.3.1@sha256:ghi789...
    restart: unless-stopped
    environment:
      - MODEL_PATH=/models
      - DOWNLOAD_IF_MISSING=true
    volumes:
      - model-cache:/models
    depends_on:
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 30s
      retries: 10
      start_period: 60s

  video-capture:
    image: surveillance/capture:2.3.1@sha256:jkl012...
    restart: unless-stopped
    depends_on:
      model-service:
        condition: service_healthy
    environment:
      - STREAM_RETRY_MAX=10
      - STREAM_RETRY_DELAY=5
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  ai-inference:
    image: surveillance/inference:2.3.1@sha256:mno345...
    restart: unless-stopped
    depends_on:
      video-capture:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 10s
      retries: 5
      start_period: 120s

  surveillance-api:
    image: surveillance/api:2.3.1@sha256:pqr678...
    restart: unless-stopped
    depends_on:
      ai-inference:
        condition: service_healthy
    environment:
      - DATABASE_URL=postgresql://...@postgres/surveillance
      - REDIS_URL=redis://redis:6379
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 20s

  nginx:
    image: nginx:alpine@sha256:stu901...
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      surveillance-api:
        condition: service_healthy
```

### 4.2 Graceful Shutdown Procedure

All services must handle SIGTERM for graceful shutdown:

```python
# shutdown_handler.py
import asyncio
import signal
import logging
from contextlib import asynccontextmanager

logger = logging.getLogger(__name__)

class GracefulShutdown:
    def __init__(self, shutdown_timeout: float = 30.0):
        self.shutdown_timeout = shutdown_timeout
        self._shutdown_event = asyncio.Event()
        self._tasks = []
    
    def register_task(self, task):
        self._tasks.append(task)
    
    async def wait_for_shutdown(self):
        await self._shutdown_event.wait()
    
    def trigger_shutdown(self):
        logger.info("Shutdown signal received, initiating graceful shutdown...")
        self._shutdown_event.set()
    
    async def shutdown(self):
        """Execute graceful shutdown sequence."""
        logger.info("Starting graceful shutdown sequence...")
        
        # 1. Stop accepting new requests/connections
        logger.info("1. Stopping request acceptance")
        await self._stop_accepting_requests()
        
        # 2. Wait for in-flight requests to complete
        logger.info("2. Waiting for in-flight requests (timeout: %.0fs)", 
                     self.shutdown_timeout)
        try:
            await asyncio.wait_for(
                self._wait_inflight_requests(),
                timeout=self.shutdown_timeout * 0.6
            )
        except asyncio.TimeoutError:
            logger.warning("In-flight requests did not complete in time")
        
        # 3. Flush buffers and complete pending writes
        logger.info("3. Flushing buffers")
        await self._flush_buffers()
        
        # 4. Close camera streams gracefully
        logger.info("4. Closing camera streams")
        await self._close_streams()
        
        # 5. Release resources
        logger.info("5. Releasing resources")
        await self._release_resources()
        
        # 6. Close database connections
        logger.info("6. Closing database connections")
        await self._close_database_connections()
        
        logger.info("Graceful shutdown complete")
    
    async def _stop_accepting_requests(self):
        # Mark service as not ready
        pass
    
    async def _wait_inflight_requests(self):
        # Wait for active request count to reach zero
        pass
    
    async def _flush_buffers(self):
        # Flush any pending log buffers, metric batches
        pass
    
    async def _close_streams(self):
        # Send RTSP TEARDOWN, release capture resources
        pass
    
    async def _release_resources(self):
        # Release GPU memory, file handles
        pass
    
    async def _close_database_connections(self):
        # Return connections to pool, close pool
        pass

def setup_signal_handlers(shutdown_manager: GracefulShutdown):
    loop = asyncio.get_event_loop()
    
    def handle_signal(sig):
        logger.info("Received signal %s", sig.name)
        shutdown_manager.trigger_shutdown()
        asyncio.create_task(shutdown_manager.shutdown())
    
    for sig in (signal.SIGTERM, signal.SIGINT):
        loop.add_signal_handler(sig, lambda s=sig: handle_signal(s))
```

**Kubernetes graceful termination:**
```yaml
spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: surveillance-api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5 && curl -X POST localhost:8080/shutdown"]
```

### 4.3 Crash Recovery & Automatic Restart

| Scenario | Detection | Automatic Action | Manual Intervention |
|----------|-----------|-----------------|---------------------|
| Container exits non-zero | Docker/K8s | Restart with exponential backoff (max 5 min) | If > 5 restarts in 10 min |
| OOM killed | Kernel event | Restart with 25% memory increase (max 3x) | Review memory limits |
| Health check fails | Probe failure | Restart container | If restart loop persists |
| Node failure | Node not ready | Reschedule to healthy node | Investigate failed node |
| Camera stream disconnect | No frames received | Retry with exponential backoff | If > 30 min offline |
| AI model load failure | Inference timeout | Reload model from backup | If model corrupted |
| Database connection lost | Query timeout | Retry connection, use replica | If primary down > 5 min |

**Exponential backoff for stream reconnection:**
```python
import asyncio
import random

async def reconnect_stream(camera_id: str, max_retries: int = 100):
    base_delay = 5  # seconds
    max_delay = 300  # 5 minutes
    
    for attempt in range(1, max_retries + 1):
        delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
        jitter = random.uniform(0, delay * 0.1)
        wait_time = delay + jitter
        
        logger.info("Camera %s: Reconnect attempt %d/%d in %.1fs",
                    camera_id, attempt, max_retries, wait_time)
        await asyncio.sleep(wait_time)
        
        try:
            stream = await connect_stream(camera_id)
            logger.info("Camera %s: Reconnected successfully", camera_id)
            return stream
        except Exception as e:
            logger.warning("Camera %s: Reconnect failed: %s", camera_id, e)
    
    logger.error("Camera %s: Max retries exceeded, stream marked offline", camera_id)
    return None
```

### 4.4 Circuit Breaker Pattern

Protect against cascading failures when dependencies are down:

```python
# circuit_breaker.py
from enum import Enum
import asyncio
import time
from dataclasses import dataclass

class CircuitState(Enum):
    CLOSED = "closed"       # Normal operation
    OPEN = "open"          # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 3
    success_threshold: int = 2

class CircuitBreaker:
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0
        self._lock = asyncio.Lock()
    
    async def call(self, func, *args, **kwargs):
        async with self._lock:
            await self._transition_state()
            
            if self.state == CircuitState.OPEN:
                raise CircuitBreakerOpen(
                    f"Circuit breaker '{self.name}' is OPEN"
                )
            
            if self.state == CircuitState.HALF_OPEN:
                if self.half_open_calls >= self.config.half_open_max_calls:
                    raise CircuitBreakerOpen(
                        f"Circuit '{self.name}' half-open limit reached"
                    )
                self.half_open_calls += 1
        
        # Execute outside lock
        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise
    
    async def _transition_state(self):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.config.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                self.success_count = 0
    
    async def _on_success(self):
        async with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.config.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            else:
                self.failure_count = 0
    
    async def _on_failure(self):
        async with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.config.failure_threshold:
                self.state = CircuitState.OPEN

class CircuitBreakerOpen(Exception):
    pass
```

**Usage:**
```python
# Create breakers for each dependency
db_breaker = CircuitBreaker("database", CircuitBreakerConfig(
    failure_threshold=3,
    recovery_timeout=30.0
))

storage_breaker = CircuitBreaker("object_storage", CircuitBreakerConfig(
    failure_threshold=5,
    recovery_timeout=60.0
))

# Use in service calls
async def save_detection(detection):
    return await db_breaker.call(
        db_repository.save_detection, detection
    )

async def store_frame(frame):
    return await storage_breaker.call(
        s3_client.upload, frame
    )
```

### 4.5 Bulkhead Pattern — Resource Isolation

Isolate resources to prevent one failing component from consuming all resources:

```python
# bulkhead.py
import asyncio
from asyncio import Semaphore

class Bulkhead:
    """Limits concurrent operations per service/camera."""
    
    def __init__(self, name: str, max_concurrent: int, max_queue: int = 100):
        self.name = name
        self.semaphore = Semaphore(max_concurrent)
        self.max_queue = max_queue
        self.queue_size = 0
        self._lock = asyncio.Lock()
    
    async def execute(self, func, *args, **kwargs):
        async with self._lock:
            if self.queue_size >= self.max_queue:
                raise BulkheadFull(
                    f"Bulkhead '{self.name}' queue full ({self.max_queue})"
                )
            self.queue_size += 1
        
        try:
            async with self.semaphore:
                return await func(*args, **kwargs)
        finally:
            async with self._lock:
                self.queue_size -= 1

class BulkheadFull(Exception):
    pass

# Per-camera bulkheads to isolate failures
camera_bulkheads = {
    f"cam_{i:02d}": Bulkhead(f"cam_{i:02d}", max_concurrent=4)
    for i in range(1, 9)
}

# Per-service bulkheads
db_bulkhead = Bulkhead("database", max_concurrent=20)
storage_bulkhead = Bulkhead("storage", max_concurrent=10)
inference_bulkhead = Bulkhead("inference", max_concurrent=8)
```

### 4.6 Recovery State Persistence

Critical state is persisted to survive restarts:

| State Type | Storage | Recovery Action |
|------------|---------|-----------------|
| Camera configurations | PostgreSQL | Reload on startup |
| Alert rules | PostgreSQL | Reload on startup |
| Processing offsets | Redis | Resume from last offset |
| In-flight detections | Redis → PostgreSQL | Replay from queue |
| Model version | Object Storage | Load specified version |
| Stream connection state | Local file | Attempt reconnection |
| Audit log buffer | Local file → Async flush | Recover unflushed entries |


---

## 5. Backup Strategy

### 5.1 Backup Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        BACKUP PIPELINE                          │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   PostgreSQL │───▶│   pgBackRest │───▶│  S3 (Primary)    │  │
│  │   (Primary)  │    │  (Full/Incr) │    │  us-east-1       │  │
│  └──────────────┘    └──────────────┘    └────────┬─────────┘  │
│                                                     │           │
│                              ┌──────────────────────┘           │
│                              │                                  │
│                              ▼                                  │
│                    ┌──────────────────┐                         │
│                    │  S3 (Secondary)  │                         │
│                    │  us-west-2         │  Cross-region replication│
│                    └──────────────────┘                         │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   Object     │───▶│   S3 Cross   │───▶│  Glacier Deep    │  │
│  │   Storage    │    │   Region     │    │  Archive         │  │
│  │   Bucket     │    │   Replication│    │  (7-year)        │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ Infrastructure│───▶│    Git       │───▶│  Encrypted Git   │  │
│  │   Config     │    │   Repository │    │  Backups         │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
```

### 5.2 PostgreSQL Backup (pgBackRest)

**Tool:** pgBackRest 2.48+ with S3 integration

**Backup Schedule:**

| Backup Type | Frequency | Start Time (UTC) | Retention |
|-------------|-----------|-----------------|-----------|
| Full backup | Weekly | Sunday 02:00 | 12 weeks |
| Differential | Daily (Mon-Sat) | 02:00 | 30 days |
| WAL archiving | Continuous | Real-time | 30 days |
| Manual backup | On-demand | Any | 90 days |

**pgBackRest configuration:**
```ini
# /etc/pgbackrest/pgbackrest.conf
[surveillance]
pg1-path=/var/lib/postgresql/15/main
pg1-port=5432

[global]
repo1-type=s3
repo1-s3-region=us-east-1
repo1-s3-bucket=surveillance-db-backups
repo1-s3-key=<ACCESS_KEY>
repo1-s3-key-secret=<SECRET_KEY>
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=12
repo1-retention-diff=30
repo1-retention-archive=30

# Encryption
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<STRONG_PASSPHRASE>

# Performance
process-max=4
compress-type=zst
compress-level=6

# Logging
log-level-file=detail
log-path=/var/log/pgbackrest

# Notifications
exec-start=/usr/local/bin/pgbackrest-notify.sh
```

**Backup cron schedule:**
```bash
# /etc/cron.d/pgbackrest
# Full backup every Sunday at 2 AM UTC
0 2 * * 0 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=full

# Differential backup daily at 2 AM UTC (Mon-Sat)
0 2 * * 1-6 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=diff

# Verify latest backup at 6 AM UTC daily
0 6 * * * postgres /usr/bin/pgbackrest --stanza=surveillance verify
```

**WAL archiving configuration (postgresql.conf):**
```ini
wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=surveillance archive-push %p'
max_wal_senders = 3
wal_keep_size = 1GB
```

### 5.3 Backup Retention Schedule

```
Timeline:
Day 1-30:    Daily backups available (full + diffs)
Week 1-12:   Weekly full backups
Month 1-12:  Monthly full backups (last Sunday of each month)
Year 1-7:    Annual snapshot in Glacier Deep Archive
```

| Tier | Frequency | Copies Kept | Storage Class | Location |
|------|-----------|-------------|---------------|----------|
| Daily (hot) | Every 24h | 30 | S3 Standard | Primary region |
| Weekly (warm) | Every Sunday | 12 | S3 Standard-IA | Primary region |
| Monthly (cold) | Last Sunday | 12 | S3 Glacier Flexible | Primary region |
| Annual (archive) | Year-end | 7 | S3 Glacier Deep Archive | Cross-region |

### 5.4 Object Storage Backup

**Cross-region replication:**
```json
// S3 bucket replication configuration
{
  "Role": "arn:aws:iam::ACCOUNT:role/S3ReplicationRole",
  "Rules": [
    {
      "ID": "surveillance-media-replication",
      "Status": "Enabled",
      "Priority": 1,
      "DeleteMarkerReplication": { "Status": "Disabled" },
      "Filter": {
        "And": {
          "Prefix": "media/",
          "Tag": {
            "Key": "replicate",
            "Value": "true"
          }
        }
      },
      "Destination": {
        "Bucket": "arn:aws:s3:::surveillance-media-backup-west",
        "StorageClass": "STANDARD_IA",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": { "Minutes": 15 }
        },
        "Metrics": {
          "Status": "Enabled",
          "EventThreshold": { "Minutes": 15 }
        },
        "EncryptionConfiguration": {
          "ReplicaKmsKeyID": "arn:aws:kms:us-west-2:ACCOUNT:key/KEY-ID"
        }
      },
      "SourceSelectionCriteria": {
        "SseKmsEncryptedObjects": { "Status": "Enabled" }
      }
    }
  ]
}
```

**Lifecycle policy for media storage:**
```json
{
  "Rules": [
    {
      "ID": "media-lifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "media/recordings/" },
      "Transitions": [
        {
          "Days": 7,
          "StorageClass": "INTELLIGENT_TIERING"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": { "Days": 2555 }
    },
    {
      "ID": "event-data-lifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "events/" },
      "Transitions": [
        { "Days": 90, "StorageClass": "STANDARD_IA" },
        { "Days": 365, "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 730 }
    }
  ]
}
```

### 5.5 Configuration Backup

All infrastructure configuration is stored as code in Git:

```
surveillance-ops/
├── terraform/
│   ├── main.tf                 # Main infrastructure
│   ├── variables.tf            # Environment variables
│   ├── outputs.tf              # Output definitions
│   ├── modules/
│   │   ├── vpc/                # Network configuration
│   │   ├── eks/                # Kubernetes cluster
│   │   ├── rds/                # PostgreSQL instances
│   │   └── s3/                 # Object storage
│   └── environments/
│       ├── production/         # Production config
│       └── dr/                 # DR site config
├── kubernetes/
│   ├── base/                   # Kustomize base resources
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── postgres/
│   │   ├── redis/
│   │   ├── api/
│   │   ├── inference/
│   │   └── capture/
│   └── overlays/
│       ├── production/
│       ├── staging/
│       └── dr/
├── docker-compose/
│   ├── docker-compose.yml      # Edge deployment
│   └── .env.example
├── ansible/
│   ├── playbook.yml            # Host provisioning
│   └── inventory/
├── monitoring/
│   ├── prometheus/
│   ├── grafana-dashboards/
│   └── alertmanager/
└── docs/
    ├── runbooks/
    ├── postmortems/
    └── architecture/
```

**Git backup to secondary provider:**
```bash
#!/bin/bash
# /usr/local/bin/backup-git-repos.sh
# Mirrors all critical repos to secondary Git provider

REPOS=(
  "git@github.com:company/surveillance-ops.git"
  "git@github.com:company/surveillance-app.git"
  "git@github.com:company/surveillance-models.git"
)

BACKUP_REMOTE="git@gitlab-backup.company.com:surveillance"
DATE=$(date +%Y%m%d)

for repo in "${REPOS[@]}"; do
  name=$(basename "$repo" .git)
  echo "Backing up $name..."
  
  git clone --mirror "$repo" "/tmp/$name-mirror"
  cd "/tmp/$name-mirror"
  
  # Push to backup remote
  git remote add backup "$BACKUP_REMOTE/$name.git" 2>/dev/null || true
  git push backup --mirror
  
  # Create dated archive
  tar czf "/backup/git/$name-$DATE.tar.gz" -C "/tmp" "$name-mirror"
  
  rm -rf "/tmp/$name-mirror"
done

# Upload to S3
aws s3 sync /backup/git/ "s3://surveillance-config-backups/git/" --storage-class STANDARD_IA
```

### 5.6 Encryption

| Data at Rest | Encryption Method | Key Management |
|-------------|-------------------|----------------|
| PostgreSQL backups | AES-256-CBC (pgBackRest native) | AWS KMS CMK |
| S3 object storage | SSE-KMS | AWS KMS CMK with automatic rotation |
| Configuration backups | AES-256-GCM (age tool) | YubiKey HSM stored keys |
| Log archives | SSE-S3 (AES-256) | AWS managed |

**KMS key policy:**
```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Enable IAM User Permissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Allow pgBackRest",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT:role/BackupServiceRole"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:GenerateDataKey*"
      ],
      "Resource": "*"
    }
  ]
}
```

### 5.7 Backup Verification

**Automated integrity checks (daily at 06:00 UTC):**
```bash
#!/bin/bash
# /usr/local/bin/verify-backup.sh

set -euo pipefail

STANZA="surveillance"
LOG_FILE="/var/log/backup/verify-$(date +%Y%m%d).log"
ALERT_WEBHOOK="https://hooks.slack.com/services/..."

log() {
    echo "[$(date -Iseconds)] $1" | tee -a "$LOG_FILE"
}

# 1. Verify latest backup exists
LATEST=$(pgbackrest --stanza=$STANZA info --output=json | jq -r '.[0].backup[-1].label')
if [ -z "$LATEST" ]; then
    log "ERROR: No backup found!"
    curl -X POST -H 'Content-type: application/json' \
        --data '{"text":"CRITICAL: No database backup found!"}' \
        "$ALERT_WEBHOOK"
    exit 1
fi

log "Latest backup: $LATEST"

# 2. Verify backup integrity
if ! pgbackrest --stanza=$STANZA verify --set=$LATEST >> "$LOG_FILE" 2>&1; then
    log "ERROR: Backup integrity check failed for $LATEST"
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"CRITICAL: Backup integrity check failed for $LATEST\"}" \
        "$ALERT_WEBHOOK"
    exit 1
fi

# 3. Check WAL archive continuity
MISSING=$(pgbackrest --stanza=$STANZA verify 2>&1 | grep -c "missing" || true)
if [ "$MISSING" -gt 0 ]; then
    log "WARNING: $MISSING WAL files missing"
fi

# 4. Verify S3 accessibility
if ! aws s3 ls "s3://surveillance-db-backups/pgbackrest/" > /dev/null 2>&1; then
    log "ERROR: Cannot access S3 backup bucket"
    exit 1
fi

# 5. Check backup age
BACKUP_AGE=$(pgbackrest --stanza=$STANZA info --output=json | \
    jq -r '.[0].backup[-1].timestamp.stop')
BACKUP_AGE_SEC=$(( $(date +%s) - $(date -d "$BACKUP_AGE" +%s) ))

if [ "$BACKUP_AGE_SEC" -gt 90000 ]; then  # > 25 hours
    log "WARNING: Latest backup is older than 25 hours"
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"WARNING: Latest backup is $((BACKUP_AGE_SEC / 3600)) hours old\"}" \
        "$ALERT_WEBHOOK"
fi

log "Backup verification completed successfully"
```

### 5.8 Restore Procedures

#### 5.8.1 Point-in-Time Recovery (PITR)

```bash
#!/bin/bash
# restore-pitr.sh — Restore to specific point in time

STANZA="surveillance"
TARGET_TIME="$1"  # e.g., "2025-01-15 08:30:00"

# Stop application
kubectl scale deployment surveillance-api --replicas=0

# Restore from backup
pgbackrest --stanza=$STANZA restore \
    --type=time \
    --target="$TARGET_TIME" \
    --target-action=promote \
    --delta

# Verify database
psql -U surveillance -d surveillance -c "SELECT pg_last_xact_replay_timestamp();"

# Restart application
kubectl scale deployment surveillance-api --replicas=3

# Verify application health
curl -f http://surveillance-api:8080/health/ready
```

#### 5.8.2 Full Disaster Recovery

```bash
#!/bin/bash
# restore-full.sh — Complete database restoration to new instance

STANZA="surveillance"
NEW_DATA_DIR="/var/lib/postgresql/15/main"

# 1. Install PostgreSQL (same version as backup)
apt-get install postgresql-15

# 2. Stop PostgreSQL
systemctl stop postgresql

# 3. Clear data directory
rm -rf "$NEW_DATA_DIR/*"

# 4. Restore full backup
pgbackrest --stanza=$STANZA restore \
    --type=immediate \
    --set=LATEST

# 5. Start PostgreSQL
systemctl start postgresql

# 6. Verify
pgbackrest --stanza=$STANZA check

# 7. Run consistency check
psql -U surveillance -d surveillance -c "SELECT count(*) FROM events;"
psql -U surveillance -d surveillance -c "SELECT pg_database_size('surveillance');"
```

### 5.9 Monthly Restore Drill

**Schedule:** First Saturday of each month at 02:00 UTC

**Procedure:**
1. Provision isolated restore environment (separate namespace/VM)
2. Restore latest full backup
3. Apply differential backups
4. Verify data integrity (row counts, checksums)
5. Run application smoke tests
6. Verify media files accessible
7. Document results in restore log
8. Tear down restore environment

**Restore drill checklist:**
```markdown
## Restore Drill — 2025-01-04
- [x] Isolated environment provisioned
- [x] Full backup restored (duration: 23 min)
- [x] Differential backup applied (duration: 4 min)
- [x] WAL replay completed (duration: 12 min)
- [x] Database row counts verified
  - events: 12,456,789 (expected: 12,456,789) ✓
  - cameras: 8 (expected: 8) ✓
  - alerts: 1,234 (expected: 1,234) ✓
- [x] Application smoke tests passed
- [x] Media file accessibility verified (100/100 random samples)
- [x] Total RTO: 41 minutes (target: < 60 min) ✓
- [x] Total RPO: 8 minutes (target: < 15 min) ✓
- [x] Environment cleaned up

**Notes:** WAL replay was slower than usual due to high write volume on Jan 3.
```

---

## 6. Data Retention

### 6.1 Retention Policy Matrix

| Data Category | Retention Period | Action After Retention | Legal Basis |
|---------------|-----------------|------------------------|-------------|
| **Raw video recordings** | 90 days (configurable) | Delete or archive to cold storage | Operational necessity |
| **Event clips (alerts)** | 1 year | Archive to cold storage for 2 additional years | Incident investigation |
| **Detection metadata** | 1 year | Anonymize & aggregate | Analytics |
| **Audit logs** | 1 year | Archive for 6 additional years | Compliance |
| **System health logs** | 90 days | Delete | Operational monitoring |
| **Access logs** | 90 days | Delete | Security monitoring |
| **Face embeddings (enrolled)** | Indefinite until deleted | User-initiated deletion | Authorized personnel database |
| **Face embeddings (detected)** | Never stored | N/A — computed and discarded immediately | Privacy by design |
| **Alert history** | 2 years | Archive | Incident reference |
| **Training data** | Indefinite | Explicit deletion by admin | AI model improvement |
| **Configuration history** | 2 years | Archive | Change tracking |
| **Backup archives** | 7 years (Glacier) | Delete per backup schedule | Disaster recovery |

### 6.2 Automated Cleanup Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                    Data Lifecycle Manager                     │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Retention  │  │   Cleanup    │  │   Archive        │   │
│  │   Policy     │──│   Executor   │──│   Manager        │   │
│  │   Engine     │  │   (CronJob)  │  │   (S3/Glacier)   │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
│         │                 │                   │              │
│         ▼                 ▼                   ▼              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  PostgreSQL  │  │  S3 Object   │  │   Elasticsearch  │   │
│  │  (metadata)  │  │  Storage     │  │   (logs)         │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
└──────────────────────────────────────────────────────────────┘
```

### 6.3 Cleanup Job Implementation

```python
# retention_manager.py
from datetime import datetime, timedelta
from typing import List, Optional
import asyncio
import logging

logger = logging.getLogger(__name__)

class RetentionPolicy:
    def __init__(self, name: str, retention_days: int, archive_first: bool = False,
                 archive_days: int = 0, anonymize: bool = False):
        self.name = name
        self.retention_days = retention_days
        self.archive_first = archive_first
        self.archive_days = archive_days
        self.anonymize = anonymize

class DataRetentionManager:
    def __init__(self):
        self.policies = {}
    
    def register_policy(self, policy: RetentionPolicy):
        self.policies[policy.name] = policy
    
    async def execute_cleanup(self, policy_name: str, dry_run: bool = False):
        policy = self.policies.get(policy_name)
        if not policy:
            raise ValueError(f"Unknown policy: {policy_name}")
        
        cutoff_date = datetime.utcnow() - timedelta(days=policy.retention_days)
        logger.info("Executing cleanup for '%s' (cutoff: %s)", 
                     policy_name, cutoff_date.isoformat())
        
        if dry_run:
            count = await self._count_eligible(policy_name, cutoff_date)
            logger.info("[DRY RUN] Would delete %d records", count)
            return count
        
        # Archive before delete if configured
        if policy.archive_first:
            archive_cutoff = datetime.utcnow() - timedelta(
                days=policy.retention_days + policy.archive_days
            )
            archived = await self._archive_records(policy_name, cutoff_date, archive_cutoff)
            logger.info("Archived %d records", archived)
        
        # Anonymize if configured
        if policy.anonymize:
            anonymized = await self._anonymize_records(policy_name, cutoff_date)
            logger.info("Anonymized %d records", anonymized)
        else:
            # Delete expired records
            deleted = await self._delete_records(policy_name, cutoff_date)
            logger.info("Deleted %d records", deleted)
        
        return {"archived": archived if policy.archive_first else 0, "deleted": deleted}

# Register policies
retention = DataRetentionManager()
retention.register_policy(RetentionPolicy("raw_video", retention_days=90, archive_first=True, archive_days=180))
retention.register_policy(RetentionPolicy("event_clips", retention_days=365, archive_first=True, archive_days=730))
retention.register_policy(RetentionPolicy("detection_metadata", retention_days=365, anonymize=True))
retention.register_policy(RetentionPolicy("audit_logs", retention_days=365, archive_first=True, archive_days=2190))
retention.register_policy(RetentionPolicy("system_logs", retention_days=90))
retention.register_policy(RetentionPolicy("access_logs", retention_days=90))
```

**Kubernetes CronJob:**
```yaml
# cleanup-job.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-retention-cleanup
  namespace: surveillance
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: surveillance/retention-manager:2.3.1
              command:
                - python
                - -m
                - retention_manager
                - --execute-all
                - --notify
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url
                - name: S3_BUCKET
                  value: surveillance-media
                - name: DRY_RUN
                  value: "false"
              resources:
                requests:
                  cpu: 100m
                  memory: 256Mi
                limits:
                  cpu: 500m
                  memory: 512Mi
          restartPolicy: OnFailure
```

### 6.4 Archive to Cold Storage

Before deletion, data is moved to cost-effective cold storage:

| Stage | Storage Class | Cost Factor | Access Time |
|-------|--------------|-------------|-------------|
| Active | S3 Standard | 1x | Immediate |
 | 7 days | S3 Intelligent-Tiering | 0.8x | Immediate |
| 90 days | S3 Glacier Instant Retrieval | 0.2x | Milliseconds |
| 1 year | S3 Glacier Flexible Retrieval | 0.08x | Minutes-hours |
| 2 years | S3 Glacier Deep Archive | 0.04x | 12-48 hours |

**Archive process:**
```bash
#!/bin/bash
# archive-old-media.sh

BUCKET="surveillance-media"
RETENTION_DAYS=90
CUTOFF=$(date -d "$RETENTION_DAYS days ago" +%Y-%m-%d)

# 1. Identify files to archive
aws s3api list-objects-v2 \
    --bucket "$BUCKET" \
    --prefix "recordings/" \
    --query "Contents[?LastModified<='$CUTOFF'].Key" \
    --output text > /tmp/archive-list.txt

# 2. Move to Glacier
while IFS= read -r key; do
    aws s3api copy-object \
        --copy-source "${BUCKET}/${key}" \
        --bucket "$BUCKET" \
        --key "$key" \
        --storage-class GLACIER_IR \
        --metadata-directive COPY
done < /tmp/archive-list.txt

# 3. Log archival
aws s3 cp /tmp/archive-list.txt \
    "s3://${BUCKET}/archive-logs/archive-$(date +%Y%m%d).txt"

# 4. Notify
echo "Archived $(wc -l < /tmp/archive-list.txt) files to Glacier IR"
```

### 6.5 Right to Deletion

For privacy compliance (GDPR/CCPA), implement data subject deletion:

```python
async def delete_subject_data(subject_id: str):
    """
    Complete deletion of a data subject:
    1. Remove from enrolled persons database
    2. Delete associated face embeddings
    3. Remove references from detection logs
    4. Delete related event clips
    5. Log deletion for audit
    """
    async with db.transaction():
        # 1. Delete enrolled person
        await db.execute(
            "DELETE FROM enrolled_persons WHERE id = $1",
            subject_id
        )
        
        # 2. Delete embeddings (separate table for encryption)
        await db.execute(
            "DELETE FROM face_embeddings WHERE person_id = $1",
            subject_id
        )
        
        # 3. Anonymize detection references
        await db.execute(
            """UPDATE detections 
                SET person_id = NULL, 
                    person_name = '[REDACTED]',
                    face_embedding = NULL
                WHERE person_id = $1""",
            subject_id
        )
        
        # 4. Queue related event clips for deletion
        clips = await db.fetch(
            "SELECT storage_path FROM event_clips WHERE person_id = $1",
            subject_id
        )
        for clip in clips:
            await s3.delete_object(clip['storage_path'])
        
        # 5. Audit log
        await db.execute(
            """INSERT INTO deletion_audit_log 
                (subject_id, deleted_at, deleted_by, reason)
                VALUES ($1, NOW(), $2, 'data_subject_request')""",
            subject_id, current_user_id()
        )
```

---

## 7. Storage Management

### 7.1 Storage Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                    Storage Architecture                       │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Hot Tier   │  │   Warm Tier  │  │   Cold Tier      │   │
│  │   (NVMe/SSD) │  │   (HDD/S3)   │  │   (Glacier)      │   │
│  │              │  │              │  │                  │   │
│  │  Current     │  │  30-90 day   │  │  90+ day media   │   │
│  │  recordings  │  │  recordings  │  │  long-term       │   │
│  │  Active DB   │  │  Event clips │  │  archive         │   │
│  │  Cache       │  │  90-day logs │  │  compliance      │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
│                                                              │
│  Edge Node (local)  ←── VPN ──→  Cloud (S3/EBS/EFS)        │
└──────────────────────────────────────────────────────────────┘
```

### 7.2 Storage Capacity Planning (8 Camera Baseline)

| Data Type | Daily Volume | Compression | Storage/day | Monthly |
|-----------|-------------|-------------|-------------|---------|
| Raw video (8x 1080p@30fps, H.265) | ~800 GB | 50% | ~400 GB | ~12 TB |
| Event clips (alerts) | ~5 GB | None | ~5 GB | ~150 GB |
| Detection metadata | ~500 MB | None | ~500 MB | ~15 GB |
| Audit logs | ~100 MB | 70% | ~30 MB | ~1 GB |
| System metrics | ~200 MB | 80% | ~40 MB | ~1.2 GB |
| Database | ~50 MB | N/A | ~50 MB | ~1.5 GB |
| Model checkpoints | N/A | N/A | N/A | ~2 GB |
| **Total** | | | **~406 GB/day** | **~12.2 TB/month** |

**Annual raw capacity requirement:** ~146 TB  
**With 90-day retention + archive:** ~40 TB hot/warm + ~110 TB cold  
**Recommended provisioned capacity:** 200 TB (with 50% growth headroom)

### 7.3 Storage Monitoring & Alerting

**Prometheus rules:**
```yaml
groups:
  - name: storage-alerts
    rules:
      - alert: StorageWarning70
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.30
        for: 5m
        labels:
          severity: p4
        annotations:
          summary: "Storage at 70% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: StorageHigh85
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
        for: 2m
        labels:
          severity: p2
        annotations:
          summary: "Storage at 85% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: StorageCritical95
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
        for: 1m
        labels:
          severity: p1
        annotations:
          summary: "Storage CRITICAL at 95% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: S3BucketSizeGrowth
        expr: predict_linear(aws_s3_bucket_size_bytes[7d], 30*24*3600) > 
              aws_s3_bucket_quota_bytes * 0.9
        for: 1h
        labels:
          severity: p3
        annotations:
          summary: "S3 bucket {{ $labels.bucket }} projected to exceed quota in 30 days"

      - alert: StorageCleanupFailed
        expr: increase(surveillance_cleanup_failures_total[1h]) > 0
        for: 5m
        labels:
          severity: p2
        annotations:
          summary: "Storage cleanup job failed"
```

### 7.4 Automated Cleanup Policies

```yaml
# cleanup-policies.yaml
cleanup_policies:
  raw_video:
    description: "Raw video recordings"
    retention_days: 90
    archive_before_delete: true
    archive_storage_class: GLACIER_IR
    priority: oldest_first
    schedule: "0 2 * * *"
    
  event_clips:
    description: "Alert event video clips"
    retention_days: 365
    archive_before_delete: true
    archive_storage_class: GLACIER
    priority: oldest_first
    schedule: "0 3 * * *"
    
  temp_processing:
    description: "Temporary processing files"
    retention_days: 1
    archive_before_delete: false
    priority: all_expired
    schedule: "*/30 * * * *"
    
  failed_uploads:
    description: "Failed upload artifacts"
    retention_days: 7
    archive_before_delete: false
    priority: all_expired
    schedule: "0 4 * * *"
    
  system_logs:
    description: "Application and system logs"
    retention_days: 90
    archive_before_delete: true
    archive_storage_class: GLACIER_IR
    priority: oldest_first
    schedule: "0 5 * * *"
```

### 7.5 Compression Strategy

| Data Age | Compression | Method | Savings |
|----------|------------|--------|---------|
| 0-7 days | None | Raw H.265 | Baseline |
| 7-30 days | Re-encode | H.265 → H.265 (lower CRF) | 30-40% |
| 30-90 days | Transcode | H.265 → AV1 | 40-50% |
| 90+ days | Archive | AV1 + tarball | 50-60% |

**Compression job:**
```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: video-compression
  namespace: surveillance
spec:
  schedule: "0 1 * * *"
  jobTemplate:
    spec:
      parallelism: 2
      template:
        spec:
          containers:
            - name: compressor
              image: surveillance/media-processor:2.3.1
              command:
                - python
                - -m
                - compression
                - --age-days=7
                - --target-crf=30
                - --codec=libx265
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi
                limits:
                  cpu: "4"
                  memory: 8Gi
          restartPolicy: OnFailure
```

### 7.6 Auto-Scaling Cloud Storage

**S3 Auto-scaling:** S3 is inherently elastic — no manual scaling needed. Monitor bucket size and cost.

**EBS volume scaling:**
```yaml
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: surveillance-expandable
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: 3000
  throughput: 125
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"
allowVolumeExpansion: true  # Enable expansion
volumeBindingMode: WaitForFirstConsumer
```

**Automated volume expansion:**
```bash
#!/bin/bash
# auto-expand-storage.sh

THRESHOLD=80
PVC_NAMES=("postgres-data" "media-storage" "log-storage")
NAMESPACE="surveillance"

for pvc in "${PVC_NAMES[@]}"; do
    # Get current usage
    USAGE=$(kubectl exec -n "$NAMESPACE" deployment/surveillance-api \
        -- df -h "/data/$pvc" | awk 'NR==2 {print $5}' | tr -d '%')
    
    if [ "$USAGE" -gt "$THRESHOLD" ]; then
        CURRENT_SIZE=$(kubectl get pvc "$pvc" -n "$NAMESPACE" \
            -o jsonpath='{.status.capacity.storage}')
        
        # Increase by 50%
        CURRENT_GB=${CURRENT_SIZE%Gi}
        NEW_GB=$((CURRENT_GB + CURRENT_GB / 2))
        
        echo "Expanding $pvc from ${CURRENT_GB}Gi to ${NEW_GB}Gi"
        
        kubectl patch pvc "$pvc" -n "$NAMESPACE" \
            --type merge \
            -p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"${NEW_GB}Gi\"}}}}"
        
        # Notify
        curl -X POST "$SLACK_WEBHOOK" \
            -H 'Content-type: application/json' \
            -d "{\"text\":\"Auto-expanded PVC $pvc to ${NEW_GB}Gi (was ${USAGE}% full)\"}"
    fi
done
```

### 7.7 Storage Cost Optimization

| Optimization | Monthly Savings | Implementation |
|-------------|----------------|----------------|
| S3 Intelligent-Tiering | 20-30% | Automatic |
| H.265 re-encode (older content) | 30-40% | Nightly job |
| Glacier IR for 30-90 day content | 60-70% | Lifecycle rule |
| Glacier Deep Archive for 1yr+ | 95% | Lifecycle rule |
| Reserved capacity for predictable workloads | 30-40% | Commitment |

---

## 8. Incident Response

### 8.1 Severity Definitions

| Severity | Name | Definition | Examples | Response Time |
|----------|------|-----------|----------|---------------|
| **P1** | Critical | Complete service outage; no surveillance capability | All cameras offline; AI pipeline completely down; storage full; database primary down | 15 minutes |
| **P2** | High | Major functionality degraded; partial surveillance loss | Single camera offline > 30 min; high error rates; model accuracy degraded; backup failures | 1 hour |
| **P3** | Medium | Minor functionality issue; workarounds available | Low FPS on camera; certificate expiring; certificate expiry warning; cleanup job failure | 4 hours |
| **P4** | Low | Cosmetic or non-urgent issue | High CPU warning; UI glitch; documentation update needed; optimization opportunity | 24 hours |

### 8.2 Escalation Matrix

```
P1 (Critical) — 15 min response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 5 min: On-call must acknowledge
├── 15 min: No acknowledge → Escalate to Team Lead (SMS + Call)
├── 30 min: No response → Escalate to Engineering Manager
├── 45 min: No response → Escalate to VP Engineering
└── 60 min: No response → Escalate to CTO

P2 (High) — 1 hour response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 30 min: No acknowledge → Reminder notification
├── 60 min: No response → Escalate to Team Lead
└── 2 hours: No response → Escalate to Engineering Manager

P3 (Medium) — Slack + email only, 4 hour response
├── 0 min: Alert fires → Slack notification
└── 4 hours: No acknowledgment → Escalate to Team Lead

P4 (Low) — Daily digest email, 24 hour response
└── Daily digest at 09:00 UTC
```

**Contact Information:**

| Role | Primary Contact | Secondary Contact | Notification Method |
|------|----------------|-------------------|---------------------|
| On-Call Engineer | Rotating (PagerDuty) | — | PagerDuty Push + SMS |
| SRE Team Lead | lead-sre@company.com | +1-555-0100 | SMS + Voice Call |
| Engineering Manager | eng-mgr@company.com | +1-555-0101 | SMS + Voice Call |
| VP Engineering | vp-eng@company.com | +1-555-0102 | Voice Call + Email |
| CTO | cto@company.com | +1-555-0103 | Voice Call + Email |

### 8.3 Incident Response Process

```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  DETECT     │───▶│  RESPOND    │───▶│  RESOLVE    │───▶│  REVIEW     │
│  (Alert)    │    │  (Triage &  │    │  (Fix &     │    │  (Post-     │
│             │    │   Mitigate) │    │   Verify)   │    │   mortem)   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │
                    ┌─────┴─────┐
                    ▼           ▼
              ┌────────┐  ┌──────────┐
              │Mitigate│  │Communicate│
              │Impact  │  │Stakeholders│
              └────────┘  └──────────┘
```

**Phase 1: Detect**
1. Monitoring alert fires
2. On-call engineer receives page
3. Acknowledge alert within 5 minutes
4. Create incident channel in Slack: `#inc-YYYY-MM-DD-brief-description`

**Phase 2: Respond**
1. Assess severity and impact
2. Execute relevant runbook
3. Apply immediate mitigation if possible
4. Update incident timeline every 15 minutes
5. Communicate to stakeholders

**Phase 3: Resolve**
1. Implement fix
2. Verify service recovery (all health checks pass)
3. Monitor for 30 minutes post-recovery
4. Close incident in PagerDuty
5. Update incident log

**Phase 4: Review**
1. Schedule post-mortem within 48 hours for P1/P2
2. Complete post-mortem document
3. Identify action items
4. Track action items to completion

### 8.4 Runbooks

#### Runbook: Camera Offline

**Detection:** `SingleCameraDown` alert fires  
**Severity:** P2  
**Initial Response Time:** 1 hour

**Diagnosis Steps:**

```bash
# 1. Check camera stream status
curl http://video-capture:8080/api/v1/cameras/{camera_id}/status

# 2. Check camera connectivity
ping <camera_ip>
curl -v rtsp://<camera_ip>:554/stream

# 3. Check video-capture service logs
kubectl logs -l app=video-capture --tail=100 | grep {camera_id}

# 4. Check network path
tracert <camera_ip>
# Verify firewall rules, VPN tunnel

# 5. Check camera resource usage
kubectl top pod -l app=video-capture
```

**Resolution Steps:**

| Issue | Resolution | Verification |
|-------|-----------|--------------|
| Camera powered off | Contact site personnel to power cycle | Ping responds |
| Network connectivity | Check switch port, cable, VLAN | Ping + RTSP describe |
| VPN tunnel down | See "VPN Tunnel Down" runbook | Tunnel status |
| Camera firmware issue | Power cycle camera remotely | Stream reconnects |
| Stream URL changed | Update camera configuration | New stream active |
| Video-capture bug | Restart capture container | Stream reconnected |
| Resource exhaustion | Scale up capture resources | CPU/memory normal |

**Workaround:** If camera cannot be restored within 30 minutes:
- Mark camera as "maintenance mode" in dashboard
- Disable alerts for this camera
- Queue for on-site technician visit

---

#### Runbook: AI Pipeline Down

**Detection:** `AIPipelineDown` or `HighErrorRate` alert  
**Severity:** P1  
**Initial Response Time:** 15 minutes

**Diagnosis Steps:**

```bash
# 1. Check inference service health
curl http://ai-inference:8080/health/deep

# 2. Check if model is loaded
curl http://ai-inference:8080/api/v1/model/status

# 3. Check GPU status (if applicable)
nvidia-smi
# OR for CPU inference:
htop

# 4. Check inference logs
kubectl logs -l app=ai-inference --tail=200

# 5. Check resource usage
kubectl top pod -l app=ai-inference
kubectl describe pod -l app=ai-inference

# 6. Check model service
kubectl logs -l app=model-service --tail=100

# 7. Check if inference queue is backing up
redis-cli LLEN inference:queue

# 8. Test inference manually
curl -X POST http://ai-inference:8080/api/v1/inference/test \
  -H "Content-Type: application/json" \
  -d '{"test_image": "base64encoded"}'
```

**Resolution Steps:**

| Issue | Resolution | Verification |
|-------|-----------|--------------|
| Model not loaded | Restart model-service pod | Model status shows loaded |
| GPU OOM | Restart inference pod; check memory limits | nvidia-smi shows free memory |
| Model corruption | Reload model from S3 backup | Test inference succeeds |
| Inference timeout | Scale inference replicas; check input | Latency returns to normal |
| Queue backup | Scale up consumers; check for dead consumers | Queue depth returns to 0 |
| Bad model update | Rollback to previous model version | Detection accuracy restored |
| Dependency failure | Check circuit breaker status; restart dependencies | All health checks pass |

**Immediate Mitigation:**
- If inference cannot be restored in 15 minutes:
  1. Switch to "detection-only" mode (skip recognition)
  2. Enable edge processing as backup
  3. Queue frames for delayed processing

---

#### Runbook: VPN Tunnel Down

**Detection:** Edge node unreachable; camera streams offline  
**Severity:** P2 (P1 if all edge cameras affected)  
**Initial Response Time:** 1 hour

**Diagnosis Steps:**

```bash
# 1. Check tunnel status from cloud side
ping <edge_gateway_ip>

# 2. Check VPN service status
kubectl logs -l app=vpn-gateway --tail=100

# 3. Check tunnel metrics
curl http://vpn-gateway:8080/metrics | grep vpn_tunnel

# 4. Check from edge side (if SSH available)
ssh edge-node "ping <cloud_gateway_ip>"
ssh edge-node "ipsec status"  # or wg show for WireGuard

# 5. Check network path
mtr <edge_gateway_ip>

# 6. Check certificates (if certificate-based VPN)
openssl x509 -in /etc/vpn/cert.pem -text -noout | grep "Not After"
```

**Resolution Steps:**

| Issue | Resolution | Verification |
|-------|-----------|--------------|
| Edge network down | Contact ISP/site | Ping responds |
| VPN service crash | Restart VPN gateway | Tunnel established |
| Certificate expired | Renew certificates | Valid cert, tunnel up |
| MTU mismatch | Adjust tunnel MTU | No packet fragmentation |
| Firewall change | Restore firewall rules | Tunnel traffic flowing |
| IPsec/IKE failure | Restart IKE daemon; check config | SA established |
| WireGuard key issue | Regenerate keys | Handshake succeeds |

**Workaround:** If tunnel cannot be restored:
- Activate local storage mode on edge (store locally, sync later)
- Switch to cellular backup if available
- Deploy technician on-site if needed

---

#### Runbook: Storage Full

**Detection:** `StorageCritical95` alert fires  
**Severity:** P1  
**Initial Response Time:** 15 minutes

**Immediate Actions (within 5 minutes):**

```bash
# 1. Identify what's consuming space
df -h
ncdu /data/surveillance

# 2. Check if cleanup job is running
kubectl get jobs -n surveillance | grep cleanup

# 3. Temporarily expand storage (cloud)
# AWS EBS:
aws ec2 modify-volume --volume-id vol-XXXX --size $((CURRENT + 100))

# 4. Emergency cleanup — delete oldest temp files
find /data/surveillance/temp -type f -mtime +1 -delete
find /data/surveillance/cache -type f -atime +7 -delete

# 5. Force log rotation
logrotate -f /etc/logrotate.d/surveillance

# 6. Truncate oversized logs (>1GB)
find /var/log/surveillance -type f -size +1G -exec sh -c '> {}' \;
```

**Resolution Steps:**

| Issue | Resolution | Verification |
|-------|-----------|--------------|
| Normal growth | Expand storage; review retention | Usage < 80% |
| Runaway logs | Fix log source; rotate logs | Log growth rate normal |
| Cleanup job failed | Restart cleanup job; fix root cause | Cleanup completes |
| Retention too long | Reduce retention period | Space freed |
| Camera bitrate high | Adjust camera encoding settings | Bitrate normalized |
| Orphaned temp files | Purge temp directory | Space recovered |

---

#### Runbook: Database Connectivity Issues

**Detection:** `DatabaseUnreachable` alert  
**Severity:** P1  
**Initial Response Time:** 15 minutes

**Diagnosis Steps:**

```bash
# 1. Check PostgreSQL pod status
kubectl get pods -l app=postgres
kubectl describe pod -l app=postgres

# 2. Check PostgreSQL logs
kubectl logs -l app=postgres --tail=200

# 3. Test connection from application pod
kubectl exec deployment/surveillance-api -- \
  pg_isready -h postgres -U surveillance

# 4. Check connection pool status
kubectl exec deployment/surveillance-api -- \
  python -c "from db import pool; print(pool.size(), pool.available())"

# 5. Check resource usage
kubectl top pod -l app=postgres

# 6. Check disk I/O
iostat -x 1 5

# 7. Check for locks
kubectl exec deployment/postgres -- \
  psql -U surveillance -c "SELECT * FROM pg_locks WHERE NOT granted;"

# 8. Check replication lag
kubectl exec deployment/postgres -- \
  psql -U surveillance -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS lag_seconds;"
```

**Resolution Steps:**

| Issue | Resolution | Verification |
|-------|-----------|--------------|
| PostgreSQL pod crash | Restart pod; check for OOM | Pod running, accepting connections |
| Connection pool exhausted | Increase pool size; check for leaks | Available connections > 0 |
| Disk I/O saturation | Scale storage IOPS; optimize queries | I/O wait < 20% |
| Lock contention | Kill blocking queries; optimize transactions | No waiting locks |
| Replication lag | Check replica resources; restart replication | Lag < 5 seconds |
| Query overload | Enable query caching; kill slow queries | Active queries normal |
| Disk full | See "Storage Full" runbook | Free space available |
| Hardware failure | Failover to replica; replace primary | Replica promoted |

**Immediate Mitigation:**
- If primary is down:
  1. Promote replica to primary: `pg_ctl promote`
  2. Update connection strings
  3. Restart application pods

---

#### Runbook: High Error Rates

**Detection:** `HighErrorRate` alert fires  
**Severity:** P1  
**Initial Response Time:** 15 minutes

**Diagnosis Steps:**

```bash
# 1. Check error distribution by service
kubectl logs -l app=surveillance --tail=1000 | \
  jq -r '.service + ": " + .level + ": " + .message' | \
  sort | uniq -c | sort -rn | head -20

# 2. Check error rate per service
curl http://prometheus:9090/api/v1/query?query=\
  "rate(surveillance_errors_total[5m])"

# 3. Check for recent deployments
kubectl rollout history deployment/surveillance-api
kubectl rollout history deployment/ai-inference

# 4. Check dependency health
curl http://surveillance-api:8080/health/deep

# 5. Check for resource exhaustion
kubectl top pods

# 6. Review recent changes
# Check CI/CD pipeline, config changes

# 7. Check circuit breaker status
for service in database storage inference; do
  curl "http://surveillance-api:8080/api/v1/circuit-breakers/$service"
done
```

**Resolution Steps:**

| Issue | Resolution | Verification |
|-------|-----------|--------------|
| Bad deployment | Rollback to previous version | Error rate drops |
| Dependency down | Fix dependency; check circuit breakers | All deps healthy |
| Resource exhaustion | Scale up; optimize resource usage | Usage normal |
| Code bug | Deploy hotfix; or rollback | Errors eliminated |
| Configuration error | Revert config change; validate config | Config valid |
| External API failure | Enable fallback; contact provider | Fallback active |
| Database deadlock | Kill blocking queries; fix code | Deadlocks resolved |

### 8.5 Post-Incident Review Template

```markdown
# Post-Incident Review

## Incident Summary

| Field | Value |
|-------|-------|
| Incident ID | INC-2025-001 |
| Date/Time (UTC) | 2025-01-15 03:45 - 2025-01-15 05:20 |
| Severity | P1 |
| Detection Method | Automated alert (StorageCritical95) |
| Affected Systems | All camera streams, event storage |
| Impact | 1h 35m of degraded recording quality |

## Timeline

| Time (UTC) | Event |
|------------|-------|
| 03:42 | Storage usage crosses 95% threshold |
| 03:45 | P1 alert fires; on-call paged |
| 03:48 | On-call engineer acknowledges |
| 03:52 | Diagnosis begins; identified storage full |
| 04:05 | Emergency cleanup initiated; temp files removed |
| 04:15 | Storage expanded by 200GB |
| 04:30 | Cleanup job restarted; oldest files archived |
| 04:45 | All camera streams reconnecting |
| 05:00 | All health checks passing |
| 05:20 | Incident closed; monitoring continues |

## Root Cause Analysis

**5 Whys:**
1. Why did storage fill up? → Cleanup job had been failing for 3 days
2. Why was cleanup failing? → Credential rotation broke S3 access
3. Why didn't credential rotation update cleanup job? → Cleanup job uses hardcoded credentials
4. Why are credentials hardcoded? → Technical debt; not migrated to secret management
5. Why wasn't this caught? → No monitoring on cleanup job success/failure

**Root Cause:** Cleanup job used hardcoded S3 credentials that were not updated during routine credential rotation, causing 3 days of accumulated data without cleanup.

## Contributing Factors
- No alert on cleanup job failures
- Storage growth rate was not monitored
- No auto-expansion configured for media storage

## What Went Well
- Automated P1 alert fired immediately at 95%
- On-call responded within 3 minutes
- Emergency cleanup procedures were effective
- No data loss occurred

## What Went Wrong
- Cleanup job failure went undetected for 3 days
- Manual intervention required for storage expansion
- Edge cameras buffered locally but some frames were lost during reconnect

## Action Items

| ID | Action | Owner | Due Date | Priority |
|----|--------|-------|----------|----------|
| AI-1 | Migrate all jobs to use IAM roles / secret management | @sre-team | 2025-01-22 | High |
| AI-2 | Add alert for cleanup job failures | @sre-team | 2025-01-18 | High |
| AI-3 | Implement auto-expansion for media storage | @sre-team | 2025-01-29 | Medium |
| AI-4 | Add storage growth rate alerting | @sre-team | 2025-01-22 | Medium |
| AI-5 | Improve camera reconnection to reduce frame loss | @eng-team | 2025-02-05 | Low |
| AI-6 | Document hardcoded credential audit procedure | @security | 2025-01-22 | High |

## Lessons Learned
- Any automated job failure must have an alert
- Credential management must be centralized
- Storage monitoring needs predictive capability

## Signatures
- Incident Commander: _________________ Date: ___/___/______
- Engineering Lead: _________________ Date: ___/___/______
```


---

## 9. Upgrades & Maintenance

### 9.1 Zero-Downtime Deployment Strategy

**Deployment Pattern:** Rolling updates with readiness gate verification

```
Phase 1: Deploy new version alongside old version
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │   (serving traffic)
  └──────────┘    └──────────┘    └──────────┘

Phase 2: Add new version pod, verify health
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │    │  Pod v2  │   (new pod not yet serving)
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
                                                      ▲
                                                health check passes

Phase 3: Route traffic to new pod, drain old pod
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v2  │    │  Pod v2  │   (traffic shifting)
  └──────────┘    └──────────┘    └──────────┘    └──────────┘

Phase 4: Complete rollout
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v2  │    │  Pod v2  │    │  Pod v2  │   (all pods updated)
  └──────────┘    └──────────┘    └──────────┘

Rollback: Instantly revert to previous ReplicaSet
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Pod v1  │    │  Pod v1  │    │  Pod v1  │   (rollback in ~30 seconds)
  └──────────┘    └──────────┘    └──────────┘
```

**Kubernetes Deployment Strategy:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: surveillance-api
  namespace: surveillance
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # Allow 1 extra pod during update
      maxUnavailable: 0    # Never reduce capacity
  selector:
    matchLabels:
      app: surveillance-api
  template:
    metadata:
      labels:
        app: surveillance-api
        version: "2.3.2"   # Updated with each release
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: surveillance/api:2.3.2@sha256:a1b2c3d4...
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 6
            successThreshold: 2
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - surveillance-api
                topologyKey: kubernetes.io/hostname
```

### 9.2 Deployment Pipeline

```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Build     │───▶│   Test      │───▶│   Stage     │───▶│   Canary    │───▶│  Production │
│  (CI)       │    │  (Unit/Int) │    │  (E2E)      │    │  (5% traff) │    │  (100%)     │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │                  │                  │
                          ▼                  ▼                  ▼
                    ┌──────────┐      ┌──────────┐      ┌──────────┐
                    │ Fail =   │      │ Fail =   │      │ Fail =   │
                    │ Block    │      │ Block    │      │ Rollback │
                    └──────────┘      └──────────┘      └──────────┘
```

**Automated promotion gates:**

| Gate | Criteria | Auto-promote Timeout |
|------|----------|---------------------|
| Build | All tests pass; linting passes; security scan clean | Immediate |
| Staging | E2E tests pass; performance within 10% of baseline | 30 min validation |
| Canary | Error rate < 0.1%; p95 latency < baseline + 20% | 15 min bake time |
| Production | Canary metrics healthy for 30 min | Auto-proceed |

### 9.3 Database Migrations

**Tool:** Alembic (SQLAlchemy migrations) with `yoyo-migrations` for idempotent SQL

**Migration rules:**
1. All migrations must be backward-compatible (add-only in one release)
2. Destructive changes require a 2-phase deployment
3. Migrations are versioned and reversible
4. Migrations run automatically as init container before app startup
5. Migration status exposed via `/health/ready`

```python
# migrations/env.py — Alembic configuration
from alembic import context
from sqlalchemy import create_engine

config = context.config

def run_migrations():
    """Run migrations in online mode."""
    connectable = create_engine(config.get_main_option("sqlalchemy.url"))
    
    with connectable.connect() as connection:
        context.configure(
            connection=connection,
            target_metadata=target_metadata,
            transaction_per_migration=True,
            compare_type=True,
        )
        
        with context.begin_transaction():
            context.run_migrations()

# Migration example: add_column (backward-compatible)
# migrations/versions/20250115_add_camera_resolution.py
"""
Add resolution column to cameras table

Revision ID: 20250115_add_camera_resolution
Revises: 20250101_initial
Create Date: 2025-01-15 08:30:00
"""
from alembic import op
import sqlalchemy as sa

revision = '20250115_add_camera_resolution'
down_revision = '20250101_initial'

# Phase 1 (this release): Add column as nullable
def upgrade():
    op.add_column('cameras', sa.Column('resolution', sa.String(20), nullable=True))
    # Backfill existing data
    op.execute("UPDATE cameras SET resolution = '1920x1080' WHERE resolution IS NULL")

# Phase 2 (next release): Make column non-nullable
# def upgrade():
#     op.alter_column('cameras', 'resolution', nullable=False)

def downgrade():
    op.drop_column('cameras', 'resolution')
```

**Migration execution (Kubernetes init container):**
```yaml
initContainers:
  - name: db-migrations
    image: surveillance/api:2.3.2@sha256:a1b2c3d4...
    command:
      - python
      - -m
      - alembic
      - upgrade
      - head
    env:
      - name: DATABASE_URL
        valueFrom:
          secretKeyRef:
            name: db-credentials
            key: url
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    # Must complete before main container starts
    restartPolicy: OnFailure
```

**Two-phase destructive change example:**

Phase 1 (Release N):
```python
def upgrade():
    # Add new column
    op.add_column('detections', sa.Column('confidence_v2', sa.Float(), nullable=True))
    # Create index concurrently (no table lock)
    op.create_index('ix_detections_confidence_v2', 'detections', ['confidence_v2'],
                    postgresql_concurrently=True)
    # Backfill in batches
    op.execute("""
        UPDATE detections 
        SET confidence_v2 = confidence 
        WHERE confidence_v2 IS NULL
        AND id IN (SELECT id FROM detections WHERE confidence_v2 IS NULL LIMIT 10000)
    """)
```

Phase 2 (Release N+1):
```python
def upgrade():
    # Now safe to drop old column (all code reads from new column)
    op.drop_column('detections', 'confidence')
    # Rename new column
    op.alter_column('detections', 'confidence_v2', new_column_name='confidence')
```

### 9.4 Model Update Deployment (Blue/Green)

AI model updates use blue/green to enable instant rollback:

```
Current State:
  ┌──────────────┐
  │  Model v2.1  │  ← Active (Blue)
  │   (Green)    │
  └──────────────┘
       ▲
   traffic: 100%

Deployment:
  1. Load Model v2.2 alongside v2.1
  2. Warm up v2.2 (run inference tests)
  3. Gradually shift traffic: 10% → 50% → 100%
  4. Monitor accuracy and latency
  
  ┌──────────────┐    ┌──────────────┐
  │  Model v2.1  │    │  Model v2.2  │
  │   (Blue)     │    │   (Green)    │
  └──────────────┘    └──────────────┘
    traffic: 70%         traffic: 30%

Rollback (instant):
  ┌──────────────┐    ┌──────────────┐
  │  Model v2.1  │    │  Model v2.2  │
  │   (Blue)     │    │  (Green)     │
  └──────────────┘    └──────────────┘
   traffic: 100%         traffic: 0%
```

**Model deployment configuration:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-blue
  namespace: surveillance
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: inference
          image: surveillance/inference:2.3.1
          env:
            - name: MODEL_VERSION
              value: "face-detection-v2.1"
            - name: MODEL_PATH
              value: "/models/v2.1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-green
  namespace: surveillance
spec:
  replicas: 0  # Scaled to 0 by default
  template:
    spec:
      containers:
        - name: inference
          image: surveillance/inference:2.3.1
          env:
            - name: MODEL_VERSION
              value: "face-detection-v2.2"
            - name: MODEL_PATH
              value: "/models/v2.2"
---
# Service routes to active model via label selector
apiVersion: v1
kind: Service
metadata:
  name: ai-inference
  annotations:
    active-model: "blue"
spec:
  selector:
    model: blue  # Changed to "green" for cutover
  ports:
    - port: 8080
```

**Model switch script:**
```bash
#!/bin/bash
# switch-model.sh — Switch between blue and green model deployments

NAMESPACE="surveillance"
TARGET="$1"  # blue or green

# Scale target to match current
CURRENT_REPLICAS=$(kubectl get deployment ai-inference-blue -n $NAMESPACE \
  -o jsonpath='{.status.replicas}')

echo "Scaling ai-inference-$TARGET to $CURRENT_REPLICAS replicas..."
kubectl scale deployment "ai-inference-$TARGET" --replicas="$CURRENT_REPLICAS" -n "$NAMESPACE"

# Wait for ready
kubectl rollout status "deployment/ai-inference-$TARGET" -n "$NAMESPACE" --timeout=300s

# Update service selector
echo "Switching service to $TARGET..."
kubectl patch service ai-inference -n "$NAMESPACE" \
  --type merge \
  -p "{\"spec\":{\"selector\":{\"model\":\"$TARGET\"}}}"

# Update annotation
kubectl annotate service ai-inference -n "$NAMESPACE" \
  "active-model=$TARGET" --overwrite

# Scale old version to 0
OLD_VERSION=$([ "$TARGET" == "blue" ] && echo "green" || echo "blue")
echo "Scaling down ai-inference-$OLD_VERSION..."
kubectl scale deployment "ai-inference-$OLD_VERSION" --replicas=0 -n "$NAMESPACE"

echo "Model switch complete. Active: $TARGET"
```

### 9.5 Maintenance Windows

| Window | Schedule | Duration | Allowed Activities |
|--------|----------|----------|-------------------|
| Weekly | Sunday 02:00-06:00 UTC | 4 hours | Patches, minor updates, config changes |
| Monthly | First Sunday 02:00-08:00 UTC | 6 hours | Database maintenance, major upgrades, model updates |
| Quarterly | Scheduled | 8 hours | Infrastructure upgrades, DR drills |
| Emergency | On-demand | As needed | Security patches, critical fixes |

**Maintenance mode API:**
```python
@app.post("/admin/maintenance")
async def enable_maintenance_mode(
    duration_minutes: int,
    reason: str,
    user: AdminUser = Depends(get_admin_user)
):
    """Enable maintenance mode — disable non-critical processing."""
    await redis.set("maintenance:active", "true", ex=duration_minutes * 60)
    await redis.set("maintenance:reason", reason, ex=duration_minutes * 60)
    
    # Notify all connected clients
    await websocket_manager.broadcast({
        "type": "maintenance",
        "status": "started",
        "reason": reason,
        "estimated_duration_minutes": duration_minutes
    })
    
    # Reduce non-critical processing
    await set_pipeline_mode("minimal")
    
    audit_log.info("Maintenance mode enabled by %s for %d minutes: %s",
                   user.username, duration_minutes, reason)
```

### 9.6 Rollback Capability

Every deployment maintains the previous N versions for instant rollback:

| Rollback Type | Method | Time to Complete | When to Use |
|--------------|--------|-----------------|-------------|
| Application rollback | `kubectl rollout undo` | ~30 seconds | Bad deployment |
| Database rollback | `alembic downgrade` | 2-5 minutes | Bad migration |
| Model rollback | Switch service selector | ~10 seconds | Bad model update |
| Configuration rollback | Git revert + apply | 1-2 minutes | Bad config change |
| Infrastructure rollback | Terraform state revert | 5-10 minutes | Bad infra change |
| Full system rollback | DR failover | 15-30 minutes | Catastrophic failure |

**Automated rollback triggers:**
```yaml
# rollback-alerts.yaml
- alert: DeploymentRollbackRequired
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      and
      delta(deployment_timestamp[10m]) > 0
    )
  for: 2m
  labels:
    severity: p1
  annotations:
    summary: "High error rate after deployment — rollback recommended"
    runbook_url: "https://wiki.internal/runbooks/auto-rollback"
```

### 9.7 Version Pinning

All container images MUST be pinned to digest, never to floating tags:

```yaml
# GOOD — pinned to digest
image: surveillance/api:2.3.1@sha256:abc123def456...

# BAD — floating tag
image: surveillance/api:latest

# ACCEPTABLE — semver tag with digest verification
image: surveillance/api:2.3.1
# (digest verified by admission controller)
```

**Image verification admission controller:**
```yaml
# Kyverno / OPA Gatekeeper policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-digest
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "All container images must be pinned to digest"
        pattern:
          spec:
            containers:
              - image: "*@sha256:*"
```

---

## 10. Performance Optimization

### 10.1 Query Optimization

**Slow query monitoring:**
```sql
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slow queries
SELECT 
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
```

**Alert on slow queries:**
```yaml
- alert: SlowPostgresQueries
  expr: |
    pg_stat_statements_mean_time > 1000
  for: 5m
  labels:
    severity: p3
  annotations:
    summary: "Slow queries detected (>1000ms average)"
```

**Index review (monthly):**
```sql
-- Check for missing indexes
SELECT 
    schemaname,
    tablename,
    attname as column,
    n_tup_read,
    n_tup_fetch,
    n_tup_insert,
    n_tup_update
FROM pg_stats 
WHERE schemaname = 'public'
ORDER BY n_tup_read DESC;

-- Check for unused indexes
SELECT 
    schemaname,
    tablename,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;
```

**Current index strategy:**
```sql
-- Core indexes for surveillance queries
CREATE INDEX CONCURRENTLY idx_detections_timestamp_camera 
    ON detections (timestamp DESC, camera_id);

CREATE INDEX CONCURRENTLY idx_detections_person_id 
    ON detections (person_id) WHERE person_id IS NOT NULL;

CREATE INDEX CONCURRENTLY idx_events_timestamp_type 
    ON events (timestamp DESC, event_type);

CREATE INDEX CONCURRENTLY idx_alerts_status_created 
    ON alerts (status, created_at DESC) 
    WHERE status IN ('pending', 'sent');

CREATE INDEX CONCURRENTLY idx_recordings_camera_timestamp 
    ON recordings (camera_id, start_time DESC);

-- Partial index for active alerts (most queried)
CREATE INDEX CONCURRENTLY idx_alerts_active 
    ON alerts (created_at DESC, camera_id, severity)
    WHERE status = 'active';
```

### 10.2 Cache Strategy (Redis)

| Cache Type | TTL | Invalidation | Purpose |
|------------|-----|-------------|---------|
| Camera configuration | 5 min | On update | Reduce DB reads |
| Person profiles | 10 min | On update | Fast face lookup |
| Recent detections | 1 min | Time-based | Dashboard display |
| Alert rules | 5 min | On update | Rule evaluation |
| API responses (frequent) | 30 sec | On data change | Reduce API load |
| Session data | 24 hours | On logout | User sessions |
| Rate limiting | 1 min | Automatic | API protection |

**Redis configuration:**
```yaml
# redis-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: surveillance
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    appendonly yes
    appendfsync everysec
    save 900 1
    save 300 10
    save 60 10000
    tcp-keepalive 60
    timeout 300
```

**Cache implementation:**
```python
# cache.py
import redis.asyncio as redis
import json
import hashlib
from functools import wraps

redis_client = redis.Redis(
    host='redis',
    port=6379,
    db=0,
    decode_responses=True,
    socket_connect_timeout=5,
    socket_timeout=5,
    health_check_interval=30,
)

async def cached(ttl_seconds: int, key_prefix: str = "cache"):
    """Decorator to cache function results."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key
            cache_key = f"{key_prefix}:{func.__name__}:{_generate_key(args, kwargs)}"
            
            # Try cache
            cached = await redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute and cache
            result = await func(*args, **kwargs)
            await redis_client.setex(
                cache_key,
                ttl_seconds,
                json.dumps(result, default=str)
            )
            return result
        return wrapper
    return decorator

def _generate_key(args, kwargs):
    key_data = json.dumps({"args": args, "kwargs": kwargs}, sort_keys=True, default=str)
    return hashlib.sha256(key_data.encode()).hexdigest()[:16]

# Usage
@cached(ttl_seconds=300, key_prefix="camera")
async def get_camera_config(camera_id: str):
    return await db.fetchrow("SELECT * FROM cameras WHERE id = $1", camera_id)

@cached(ttl_seconds=60, key_prefix="detections")
async def get_recent_detections(camera_id: str, limit: int = 50):
    return await db.fetch(
        """SELECT * FROM detections 
           WHERE camera_id = $1 
           ORDER BY timestamp DESC 
           LIMIT $2""",
        camera_id, limit
    )
```

### 10.3 CDN Configuration

Static assets and archived media are served via CDN:

```yaml
# CloudFront / CDN configuration
cdn:
  origins:
    - id: surveillance-media
      domain: surveillance-media.s3.amazonaws.com
      path: /recordings
      
    - id: surveillance-static
      domain: surveillance-static.s3.amazonaws.com
      path: /static
      
  behaviors:
    - path: /recordings/*.mp4
      ttl: 86400
      compress: true
      
    - path: /static/*
      ttl: 604800
      cache_control: "public, max-age=604800, immutable"
      
    - path: /api/*
      ttl: 0  # Don't cache API
      
  signed_urls:
    enabled: true
    key_pair_id: "K..."
    expiration: 3600  # 1 hour
```

### 10.4 Connection Pooling

#### Database Connection Pooling

```python
# database.py
import asyncpg

DB_POOL_CONFIG = {
    "min_size": 5,
    "max_size": 20,
    "max_inactive_time": 300,
    "max_queries": 50000,
    "command_timeout": 30,
    "server_settings": {
        "jit": "off",
        "application_name": "surveillance-api"
    }
}

pool = None

async def init_pool(database_url: str):
    global pool
    pool = await asyncpg.create_pool(
        database_url,
        **DB_POOL_CONFIG
    )

async def get_connection():
    return await pool.acquire()

async def release_connection(conn):
    await pool.release(conn)
```

#### HTTP Connection Pooling (for inter-service communication)

```python
# http_client.py
import httpx

class ServiceClient:
    def __init__(self):
        self.client = httpx.AsyncClient(
            timeout=httpx.Timeout(connect=5.0, read=30.0),
            limits=httpx.Limits(
                max_connections=100,
                max_keepalive_connections=20
            ),
            http2=True,
        )
    
    async def get(self, service: str, path: str):
        url = f"http://{service}:8080{path}"
        response = await self.client.get(url)
        response.raise_for_status()
        return response.json()

service_client = ServiceClient()
```

### 10.5 Resource Limits

```yaml
# resource-limits.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: surveillance-limits
  namespace: surveillance
spec:
  limits:
    - default:
        cpu: "1"
        memory: 1Gi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      type: Container
---
# Per-service resource allocation
resources:
  # Video capture (I/O bound)
  video-capture:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "2"
      memory: 4Gi

  # AI inference (CPU/GPU bound)
  ai-inference:
    requests:
      cpu: "2"
      memory: 4Gi
    limits:
      cpu: "4"
      memory: 8Gi

  # API (moderate load)
  surveillance-api:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 2Gi

  # Database (high memory)
  postgres:
    requests:
      cpu: "1"
      memory: 4Gi
    limits:
      cpu: "4"
      memory: 16Gi

  # Redis (low CPU, moderate memory)
  redis:
    requests:
      cpu: 100m
      memory: 1Gi
    limits:
      cpu: "1"
      memory: 2Gi
```

### 10.6 Performance Benchmarks

| Metric | Target | Alert Threshold | Critical Threshold |
|--------|--------|----------------|-------------------|
| Camera stream latency | < 100ms | > 200ms | > 500ms |
| AI inference per frame | < 50ms | > 100ms | > 200ms |
| End-to-end detection latency | < 500ms | > 1000ms | > 2000ms |
| API response time (p50) | < 50ms | > 100ms | > 500ms |
| API response time (p95) | < 200ms | > 500ms | > 1000ms |
| Database query time (p95) | < 10ms | > 50ms | > 200ms |
| Stream processing FPS | 30 FPS | < 25 FPS | < 15 FPS |
| Frame drop rate | < 0.1% | > 1% | > 5% |
| Alert delivery time | < 5s | > 10s | > 30s |

---

## 11. Disaster Recovery

### 11.1 DR Objectives

| Metric | Value | Measurement |
|--------|-------|-------------|
| **RTO** (Recovery Time Objective) | 1 hour | Time from disaster declaration to service restoration |
| **RPO** (Recovery Point Objective) | 15 minutes | Maximum acceptable data loss |
| **RTO (Database)** | 30 minutes | Database failover time |
| **RTO (Application)** | 15 minutes | Application redeployment time |
| **RPO (Database)** | < 1 minute | With synchronous replication |
| **RPO (Media)** | 15 minutes | Cross-region replication lag |

### 11.2 DR Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        PRODUCTION (us-east-1)                        │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │   EKS        │  │   RDS        │  │   S3             │          │
│  │   Cluster    │  │   PostgreSQL │  │   Primary        │          │
│  │              │  │   (Primary)  │  │   Bucket         │          │
│  │  ┌────────┐  │  │              │  │                  │          │
│  │  │Capture │  │  │  ┌────────┐  │  │  ┌──────────┐    │          │
│  │  │API     │  │  │  │Primary │  │  │  │ Recordings│   │          │
│  │  │Inference│  │  │  │Replica │  │  │  │ Events    │   │          │
│  │  └────────┘  │  │  └────────┘  │  │  │ Models    │   │          │
│  └──────────────┘  └──────────────┘  └──────────────────┘          │
│           │                │                  │                      │
│           ▼                ▼                  ▼                      │
│     ┌─────────────────────────────────────────────────┐              │
│     │           Real-time Replication                  │              │
│     │  (WAL streaming + S3 cross-region replication)   │              │
│     └─────────────────────────────────────────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     DR SITE (us-west-2)                              │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐          │
│  │   EKS        │  │   RDS        │  │   S3             │          │
│  │   (Scaled    │  │   PostgreSQL │  │   Replica        │          │
│  │    to 0)     │  │   (Standby)  │  │   Bucket         │          │
│  │              │  │              │  │                  │          │
│  │  [Ready to   │  │  ┌────────┐  │  │  [Fully         │          │
│  │   scale up]  │  │  │Standby │  │  │   replicated]  │          │
│  │              │  │  │Replica │  │  │                  │          │
│  └──────────────┘  │  └────────┘  │  └──────────────────┘          │
│                    └──────────────┘                                  │
└─────────────────────────────────────────────────────────────────────┘
```

### 11.3 Data Replication

#### Database Replication

```yaml
# RDS PostgreSQL cross-region read replica
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DRReadReplica:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceIdentifier: surveillance-dr-replica
      DBInstanceClass: db.r6g.xlarge
      Engine: postgres
      EngineVersion: '15.4'
      SourceDBInstanceIdentifier: 
        !Sub 'arn:aws:rds:us-east-1:${AWS::AccountId}:db:surveillance-primary'
      DBSubnetGroupName: !Ref DRSubnetGroup
      VPCSecurityGroups:
        - !Ref DRSecurityGroup
      MultiAZ: false  # Standby only; enable during failover
      StorageEncrypted: true
      KmsKeyId: !Ref DRKMSKey
      BackupRetentionPeriod: 7
      DeletionProtection: true
      Tags:
        - Key: Purpose
          Value: DR-Standby
        - Key: RPO
          Value: 15min
```

**Replication monitoring:**
```sql
-- Check replication lag (run on primary)
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- Alert if replication lag > 5 minutes
```

#### Object Storage Replication

S3 Cross-Region Replication (CRR) with 15-minute RPO:
- All new objects replicated within 15 minutes
- Replication status tracked per object
- Failed replication events alerted

#### Configuration Replication

- Terraform state stored in S3 with cross-region replication
- Git repositories mirrored to secondary Git provider
- Kubernetes manifests stored in Git (GitOps)

### 11.4 Failover Process

#### Automated Failover (Database — RDS)

RDS Multi-AZ provides automatic failover:
1. Health check fails on primary
2. RDS promotes standby to primary (typically 60-120 seconds)
3. DNS endpoint updates automatically
4. Application reconnects via connection pool

#### Manual DR Failover (Full Site)

```bash
#!/bin/bash
# dr-failover.sh — Execute full site failover to DR region

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
FAILOVER_REASON="$1"

log() {
    echo "[$(date -Iseconds)] $1" | tee -a /var/log/dr/failover-$(date +%Y%m%d).log
}

log "=== DR FAILOVER INITIATED ==="
log "Reason: $FAILOVER_REASON"
log "From: $PRIMARY_REGION → $DR_REGION"

# 1. Verify DR environment
log "1. Verifying DR environment readiness..."
if ! aws eks describe-cluster --name surveillance-dr --region $DR_REGION > /dev/null 2>&1; then
    log "ERROR: DR EKS cluster not accessible"
    exit 1
fi

# 2. Promote DR database from standby
log "2. Promoting DR database..."
aws rds promote-read-replica \
    --db-instance-identifier surveillance-dr-replica \
    --region $DR_REGION

# Wait for promotion
aws rds wait db-instance-available \
    --db-instance-identifier surveillance-dr-replica \
    --region $DR_REGION
log "   DR database promoted successfully"

# 3. Enable Multi-AZ on DR database
log "3. Enabling Multi-AZ on DR database..."
aws rds modify-db-instance \
    --db-instance-identifier surveillance-dr-replica \
    --multi-az \
    --apply-immediately \
    --region $DR_REGION

# 4. Scale up DR EKS cluster
log "4. Scaling up DR EKS cluster..."
aws eks update-nodegroup-config \
    --cluster-name surveillance-dr \
    --nodegroup-name surveillance-workers \
    --scaling-config minSize=3,maxSize=10,desiredSize=3 \
    --region $DR_REGION

# Wait for nodes
sleep 120
kubectl wait --for=condition=Ready nodes --all --timeout=300s

# 5. Deploy application to DR
log "5. Deploying application to DR..."
kubectl config use-context surveillance-dr
kubectl apply -k k8s/overlays/dr/

# Wait for deployments
kubectl wait --for=condition=available \
    --all deployments \
    --namespace surveillance \
    --timeout=600s

# 6. Update DNS to point to DR
log "6. Updating DNS to DR region..."
aws route53 change-resource-record-sets \
    --hosted-zone-id $HOSTED_ZONE_ID \
    --change-batch file://dr-dns-update.json

# 7. Verify health
log "7. Running health checks..."
for i in {1..10}; do
    if curl -f https://surveillance.company.com/health/deep > /dev/null 2>&1; then
        log "   Health check PASSED"
        break
    fi
    log "   Health check attempt $i/10..."
    sleep 10
done

# 8. Verify cameras reconnecting
log "8. Verifying camera streams..."
sleep 60
STREAM_COUNT=$(curl -s https://surveillance.company.com/api/v1/cameras/status | \
    jq '[.cameras[] | select(.status == "active")] | length')
log "   Active streams: $STREAM_COUNT/8"

# 9. Send notifications
log "9. Sending notifications..."
curl -X POST "$SLACK_WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\":\"DR FAILOVER COMPLETE: Production now running in $DR_REGION. Reason: $FAILOVER_REASON. Active streams: $STREAM_COUNT/8\"}"

log "=== DR FAILOVER COMPLETE ==="
log "Total time: $(($(date +%s) - START_TIME)) seconds"
```

### 11.5 DR Testing Schedule

| Test Type | Frequency | Scope | Duration | Validation |
|-----------|-----------|-------|----------|------------|
| Backup restore drill | Monthly | Database + media | 2 hours | Data integrity verified |
| Application redeployment | Monthly | Full application stack | 1 hour | All services healthy |
| Network failover test | Quarterly | VPN, DNS | 30 min | Traffic routes correctly |
| Database failover test | Quarterly | RDS Multi-AZ promotion | 1 hour | Replication lag acceptable |
| **Full DR drill** | **Quarterly** | **Complete site failover** | **4 hours** | **All RTO/RPO met** |
| Tabletop exercise | Semi-annually | Team response procedures | 2 hours | Process gaps identified |

**Full DR drill procedure:**
1. **Week before:** Schedule drill; notify stakeholders; prepare isolated test data
2. **Day of:**
   - 09:00 — Initiate failover (simulate primary region failure)
   - 09:05 — DR team executes failover runbook
   - 09:30 — Verify database is promoted and accessible
   - 10:00 — Verify application is deployed and healthy
   - 10:30 — Verify camera streams reconnect
   - 11:00 — Verify alert delivery
   - 11:30 — Run E2E test suite
   - 12:00 — Validate data integrity (sample checks)
   - 12:30 — Measure and document RTO/RPO
   - 13:00 — Initiate failback to primary
   - 14:00 — Verify primary is restored
3. **Week after:** Complete DR test report; file action items

**DR Test Report Template:**
```markdown
## DR Drill Report — 2025-Q1

| Item | Result |
|------|--------|
| Date | 2025-03-15 |
| Scenario | Complete region failure (us-east-1) |
| Failover RTO Target | 60 minutes |
| Failover RTO Achieved | 42 minutes |
| RPO Target | 15 minutes |
| RPO Achieved | 8 minutes |
| Streams Restored | 8/8 (100%) |
| Data Integrity | PASS |
| E2E Tests | 47/47 PASS |

### Issues Found
1. Camera reconnection took 18 minutes (target: <10 min) — AI-7 filed
2. Alert service required manual restart — AI-8 filed

### Action Items
| ID | Description | Owner | Due |
|----|-------------|-------|-----|
| AI-7 | Optimize camera reconnection sequence | @eng | 2025-04-01 |
| AI-8 | Fix alert service startup dependency | @sre | 2025-03-22 |
```

### 11.6 DR Readiness Checklist

Verify monthly (automated where possible):

- [ ] DR database replication lag < 1 minute
- [ ] S3 cross-region replication caught up
- [ ] DR EKS cluster accessible and nodes can scale
- [ ] Latest container images available in DR region registry
- [ ] DR Terraform plan applies without errors (dry-run)
- [ ] Backup integrity verified (latest full backup)
- [ ] Failover runbook accessible and up-to-date
- [ ] DR contact list current
- [ ] VPN/cross-region network paths verified

---

## 12. Capacity Planning

### 12.1 Current Capacity Baseline (8 Cameras)

| Resource | Current Usage | Capacity | Headroom |
|----------|--------------|----------|----------|
| **CPU (cloud)** | 4 cores avg | 8 cores | 100% |
| **Memory (cloud)** | 12 GB | 32 GB | 167% |
| **GPU (if used)** | 40% utilization | 1x GPU | 150% |
| **Storage hot tier** | 6 TB / 20 TB | 20 TB | 233% |
| **Storage warm tier** | 18 TB / 50 TB | 50 TB | 178% |
| **Database storage** | 150 GB | 500 GB | 233% |
| **Database connections** | 25 / 100 | 100 | 300% |
| **Network egress** | 200 Mbps / 1 Gbps | 1 Gbps | 400% |
| **Inference throughput** | 240 FPS (8x30) | 480 FPS | 100% |
| **Alert volume** | 50/day | 500/day | 900% |

### 12.2 Scaling Triggers

| Metric | Scale-Up Trigger | Scale-Down Trigger | Action |
|--------|-----------------|-------------------|--------|
| **CPU utilization** | > 70% for 10 minutes | < 30% for 30 minutes | Add/remove inference pods |
| **Memory utilization** | > 80% for 10 minutes | < 40% for 30 minutes | Add memory or pods |
| **Inference latency** | > 100ms p95 for 5 min | < 50ms p95 for 10 min | Scale inference horizontally |
| **Queue depth** | > 1000 frames | < 100 frames | Adjust consumer count |
| **Storage usage** | > 70% | N/A (manual) | Expand volume or archive |
| **Camera count** | > 8 cameras | N/A | Scale per-camera resources |

**Horizontal Pod Autoscaler configuration:**
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
  namespace: surveillance
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: surveillance_pipeline_latency_ms
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
```

### 12.3 Camera Addition Process

```
Step 1: Pre-deployment Assessment (Day -7)
├── Evaluate resource requirements
├── Verify network connectivity
├── Review camera positioning and coverage
└── Update configuration in Git

Step 2: Infrastructure Preparation (Day -3)
├── Calculate additional storage needs
├── Verify scaling headroom
├── Prepare camera configuration
└── Stage network/VPN configuration

Step 3: Deployment (Day 0)
├── Add camera to configuration
├── Deploy updated configuration
├── Verify stream connection
├── Validate AI processing
├── Test alert generation
└── Update dashboards

Step 4: Validation (Day 0-1)
├── Monitor for 24 hours
├── Verify FPS and quality
├── Confirm alerts working
├── Document in camera registry
└── Notify stakeholders
```

**Camera addition checklist:**

| Step | Item | Verification |
|------|------|-------------|
| 1 | Camera network reachable | `ping <camera_ip>` |
| 2 | RTSP stream accessible | `ffprobe rtsp://<camera>/stream` |
| 3 | VPN tunnel supports additional bandwidth | Bandwidth check |
| 4 | Configuration added to Git | PR merged |
| 5 | Stream appears in video-capture | Logs show connection |
| 6 | FPS meets target (>25) | Grafana dashboard |
| 7 | AI inference processing frames | Detection metrics |
| 8 | Alerts generated correctly | Test alert |
| 9 | Storage projections updated | Capacity review |
| 10 | Camera documented | Registry updated |

### 12.4 Per-Camera Resource Requirements

| Resource | Per Camera | 8 Cameras | 16 Cameras | 24 Cameras |
|----------|-----------|-----------|------------|------------|
| **CPU (inference)** | 0.5 cores | 4 cores | 8 cores | 12 cores |
| **Memory (processing)** | 1 GB | 8 GB | 16 GB | 24 GB |
| **Storage (hot, daily)** | 50 GB/day | 400 GB/day | 800 GB/day | 1.2 TB/day |
| **Network (ingress)** | 25 Mbps | 200 Mbps | 400 Mbps | 600 Mbps |
| **GPU memory** | 512 MB | 4 GB | 8 GB | 12 GB |
| **Database IOPS** | 100 | 800 | 1,600 | 2,400 |

### 12.5 Scaling Roadmap

| Phase | Cameras | Timeline | Infrastructure Changes |
|-------|---------|----------|----------------------|
| **Current** | 8 | Now | 3 inference pods, 8 CPU, 32 GB RAM |
| **Phase 1** | 12 | Q2 2025 | 4 inference pods, 12 CPU, 48 GB RAM |
| **Phase 2** | 16 | Q3 2025 | 6 inference pods, 16 CPU, 64 GB RAM, GPU add |
| **Phase 3** | 24 | Q1 2026 | 8 inference pods, 24 CPU, 96 GB RAM, 2 GPU |
| **Phase 4** | 32+ | Q3 2026 | Shard by location, dedicated inference cluster |

### 12.6 Performance Benchmarks

**Benchmark suite executed monthly:**

```bash
#!/bin/bash
# performance-benchmark.sh

API_URL="https://surveillance.company.com"
RESULTS_FILE="/var/log/benchmarks/$(date +%Y%m%d).json"

echo "{\"timestamp\": \"$(date -Iseconds)\"," > "$RESULTS_FILE"
echo "\"benchmarks\": {" >> "$RESULTS_FILE"

# 1. Health check latency
echo "  Running health check latency test..."
HEALTH_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health")
echo "  \"health_check_latency_ms\": $(echo "$HEALTH_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 2. Deep health check latency
echo "  Running deep health check..."
DEEP_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health/deep")
echo "  \"deep_health_latency_ms\": $(echo "$DEEP_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 3. API response time (events list)
echo "  Running API response time test..."
API_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
  "$API_URL/api/v1/events?limit=100&start=$(date -d '1 hour ago' -Iseconds)")
echo "  \"api_events_latency_ms\": $(echo "$API_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 4. Database query performance
echo "  Running database query test..."
DB_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
  "$API_URL/api/v1/admin/db-performance")
echo "  \"db_query_latency_ms\": $(echo "$DB_LAT * 1000" | bc)," >> "$RESULTS_FILE"

# 5. Stream status
echo "  Checking stream status..."
STREAMS=$(curl -s "$API_URL/api/v1/cameras/status" | jq '[.cameras[] | select(.status == "active")] | length')
echo "  \"active_streams\": $STREAMS," >> "$RESULTS_FILE"

# 6. Inference latency (from Prometheus)
echo "  Fetching inference metrics..."
INF_LAT=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(surveillance_model_inference_ms_bucket[5m])))" | \
  jq -r '.data.result[0].value[1] // "null"')
echo "  \"inference_p95_latency_ms\": $INF_LAT" >> "$RESULTS_FILE"

echo "}}" >> "$RESULTS_FILE"

echo "Benchmark complete. Results saved to $RESULTS_FILE"
cat "$RESULTS_FILE"
```

**Benchmark history tracking:**

| Date | Health (ms) | Deep Health (ms) | API (ms) | Inference P95 (ms) | Streams Active |
|------|-------------|-------------------|----------|-------------------|----------------|
| 2025-01-01 | 12 | 245 | 89 | 42 | 8/8 |
| 2025-01-08 | 11 | 238 | 92 | 45 | 8/8 |
| 2025-01-15 | 15 | 520 | 156 | 78 | 7/8 (cam_03 offline) |

### 12.7 Resource Request & Provisioning Workflow

```
Requestor submits capacity request
        │
        ▼
┌───────────────┐
│ SRE Review    │ ← Assess impact, feasibility, alternatives
│ (2 biz days)  │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Approval      │ ← Engineering Manager + Finance (if >$X)
│ (1 biz day)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Implementation│ ← SRE executes change during maintenance window
│ (scheduled)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Validation    │ ← Verify performance meets requirements
│ (24-48 hours) │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Close Request │ ← Document in capacity ledger
└───────────────┘
```

---

## Appendices

### Appendix A: Contact Directory

| Role | Name | Email | Phone | Slack |
|------|------|-------|-------|-------|
| On-Call (rotating) | See PagerDuty | oncall@company.com | Via PagerDuty | #surveillance-oncall |
| SRE Team Lead | [Name] | sre-lead@company.com | +1-555-0100 | @sre-lead |
| Engineering Manager | [Name] | eng-mgr@company.com | +1-555-0101 | @eng-mgr |
| Security Officer | [Name] | security@company.com | +1-555-0104 | @security |
| Product Owner | [Name] | product@company.com | +1-555-0105 | @product |
| VP Engineering | [Name] | vp-eng@company.com | +1-555-0102 | @vp-eng |

### Appendix B: Tooling Inventory

| Category | Tool | Version | Purpose |
|----------|------|---------|---------|
| Monitoring | Prometheus | 2.47+ | Metrics collection |
| Monitoring | Grafana | 10.0+ | Visualization |
| Monitoring | Alertmanager | 0.26+ | Alert routing |
| Logging | Elasticsearch | 8.11+ | Log storage |
| Logging | Filebeat | 8.11+ | Log shipping |
| Logging | Kibana | 8.11+ | Log visualization |
| Orchestration | Kubernetes | 1.28+ | Container orchestration |
| Packaging | Helm | 3.13+ | K8s package management |
| IaC | Terraform | 1.6+ | Infrastructure provisioning |
| GitOps | ArgoCD | 2.9+ | Continuous deployment |
| Backup | pgBackRest | 2.48+ | PostgreSQL backup |
| Secrets | Vault / AWS Secrets Manager | Latest | Secret management |
| Paging | PagerDuty | SaaS | Incident paging |
| Communication | Slack | SaaS | Team communication |

### Appendix C: Network Architecture

```
Internet
    │
    ▼
┌─────────┐    ┌─────────────┐    ┌──────────────────┐
│   CDN   │───▶│  Nginx/ALB  │───▶│  API Gateway     │
│         │    │  (TLS term) │    │  (auth/rate-lim) │
└─────────┘    └─────────────┘    └────────┬─────────┘
                                           │
                    ┌──────────────────────┼──────────────────────┐
                    │                      │                      │
                    ▼                      ▼                      ▼
            ┌──────────┐         ┌──────────────┐      ┌──────────┐
            │ Surveil- │         │   WebSocket  │      │ Grafana  │
            │ lance    │         │   Service    │      │ /Kibana  │
            │ API      │         │              │      │          │
            └────┬─────┘         └──────────────┘      └──────────┘
                 │
        ┌────────┼────────┬──────────────┐
        │        │        │              │
        ▼        ▼        ▼              ▼
   ┌────────┐ ┌─────┐ ┌──────────┐ ┌──────────┐
   │PostgreSQL│ │Redis│ │  S3/MinIO│ │ Prometheus│
   │         │ │     │ │          │ │           │
   └─────────┘ └─────┘ └──────────┘ └───────────┘

    VPN Tunnel
    ══════════
    ┌──────────────┐
    │  Edge Node   │◀── RTSP ──▶ [Cameras 1-8]
    │  (local proc)│
    └──────────────┘
```

### Appendix D: Document Revision History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-01-15 | SRE Team | Initial comprehensive operations plan covering all 12 domains |

---

*END OF DOCUMENT*

*This document is a living document and should be reviewed and updated quarterly or after any significant infrastructure change.*
