AI Surveillance Platform — 24/7 Operations & Reliability Plan
Version: 1.0
Date: 2025-01-15
Classification: Internal — Operations & Engineering
System: 8-Channel AI Surveillance Platform (Cloud + Edge)
Target: Industrial-grade autonomous operations with minimal human intervention
Table of Contents
- Monitoring & Observability
- Logging Strategy
- Health Checks
- Service Restart & Recovery
- Backup Strategy
- Data Retention
- Storage Management
- Incident Response
- Upgrades & Maintenance
- Performance Optimization
- Disaster Recovery
- Capacity Planning
Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-01-15 | SRE Team | Initial comprehensive operations plan |
Approval
| Role | Name | Date |
|---|---|---|
| Head of Engineering | _____________ | //______ |
| Security Officer | _____________ | //______ |
| Operations Lead | _____________ | //______ |
1. Monitoring & Observability
1.1 Overview
The monitoring stack provides real-time visibility into all platform components, enabling proactive issue detection and rapid incident response. All metrics are collected at 15-second intervals with 15-month retention.
Tooling Choice: Prometheus + Grafana (primary) with Alertmanager for notification routing.
Architecture:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node │ │ Prometheus │ │ Grafana │
│ Exporter │────▶│ Server │────▶│ Dashboards │
│ (per host) │ │ (TSDB) │ │ (visualize)│
└─────────────┘ └──────┬──────┘ └─────────────┘
│
┌──────┴──────┐
│ Alertmanager │────▶ PagerDuty / OpsGenie / Slack
└─────────────┘
1.2 Metrics Collection
1.2.1 System Metrics (Node Exporter + cAdvisor)
| Metric Category | Specific Metrics | Collection Interval | Retention |
|---|---|---|---|
| CPU | Usage % per core, load average (1m/5m/15m), steal time, iowait | 15s | 15 months |
| Memory | Used/available/total, swap usage, OOM kills, page faults | 15s | 15 months |
| Disk | Usage % per volume, IOPS, read/write latency, inode usage | 15s | 15 months |
| Network | RX/TX bytes/packets/drops per interface, TCP connections, retransmits | 15s | 15 months |
| Containers | CPU/memory per container, restart count, network IO per container | 15s | 15 months |
Prometheus scrape configuration:
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 15s
scrape_timeout: 10s
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 15s
- job_name: 'surveillance-api'
static_configs:
- targets: ['surveillance-api:8080']
scrape_interval: 15s
metrics_path: /metrics
- job_name: 'ai-inference'
static_configs:
- targets: ['ai-inference:8080']
scrape_interval: 15s
metrics_path: /metrics
- job_name: 'video-processor'
static_configs:
- targets: ['video-processor:8080']
scrape_interval: 15s
metrics_path: /metrics
1.2.2 Application Metrics (Custom / OpenTelemetry)
| Metric Name | Type | Description | Labels |
|---|---|---|---|
surveillance_fps_per_camera |
Gauge | Current FPS being processed per camera | camera_id, location |
surveillance_detection_rate |
Gauge | Detections per second per stream | camera_id, model_version |
surveillance_alert_rate |
Counter | Total alerts generated | severity, camera_id, alert_type |
surveillance_pipeline_latency_ms |
Histogram | End-to-end processing latency | stage, camera_id |
surveillance_frame_drop_rate |
Gauge | Percentage of frames dropped | camera_id, reason |
surveillance_model_inference_ms |
Histogram | AI model inference time | model_name, batch_size |
surveillance_stream_active |
Gauge | Whether stream is active (1/0) | camera_id, source |
surveillance_face_recognition_matches |
Counter | Face recognition hits/misses | camera_id, match_type |
Application instrumentation (Python example):
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from functools import wraps
import time
# Define metrics
DETECTION_COUNTER = Counter(
'surveillance_detections_total',
'Total detections by type',
['camera_id', 'detection_type', 'model_version']
)
PIPELINE_LATENCY = Histogram(
'surveillance_pipeline_latency_ms',
'End-to-end pipeline latency in milliseconds',
['stage', 'camera_id'],
buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
)
CAMERA_FPS = Gauge(
'surveillance_fps_per_camera',
'Current FPS per camera stream',
['camera_id', 'location']
)
STREAM_ACTIVE = Gauge(
'surveillance_stream_active',
'Stream connectivity status',
['camera_id', 'source']
)
def track_latency(stage, camera_id):
"""Decorator to track function latency."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
try:
return func(*args, **kwargs)
finally:
elapsed_ms = (time.time() - start) * 1000
PIPELINE_LATENCY.labels(
stage=stage,
camera_id=camera_id
).observe(elapsed_ms)
return wrapper
return decorator
1.2.3 Business Metrics
| Metric Name | Type | Business Purpose | Alert Threshold |
|---|---|---|---|
surveillance_persons_detected_daily |
Counter | Daily person detection volume | Anomaly detection |
surveillance_unknown_persons |
Counter | Unknown/alerted persons per period | Trend analysis |
surveillance_alerts_sent |
Counter | Alerts successfully delivered | Delivery health |
surveillance_alerts_failed |
Counter | Failed alert deliveries | > 5 in 5 min = P2 |
surveillance_camera_uptime_pct |
Gauge | Per-camera uptime percentage | < 99% = P3 |
surveillance_detection_accuracy |
Gauge | Model accuracy score | < threshold = P2 |
1.2.4 Error Metrics
| Metric Name | Type | Description | Severity |
|---|---|---|---|
surveillance_errors_total |
Counter | Errors by type and service | All |
surveillance_stream_errors |
Counter | Stream connection errors | P2 if > 10/min |
surveillance_model_errors |
Counter | Model inference failures | P1 if > 5/min |
surveillance_db_errors |
Counter | Database operation failures | P1 if > 3/min |
surveillance_storage_errors |
Counter | Storage read/write failures | P2 if > 5/min |
1.3 Alerting Rules
1.3.1 Critical Alerts (P1) — Page Immediately
# /etc/prometheus/alerts/critical.yml
groups:
- name: critical
rules:
- alert: AllStreamsDown
expr: sum(surveillance_stream_active) == 0
for: 1m
labels:
severity: p1
annotations:
summary: "ALL camera streams are down"
description: "No active streams detected for more than 1 minute"
runbook_url: "https://wiki.internal/runbooks/all-streams-down"
- alert: AIPipelineDown
expr: rate(surveillance_detections_total[5m]) == 0
for: 2m
labels:
severity: p1
annotations:
summary: "AI pipeline not producing detections"
description: "Zero detections in the last 2 minutes across all streams"
- alert: StorageFull
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
for: 1m
labels:
severity: p1
annotations:
summary: "Storage critically low: {{ $labels.mountpoint }}"
description: "Less than 5% storage remaining on {{ $labels.instance }}"
- alert: DatabaseUnreachable
expr: pg_up == 0
for: 1m
labels:
severity: p1
annotations:
summary: "PostgreSQL database is unreachable"
description: "Cannot connect to primary database"
- alert: HighErrorRate
expr: rate(surveillance_errors_total[5m]) > 10
for: 2m
labels:
severity: p1
annotations:
summary: "High error rate across services"
description: "Error rate exceeds 10 errors per second"
1.3.2 High Severity Alerts (P2) — Page Within 1 Hour
# /etc/prometheus/alerts/high.yml
groups:
- name: high
rules:
- alert: SingleCameraDown
expr: surveillance_stream_active{camera_id=~"cam.*"} == 0
for: 5m
labels:
severity: p2
annotations:
summary: "Camera {{ $labels.camera_id }} is offline"
description: "Camera stream has been down for more than 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95,
rate(surveillance_pipeline_latency_ms_bucket[5m])) > 2000
for: 5m
labels:
severity: p2
annotations:
summary: "Pipeline latency is high"
description: "P95 latency exceeds 2000ms"
- alert: ModelAccuracyDegraded
expr: surveillance_detection_accuracy < 0.85
for: 10m
labels:
severity: p2
annotations:
summary: "AI model accuracy degraded"
description: "Detection accuracy below 85%"
- alert: MemoryPressure
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes > 0.90
for: 5m
labels:
severity: p2
annotations:
summary: "Memory pressure on {{ $labels.instance }}"
description: "Memory usage above 90%"
- alert: DiskSpaceWarning
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
for: 5m
labels:
severity: p2
annotations:
summary: "Disk space warning: {{ $labels.mountpoint }}"
description: "Less than 15% disk space remaining"
1.3.3 Medium Severity Alerts (P3) — Respond Within 4 Hours
# /etc/prometheus/alerts/medium.yml
groups:
- name: medium
rules:
- alert: CameraFPSLow
expr: surveillance_fps_per_camera < 15
for: 10m
labels:
severity: p3
annotations:
summary: "Camera {{ $labels.camera_id }} FPS below threshold"
- alert: FrameDropsHigh
expr: surveillance_frame_drop_rate > 0.10
for: 10m
labels:
severity: p3
annotations:
summary: "High frame drop rate on {{ $labels.camera_id }}"
- alert: CertificateExpiry
expr: (ssl_certificate_expiry_seconds - time()) / 86400 < 30
for: 1h
labels:
severity: p3
annotations:
summary: "TLS certificate expiring soon"
- alert: BackupNotRun
expr: time() - surveillance_last_backup_timestamp > 90000
for: 1h
labels:
severity: p3
annotations:
summary: "Database backup has not run in 25+ hours"
1.3.4 Low Severity Alerts (P4) — Respond Within 24 Hours
# /etc/prometheus/alerts/low.yml
groups:
- name: low
rules:
- alert: HighCPU
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 30m
labels:
severity: p4
annotations:
summary: "CPU usage high on {{ $labels.instance }}"
- alert: ContainerRestartLoop
expr: rate(container_restarts_total[15m]) > 3
for: 15m
labels:
severity: p4
annotations:
summary: "Container restart loop detected"
1.4 Alertmanager Configuration
# /etc/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@surveillance.company.com'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
slack_api_url: '<SLACK_WEBHOOK_URL>'
# Inhibit alerts of lower severity when higher severity fires
inhibit_rules:
- source_match:
severity: 'p1'
target_match:
severity: 'p2'
equal: ['alertname', 'instance']
route:
receiver: 'default'
group_by: ['alertname', 'severity', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# P1 alerts — page immediately, no grouping delay
- match:
severity: p1
receiver: 'p1-critical'
group_wait: 0s
repeat_interval: 15m
continue: true
# P2 alerts — page within 1 hour
- match:
severity: p2
receiver: 'p2-high'
group_wait: 2m
repeat_interval: 1h
# P3 alerts — Slack + email only
- match:
severity: p3
receiver: 'p3-medium'
group_wait: 5m
repeat_interval: 4h
# P4 alerts — daily digest
- match:
severity: p4
receiver: 'p4-low'
group_wait: 10m
repeat_interval: 24h
receivers:
- name: 'default'
slack_configs:
- channel: '#surveillance-alerts'
title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'p1-critical'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
severity: critical
description: '{{ .GroupLabels.alertname }}'
slack_configs:
- channel: '#surveillance-critical'
send_resolved: true
title: 'P1 CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
email_configs:
- to: 'oncall@company.com'
subject: '[P1 CRITICAL] Surveillance Platform Alert'
- name: 'p2-high'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
severity: error
slack_configs:
- channel: '#surveillance-alerts'
send_resolved: true
- name: 'p3-medium'
slack_configs:
- channel: '#surveillance-warnings'
send_resolved: true
- name: 'p4-low'
email_configs:
- to: 'ops-team@company.com'
subject: '[P4 Low] Surveillance Platform — Daily Digest'
1.5 Grafana Dashboards
1.5.1 Dashboard: Infrastructure Overview (ID: infra-overview)
{
"dashboard": {
"title": "Infrastructure Overview",
"tags": ["infrastructure", "overview"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage %",
"type": "timeseries",
"targets": [{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}],
"alert": {
"conditions": [{
"evaluator": {"params": [85], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"type": "avg"}
}]
}
},
{
"title": "Memory Usage",
"type": "timeseries",
"targets": [{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "{{ instance }}"
}]
},
{
"title": "Disk Usage",
"type": "gauge",
"targets": [{
"expr": "100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)"
}],
"fieldConfig": {
"max": 100,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "orange", "value": 85},
{"color": "red", "value": 95}
]
}
}
},
{
"title": "Network I/O",
"type": "timeseries",
"targets": [
{"expr": "rate(node_network_receive_bytes_total[5m])", "legendFormat": "RX {{ device }}"},
{"expr": "rate(node_network_transmit_bytes_total[5m])", "legendFormat": "TX {{ device }}"}
]
},
{
"title": "Container Count",
"type": "stat",
"targets": [{
"expr": "count(container_last_seen)"
}]
},
{
"title": "Container Restarts (15m)",
"type": "stat",
"targets": [{
"expr": "increase(container_restarts_total[15m])"
}],
"fieldConfig": {
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "red", "value": 1}
]
}
}
}
]
}
}
1.5.2 Dashboard: Camera Health (ID: camera-health)
| Panel | Type | Query / Data Source |
|---|---|---|
| Stream Status Grid | Stat grid (8 panels) | surveillance_stream_active{camera_id=~"cam.*"} |
| FPS per Camera | Timeseries | surveillance_fps_per_camera by camera_id |
| Frame Drop Rate | Timeseries | surveillance_frame_drop_rate by camera_id |
| Camera Uptime % | Gauge per camera | avg_over_time(surveillance_stream_active[24h]) * 100 |
| Stream Error Count | Bar chart | increase(surveillance_stream_errors[1h]) by camera_id |
| Last Frame Timestamp | Table | Time since last frame per camera |
| Bitrate per Stream | Timeseries | surveillance_stream_bitrate_kbps |
Camera Health Score Calculation:
# Overall camera health score (0-100)
(
avg(surveillance_stream_active) * 50 +
(1 - avg(surveillance_frame_drop_rate)) * 30 +
(avg(surveillance_fps_per_camera) / 30) * 20
) * 100
1.5.3 Dashboard: AI Pipeline Performance (ID: ai-pipeline)
| Panel | Type | Metric |
|---|---|---|
| Inference Latency (P50/P95/P99) | Timeseries | histogram_quantile(0.5x, rate(...)) |
| Detections per Second | Timeseries | rate(surveillance_detections_total[5m]) |
| Model Accuracy Trend | Timeseries | surveillance_detection_accuracy |
| Pipeline Throughput | Stat | Total frames processed/minute |
| GPU Utilization (if applicable) | Gauge | nvidia_gpu_utilization_gpu |
| GPU Memory Usage | Timeseries | nvidia_gpu_memory_used_bytes |
| Model Load Status | Table | Current model version, load time, status |
| Batch Size Distribution | Heatmap | Inference batch sizes over time |
1.5.4 Dashboard: Alert Delivery Stats (ID: alert-delivery)
| Panel | Type | Query |
|---|---|---|
| Alerts Sent Today | Stat | increase(surveillance_alerts_sent[24h]) |
| Alerts Failed | Stat | increase(surveillance_alerts_failed[24h]) |
| Delivery Success Rate | Gauge | alerts_sent / (alerts_sent + alerts_failed) |
| Alerts by Severity | Pie chart | surveillance_alerts_sent by severity |
| Alerts by Camera | Bar chart | Top cameras by alert count |
| Notification Channel Status | Table | Channel health per delivery method |
| Alert Response Time | Histogram | Time from detection to notification |
1.5.5 Dashboard: Storage Usage Trends (ID: storage-trends)
| Panel | Type | Query |
|---|---|---|
| Total Storage Used | Stat | Sum of all storage volumes |
| Storage Growth Rate | Timeseries | Daily increase in bytes |
| Retention Policy Status | Table | Days remaining per retention tier |
| Media vs. Metadata Split | Pie chart | Storage breakdown by type |
| Projected Capacity Exhaustion | Stat | Days until full at current growth rate |
| Cleanup Job Status | Table | Last run, records cleaned, errors |
| Cross-Region Replication Lag | Timeseries | Replication delay in seconds |
1.6 On-Call Rotation
| Shift | Time (UTC) | Primary On-Call | Secondary |
|---|---|---|---|
| APAC | 00:00 — 08:00 | APAC SRE Team | EMEA Escalation |
| EMEA | 08:00 — 16:00 | EMEA SRE Team | Americas Escalation |
| Americas | 16:00 — 00:00 | Americas SRE Team | APAC Escalation |
Escalation Policy (PagerDuty):
- Notification: Alert fires → Notify on-call engineer via PagerDuty push + SMS
- Acknowledge: 5-minute acknowledge window
- Escalation 1: No acknowledge → Escalate to team lead (15 min)
- Escalation 2: No response → Escalate to engineering manager (30 min)
- Escalation 3: No response → Escalate to VP Engineering (45 min)
2. Logging Strategy
2.1 Log Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Application │────▶│ Filebeat │────▶│ Logstash │────▶│ Elasticsearch│
│ (JSON logs)│ │ (shipper) │ │ (processor) │ │ (store) │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘
│
┌──────┴──────┐
│ Kibana │
│ (visualize) │
└─────────────┘
2.2 Log Levels
| Level | Numeric | Usage | Retention | Action |
|---|---|---|---|---|
| DEBUG | 10 | Detailed diagnostic info | 7 days | Development only |
| INFO | 20 | Normal operational events | 90 days | Standard operations |
| WARNING | 30 | Anomalous but non-critical conditions | 90 days | Monitor trends |
| ERROR | 40 | Operational failures, handled exceptions | 1 year | Alert if rate > threshold |
| CRITICAL | 50 | System-threatening failures | 1 year | Immediate P1 alert |
Production default level: INFO (DEBUG only enabled per-request for troubleshooting)
2.3 Structured Logging Format
All application logs MUST be in JSON format:
{
"timestamp": "2025-01-15T08:30:15.123456Z",
"level": "ERROR",
"logger": "surveillance.video_processor",
"message": "Failed to connect to camera stream",
"request_id": "req_abc123def456",
"trace_id": "trace_789xyz",
"service": "video-processor",
"version": "2.3.1",
"host": "edge-node-01",
"environment": "production",
"camera_id": "cam_03_entrance",
"location": "main_entrance",
"error": {
"type": "ConnectionTimeout",
"message": "Connection to rtsp://192.168.1.103:554/stream timed out after 10s",
"retry_count": 3,
"stack_trace": "..."
},
"context": {
"stream_url": "rtsp://***.***.1.***:554/stream",
"connection_duration_ms": 10000,
"previous_disconnect": "2025-01-15T08:25:00Z"
},
"performance": {
"processing_time_ms": 0.5,
"memory_mb": 128.5
}
}
Python logging configuration:
# logging_config.py
import logging
import json
from pythonjsonlogger import jsonlogger
import os
class StructuredLogFormatter(jsonlogger.JsonFormatter):
def add_fields(self, log_record, record, message_dict):
super().add_fields(log_record, record, message_dict)
log_record['timestamp'] = datetime.utcnow().isoformat() + 'Z'
log_record['level'] = record.levelname
log_record['logger'] = record.name
log_record['service'] = os.environ.get('SERVICE_NAME', 'unknown')
log_record['version'] = os.environ.get('SERVICE_VERSION', 'unknown')
log_record['host'] = os.environ.get('HOSTNAME', 'unknown')
log_record['environment'] = os.environ.get('ENV', 'production')
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'json': {
'()': StructuredLogFormatter,
'format': '%(timestamp)s %(level)s %(message)s'
}
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'json',
'stream': 'ext://sys.stdout'
},
'file': {
'class': 'logging.handlers.RotatingFileHandler',
'formatter': 'json',
'filename': '/var/log/surveillance/app.log',
'maxBytes': 104857600, # 100 MB
'backupCount': 10
}
},
'loggers': {
'surveillance': {
'level': os.environ.get('LOG_LEVEL', 'INFO'),
'handlers': ['console', 'file'],
'propagate': False
}
}
}
2.4 Log Correlation
Every request receives a unique request_id and trace_id:
import uuid
import contextvars
# Context variable for request-scoped tracing
request_id_var = contextvars.ContextVar('request_id', default=None)
trace_id_var = contextvars.ContextVar('trace_id', default=None)
def get_current_request_id() -> str:
req_id = request_id_var.get()
if req_id is None:
req_id = f"req_{uuid.uuid4().hex[:16]}"
request_id_var.set(req_id)
return req_id
def get_current_trace_id() -> str:
trace_id = trace_id_var.get()
if trace_id is None:
trace_id = f"trace_{uuid.uuid4().hex[:16]}"
trace_id_var.set(trace_id)
return trace_id
Propagation across services:
- HTTP:
X-Request-IDandX-Trace-IDheaders - Message queue: Metadata fields in message envelope
- gRPC: Custom metadata
2.5 Log Retention Policy
| Log Category | Retention | Storage Class | Compression |
|---|---|---|---|
| Application logs (INFO+) | 90 days | Hot (SSD) 30d → Warm 60d | After 7 days |
| Error logs (ERROR+) | 1 year | Warm 90d → Cold 275d | After 30 days |
| Audit logs | 1 year | Hot 90d → Warm 180d → Cold 95d | After 90 days |
| Debug logs | 7 days | Hot only | None |
| Access logs | 90 days | Warm 30d → Cold 60d | After 30 days |
| System logs (syslog/journald) | 90 days | Warm | After 7 days |
Elasticsearch Index Lifecycle Management (ILM):
PUT _ilm/policy/surveillance-logs
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d",
"max_docs": 100000000
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"allocate": {
"require": { "data": "warm" }
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"require": { "data": "cold" }
},
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": { "delete": {} }
}
}
}
}
2.6 Sensitive Data Handling
NEVER log:
- Face embeddings or biometric data
- Full-resolution images of detected persons
- PII (names, employee IDs, phone numbers)
- Credentials, API keys, tokens, passwords
- Stream URLs with embedded credentials
- Internal network topology
- VPN configuration details
Sanitization rules:
import re
SENSITIVE_PATTERNS = [
(r'rtsp://[^:]+:[^@]+@', 'rtsp://***:***@'),
(r'password[=:]\s*\S+', 'password=***'),
(r'api[_-]?key[=:]\s*\S+', 'api_key=***'),
(r'token[=:]\s*\S+', 'token=***'),
(r'embedding[=:]\s*\[.*?\]', 'embedding=[REDACTED]'),
(r'face[_-]?vector[=:]\s*\[.*?\]', 'face_vector=[REDACTED]'),
]
def sanitize_log_message(message: str) -> str:
for pattern, replacement in SENSITIVE_PATTERNS:
message = re.sub(pattern, replacement, message, flags=re.IGNORECASE)
return message
3. Health Checks
3.1 Health Check Architecture
┌─────────────────────────────────────────────────────────────┐
│ Health Check Endpoints │
│ │
│ /health → Liveness probe (Kubernetes/Docker) │
│ /health/ready → Readiness probe (accepting traffic) │
│ /health/deep → Deep health (full pipeline validation) │
└─────────────────────────────────────────────────────────────┘
3.2 Endpoint Specifications
3.2.1 Liveness Probe — GET /health
Purpose: Determine if the process is running and not deadlocked.
Response:
{
"status": "alive",
"timestamp": "2025-01-15T08:30:15Z",
"service": "surveillance-api",
"version": "2.3.1",
"uptime_seconds": 86400
}
Criteria:
- Process is running
- Main thread is not blocked
- Returns HTTP 200 within 1 second
Failure action: Container orchestrator restarts the container.
Configuration:
# Kubernetes
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Docker Compose
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 3s
retries: 3
start_period: 30s
3.2.2 Readiness Probe — GET /health/ready
Purpose: Determine if the service is ready to accept traffic.
Response:
{
"status": "ready",
"timestamp": "2025-01-15T08:30:15Z",
"service": "surveillance-api",
"version": "2.3.1",
"checks": {
"database": {
"status": "pass",
"response_time_ms": 12,
"message": "Connected to PostgreSQL primary"
},
"object_storage": {
"status": "pass",
"response_time_ms": 45,
"message": "S3 bucket accessible"
},
"cache": {
"status": "pass",
"response_time_ms": 2,
"message": "Redis connection OK"
}
}
}
Criteria:
- All required dependencies reachable
- Database connection pool has available connections
- Object storage accessible
- Cache layer accessible
- AI model loaded (for inference services)
Failure response: HTTP 503 with details
{
"status": "not_ready",
"timestamp": "2025-01-15T08:30:15Z",
"checks": {
"database": {
"status": "fail",
"response_time_ms": 5000,
"message": "Connection timeout after 5000ms"
},
"object_storage": { "status": "pass" },
"cache": { "status": "pass" }
}
}
Configuration:
# Kubernetes
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 2
3.2.3 Deep Health Check — GET /health/deep
Purpose: Validate the entire processing pipeline end-to-end.
Response:
{
"status": "healthy",
"timestamp": "2025-01-15T08:30:15Z",
"service": "surveillance-platform",
"version": "2.3.1",
"checks": {
"database": {
"status": "pass",
"response_time_ms": 8,
"details": {
"connection": "ok",
"query_execution": "ok",
"replication_lag_seconds": 0
}
},
"object_storage": {
"status": "pass",
"response_time_ms": 67,
"details": {
"read_test": "ok",
"write_test": "ok",
"list_test": "ok"
}
},
"ai_model": {
"status": "pass",
"response_time_ms": 145,
"details": {
"model_loaded": true,
"model_version": "face-detection-v2.1",
"gpu_available": true,
"test_inference": "ok"
}
},
"streams": {
"status": "pass",
"details": {
"active_streams": 8,
"expected_streams": 8,
"streams": [
{"camera_id": "cam_01", "fps": 30, "status": "active"},
{"camera_id": "cam_02", "fps": 30, "status": "active"},
{"camera_id": "cam_03", "fps": 25, "status": "active"},
{"camera_id": "cam_04", "fps": 30, "status": "active"},
{"camera_id": "cam_05", "fps": 30, "status": "active"},
{"camera_id": "cam_06", "fps": 28, "status": "active"},
{"camera_id": "cam_07", "fps": 30, "status": "active"},
{"camera_id": "cam_08", "fps": 30, "status": "active"}
]
}
},
"cache": {
"status": "pass",
"response_time_ms": 1,
"details": {
"set_test": "ok",
"get_test": "ok",
"memory_usage_pct": 45
}
},
"alert_delivery": {
"status": "pass",
"details": {
"channels_tested": 3,
"success": 3
}
},
"pipeline_e2e": {
"status": "pass",
"response_time_ms": 523,
"details": {
"capture": "ok",
"inference": "ok",
"alert_generation": "ok",
"storage": "ok"
}
}
}
}
Execution:
- Triggered manually or by monitoring every 5 minutes
- NOT used for Kubernetes probes (too slow)
- Full pipeline validation takes 1-5 seconds
3.3 Dependency Health Check Matrix
| Dependency | Check Method | Timeout | Expected Result | Failure Action |
|---|---|---|---|---|
| PostgreSQL | SELECT 1 |
3s | Row returned | Return not_ready |
| Redis Cache | PING → PONG |
2s | PONG received | Degrade to DB only |
| S3 / Object Storage | List + Put + Get test object | 10s | All operations succeed | Queue for retry |
| AI Model | Load model + test inference | 30s | Inference completes | Report model error |
| Camera Streams | RTSP describe/ping | 10s | Stream metadata received | Mark stream offline |
| VPN Tunnel | ICMP to edge gateway | 5s | Response received | Mark edge offline |
| SMTP/Notification | TCP connect + EHLO | 5s | SMTP greeting received | Queue alerts |
3.4 Health Check Implementation
# health.py
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import time
import asyncio
class HealthStatus(Enum):
PASS = "pass"
FAIL = "fail"
WARN = "warn"
@dataclass
class HealthCheckResult:
name: str
status: HealthStatus
response_time_ms: float
message: str
details: Dict = field(default_factory=dict)
class HealthChecker:
def __init__(self):
self.checks = {}
def register(self, name: str, check_func):
self.checks[name] = check_func
async def run_all(self, timeout: float = 30.0) -> List[HealthCheckResult]:
tasks = [
self._run_check(name, func, timeout)
for name, func in self.checks.items()
]
return await asyncio.gather(*tasks)
async def _run_check(self, name: str, func, timeout: float) -> HealthCheckResult:
start = time.monotonic()
try:
result = await asyncio.wait_for(func(), timeout=timeout)
elapsed = (time.monotonic() - start) * 1000
result.response_time_ms = round(elapsed, 2)
return result
except asyncio.TimeoutError:
return HealthCheckResult(
name=name,
status=HealthStatus.FAIL,
response_time_ms=timeout * 1000,
message=f"Health check timed out after {timeout}s"
)
except Exception as e:
elapsed = (time.monotonic() - start) * 1000
return HealthCheckResult(
name=name,
status=HealthStatus.FAIL,
response_time_ms=round(elapsed, 2),
message=str(e)
)
# Usage
health_checker = HealthChecker()
# Register checks
health_checker.register("database", check_database)
health_checker.register("object_storage", check_object_storage)
health_checker.register("ai_model", check_ai_model)
health_checker.register("streams", check_all_streams)
health_checker.register("cache", check_cache)
# FastAPI endpoint
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
async def liveness():
return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}
@app.get("/health/ready")
async def readiness():
results = await health_checker.run_all(timeout=5.0)
all_pass = all(r.status == HealthStatus.PASS for r in results)
status_code = 200 if all_pass else 503
status = "ready" if all_pass else "not_ready"
return JSONResponse(
status_code=status_code,
content={
"status": status,
"timestamp": datetime.utcnow().isoformat(),
"checks": {
r.name: {
"status": r.status.value,
"response_time_ms": r.response_time_ms,
"message": r.message,
**r.details
}
for r in results
}
}
)
@app.get("/health/deep")
async def deep_health():
# Runs full pipeline check
results = await health_checker.run_all(timeout=30.0)
# ... similar to readiness but with pipeline_e2e
4. Service Restart & Recovery
4.1 Service Startup Sequence
Services must start in strict dependency order. Docker Compose depends_on or Kubernetes init containers enforce this.
Phase 1: Infrastructure
├─ PostgreSQL (primary + replica)
├─ Redis Cache
└─ MinIO / S3 Object Storage
Phase 2: Core Services
├─ Message Queue (RabbitMQ / NATS)
├─ Configuration Service
└─ Identity/Auth Service
Phase 3: AI Pipeline
├─ Model Service (download & load models)
├─ Video Capture Service (connect to cameras)
├─ AI Inference Service
└─ Post-Processing Service
Phase 4: Application Layer
├─ API Gateway
├─ Surveillance API Service
├─ Alert Service
└─ WebSocket / Real-time Service
Phase 5: Frontend
├─ Nginx / Reverse Proxy
└─ Web Dashboard
Docker Compose startup configuration:
# docker-compose.yml (relevant section)
services:
postgres:
image: postgres:15.4@sha256:abc123...
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U surveillance"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7.2@sha256:def456...
restart: unless-stopped
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
depends_on:
postgres:
condition: service_healthy
model-service:
image: surveillance/model-service:2.3.1@sha256:ghi789...
restart: unless-stopped
environment:
- MODEL_PATH=/models
- DOWNLOAD_IF_MISSING=true
volumes:
- model-cache:/models
depends_on:
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 30s
retries: 10
start_period: 60s
video-capture:
image: surveillance/capture:2.3.1@sha256:jkl012...
restart: unless-stopped
depends_on:
model-service:
condition: service_healthy
environment:
- STREAM_RETRY_MAX=10
- STREAM_RETRY_DELAY=5
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
ai-inference:
image: surveillance/inference:2.3.1@sha256:mno345...
restart: unless-stopped
depends_on:
video-capture:
condition: service_healthy
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
interval: 10s
timeout: 10s
retries: 5
start_period: 120s
surveillance-api:
image: surveillance/api:2.3.1@sha256:pqr678...
restart: unless-stopped
depends_on:
ai-inference:
condition: service_healthy
environment:
- DATABASE_URL=postgresql://...@postgres/surveillance
- REDIS_URL=redis://redis:6379
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health/ready"]
interval: 10s
timeout: 5s
retries: 3
start_period: 20s
nginx:
image: nginx:alpine@sha256:stu901...
restart: unless-stopped
ports:
- "80:80"
- "443:443"
depends_on:
surveillance-api:
condition: service_healthy
4.2 Graceful Shutdown Procedure
All services must handle SIGTERM for graceful shutdown:
# shutdown_handler.py
import asyncio
import signal
import logging
from contextlib import asynccontextmanager
logger = logging.getLogger(__name__)
class GracefulShutdown:
def __init__(self, shutdown_timeout: float = 30.0):
self.shutdown_timeout = shutdown_timeout
self._shutdown_event = asyncio.Event()
self._tasks = []
def register_task(self, task):
self._tasks.append(task)
async def wait_for_shutdown(self):
await self._shutdown_event.wait()
def trigger_shutdown(self):
logger.info("Shutdown signal received, initiating graceful shutdown...")
self._shutdown_event.set()
async def shutdown(self):
"""Execute graceful shutdown sequence."""
logger.info("Starting graceful shutdown sequence...")
# 1. Stop accepting new requests/connections
logger.info("1. Stopping request acceptance")
await self._stop_accepting_requests()
# 2. Wait for in-flight requests to complete
logger.info("2. Waiting for in-flight requests (timeout: %.0fs)",
self.shutdown_timeout)
try:
await asyncio.wait_for(
self._wait_inflight_requests(),
timeout=self.shutdown_timeout * 0.6
)
except asyncio.TimeoutError:
logger.warning("In-flight requests did not complete in time")
# 3. Flush buffers and complete pending writes
logger.info("3. Flushing buffers")
await self._flush_buffers()
# 4. Close camera streams gracefully
logger.info("4. Closing camera streams")
await self._close_streams()
# 5. Release resources
logger.info("5. Releasing resources")
await self._release_resources()
# 6. Close database connections
logger.info("6. Closing database connections")
await self._close_database_connections()
logger.info("Graceful shutdown complete")
async def _stop_accepting_requests(self):
# Mark service as not ready
pass
async def _wait_inflight_requests(self):
# Wait for active request count to reach zero
pass
async def _flush_buffers(self):
# Flush any pending log buffers, metric batches
pass
async def _close_streams(self):
# Send RTSP TEARDOWN, release capture resources
pass
async def _release_resources(self):
# Release GPU memory, file handles
pass
async def _close_database_connections(self):
# Return connections to pool, close pool
pass
def setup_signal_handlers(shutdown_manager: GracefulShutdown):
loop = asyncio.get_event_loop()
def handle_signal(sig):
logger.info("Received signal %s", sig.name)
shutdown_manager.trigger_shutdown()
asyncio.create_task(shutdown_manager.shutdown())
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, lambda s=sig: handle_signal(s))
Kubernetes graceful termination:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: surveillance-api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5 && curl -X POST localhost:8080/shutdown"]
4.3 Crash Recovery & Automatic Restart
| Scenario | Detection | Automatic Action | Manual Intervention |
|---|---|---|---|
| Container exits non-zero | Docker/K8s | Restart with exponential backoff (max 5 min) | If > 5 restarts in 10 min |
| OOM killed | Kernel event | Restart with 25% memory increase (max 3x) | Review memory limits |
| Health check fails | Probe failure | Restart container | If restart loop persists |
| Node failure | Node not ready | Reschedule to healthy node | Investigate failed node |
| Camera stream disconnect | No frames received | Retry with exponential backoff | If > 30 min offline |
| AI model load failure | Inference timeout | Reload model from backup | If model corrupted |
| Database connection lost | Query timeout | Retry connection, use replica | If primary down > 5 min |
Exponential backoff for stream reconnection:
import asyncio
import random
async def reconnect_stream(camera_id: str, max_retries: int = 100):
base_delay = 5 # seconds
max_delay = 300 # 5 minutes
for attempt in range(1, max_retries + 1):
delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
jitter = random.uniform(0, delay * 0.1)
wait_time = delay + jitter
logger.info("Camera %s: Reconnect attempt %d/%d in %.1fs",
camera_id, attempt, max_retries, wait_time)
await asyncio.sleep(wait_time)
try:
stream = await connect_stream(camera_id)
logger.info("Camera %s: Reconnected successfully", camera_id)
return stream
except Exception as e:
logger.warning("Camera %s: Reconnect failed: %s", camera_id, e)
logger.error("Camera %s: Max retries exceeded, stream marked offline", camera_id)
return None
4.4 Circuit Breaker Pattern
Protect against cascading failures when dependencies are down:
# circuit_breaker.py
from enum import Enum
import asyncio
import time
from dataclasses import dataclass
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing fast
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 3
success_threshold: int = 2
class CircuitBreaker:
def __init__(self, name: str, config: CircuitBreakerConfig = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = 0
self.half_open_calls = 0
self._lock = asyncio.Lock()
async def call(self, func, *args, **kwargs):
async with self._lock:
await self._transition_state()
if self.state == CircuitState.OPEN:
raise CircuitBreakerOpen(
f"Circuit breaker '{self.name}' is OPEN"
)
if self.state == CircuitState.HALF_OPEN:
if self.half_open_calls >= self.config.half_open_max_calls:
raise CircuitBreakerOpen(
f"Circuit '{self.name}' half-open limit reached"
)
self.half_open_calls += 1
# Execute outside lock
try:
result = await func(*args, **kwargs)
await self._on_success()
return result
except Exception as e:
await self._on_failure()
raise
async def _transition_state(self):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time >= self.config.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
self.success_count = 0
async def _on_success(self):
async with self._lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.config.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
async def _on_failure(self):
async with self._lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
elif self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
class CircuitBreakerOpen(Exception):
pass
Usage:
# Create breakers for each dependency
db_breaker = CircuitBreaker("database", CircuitBreakerConfig(
failure_threshold=3,
recovery_timeout=30.0
))
storage_breaker = CircuitBreaker("object_storage", CircuitBreakerConfig(
failure_threshold=5,
recovery_timeout=60.0
))
# Use in service calls
async def save_detection(detection):
return await db_breaker.call(
db_repository.save_detection, detection
)
async def store_frame(frame):
return await storage_breaker.call(
s3_client.upload, frame
)
4.5 Bulkhead Pattern — Resource Isolation
Isolate resources to prevent one failing component from consuming all resources:
# bulkhead.py
import asyncio
from asyncio import Semaphore
class Bulkhead:
"""Limits concurrent operations per service/camera."""
def __init__(self, name: str, max_concurrent: int, max_queue: int = 100):
self.name = name
self.semaphore = Semaphore(max_concurrent)
self.max_queue = max_queue
self.queue_size = 0
self._lock = asyncio.Lock()
async def execute(self, func, *args, **kwargs):
async with self._lock:
if self.queue_size >= self.max_queue:
raise BulkheadFull(
f"Bulkhead '{self.name}' queue full ({self.max_queue})"
)
self.queue_size += 1
try:
async with self.semaphore:
return await func(*args, **kwargs)
finally:
async with self._lock:
self.queue_size -= 1
class BulkheadFull(Exception):
pass
# Per-camera bulkheads to isolate failures
camera_bulkheads = {
f"cam_{i:02d}": Bulkhead(f"cam_{i:02d}", max_concurrent=4)
for i in range(1, 9)
}
# Per-service bulkheads
db_bulkhead = Bulkhead("database", max_concurrent=20)
storage_bulkhead = Bulkhead("storage", max_concurrent=10)
inference_bulkhead = Bulkhead("inference", max_concurrent=8)
4.6 Recovery State Persistence
Critical state is persisted to survive restarts:
| State Type | Storage | Recovery Action |
|---|---|---|
| Camera configurations | PostgreSQL | Reload on startup |
| Alert rules | PostgreSQL | Reload on startup |
| Processing offsets | Redis | Resume from last offset |
| In-flight detections | Redis → PostgreSQL | Replay from queue |
| Model version | Object Storage | Load specified version |
| Stream connection state | Local file | Attempt reconnection |
| Audit log buffer | Local file → Async flush | Recover unflushed entries |
5. Backup Strategy
5.1 Backup Architecture
┌─────────────────────────────────────────────────────────────────┐
│ BACKUP PIPELINE │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ PostgreSQL │───▶│ pgBackRest │───▶│ S3 (Primary) │ │
│ │ (Primary) │ │ (Full/Incr) │ │ us-east-1 │ │
│ └──────────────┘ └──────────────┘ └────────┬─────────┘ │
│ │ │
│ ┌──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ S3 (Secondary) │ │
│ │ us-west-2 │ Cross-region replication│
│ └──────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Object │───▶│ S3 Cross │───▶│ Glacier Deep │ │
│ │ Storage │ │ Region │ │ Archive │ │
│ │ Bucket │ │ Replication│ │ (7-year) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Infrastructure│───▶│ Git │───▶│ Encrypted Git │ │
│ │ Config │ │ Repository │ │ Backups │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
5.2 PostgreSQL Backup (pgBackRest)
Tool: pgBackRest 2.48+ with S3 integration
Backup Schedule:
| Backup Type | Frequency | Start Time (UTC) | Retention |
|---|---|---|---|
| Full backup | Weekly | Sunday 02:00 | 12 weeks |
| Differential | Daily (Mon-Sat) | 02:00 | 30 days |
| WAL archiving | Continuous | Real-time | 30 days |
| Manual backup | On-demand | Any | 90 days |
pgBackRest configuration:
# /etc/pgbackrest/pgbackrest.conf
[surveillance]
pg1-path=/var/lib/postgresql/15/main
pg1-port=5432
[global]
repo1-type=s3
repo1-s3-region=us-east-1
repo1-s3-bucket=surveillance-db-backups
repo1-s3-key=<ACCESS_KEY>
repo1-s3-key-secret=<SECRET_KEY>
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=12
repo1-retention-diff=30
repo1-retention-archive=30
# Encryption
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<STRONG_PASSPHRASE>
# Performance
process-max=4
compress-type=zst
compress-level=6
# Logging
log-level-file=detail
log-path=/var/log/pgbackrest
# Notifications
exec-start=/usr/local/bin/pgbackrest-notify.sh
Backup cron schedule:
# /etc/cron.d/pgbackrest
# Full backup every Sunday at 2 AM UTC
0 2 * * 0 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=full
# Differential backup daily at 2 AM UTC (Mon-Sat)
0 2 * * 1-6 postgres /usr/bin/pgbackrest --stanza=surveillance backup --type=diff
# Verify latest backup at 6 AM UTC daily
0 6 * * * postgres /usr/bin/pgbackrest --stanza=surveillance verify
WAL archiving configuration (postgresql.conf):
wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=surveillance archive-push %p'
max_wal_senders = 3
wal_keep_size = 1GB
5.3 Backup Retention Schedule
Timeline:
Day 1-30: Daily backups available (full + diffs)
Week 1-12: Weekly full backups
Month 1-12: Monthly full backups (last Sunday of each month)
Year 1-7: Annual snapshot in Glacier Deep Archive
| Tier | Frequency | Copies Kept | Storage Class | Location |
|---|---|---|---|---|
| Daily (hot) | Every 24h | 30 | S3 Standard | Primary region |
| Weekly (warm) | Every Sunday | 12 | S3 Standard-IA | Primary region |
| Monthly (cold) | Last Sunday | 12 | S3 Glacier Flexible | Primary region |
| Annual (archive) | Year-end | 7 | S3 Glacier Deep Archive | Cross-region |
5.4 Object Storage Backup
Cross-region replication:
// S3 bucket replication configuration
{
"Role": "arn:aws:iam::ACCOUNT:role/S3ReplicationRole",
"Rules": [
{
"ID": "surveillance-media-replication",
"Status": "Enabled",
"Priority": 1,
"DeleteMarkerReplication": { "Status": "Disabled" },
"Filter": {
"And": {
"Prefix": "media/",
"Tag": {
"Key": "replicate",
"Value": "true"
}
}
},
"Destination": {
"Bucket": "arn:aws:s3:::surveillance-media-backup-west",
"StorageClass": "STANDARD_IA",
"ReplicationTime": {
"Status": "Enabled",
"Time": { "Minutes": 15 }
},
"Metrics": {
"Status": "Enabled",
"EventThreshold": { "Minutes": 15 }
},
"EncryptionConfiguration": {
"ReplicaKmsKeyID": "arn:aws:kms:us-west-2:ACCOUNT:key/KEY-ID"
}
},
"SourceSelectionCriteria": {
"SseKmsEncryptedObjects": { "Status": "Enabled" }
}
}
]
}
Lifecycle policy for media storage:
{
"Rules": [
{
"ID": "media-lifecycle",
"Status": "Enabled",
"Filter": { "Prefix": "media/recordings/" },
"Transitions": [
{
"Days": 7,
"StorageClass": "INTELLIGENT_TIERING"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": { "Days": 2555 }
},
{
"ID": "event-data-lifecycle",
"Status": "Enabled",
"Filter": { "Prefix": "events/" },
"Transitions": [
{ "Days": 90, "StorageClass": "STANDARD_IA" },
{ "Days": 365, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 730 }
}
]
}
5.5 Configuration Backup
All infrastructure configuration is stored as code in Git:
surveillance-ops/
├── terraform/
│ ├── main.tf # Main infrastructure
│ ├── variables.tf # Environment variables
│ ├── outputs.tf # Output definitions
│ ├── modules/
│ │ ├── vpc/ # Network configuration
│ │ ├── eks/ # Kubernetes cluster
│ │ ├── rds/ # PostgreSQL instances
│ │ └── s3/ # Object storage
│ └── environments/
│ ├── production/ # Production config
│ └── dr/ # DR site config
├── kubernetes/
│ ├── base/ # Kustomize base resources
│ │ ├── kustomization.yaml
│ │ ├── namespace.yaml
│ │ ├── postgres/
│ │ ├── redis/
│ │ ├── api/
│ │ ├── inference/
│ │ └── capture/
│ └── overlays/
│ ├── production/
│ ├── staging/
│ └── dr/
├── docker-compose/
│ ├── docker-compose.yml # Edge deployment
│ └── .env.example
├── ansible/
│ ├── playbook.yml # Host provisioning
│ └── inventory/
├── monitoring/
│ ├── prometheus/
│ ├── grafana-dashboards/
│ └── alertmanager/
└── docs/
├── runbooks/
├── postmortems/
└── architecture/
Git backup to secondary provider:
#!/bin/bash
# /usr/local/bin/backup-git-repos.sh
# Mirrors all critical repos to secondary Git provider
REPOS=(
"git@github.com:company/surveillance-ops.git"
"git@github.com:company/surveillance-app.git"
"git@github.com:company/surveillance-models.git"
)
BACKUP_REMOTE="git@gitlab-backup.company.com:surveillance"
DATE=$(date +%Y%m%d)
for repo in "${REPOS[@]}"; do
name=$(basename "$repo" .git)
echo "Backing up $name..."
git clone --mirror "$repo" "/tmp/$name-mirror"
cd "/tmp/$name-mirror"
# Push to backup remote
git remote add backup "$BACKUP_REMOTE/$name.git" 2>/dev/null || true
git push backup --mirror
# Create dated archive
tar czf "/backup/git/$name-$DATE.tar.gz" -C "/tmp" "$name-mirror"
rm -rf "/tmp/$name-mirror"
done
# Upload to S3
aws s3 sync /backup/git/ "s3://surveillance-config-backups/git/" --storage-class STANDARD_IA
5.6 Encryption
| Data at Rest | Encryption Method | Key Management |
|---|---|---|
| PostgreSQL backups | AES-256-CBC (pgBackRest native) | AWS KMS CMK |
| S3 object storage | SSE-KMS | AWS KMS CMK with automatic rotation |
| Configuration backups | AES-256-GCM (age tool) | YubiKey HSM stored keys |
| Log archives | SSE-S3 (AES-256) | AWS managed |
KMS key policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow pgBackRest",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT:role/BackupServiceRole"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:GenerateDataKey*"
],
"Resource": "*"
}
]
}
5.7 Backup Verification
Automated integrity checks (daily at 06:00 UTC):
#!/bin/bash
# /usr/local/bin/verify-backup.sh
set -euo pipefail
STANZA="surveillance"
LOG_FILE="/var/log/backup/verify-$(date +%Y%m%d).log"
ALERT_WEBHOOK="https://hooks.slack.com/services/..."
log() {
echo "[$(date -Iseconds)] $1" | tee -a "$LOG_FILE"
}
# 1. Verify latest backup exists
LATEST=$(pgbackrest --stanza=$STANZA info --output=json | jq -r '.[0].backup[-1].label')
if [ -z "$LATEST" ]; then
log "ERROR: No backup found!"
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"CRITICAL: No database backup found!"}' \
"$ALERT_WEBHOOK"
exit 1
fi
log "Latest backup: $LATEST"
# 2. Verify backup integrity
if ! pgbackrest --stanza=$STANZA verify --set=$LATEST >> "$LOG_FILE" 2>&1; then
log "ERROR: Backup integrity check failed for $LATEST"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"CRITICAL: Backup integrity check failed for $LATEST\"}" \
"$ALERT_WEBHOOK"
exit 1
fi
# 3. Check WAL archive continuity
MISSING=$(pgbackrest --stanza=$STANZA verify 2>&1 | grep -c "missing" || true)
if [ "$MISSING" -gt 0 ]; then
log "WARNING: $MISSING WAL files missing"
fi
# 4. Verify S3 accessibility
if ! aws s3 ls "s3://surveillance-db-backups/pgbackrest/" > /dev/null 2>&1; then
log "ERROR: Cannot access S3 backup bucket"
exit 1
fi
# 5. Check backup age
BACKUP_AGE=$(pgbackrest --stanza=$STANZA info --output=json | \
jq -r '.[0].backup[-1].timestamp.stop')
BACKUP_AGE_SEC=$(( $(date +%s) - $(date -d "$BACKUP_AGE" +%s) ))
if [ "$BACKUP_AGE_SEC" -gt 90000 ]; then # > 25 hours
log "WARNING: Latest backup is older than 25 hours"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"WARNING: Latest backup is $((BACKUP_AGE_SEC / 3600)) hours old\"}" \
"$ALERT_WEBHOOK"
fi
log "Backup verification completed successfully"
5.8 Restore Procedures
5.8.1 Point-in-Time Recovery (PITR)
#!/bin/bash
# restore-pitr.sh — Restore to specific point in time
STANZA="surveillance"
TARGET_TIME="$1" # e.g., "2025-01-15 08:30:00"
# Stop application
kubectl scale deployment surveillance-api --replicas=0
# Restore from backup
pgbackrest --stanza=$STANZA restore \
--type=time \
--target="$TARGET_TIME" \
--target-action=promote \
--delta
# Verify database
psql -U surveillance -d surveillance -c "SELECT pg_last_xact_replay_timestamp();"
# Restart application
kubectl scale deployment surveillance-api --replicas=3
# Verify application health
curl -f http://surveillance-api:8080/health/ready
5.8.2 Full Disaster Recovery
#!/bin/bash
# restore-full.sh — Complete database restoration to new instance
STANZA="surveillance"
NEW_DATA_DIR="/var/lib/postgresql/15/main"
# 1. Install PostgreSQL (same version as backup)
apt-get install postgresql-15
# 2. Stop PostgreSQL
systemctl stop postgresql
# 3. Clear data directory
rm -rf "$NEW_DATA_DIR/*"
# 4. Restore full backup
pgbackrest --stanza=$STANZA restore \
--type=immediate \
--set=LATEST
# 5. Start PostgreSQL
systemctl start postgresql
# 6. Verify
pgbackrest --stanza=$STANZA check
# 7. Run consistency check
psql -U surveillance -d surveillance -c "SELECT count(*) FROM events;"
psql -U surveillance -d surveillance -c "SELECT pg_database_size('surveillance');"
5.9 Monthly Restore Drill
Schedule: First Saturday of each month at 02:00 UTC
Procedure:
- Provision isolated restore environment (separate namespace/VM)
- Restore latest full backup
- Apply differential backups
- Verify data integrity (row counts, checksums)
- Run application smoke tests
- Verify media files accessible
- Document results in restore log
- Tear down restore environment
Restore drill checklist:
## Restore Drill — 2025-01-04
- [x] Isolated environment provisioned
- [x] Full backup restored (duration: 23 min)
- [x] Differential backup applied (duration: 4 min)
- [x] WAL replay completed (duration: 12 min)
- [x] Database row counts verified
- events: 12,456,789 (expected: 12,456,789) ✓
- cameras: 8 (expected: 8) ✓
- alerts: 1,234 (expected: 1,234) ✓
- [x] Application smoke tests passed
- [x] Media file accessibility verified (100/100 random samples)
- [x] Total RTO: 41 minutes (target: < 60 min) ✓
- [x] Total RPO: 8 minutes (target: < 15 min) ✓
- [x] Environment cleaned up
**Notes:** WAL replay was slower than usual due to high write volume on Jan 3.
6. Data Retention
6.1 Retention Policy Matrix
| Data Category | Retention Period | Action After Retention | Legal Basis |
|---|---|---|---|
| Raw video recordings | 90 days (configurable) | Delete or archive to cold storage | Operational necessity |
| Event clips (alerts) | 1 year | Archive to cold storage for 2 additional years | Incident investigation |
| Detection metadata | 1 year | Anonymize & aggregate | Analytics |
| Audit logs | 1 year | Archive for 6 additional years | Compliance |
| System health logs | 90 days | Delete | Operational monitoring |
| Access logs | 90 days | Delete | Security monitoring |
| Face embeddings (enrolled) | Indefinite until deleted | User-initiated deletion | Authorized personnel database |
| Face embeddings (detected) | Never stored | N/A — computed and discarded immediately | Privacy by design |
| Alert history | 2 years | Archive | Incident reference |
| Training data | Indefinite | Explicit deletion by admin | AI model improvement |
| Configuration history | 2 years | Archive | Change tracking |
| Backup archives | 7 years (Glacier) | Delete per backup schedule | Disaster recovery |
6.2 Automated Cleanup Architecture
┌──────────────────────────────────────────────────────────────┐
│ Data Lifecycle Manager │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Retention │ │ Cleanup │ │ Archive │ │
│ │ Policy │──│ Executor │──│ Manager │ │
│ │ Engine │ │ (CronJob) │ │ (S3/Glacier) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ PostgreSQL │ │ S3 Object │ │ Elasticsearch │ │
│ │ (metadata) │ │ Storage │ │ (logs) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────┘
6.3 Cleanup Job Implementation
# retention_manager.py
from datetime import datetime, timedelta
from typing import List, Optional
import asyncio
import logging
logger = logging.getLogger(__name__)
class RetentionPolicy:
def __init__(self, name: str, retention_days: int, archive_first: bool = False,
archive_days: int = 0, anonymize: bool = False):
self.name = name
self.retention_days = retention_days
self.archive_first = archive_first
self.archive_days = archive_days
self.anonymize = anonymize
class DataRetentionManager:
def __init__(self):
self.policies = {}
def register_policy(self, policy: RetentionPolicy):
self.policies[policy.name] = policy
async def execute_cleanup(self, policy_name: str, dry_run: bool = False):
policy = self.policies.get(policy_name)
if not policy:
raise ValueError(f"Unknown policy: {policy_name}")
cutoff_date = datetime.utcnow() - timedelta(days=policy.retention_days)
logger.info("Executing cleanup for '%s' (cutoff: %s)",
policy_name, cutoff_date.isoformat())
if dry_run:
count = await self._count_eligible(policy_name, cutoff_date)
logger.info("[DRY RUN] Would delete %d records", count)
return count
# Archive before delete if configured
if policy.archive_first:
archive_cutoff = datetime.utcnow() - timedelta(
days=policy.retention_days + policy.archive_days
)
archived = await self._archive_records(policy_name, cutoff_date, archive_cutoff)
logger.info("Archived %d records", archived)
# Anonymize if configured
if policy.anonymize:
anonymized = await self._anonymize_records(policy_name, cutoff_date)
logger.info("Anonymized %d records", anonymized)
else:
# Delete expired records
deleted = await self._delete_records(policy_name, cutoff_date)
logger.info("Deleted %d records", deleted)
return {"archived": archived if policy.archive_first else 0, "deleted": deleted}
# Register policies
retention = DataRetentionManager()
retention.register_policy(RetentionPolicy("raw_video", retention_days=90, archive_first=True, archive_days=180))
retention.register_policy(RetentionPolicy("event_clips", retention_days=365, archive_first=True, archive_days=730))
retention.register_policy(RetentionPolicy("detection_metadata", retention_days=365, anonymize=True))
retention.register_policy(RetentionPolicy("audit_logs", retention_days=365, archive_first=True, archive_days=2190))
retention.register_policy(RetentionPolicy("system_logs", retention_days=90))
retention.register_policy(RetentionPolicy("access_logs", retention_days=90))
Kubernetes CronJob:
# cleanup-job.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: data-retention-cleanup
namespace: surveillance
spec:
schedule: "0 3 * * *" # Daily at 3 AM UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: surveillance/retention-manager:2.3.1
command:
- python
- -m
- retention_manager
- --execute-all
- --notify
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: S3_BUCKET
value: surveillance-media
- name: DRY_RUN
value: "false"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
restartPolicy: OnFailure
6.4 Archive to Cold Storage
Before deletion, data is moved to cost-effective cold storage:
| Stage | Storage Class | Cost Factor | Access Time |
|---|---|---|---|
| Active | S3 Standard | 1x | Immediate |
| 7 days | S3 Intelligent-Tiering | 0.8x | Immediate |
| 90 days | S3 Glacier Instant Retrieval | 0.2x | Milliseconds |
| 1 year | S3 Glacier Flexible Retrieval | 0.08x | Minutes-hours |
| 2 years | S3 Glacier Deep Archive | 0.04x | 12-48 hours |
Archive process:
#!/bin/bash
# archive-old-media.sh
BUCKET="surveillance-media"
RETENTION_DAYS=90
CUTOFF=$(date -d "$RETENTION_DAYS days ago" +%Y-%m-%d)
# 1. Identify files to archive
aws s3api list-objects-v2 \
--bucket "$BUCKET" \
--prefix "recordings/" \
--query "Contents[?LastModified<='$CUTOFF'].Key" \
--output text > /tmp/archive-list.txt
# 2. Move to Glacier
while IFS= read -r key; do
aws s3api copy-object \
--copy-source "${BUCKET}/${key}" \
--bucket "$BUCKET" \
--key "$key" \
--storage-class GLACIER_IR \
--metadata-directive COPY
done < /tmp/archive-list.txt
# 3. Log archival
aws s3 cp /tmp/archive-list.txt \
"s3://${BUCKET}/archive-logs/archive-$(date +%Y%m%d).txt"
# 4. Notify
echo "Archived $(wc -l < /tmp/archive-list.txt) files to Glacier IR"
6.5 Right to Deletion
For privacy compliance (GDPR/CCPA), implement data subject deletion:
async def delete_subject_data(subject_id: str):
"""
Complete deletion of a data subject:
1. Remove from enrolled persons database
2. Delete associated face embeddings
3. Remove references from detection logs
4. Delete related event clips
5. Log deletion for audit
"""
async with db.transaction():
# 1. Delete enrolled person
await db.execute(
"DELETE FROM enrolled_persons WHERE id = $1",
subject_id
)
# 2. Delete embeddings (separate table for encryption)
await db.execute(
"DELETE FROM face_embeddings WHERE person_id = $1",
subject_id
)
# 3. Anonymize detection references
await db.execute(
"""UPDATE detections
SET person_id = NULL,
person_name = '[REDACTED]',
face_embedding = NULL
WHERE person_id = $1""",
subject_id
)
# 4. Queue related event clips for deletion
clips = await db.fetch(
"SELECT storage_path FROM event_clips WHERE person_id = $1",
subject_id
)
for clip in clips:
await s3.delete_object(clip['storage_path'])
# 5. Audit log
await db.execute(
"""INSERT INTO deletion_audit_log
(subject_id, deleted_at, deleted_by, reason)
VALUES ($1, NOW(), $2, 'data_subject_request')""",
subject_id, current_user_id()
)
7. Storage Management
7.1 Storage Architecture
┌──────────────────────────────────────────────────────────────┐
│ Storage Architecture │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Hot Tier │ │ Warm Tier │ │ Cold Tier │ │
│ │ (NVMe/SSD) │ │ (HDD/S3) │ │ (Glacier) │ │
│ │ │ │ │ │ │ │
│ │ Current │ │ 30-90 day │ │ 90+ day media │ │
│ │ recordings │ │ recordings │ │ long-term │ │
│ │ Active DB │ │ Event clips │ │ archive │ │
│ │ Cache │ │ 90-day logs │ │ compliance │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
│ Edge Node (local) ←── VPN ──→ Cloud (S3/EBS/EFS) │
└──────────────────────────────────────────────────────────────┘
7.2 Storage Capacity Planning (8 Camera Baseline)
| Data Type | Daily Volume | Compression | Storage/day | Monthly |
|---|---|---|---|---|
| Raw video (8x 1080p@30fps, H.265) | ~800 GB | 50% | ~400 GB | ~12 TB |
| Event clips (alerts) | ~5 GB | None | ~5 GB | ~150 GB |
| Detection metadata | ~500 MB | None | ~500 MB | ~15 GB |
| Audit logs | ~100 MB | 70% | ~30 MB | ~1 GB |
| System metrics | ~200 MB | 80% | ~40 MB | ~1.2 GB |
| Database | ~50 MB | N/A | ~50 MB | ~1.5 GB |
| Model checkpoints | N/A | N/A | N/A | ~2 GB |
| Total | ~406 GB/day | ~12.2 TB/month |
Annual raw capacity requirement: ~146 TB
With 90-day retention + archive: ~40 TB hot/warm + ~110 TB cold
Recommended provisioned capacity: 200 TB (with 50% growth headroom)
7.3 Storage Monitoring & Alerting
Prometheus rules:
groups:
- name: storage-alerts
rules:
- alert: StorageWarning70
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.30
for: 5m
labels:
severity: p4
annotations:
summary: "Storage at 70% on {{ $labels.instance }}:{{ $labels.mountpoint }}"
- alert: StorageHigh85
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
for: 2m
labels:
severity: p2
annotations:
summary: "Storage at 85% on {{ $labels.instance }}:{{ $labels.mountpoint }}"
- alert: StorageCritical95
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
for: 1m
labels:
severity: p1
annotations:
summary: "Storage CRITICAL at 95% on {{ $labels.instance }}:{{ $labels.mountpoint }}"
- alert: S3BucketSizeGrowth
expr: predict_linear(aws_s3_bucket_size_bytes[7d], 30*24*3600) >
aws_s3_bucket_quota_bytes * 0.9
for: 1h
labels:
severity: p3
annotations:
summary: "S3 bucket {{ $labels.bucket }} projected to exceed quota in 30 days"
- alert: StorageCleanupFailed
expr: increase(surveillance_cleanup_failures_total[1h]) > 0
for: 5m
labels:
severity: p2
annotations:
summary: "Storage cleanup job failed"
7.4 Automated Cleanup Policies
# cleanup-policies.yaml
cleanup_policies:
raw_video:
description: "Raw video recordings"
retention_days: 90
archive_before_delete: true
archive_storage_class: GLACIER_IR
priority: oldest_first
schedule: "0 2 * * *"
event_clips:
description: "Alert event video clips"
retention_days: 365
archive_before_delete: true
archive_storage_class: GLACIER
priority: oldest_first
schedule: "0 3 * * *"
temp_processing:
description: "Temporary processing files"
retention_days: 1
archive_before_delete: false
priority: all_expired
schedule: "*/30 * * * *"
failed_uploads:
description: "Failed upload artifacts"
retention_days: 7
archive_before_delete: false
priority: all_expired
schedule: "0 4 * * *"
system_logs:
description: "Application and system logs"
retention_days: 90
archive_before_delete: true
archive_storage_class: GLACIER_IR
priority: oldest_first
schedule: "0 5 * * *"
7.5 Compression Strategy
| Data Age | Compression | Method | Savings |
|---|---|---|---|
| 0-7 days | None | Raw H.265 | Baseline |
| 7-30 days | Re-encode | H.265 → H.265 (lower CRF) | 30-40% |
| 30-90 days | Transcode | H.265 → AV1 | 40-50% |
| 90+ days | Archive | AV1 + tarball | 50-60% |
Compression job:
apiVersion: batch/v1
kind: CronJob
metadata:
name: video-compression
namespace: surveillance
spec:
schedule: "0 1 * * *"
jobTemplate:
spec:
parallelism: 2
template:
spec:
containers:
- name: compressor
image: surveillance/media-processor:2.3.1
command:
- python
- -m
- compression
- --age-days=7
- --target-crf=30
- --codec=libx265
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
restartPolicy: OnFailure
7.6 Auto-Scaling Cloud Storage
S3 Auto-scaling: S3 is inherently elastic — no manual scaling needed. Monitor bucket size and cost.
EBS volume scaling:
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: surveillance-expandable
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: 3000
throughput: 125
encrypted: "true"
kmsKeyId: "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"
allowVolumeExpansion: true # Enable expansion
volumeBindingMode: WaitForFirstConsumer
Automated volume expansion:
#!/bin/bash
# auto-expand-storage.sh
THRESHOLD=80
PVC_NAMES=("postgres-data" "media-storage" "log-storage")
NAMESPACE="surveillance"
for pvc in "${PVC_NAMES[@]}"; do
# Get current usage
USAGE=$(kubectl exec -n "$NAMESPACE" deployment/surveillance-api \
-- df -h "/data/$pvc" | awk 'NR==2 {print $5}' | tr -d '%')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
CURRENT_SIZE=$(kubectl get pvc "$pvc" -n "$NAMESPACE" \
-o jsonpath='{.status.capacity.storage}')
# Increase by 50%
CURRENT_GB=${CURRENT_SIZE%Gi}
NEW_GB=$((CURRENT_GB + CURRENT_GB / 2))
echo "Expanding $pvc from ${CURRENT_GB}Gi to ${NEW_GB}Gi"
kubectl patch pvc "$pvc" -n "$NAMESPACE" \
--type merge \
-p "{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"${NEW_GB}Gi\"}}}}"
# Notify
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\":\"Auto-expanded PVC $pvc to ${NEW_GB}Gi (was ${USAGE}% full)\"}"
fi
done
7.7 Storage Cost Optimization
| Optimization | Monthly Savings | Implementation |
|---|---|---|
| S3 Intelligent-Tiering | 20-30% | Automatic |
| H.265 re-encode (older content) | 30-40% | Nightly job |
| Glacier IR for 30-90 day content | 60-70% | Lifecycle rule |
| Glacier Deep Archive for 1yr+ | 95% | Lifecycle rule |
| Reserved capacity for predictable workloads | 30-40% | Commitment |
8. Incident Response
8.1 Severity Definitions
| Severity | Name | Definition | Examples | Response Time |
|---|---|---|---|---|
| P1 | Critical | Complete service outage; no surveillance capability | All cameras offline; AI pipeline completely down; storage full; database primary down | 15 minutes |
| P2 | High | Major functionality degraded; partial surveillance loss | Single camera offline > 30 min; high error rates; model accuracy degraded; backup failures | 1 hour |
| P3 | Medium | Minor functionality issue; workarounds available | Low FPS on camera; certificate expiring; certificate expiry warning; cleanup job failure | 4 hours |
| P4 | Low | Cosmetic or non-urgent issue | High CPU warning; UI glitch; documentation update needed; optimization opportunity | 24 hours |
8.2 Escalation Matrix
P1 (Critical) — 15 min response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 5 min: On-call must acknowledge
├── 15 min: No acknowledge → Escalate to Team Lead (SMS + Call)
├── 30 min: No response → Escalate to Engineering Manager
├── 45 min: No response → Escalate to VP Engineering
└── 60 min: No response → Escalate to CTO
P2 (High) — 1 hour response
├── 0 min: Alert fires → PagerDuty pages on-call engineer
├── 30 min: No acknowledge → Reminder notification
├── 60 min: No response → Escalate to Team Lead
└── 2 hours: No response → Escalate to Engineering Manager
P3 (Medium) — Slack + email only, 4 hour response
├── 0 min: Alert fires → Slack notification
└── 4 hours: No acknowledgment → Escalate to Team Lead
P4 (Low) — Daily digest email, 24 hour response
└── Daily digest at 09:00 UTC
Contact Information:
| Role | Primary Contact | Secondary Contact | Notification Method |
|---|---|---|---|
| On-Call Engineer | Rotating (PagerDuty) | — | PagerDuty Push + SMS |
| SRE Team Lead | lead-sre@company.com | +1-555-0100 | SMS + Voice Call |
| Engineering Manager | eng-mgr@company.com | +1-555-0101 | SMS + Voice Call |
| VP Engineering | vp-eng@company.com | +1-555-0102 | Voice Call + Email |
| CTO | cto@company.com | +1-555-0103 | Voice Call + Email |
8.3 Incident Response Process
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ DETECT │───▶│ RESPOND │───▶│ RESOLVE │───▶│ REVIEW │
│ (Alert) │ │ (Triage & │ │ (Fix & │ │ (Post- │
│ │ │ Mitigate) │ │ Verify) │ │ mortem) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
┌─────┴─────┐
▼ ▼
┌────────┐ ┌──────────┐
│Mitigate│ │Communicate│
│Impact │ │Stakeholders│
└────────┘ └──────────┘
Phase 1: Detect
- Monitoring alert fires
- On-call engineer receives page
- Acknowledge alert within 5 minutes
- Create incident channel in Slack:
#inc-YYYY-MM-DD-brief-description
Phase 2: Respond
- Assess severity and impact
- Execute relevant runbook
- Apply immediate mitigation if possible
- Update incident timeline every 15 minutes
- Communicate to stakeholders
Phase 3: Resolve
- Implement fix
- Verify service recovery (all health checks pass)
- Monitor for 30 minutes post-recovery
- Close incident in PagerDuty
- Update incident log
Phase 4: Review
- Schedule post-mortem within 48 hours for P1/P2
- Complete post-mortem document
- Identify action items
- Track action items to completion
8.4 Runbooks
Runbook: Camera Offline
Detection: SingleCameraDown alert fires
Severity: P2
Initial Response Time: 1 hour
Diagnosis Steps:
# 1. Check camera stream status
curl http://video-capture:8080/api/v1/cameras/{camera_id}/status
# 2. Check camera connectivity
ping <camera_ip>
curl -v rtsp://<camera_ip>:554/stream
# 3. Check video-capture service logs
kubectl logs -l app=video-capture --tail=100 | grep {camera_id}
# 4. Check network path
tracert <camera_ip>
# Verify firewall rules, VPN tunnel
# 5. Check camera resource usage
kubectl top pod -l app=video-capture
Resolution Steps:
| Issue | Resolution | Verification |
|---|---|---|
| Camera powered off | Contact site personnel to power cycle | Ping responds |
| Network connectivity | Check switch port, cable, VLAN | Ping + RTSP describe |
| VPN tunnel down | See "VPN Tunnel Down" runbook | Tunnel status |
| Camera firmware issue | Power cycle camera remotely | Stream reconnects |
| Stream URL changed | Update camera configuration | New stream active |
| Video-capture bug | Restart capture container | Stream reconnected |
| Resource exhaustion | Scale up capture resources | CPU/memory normal |
Workaround: If camera cannot be restored within 30 minutes:
- Mark camera as "maintenance mode" in dashboard
- Disable alerts for this camera
- Queue for on-site technician visit
Runbook: AI Pipeline Down
Detection: AIPipelineDown or HighErrorRate alert
Severity: P1
Initial Response Time: 15 minutes
Diagnosis Steps:
# 1. Check inference service health
curl http://ai-inference:8080/health/deep
# 2. Check if model is loaded
curl http://ai-inference:8080/api/v1/model/status
# 3. Check GPU status (if applicable)
nvidia-smi
# OR for CPU inference:
htop
# 4. Check inference logs
kubectl logs -l app=ai-inference --tail=200
# 5. Check resource usage
kubectl top pod -l app=ai-inference
kubectl describe pod -l app=ai-inference
# 6. Check model service
kubectl logs -l app=model-service --tail=100
# 7. Check if inference queue is backing up
redis-cli LLEN inference:queue
# 8. Test inference manually
curl -X POST http://ai-inference:8080/api/v1/inference/test \
-H "Content-Type: application/json" \
-d '{"test_image": "base64encoded"}'
Resolution Steps:
| Issue | Resolution | Verification |
|---|---|---|
| Model not loaded | Restart model-service pod | Model status shows loaded |
| GPU OOM | Restart inference pod; check memory limits | nvidia-smi shows free memory |
| Model corruption | Reload model from S3 backup | Test inference succeeds |
| Inference timeout | Scale inference replicas; check input | Latency returns to normal |
| Queue backup | Scale up consumers; check for dead consumers | Queue depth returns to 0 |
| Bad model update | Rollback to previous model version | Detection accuracy restored |
| Dependency failure | Check circuit breaker status; restart dependencies | All health checks pass |
Immediate Mitigation:
- If inference cannot be restored in 15 minutes:
- Switch to "detection-only" mode (skip recognition)
- Enable edge processing as backup
- Queue frames for delayed processing
Runbook: VPN Tunnel Down
Detection: Edge node unreachable; camera streams offline
Severity: P2 (P1 if all edge cameras affected)
Initial Response Time: 1 hour
Diagnosis Steps:
# 1. Check tunnel status from cloud side
ping <edge_gateway_ip>
# 2. Check VPN service status
kubectl logs -l app=vpn-gateway --tail=100
# 3. Check tunnel metrics
curl http://vpn-gateway:8080/metrics | grep vpn_tunnel
# 4. Check from edge side (if SSH available)
ssh edge-node "ping <cloud_gateway_ip>"
ssh edge-node "ipsec status" # or wg show for WireGuard
# 5. Check network path
mtr <edge_gateway_ip>
# 6. Check certificates (if certificate-based VPN)
openssl x509 -in /etc/vpn/cert.pem -text -noout | grep "Not After"
Resolution Steps:
| Issue | Resolution | Verification |
|---|---|---|
| Edge network down | Contact ISP/site | Ping responds |
| VPN service crash | Restart VPN gateway | Tunnel established |
| Certificate expired | Renew certificates | Valid cert, tunnel up |
| MTU mismatch | Adjust tunnel MTU | No packet fragmentation |
| Firewall change | Restore firewall rules | Tunnel traffic flowing |
| IPsec/IKE failure | Restart IKE daemon; check config | SA established |
| WireGuard key issue | Regenerate keys | Handshake succeeds |
Workaround: If tunnel cannot be restored:
- Activate local storage mode on edge (store locally, sync later)
- Switch to cellular backup if available
- Deploy technician on-site if needed
Runbook: Storage Full
Detection: StorageCritical95 alert fires
Severity: P1
Initial Response Time: 15 minutes
Immediate Actions (within 5 minutes):
# 1. Identify what's consuming space
df -h
ncdu /data/surveillance
# 2. Check if cleanup job is running
kubectl get jobs -n surveillance | grep cleanup
# 3. Temporarily expand storage (cloud)
# AWS EBS:
aws ec2 modify-volume --volume-id vol-XXXX --size $((CURRENT + 100))
# 4. Emergency cleanup — delete oldest temp files
find /data/surveillance/temp -type f -mtime +1 -delete
find /data/surveillance/cache -type f -atime +7 -delete
# 5. Force log rotation
logrotate -f /etc/logrotate.d/surveillance
# 6. Truncate oversized logs (>1GB)
find /var/log/surveillance -type f -size +1G -exec sh -c '> {}' \;
Resolution Steps:
| Issue | Resolution | Verification |
|---|---|---|
| Normal growth | Expand storage; review retention | Usage < 80% |
| Runaway logs | Fix log source; rotate logs | Log growth rate normal |
| Cleanup job failed | Restart cleanup job; fix root cause | Cleanup completes |
| Retention too long | Reduce retention period | Space freed |
| Camera bitrate high | Adjust camera encoding settings | Bitrate normalized |
| Orphaned temp files | Purge temp directory | Space recovered |
Runbook: Database Connectivity Issues
Detection: DatabaseUnreachable alert
Severity: P1
Initial Response Time: 15 minutes
Diagnosis Steps:
# 1. Check PostgreSQL pod status
kubectl get pods -l app=postgres
kubectl describe pod -l app=postgres
# 2. Check PostgreSQL logs
kubectl logs -l app=postgres --tail=200
# 3. Test connection from application pod
kubectl exec deployment/surveillance-api -- \
pg_isready -h postgres -U surveillance
# 4. Check connection pool status
kubectl exec deployment/surveillance-api -- \
python -c "from db import pool; print(pool.size(), pool.available())"
# 5. Check resource usage
kubectl top pod -l app=postgres
# 6. Check disk I/O
iostat -x 1 5
# 7. Check for locks
kubectl exec deployment/postgres -- \
psql -U surveillance -c "SELECT * FROM pg_locks WHERE NOT granted;"
# 8. Check replication lag
kubectl exec deployment/postgres -- \
psql -U surveillance -c "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS lag_seconds;"
Resolution Steps:
| Issue | Resolution | Verification |
|---|---|---|
| PostgreSQL pod crash | Restart pod; check for OOM | Pod running, accepting connections |
| Connection pool exhausted | Increase pool size; check for leaks | Available connections > 0 |
| Disk I/O saturation | Scale storage IOPS; optimize queries | I/O wait < 20% |
| Lock contention | Kill blocking queries; optimize transactions | No waiting locks |
| Replication lag | Check replica resources; restart replication | Lag < 5 seconds |
| Query overload | Enable query caching; kill slow queries | Active queries normal |
| Disk full | See "Storage Full" runbook | Free space available |
| Hardware failure | Failover to replica; replace primary | Replica promoted |
Immediate Mitigation:
- If primary is down:
- Promote replica to primary:
pg_ctl promote - Update connection strings
- Restart application pods
- Promote replica to primary:
Runbook: High Error Rates
Detection: HighErrorRate alert fires
Severity: P1
Initial Response Time: 15 minutes
Diagnosis Steps:
# 1. Check error distribution by service
kubectl logs -l app=surveillance --tail=1000 | \
jq -r '.service + ": " + .level + ": " + .message' | \
sort | uniq -c | sort -rn | head -20
# 2. Check error rate per service
curl http://prometheus:9090/api/v1/query?query=\
"rate(surveillance_errors_total[5m])"
# 3. Check for recent deployments
kubectl rollout history deployment/surveillance-api
kubectl rollout history deployment/ai-inference
# 4. Check dependency health
curl http://surveillance-api:8080/health/deep
# 5. Check for resource exhaustion
kubectl top pods
# 6. Review recent changes
# Check CI/CD pipeline, config changes
# 7. Check circuit breaker status
for service in database storage inference; do
curl "http://surveillance-api:8080/api/v1/circuit-breakers/$service"
done
Resolution Steps:
| Issue | Resolution | Verification |
|---|---|---|
| Bad deployment | Rollback to previous version | Error rate drops |
| Dependency down | Fix dependency; check circuit breakers | All deps healthy |
| Resource exhaustion | Scale up; optimize resource usage | Usage normal |
| Code bug | Deploy hotfix; or rollback | Errors eliminated |
| Configuration error | Revert config change; validate config | Config valid |
| External API failure | Enable fallback; contact provider | Fallback active |
| Database deadlock | Kill blocking queries; fix code | Deadlocks resolved |
8.5 Post-Incident Review Template
# Post-Incident Review
## Incident Summary
| Field | Value |
|-------|-------|
| Incident ID | INC-2025-001 |
| Date/Time (UTC) | 2025-01-15 03:45 - 2025-01-15 05:20 |
| Severity | P1 |
| Detection Method | Automated alert (StorageCritical95) |
| Affected Systems | All camera streams, event storage |
| Impact | 1h 35m of degraded recording quality |
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 03:42 | Storage usage crosses 95% threshold |
| 03:45 | P1 alert fires; on-call paged |
| 03:48 | On-call engineer acknowledges |
| 03:52 | Diagnosis begins; identified storage full |
| 04:05 | Emergency cleanup initiated; temp files removed |
| 04:15 | Storage expanded by 200GB |
| 04:30 | Cleanup job restarted; oldest files archived |
| 04:45 | All camera streams reconnecting |
| 05:00 | All health checks passing |
| 05:20 | Incident closed; monitoring continues |
## Root Cause Analysis
**5 Whys:**
1. Why did storage fill up? → Cleanup job had been failing for 3 days
2. Why was cleanup failing? → Credential rotation broke S3 access
3. Why didn't credential rotation update cleanup job? → Cleanup job uses hardcoded credentials
4. Why are credentials hardcoded? → Technical debt; not migrated to secret management
5. Why wasn't this caught? → No monitoring on cleanup job success/failure
**Root Cause:** Cleanup job used hardcoded S3 credentials that were not updated during routine credential rotation, causing 3 days of accumulated data without cleanup.
## Contributing Factors
- No alert on cleanup job failures
- Storage growth rate was not monitored
- No auto-expansion configured for media storage
## What Went Well
- Automated P1 alert fired immediately at 95%
- On-call responded within 3 minutes
- Emergency cleanup procedures were effective
- No data loss occurred
## What Went Wrong
- Cleanup job failure went undetected for 3 days
- Manual intervention required for storage expansion
- Edge cameras buffered locally but some frames were lost during reconnect
## Action Items
| ID | Action | Owner | Due Date | Priority |
|----|--------|-------|----------|----------|
| AI-1 | Migrate all jobs to use IAM roles / secret management | @sre-team | 2025-01-22 | High |
| AI-2 | Add alert for cleanup job failures | @sre-team | 2025-01-18 | High |
| AI-3 | Implement auto-expansion for media storage | @sre-team | 2025-01-29 | Medium |
| AI-4 | Add storage growth rate alerting | @sre-team | 2025-01-22 | Medium |
| AI-5 | Improve camera reconnection to reduce frame loss | @eng-team | 2025-02-05 | Low |
| AI-6 | Document hardcoded credential audit procedure | @security | 2025-01-22 | High |
## Lessons Learned
- Any automated job failure must have an alert
- Credential management must be centralized
- Storage monitoring needs predictive capability
## Signatures
- Incident Commander: _________________ Date: ___/___/______
- Engineering Lead: _________________ Date: ___/___/______
9. Upgrades & Maintenance
9.1 Zero-Downtime Deployment Strategy
Deployment Pattern: Rolling updates with readiness gate verification
Phase 1: Deploy new version alongside old version
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pod v1 │ │ Pod v1 │ │ Pod v1 │ (serving traffic)
└──────────┘ └──────────┘ └──────────┘
Phase 2: Add new version pod, verify health
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pod v1 │ │ Pod v1 │ │ Pod v1 │ │ Pod v2 │ (new pod not yet serving)
└──────────┘ └──────────┘ └──────────┘ └──────────┘
▲
health check passes
Phase 3: Route traffic to new pod, drain old pod
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pod v1 │ │ Pod v1 │ │ Pod v2 │ │ Pod v2 │ (traffic shifting)
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Phase 4: Complete rollout
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pod v2 │ │ Pod v2 │ │ Pod v2 │ (all pods updated)
└──────────┘ └──────────┘ └──────────┘
Rollback: Instantly revert to previous ReplicaSet
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pod v1 │ │ Pod v1 │ │ Pod v1 │ (rollback in ~30 seconds)
└──────────┘ └──────────┘ └──────────┘
Kubernetes Deployment Strategy:
apiVersion: apps/v1
kind: Deployment
metadata:
name: surveillance-api
namespace: surveillance
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never reduce capacity
selector:
matchLabels:
app: surveillance-api
template:
metadata:
labels:
app: surveillance-api
version: "2.3.2" # Updated with each release
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
image: surveillance/api:2.3.2@sha256:a1b2c3d4...
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 6
successThreshold: 2
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- surveillance-api
topologyKey: kubernetes.io/hostname
9.2 Deployment Pipeline
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Build │───▶│ Test │───▶│ Stage │───▶│ Canary │───▶│ Production │
│ (CI) │ │ (Unit/Int) │ │ (E2E) │ │ (5% traff) │ │ (100%) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Fail = │ │ Fail = │ │ Fail = │
│ Block │ │ Block │ │ Rollback │
└──────────┘ └──────────┘ └──────────┘
Automated promotion gates:
| Gate | Criteria | Auto-promote Timeout |
|---|---|---|
| Build | All tests pass; linting passes; security scan clean | Immediate |
| Staging | E2E tests pass; performance within 10% of baseline | 30 min validation |
| Canary | Error rate < 0.1%; p95 latency < baseline + 20% | 15 min bake time |
| Production | Canary metrics healthy for 30 min | Auto-proceed |
9.3 Database Migrations
Tool: Alembic (SQLAlchemy migrations) with yoyo-migrations for idempotent SQL
Migration rules:
- All migrations must be backward-compatible (add-only in one release)
- Destructive changes require a 2-phase deployment
- Migrations are versioned and reversible
- Migrations run automatically as init container before app startup
- Migration status exposed via
/health/ready
# migrations/env.py — Alembic configuration
from alembic import context
from sqlalchemy import create_engine
config = context.config
def run_migrations():
"""Run migrations in online mode."""
connectable = create_engine(config.get_main_option("sqlalchemy.url"))
with connectable.connect() as connection:
context.configure(
connection=connection,
target_metadata=target_metadata,
transaction_per_migration=True,
compare_type=True,
)
with context.begin_transaction():
context.run_migrations()
# Migration example: add_column (backward-compatible)
# migrations/versions/20250115_add_camera_resolution.py
"""
Add resolution column to cameras table
Revision ID: 20250115_add_camera_resolution
Revises: 20250101_initial
Create Date: 2025-01-15 08:30:00
"""
from alembic import op
import sqlalchemy as sa
revision = '20250115_add_camera_resolution'
down_revision = '20250101_initial'
# Phase 1 (this release): Add column as nullable
def upgrade():
op.add_column('cameras', sa.Column('resolution', sa.String(20), nullable=True))
# Backfill existing data
op.execute("UPDATE cameras SET resolution = '1920x1080' WHERE resolution IS NULL")
# Phase 2 (next release): Make column non-nullable
# def upgrade():
# op.alter_column('cameras', 'resolution', nullable=False)
def downgrade():
op.drop_column('cameras', 'resolution')
Migration execution (Kubernetes init container):
initContainers:
- name: db-migrations
image: surveillance/api:2.3.2@sha256:a1b2c3d4...
command:
- python
- -m
- alembic
- upgrade
- head
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
resources:
requests:
cpu: 100m
memory: 128Mi
# Must complete before main container starts
restartPolicy: OnFailure
Two-phase destructive change example:
Phase 1 (Release N):
def upgrade():
# Add new column
op.add_column('detections', sa.Column('confidence_v2', sa.Float(), nullable=True))
# Create index concurrently (no table lock)
op.create_index('ix_detections_confidence_v2', 'detections', ['confidence_v2'],
postgresql_concurrently=True)
# Backfill in batches
op.execute("""
UPDATE detections
SET confidence_v2 = confidence
WHERE confidence_v2 IS NULL
AND id IN (SELECT id FROM detections WHERE confidence_v2 IS NULL LIMIT 10000)
""")
Phase 2 (Release N+1):
def upgrade():
# Now safe to drop old column (all code reads from new column)
op.drop_column('detections', 'confidence')
# Rename new column
op.alter_column('detections', 'confidence_v2', new_column_name='confidence')
9.4 Model Update Deployment (Blue/Green)
AI model updates use blue/green to enable instant rollback:
Current State:
┌──────────────┐
│ Model v2.1 │ ← Active (Blue)
│ (Green) │
└──────────────┘
▲
traffic: 100%
Deployment:
1. Load Model v2.2 alongside v2.1
2. Warm up v2.2 (run inference tests)
3. Gradually shift traffic: 10% → 50% → 100%
4. Monitor accuracy and latency
┌──────────────┐ ┌──────────────┐
│ Model v2.1 │ │ Model v2.2 │
│ (Blue) │ │ (Green) │
└──────────────┘ └──────────────┘
traffic: 70% traffic: 30%
Rollback (instant):
┌──────────────┐ ┌──────────────┐
│ Model v2.1 │ │ Model v2.2 │
│ (Blue) │ │ (Green) │
└──────────────┘ └──────────────┘
traffic: 100% traffic: 0%
Model deployment configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-blue
namespace: surveillance
spec:
replicas: 3
template:
spec:
containers:
- name: inference
image: surveillance/inference:2.3.1
env:
- name: MODEL_VERSION
value: "face-detection-v2.1"
- name: MODEL_PATH
value: "/models/v2.1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-inference-green
namespace: surveillance
spec:
replicas: 0 # Scaled to 0 by default
template:
spec:
containers:
- name: inference
image: surveillance/inference:2.3.1
env:
- name: MODEL_VERSION
value: "face-detection-v2.2"
- name: MODEL_PATH
value: "/models/v2.2"
---
# Service routes to active model via label selector
apiVersion: v1
kind: Service
metadata:
name: ai-inference
annotations:
active-model: "blue"
spec:
selector:
model: blue # Changed to "green" for cutover
ports:
- port: 8080
Model switch script:
#!/bin/bash
# switch-model.sh — Switch between blue and green model deployments
NAMESPACE="surveillance"
TARGET="$1" # blue or green
# Scale target to match current
CURRENT_REPLICAS=$(kubectl get deployment ai-inference-blue -n $NAMESPACE \
-o jsonpath='{.status.replicas}')
echo "Scaling ai-inference-$TARGET to $CURRENT_REPLICAS replicas..."
kubectl scale deployment "ai-inference-$TARGET" --replicas="$CURRENT_REPLICAS" -n "$NAMESPACE"
# Wait for ready
kubectl rollout status "deployment/ai-inference-$TARGET" -n "$NAMESPACE" --timeout=300s
# Update service selector
echo "Switching service to $TARGET..."
kubectl patch service ai-inference -n "$NAMESPACE" \
--type merge \
-p "{\"spec\":{\"selector\":{\"model\":\"$TARGET\"}}}"
# Update annotation
kubectl annotate service ai-inference -n "$NAMESPACE" \
"active-model=$TARGET" --overwrite
# Scale old version to 0
OLD_VERSION=$([ "$TARGET" == "blue" ] && echo "green" || echo "blue")
echo "Scaling down ai-inference-$OLD_VERSION..."
kubectl scale deployment "ai-inference-$OLD_VERSION" --replicas=0 -n "$NAMESPACE"
echo "Model switch complete. Active: $TARGET"
9.5 Maintenance Windows
| Window | Schedule | Duration | Allowed Activities |
|---|---|---|---|
| Weekly | Sunday 02:00-06:00 UTC | 4 hours | Patches, minor updates, config changes |
| Monthly | First Sunday 02:00-08:00 UTC | 6 hours | Database maintenance, major upgrades, model updates |
| Quarterly | Scheduled | 8 hours | Infrastructure upgrades, DR drills |
| Emergency | On-demand | As needed | Security patches, critical fixes |
Maintenance mode API:
@app.post("/admin/maintenance")
async def enable_maintenance_mode(
duration_minutes: int,
reason: str,
user: AdminUser = Depends(get_admin_user)
):
"""Enable maintenance mode — disable non-critical processing."""
await redis.set("maintenance:active", "true", ex=duration_minutes * 60)
await redis.set("maintenance:reason", reason, ex=duration_minutes * 60)
# Notify all connected clients
await websocket_manager.broadcast({
"type": "maintenance",
"status": "started",
"reason": reason,
"estimated_duration_minutes": duration_minutes
})
# Reduce non-critical processing
await set_pipeline_mode("minimal")
audit_log.info("Maintenance mode enabled by %s for %d minutes: %s",
user.username, duration_minutes, reason)
9.6 Rollback Capability
Every deployment maintains the previous N versions for instant rollback:
| Rollback Type | Method | Time to Complete | When to Use |
|---|---|---|---|
| Application rollback | kubectl rollout undo |
~30 seconds | Bad deployment |
| Database rollback | alembic downgrade |
2-5 minutes | Bad migration |
| Model rollback | Switch service selector | ~10 seconds | Bad model update |
| Configuration rollback | Git revert + apply | 1-2 minutes | Bad config change |
| Infrastructure rollback | Terraform state revert | 5-10 minutes | Bad infra change |
| Full system rollback | DR failover | 15-30 minutes | Catastrophic failure |
Automated rollback triggers:
# rollback-alerts.yaml
- alert: DeploymentRollbackRequired
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) > 0.1
and
delta(deployment_timestamp[10m]) > 0
)
for: 2m
labels:
severity: p1
annotations:
summary: "High error rate after deployment — rollback recommended"
runbook_url: "https://wiki.internal/runbooks/auto-rollback"
9.7 Version Pinning
All container images MUST be pinned to digest, never to floating tags:
# GOOD — pinned to digest
image: surveillance/api:2.3.1@sha256:abc123def456...
# BAD — floating tag
image: surveillance/api:latest
# ACCEPTABLE — semver tag with digest verification
image: surveillance/api:2.3.1
# (digest verified by admission controller)
Image verification admission controller:
# Kyverno / OPA Gatekeeper policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-image-digest
spec:
validationFailureAction: Enforce
rules:
- name: check-digest
match:
resources:
kinds:
- Pod
validate:
message: "All container images must be pinned to digest"
pattern:
spec:
containers:
- image: "*@sha256:*"
10. Performance Optimization
10.1 Query Optimization
Slow query monitoring:
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find slow queries
SELECT
query,
calls,
total_exec_time,
mean_exec_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
Alert on slow queries:
- alert: SlowPostgresQueries
expr: |
pg_stat_statements_mean_time > 1000
for: 5m
labels:
severity: p3
annotations:
summary: "Slow queries detected (>1000ms average)"
Index review (monthly):
-- Check for missing indexes
SELECT
schemaname,
tablename,
attname as column,
n_tup_read,
n_tup_fetch,
n_tup_insert,
n_tup_update
FROM pg_stats
WHERE schemaname = 'public'
ORDER BY n_tup_read DESC;
-- Check for unused indexes
SELECT
schemaname,
tablename,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;
Current index strategy:
-- Core indexes for surveillance queries
CREATE INDEX CONCURRENTLY idx_detections_timestamp_camera
ON detections (timestamp DESC, camera_id);
CREATE INDEX CONCURRENTLY idx_detections_person_id
ON detections (person_id) WHERE person_id IS NOT NULL;
CREATE INDEX CONCURRENTLY idx_events_timestamp_type
ON events (timestamp DESC, event_type);
CREATE INDEX CONCURRENTLY idx_alerts_status_created
ON alerts (status, created_at DESC)
WHERE status IN ('pending', 'sent');
CREATE INDEX CONCURRENTLY idx_recordings_camera_timestamp
ON recordings (camera_id, start_time DESC);
-- Partial index for active alerts (most queried)
CREATE INDEX CONCURRENTLY idx_alerts_active
ON alerts (created_at DESC, camera_id, severity)
WHERE status = 'active';
10.2 Cache Strategy (Redis)
| Cache Type | TTL | Invalidation | Purpose |
|---|---|---|---|
| Camera configuration | 5 min | On update | Reduce DB reads |
| Person profiles | 10 min | On update | Fast face lookup |
| Recent detections | 1 min | Time-based | Dashboard display |
| Alert rules | 5 min | On update | Rule evaluation |
| API responses (frequent) | 30 sec | On data change | Reduce API load |
| Session data | 24 hours | On logout | User sessions |
| Rate limiting | 1 min | Automatic | API protection |
Redis configuration:
# redis-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-config
namespace: surveillance
data:
redis.conf: |
maxmemory 2gb
maxmemory-policy allkeys-lru
appendonly yes
appendfsync everysec
save 900 1
save 300 10
save 60 10000
tcp-keepalive 60
timeout 300
Cache implementation:
# cache.py
import redis.asyncio as redis
import json
import hashlib
from functools import wraps
redis_client = redis.Redis(
host='redis',
port=6379,
db=0,
decode_responses=True,
socket_connect_timeout=5,
socket_timeout=5,
health_check_interval=30,
)
async def cached(ttl_seconds: int, key_prefix: str = "cache"):
"""Decorator to cache function results."""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
# Generate cache key
cache_key = f"{key_prefix}:{func.__name__}:{_generate_key(args, kwargs)}"
# Try cache
cached = await redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Execute and cache
result = await func(*args, **kwargs)
await redis_client.setex(
cache_key,
ttl_seconds,
json.dumps(result, default=str)
)
return result
return wrapper
return decorator
def _generate_key(args, kwargs):
key_data = json.dumps({"args": args, "kwargs": kwargs}, sort_keys=True, default=str)
return hashlib.sha256(key_data.encode()).hexdigest()[:16]
# Usage
@cached(ttl_seconds=300, key_prefix="camera")
async def get_camera_config(camera_id: str):
return await db.fetchrow("SELECT * FROM cameras WHERE id = $1", camera_id)
@cached(ttl_seconds=60, key_prefix="detections")
async def get_recent_detections(camera_id: str, limit: int = 50):
return await db.fetch(
"""SELECT * FROM detections
WHERE camera_id = $1
ORDER BY timestamp DESC
LIMIT $2""",
camera_id, limit
)
10.3 CDN Configuration
Static assets and archived media are served via CDN:
# CloudFront / CDN configuration
cdn:
origins:
- id: surveillance-media
domain: surveillance-media.s3.amazonaws.com
path: /recordings
- id: surveillance-static
domain: surveillance-static.s3.amazonaws.com
path: /static
behaviors:
- path: /recordings/*.mp4
ttl: 86400
compress: true
- path: /static/*
ttl: 604800
cache_control: "public, max-age=604800, immutable"
- path: /api/*
ttl: 0 # Don't cache API
signed_urls:
enabled: true
key_pair_id: "K..."
expiration: 3600 # 1 hour
10.4 Connection Pooling
Database Connection Pooling
# database.py
import asyncpg
DB_POOL_CONFIG = {
"min_size": 5,
"max_size": 20,
"max_inactive_time": 300,
"max_queries": 50000,
"command_timeout": 30,
"server_settings": {
"jit": "off",
"application_name": "surveillance-api"
}
}
pool = None
async def init_pool(database_url: str):
global pool
pool = await asyncpg.create_pool(
database_url,
**DB_POOL_CONFIG
)
async def get_connection():
return await pool.acquire()
async def release_connection(conn):
await pool.release(conn)
HTTP Connection Pooling (for inter-service communication)
# http_client.py
import httpx
class ServiceClient:
def __init__(self):
self.client = httpx.AsyncClient(
timeout=httpx.Timeout(connect=5.0, read=30.0),
limits=httpx.Limits(
max_connections=100,
max_keepalive_connections=20
),
http2=True,
)
async def get(self, service: str, path: str):
url = f"http://{service}:8080{path}"
response = await self.client.get(url)
response.raise_for_status()
return response.json()
service_client = ServiceClient()
10.5 Resource Limits
# resource-limits.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: surveillance-limits
namespace: surveillance
spec:
limits:
- default:
cpu: "1"
memory: 1Gi
defaultRequest:
cpu: 100m
memory: 128Mi
type: Container
---
# Per-service resource allocation
resources:
# Video capture (I/O bound)
video-capture:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
# AI inference (CPU/GPU bound)
ai-inference:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
# API (moderate load)
surveillance-api:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
# Database (high memory)
postgres:
requests:
cpu: "1"
memory: 4Gi
limits:
cpu: "4"
memory: 16Gi
# Redis (low CPU, moderate memory)
redis:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: "1"
memory: 2Gi
10.6 Performance Benchmarks
| Metric | Target | Alert Threshold | Critical Threshold |
|---|---|---|---|
| Camera stream latency | < 100ms | > 200ms | > 500ms |
| AI inference per frame | < 50ms | > 100ms | > 200ms |
| End-to-end detection latency | < 500ms | > 1000ms | > 2000ms |
| API response time (p50) | < 50ms | > 100ms | > 500ms |
| API response time (p95) | < 200ms | > 500ms | > 1000ms |
| Database query time (p95) | < 10ms | > 50ms | > 200ms |
| Stream processing FPS | 30 FPS | < 25 FPS | < 15 FPS |
| Frame drop rate | < 0.1% | > 1% | > 5% |
| Alert delivery time | < 5s | > 10s | > 30s |
11. Disaster Recovery
11.1 DR Objectives
| Metric | Value | Measurement |
|---|---|---|
| RTO (Recovery Time Objective) | 1 hour | Time from disaster declaration to service restoration |
| RPO (Recovery Point Objective) | 15 minutes | Maximum acceptable data loss |
| RTO (Database) | 30 minutes | Database failover time |
| RTO (Application) | 15 minutes | Application redeployment time |
| RPO (Database) | < 1 minute | With synchronous replication |
| RPO (Media) | 15 minutes | Cross-region replication lag |
11.2 DR Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ PRODUCTION (us-east-1) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ EKS │ │ RDS │ │ S3 │ │
│ │ Cluster │ │ PostgreSQL │ │ Primary │ │
│ │ │ │ (Primary) │ │ Bucket │ │
│ │ ┌────────┐ │ │ │ │ │ │
│ │ │Capture │ │ │ ┌────────┐ │ │ ┌──────────┐ │ │
│ │ │API │ │ │ │Primary │ │ │ │ Recordings│ │ │
│ │ │Inference│ │ │ │Replica │ │ │ │ Events │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ │ Models │ │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Real-time Replication │ │
│ │ (WAL streaming + S3 cross-region replication) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ DR SITE (us-west-2) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ EKS │ │ RDS │ │ S3 │ │
│ │ (Scaled │ │ PostgreSQL │ │ Replica │ │
│ │ to 0) │ │ (Standby) │ │ Bucket │ │
│ │ │ │ │ │ │ │
│ │ [Ready to │ │ ┌────────┐ │ │ [Fully │ │
│ │ scale up] │ │ │Standby │ │ │ replicated] │ │
│ │ │ │ │Replica │ │ │ │ │
│ └──────────────┘ │ └────────┘ │ └──────────────────┘ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
11.3 Data Replication
Database Replication
# RDS PostgreSQL cross-region read replica
AWSTemplateFormatVersion: '2010-09-09'
Resources:
DRReadReplica:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceIdentifier: surveillance-dr-replica
DBInstanceClass: db.r6g.xlarge
Engine: postgres
EngineVersion: '15.4'
SourceDBInstanceIdentifier:
!Sub 'arn:aws:rds:us-east-1:${AWS::AccountId}:db:surveillance-primary'
DBSubnetGroupName: !Ref DRSubnetGroup
VPCSecurityGroups:
- !Ref DRSecurityGroup
MultiAZ: false # Standby only; enable during failover
StorageEncrypted: true
KmsKeyId: !Ref DRKMSKey
BackupRetentionPeriod: 7
DeletionProtection: true
Tags:
- Key: Purpose
Value: DR-Standby
- Key: RPO
Value: 15min
Replication monitoring:
-- Check replication lag (run on primary)
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
-- Alert if replication lag > 5 minutes
Object Storage Replication
S3 Cross-Region Replication (CRR) with 15-minute RPO:
- All new objects replicated within 15 minutes
- Replication status tracked per object
- Failed replication events alerted
Configuration Replication
- Terraform state stored in S3 with cross-region replication
- Git repositories mirrored to secondary Git provider
- Kubernetes manifests stored in Git (GitOps)
11.4 Failover Process
Automated Failover (Database — RDS)
RDS Multi-AZ provides automatic failover:
- Health check fails on primary
- RDS promotes standby to primary (typically 60-120 seconds)
- DNS endpoint updates automatically
- Application reconnects via connection pool
Manual DR Failover (Full Site)
#!/bin/bash
# dr-failover.sh — Execute full site failover to DR region
PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
FAILOVER_REASON="$1"
log() {
echo "[$(date -Iseconds)] $1" | tee -a /var/log/dr/failover-$(date +%Y%m%d).log
}
log "=== DR FAILOVER INITIATED ==="
log "Reason: $FAILOVER_REASON"
log "From: $PRIMARY_REGION → $DR_REGION"
# 1. Verify DR environment
log "1. Verifying DR environment readiness..."
if ! aws eks describe-cluster --name surveillance-dr --region $DR_REGION > /dev/null 2>&1; then
log "ERROR: DR EKS cluster not accessible"
exit 1
fi
# 2. Promote DR database from standby
log "2. Promoting DR database..."
aws rds promote-read-replica \
--db-instance-identifier surveillance-dr-replica \
--region $DR_REGION
# Wait for promotion
aws rds wait db-instance-available \
--db-instance-identifier surveillance-dr-replica \
--region $DR_REGION
log " DR database promoted successfully"
# 3. Enable Multi-AZ on DR database
log "3. Enabling Multi-AZ on DR database..."
aws rds modify-db-instance \
--db-instance-identifier surveillance-dr-replica \
--multi-az \
--apply-immediately \
--region $DR_REGION
# 4. Scale up DR EKS cluster
log "4. Scaling up DR EKS cluster..."
aws eks update-nodegroup-config \
--cluster-name surveillance-dr \
--nodegroup-name surveillance-workers \
--scaling-config minSize=3,maxSize=10,desiredSize=3 \
--region $DR_REGION
# Wait for nodes
sleep 120
kubectl wait --for=condition=Ready nodes --all --timeout=300s
# 5. Deploy application to DR
log "5. Deploying application to DR..."
kubectl config use-context surveillance-dr
kubectl apply -k k8s/overlays/dr/
# Wait for deployments
kubectl wait --for=condition=available \
--all deployments \
--namespace surveillance \
--timeout=600s
# 6. Update DNS to point to DR
log "6. Updating DNS to DR region..."
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch file://dr-dns-update.json
# 7. Verify health
log "7. Running health checks..."
for i in {1..10}; do
if curl -f https://surveillance.company.com/health/deep > /dev/null 2>&1; then
log " Health check PASSED"
break
fi
log " Health check attempt $i/10..."
sleep 10
done
# 8. Verify cameras reconnecting
log "8. Verifying camera streams..."
sleep 60
STREAM_COUNT=$(curl -s https://surveillance.company.com/api/v1/cameras/status | \
jq '[.cameras[] | select(.status == "active")] | length')
log " Active streams: $STREAM_COUNT/8"
# 9. Send notifications
log "9. Sending notifications..."
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d "{\"text\":\"DR FAILOVER COMPLETE: Production now running in $DR_REGION. Reason: $FAILOVER_REASON. Active streams: $STREAM_COUNT/8\"}"
log "=== DR FAILOVER COMPLETE ==="
log "Total time: $(($(date +%s) - START_TIME)) seconds"
11.5 DR Testing Schedule
| Test Type | Frequency | Scope | Duration | Validation |
|---|---|---|---|---|
| Backup restore drill | Monthly | Database + media | 2 hours | Data integrity verified |
| Application redeployment | Monthly | Full application stack | 1 hour | All services healthy |
| Network failover test | Quarterly | VPN, DNS | 30 min | Traffic routes correctly |
| Database failover test | Quarterly | RDS Multi-AZ promotion | 1 hour | Replication lag acceptable |
| Full DR drill | Quarterly | Complete site failover | 4 hours | All RTO/RPO met |
| Tabletop exercise | Semi-annually | Team response procedures | 2 hours | Process gaps identified |
Full DR drill procedure:
- Week before: Schedule drill; notify stakeholders; prepare isolated test data
- Day of:
- 09:00 — Initiate failover (simulate primary region failure)
- 09:05 — DR team executes failover runbook
- 09:30 — Verify database is promoted and accessible
- 10:00 — Verify application is deployed and healthy
- 10:30 — Verify camera streams reconnect
- 11:00 — Verify alert delivery
- 11:30 — Run E2E test suite
- 12:00 — Validate data integrity (sample checks)
- 12:30 — Measure and document RTO/RPO
- 13:00 — Initiate failback to primary
- 14:00 — Verify primary is restored
- Week after: Complete DR test report; file action items
DR Test Report Template:
## DR Drill Report — 2025-Q1
| Item | Result |
|------|--------|
| Date | 2025-03-15 |
| Scenario | Complete region failure (us-east-1) |
| Failover RTO Target | 60 minutes |
| Failover RTO Achieved | 42 minutes |
| RPO Target | 15 minutes |
| RPO Achieved | 8 minutes |
| Streams Restored | 8/8 (100%) |
| Data Integrity | PASS |
| E2E Tests | 47/47 PASS |
### Issues Found
1. Camera reconnection took 18 minutes (target: <10 min) — AI-7 filed
2. Alert service required manual restart — AI-8 filed
### Action Items
| ID | Description | Owner | Due |
|----|-------------|-------|-----|
| AI-7 | Optimize camera reconnection sequence | @eng | 2025-04-01 |
| AI-8 | Fix alert service startup dependency | @sre | 2025-03-22 |
11.6 DR Readiness Checklist
Verify monthly (automated where possible):
- DR database replication lag < 1 minute
- S3 cross-region replication caught up
- DR EKS cluster accessible and nodes can scale
- Latest container images available in DR region registry
- DR Terraform plan applies without errors (dry-run)
- Backup integrity verified (latest full backup)
- Failover runbook accessible and up-to-date
- DR contact list current
- VPN/cross-region network paths verified
12. Capacity Planning
12.1 Current Capacity Baseline (8 Cameras)
| Resource | Current Usage | Capacity | Headroom |
|---|---|---|---|
| CPU (cloud) | 4 cores avg | 8 cores | 100% |
| Memory (cloud) | 12 GB | 32 GB | 167% |
| GPU (if used) | 40% utilization | 1x GPU | 150% |
| Storage hot tier | 6 TB / 20 TB | 20 TB | 233% |
| Storage warm tier | 18 TB / 50 TB | 50 TB | 178% |
| Database storage | 150 GB | 500 GB | 233% |
| Database connections | 25 / 100 | 100 | 300% |
| Network egress | 200 Mbps / 1 Gbps | 1 Gbps | 400% |
| Inference throughput | 240 FPS (8x30) | 480 FPS | 100% |
| Alert volume | 50/day | 500/day | 900% |
12.2 Scaling Triggers
| Metric | Scale-Up Trigger | Scale-Down Trigger | Action |
|---|---|---|---|
| CPU utilization | > 70% for 10 minutes | < 30% for 30 minutes | Add/remove inference pods |
| Memory utilization | > 80% for 10 minutes | < 40% for 30 minutes | Add memory or pods |
| Inference latency | > 100ms p95 for 5 min | < 50ms p95 for 10 min | Scale inference horizontally |
| Queue depth | > 1000 frames | < 100 frames | Adjust consumer count |
| Storage usage | > 70% | N/A (manual) | Expand volume or archive |
| Camera count | > 8 cameras | N/A | Scale per-camera resources |
Horizontal Pod Autoscaler configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-inference-hpa
namespace: surveillance
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-inference
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: surveillance_pipeline_latency_ms
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
12.3 Camera Addition Process
Step 1: Pre-deployment Assessment (Day -7)
├── Evaluate resource requirements
├── Verify network connectivity
├── Review camera positioning and coverage
└── Update configuration in Git
Step 2: Infrastructure Preparation (Day -3)
├── Calculate additional storage needs
├── Verify scaling headroom
├── Prepare camera configuration
└── Stage network/VPN configuration
Step 3: Deployment (Day 0)
├── Add camera to configuration
├── Deploy updated configuration
├── Verify stream connection
├── Validate AI processing
├── Test alert generation
└── Update dashboards
Step 4: Validation (Day 0-1)
├── Monitor for 24 hours
├── Verify FPS and quality
├── Confirm alerts working
├── Document in camera registry
└── Notify stakeholders
Camera addition checklist:
| Step | Item | Verification |
|---|---|---|
| 1 | Camera network reachable | ping <camera_ip> |
| 2 | RTSP stream accessible | ffprobe rtsp://<camera>/stream |
| 3 | VPN tunnel supports additional bandwidth | Bandwidth check |
| 4 | Configuration added to Git | PR merged |
| 5 | Stream appears in video-capture | Logs show connection |
| 6 | FPS meets target (>25) | Grafana dashboard |
| 7 | AI inference processing frames | Detection metrics |
| 8 | Alerts generated correctly | Test alert |
| 9 | Storage projections updated | Capacity review |
| 10 | Camera documented | Registry updated |
12.4 Per-Camera Resource Requirements
| Resource | Per Camera | 8 Cameras | 16 Cameras | 24 Cameras |
|---|---|---|---|---|
| CPU (inference) | 0.5 cores | 4 cores | 8 cores | 12 cores |
| Memory (processing) | 1 GB | 8 GB | 16 GB | 24 GB |
| Storage (hot, daily) | 50 GB/day | 400 GB/day | 800 GB/day | 1.2 TB/day |
| Network (ingress) | 25 Mbps | 200 Mbps | 400 Mbps | 600 Mbps |
| GPU memory | 512 MB | 4 GB | 8 GB | 12 GB |
| Database IOPS | 100 | 800 | 1,600 | 2,400 |
12.5 Scaling Roadmap
| Phase | Cameras | Timeline | Infrastructure Changes |
|---|---|---|---|
| Current | 8 | Now | 3 inference pods, 8 CPU, 32 GB RAM |
| Phase 1 | 12 | Q2 2025 | 4 inference pods, 12 CPU, 48 GB RAM |
| Phase 2 | 16 | Q3 2025 | 6 inference pods, 16 CPU, 64 GB RAM, GPU add |
| Phase 3 | 24 | Q1 2026 | 8 inference pods, 24 CPU, 96 GB RAM, 2 GPU |
| Phase 4 | 32+ | Q3 2026 | Shard by location, dedicated inference cluster |
12.6 Performance Benchmarks
Benchmark suite executed monthly:
#!/bin/bash
# performance-benchmark.sh
API_URL="https://surveillance.company.com"
RESULTS_FILE="/var/log/benchmarks/$(date +%Y%m%d).json"
echo "{\"timestamp\": \"$(date -Iseconds)\"," > "$RESULTS_FILE"
echo "\"benchmarks\": {" >> "$RESULTS_FILE"
# 1. Health check latency
echo " Running health check latency test..."
HEALTH_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health")
echo " \"health_check_latency_ms\": $(echo "$HEALTH_LAT * 1000" | bc)," >> "$RESULTS_FILE"
# 2. Deep health check latency
echo " Running deep health check..."
DEEP_LAT=$(curl -o /dev/null -s -w "%{time_total}" "$API_URL/health/deep")
echo " \"deep_health_latency_ms\": $(echo "$DEEP_LAT * 1000" | bc)," >> "$RESULTS_FILE"
# 3. API response time (events list)
echo " Running API response time test..."
API_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
"$API_URL/api/v1/events?limit=100&start=$(date -d '1 hour ago' -Iseconds)")
echo " \"api_events_latency_ms\": $(echo "$API_LAT * 1000" | bc)," >> "$RESULTS_FILE"
# 4. Database query performance
echo " Running database query test..."
DB_LAT=$(curl -o /dev/null -s -w "%{time_total}" \
"$API_URL/api/v1/admin/db-performance")
echo " \"db_query_latency_ms\": $(echo "$DB_LAT * 1000" | bc)," >> "$RESULTS_FILE"
# 5. Stream status
echo " Checking stream status..."
STREAMS=$(curl -s "$API_URL/api/v1/cameras/status" | jq '[.cameras[] | select(.status == "active")] | length')
echo " \"active_streams\": $STREAMS," >> "$RESULTS_FILE"
# 6. Inference latency (from Prometheus)
echo " Fetching inference metrics..."
INF_LAT=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(surveillance_model_inference_ms_bucket[5m])))" | \
jq -r '.data.result[0].value[1] // "null"')
echo " \"inference_p95_latency_ms\": $INF_LAT" >> "$RESULTS_FILE"
echo "}}" >> "$RESULTS_FILE"
echo "Benchmark complete. Results saved to $RESULTS_FILE"
cat "$RESULTS_FILE"
Benchmark history tracking:
| Date | Health (ms) | Deep Health (ms) | API (ms) | Inference P95 (ms) | Streams Active |
|---|---|---|---|---|---|
| 2025-01-01 | 12 | 245 | 89 | 42 | 8/8 |
| 2025-01-08 | 11 | 238 | 92 | 45 | 8/8 |
| 2025-01-15 | 15 | 520 | 156 | 78 | 7/8 (cam_03 offline) |
12.7 Resource Request & Provisioning Workflow
Requestor submits capacity request
│
▼
┌───────────────┐
│ SRE Review │ ← Assess impact, feasibility, alternatives
│ (2 biz days) │
└───────┬───────┘
│
▼
┌───────────────┐
│ Approval │ ← Engineering Manager + Finance (if >$X)
│ (1 biz day) │
└───────┬───────┘
│
▼
┌───────────────┐
│ Implementation│ ← SRE executes change during maintenance window
│ (scheduled) │
└───────┬───────┘
│
▼
┌───────────────┐
│ Validation │ ← Verify performance meets requirements
│ (24-48 hours) │
└───────┬───────┘
│
▼
┌───────────────┐
│ Close Request │ ← Document in capacity ledger
└───────────────┘
Appendices
Appendix A: Contact Directory
| Role | Name | Phone | Slack | |
|---|---|---|---|---|
| On-Call (rotating) | See PagerDuty | oncall@company.com | Via PagerDuty | #surveillance-oncall |
| SRE Team Lead | [Name] | sre-lead@company.com | +1-555-0100 | @sre-lead |
| Engineering Manager | [Name] | eng-mgr@company.com | +1-555-0101 | @eng-mgr |
| Security Officer | [Name] | security@company.com | +1-555-0104 | @security |
| Product Owner | [Name] | product@company.com | +1-555-0105 | @product |
| VP Engineering | [Name] | vp-eng@company.com | +1-555-0102 | @vp-eng |
Appendix B: Tooling Inventory
| Category | Tool | Version | Purpose |
|---|---|---|---|
| Monitoring | Prometheus | 2.47+ | Metrics collection |
| Monitoring | Grafana | 10.0+ | Visualization |
| Monitoring | Alertmanager | 0.26+ | Alert routing |
| Logging | Elasticsearch | 8.11+ | Log storage |
| Logging | Filebeat | 8.11+ | Log shipping |
| Logging | Kibana | 8.11+ | Log visualization |
| Orchestration | Kubernetes | 1.28+ | Container orchestration |
| Packaging | Helm | 3.13+ | K8s package management |
| IaC | Terraform | 1.6+ | Infrastructure provisioning |
| GitOps | ArgoCD | 2.9+ | Continuous deployment |
| Backup | pgBackRest | 2.48+ | PostgreSQL backup |
| Secrets | Vault / AWS Secrets Manager | Latest | Secret management |
| Paging | PagerDuty | SaaS | Incident paging |
| Communication | Slack | SaaS | Team communication |
Appendix C: Network Architecture
Internet
│
▼
┌─────────┐ ┌─────────────┐ ┌──────────────────┐
│ CDN │───▶│ Nginx/ALB │───▶│ API Gateway │
│ │ │ (TLS term) │ │ (auth/rate-lim) │
└─────────┘ └─────────────┘ └────────┬─────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Surveil- │ │ WebSocket │ │ Grafana │
│ lance │ │ Service │ │ /Kibana │
│ API │ │ │ │ │
└────┬─────┘ └──────────────┘ └──────────┘
│
┌────────┼────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌─────┐ ┌──────────┐ ┌──────────┐
│PostgreSQL│ │Redis│ │ S3/MinIO│ │ Prometheus│
│ │ │ │ │ │ │ │
└─────────┘ └─────┘ └──────────┘ └───────────┘
VPN Tunnel
══════════
┌──────────────┐
│ Edge Node │◀── RTSP ──▶ [Cameras 1-8]
│ (local proc)│
└──────────────┘
Appendix D: Document Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-01-15 | SRE Team | Initial comprehensive operations plan covering all 12 domains |
END OF DOCUMENT
This document is a living document and should be reviewed and updated quarterly or after any significant infrastructure change.