AI-Powered Industrial Surveillance Platform — System Architecture

Document Information

Version: 1.0
Date: 2025-01-20
Status: Production Architecture Design
Target Platform: CP PLUS ORANGE Series DVR (CP-UVR-0801E1-CV2)
Camera Count: 8 channels (scalable to 64+)
Resolution: 960x1080 per channel

Executive Summary
Deployment Topology
Network Security Zones
Service Architecture
Data Flow Design
Technology Stack
Scaling Strategy
Failover & Reliability
Security Architecture
Monitoring & Observability
Cost Estimation
Implementation Phases
Appendices

1. Executive Summary

This document presents the complete system architecture for an AI-powered industrial surveillance platform designed to process 8 camera channels (expandable to 64+) from a CP PLUS ORANGE Series DVR. The architecture follows a cloud+edge hybrid pattern where compute-intensive AI inference runs in the cloud while a local edge gateway handles stream ingestion and site-local concerns. All DVR communication is protected inside a WireGuard VPN tunnel — the DVR has zero public internet exposure.

Key Architectural Decisions

Decision	Choice	Rationale
Cloud Provider	AWS (us-east-1 / ap-south-1)	Broadest IoT/edge tooling, VPC peering, lowest latency to India region
Container Orchestration	Amazon EKS (Kubernetes)	Managed control plane, auto-scaling, GPU node support for AI inference
VPN Solution	WireGuard	~60% faster than OpenVPN, modern crypto, simple setup, NAT traversal
Message Queue	Apache Kafka (MSK)	Durable, ordered event log, replay capability, proven at scale
Stream Processing	Apache Flink on EKS	Stateful stream processing, exactly-once semantics, windowed operations
Reverse Proxy	Traefik (in-cluster) + AWS ALB (ingress)	Native Kubernetes integration, automatic cert management
AI Framework	NVIDIA Triton Inference Server + YOLOv8	GPU-optimized inference, model ensemble, dynamic batching
Object Storage	MinIO (on-premises) + AWS S3 (cold archive)	S3-compatible API, local buffering, cost-tiered archival
Database	PostgreSQL 16 (RDS) + pgvector extension	Relational integrity for events, native vector support for face embeddings
Cache/Queue	Redis 7 Cluster (ElastiCache)	Sub-ms latency, stream data type for real-time pub/sub
Edge Hardware	Intel NUC 13 Pro i7 / NVIDIA Jetson Orin NX	x86 preferred for flexibility; Jetson alternative for GPU-at-edge

2. Deployment Topology

2.1 High-Level Topology Diagram

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    CLOUD (AWS VPC)                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────────────┐    │
│  │                         KUBERNETES CLUSTER (EKS)                                   │    │
│  │                                                                                    │    │
│  │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐  │    │
│  │   │  API GW     │   │  Stream     │   │   AI Inf.   │   │   Suspicious Act.   │  │    │
│  │   │  (Traefik)  │   │  Ingestion  │   │  Service    │   │   Service           │  │    │
│  │   │  :8443      │   │  Service    │   │  (Triton)   │   │   (Night Mode)      │  │    │
│  │   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘   └─────────────────────┘  │    │
│  │          │                 │                 │                                    │    │
│  │   ┌──────┴──────┐   ┌──────┴──────┐   ┌──────┴──────┐   ┌─────────────────────┐  │    │
│  │   │  Web App    │   │  Training   │   │ Notification│   │   Video Playback    │  │    │
│  │   │  (Next.js)  │   │  Service    │   │  Service    │   │   Service (HLS)     │  │    │
│  │   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────────────┘  │    │
│  │                                                                                    │    │
│  │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐  │    │
│  │   │  PostgreSQL │   │    Redis    │   │   Kafka     │   │      MinIO          │  │    │
│  │   │  (RDS)      │   │  Cluster    │   │   (MSK)     │   │   (S3-compatible)   │  │    │
│  │   │  :5432      │   │  :6379      │   │  :9092      │   │   :9000             │  │    │
│  │   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────────────┘  │    │
│  │                                                                                    │    │
│  │   ┌─────────────────────────────────────────────────────────────────────────┐      │    │
│  │   │              AWS APPLICATION LOAD BALANCER (:443)                       │      │    │
│  │   │         SSL termination, WAF, rate limiting, geo-restriction            │      │    │
│  │   └─────────────────────────────────────────────────────────────────────────┘      │    │
│  └─────────────────────────────────────────────────────────────────────────────────────┘    │
│           ▲                                                                                 │
│           │ WireGuard VPN Tunnel (UDP 51820)                                                │
│           │ Site-to-Site encrypted tunnel                                                   │
│           │ Cloud peer: 10.200.0.1/32  ←→  Edge peer: 10.200.0.2/32                       │
└───────────┼─────────────────────────────────────────────────────────────────────────────────┘
            │
┌───────────┴─────────────────────────────────────────────────────────────────────────────────┐
│                                 EDGE SITE (Local Network)                                   │
│                                                                                             │
│   ┌─────────────────────────────────┐          ┌─────────────────────────────────────────┐  │
│   │      EDGE GATEWAY               │          │         LOCAL NETWORK                   │  │
│   │   (Intel NUC / Jetson Orin)     │          │     (192.168.29.0/24)                   │  │
│   │   OS: Ubuntu 22.04 LTS          │          │                                         │  │
│   │   WireGuard endpoint            │          │   ┌─────────────────────────────────┐   │  │
│   │   K3s lightweight cluster       │          │   │   CP PLUS DVR                     │   │  │
│   │                                 │          │   │   CP-UVR-0801E1-CV2               │   │  │
│   │   ┌───────────────────────┐     │◄────────►│   │   LAN: 192.168.29.200             │   │  │
│   │   │  Edge Gateway Agent   │     │  :554    │   │   RTSP: 554, HTTP: 80/443         │   │  │
│   │   │  - Stream puller      │     │  :80     │   │   TCP: 25001, UDP: 25002          │   │  │
│   │   │  - Buffer/forward     │     │          │   │   8 Channels × 960×1080           │   │  │
│   │   │  - Local recording    │     │          │   └─────────────────────────────────┘   │  │
│   │   │  - VPN client         │     │          │                                         │  │
│   │   └───────────────────────┘     │          │   ┌─────────────────────────────────┐   │  │
│   │                                 │          │   │   Local Monitor (optional)      │   │  │
│   │   Local Storage: 2TB NVMe       │          │   │   192.168.29.10                 │   │  │
│   │   (7-day circular buffer)       │          │   └─────────────────────────────────┘   │  │
│   └─────────────────────────────────┘          │                                         │  │
│                                                │   CAMERAS (BNC/IP) ──┐                  │  │
│                                                │                      │                  │  │
│                                                │   CH1 ──► CH2 ──► CH3 ──► CH4           │  │
│                                                │   CH5 ──► CH6 ──► CH7 ──► CH8           │  │
│                                                │                                        │  │
│                                                └────────────────────────────────────────┘  │
│                                                                                             │
│   Network: Edge Gateway has TWO interfaces:                                                 │
│   - eth0: 192.168.29.5/24  ←→ Local network (DVR access)                                  │
│   - eth1: DHCP / Static    ←→ Internet (VPN tunnel to cloud)                                │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

2.2 Physical Edge Gateway Specification

Component	Specification
Hardware	Intel NUC 13 Pro, Core i7-1360P, 32GB DDR4, 2TB NVMe SSD
Alternative	NVIDIA Jetson Orin NX 16GB (for on-edge AI inference)
OS	Ubuntu 22.04 LTS Server, minimal install
Container Runtime	containerd (via K3s)
K8s Distribution	K3s v1.28+ (lightweight, single-node or 2-node HA)
Power	UPS-backed, auto-restart on power loss (BIOS setting)
Network	Dual Ethernet: one for local DVR segment, one for internet/VPN
Local Storage	2TB NVMe for 7-day circular buffer of all 8 streams

2.3 Cloud Infrastructure Specification

Component	Specification
Region	Primary: ap-south-1 (Mumbai), DR: ap-southeast-1 (Singapore)
VPC	10.100.0.0/16, 3 AZs, private subnets only for workloads
EKS	Managed node groups: `on-demand` for API, `spot` for batch processing
GPU Nodes	g4dn.xlarge (NVIDIA T4) for Triton inference, 1-4 nodes auto-scaled
ALB	Internet-facing, WAF v2 attached, Shield Advanced optional
RDS	PostgreSQL 16, db.r6g.xlarge, Multi-AZ, encrypted at rest
ElastiCache	Redis 7, cluster mode enabled, 2 shards × 2 replicas
MSK (Kafka)	3 broker nodes, kafka.m5.large, 3 AZs
S3	Standard (hot), IA (30 days), Glacier Deep Archive (1 year)

3. Network Security Zones

3.1 Security Zone Diagram

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    SECURITY ZONES                                           │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                        ZONE 0: INTERNET (UNTRUSTED)                             │       │
│  │                                                                                 │       │
│  │   Users/Browsers  ──►  AWS ALB (:443)  ──►  WAF  ──►  Rate Limit  ──►  Geo-Block │    │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                    ZONE 1: AWS VPC EDGE (DEMILITARIZED)                         │       │
│  │                                                                                 │       │
│  │   ALB ──► Traefik Ingress ──► Public API endpoints only                         │       │
│  │   Auth: JWT + RBAC, API key for edge gateway                                    │       │
│  │                                                                                 │       │
│  │   AWS ALB Security Group: Allow 443 from 0.0.0.0/0                             │       │
│  │   Traefik SG: Allow 8443 from ALB-SG only                                       │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │              ZONE 2: AWS VPC APPLICATION (TRUSTED, ISOLATED)                    │       │
│  │                                                                                 │       │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │       │
│  │   │ Stream Ing. │  │ AI Inference│  │ Suspicious  │  │  Training Service   │   │       │
│  │   │  Service    │  │  Service    │  │  Activity   │  │                     │   │       │
│  │   │             │  │             │  │  Service    │  │                     │   │       │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘   │       │
│  │                                                                                 │       │
│  │   Pod Security Policies: No root, read-only FS, no privilege escalation        │       │
│  │   Network Policies: Ingress only from API GW namespace, egress to data layer   │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                ZONE 3: AWS VPC DATA (HIGHLY RESTRICTED)                         │       │
│  │                                                                                 │       │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │       │
│  │   │ PostgreSQL  │  │    Redis    │  │    Kafka    │  │       MinIO         │   │       │
│  │   │   (RDS)     │  │  (ElastiC.) │  │    (MSK)    │  │    (S3 API)         │   │       │
│  │   │   :5432     │  │   :6379     │  │   :9092     │  │    :9000            │   │       │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘   │       │
│  │                                                                                 │       │
│  │   Security Groups: Allow connections ONLY from Application Zone SGs              │       │
│  │   RDS: Encrypted (AWS KMS), no public access, IAM auth enabled                  │       │
│  │   S3: Bucket policy deny all except VPC endpoint, versioning enabled             │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                    WireGuard VPN Tunnel  │  (UDP 51820, ChaCha20-Poly1305)                │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │              ZONE 4: EDGE NETWORK (PHYSICALLY ISOLATED)                         │       │
│  │                                                                                 │       │
│  │   ┌──────────────────────────┐          ┌─────────────────────────────────┐     │       │
│  │   │   EDGE GATEWAY AGENT     │          │     DVR (192.168.29.200)        │     │       │
│  │   │   - K3s node              │◄────────►│     NO INTERNET ACCESS          │     │       │
│  │   │   - WireGuard peer        │   :554   │     Firewall: DROP all non-local│     │       │
│  │   │   - Stream ingestion      │   :80    │                                 │     │       │
│  │   │   - Local buffer          │          │     Only 192.168.29.0/24 allowed│     │       │
│  │   └──────────────────────────┘          └─────────────────────────────────┘     │       │
│  │                                                                                 │       │
│  │   Edge Gateway Firewall (ufw):                                                  │       │
│  │   - ALLOW 192.168.29.0/24 → DVR ports (554, 80)                               │       │
│  │   - ALLOW OUT 51820/udp → Cloud VPN endpoint                                  │       │
│  │   - DENY ALL other incoming                                                   │       │
│  │   - No forwarding to local network from VPN (except explicit rules)            │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

3.2 Firewall Rules

Edge Gateway (UFW)

# Default deny
ufw default deny incoming
ufw default allow outgoing

# Local network access to DVR
ufw allow from 192.168.29.200 to any port 554 proto tcp    # RTSP
ufw allow from 192.168.29.200 to any port 80 proto tcp     # HTTP (ONVIF)

# WireGuard VPN to cloud
ufw allow out on eth1 to <cloud-vpn-ip> port 51820 proto udp

# Local admin access (optional, from specific admin IP)
ufw allow from 192.168.29.10 to any port 22 proto tcp      # SSH from admin workstation

AWS Security Groups

Security Group	Ingress Rules	Egress Rules
`alb-public-sg`	TCP 443 from 0.0.0.0/0	All to VPC
`traefik-ingress-sg`	TCP 8443 from alb-public-sg only	All to VPC
`app-services-sg`	TCP 8080-8090 from traefik-ingress-sg	All to data-sg
`data-layer-sg`	TCP 5432, 6379, 9092, 9000 from app-services-sg only	None
`vpn-endpoint-sg`	UDP 51820 from edge-gateway-ip/32	All to VPC

4. Service Architecture

4.1 Service Interaction Diagram

┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      SERVICE ARCHITECTURE                                       │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                              API GATEWAY LAYER                                      │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Traefik Ingress Controller (K8s)                                           │   │     │
│   │   │  - Route: /api/* → Backend Service                                          │   │     │
│   │   │  - Route: /ws/*  → WebSocket Handler (live video)                          │   │     │
│   │   │  - Route: /     → Next.js Web App                                           │   │     │
│   │   │  - TLS: Let's Encrypt automatic certificates                                │   │     │
│   │   │  - Middleware: rate limit (100 req/min per IP), JWT validation, CORS       │   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│                                           │                                                    │
│                    ┌──────────────────────┼──────────────────────┐                             │
│                    │                      │                      │                             │
│                    ▼                      ▼                      ▼                             │
│   ┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐             │
│   │   BACKEND SERVICE    │   │    WEB FRONTEND      │   │   VIDEO PLAYBACK     │             │
│   │   (Go/Gin)           │   │   (Next.js 14)       │   │   SERVICE (Go)       │             │
│   │   :8080              │   │   :3000              │   │   :8085 (HLS)        │             │
│   │                      │   │                      │   │                      │             │
│   │  ┌──────────────┐   │   │  ┌──────────────┐   │   │  ┌──────────────┐   │             │
│   │  │ REST API     │   │   │  │ React SSR    │   │   │  │ HLS Segment  │   │             │
│   │  │ - /cameras   │   │   │  │ - Dashboard  │   │   │  │ Server       │   │             │
│   │  │ - /events    │   │   │  │ - Live View  │   │   │  │ - /live/:id  │   │             │
│   │  │ - /alerts    │   │   │  │ - Timeline   │   │   │  │ - /vod/:id   │   │             │
│   │  │ - /search    │   │   │  │ - Analytics  │   │   │  │ (DASH/HLS)   │   │             │
│   │  │ - /training  │   │   │  │ - Admin      │   │   │  └──────────────┘   │             │
│   │  └──────────────┘   │   │  └──────────────┘   │   │                      │             │
│   │  ┌──────────────┐   │   └──────────────────────┘   └──────────────────────┘             │
│   │  │ gRPC Client  │   │                                                                    │
│   │  │ (to AI svc)  │   │                                                                    │
│   │  └──────────────┘   │                                                                    │
│   └─────────────────────┘                                                                    │
│            │                                                                                  │
│            │ gRPC (:50051)                                                                    │
│            ▼                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                           EVENT & MESSAGE BUS                                       │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Apache Kafka (MSK)                                                          │   │     │
│   │   │                                                                             │   │     │
│   │   │  Topics:                                                                    │   │     │
│   │   │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────────┐│   │     │
│   │   │  │ streams.raw     │  │ ai.detections   │  │ alerts.critical             ││   │     │
│   │   │  │ (protobuf)      │  │ (JSON)          │  │ (JSON)                      ││   │     │
│   │   │  │ - 8 partitions  │  │ - 16 partitions │  │ - 4 partitions              ││   │     │
│   │   │  │ - 7-day reten.  │  │ - 30-day reten. │  │ - 90-day reten.             ││   │     │
│   │   │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘│   │     │
│   │   │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────────┐│   │     │
│   │   │  │ training.data   │  │ system.metrics  │  │ notifications.email         ││   │     │
│   │   │  │ (protobuf)      │  │ (JSON)          │  │ notifications.sms           ││   │     │
│   │   │  │ - 30-day reten. │  │ - 7-day reten.  │  │ notifications.push          ││   │     │
│   │   │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘│   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Redis Cluster (ElastiCache)                                                 │   │     │
│   │   │                                                                             │   │     │
│   │   │  Streams:  ┌─────────────────┐  Pub/Sub:  ┌──────────────────────┐        │   │     │
│   │   │            │ live:cam:{id}   │            │ alert:broadcast      │        │   │     │
│   │   │            │ (video chunks)  │            │ ws:session:*         │        │   │     │
│   │   │            │ cache:api:*     │            │ stream:status        │        │   │     │
│   │   │            └─────────────────┘            └──────────────────────┘        │   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│            │                        │                        │                                 │
│            ▼                        ▼                        ▼                                 │
│   ┌──────────────┐        ┌──────────────┐        ┌──────────────────────┐                   │
│   │ STREAM ING.  │        │ AI INFERENCE │        │ SUSPICIOUS ACTIVITY  │                   │
│   │ SERVICE      │        │ SERVICE      │        │ SERVICE              │                   │
│   │ (Go/FFmpeg)  │        │ (Python/gRPC)│        │ (Go/Python)          │                   │
│   │ :8081        │        │ :8001 (Triton)│       │ :8083                │                   │
│   │              │        │              │        │                      │                   │
│   │┌────────────┐│        │┌────────────┐│        │┌────────────────────┐│                   │
│   ││RTSP Client ││        ││Triton Svr  ││        ││Night Mode Analyzer ││                   │
│   ││(ffmpeg)    ││        ││├─YOLOv8-det││        ││├─Motion detection  ││                   │
│   ││8 concurrent││        ││├─YOLOv8-face││       ││├─Loitering detect. ││                   │
│   ││streams     ││        ││├─ArcFace    ││       ││├─Perimeter breach   ││                   │
│   │├────────────┤│        ││└────────────┘│       ││├─Abandoned object  ││                   │
│   ││Frame Extrac││        ││Model Mgmt.  ││       ││├─Crowd detection   ││                   │
│   ││1 fps anal. ││        ││└────────────┘│       ││└────────────────────┘│                   │
│   │├────────────┤│        │├─────────────┤│       │├────────────────────┤│                   │
│   ││Kafka Produc.││        ││gRPC API     ││       ││Kafka Consumer      ││                   │
│   ││(raw frames) ││       ││- detect()   ││       ││(ai.detections)     ││                   │
│   │└─────────────┘│        ││- embed()    ││       │├────────────────────┤│                   │
│   │┌────────────┐ │        ││- compare()  ││       ││Rule Engine         ││                   │
│   ││MinIO Client│ │        │└─────────────┘│       ││├─Time-based rules  ││                   │
│   ││(video seg.)│ │        └──────────────┘       ││├─Zone-based rules  ││                   │
│   │└────────────┘ │                                ││├─Severity scoring  ││                   │
│   └───────────────┘                                │└────────────────────┘│                   │
│            ▲                                       └──────────────────────┘                   │
│            │ WireGuard VPN                                                                     │
│            │                                                                                   │
│   ┌──────────────┐                                                                            │
│   │ EDGE GATEWAY │                                                                            │
│   │ SERVICE      │                                                                            │
│   │ (Local)      │                                                                            │
│   └──────────────┘                                                                            │
│                                                                                               │
│            │                                                                                   │
│            ▼                                                                                   │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                           DATA LAYER                                                │     │
│   │                                                                                      │     │
│   │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │     │
│   │   │ PostgreSQL 16│  │  pgvector    │  │    MinIO     │  │   S3 (Cold Archive)  │   │     │
│   │   │  (RDS)       │  │  Extension   │  │  (on-prem +  │  │                      │   │     │
│   │   │              │  │              │  │   cloud)     │  │                      │   │     │
│   │   │ cameras      │  │ face_embed.  │  │ video-seg/   │  │  yearly archive      │   │     │
│   │   │ events       │  │  table       │  │ training-img │  │  compliance storage  │   │     │
│   │   │ alerts       │  │  (vector)    │  │              │  │                      │   │     │
│   │   │ audit_log    │  │              │  │ lifecycle:   │  │  lifecycle:          │   │     │
│   │   │ users        │  │ HNSW index   │  │ 7d local →   │  │  90d → Glacier       │   │     │
│   │   │ zones        │  │ cos similarity│ │ 30d cloud →  │  │  Deep Archive        │   │     │
│   │   └──────────────┘  └──────────────┘  │ 1yr archive  │  │                      │   │     │
│   │                                         └──────────────┘  └──────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│                                                                                                │
└────────────────────────────────────────────────────────────────────────────────────────────────┘

4.2 Service Specifications

4.2.1 Edge Gateway Service (Local)

Attribute	Specification
Runtime	Go 1.21, compiled binary
Deployment	Systemd service on Ubuntu + K3s for containerized components
Location	Intel NUC, physically on-site
Responsibilities	RTSP stream pull, local recording buffer, VPN tunnel endpoint, heartbeat to cloud
Ports	8080 (HTTP admin), 51820 (WireGuard), 1935 (RTMP relay if needed)
Stream Protocol	RTSP over TCP (interleaved) from DVR at 192.168.29.200:554
Local Storage	2TB NVMe, 7-day circular buffer, ~1.5GB/hour per channel = ~288GB/day for 8ch
Reconnect Policy	Exponential backoff: 1s → 2s → 4s → 8s → max 60s, reset on success
Heartbeat	Every 30s to cloud Stream Ingestion Service via VPN
Failover	Auto-restart via systemd (Restart=always, RestartSec=5)

4.2.2 Stream Ingestion Service (Cloud)

Attribute	Specification
Runtime	Go 1.21
Deployment	Kubernetes Deployment, 3 replicas minimum
Responsibilities	Receive frames from edge, decode, produce to Kafka, store segments to MinIO
Protocol	gRPC bidirectional streaming from Edge Gateway
Frame Rate	1 fps for AI analysis (decimated from 25fps source)
Full Rate	25 fps for event clips (triggered recordings)
Kafka Topic	`streams.raw.{camera_id}` — protobuf-encoded frame batches
Video Segments	10-second H.264 segments → MinIO bucket `video-segments`
Resource Request	1 CPU, 2GB RAM per replica
HPA	3-20 replicas based on CPU > 70%

4.2.3 AI Inference Service

Attribute	Specification
Runtime	NVIDIA Triton Inference Server 2.40+ (Docker)
Deployment	Kubernetes Deployment on GPU nodes (g4dn.xlarge)
GPU	NVIDIA T4 16GB, 1 GPU per replica
Models	YOLOv8x (detection), YOLOv8x-face (face detection), ArcFace (face recognition/embedding)
Model Format	TensorRT engines (.plan) for GPU optimization
gRPC API	`:8001` — Triton native gRPC
HTTP API	`:8000` — Triton native HTTP
Metrics	`:8002` — Prometheus metrics endpoint
Dynamic Batching	Max batch size: 8 for detection, 16 for face embedding
Input	JPEG frames (960×1080) from Kafka topic `streams.raw.*`
Output	Detections (bbox, class, confidence) → Kafka `ai.detections`
Face Embeddings	512-dim float32 vectors → pgvector (PostgreSQL)
Resource	1× T4 GPU, 4 CPU, 16GB RAM per replica
HPA	1-4 replicas based on GPU utilization > 80% and Kafka consumer lag

4.2.4 Suspicious Activity Service (Night Mode)

Attribute	Specification
Runtime	Python 3.11 (OpenCV, scikit-learn) + Go orchestrator
Deployment	Kubernetes Deployment, 2-8 replicas
Input	Kafka topic `ai.detections` + `streams.raw.*` for motion analysis
Responsibilities	Night-mode analysis, loitering detection, perimeter breach, abandoned object, crowd detection
Rules Engine	YAML-configured rules per camera, per time schedule
Night Schedule	Configurable (default: 22:00 - 06:00), overrides day-mode sensitivity
Output	Scored alerts → Kafka `alerts.critical` + PostgreSQL `alerts` table
ML Models	Background subtraction (MOG2), optical flow for motion tracking, Kalman filters for object tracking
Resource Request	2 CPU, 4GB RAM per replica

4.2.5 Training Service

Attribute	Specification
Runtime	Python 3.11, PyTorch 2.1, NVIDIA CUDA 12.1
Deployment	Kubernetes Job/CronJob, runs on GPU spot instances
Responsibilities	Model retraining, fine-tuning on collected data, A/B model validation
Trigger	Weekly scheduled (Sunday 02:00) or manual (API call)
Data Source	MinIO bucket `training-data` (curated positive/negative samples)
Output	New TensorRT engines → MinIO bucket `model-artifacts`
A/B Rollout	Blue/green model deployment via Triton model repository
Validation	mAP > 0.85 required before promotion to production
Resource	1× V100 GPU (spot), 8 CPU, 32GB RAM

4.2.6 API Gateway / Backend Service

Attribute	Specification
Runtime	Go 1.21, Gin framework
Deployment	Kubernetes Deployment, 3-10 replicas
Protocol	HTTP/2, REST API + WebSocket for live updates
Authentication	JWT (RS256), access token 15min, refresh token 7 days
Authorization	RBAC: admin, operator, viewer roles
Rate Limiting	100 req/min per IP, 1000 req/min per API key
Endpoints	See API Specification below
Caching	Redis for session store and API response caching (TTL 60s)
Resource Request	0.5 CPU, 1GB RAM per replica

API Endpoints:

Endpoint	Method	Description	Auth
`/api/v1/auth/login`	POST	User authentication	Public
`/api/v1/auth/refresh`	POST	Token refresh	Public
`/api/v1/cameras`	GET	List all cameras	Viewer+
`/api/v1/cameras/{id}`	GET	Camera details	Viewer+
`/api/v1/cameras/{id}/live`	GET	Live stream URL (HLS)	Viewer+
`/api/v1/events`	GET	Query events (paginated, filtered)	Viewer+
`/api/v1/events/{id}`	GET	Event details with snapshot	Viewer+
`/api/v1/alerts`	GET	List alerts	Viewer+
`/api/v1/alerts/{id}/ack`	POST	Acknowledge alert	Operator+
`/api/v1/search/faces`	POST	Face search by image	Operator+
`/api/v1/search/faces/{embedding}`	GET	Similar face lookup	Operator+
`/api/v1/training/upload`	POST	Upload training samples	Admin
`/api/v1/training/jobs`	GET	List training jobs	Admin
`/api/v1/zones`	CRUD	Perimeter zones per camera	Admin
`/api/v1/reports/daily`	GET	Daily activity report	Viewer+
`/api/v1/system/health`	GET	System health status	Internal

4.2.7 Web Frontend

Attribute	Specification
Framework	Next.js 14 (App Router), React 18, TypeScript
Styling	Tailwind CSS + shadcn/ui components
State Management	Zustand (client), React Query (server)
Video Player	HLS.js for live stream playback, Video.js for VOD
Maps	MapLibre GL JS (open source, no API key required) for camera geolocation
Real-time	WebSocket connection for alert notifications
Build Output	Static export → served via CDN (CloudFront)
PWA	Service worker for offline dashboard viewing

4.2.8 Notification Service

Attribute	Specification
Runtime	Go 1.21
Deployment	Kubernetes Deployment, 2-5 replicas
Input	Kafka topic `alerts.critical`
Channels	Email (SMTP/AWS SES), SMS (Twilio/AWS SNS), Push (Firebase FCM), Webhook
Templates	HTML email templates with event snapshot attachment
Rate Limiting	Max 1 SMS per phone per 5 minutes; max 10 emails per address per hour
Retry Policy	3 retries with exponential backoff for each channel; dead-letter after failure
Escalation	Unacknowledged critical alerts escalate after 15 minutes (to admin)

4.2.9 Database — PostgreSQL 16 (RDS)

Attribute	Specification
Instance	db.r6g.xlarge (4 vCPU, 32GB RAM)
Storage	500GB gp3, auto-scaling to 2TB
Multi-AZ	Enabled for production
Extensions	pgvector (face embeddings), PostGIS (zone geometry), pg_stat_statements
Backup	Daily automated, 35-day retention
Read Replica	1 read replica for analytics queries

Schema Overview:

-- Core tables
cameras (id, name, dvr_channel, rtsp_url, location, status, created_at)
events (id, camera_id, event_type, confidence, bounding_box, snapshot_path, 
        start_time, end_time, severity, metadata JSONB, created_at)
alerts (id, event_id, rule_id, severity, status [new|ack|resolved], 
        acknowledged_by, acked_at, notification_channels, created_at)
face_embeddings (id, person_name, embedding vector(512), camera_id, 
                 first_seen, last_seen, occurrence_count, metadata JSONB)
users (id, username, password_hash, role, email, phone, created_at)
alert_rules (id, camera_id, rule_type, config JSONB, schedule JSONB, 
             severity, enabled, created_at)
audit_log (id, user_id, action, resource, details JSONB, ip_address, created_at)
perimeter_zones (id, camera_id, name, polygon GEOMETRY(POLYGON), 
                 alert_on_enter, alert_on_exit, schedule, created_at)

4.2.10 Object Storage — MinIO + S3

Attribute	Specification
Local (Edge)	MinIO single-node, 2TB NVMe, 7-day retention
Cloud (Primary)	MinIO distributed cluster on EKS, 10TB initial, auto-scaling
Archive	AWS S3 with lifecycle: Standard → IA (30d) → Glacier Deep Archive (365d)
API	S3-compatible, same SDK for all tiers
Buckets	`video-segments` (10s segments), `event-clips` (triggered recordings), `training-data` (curated samples), `snapshots` (JPEG event frames), `model-artifacts` (TensorRT engines)

4.2.11 Redis Cluster

Attribute	Specification
Type	ElastiCache for Redis, cluster mode enabled
Node Type	cache.r6g.large per shard
Shards	2 shards, 2 replicas per shard
Max Memory Policy	allkeys-lru (evict least recently used)
Persistence	AOF everysec, RDB every 60min
Use Cases	Session store, API cache, real-time pub/sub, stream position tracking

4.2.12 Vector Store (pgvector)

Attribute	Specification
Integration	PostgreSQL extension (same RDS instance)
Table	`face_embeddings` with `embedding vector(512)` column
Index	HNSW (hierarchical navigable small world) for approximate nearest neighbor
Index Parameters	`m = 16`, `ef_construction = 64`
Similarity Metric	Cosine similarity (`<=>` operator)
Query	`SELECT * FROM face_embeddings ORDER BY embedding <=> $1 LIMIT 10`
Expected Volume	~1M vectors per year (8 cameras)

5. Data Flow Design

5.1 Complete Data Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      DATA FLOW                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                 │
│  LAYER 1: CAPTURE & INGESTION                                                                   │
│  ══════════════════════════                                                                     │
│                                                                                                 │
│   CAMERAS (8ch) ──► DVR (192.168.29.200) ──► RTSP (:554) ──► EDGE GATEWAY (192.168.29.5)      │
│                                                                                                 │
│   Camera → BNC coax → DVR encoder → H.264 stream → RTSP server (DVR builtin)                  │
│                                                                                                 │
│   Edge Gateway pulls 8 concurrent RTSP streams:                                                 │
│   rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?                       │
│   rtsp://192.168.29.200:554/user=admin&password=&channel=2&stream=0.sdp?                       │
│   ... (channels 1-8)                                                                           │
│                                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐                  │
│   │  EDGE GATEWAY PROCESSING per stream:                                     │                  │
│   │  1. FFmpeg demux → raw H.264 Annex-B frames                            │                  │
│   │  2. Segment into 10s chunks → local MinIO (circular buffer)            │                  │
│   │  3. Extract 1 fps JPEG frames (960×1080 → 640×640 resize for AI)       │                  │
│   │  4. Protobuf-encode frame batches                                        │                  │
│   │  5. Send via gRPC bidirectional stream over WireGuard VPN                │                  │
│   └─────────────────────────────────────────────────────────────────────────┘                  │
│                                          │                                                      │
│                    ┌─────────────────────┼─────────────────────┐                                │
│                    │                     │                     │                                │
│                    ▼                     ▼                     ▼                                │
│   ┌─────────────────────┐   ┌─────────────────────┐   ┌─────────────────────┐                 │
│   │ PATH A: LIVE VIDEO  │   │ PATH B: AI ANALYSIS │   │ PATH C: RECORDING   │                 │
│   │ (WebRTC/HLS path)   │   │ (detection pipeline)│   │ (event archival)    │                 │
│   └─────────────────────┘   └─────────────────────┘   └─────────────────────┘                 │
│                                                                                                 │
│  LAYER 2: STREAM PROCESSING (Cloud)                                                             │
│  ══════════════════════════════════                                                             │
│                                                                                                 │
│  PATH A: LIVE VIDEO ────────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Edge Gateway ──► Stream Ing. Svc ──► Redis Stream (live:cam:{id}) ──► HLS Segment Svc      │
│        RTSP              (decode)           (pub/sub buffer)              (m3u8 + .ts)        │
│                                                                            │                    │
│                                                                            ▼                    │
│                                                                     CloudFront CDN              │
│                                                                            │                    │
│                                                                            ▼                    │
│                                                                    Web Browser (HLS.js)        │
│                                                                                                 │
│  PATH B: AI ANALYSIS ───────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Stream Ing. Svc ──► Kafka (streams.raw.{cam}) ──► AI Inference Svc (Triton)                 │
│   (frame batches)         (ordered, partitioned)         (YOLOv8 + ArcFace)                    │
│                                                              │                                  │
│                                    ┌─────────────────────────┼─────────────────────────┐        │
│                                    │                         │                         │        │
│                                    ▼                         ▼                         ▼        │
│                            ┌──────────────┐         ┌──────────────┐         ┌──────────────┐  │
│                            │ Detections   │         │  Face Emb.   │         │  Stream to   │  │
│                            │ (person,     │         │  (512-dim)   │         │  Suspicious  │  │
│                            │  vehicle)    │         │              │         │  Activity    │  │
│                            └──────┬───────┘         └──────┬───────┘         │  Service     │  │
│                                   │                        │                 └──────┬───────┘  │
│                                   ▼                        ▼                        │          │
│                            Kafka (ai.              PostgreSQL                 Kafka  │          │
│                            detections)             (pgvector)                 alerts  │          │
│                                                                                 .critical      │
│                                                                                                 │
│  PATH C: RECORDING ─────────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Edge Gateway ──► Local MinIO (7d) ──► Sync ──► Cloud MinIO ──► S3 Lifecycle → Glacier       │
│   (10s segments)      (hot buffer)    (daily)     (30d hot)        (1yr archive)               │
│                                                                                                 │
│  LAYER 3: EVENT PROCESSING                                                                      │
│  ═════════════════════════                                                                      │
│                                                                                                 │
│   AI Inference ──► Kafka (ai.detections) ──► Suspicious Activity Svc                           │
│   Output              - bbox, class, conf          - Rule engine evaluation                      │
│                       - timestamp                  - Loitering detection                         │
│                       - camera_id                  - Perimeter breach check                      │
│                       - embedding_id               - Crowd counting                              │
│                                                    - Time-of-day scoring                         │
│                                                          │                                      │
│                                    ┌─────────────────────┼─────────────────────┐                │
│                                    │                     │                     │                │
│                                    ▼                     ▼                     ▼                │
│                            ┌──────────────┐     ┌──────────────┐     ┌──────────────────────┐   │
│                            │ PostgreSQL   │     │   Kafka      │     │  Notification Svc    │   │
│                            │ (alerts)     │     │ (alerts.     │     │  - Email (SES)       │   │
│                            │ (events)     │     │  critical)   │     │  - SMS (Twilio)      │   │
│                            └──────────────┘     └──────────────┘     │  - Push (FCM)        │   │
│                                                                      │  - Webhook           │   │
│                                                                      └──────────────────────┘   │
│                                                                                                 │
│  LAYER 4: CONSUMPTION                                                                           │
│  ════════════════════                                                                           │
│                                                                                                 │
│   Web Frontend ──► API Gateway ──► Backend Service ──► PostgreSQL/Redis/MinIO                 │
│   (Next.js)          (Traefik)      (Go/Gin)            (data queries)                         │
│      │                                                                        │                │
│      │  ┌─────────────────────────────────────────────────────────────────┐   │                │
│      │  │  DASHBOARD VIEWS:                                               │   │                │
│      │  │  - Live View: HLS.js + WebSocket for alert overlay              │   │                │
│      │  │  - Event Timeline: Infinite scroll, filters                     │   │                │
│      │  │  - Alert Management: Ack/Nack, assignment                       │   │                │
│      │  │  - Face Search: Upload photo → pgvector similarity search        │   │                │
│      │  │  - Analytics: Time-series charts (event frequency, heatmaps)    │   │                │
│      │  │  - Settings: Camera config, zone drawing, rule management       │   │                │
│      │  └─────────────────────────────────────────────────────────────────┘   │                │
│      │                                                                        │                │
│      └─────────────────────────── WebSocket: /ws/alerts ───────────────────────┘                │
│                                    (real-time alert push)                                       │
│                                                                                                 │
│  LAYER 5: TRAINING DATA FLOW                                                                    │
│  ═══════════════════════════                                                                    │
│                                                                                                 │
│   Events (false positive) ──► Admin review ──► "Add to Training" ──► MinIO (training-data)    │
│   Events (missed detect)  ──► Manual upload ──► Labeling UI ──► Curated dataset               │
│                                                                     │                           │
│                                                                     ▼                           │
│                                                          Training Service (weekly CronJob)      │
│                                                          - Load dataset from MinIO              │
│                                                          - Fine-tune YOLOv8 weights             │
│                                                          - Convert to TensorRT engine           │
│                                                          - Validate mAP > 0.85                  │
│                                                                │                                │
│                                                                ▼                                │
│                                                          Model Registry (MinIO)                 │
│                                                          - Blue/green deployment                │
│                                                          - Triton model repository              │
│                                                                │                                │
│                                                                ▼                                │
│                                                          AI Inference Svc (rolling update)      │
│                                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘

5.2 Stream Flow Detail

┌─────────────────────────────────────────────────────────────────────────────┐
│                     VIDEO STREAM FLOW (Per Camera)                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Camera ──► DVR Encoder ──► RTSP Stream ──► Edge Gateway                  │
│                              (H.264, 25fps,                               │
│                               960x1080)                                     │
│                                                                             │
│                              Edge Gateway Processing:                       │
│                              ┌─────────────────────────────────────┐       │
│                              │ 1. FFmpeg process per channel       │       │
│                              │    -input rtsp://dvr/ch{N}          │       │
│                              │    -c:v copy -f segment             │       │
│                              │    -segment_time 10                 │       │
│                              │    /recordings/ch{N}/%d.ts          │       │
│                              │                                     │       │
│                              │ 2. Parallel: 1 fps extraction       │       │
│                              │    -vf fps=1,scale=640:640          │       │
│                              │    -f image2pipe -vcodec mjpeg      │       │
│                              │    → AI pipeline                    │       │
│                              └─────────────────────────────────────┘       │
│                                          │                                  │
│                    ┌─────────────────────┼─────────────────────┐           │
│                    │                     │                     │           │
│                    ▼                     ▼                     ▼           │
│            ┌──────────────┐      ┌──────────────┐      ┌──────────────┐   │
│            │ Local Buffer │      │ AI Frames    │      │ Cloud Upload │   │
│            │ (7-day ring) │      │ (1 fps JPEG) │      │ (10s chunks) │   │
│            └──────────────┘      └──────┬───────┘      └──────┬───────┘   │
│                                         │                     │           │
│                    ┌────────────────────┘                     │           │
│                    │ WireGuard VPN                              │           │
│                    ▼                                          ▼           │
│           ┌────────────────┐                        ┌────────────────┐    │
│           │ Cloud Stream   │                        │ Cloud MinIO    │    │
│           │ Ingestion Svc  │                        │ (30-day hot)   │    │
│           └───────┬────────┘                        └────────────────┘    │
│                   │                                                        │
│                   ▼                                                        │
│     ┌────────────────────────┐                                             │
│     │  Kafka (streams.raw)   │                                             │
│     │  Partition = camera_id │                                             │
│     │  Guarantees ordering   │                                             │
│     │  per camera            │                                             │
│     └───────────┬────────────┘                                             │
│                 │                                                          │
│    ┌────────────┼────────────┐                                             │
│    │            │            │                                             │
│    ▼            ▼            ▼                                             │
│ ┌──────┐   ┌──────┐   ┌──────────┐                                       │
│ │ AI   │   │ HLS  │   │ Recording│                                       │
│ │ Inf. │   │ Seg. │   │ Archival │                                       │
│ │ Svc  │   │ Svc  │   │ Svc      │                                       │
│ └──────┘   └──────┘   └──────────┘                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.3 Event/Alert Flow Detail

┌─────────────────────────────────────────────────────────────────────────────┐
│                      EVENT & ALERT FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  AI Inference Output:                                                       │
│  {                                                                         │
│    "camera_id": "cam_001",                                                  │
│    "timestamp": "2025-01-20T14:30:00Z",                                     │
│    "detections": [                                                          │
│      { "class": "person", "confidence": 0.94,                               │
│        "bbox": [120, 340, 280, 560], "track_id": 42 },                     │
│      { "class": "face", "confidence": 0.89,                                 │
│        "bbox": [150, 360, 200, 420], "embedding_id": "emb_12345" }         │
│    ]                                                                        │
│  }                                                                          │
│                                                                             │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────────────────────────────┐                               │
│  │  Kafka Topic: ai.detections             │                               │
│  │  (JSON, 16 partitions)                  │                               │
│  └─────────────┬───────────────────────────┘                               │
│                │                                                            │
│    ┌───────────┴───────────┐                                               │
│    │                       │                                               │
│    ▼                       ▼                                               │
│ ┌──────────┐        ┌──────────────┐                                      │
│ │ Face     │        │ Suspicious   │                                      │
│ │ Matching │        │ Activity Svc │                                      │
│ │ (pgvector│        │              │                                      │
│ │  search) │        │ Rule Eval:   │                                      │
│ │          │        │ - Night mode?│                                      │
│ └────┬─────┘        │ - Zone       │                                      │
│      │              │   overlap?   │                                      │
│      │              │ - Loitering  │                                      │
│      │              │   > 5 min?   │                                      │
│      │              │ - Crowd      │                                      │
│      │              │   > 5 ppl?   │                                      │
│      │              └──────┬───────┘                                      │
│      │                     │                                              │
│      │    MATCH FOUND      │ ALERT TRIGGERED                             │
│      │         │           │                                              │
│      ▼         ▼           ▼                                              │
│  ┌──────────────────────────────────────────┐                            │
│  │  PostgreSQL                             │                            │
│  │  - events table (all detections)        │                            │
│  │  - alerts table (triggered alerts)      │                            │
│  │  - face_embeddings (if new/matched)     │                            │
│  └───────────────────┬──────────────────────┘                            │
│                      │                                                    │
│          ┌───────────┼───────────┐                                       │
│          │           │           │                                       │
│          ▼           ▼           ▼                                       │
│   ┌──────────┐ ┌──────────┐ ┌──────────────┐                           │
│   │ WebSocket│ │ Kafka    │ │ Notification │                           │
│   │ Push     │ │ alerts.  │ │ Service      │                           │
│   │ (live    │ │ critical │ │              │                           │
│   │  update) │ │          │ │ - Email      │                           │
│   └──────────┘ └──────────┘ │ - SMS        │                           │
│                             │ - Push       │                           │
│                             │ - Webhook    │                           │
│                             └──────────────┘                           │
│                                                                             │
│  Alert Lifecycle:                                                           │
│  DETECTED → NEW (insert) → WebSocket push → NOTIFY → ACK/RESOLVE          │
│                                ↓                                            │
│                           If unacked 15min → ESCALATE to admin             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.4 Live Video to Browser Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                   LIVE VIDEO TO BROWSER FLOW                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐            │
│  │   BROWSER    │      │  CloudFront  │      │   EKS HLS    │            │
│  │              │      │   CDN        │      │   Service    │            │
│  │  ┌────────┐  │      │              │      │              │            │
│  │  │HLS.js  │  │      │              │      │  ┌────────┐  │            │
│  │  │Player  │◄─┼──────┼─── m3u8 ─────┼──────┼──│ Playlist│  │            │
│  │  │        │  │      │   + .ts      │      │  │ Builder │  │            │
│  │  └────────┘  │      │   segments   │      │  └────┬───┘  │            │
│  │              │      │              │      │       │      │            │
│  │  WebSocket ──┼──────┼──────────────┼──────┼───────┘      │            │
│  │  /ws/alerts  │      │              │      │              │            │
│  └──────────────┘      └──────────────┘      └──────┬───────┘            │
│                                                     │                      │
│                                                     ▼                      │
│                                            ┌────────────────┐             │
│                                            │  Redis Stream  │             │
│                                            │  live:cam:{id} │             │
│                                            │                │             │
│                                            │  ┌──────────┐  │             │
│                                            │  │Segment 1 │  │             │
│                                            │  │Segment 2 │  │             │
│                                            │  │Segment 3 │──┼──► FIFO     │
│                                            │  └──────────┘  │   (keep 30)  │
│                                            └───────┬────────┘             │
│                                                    │                       │
│                              ┌─────────────────────┘                       │
│                              │ WireGuard VPN                               │
│                              ▼                                              │
│                    ┌──────────────────┐                                    │
│                    │  Edge Gateway    │                                    │
│                    │  FFmpeg → HLS    │                                    │
│                    │  segmenter       │                                    │
│                    └──────────────────┘                                    │
│                                                                             │
│  Latency Budget:                                                            │
│  - DVR encoding:  ~100ms                                                    │
│  - RTSP to Edge:  ~50ms                                                     │
│  - VPN tunnel:    ~30-80ms (depending on internet)                         │
│  - Cloud HLS svc: ~50ms                                                     │
│  - CDN delivery:  ~20-100ms                                                 │
│  - Player buffer: 3-6 segments (~30-60s behind real-time)                  │
│  TOTAL LIVE LATENCY: ~35-65 seconds (HLS inherent)                         │
│                                                                             │
│  For lower latency: WebRTC mode (optional future):                          │
│  - Target: < 2 seconds using WHIP/WHEP                                      │
│  - Requires direct edge-to-browser or TURN relay                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.5 Training Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                      TRAINING DATA FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SOURCE 1: Automatic (False Positive Detection)                            │
│  ──────────────────────────────────────────────                             │
│                                                                             │
│  AI Inference → Confidence 0.3-0.6 range → Flag as "uncertain"            │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/auto/      │                               │
│  │  - Original frame (JPEG)                │                               │
│  │  - Inference result (JSON)              │                               │
│  │  - Flagged for review                   │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  SOURCE 2: Manual (Operator Upload)                                        │
│  ──────────────────────────────────                                         │
│                                                                             │
│  Operator → Dashboard "Upload Training Image" → Label with bounding boxes  │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/manual/    │                               │
│  │  - Uploaded image with labels (COCO fmt)│                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  SOURCE 3: Missed Detection (Post-Incident)                                │
│  ────────────────────────────────────────────                               │
│                                                                             │
│  Security review → "AI should have caught this" → Extract from recording   │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/incident/  │                               │
│  │  - Video clip with manual annotation    │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  AGGREGATION:                                                               │
│  ════════════                                                               │
│                                                                             │
│  All sources → Weekly CronJob (Sunday 02:00 UTC)                           │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Training Service Pipeline:                                         │   │
│  │                                                                     │   │
│  │  1. Download all new training data from MinIO                       │   │
│  │  2. Deduplicate (perceptual hashing)                                │   │
│  │  3. Augment: rotation, brightness, noise (albumentations)           │   │
│  │  4. Validate: train/val/test split (80/10/10)                       │   │
│  │  5. Fine-tune YOLOv8x:                                              │   │
│  │     - Base: COCO-pretrained weights                                 │   │
│  │     - Epochs: 100, early stopping patience 10                       │   │
│  │     - LR: 0.001 with cosine decay                                   │   │
│  │     - Batch: 8 per GPU                                              │   │
│  │  6. Validate mAP@0.5 > 0.85                                         │   │
│  │  7. Convert to TensorRT engine (FP16, max batch 8)                  │   │
│  │  8. Upload to MinIO: model-artifacts/{version}/                     │   │
│  │  9. A/B test: shadow mode for 24 hours                               │   │
│  │  10. Promote to production if FP rate < baseline                    │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  MODEL DEPLOYMENT:                                                          │
│  ═════════════════                                                          │
│                                                                             │
│  MinIO model-artifacts/ → Triton Model Repository → SIGHUP reload          │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  Blue/Green Deployment:                 │                               │
│  │  - Triton loads new model as "green"    │                               │
│  │  - 5% traffic routed for 1 hour         │                               │
│  │  - Monitor: latency P99, error rate     │                               │
│  │  - If OK: 100% traffic, "blue" retired  │                               │
│  │  - If FAIL: automatic rollback          │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

6. Technology Stack

6.1 Technology Selection Matrix

Category	Technology	Alternative	Selection Criteria
Cloud Platform	AWS	GCP, Azure	Best India region coverage (Mumbai), most mature managed Kafka (MSK), broad GPU instance types, VPC endpoints for private service communication
Container Orchestration	Amazon EKS	GKE, AKS, self-managed	Managed control plane, GPU device plugin support, Cluster Autoscaler, native IAM integration
Edge K8s	K3s	K0s, MicroK8s, Docker Compose	Single binary, lightweight, automatic HA with embedded etcd, built-in Helm, compatible with standard K8s manifests
VPN	WireGuard	OpenVPN, IPSec, Tailscale	Modern crypto (Curve25519, ChaCha20, Poly1305), kernel module since Linux 5.6, ~60% faster than OpenVPN, NAT traversal, simple config
Message Queue	Apache Kafka (MSK)	RabbitMQ, NATS, AWS SQS	Ordered event log, stream replay, high throughput, exactly-once processing with Flink, managed service reduces ops
Stream Processing	Apache Flink on EKS	Kafka Streams, Spark Streaming	Stateful processing, event time semantics, exactly-once, CEP (complex event processing) for multi-frame rules
Reverse Proxy	Traefik	NGINX, HAProxy, Envoy	Native Kubernetes Ingress, automatic Let's Encrypt, middleware chains, WebSocket support, Prometheus metrics
AI Inference	NVIDIA Triton + YOLOv8	TorchServe, TensorFlow Serving, custom	Multi-framework support, TensorRT optimization, dynamic batching, model ensemble, Prometheus metrics
Database	PostgreSQL 16 (RDS) + pgvector	MySQL, MongoDB, separate vector DB	ACID compliance, mature managed service, pgvector handles 512-dim embeddings at scale, no separate DB to manage
Cache	Redis 7 Cluster	Memcached, KeyDB	Data structures (streams, sorted sets), pub/sub, persistence, cluster mode for horizontal scaling
Object Storage	MinIO + S3	Ceph, GlusterFS, pure S3	S3-compatible API everywhere, local buffering at edge, cloud tiering, cost optimization via lifecycle policies
Backend Language	Go 1.21	Python, Java, Rust	Compiled performance for high-throughput streaming, excellent concurrency (goroutines), small container images
Frontend	Next.js 14 + React 18	Vue, Angular, Svelte	SSR for SEO/performance, React ecosystem, API routes, image optimization, easy deployment to CDN
Monitoring	Prometheus + Grafana + Loki	Datadog, New Relic, CloudWatch	Open source, no per-host licensing, powerful alerting, log aggregation with Loki, custom dashboards
CI/CD	GitHub Actions + ArgoCD	GitLab CI, Jenkins, Flux	GitOps deployment, automated rollback, drift detection, progressive delivery

6.2 WireGuard VPN Configuration

# Cloud Server (AWS EC2 bastion / VPN endpoint)
[Interface]
Address = 10.200.0.1/32
ListenPort = 51820
PrivateKey = <cloud-private-key>
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Edge Gateway
PublicKey = <edge-public-key>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25

# Edge Gateway (Intel NUC)
[Interface]
Address = 10.200.0.2/32
PrivateKey = <edge-private-key>

[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
AllowedIPs = 10.100.0.0/16  # Entire AWS VPC
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25

6.3 Port Reference Table

Service	Port	Protocol	Location	Notes
DVR RTSP	554	TCP	192.168.29.200	Local network only
DVR HTTP	80	TCP	192.168.29.200	Admin UI, local only
DVR HTTPS	443	TCP	192.168.29.200	Admin UI, local only
DVR TCP	25001	TCP	192.168.29.200	Proprietary protocol
DVR UDP	25002	UDP	192.168.29.200	Proprietary protocol
DVR NTP	123	UDP	192.168.29.200	Time sync
WireGuard	51820	UDP	Cloud + Edge	VPN tunnel
Edge Admin	8080	TCP	192.168.29.5	Local admin UI
Edge SSH	22	TCP	192.168.29.5	Admin access only
Traefik HTTP	8000	TCP	EKS	Internal HTTP entrypoint
Traefik HTTPS	8443	TCP	EKS	Internal HTTPS entrypoint
ALB HTTPS	443	TCP	AWS	Public-facing
Backend API	8080	TCP	EKS pods	Internal service port
Triton HTTP	8000	TCP	EKS GPU nodes	Model inference HTTP
Triton gRPC	8001	TCP	EKS GPU nodes	Model inference gRPC
Triton Metrics	8002	TCP	EKS GPU nodes	Prometheus metrics
PostgreSQL	5432	TCP	RDS	VPC-private
Redis	6379	TCP	ElastiCache	VPC-private
Kafka	9092	TCP	MSK	VPC-private
MinIO API	9000	TCP	EKS + Edge	S3-compatible API
MinIO Console	9001	TCP	EKS + Edge	Admin console
Prometheus	9090	TCP	EKS	Metrics collection
Grafana	3000	TCP	EKS	Dashboards

7. Scaling Strategy

7.1 Camera Scaling Roadmap

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAMERA SCALING ROADMAP                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CURRENT: 8 cameras (1 DVR)                                                 │
│  ├─ Edge: Intel NUC i7, 32GB RAM                                          │
│  ├─ Streams: 8 × RTSP @ 960×1080                                          │
│  ├─ Bandwidth: ~16 Mbps upstream (2 Mbps per H.264 stream)                │
│  └─ Cloud AI: 1× T4 GPU (handles 8 streams @ 1 fps)                       │
│                                                                             │
│  PHASE 1: 16 cameras (2 DVRs)                                               │
│  ├─ Edge: Intel NUC i7 (sufficient) or 2× NUC                            │
│  ├─ Add 2nd edge gateway for 2nd DVR site (if different location)        │
│  ├─ Streams: 16 × RTSP                                                    │
│  ├─ Bandwidth: ~32 Mbps                                                   │
│  ├─ Cloud AI: 1× T4 GPU (still sufficient, batch size 8 → 16)            │
│  └─ Kafka: 8 partitions → 16 partitions                                   │
│                                                                             │
│  PHASE 2: 32 cameras (4 DVRs / 4 sites)                                     │
│  ├─ Edge: 4× Intel NUC (one per site)                                     │
│  ├─ VPN: Hub-spoke model (4 edge peers → 1 cloud endpoint)               │
│  ├─ Bandwidth: ~64 Mbps                                                   │
│  ├─ Cloud AI: 2× T4 GPUs (HPA: 2-6 replicas)                             │
│  ├─ Stream Ing.: 6-12 replicas (HPA)                                      │
│  ├─ Kafka: 32 partitions                                                  │
│  └─ PostgreSQL: db.r6g.2xlarge (scale up)                                 │
│                                                                             │
│  PHASE 3: 64 cameras (8 DVRs / 8 sites)                                     │
│  ├─ Edge: 8× Intel NUC (or NVIDIA Jetson Orin for edge AI pre-filter)     │
│  ├─ VPN: WireGuard hub-spoke or mesh (consider Tailscale for simplicity) │
│  ├─ Bandwidth: ~128 Mbps (dedicated internet circuit recommended)        │
│  ├─ Cloud AI: 4× T4 GPUs or 2× A10G (g5.2xlarge)                         │
│  ├─ Stream Ing.: 12-20 replicas                                           │
│  ├─ Kafka: 64 partitions, consider MSK multi-cluster                      │
│  ├─ PostgreSQL: db.r6g.4xlarge + read replica                             │
│  ├─ Redis: 4 shards                                                       │
│  └─ MinIO: Distributed mode, 4+ nodes                                     │
│                                                                             │
│  PHASE 4: 64+ cameras (NVR consolidation)                                   │
│  ├─ Consider NVR-to-edge consolidation (fewer, more powerful recorders)   │
│  ├─ Edge AI pre-filtering (Jetson Orin): only send motion frames         │
│  ├─ Bandwidth reduction: ~50% via smart filtering                         │
│  └─ Multi-region cloud deployment for latency optimization                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

7.2 AI Inference Scaling

Metric	8 Cameras	16 Cameras	32 Cameras	64 Cameras
Frame Rate	8 fps (1 per cam)	16 fps	32 fps	64 fps
GPU Replicas	1× T4	1× T4	2× T4	4× T4 or 2× A10G
Inference Latency (P99)	80ms	120ms	150ms	200ms
Kafka Partitions (raw)	8	16	32	64
Consumer Groups	3	4	6	8

Auto-scaling Triggers:

GPU utilization > 80% for 2 minutes → scale out
Kafka consumer lag > 1000 messages for 5 minutes → scale out
Queue depth < 100 for 10 minutes → scale in (to minimum)

7.3 Storage Scaling

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STORAGE CAPACITY PLANNING                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Per-Camera Storage Profile:                                                │
│  - Continuous recording: ~1.5 GB/hour @ 960×1080 H.264 main profile       │
│  - AI snapshots (1 fps): ~50 MB/hour (JPEG compressed)                    │
│  - Event clips: ~10 MB average per event (30-second clip)                 │
│                                                                             │
│  Total Per Day (8 cameras):                                                 │
│  - Video: 8 × 1.5 GB × 24h = 288 GB/day                                   │
│  - Snapshots: 8 × 50 MB × 24h = 9.6 GB/day                                │
│  - Events (est. 500/day): 500 × 10 MB = 5 GB/day                          │
│  - TOTAL: ~303 GB/day = ~9 TB/month                                       │
│                                                                             │
│  Tiered Storage Strategy:                                                   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 1: EDGE LOCAL (Hot, 7 days)                                   │   │
│  │  - Capacity: 2TB NVMe per edge gateway                              │   │
│  │  - All 8 streams, full resolution, 10s segments                     │   │
│  │  - Cost: Hardware (CAPEX)                                           │   │
│  │  - Use: Immediate playback, event export                            │   │
│  │                                                                     │   │
│  │  7 days × 303 GB = 2.1 TB ✓ (fits in 2TB with compression)        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 2: CLOUD MINIO (Warm, 30 days)                                │   │
│  │  - Capacity: 10TB initial, auto-scaling                             │   │
│  │  - Full resolution video segments + event snapshots                 │   │
│  │  - Cost: ~$0.023/GB/month (S3 Standard equivalent)                  │   │
│  │  - Use: Dashboard playback, search, investigation                   │   │
│  │                                                                     │   │
│  │  30 days × 303 GB = 9.1 TB                                          │   │
│  │  Cost: 9,100 GB × $0.023 = ~$209/month                              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 3: S3 IA (Cool, 31-90 days)                                   │   │
│  │  - Capacity: Auto (lifecycle transition)                            │   │
│  │  - Cost: ~$0.0125/GB/month                                          │   │
│  │  - Use: Occasional access, compliance review                        │   │
│  │                                                                     │   │
│  │  60 days × 303 GB = 18.2 TB                                         │   │
│  │  Cost: 18,200 GB × $0.0125 = ~$228/month                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 4: GLACIER DEEP ARCHIVE (Cold, 90+ days)                      │   │
│  │  - Capacity: Unbounded                                              │   │
│  │  - Cost: ~$0.00099/GB/month                                         │   │
│  │  - Retrieval: 12-48 hours (batch)                                   │   │
│  │  - Use: Long-term compliance, legal hold                            │   │
│  │                                                                     │   │
│  │  Annual accumulation: 303 GB × 365 = 110 TB                         │   │
│  │  Cost: 110,000 GB × $0.00099 = ~$109/month                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  TOTAL MONTHLY STORAGE COST (8 cameras, steady state):                      │
│  - Tier 2 (hot): $209                                                       │
│  - Tier 3 (warm): $228                                                      │
│  - Tier 4 (cold): $109                                                      │
│  - TOTAL: ~$546/month                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

7.4 Database Partitioning Strategy

-- Partition events table by month (range partitioning)
CREATE TABLE events (
    id BIGSERIAL,
    camera_id VARCHAR(32) NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    confidence DECIMAL(4,3),
    bounding_box BOX,
    snapshot_path VARCHAR(512),
    start_time TIMESTAMPTZ NOT NULL,
    end_time TIMESTAMPTZ,
    severity VARCHAR(20),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (start_time);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... auto-created by cron job

-- Partition pruning ensures queries for specific time ranges
-- only scan relevant partitions

-- Automated partition creation (pg_partman extension)
SELECT partman.create_parent('public.events', 'start_time', 'native', 'monthly');

-- Partition compression and archival
-- Partitions older than 12 months:
-- 1. Compress with pg_compress
-- 2. Move to S3 via pg_s3_fifo FDW
-- 3. Drop local partition (data in cold archive)

8. Failover & Reliability

8.1 Service Restart Policies

# Kubernetes Deployment - Restart Policy Example
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stream-ingestion-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime deployment
  template:
    spec:
      containers:
        - name: ingestion
          image: surveillance/stream-ingestion:v1.2.3
          resources:
            requests:
              cpu: "1000m"
              memory: "2Gi"
            limits:
              cpu: "2000m"
              memory: "4Gi"
          livenessProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3  # Restart after 30s of failures
          readinessProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          startupProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 30  # 150s max for startup

8.2 Stream Reconnect Logic

// Edge Gateway Stream Reconnect Logic (Go pseudocode)
func maintainStream(cameraID string, rtspURL string) {
    backoff := NewExponentialBackoff(
        Initial:    1 * time.Second,
        Max:        60 * time.Second,
        Multiplier: 2.0,
        Jitter:     0.1,
    )
    
    for {
        ctx, cancel := context.WithCancel(context.Background())
        
        err := connectAndStream(ctx, cameraID, rtspURL)
        if err != nil {
            log.Error("stream disconnected", "camera", cameraID, "error", err)
            
            // Update health status in Redis
            redis.HSet("stream:health", cameraID, "disconnected")
            
            // Wait with backoff
            wait := backoff.Next()
            log.Info("reconnecting", "camera", cameraID, "wait", wait)
            time.Sleep(wait)
            
            cancel()
            continue
        }
        
        // Success - reset backoff
        backoff.Reset()
        redis.HSet("stream:health", cameraID, "connected")
    }
}

// Circuit breaker pattern for cloud connection
type CircuitBreaker struct {
    state          State  // Closed, Open, HalfOpen
    failureCount   int
    failureThreshold int    // 5 failures
    timeout        time.Duration  // 60s open state
    lastFailureTime time.Time
}

8.3 VPN Tunnel Recovery

#!/bin/bash
# /usr/local/bin/wireguard-watchdog.sh
# Runs every 30 seconds via cron

CLOUD_ENDPOINT="10.200.0.1"
TUNNEL_INTERFACE="wg0"
MAX_PING_LOSS=3
LOG_FILE="/var/log/wg-watchdog.log"

# Check tunnel health
ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1

if [ $? -ne 0 ]; then
    echo "$(date): VPN tunnel unhealthy, restarting..." >> $LOG_FILE
    
    # 1. Restart WireGuard interface
    wg-quick down $TUNNEL_INTERFACE
    sleep 2
    wg-quick up $TUNNEL_INTERFACE
    
    # 2. Verify recovery
    sleep 5
    ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1
    
    if [ $? -eq 0 ]; then
        echo "$(date): VPN tunnel recovered" >> $LOG_FILE
        # Notify cloud of recovery
        curl -X POST http://10.200.0.1:8080/api/v1/system/edge-recovery \
            -H "Authorization: Bearer $EDGE_TOKEN" \
            -d "{\"edge_id\": \"$HOSTNAME\", \"status\": \"recovered\"}"
    else
        echo "$(date): VPN tunnel recovery FAILED" >> $LOG_FILE
        # Escalate: local alert (buzzer/email if available)
    fi
fi

8.4 Queue Recovery & Durability

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KAFKA DURABILITY CONFIGURATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Producer Configuration (Stream Ingestion Service):                         │
│  ──────────────────────────────────────────────────                         │
│  - acks=all              # Wait for all replicas                            │
│  - retries=10            # Aggressive retry                                 │
│  - retry.backoff.ms=1000 # 1 second between retries                       │
│  - enable.idempotence=true  # Exactly-once semantics                      │
│  - max.in.flight.requests=1  # Preserve ordering during retry             │
│  - compression.type=lz4  # Efficient compression                          │
│                                                                             │
│  Topic Configuration:                                                       │
│  ────────────────────                                                       │
│  - replication.factor=3     # 3 copies across AZs                          │
│  - min.insync.replicas=2    # Require 2 acks for producer commit           │
│  - retention.ms=604800000   # 7 days for raw streams                       │
│  - retention.ms=2592000000  # 30 days for detections                       │
│  - unclean.leader.election.enable=false  # Never lose committed data       │
│                                                                             │
│  Consumer Configuration (AI Inference Service):                             │
│  ──────────────────────────────────────────────                             │
│  - enable.auto.commit=false  # Manual offset management                     │
│  - auto.offset.reset=earliest  # Replay from beginning on new group        │
│  - max.poll.records=100      # Process in batches                           │
│  - isolation.level=read_committed  # Only read committed transactions       │
│                                                                             │
│  Offset Commit Strategy:                                                    │
│  ───────────────────────                                                    │
│  1. Pull batch from Kafka                                                   │
│  2. Process (run inference)                                                 │
│  3. Write results to PostgreSQL (transaction)                               │
│  4. Commit Kafka offset ONLY after DB write succeeds                        │
│  5. If any step fails: don't commit, reprocess on next poll                 │
│                                                                             │
│  Dead Letter Queue:                                                         │
│  ──────────────────                                                         │
│  - Topic: streams.raw.dlq                                                   │
│  - After 5 processing failures, message moved to DLQ                        │
│  - DLQ consumer: alerts admin, manual inspection                            │
│  - Retention: 30 days                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.5 Graceful Degradation

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GRACEFUL DEGRADATION MATRIX                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  FAILURE MODE                    │  DEGRADATION STRATEGY            │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  AI Inference Service DOWN       │  Continue recording ALL video    │   │
│  │  (GPU failure, model crash)      │  - Events stored as "unprocessed"│   │
│  │                                  │  - No real-time alerts           │   │
│  │                                  │  - Queue frames for later batch  │   │
│  │                                  │    processing when AI recovers   │   │
│  │                                  │  - Dashboard shows "AI OFFLINE"  │   │
│  │                                  │    banner                        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Kafka DOWN                      │  Edge Gateway buffers locally    │   │
│  │  (MSK outage)                    │  - Local MinIO ring buffer       │   │
│  │                                  │  - Backpressure: reduce to       │   │
│  │                                  │    key frames only (0.2 fps)     │   │
│  │                                  │  - Auto-reconnect with 2x        │   │
│  │                                  │    exponential backoff           │   │
│  │                                  │  - Replay from local buffer      │   │
│  │                                  │    when Kafka recovers           │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  VPN Tunnel DOWN                 │  Full local operation mode       │   │
│  │  (internet outage)               │  - All recording continues       │   │
│  │                                  │    locally (7-day buffer)        │   │
│  │                                  │  - Local alert buzzer/relay      │   │
│  │                                  │    (configurable)                │   │
│  │                                  │  - No cloud dashboard access     │   │
│  │                                  │  - Auto-sync when VPN recovers   │   │
│  │                                  │  - Queue cloud events for        │   │
│  │                                  │    later replay                  │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  PostgreSQL DOWN                 │  Alert queue builds in Kafka     │   │
│  │  (RDS outage)                    │  - Events not lost (Kafka dur.)  │   │
│  │                                  │  - Read-only dashboard mode      │   │
│  │                                  │  - Cached data from Redis        │   │
│  │                                  │  - Alert on-call engineer        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Notification Service DOWN       │  Alerts accumulate in DB         │   │
│  │                                  │  - Retry with exponential backoff│   │
│  │                                  │  - Dead letter after 24 hours    │   │
│  │                                  │  - Dashboard shows pending count │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Edge Gateway DOWN               │  Cloud dashboard shows           │   │
│  │  (power/hardware failure)        │  "SITE OFFLINE"                  │   │
│  │                                  │  - Last known recordings in      │   │
│  │                                  │    cloud (up to disconnect)      │   │
│  │                                  │  - Alert sent immediately        │   │
│  │                                  │  - UPS on edge: graceful         │   │
│  │                                  │    shutdown, preserve data       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Priority Order (highest first):                                            │
│  1. Video recording NEVER STOPS (local edge priority)                       │
│  2. Critical alerts ALWAYS FIRE (local buzzer + queued cloud alerts)        │
│  3. AI inference gracefully degrades to batch catch-up                      │
│  4. Dashboard operates in read-only/cache mode during DB outage             │
│  5. Cloud sync resumes automatically when connectivity restored             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.6 Health Check Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEALTH CHECK ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 1: KUBERNETES PROBES                                                │
│  ──────────────────────────                                                 │
│  - Liveness Probe:  /health/live   → Restart container if failing          │
│  - Readiness Probe: /health/ready  → Remove from service if failing        │
│  - Startup Probe:   /health/startup→ Allow long initialization             │
│                                                                             │
│  LAYER 2: SERVICE-LEVEL HEALTH (Prometheus metrics)                        │
│  ──────────────────────────────────────────────────                         │
│  Each service exposes:                                                      │
│  - app_health_status{service="X"}  0=healthy, 1=degraded, 2=critical      │
│  - app_health_details{check="db"}  last check timestamp + result           │
│                                                                             │
│  LAYER 3: DEPENDENCY HEALTH CHECKS                                          │
│  ────────────────────────────────                                           │
│  Backend Service checks:                                                    │
│  ├─ PostgreSQL: SELECT 1; (timeout 2s)                                     │
│  ├─ Redis: PING (timeout 1s)                                               │
│  ├─ Kafka: ListTopics (timeout 3s)                                         │
│  ├─ MinIO: ListBuckets (timeout 3s)                                        │
│  └─ Triton: ModelReady API (timeout 5s)                                    │
│                                                                             │
│  LAYER 4: END-TO-END HEALTH                                                │
│  ──────────────────────────                                                 │
│  Synthetic probe:                                                           │
│  1. Upload test image to stream ingestion                                   │
│  2. Verify AI detection result appears in Kafka                             │
│  3. Verify event written to PostgreSQL                                      │
│  4. Verify alert queryable via API                                          │
│  5. Verify WebSocket push received                                          │
│  Run: Every 60 seconds from monitoring namespace                            │
│                                                                             │
│  LAYER 5: EDGE HEALTH HEARTBEAT                                            │
│  ────────────────────────────────                                           │
│  - Edge Gateway sends heartbeat every 30 seconds                            │
│  - Payload: {edge_id, timestamp, stream_count, disk_free, mem_usage}       │
│  - Missed 3 heartbeats (90s) → "EDGE OFFLINE" alert                       │
│  - Recovers → "EDGE ONLINE" notification                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

9. Security Architecture

9.1 Defense in Depth

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFENSE IN DEPTH LAYERS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 1: PERIMETER                                                         │
│  ──────────────                                                             │
│  - AWS WAF v2: SQL injection, XSS, rate limiting rules                     │
│  - Geo-restriction: Allow only specific countries                          │
│  - AWS Shield Standard (DDoS protection)                                   │
│  - ALB access logs → S3 → Athena for analysis                              │
│                                                                             │
│  LAYER 2: TRANSPORT                                                         │
│  ──────────────                                                             │
│  - TLS 1.3 for all external HTTPS connections                              │
│  - WireGuard ChaCha20-Poly1305 for VPN tunnel                              │
│  - mTLS (mutual TLS) for internal service-to-service communication         │
│  - Certificate rotation: Let's Encrypt auto (90-day)                       │
│                                                                             │
│  LAYER 3: AUTHENTICATION & AUTHORIZATION                                    │
│  ──────────────────────────────────────────                                 │
│  - JWT with RS256 (asymmetric signing)                                     │
│  - Access token: 15 minutes                                                │
│  - Refresh token: 7 days (stored in httpOnly cookie)                       │
│  - RBAC: admin, operator, viewer roles                                     │
│  - API keys for edge gateway authentication                                │
│  - Multi-factor authentication for admin role                              │
│                                                                             │
│  LAYER 4: APPLICATION SECURITY                                              │
│  ────────────────────────────                                               │
│  - Input validation: strict JSON schemas                                   │
│  - SQL injection: parameterized queries only (pgx)                         │
│  - XSS prevention: Content Security Policy headers                         │
│  - CSRF tokens for state-changing operations                               │
│  - File upload: virus scanning, size limits, type validation               │
│                                                                             │
│  LAYER 5: DATA SECURITY                                                     │
│  ────────────────────                                                       │
│  - RDS: Encryption at rest (AES-256, AWS KMS CMK)                          │
│  - RDS: Encryption in transit (TLS 1.2+)                                   │
│  - S3: Default encryption (SSE-S3 or SSE-KMS)                              │
│  - Redis: TLS in transit, no AUTH token exposure                           │
│  - Face embeddings: stored as vectors, not raw images (privacy)            │
│  - Backup encryption: separate KMS key for backups                         │
│                                                                             │
│  LAYER 6: NETWORK SEGMENTATION                                              │
│  ───────────────────────────                                                │
│  - VPC private subnets for all workloads                                   │
│  - Security groups: least privilege, explicit allow only                   │
│  - Network Policies: namespace-level isolation in K8s                      │
│  - DVR: NO public IP, NO internet gateway, local network only              │
│  - VPN: Single controlled entry point                                      │
│                                                                             │
│  LAYER 7: AUDIT & MONITORING                                                │
│  ─────────────────────────                                                  │
│  - All API calls logged with user, IP, timestamp, resource                 │
│  - PostgreSQL audit_log table (append-only)                                │
│  - CloudTrail for AWS API calls                                            │
│  - VPC Flow Logs for network analysis                                      │
│  - Alert on abnormal patterns (unusual login times, geo anomalies)         │
│  - Log retention: 1 year in S3 Glacier                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

9.2 Secret Management

# Kubernetes External Secrets (AWS Secrets Manager integration)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: surveillance
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: db-credentials
    creationPolicy: Owner
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: surveillance/production/db
        property: password
    - secretKey: DB_USER
      remoteRef:
        key: surveillance/production/db
        property: username

10. Monitoring & Observability

10.1 Monitoring Stack

Component	Technology	Purpose
Metrics	Prometheus + Thanos	Time-series collection, long-term storage
Visualization	Grafana	Dashboards for all services
Logs	Loki + Promtail	Log aggregation, indexed by labels
Traces	Jaeger	Distributed request tracing
Alerts	Alertmanager + PagerDuty	Multi-channel alerting
Uptime	UptimeRobot (external)	External endpoint monitoring

10.2 Key Metrics

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KEY METRICS DASHBOARD                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STREAM HEALTH                                                              │
│  ────────────                                                               │
│  - stream_active{camera_id}           Gauge: 0/1                           │
│  - stream_fps{camera_id}              Gauge: actual FPS                    │
│  - stream_bitrate{camera_id}          Gauge: kbps                          │
│  - stream_reconnect_total{camera_id}  Counter: reconnect events            │
│  - stream_latency_seconds{camera_id}  Histogram: end-to-end latency        │
│                                                                             │
│  AI INFERENCE                                                               │
│  ────────────                                                               │
│  - ai_inference_duration_seconds      Histogram: per-model latency         │
│  - ai_detection_total{model,class}    Counter: detections by class         │
│  - ai_gpu_utilization_percent         Gauge: GPU usage                     │
│  - ai_gpu_memory_used_bytes           Gauge: VRAM usage                    │
│  - ai_batch_size_current              Gauge: current batch size            │
│  - ai_queue_depth                     Gauge: pending inference requests    │
│                                                                             │
│  EVENTS & ALERTS                                                            │
│  ───────────────                                                            │
│  - events_total{type,severity}        Counter: events processed            │
│  - alerts_active{severity}            Gauge: unacknowledged alerts         │
│  - alert_ack_duration_seconds         Histogram: time to acknowledge       │
│  - false_positive_rate                Gauge: FP ratio (training feedback)  │
│                                                                             │
│  SYSTEM                                                                       │
│  ──────                                                                     │
│  - edge_disk_free_bytes               Gauge: local storage remaining       │
│  - edge_memory_usage_percent          Gauge: RAM usage                     │
│  - vpn_latency_ms                     Gauge: tunnel round-trip time        │
│  - kafka_consumer_lag{topic,group}    Gauge: message backlog               │
│  - db_connection_pool_active          Gauge: DB connections in use         │
│  - api_request_duration_seconds       Histogram: API response time         │
│  - api_requests_total{status,path}    Counter: HTTP status distribution    │
│                                                                             │
│  BUSINESS                                                                   │
│  ─────────                                                                  │
│  - cameras_online_total               Gauge: healthy camera count          │
│  - daily_events_total                 Counter: events per day              │
│  - alert_response_time_avg            Gauge: avg ack time (SLA: <5min)     │
│  - storage_cost_daily_usd             Gauge: estimated daily cost          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

10.3 Alerting Rules

# Prometheus alerting rules
# alerts.yml
groups:
  - name: surveillance-critical
    rules:
      - alert: CameraStreamDown
        expr: stream_active == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Camera {{ $labels.camera_id }} stream is down"
          
      - alert: EdgeGatewayOffline
        expr: time() - vpn_last_heartbeat > 120
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Edge gateway {{ $labels.edge_id }} is offline"
          
      - alert: AIInferenceHighLatency
        expr: histogram_quantile(0.99, ai_inference_duration_seconds) > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI inference P99 latency is {{ $value }}ms"
          
      - alert: DiskSpaceLow
        expr: edge_disk_free_bytes / edge_disk_total_bytes < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Edge disk usage is above 90%"
          
      - alert: UnacknowledgedCriticalAlerts
        expr: alerts_active{severity="critical"} > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} critical alerts unacknowledged for >15 minutes"

11. Cost Estimation

11.1 Monthly Cost Breakdown (8 cameras)

Service	Instance/Type	Monthly Cost (USD)
EKS Control Plane	Managed	$73
EKS Worker Nodes (on-demand)	3× t3.large (API, services)	$200
EKS GPU Nodes	1× g4dn.xlarge (spot when possible)	$350
RDS PostgreSQL	db.r6g.xlarge Multi-AZ	$520
ElastiCache Redis	cache.r6g.large (2 shards)	$260
MSK Kafka	3× kafka.m5.large	$350
ALB + Data Transfer	~500 GB/month	$50
S3 Storage	~10 TB (tiered)	$200
CloudFront CDN	~200 GB/month	$30
EC2 VPN Endpoint	t3.micro	$15
Edge Hardware	Intel NUC (amortized 3yr)	~$40
Internet (site)	Business broadband	$50
TOTAL		~$2,138/month

11.2 Cost Optimization Strategies

Spot Instances: GPU nodes and batch processing on spot (70% savings)
Reserved Instances: RDS and ElastiCache 1-year reserved (40% savings)
S3 Lifecycle: Automatic tiering to IA and Glacier
Right-sizing: Monitor actual usage, adjust requests/limits
Edge AI: Pre-filter on Jetson Orin to reduce cloud bandwidth (future)

12. Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Set up AWS VPC, EKS cluster
Deploy WireGuard VPN (cloud endpoint + edge gateway)
Deploy PostgreSQL, Redis, Kafka, MinIO
Build and deploy Edge Gateway Agent
Verify RTSP stream capture from all 8 channels
Basic stream ingestion to Kafka

Phase 2: Core AI (Weeks 5-8)

Deploy NVIDIA Triton with YOLOv8 detection model
Build AI Inference Service (Kafka consumer)
Implement person/vehicle detection pipeline
Build Suspicious Activity Service (night mode)
Face detection + embedding extraction
Alert generation and storage

Phase 3: Application (Weeks 9-12)

Build Backend API Service
Build Web Frontend (Next.js dashboard)
Implement live video playback (HLS)
Event timeline and search
Alert management UI
Face search by similarity

Phase 4: Operations (Weeks 13-16)

Notification Service (email, SMS, push)
Training Service + model retraining pipeline
Monitoring stack (Prometheus, Grafana, Loki)
Security hardening and penetration testing
Performance optimization and load testing
Documentation and operator training

13. Appendices

Appendix A: DVR RTSP URL Format

# CP PLUS ORANGE Series RTSP URL format
rtsp://<username>:<password>@<dvr_ip>:554/user=<username>&password=<password>&channel=<1-8>&stream=<0|1>.sdp?

# stream=0: Main stream (higher quality)
# stream=1: Sub stream (lower quality)

# Example:
rtsp://admin:password@192.168.29.200:554/user=admin&password=password&channel=1&stream=0.sdp?

# FFmpeg test command:
ffmpeg -i "rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?" \
       -c copy -f segment -segment_time 10 -reset_timestamps 1 \
       /recordings/ch1/%Y%m%d_%H%M%S.mkv

Appendix B: WireGuard Full Configuration

# === CLOUD SERVER (AWS EC2) ===
# /etc/wireguard/wg0-cloud.conf

[Interface]
PrivateKey = <cloud-private-key>
Address = 10.200.0.1/24
ListenPort = 51820
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
         iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE; \
         iptables -A FORWARD -p tcp --dport 5432 -j DROP; \
         iptables -A FORWARD -p tcp --dport 6379 -j DROP
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
           iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
DNS = 10.100.0.2

[Peer]
# Edge Gateway - Site 1
PublicKey = <edge1-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25

# === EDGE GATEWAY (Intel NUC) ===
# /etc/wireguard/wg0-edge.conf

[Interface]
PrivateKey = <edge-private-key>
Address = 10.200.0.2/32
DNS = 10.100.0.2
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
         iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
           iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.100.0.0/16, 10.200.0.0/24
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25

Appendix C: FFmpeg Stream Processing Command

#!/bin/bash
# Edge Gateway stream processing pipeline

CAMERA_ID=$1
CHANNEL=$2
DVR_IP="192.168.29.200"
DVR_USER="admin"
DVR_PASS=""

RTSP_URL="rtsp://${DVR_IP}:554/user=${DVR_USER}&password=${DVR_PASS}&channel=${CHANNEL}&stream=0.sdp?"
RECORDING_DIR="/var/recordings/${CAMERA_ID}"
AI_PIPE="/tmp/ai_pipe_${CAMERA_ID}"

mkdir -p "$RECORDING_DIR"
mkfifo "$AI_PIPE" 2>/dev/null

# Pipeline 1: Recording (10s segments)
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -i "$RTSP_URL" \
    -c copy -f segment \
    -segment_time 10 \
    -segment_format mp4 \
    -reset_timestamps 1 \
    -strftime 1 \
    "${RECORDING_DIR}/%Y%m%d_%H%M%S.mp4" \
    2>> /var/log/ffmpeg-${CAMERA_ID}.log &

# Pipeline 2: AI frame extraction (1 fps)
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -i "$RTSP_URL" \
    -vf "fps=1,scale=640:640" \
    -f image2pipe \
    -vcodec mjpeg \
    -q:v 5 \
    "$AI_PIPE" \
    2>> /var/log/ffmpeg-ai-${CAMERA_ID}.log &

# Pipeline 3: Frame batching and gRPC send to cloud
frame-batcher \
    --input "$AI_PIPE" \
    --camera-id "$CAMERA_ID" \
    --batch-size 8 \
    --cloud-endpoint "10.200.0.1:8081" \
    --vpn-interface wg0 \
    >> /var/log/batcher-${CAMERA_ID}.log 2>&1 &

Appendix D: Kubernetes Resource Summary

# Complete resource manifest summary
# Namespaces:
# - surveillance: Main application
# - surveillance-data: Database, cache, storage
# - surveillance-monitoring: Prometheus, Grafana
# - surveillance-ops: CI/CD, backup jobs

# Deployments (always running):
# - stream-ingestion: 3-20 replicas, HPA
# - ai-inference: 1-4 replicas (GPU), HPA
# - suspicious-activity: 2-8 replicas, HPA
# - backend-api: 3-10 replicas, HPA
# - video-playback: 2-4 replicas
# - notification-service: 2-5 replicas, HPA
# - web-frontend: 3 replicas (static, CDN-cached)
# - traefik: 2 replicas (DaemonSet preferred)

# StatefulSets:
# - minio: 4 replicas (distributed mode)

# CronJobs:
# - training-service: Weekly (Sundays 02:00)
# - db-backup: Daily (02:00)
# - storage-cleanup: Daily (03:00)
# - partition-maintenance: Monthly (1st, 03:00)

# External Services (AWS managed):
# - PostgreSQL: RDS db.r6g.xlarge Multi-AZ
# - Redis: ElastiCache cluster mode
# - Kafka: MSK 3 brokers
# - ALB: Internet-facing, WAF attached

Document History

Version	Date	Author	Changes
1.0	2025-01-20	Solution Architect	Initial complete architecture

End of Document

System Architecture