System Architecture

System Architecture

Topology, scaling, failover, and technology choices.

AI-Powered Industrial Surveillance Platform — System Architecture

Document Information

  • Version: 1.0
  • Date: 2025-01-20
  • Status: Production Architecture Design
  • Target Platform: CP PLUS ORANGE Series DVR (CP-UVR-0801E1-CV2)
  • Camera Count: 8 channels (scalable to 64+)
  • Resolution: 960x1080 per channel

Table of Contents

  1. Executive Summary
  2. Deployment Topology
  3. Network Security Zones
  4. Service Architecture
  5. Data Flow Design
  6. Technology Stack
  7. Scaling Strategy
  8. Failover & Reliability
  9. Security Architecture
  10. Monitoring & Observability
  11. Cost Estimation
  12. Implementation Phases
  13. Appendices

1. Executive Summary

This document presents the complete system architecture for an AI-powered industrial surveillance platform designed to process 8 camera channels (expandable to 64+) from a CP PLUS ORANGE Series DVR. The architecture follows a cloud+edge hybrid pattern where compute-intensive AI inference runs in the cloud while a local edge gateway handles stream ingestion and site-local concerns. All DVR communication is protected inside a WireGuard VPN tunnel — the DVR has zero public internet exposure.

Key Architectural Decisions

Decision Choice Rationale
Cloud Provider AWS (us-east-1 / ap-south-1) Broadest IoT/edge tooling, VPC peering, lowest latency to India region
Container Orchestration Amazon EKS (Kubernetes) Managed control plane, auto-scaling, GPU node support for AI inference
VPN Solution WireGuard ~60% faster than OpenVPN, modern crypto, simple setup, NAT traversal
Message Queue Apache Kafka (MSK) Durable, ordered event log, replay capability, proven at scale
Stream Processing Apache Flink on EKS Stateful stream processing, exactly-once semantics, windowed operations
Reverse Proxy Traefik (in-cluster) + AWS ALB (ingress) Native Kubernetes integration, automatic cert management
AI Framework NVIDIA Triton Inference Server + YOLOv8 GPU-optimized inference, model ensemble, dynamic batching
Object Storage MinIO (on-premises) + AWS S3 (cold archive) S3-compatible API, local buffering, cost-tiered archival
Database PostgreSQL 16 (RDS) + pgvector extension Relational integrity for events, native vector support for face embeddings
Cache/Queue Redis 7 Cluster (ElastiCache) Sub-ms latency, stream data type for real-time pub/sub
Edge Hardware Intel NUC 13 Pro i7 / NVIDIA Jetson Orin NX x86 preferred for flexibility; Jetson alternative for GPU-at-edge

2. Deployment Topology

2.1 High-Level Topology Diagram

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    CLOUD (AWS VPC)                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────────────┐    │
│  │                         KUBERNETES CLUSTER (EKS)                                   │    │
│  │                                                                                    │    │
│  │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐  │    │
│  │   │  API GW     │   │  Stream     │   │   AI Inf.   │   │   Suspicious Act.   │  │    │
│  │   │  (Traefik)  │   │  Ingestion  │   │  Service    │   │   Service           │  │    │
│  │   │  :8443      │   │  Service    │   │  (Triton)   │   │   (Night Mode)      │  │    │
│  │   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘   └─────────────────────┘  │    │
│  │          │                 │                 │                                    │    │
│  │   ┌──────┴──────┐   ┌──────┴──────┐   ┌──────┴──────┐   ┌─────────────────────┐  │    │
│  │   │  Web App    │   │  Training   │   │ Notification│   │   Video Playback    │  │    │
│  │   │  (Next.js)  │   │  Service    │   │  Service    │   │   Service (HLS)     │  │    │
│  │   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────────────┘  │    │
│  │                                                                                    │    │
│  │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐  │    │
│  │   │  PostgreSQL │   │    Redis    │   │   Kafka     │   │      MinIO          │  │    │
│  │   │  (RDS)      │   │  Cluster    │   │   (MSK)     │   │   (S3-compatible)   │  │    │
│  │   │  :5432      │   │  :6379      │   │  :9092      │   │   :9000             │  │    │
│  │   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────────────┘  │    │
│  │                                                                                    │    │
│  │   ┌─────────────────────────────────────────────────────────────────────────┐      │    │
│  │   │              AWS APPLICATION LOAD BALANCER (:443)                       │      │    │
│  │   │         SSL termination, WAF, rate limiting, geo-restriction            │      │    │
│  │   └─────────────────────────────────────────────────────────────────────────┘      │    │
│  └─────────────────────────────────────────────────────────────────────────────────────┘    │
│           ▲                                                                                 │
│           │ WireGuard VPN Tunnel (UDP 51820)                                                │
│           │ Site-to-Site encrypted tunnel                                                   │
│           │ Cloud peer: 10.200.0.1/32  ←→  Edge peer: 10.200.0.2/32                       │
└───────────┼─────────────────────────────────────────────────────────────────────────────────┘
            │
┌───────────┴─────────────────────────────────────────────────────────────────────────────────┐
│                                 EDGE SITE (Local Network)                                   │
│                                                                                             │
│   ┌─────────────────────────────────┐          ┌─────────────────────────────────────────┐  │
│   │      EDGE GATEWAY               │          │         LOCAL NETWORK                   │  │
│   │   (Intel NUC / Jetson Orin)     │          │     (192.168.29.0/24)                   │  │
│   │   OS: Ubuntu 22.04 LTS          │          │                                         │  │
│   │   WireGuard endpoint            │          │   ┌─────────────────────────────────┐   │  │
│   │   K3s lightweight cluster       │          │   │   CP PLUS DVR                     │   │  │
│   │                                 │          │   │   CP-UVR-0801E1-CV2               │   │  │
│   │   ┌───────────────────────┐     │◄────────►│   │   LAN: 192.168.29.200             │   │  │
│   │   │  Edge Gateway Agent   │     │  :554    │   │   RTSP: 554, HTTP: 80/443         │   │  │
│   │   │  - Stream puller      │     │  :80     │   │   TCP: 25001, UDP: 25002          │   │  │
│   │   │  - Buffer/forward     │     │          │   │   8 Channels × 960×1080           │   │  │
│   │   │  - Local recording    │     │          │   └─────────────────────────────────┘   │  │
│   │   │  - VPN client         │     │          │                                         │  │
│   │   └───────────────────────┘     │          │   ┌─────────────────────────────────┐   │  │
│   │                                 │          │   │   Local Monitor (optional)      │   │  │
│   │   Local Storage: 2TB NVMe       │          │   │   192.168.29.10                 │   │  │
│   │   (7-day circular buffer)       │          │   └─────────────────────────────────┘   │  │
│   └─────────────────────────────────┘          │                                         │  │
│                                                │   CAMERAS (BNC/IP) ──┐                  │  │
│                                                │                      │                  │  │
│                                                │   CH1 ──► CH2 ──► CH3 ──► CH4           │  │
│                                                │   CH5 ──► CH6 ──► CH7 ──► CH8           │  │
│                                                │                                        │  │
│                                                └────────────────────────────────────────┘  │
│                                                                                             │
│   Network: Edge Gateway has TWO interfaces:                                                 │
│   - eth0: 192.168.29.5/24  ←→ Local network (DVR access)                                  │
│   - eth1: DHCP / Static    ←→ Internet (VPN tunnel to cloud)                                │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

2.2 Physical Edge Gateway Specification

Component Specification
Hardware Intel NUC 13 Pro, Core i7-1360P, 32GB DDR4, 2TB NVMe SSD
Alternative NVIDIA Jetson Orin NX 16GB (for on-edge AI inference)
OS Ubuntu 22.04 LTS Server, minimal install
Container Runtime containerd (via K3s)
K8s Distribution K3s v1.28+ (lightweight, single-node or 2-node HA)
Power UPS-backed, auto-restart on power loss (BIOS setting)
Network Dual Ethernet: one for local DVR segment, one for internet/VPN
Local Storage 2TB NVMe for 7-day circular buffer of all 8 streams

2.3 Cloud Infrastructure Specification

Component Specification
Region Primary: ap-south-1 (Mumbai), DR: ap-southeast-1 (Singapore)
VPC 10.100.0.0/16, 3 AZs, private subnets only for workloads
EKS Managed node groups: on-demand for API, spot for batch processing
GPU Nodes g4dn.xlarge (NVIDIA T4) for Triton inference, 1-4 nodes auto-scaled
ALB Internet-facing, WAF v2 attached, Shield Advanced optional
RDS PostgreSQL 16, db.r6g.xlarge, Multi-AZ, encrypted at rest
ElastiCache Redis 7, cluster mode enabled, 2 shards × 2 replicas
MSK (Kafka) 3 broker nodes, kafka.m5.large, 3 AZs
S3 Standard (hot), IA (30 days), Glacier Deep Archive (1 year)

3. Network Security Zones

3.1 Security Zone Diagram

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    SECURITY ZONES                                           │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                        ZONE 0: INTERNET (UNTRUSTED)                             │       │
│  │                                                                                 │       │
│  │   Users/Browsers  ──►  AWS ALB (:443)  ──►  WAF  ──►  Rate Limit  ──►  Geo-Block │    │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                    ZONE 1: AWS VPC EDGE (DEMILITARIZED)                         │       │
│  │                                                                                 │       │
│  │   ALB ──► Traefik Ingress ──► Public API endpoints only                         │       │
│  │   Auth: JWT + RBAC, API key for edge gateway                                    │       │
│  │                                                                                 │       │
│  │   AWS ALB Security Group: Allow 443 from 0.0.0.0/0                             │       │
│  │   Traefik SG: Allow 8443 from ALB-SG only                                       │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │              ZONE 2: AWS VPC APPLICATION (TRUSTED, ISOLATED)                    │       │
│  │                                                                                 │       │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │       │
│  │   │ Stream Ing. │  │ AI Inference│  │ Suspicious  │  │  Training Service   │   │       │
│  │   │  Service    │  │  Service    │  │  Activity   │  │                     │   │       │
│  │   │             │  │             │  │  Service    │  │                     │   │       │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘   │       │
│  │                                                                                 │       │
│  │   Pod Security Policies: No root, read-only FS, no privilege escalation        │       │
│  │   Network Policies: Ingress only from API GW namespace, egress to data layer   │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                ZONE 3: AWS VPC DATA (HIGHLY RESTRICTED)                         │       │
│  │                                                                                 │       │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │       │
│  │   │ PostgreSQL  │  │    Redis    │  │    Kafka    │  │       MinIO         │   │       │
│  │   │   (RDS)     │  │  (ElastiC.) │  │    (MSK)    │  │    (S3 API)         │   │       │
│  │   │   :5432     │  │   :6379     │  │   :9092     │  │    :9000            │   │       │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘   │       │
│  │                                                                                 │       │
│  │   Security Groups: Allow connections ONLY from Application Zone SGs              │       │
│  │   RDS: Encrypted (AWS KMS), no public access, IAM auth enabled                  │       │
│  │   S3: Bucket policy deny all except VPC endpoint, versioning enabled             │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                    WireGuard VPN Tunnel  │  (UDP 51820, ChaCha20-Poly1305)                │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │              ZONE 4: EDGE NETWORK (PHYSICALLY ISOLATED)                         │       │
│  │                                                                                 │       │
│  │   ┌──────────────────────────┐          ┌─────────────────────────────────┐     │       │
│  │   │   EDGE GATEWAY AGENT     │          │     DVR (192.168.29.200)        │     │       │
│  │   │   - K3s node              │◄────────►│     NO INTERNET ACCESS          │     │       │
│  │   │   - WireGuard peer        │   :554   │     Firewall: DROP all non-local│     │       │
│  │   │   - Stream ingestion      │   :80    │                                 │     │       │
│  │   │   - Local buffer          │          │     Only 192.168.29.0/24 allowed│     │       │
│  │   └──────────────────────────┘          └─────────────────────────────────┘     │       │
│  │                                                                                 │       │
│  │   Edge Gateway Firewall (ufw):                                                  │       │
│  │   - ALLOW 192.168.29.0/24 → DVR ports (554, 80)                               │       │
│  │   - ALLOW OUT 51820/udp → Cloud VPN endpoint                                  │       │
│  │   - DENY ALL other incoming                                                   │       │
│  │   - No forwarding to local network from VPN (except explicit rules)            │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

3.2 Firewall Rules

Edge Gateway (UFW)

# Default deny
ufw default deny incoming
ufw default allow outgoing

# Local network access to DVR
ufw allow from 192.168.29.200 to any port 554 proto tcp    # RTSP
ufw allow from 192.168.29.200 to any port 80 proto tcp     # HTTP (ONVIF)

# WireGuard VPN to cloud
ufw allow out on eth1 to <cloud-vpn-ip> port 51820 proto udp

# Local admin access (optional, from specific admin IP)
ufw allow from 192.168.29.10 to any port 22 proto tcp      # SSH from admin workstation

AWS Security Groups

Security Group Ingress Rules Egress Rules
alb-public-sg TCP 443 from 0.0.0.0/0 All to VPC
traefik-ingress-sg TCP 8443 from alb-public-sg only All to VPC
app-services-sg TCP 8080-8090 from traefik-ingress-sg All to data-sg
data-layer-sg TCP 5432, 6379, 9092, 9000 from app-services-sg only None
vpn-endpoint-sg UDP 51820 from edge-gateway-ip/32 All to VPC

4. Service Architecture

4.1 Service Interaction Diagram

┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      SERVICE ARCHITECTURE                                       │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                              API GATEWAY LAYER                                      │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Traefik Ingress Controller (K8s)                                           │   │     │
│   │   │  - Route: /api/* → Backend Service                                          │   │     │
│   │   │  - Route: /ws/*  → WebSocket Handler (live video)                          │   │     │
│   │   │  - Route: /     → Next.js Web App                                           │   │     │
│   │   │  - TLS: Let's Encrypt automatic certificates                                │   │     │
│   │   │  - Middleware: rate limit (100 req/min per IP), JWT validation, CORS       │   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│                                           │                                                    │
│                    ┌──────────────────────┼──────────────────────┐                             │
│                    │                      │                      │                             │
│                    ▼                      ▼                      ▼                             │
│   ┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐             │
│   │   BACKEND SERVICE    │   │    WEB FRONTEND      │   │   VIDEO PLAYBACK     │             │
│   │   (Go/Gin)           │   │   (Next.js 14)       │   │   SERVICE (Go)       │             │
│   │   :8080              │   │   :3000              │   │   :8085 (HLS)        │             │
│   │                      │   │                      │   │                      │             │
│   │  ┌──────────────┐   │   │  ┌──────────────┐   │   │  ┌──────────────┐   │             │
│   │  │ REST API     │   │   │  │ React SSR    │   │   │  │ HLS Segment  │   │             │
│   │  │ - /cameras   │   │   │  │ - Dashboard  │   │   │  │ Server       │   │             │
│   │  │ - /events    │   │   │  │ - Live View  │   │   │  │ - /live/:id  │   │             │
│   │  │ - /alerts    │   │   │  │ - Timeline   │   │   │  │ - /vod/:id   │   │             │
│   │  │ - /search    │   │   │  │ - Analytics  │   │   │  │ (DASH/HLS)   │   │             │
│   │  │ - /training  │   │   │  │ - Admin      │   │   │  └──────────────┘   │             │
│   │  └──────────────┘   │   │  └──────────────┘   │   │                      │             │
│   │  ┌──────────────┐   │   └──────────────────────┘   └──────────────────────┘             │
│   │  │ gRPC Client  │   │                                                                    │
│   │  │ (to AI svc)  │   │                                                                    │
│   │  └──────────────┘   │                                                                    │
│   └─────────────────────┘                                                                    │
│            │                                                                                  │
│            │ gRPC (:50051)                                                                    │
│            ▼                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                           EVENT & MESSAGE BUS                                       │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Apache Kafka (MSK)                                                          │   │     │
│   │   │                                                                             │   │     │
│   │   │  Topics:                                                                    │   │     │
│   │   │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────────┐│   │     │
│   │   │  │ streams.raw     │  │ ai.detections   │  │ alerts.critical             ││   │     │
│   │   │  │ (protobuf)      │  │ (JSON)          │  │ (JSON)                      ││   │     │
│   │   │  │ - 8 partitions  │  │ - 16 partitions │  │ - 4 partitions              ││   │     │
│   │   │  │ - 7-day reten.  │  │ - 30-day reten. │  │ - 90-day reten.             ││   │     │
│   │   │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘│   │     │
│   │   │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────────┐│   │     │
│   │   │  │ training.data   │  │ system.metrics  │  │ notifications.email         ││   │     │
│   │   │  │ (protobuf)      │  │ (JSON)          │  │ notifications.sms           ││   │     │
│   │   │  │ - 30-day reten. │  │ - 7-day reten.  │  │ notifications.push          ││   │     │
│   │   │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘│   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Redis Cluster (ElastiCache)                                                 │   │     │
│   │   │                                                                             │   │     │
│   │   │  Streams:  ┌─────────────────┐  Pub/Sub:  ┌──────────────────────┐        │   │     │
│   │   │            │ live:cam:{id}   │            │ alert:broadcast      │        │   │     │
│   │   │            │ (video chunks)  │            │ ws:session:*         │        │   │     │
│   │   │            │ cache:api:*     │            │ stream:status        │        │   │     │
│   │   │            └─────────────────┘            └──────────────────────┘        │   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│            │                        │                        │                                 │
│            ▼                        ▼                        ▼                                 │
│   ┌──────────────┐        ┌──────────────┐        ┌──────────────────────┐                   │
│   │ STREAM ING.  │        │ AI INFERENCE │        │ SUSPICIOUS ACTIVITY  │                   │
│   │ SERVICE      │        │ SERVICE      │        │ SERVICE              │                   │
│   │ (Go/FFmpeg)  │        │ (Python/gRPC)│        │ (Go/Python)          │                   │
│   │ :8081        │        │ :8001 (Triton)│       │ :8083                │                   │
│   │              │        │              │        │                      │                   │
│   │┌────────────┐│        │┌────────────┐│        │┌────────────────────┐│                   │
│   ││RTSP Client ││        ││Triton Svr  ││        ││Night Mode Analyzer ││                   │
│   ││(ffmpeg)    ││        ││├─YOLOv8-det││        ││├─Motion detection  ││                   │
│   ││8 concurrent││        ││├─YOLOv8-face││       ││├─Loitering detect. ││                   │
│   ││streams     ││        ││├─ArcFace    ││       ││├─Perimeter breach   ││                   │
│   │├────────────┤│        ││└────────────┘│       ││├─Abandoned object  ││                   │
│   ││Frame Extrac││        ││Model Mgmt.  ││       ││├─Crowd detection   ││                   │
│   ││1 fps anal. ││        ││└────────────┘│       ││└────────────────────┘│                   │
│   │├────────────┤│        │├─────────────┤│       │├────────────────────┤│                   │
│   ││Kafka Produc.││        ││gRPC API     ││       ││Kafka Consumer      ││                   │
│   ││(raw frames) ││       ││- detect()   ││       ││(ai.detections)     ││                   │
│   │└─────────────┘│        ││- embed()    ││       │├────────────────────┤│                   │
│   │┌────────────┐ │        ││- compare()  ││       ││Rule Engine         ││                   │
│   ││MinIO Client│ │        │└─────────────┘│       ││├─Time-based rules  ││                   │
│   ││(video seg.)│ │        └──────────────┘       ││├─Zone-based rules  ││                   │
│   │└────────────┘ │                                ││├─Severity scoring  ││                   │
│   └───────────────┘                                │└────────────────────┘│                   │
│            ▲                                       └──────────────────────┘                   │
│            │ WireGuard VPN                                                                     │
│            │                                                                                   │
│   ┌──────────────┐                                                                            │
│   │ EDGE GATEWAY │                                                                            │
│   │ SERVICE      │                                                                            │
│   │ (Local)      │                                                                            │
│   └──────────────┘                                                                            │
│                                                                                               │
│            │                                                                                   │
│            ▼                                                                                   │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                           DATA LAYER                                                │     │
│   │                                                                                      │     │
│   │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │     │
│   │   │ PostgreSQL 16│  │  pgvector    │  │    MinIO     │  │   S3 (Cold Archive)  │   │     │
│   │   │  (RDS)       │  │  Extension   │  │  (on-prem +  │  │                      │   │     │
│   │   │              │  │              │  │   cloud)     │  │                      │   │     │
│   │   │ cameras      │  │ face_embed.  │  │ video-seg/   │  │  yearly archive      │   │     │
│   │   │ events       │  │  table       │  │ training-img │  │  compliance storage  │   │     │
│   │   │ alerts       │  │  (vector)    │  │              │  │                      │   │     │
│   │   │ audit_log    │  │              │  │ lifecycle:   │  │  lifecycle:          │   │     │
│   │   │ users        │  │ HNSW index   │  │ 7d local →   │  │  90d → Glacier       │   │     │
│   │   │ zones        │  │ cos similarity│ │ 30d cloud →  │  │  Deep Archive        │   │     │
│   │   └──────────────┘  └──────────────┘  │ 1yr archive  │  │                      │   │     │
│   │                                         └──────────────┘  └──────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│                                                                                                │
└────────────────────────────────────────────────────────────────────────────────────────────────┘

4.2 Service Specifications

4.2.1 Edge Gateway Service (Local)

Attribute Specification
Runtime Go 1.21, compiled binary
Deployment Systemd service on Ubuntu + K3s for containerized components
Location Intel NUC, physically on-site
Responsibilities RTSP stream pull, local recording buffer, VPN tunnel endpoint, heartbeat to cloud
Ports 8080 (HTTP admin), 51820 (WireGuard), 1935 (RTMP relay if needed)
Stream Protocol RTSP over TCP (interleaved) from DVR at 192.168.29.200:554
Local Storage 2TB NVMe, 7-day circular buffer, ~1.5GB/hour per channel = ~288GB/day for 8ch
Reconnect Policy Exponential backoff: 1s → 2s → 4s → 8s → max 60s, reset on success
Heartbeat Every 30s to cloud Stream Ingestion Service via VPN
Failover Auto-restart via systemd (Restart=always, RestartSec=5)

4.2.2 Stream Ingestion Service (Cloud)

Attribute Specification
Runtime Go 1.21
Deployment Kubernetes Deployment, 3 replicas minimum
Responsibilities Receive frames from edge, decode, produce to Kafka, store segments to MinIO
Protocol gRPC bidirectional streaming from Edge Gateway
Frame Rate 1 fps for AI analysis (decimated from 25fps source)
Full Rate 25 fps for event clips (triggered recordings)
Kafka Topic streams.raw.{camera_id} — protobuf-encoded frame batches
Video Segments 10-second H.264 segments → MinIO bucket video-segments
Resource Request 1 CPU, 2GB RAM per replica
HPA 3-20 replicas based on CPU > 70%

4.2.3 AI Inference Service

Attribute Specification
Runtime NVIDIA Triton Inference Server 2.40+ (Docker)
Deployment Kubernetes Deployment on GPU nodes (g4dn.xlarge)
GPU NVIDIA T4 16GB, 1 GPU per replica
Models YOLOv8x (detection), YOLOv8x-face (face detection), ArcFace (face recognition/embedding)
Model Format TensorRT engines (.plan) for GPU optimization
gRPC API :8001 — Triton native gRPC
HTTP API :8000 — Triton native HTTP
Metrics :8002 — Prometheus metrics endpoint
Dynamic Batching Max batch size: 8 for detection, 16 for face embedding
Input JPEG frames (960×1080) from Kafka topic streams.raw.*
Output Detections (bbox, class, confidence) → Kafka ai.detections
Face Embeddings 512-dim float32 vectors → pgvector (PostgreSQL)
Resource 1× T4 GPU, 4 CPU, 16GB RAM per replica
HPA 1-4 replicas based on GPU utilization > 80% and Kafka consumer lag

4.2.4 Suspicious Activity Service (Night Mode)

Attribute Specification
Runtime Python 3.11 (OpenCV, scikit-learn) + Go orchestrator
Deployment Kubernetes Deployment, 2-8 replicas
Input Kafka topic ai.detections + streams.raw.* for motion analysis
Responsibilities Night-mode analysis, loitering detection, perimeter breach, abandoned object, crowd detection
Rules Engine YAML-configured rules per camera, per time schedule
Night Schedule Configurable (default: 22:00 - 06:00), overrides day-mode sensitivity
Output Scored alerts → Kafka alerts.critical + PostgreSQL alerts table
ML Models Background subtraction (MOG2), optical flow for motion tracking, Kalman filters for object tracking
Resource Request 2 CPU, 4GB RAM per replica

4.2.5 Training Service

Attribute Specification
Runtime Python 3.11, PyTorch 2.1, NVIDIA CUDA 12.1
Deployment Kubernetes Job/CronJob, runs on GPU spot instances
Responsibilities Model retraining, fine-tuning on collected data, A/B model validation
Trigger Weekly scheduled (Sunday 02:00) or manual (API call)
Data Source MinIO bucket training-data (curated positive/negative samples)
Output New TensorRT engines → MinIO bucket model-artifacts
A/B Rollout Blue/green model deployment via Triton model repository
Validation mAP > 0.85 required before promotion to production
Resource 1× V100 GPU (spot), 8 CPU, 32GB RAM

4.2.6 API Gateway / Backend Service

Attribute Specification
Runtime Go 1.21, Gin framework
Deployment Kubernetes Deployment, 3-10 replicas
Protocol HTTP/2, REST API + WebSocket for live updates
Authentication JWT (RS256), access token 15min, refresh token 7 days
Authorization RBAC: admin, operator, viewer roles
Rate Limiting 100 req/min per IP, 1000 req/min per API key
Endpoints See API Specification below
Caching Redis for session store and API response caching (TTL 60s)
Resource Request 0.5 CPU, 1GB RAM per replica

API Endpoints:

Endpoint Method Description Auth
/api/v1/auth/login POST User authentication Public
/api/v1/auth/refresh POST Token refresh Public
/api/v1/cameras GET List all cameras Viewer+
/api/v1/cameras/{id} GET Camera details Viewer+
/api/v1/cameras/{id}/live GET Live stream URL (HLS) Viewer+
/api/v1/events GET Query events (paginated, filtered) Viewer+
/api/v1/events/{id} GET Event details with snapshot Viewer+
/api/v1/alerts GET List alerts Viewer+
/api/v1/alerts/{id}/ack POST Acknowledge alert Operator+
/api/v1/search/faces POST Face search by image Operator+
/api/v1/search/faces/{embedding} GET Similar face lookup Operator+
/api/v1/training/upload POST Upload training samples Admin
/api/v1/training/jobs GET List training jobs Admin
/api/v1/zones CRUD Perimeter zones per camera Admin
/api/v1/reports/daily GET Daily activity report Viewer+
/api/v1/system/health GET System health status Internal

4.2.7 Web Frontend

Attribute Specification
Framework Next.js 14 (App Router), React 18, TypeScript
Styling Tailwind CSS + shadcn/ui components
State Management Zustand (client), React Query (server)
Video Player HLS.js for live stream playback, Video.js for VOD
Maps MapLibre GL JS (open source, no API key required) for camera geolocation
Real-time WebSocket connection for alert notifications
Build Output Static export → served via CDN (CloudFront)
PWA Service worker for offline dashboard viewing

4.2.8 Notification Service

Attribute Specification
Runtime Go 1.21
Deployment Kubernetes Deployment, 2-5 replicas
Input Kafka topic alerts.critical
Channels Email (SMTP/AWS SES), SMS (Twilio/AWS SNS), Push (Firebase FCM), Webhook
Templates HTML email templates with event snapshot attachment
Rate Limiting Max 1 SMS per phone per 5 minutes; max 10 emails per address per hour
Retry Policy 3 retries with exponential backoff for each channel; dead-letter after failure
Escalation Unacknowledged critical alerts escalate after 15 minutes (to admin)

4.2.9 Database — PostgreSQL 16 (RDS)

Attribute Specification
Instance db.r6g.xlarge (4 vCPU, 32GB RAM)
Storage 500GB gp3, auto-scaling to 2TB
Multi-AZ Enabled for production
Extensions pgvector (face embeddings), PostGIS (zone geometry), pg_stat_statements
Backup Daily automated, 35-day retention
Read Replica 1 read replica for analytics queries

Schema Overview:

-- Core tables
cameras (id, name, dvr_channel, rtsp_url, location, status, created_at)
events (id, camera_id, event_type, confidence, bounding_box, snapshot_path, 
        start_time, end_time, severity, metadata JSONB, created_at)
alerts (id, event_id, rule_id, severity, status [new|ack|resolved], 
        acknowledged_by, acked_at, notification_channels, created_at)
face_embeddings (id, person_name, embedding vector(512), camera_id, 
                 first_seen, last_seen, occurrence_count, metadata JSONB)
users (id, username, password_hash, role, email, phone, created_at)
alert_rules (id, camera_id, rule_type, config JSONB, schedule JSONB, 
             severity, enabled, created_at)
audit_log (id, user_id, action, resource, details JSONB, ip_address, created_at)
perimeter_zones (id, camera_id, name, polygon GEOMETRY(POLYGON), 
                 alert_on_enter, alert_on_exit, schedule, created_at)

4.2.10 Object Storage — MinIO + S3

Attribute Specification
Local (Edge) MinIO single-node, 2TB NVMe, 7-day retention
Cloud (Primary) MinIO distributed cluster on EKS, 10TB initial, auto-scaling
Archive AWS S3 with lifecycle: Standard → IA (30d) → Glacier Deep Archive (365d)
API S3-compatible, same SDK for all tiers
Buckets video-segments (10s segments), event-clips (triggered recordings), training-data (curated samples), snapshots (JPEG event frames), model-artifacts (TensorRT engines)

4.2.11 Redis Cluster

Attribute Specification
Type ElastiCache for Redis, cluster mode enabled
Node Type cache.r6g.large per shard
Shards 2 shards, 2 replicas per shard
Max Memory Policy allkeys-lru (evict least recently used)
Persistence AOF everysec, RDB every 60min
Use Cases Session store, API cache, real-time pub/sub, stream position tracking

4.2.12 Vector Store (pgvector)

Attribute Specification
Integration PostgreSQL extension (same RDS instance)
Table face_embeddings with embedding vector(512) column
Index HNSW (hierarchical navigable small world) for approximate nearest neighbor
Index Parameters m = 16, ef_construction = 64
Similarity Metric Cosine similarity (<=> operator)
Query SELECT * FROM face_embeddings ORDER BY embedding <=> $1 LIMIT 10
Expected Volume ~1M vectors per year (8 cameras)

5. Data Flow Design

5.1 Complete Data Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      DATA FLOW                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                 │
│  LAYER 1: CAPTURE & INGESTION                                                                   │
│  ══════════════════════════                                                                     │
│                                                                                                 │
│   CAMERAS (8ch) ──► DVR (192.168.29.200) ──► RTSP (:554) ──► EDGE GATEWAY (192.168.29.5)      │
│                                                                                                 │
│   Camera → BNC coax → DVR encoder → H.264 stream → RTSP server (DVR builtin)                  │
│                                                                                                 │
│   Edge Gateway pulls 8 concurrent RTSP streams:                                                 │
│   rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?                       │
│   rtsp://192.168.29.200:554/user=admin&password=&channel=2&stream=0.sdp?                       │
│   ... (channels 1-8)                                                                           │
│                                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐                  │
│   │  EDGE GATEWAY PROCESSING per stream:                                     │                  │
│   │  1. FFmpeg demux → raw H.264 Annex-B frames                            │                  │
│   │  2. Segment into 10s chunks → local MinIO (circular buffer)            │                  │
│   │  3. Extract 1 fps JPEG frames (960×1080 → 640×640 resize for AI)       │                  │
│   │  4. Protobuf-encode frame batches                                        │                  │
│   │  5. Send via gRPC bidirectional stream over WireGuard VPN                │                  │
│   └─────────────────────────────────────────────────────────────────────────┘                  │
│                                          │                                                      │
│                    ┌─────────────────────┼─────────────────────┐                                │
│                    │                     │                     │                                │
│                    ▼                     ▼                     ▼                                │
│   ┌─────────────────────┐   ┌─────────────────────┐   ┌─────────────────────┐                 │
│   │ PATH A: LIVE VIDEO  │   │ PATH B: AI ANALYSIS │   │ PATH C: RECORDING   │                 │
│   │ (WebRTC/HLS path)   │   │ (detection pipeline)│   │ (event archival)    │                 │
│   └─────────────────────┘   └─────────────────────┘   └─────────────────────┘                 │
│                                                                                                 │
│  LAYER 2: STREAM PROCESSING (Cloud)                                                             │
│  ══════════════════════════════════                                                             │
│                                                                                                 │
│  PATH A: LIVE VIDEO ────────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Edge Gateway ──► Stream Ing. Svc ──► Redis Stream (live:cam:{id}) ──► HLS Segment Svc      │
│        RTSP              (decode)           (pub/sub buffer)              (m3u8 + .ts)        │
│                                                                            │                    │
│                                                                            ▼                    │
│                                                                     CloudFront CDN              │
│                                                                            │                    │
│                                                                            ▼                    │
│                                                                    Web Browser (HLS.js)        │
│                                                                                                 │
│  PATH B: AI ANALYSIS ───────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Stream Ing. Svc ──► Kafka (streams.raw.{cam}) ──► AI Inference Svc (Triton)                 │
│   (frame batches)         (ordered, partitioned)         (YOLOv8 + ArcFace)                    │
│                                                              │                                  │
│                                    ┌─────────────────────────┼─────────────────────────┐        │
│                                    │                         │                         │        │
│                                    ▼                         ▼                         ▼        │
│                            ┌──────────────┐         ┌──────────────┐         ┌──────────────┐  │
│                            │ Detections   │         │  Face Emb.   │         │  Stream to   │  │
│                            │ (person,     │         │  (512-dim)   │         │  Suspicious  │  │
│                            │  vehicle)    │         │              │         │  Activity    │  │
│                            └──────┬───────┘         └──────┬───────┘         │  Service     │  │
│                                   │                        │                 └──────┬───────┘  │
│                                   ▼                        ▼                        │          │
│                            Kafka (ai.              PostgreSQL                 Kafka  │          │
│                            detections)             (pgvector)                 alerts  │          │
│                                                                                 .critical      │
│                                                                                                 │
│  PATH C: RECORDING ─────────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Edge Gateway ──► Local MinIO (7d) ──► Sync ──► Cloud MinIO ──► S3 Lifecycle → Glacier       │
│   (10s segments)      (hot buffer)    (daily)     (30d hot)        (1yr archive)               │
│                                                                                                 │
│  LAYER 3: EVENT PROCESSING                                                                      │
│  ═════════════════════════                                                                      │
│                                                                                                 │
│   AI Inference ──► Kafka (ai.detections) ──► Suspicious Activity Svc                           │
│   Output              - bbox, class, conf          - Rule engine evaluation                      │
│                       - timestamp                  - Loitering detection                         │
│                       - camera_id                  - Perimeter breach check                      │
│                       - embedding_id               - Crowd counting                              │
│                                                    - Time-of-day scoring                         │
│                                                          │                                      │
│                                    ┌─────────────────────┼─────────────────────┐                │
│                                    │                     │                     │                │
│                                    ▼                     ▼                     ▼                │
│                            ┌──────────────┐     ┌──────────────┐     ┌──────────────────────┐   │
│                            │ PostgreSQL   │     │   Kafka      │     │  Notification Svc    │   │
│                            │ (alerts)     │     │ (alerts.     │     │  - Email (SES)       │   │
│                            │ (events)     │     │  critical)   │     │  - SMS (Twilio)      │   │
│                            └──────────────┘     └──────────────┘     │  - Push (FCM)        │   │
│                                                                      │  - Webhook           │   │
│                                                                      └──────────────────────┘   │
│                                                                                                 │
│  LAYER 4: CONSUMPTION                                                                           │
│  ════════════════════                                                                           │
│                                                                                                 │
│   Web Frontend ──► API Gateway ──► Backend Service ──► PostgreSQL/Redis/MinIO                 │
│   (Next.js)          (Traefik)      (Go/Gin)            (data queries)                         │
│      │                                                                        │                │
│      │  ┌─────────────────────────────────────────────────────────────────┐   │                │
│      │  │  DASHBOARD VIEWS:                                               │   │                │
│      │  │  - Live View: HLS.js + WebSocket for alert overlay              │   │                │
│      │  │  - Event Timeline: Infinite scroll, filters                     │   │                │
│      │  │  - Alert Management: Ack/Nack, assignment                       │   │                │
│      │  │  - Face Search: Upload photo → pgvector similarity search        │   │                │
│      │  │  - Analytics: Time-series charts (event frequency, heatmaps)    │   │                │
│      │  │  - Settings: Camera config, zone drawing, rule management       │   │                │
│      │  └─────────────────────────────────────────────────────────────────┘   │                │
│      │                                                                        │                │
│      └─────────────────────────── WebSocket: /ws/alerts ───────────────────────┘                │
│                                    (real-time alert push)                                       │
│                                                                                                 │
│  LAYER 5: TRAINING DATA FLOW                                                                    │
│  ═══════════════════════════                                                                    │
│                                                                                                 │
│   Events (false positive) ──► Admin review ──► "Add to Training" ──► MinIO (training-data)    │
│   Events (missed detect)  ──► Manual upload ──► Labeling UI ──► Curated dataset               │
│                                                                     │                           │
│                                                                     ▼                           │
│                                                          Training Service (weekly CronJob)      │
│                                                          - Load dataset from MinIO              │
│                                                          - Fine-tune YOLOv8 weights             │
│                                                          - Convert to TensorRT engine           │
│                                                          - Validate mAP > 0.85                  │
│                                                                │                                │
│                                                                ▼                                │
│                                                          Model Registry (MinIO)                 │
│                                                          - Blue/green deployment                │
│                                                          - Triton model repository              │
│                                                                │                                │
│                                                                ▼                                │
│                                                          AI Inference Svc (rolling update)      │
│                                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘

5.2 Stream Flow Detail

┌─────────────────────────────────────────────────────────────────────────────┐
│                     VIDEO STREAM FLOW (Per Camera)                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Camera ──► DVR Encoder ──► RTSP Stream ──► Edge Gateway                  │
│                              (H.264, 25fps,                               │
│                               960x1080)                                     │
│                                                                             │
│                              Edge Gateway Processing:                       │
│                              ┌─────────────────────────────────────┐       │
│                              │ 1. FFmpeg process per channel       │       │
│                              │    -input rtsp://dvr/ch{N}          │       │
│                              │    -c:v copy -f segment             │       │
│                              │    -segment_time 10                 │       │
│                              │    /recordings/ch{N}/%d.ts          │       │
│                              │                                     │       │
│                              │ 2. Parallel: 1 fps extraction       │       │
│                              │    -vf fps=1,scale=640:640          │       │
│                              │    -f image2pipe -vcodec mjpeg      │       │
│                              │    → AI pipeline                    │       │
│                              └─────────────────────────────────────┘       │
│                                          │                                  │
│                    ┌─────────────────────┼─────────────────────┐           │
│                    │                     │                     │           │
│                    ▼                     ▼                     ▼           │
│            ┌──────────────┐      ┌──────────────┐      ┌──────────────┐   │
│            │ Local Buffer │      │ AI Frames    │      │ Cloud Upload │   │
│            │ (7-day ring) │      │ (1 fps JPEG) │      │ (10s chunks) │   │
│            └──────────────┘      └──────┬───────┘      └──────┬───────┘   │
│                                         │                     │           │
│                    ┌────────────────────┘                     │           │
│                    │ WireGuard VPN                              │           │
│                    ▼                                          ▼           │
│           ┌────────────────┐                        ┌────────────────┐    │
│           │ Cloud Stream   │                        │ Cloud MinIO    │    │
│           │ Ingestion Svc  │                        │ (30-day hot)   │    │
│           └───────┬────────┘                        └────────────────┘    │
│                   │                                                        │
│                   ▼                                                        │
│     ┌────────────────────────┐                                             │
│     │  Kafka (streams.raw)   │                                             │
│     │  Partition = camera_id │                                             │
│     │  Guarantees ordering   │                                             │
│     │  per camera            │                                             │
│     └───────────┬────────────┘                                             │
│                 │                                                          │
│    ┌────────────┼────────────┐                                             │
│    │            │            │                                             │
│    ▼            ▼            ▼                                             │
│ ┌──────┐   ┌──────┐   ┌──────────┐                                       │
│ │ AI   │   │ HLS  │   │ Recording│                                       │
│ │ Inf. │   │ Seg. │   │ Archival │                                       │
│ │ Svc  │   │ Svc  │   │ Svc      │                                       │
│ └──────┘   └──────┘   └──────────┘                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.3 Event/Alert Flow Detail

┌─────────────────────────────────────────────────────────────────────────────┐
│                      EVENT & ALERT FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  AI Inference Output:                                                       │
│  {                                                                         │
│    "camera_id": "cam_001",                                                  │
│    "timestamp": "2025-01-20T14:30:00Z",                                     │
│    "detections": [                                                          │
│      { "class": "person", "confidence": 0.94,                               │
│        "bbox": [120, 340, 280, 560], "track_id": 42 },                     │
│      { "class": "face", "confidence": 0.89,                                 │
│        "bbox": [150, 360, 200, 420], "embedding_id": "emb_12345" }         │
│    ]                                                                        │
│  }                                                                          │
│                                                                             │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────────────────────────────┐                               │
│  │  Kafka Topic: ai.detections             │                               │
│  │  (JSON, 16 partitions)                  │                               │
│  └─────────────┬───────────────────────────┘                               │
│                │                                                            │
│    ┌───────────┴───────────┐                                               │
│    │                       │                                               │
│    ▼                       ▼                                               │
│ ┌──────────┐        ┌──────────────┐                                      │
│ │ Face     │        │ Suspicious   │                                      │
│ │ Matching │        │ Activity Svc │                                      │
│ │ (pgvector│        │              │                                      │
│ │  search) │        │ Rule Eval:   │                                      │
│ │          │        │ - Night mode?│                                      │
│ └────┬─────┘        │ - Zone       │                                      │
│      │              │   overlap?   │                                      │
│      │              │ - Loitering  │                                      │
│      │              │   > 5 min?   │                                      │
│      │              │ - Crowd      │                                      │
│      │              │   > 5 ppl?   │                                      │
│      │              └──────┬───────┘                                      │
│      │                     │                                              │
│      │    MATCH FOUND      │ ALERT TRIGGERED                             │
│      │         │           │                                              │
│      ▼         ▼           ▼                                              │
│  ┌──────────────────────────────────────────┐                            │
│  │  PostgreSQL                             │                            │
│  │  - events table (all detections)        │                            │
│  │  - alerts table (triggered alerts)      │                            │
│  │  - face_embeddings (if new/matched)     │                            │
│  └───────────────────┬──────────────────────┘                            │
│                      │                                                    │
│          ┌───────────┼───────────┐                                       │
│          │           │           │                                       │
│          ▼           ▼           ▼                                       │
│   ┌──────────┐ ┌──────────┐ ┌──────────────┐                           │
│   │ WebSocket│ │ Kafka    │ │ Notification │                           │
│   │ Push     │ │ alerts.  │ │ Service      │                           │
│   │ (live    │ │ critical │ │              │                           │
│   │  update) │ │          │ │ - Email      │                           │
│   └──────────┘ └──────────┘ │ - SMS        │                           │
│                             │ - Push       │                           │
│                             │ - Webhook    │                           │
│                             └──────────────┘                           │
│                                                                             │
│  Alert Lifecycle:                                                           │
│  DETECTED → NEW (insert) → WebSocket push → NOTIFY → ACK/RESOLVE          │
│                                ↓                                            │
│                           If unacked 15min → ESCALATE to admin             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.4 Live Video to Browser Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                   LIVE VIDEO TO BROWSER FLOW                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐            │
│  │   BROWSER    │      │  CloudFront  │      │   EKS HLS    │            │
│  │              │      │   CDN        │      │   Service    │            │
│  │  ┌────────┐  │      │              │      │              │            │
│  │  │HLS.js  │  │      │              │      │  ┌────────┐  │            │
│  │  │Player  │◄─┼──────┼─── m3u8 ─────┼──────┼──│ Playlist│  │            │
│  │  │        │  │      │   + .ts      │      │  │ Builder │  │            │
│  │  └────────┘  │      │   segments   │      │  └────┬───┘  │            │
│  │              │      │              │      │       │      │            │
│  │  WebSocket ──┼──────┼──────────────┼──────┼───────┘      │            │
│  │  /ws/alerts  │      │              │      │              │            │
│  └──────────────┘      └──────────────┘      └──────┬───────┘            │
│                                                     │                      │
│                                                     ▼                      │
│                                            ┌────────────────┐             │
│                                            │  Redis Stream  │             │
│                                            │  live:cam:{id} │             │
│                                            │                │             │
│                                            │  ┌──────────┐  │             │
│                                            │  │Segment 1 │  │             │
│                                            │  │Segment 2 │  │             │
│                                            │  │Segment 3 │──┼──► FIFO     │
│                                            │  └──────────┘  │   (keep 30)  │
│                                            └───────┬────────┘             │
│                                                    │                       │
│                              ┌─────────────────────┘                       │
│                              │ WireGuard VPN                               │
│                              ▼                                              │
│                    ┌──────────────────┐                                    │
│                    │  Edge Gateway    │                                    │
│                    │  FFmpeg → HLS    │                                    │
│                    │  segmenter       │                                    │
│                    └──────────────────┘                                    │
│                                                                             │
│  Latency Budget:                                                            │
│  - DVR encoding:  ~100ms                                                    │
│  - RTSP to Edge:  ~50ms                                                     │
│  - VPN tunnel:    ~30-80ms (depending on internet)                         │
│  - Cloud HLS svc: ~50ms                                                     │
│  - CDN delivery:  ~20-100ms                                                 │
│  - Player buffer: 3-6 segments (~30-60s behind real-time)                  │
│  TOTAL LIVE LATENCY: ~35-65 seconds (HLS inherent)                         │
│                                                                             │
│  For lower latency: WebRTC mode (optional future):                          │
│  - Target: < 2 seconds using WHIP/WHEP                                      │
│  - Requires direct edge-to-browser or TURN relay                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.5 Training Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                      TRAINING DATA FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SOURCE 1: Automatic (False Positive Detection)                            │
│  ──────────────────────────────────────────────                             │
│                                                                             │
│  AI Inference → Confidence 0.3-0.6 range → Flag as "uncertain"            │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/auto/      │                               │
│  │  - Original frame (JPEG)                │                               │
│  │  - Inference result (JSON)              │                               │
│  │  - Flagged for review                   │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  SOURCE 2: Manual (Operator Upload)                                        │
│  ──────────────────────────────────                                         │
│                                                                             │
│  Operator → Dashboard "Upload Training Image" → Label with bounding boxes  │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/manual/    │                               │
│  │  - Uploaded image with labels (COCO fmt)│                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  SOURCE 3: Missed Detection (Post-Incident)                                │
│  ────────────────────────────────────────────                               │
│                                                                             │
│  Security review → "AI should have caught this" → Extract from recording   │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/incident/  │                               │
│  │  - Video clip with manual annotation    │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  AGGREGATION:                                                               │
│  ════════════                                                               │
│                                                                             │
│  All sources → Weekly CronJob (Sunday 02:00 UTC)                           │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Training Service Pipeline:                                         │   │
│  │                                                                     │   │
│  │  1. Download all new training data from MinIO                       │   │
│  │  2. Deduplicate (perceptual hashing)                                │   │
│  │  3. Augment: rotation, brightness, noise (albumentations)           │   │
│  │  4. Validate: train/val/test split (80/10/10)                       │   │
│  │  5. Fine-tune YOLOv8x:                                              │   │
│  │     - Base: COCO-pretrained weights                                 │   │
│  │     - Epochs: 100, early stopping patience 10                       │   │
│  │     - LR: 0.001 with cosine decay                                   │   │
│  │     - Batch: 8 per GPU                                              │   │
│  │  6. Validate mAP@0.5 > 0.85                                         │   │
│  │  7. Convert to TensorRT engine (FP16, max batch 8)                  │   │
│  │  8. Upload to MinIO: model-artifacts/{version}/                     │   │
│  │  9. A/B test: shadow mode for 24 hours                               │   │
│  │  10. Promote to production if FP rate < baseline                    │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  MODEL DEPLOYMENT:                                                          │
│  ═════════════════                                                          │
│                                                                             │
│  MinIO model-artifacts/ → Triton Model Repository → SIGHUP reload          │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  Blue/Green Deployment:                 │                               │
│  │  - Triton loads new model as "green"    │                               │
│  │  - 5% traffic routed for 1 hour         │                               │
│  │  - Monitor: latency P99, error rate     │                               │
│  │  - If OK: 100% traffic, "blue" retired  │                               │
│  │  - If FAIL: automatic rollback          │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

6. Technology Stack

6.1 Technology Selection Matrix

Category Technology Alternative Selection Criteria
Cloud Platform AWS GCP, Azure Best India region coverage (Mumbai), most mature managed Kafka (MSK), broad GPU instance types, VPC endpoints for private service communication
Container Orchestration Amazon EKS GKE, AKS, self-managed Managed control plane, GPU device plugin support, Cluster Autoscaler, native IAM integration
Edge K8s K3s K0s, MicroK8s, Docker Compose Single binary, lightweight, automatic HA with embedded etcd, built-in Helm, compatible with standard K8s manifests
VPN WireGuard OpenVPN, IPSec, Tailscale Modern crypto (Curve25519, ChaCha20, Poly1305), kernel module since Linux 5.6, ~60% faster than OpenVPN, NAT traversal, simple config
Message Queue Apache Kafka (MSK) RabbitMQ, NATS, AWS SQS Ordered event log, stream replay, high throughput, exactly-once processing with Flink, managed service reduces ops
Stream Processing Apache Flink on EKS Kafka Streams, Spark Streaming Stateful processing, event time semantics, exactly-once, CEP (complex event processing) for multi-frame rules
Reverse Proxy Traefik NGINX, HAProxy, Envoy Native Kubernetes Ingress, automatic Let's Encrypt, middleware chains, WebSocket support, Prometheus metrics
AI Inference NVIDIA Triton + YOLOv8 TorchServe, TensorFlow Serving, custom Multi-framework support, TensorRT optimization, dynamic batching, model ensemble, Prometheus metrics
Database PostgreSQL 16 (RDS) + pgvector MySQL, MongoDB, separate vector DB ACID compliance, mature managed service, pgvector handles 512-dim embeddings at scale, no separate DB to manage
Cache Redis 7 Cluster Memcached, KeyDB Data structures (streams, sorted sets), pub/sub, persistence, cluster mode for horizontal scaling
Object Storage MinIO + S3 Ceph, GlusterFS, pure S3 S3-compatible API everywhere, local buffering at edge, cloud tiering, cost optimization via lifecycle policies
Backend Language Go 1.21 Python, Java, Rust Compiled performance for high-throughput streaming, excellent concurrency (goroutines), small container images
Frontend Next.js 14 + React 18 Vue, Angular, Svelte SSR for SEO/performance, React ecosystem, API routes, image optimization, easy deployment to CDN
Monitoring Prometheus + Grafana + Loki Datadog, New Relic, CloudWatch Open source, no per-host licensing, powerful alerting, log aggregation with Loki, custom dashboards
CI/CD GitHub Actions + ArgoCD GitLab CI, Jenkins, Flux GitOps deployment, automated rollback, drift detection, progressive delivery

6.2 WireGuard VPN Configuration

# Cloud Server (AWS EC2 bastion / VPN endpoint)
[Interface]
Address = 10.200.0.1/32
ListenPort = 51820
PrivateKey = <cloud-private-key>
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Edge Gateway
PublicKey = <edge-public-key>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25

# Edge Gateway (Intel NUC)
[Interface]
Address = 10.200.0.2/32
PrivateKey = <edge-private-key>

[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
AllowedIPs = 10.100.0.0/16  # Entire AWS VPC
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25

6.3 Port Reference Table

Service Port Protocol Location Notes
DVR RTSP 554 TCP 192.168.29.200 Local network only
DVR HTTP 80 TCP 192.168.29.200 Admin UI, local only
DVR HTTPS 443 TCP 192.168.29.200 Admin UI, local only
DVR TCP 25001 TCP 192.168.29.200 Proprietary protocol
DVR UDP 25002 UDP 192.168.29.200 Proprietary protocol
DVR NTP 123 UDP 192.168.29.200 Time sync
WireGuard 51820 UDP Cloud + Edge VPN tunnel
Edge Admin 8080 TCP 192.168.29.5 Local admin UI
Edge SSH 22 TCP 192.168.29.5 Admin access only
Traefik HTTP 8000 TCP EKS Internal HTTP entrypoint
Traefik HTTPS 8443 TCP EKS Internal HTTPS entrypoint
ALB HTTPS 443 TCP AWS Public-facing
Backend API 8080 TCP EKS pods Internal service port
Triton HTTP 8000 TCP EKS GPU nodes Model inference HTTP
Triton gRPC 8001 TCP EKS GPU nodes Model inference gRPC
Triton Metrics 8002 TCP EKS GPU nodes Prometheus metrics
PostgreSQL 5432 TCP RDS VPC-private
Redis 6379 TCP ElastiCache VPC-private
Kafka 9092 TCP MSK VPC-private
MinIO API 9000 TCP EKS + Edge S3-compatible API
MinIO Console 9001 TCP EKS + Edge Admin console
Prometheus 9090 TCP EKS Metrics collection
Grafana 3000 TCP EKS Dashboards

7. Scaling Strategy

7.1 Camera Scaling Roadmap

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAMERA SCALING ROADMAP                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CURRENT: 8 cameras (1 DVR)                                                 │
│  ├─ Edge: Intel NUC i7, 32GB RAM                                          │
│  ├─ Streams: 8 × RTSP @ 960×1080                                          │
│  ├─ Bandwidth: ~16 Mbps upstream (2 Mbps per H.264 stream)                │
│  └─ Cloud AI: 1× T4 GPU (handles 8 streams @ 1 fps)                       │
│                                                                             │
│  PHASE 1: 16 cameras (2 DVRs)                                               │
│  ├─ Edge: Intel NUC i7 (sufficient) or 2× NUC                            │
│  ├─ Add 2nd edge gateway for 2nd DVR site (if different location)        │
│  ├─ Streams: 16 × RTSP                                                    │
│  ├─ Bandwidth: ~32 Mbps                                                   │
│  ├─ Cloud AI: 1× T4 GPU (still sufficient, batch size 8 → 16)            │
│  └─ Kafka: 8 partitions → 16 partitions                                   │
│                                                                             │
│  PHASE 2: 32 cameras (4 DVRs / 4 sites)                                     │
│  ├─ Edge: 4× Intel NUC (one per site)                                     │
│  ├─ VPN: Hub-spoke model (4 edge peers → 1 cloud endpoint)               │
│  ├─ Bandwidth: ~64 Mbps                                                   │
│  ├─ Cloud AI: 2× T4 GPUs (HPA: 2-6 replicas)                             │
│  ├─ Stream Ing.: 6-12 replicas (HPA)                                      │
│  ├─ Kafka: 32 partitions                                                  │
│  └─ PostgreSQL: db.r6g.2xlarge (scale up)                                 │
│                                                                             │
│  PHASE 3: 64 cameras (8 DVRs / 8 sites)                                     │
│  ├─ Edge: 8× Intel NUC (or NVIDIA Jetson Orin for edge AI pre-filter)     │
│  ├─ VPN: WireGuard hub-spoke or mesh (consider Tailscale for simplicity) │
│  ├─ Bandwidth: ~128 Mbps (dedicated internet circuit recommended)        │
│  ├─ Cloud AI: 4× T4 GPUs or 2× A10G (g5.2xlarge)                         │
│  ├─ Stream Ing.: 12-20 replicas                                           │
│  ├─ Kafka: 64 partitions, consider MSK multi-cluster                      │
│  ├─ PostgreSQL: db.r6g.4xlarge + read replica                             │
│  ├─ Redis: 4 shards                                                       │
│  └─ MinIO: Distributed mode, 4+ nodes                                     │
│                                                                             │
│  PHASE 4: 64+ cameras (NVR consolidation)                                   │
│  ├─ Consider NVR-to-edge consolidation (fewer, more powerful recorders)   │
│  ├─ Edge AI pre-filtering (Jetson Orin): only send motion frames         │
│  ├─ Bandwidth reduction: ~50% via smart filtering                         │
│  └─ Multi-region cloud deployment for latency optimization                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

7.2 AI Inference Scaling

Metric 8 Cameras 16 Cameras 32 Cameras 64 Cameras
Frame Rate 8 fps (1 per cam) 16 fps 32 fps 64 fps
GPU Replicas 1× T4 1× T4 2× T4 4× T4 or 2× A10G
Inference Latency (P99) 80ms 120ms 150ms 200ms
Kafka Partitions (raw) 8 16 32 64
Consumer Groups 3 4 6 8

Auto-scaling Triggers:

  • GPU utilization > 80% for 2 minutes → scale out
  • Kafka consumer lag > 1000 messages for 5 minutes → scale out
  • Queue depth < 100 for 10 minutes → scale in (to minimum)

7.3 Storage Scaling

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STORAGE CAPACITY PLANNING                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Per-Camera Storage Profile:                                                │
│  - Continuous recording: ~1.5 GB/hour @ 960×1080 H.264 main profile       │
│  - AI snapshots (1 fps): ~50 MB/hour (JPEG compressed)                    │
│  - Event clips: ~10 MB average per event (30-second clip)                 │
│                                                                             │
│  Total Per Day (8 cameras):                                                 │
│  - Video: 8 × 1.5 GB × 24h = 288 GB/day                                   │
│  - Snapshots: 8 × 50 MB × 24h = 9.6 GB/day                                │
│  - Events (est. 500/day): 500 × 10 MB = 5 GB/day                          │
│  - TOTAL: ~303 GB/day = ~9 TB/month                                       │
│                                                                             │
│  Tiered Storage Strategy:                                                   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 1: EDGE LOCAL (Hot, 7 days)                                   │   │
│  │  - Capacity: 2TB NVMe per edge gateway                              │   │
│  │  - All 8 streams, full resolution, 10s segments                     │   │
│  │  - Cost: Hardware (CAPEX)                                           │   │
│  │  - Use: Immediate playback, event export                            │   │
│  │                                                                     │   │
│  │  7 days × 303 GB = 2.1 TB ✓ (fits in 2TB with compression)        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 2: CLOUD MINIO (Warm, 30 days)                                │   │
│  │  - Capacity: 10TB initial, auto-scaling                             │   │
│  │  - Full resolution video segments + event snapshots                 │   │
│  │  - Cost: ~$0.023/GB/month (S3 Standard equivalent)                  │   │
│  │  - Use: Dashboard playback, search, investigation                   │   │
│  │                                                                     │   │
│  │  30 days × 303 GB = 9.1 TB                                          │   │
│  │  Cost: 9,100 GB × $0.023 = ~$209/month                              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 3: S3 IA (Cool, 31-90 days)                                   │   │
│  │  - Capacity: Auto (lifecycle transition)                            │   │
│  │  - Cost: ~$0.0125/GB/month                                          │   │
│  │  - Use: Occasional access, compliance review                        │   │
│  │                                                                     │   │
│  │  60 days × 303 GB = 18.2 TB                                         │   │
│  │  Cost: 18,200 GB × $0.0125 = ~$228/month                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 4: GLACIER DEEP ARCHIVE (Cold, 90+ days)                      │   │
│  │  - Capacity: Unbounded                                              │   │
│  │  - Cost: ~$0.00099/GB/month                                         │   │
│  │  - Retrieval: 12-48 hours (batch)                                   │   │
│  │  - Use: Long-term compliance, legal hold                            │   │
│  │                                                                     │   │
│  │  Annual accumulation: 303 GB × 365 = 110 TB                         │   │
│  │  Cost: 110,000 GB × $0.00099 = ~$109/month                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  TOTAL MONTHLY STORAGE COST (8 cameras, steady state):                      │
│  - Tier 2 (hot): $209                                                       │
│  - Tier 3 (warm): $228                                                      │
│  - Tier 4 (cold): $109                                                      │
│  - TOTAL: ~$546/month                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

7.4 Database Partitioning Strategy

-- Partition events table by month (range partitioning)
CREATE TABLE events (
    id BIGSERIAL,
    camera_id VARCHAR(32) NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    confidence DECIMAL(4,3),
    bounding_box BOX,
    snapshot_path VARCHAR(512),
    start_time TIMESTAMPTZ NOT NULL,
    end_time TIMESTAMPTZ,
    severity VARCHAR(20),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (start_time);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... auto-created by cron job

-- Partition pruning ensures queries for specific time ranges
-- only scan relevant partitions

-- Automated partition creation (pg_partman extension)
SELECT partman.create_parent('public.events', 'start_time', 'native', 'monthly');

-- Partition compression and archival
-- Partitions older than 12 months:
-- 1. Compress with pg_compress
-- 2. Move to S3 via pg_s3_fifo FDW
-- 3. Drop local partition (data in cold archive)

8. Failover & Reliability

8.1 Service Restart Policies

# Kubernetes Deployment - Restart Policy Example
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stream-ingestion-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime deployment
  template:
    spec:
      containers:
        - name: ingestion
          image: surveillance/stream-ingestion:v1.2.3
          resources:
            requests:
              cpu: "1000m"
              memory: "2Gi"
            limits:
              cpu: "2000m"
              memory: "4Gi"
          livenessProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3  # Restart after 30s of failures
          readinessProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          startupProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 30  # 150s max for startup

8.2 Stream Reconnect Logic

// Edge Gateway Stream Reconnect Logic (Go pseudocode)
func maintainStream(cameraID string, rtspURL string) {
    backoff := NewExponentialBackoff(
        Initial:    1 * time.Second,
        Max:        60 * time.Second,
        Multiplier: 2.0,
        Jitter:     0.1,
    )
    
    for {
        ctx, cancel := context.WithCancel(context.Background())
        
        err := connectAndStream(ctx, cameraID, rtspURL)
        if err != nil {
            log.Error("stream disconnected", "camera", cameraID, "error", err)
            
            // Update health status in Redis
            redis.HSet("stream:health", cameraID, "disconnected")
            
            // Wait with backoff
            wait := backoff.Next()
            log.Info("reconnecting", "camera", cameraID, "wait", wait)
            time.Sleep(wait)
            
            cancel()
            continue
        }
        
        // Success - reset backoff
        backoff.Reset()
        redis.HSet("stream:health", cameraID, "connected")
    }
}

// Circuit breaker pattern for cloud connection
type CircuitBreaker struct {
    state          State  // Closed, Open, HalfOpen
    failureCount   int
    failureThreshold int    // 5 failures
    timeout        time.Duration  // 60s open state
    lastFailureTime time.Time
}

8.3 VPN Tunnel Recovery

#!/bin/bash
# /usr/local/bin/wireguard-watchdog.sh
# Runs every 30 seconds via cron

CLOUD_ENDPOINT="10.200.0.1"
TUNNEL_INTERFACE="wg0"
MAX_PING_LOSS=3
LOG_FILE="/var/log/wg-watchdog.log"

# Check tunnel health
ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1

if [ $? -ne 0 ]; then
    echo "$(date): VPN tunnel unhealthy, restarting..." >> $LOG_FILE
    
    # 1. Restart WireGuard interface
    wg-quick down $TUNNEL_INTERFACE
    sleep 2
    wg-quick up $TUNNEL_INTERFACE
    
    # 2. Verify recovery
    sleep 5
    ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1
    
    if [ $? -eq 0 ]; then
        echo "$(date): VPN tunnel recovered" >> $LOG_FILE
        # Notify cloud of recovery
        curl -X POST http://10.200.0.1:8080/api/v1/system/edge-recovery \
            -H "Authorization: Bearer $EDGE_TOKEN" \
            -d "{\"edge_id\": \"$HOSTNAME\", \"status\": \"recovered\"}"
    else
        echo "$(date): VPN tunnel recovery FAILED" >> $LOG_FILE
        # Escalate: local alert (buzzer/email if available)
    fi
fi

8.4 Queue Recovery & Durability

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KAFKA DURABILITY CONFIGURATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Producer Configuration (Stream Ingestion Service):                         │
│  ──────────────────────────────────────────────────                         │
│  - acks=all              # Wait for all replicas                            │
│  - retries=10            # Aggressive retry                                 │
│  - retry.backoff.ms=1000 # 1 second between retries                       │
│  - enable.idempotence=true  # Exactly-once semantics                      │
│  - max.in.flight.requests=1  # Preserve ordering during retry             │
│  - compression.type=lz4  # Efficient compression                          │
│                                                                             │
│  Topic Configuration:                                                       │
│  ────────────────────                                                       │
│  - replication.factor=3     # 3 copies across AZs                          │
│  - min.insync.replicas=2    # Require 2 acks for producer commit           │
│  - retention.ms=604800000   # 7 days for raw streams                       │
│  - retention.ms=2592000000  # 30 days for detections                       │
│  - unclean.leader.election.enable=false  # Never lose committed data       │
│                                                                             │
│  Consumer Configuration (AI Inference Service):                             │
│  ──────────────────────────────────────────────                             │
│  - enable.auto.commit=false  # Manual offset management                     │
│  - auto.offset.reset=earliest  # Replay from beginning on new group        │
│  - max.poll.records=100      # Process in batches                           │
│  - isolation.level=read_committed  # Only read committed transactions       │
│                                                                             │
│  Offset Commit Strategy:                                                    │
│  ───────────────────────                                                    │
│  1. Pull batch from Kafka                                                   │
│  2. Process (run inference)                                                 │
│  3. Write results to PostgreSQL (transaction)                               │
│  4. Commit Kafka offset ONLY after DB write succeeds                        │
│  5. If any step fails: don't commit, reprocess on next poll                 │
│                                                                             │
│  Dead Letter Queue:                                                         │
│  ──────────────────                                                         │
│  - Topic: streams.raw.dlq                                                   │
│  - After 5 processing failures, message moved to DLQ                        │
│  - DLQ consumer: alerts admin, manual inspection                            │
│  - Retention: 30 days                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.5 Graceful Degradation

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GRACEFUL DEGRADATION MATRIX                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  FAILURE MODE                    │  DEGRADATION STRATEGY            │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  AI Inference Service DOWN       │  Continue recording ALL video    │   │
│  │  (GPU failure, model crash)      │  - Events stored as "unprocessed"│   │
│  │                                  │  - No real-time alerts           │   │
│  │                                  │  - Queue frames for later batch  │   │
│  │                                  │    processing when AI recovers   │   │
│  │                                  │  - Dashboard shows "AI OFFLINE"  │   │
│  │                                  │    banner                        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Kafka DOWN                      │  Edge Gateway buffers locally    │   │
│  │  (MSK outage)                    │  - Local MinIO ring buffer       │   │
│  │                                  │  - Backpressure: reduce to       │   │
│  │                                  │    key frames only (0.2 fps)     │   │
│  │                                  │  - Auto-reconnect with 2x        │   │
│  │                                  │    exponential backoff           │   │
│  │                                  │  - Replay from local buffer      │   │
│  │                                  │    when Kafka recovers           │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  VPN Tunnel DOWN                 │  Full local operation mode       │   │
│  │  (internet outage)               │  - All recording continues       │   │
│  │                                  │    locally (7-day buffer)        │   │
│  │                                  │  - Local alert buzzer/relay      │   │
│  │                                  │    (configurable)                │   │
│  │                                  │  - No cloud dashboard access     │   │
│  │                                  │  - Auto-sync when VPN recovers   │   │
│  │                                  │  - Queue cloud events for        │   │
│  │                                  │    later replay                  │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  PostgreSQL DOWN                 │  Alert queue builds in Kafka     │   │
│  │  (RDS outage)                    │  - Events not lost (Kafka dur.)  │   │
│  │                                  │  - Read-only dashboard mode      │   │
│  │                                  │  - Cached data from Redis        │   │
│  │                                  │  - Alert on-call engineer        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Notification Service DOWN       │  Alerts accumulate in DB         │   │
│  │                                  │  - Retry with exponential backoff│   │
│  │                                  │  - Dead letter after 24 hours    │   │
│  │                                  │  - Dashboard shows pending count │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Edge Gateway DOWN               │  Cloud dashboard shows           │   │
│  │  (power/hardware failure)        │  "SITE OFFLINE"                  │   │
│  │                                  │  - Last known recordings in      │   │
│  │                                  │    cloud (up to disconnect)      │   │
│  │                                  │  - Alert sent immediately        │   │
│  │                                  │  - UPS on edge: graceful         │   │
│  │                                  │    shutdown, preserve data       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Priority Order (highest first):                                            │
│  1. Video recording NEVER STOPS (local edge priority)                       │
│  2. Critical alerts ALWAYS FIRE (local buzzer + queued cloud alerts)        │
│  3. AI inference gracefully degrades to batch catch-up                      │
│  4. Dashboard operates in read-only/cache mode during DB outage             │
│  5. Cloud sync resumes automatically when connectivity restored             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.6 Health Check Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEALTH CHECK ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 1: KUBERNETES PROBES                                                │
│  ──────────────────────────                                                 │
│  - Liveness Probe:  /health/live   → Restart container if failing          │
│  - Readiness Probe: /health/ready  → Remove from service if failing        │
│  - Startup Probe:   /health/startup→ Allow long initialization             │
│                                                                             │
│  LAYER 2: SERVICE-LEVEL HEALTH (Prometheus metrics)                        │
│  ──────────────────────────────────────────────────                         │
│  Each service exposes:                                                      │
│  - app_health_status{service="X"}  0=healthy, 1=degraded, 2=critical      │
│  - app_health_details{check="db"}  last check timestamp + result           │
│                                                                             │
│  LAYER 3: DEPENDENCY HEALTH CHECKS                                          │
│  ────────────────────────────────                                           │
│  Backend Service checks:                                                    │
│  ├─ PostgreSQL: SELECT 1; (timeout 2s)                                     │
│  ├─ Redis: PING (timeout 1s)                                               │
│  ├─ Kafka: ListTopics (timeout 3s)                                         │
│  ├─ MinIO: ListBuckets (timeout 3s)                                        │
│  └─ Triton: ModelReady API (timeout 5s)                                    │
│                                                                             │
│  LAYER 4: END-TO-END HEALTH                                                │
│  ──────────────────────────                                                 │
│  Synthetic probe:                                                           │
│  1. Upload test image to stream ingestion                                   │
│  2. Verify AI detection result appears in Kafka                             │
│  3. Verify event written to PostgreSQL                                      │
│  4. Verify alert queryable via API                                          │
│  5. Verify WebSocket push received                                          │
│  Run: Every 60 seconds from monitoring namespace                            │
│                                                                             │
│  LAYER 5: EDGE HEALTH HEARTBEAT                                            │
│  ────────────────────────────────                                           │
│  - Edge Gateway sends heartbeat every 30 seconds                            │
│  - Payload: {edge_id, timestamp, stream_count, disk_free, mem_usage}       │
│  - Missed 3 heartbeats (90s) → "EDGE OFFLINE" alert                       │
│  - Recovers → "EDGE ONLINE" notification                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

9. Security Architecture

9.1 Defense in Depth

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFENSE IN DEPTH LAYERS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 1: PERIMETER                                                         │
│  ──────────────                                                             │
│  - AWS WAF v2: SQL injection, XSS, rate limiting rules                     │
│  - Geo-restriction: Allow only specific countries                          │
│  - AWS Shield Standard (DDoS protection)                                   │
│  - ALB access logs → S3 → Athena for analysis                              │
│                                                                             │
│  LAYER 2: TRANSPORT                                                         │
│  ──────────────                                                             │
│  - TLS 1.3 for all external HTTPS connections                              │
│  - WireGuard ChaCha20-Poly1305 for VPN tunnel                              │
│  - mTLS (mutual TLS) for internal service-to-service communication         │
│  - Certificate rotation: Let's Encrypt auto (90-day)                       │
│                                                                             │
│  LAYER 3: AUTHENTICATION & AUTHORIZATION                                    │
│  ──────────────────────────────────────────                                 │
│  - JWT with RS256 (asymmetric signing)                                     │
│  - Access token: 15 minutes                                                │
│  - Refresh token: 7 days (stored in httpOnly cookie)                       │
│  - RBAC: admin, operator, viewer roles                                     │
│  - API keys for edge gateway authentication                                │
│  - Multi-factor authentication for admin role                              │
│                                                                             │
│  LAYER 4: APPLICATION SECURITY                                              │
│  ────────────────────────────                                               │
│  - Input validation: strict JSON schemas                                   │
│  - SQL injection: parameterized queries only (pgx)                         │
│  - XSS prevention: Content Security Policy headers                         │
│  - CSRF tokens for state-changing operations                               │
│  - File upload: virus scanning, size limits, type validation               │
│                                                                             │
│  LAYER 5: DATA SECURITY                                                     │
│  ────────────────────                                                       │
│  - RDS: Encryption at rest (AES-256, AWS KMS CMK)                          │
│  - RDS: Encryption in transit (TLS 1.2+)                                   │
│  - S3: Default encryption (SSE-S3 or SSE-KMS)                              │
│  - Redis: TLS in transit, no AUTH token exposure                           │
│  - Face embeddings: stored as vectors, not raw images (privacy)            │
│  - Backup encryption: separate KMS key for backups                         │
│                                                                             │
│  LAYER 6: NETWORK SEGMENTATION                                              │
│  ───────────────────────────                                                │
│  - VPC private subnets for all workloads                                   │
│  - Security groups: least privilege, explicit allow only                   │
│  - Network Policies: namespace-level isolation in K8s                      │
│  - DVR: NO public IP, NO internet gateway, local network only              │
│  - VPN: Single controlled entry point                                      │
│                                                                             │
│  LAYER 7: AUDIT & MONITORING                                                │
│  ─────────────────────────                                                  │
│  - All API calls logged with user, IP, timestamp, resource                 │
│  - PostgreSQL audit_log table (append-only)                                │
│  - CloudTrail for AWS API calls                                            │
│  - VPC Flow Logs for network analysis                                      │
│  - Alert on abnormal patterns (unusual login times, geo anomalies)         │
│  - Log retention: 1 year in S3 Glacier                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

9.2 Secret Management

# Kubernetes External Secrets (AWS Secrets Manager integration)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: surveillance
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: db-credentials
    creationPolicy: Owner
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: surveillance/production/db
        property: password
    - secretKey: DB_USER
      remoteRef:
        key: surveillance/production/db
        property: username

10. Monitoring & Observability

10.1 Monitoring Stack

Component Technology Purpose
Metrics Prometheus + Thanos Time-series collection, long-term storage
Visualization Grafana Dashboards for all services
Logs Loki + Promtail Log aggregation, indexed by labels
Traces Jaeger Distributed request tracing
Alerts Alertmanager + PagerDuty Multi-channel alerting
Uptime UptimeRobot (external) External endpoint monitoring

10.2 Key Metrics

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KEY METRICS DASHBOARD                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STREAM HEALTH                                                              │
│  ────────────                                                               │
│  - stream_active{camera_id}           Gauge: 0/1                           │
│  - stream_fps{camera_id}              Gauge: actual FPS                    │
│  - stream_bitrate{camera_id}          Gauge: kbps                          │
│  - stream_reconnect_total{camera_id}  Counter: reconnect events            │
│  - stream_latency_seconds{camera_id}  Histogram: end-to-end latency        │
│                                                                             │
│  AI INFERENCE                                                               │
│  ────────────                                                               │
│  - ai_inference_duration_seconds      Histogram: per-model latency         │
│  - ai_detection_total{model,class}    Counter: detections by class         │
│  - ai_gpu_utilization_percent         Gauge: GPU usage                     │
│  - ai_gpu_memory_used_bytes           Gauge: VRAM usage                    │
│  - ai_batch_size_current              Gauge: current batch size            │
│  - ai_queue_depth                     Gauge: pending inference requests    │
│                                                                             │
│  EVENTS & ALERTS                                                            │
│  ───────────────                                                            │
│  - events_total{type,severity}        Counter: events processed            │
│  - alerts_active{severity}            Gauge: unacknowledged alerts         │
│  - alert_ack_duration_seconds         Histogram: time to acknowledge       │
│  - false_positive_rate                Gauge: FP ratio (training feedback)  │
│                                                                             │
│  SYSTEM                                                                       │
│  ──────                                                                     │
│  - edge_disk_free_bytes               Gauge: local storage remaining       │
│  - edge_memory_usage_percent          Gauge: RAM usage                     │
│  - vpn_latency_ms                     Gauge: tunnel round-trip time        │
│  - kafka_consumer_lag{topic,group}    Gauge: message backlog               │
│  - db_connection_pool_active          Gauge: DB connections in use         │
│  - api_request_duration_seconds       Histogram: API response time         │
│  - api_requests_total{status,path}    Counter: HTTP status distribution    │
│                                                                             │
│  BUSINESS                                                                   │
│  ─────────                                                                  │
│  - cameras_online_total               Gauge: healthy camera count          │
│  - daily_events_total                 Counter: events per day              │
│  - alert_response_time_avg            Gauge: avg ack time (SLA: <5min)     │
│  - storage_cost_daily_usd             Gauge: estimated daily cost          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

10.3 Alerting Rules

# Prometheus alerting rules
# alerts.yml
groups:
  - name: surveillance-critical
    rules:
      - alert: CameraStreamDown
        expr: stream_active == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Camera {{ $labels.camera_id }} stream is down"
          
      - alert: EdgeGatewayOffline
        expr: time() - vpn_last_heartbeat > 120
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Edge gateway {{ $labels.edge_id }} is offline"
          
      - alert: AIInferenceHighLatency
        expr: histogram_quantile(0.99, ai_inference_duration_seconds) > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI inference P99 latency is {{ $value }}ms"
          
      - alert: DiskSpaceLow
        expr: edge_disk_free_bytes / edge_disk_total_bytes < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Edge disk usage is above 90%"
          
      - alert: UnacknowledgedCriticalAlerts
        expr: alerts_active{severity="critical"} > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} critical alerts unacknowledged for >15 minutes"

11. Cost Estimation

11.1 Monthly Cost Breakdown (8 cameras)

Service Instance/Type Monthly Cost (USD)
EKS Control Plane Managed $73
EKS Worker Nodes (on-demand) 3× t3.large (API, services) $200
EKS GPU Nodes 1× g4dn.xlarge (spot when possible) $350
RDS PostgreSQL db.r6g.xlarge Multi-AZ $520
ElastiCache Redis cache.r6g.large (2 shards) $260
MSK Kafka 3× kafka.m5.large $350
ALB + Data Transfer ~500 GB/month $50
S3 Storage ~10 TB (tiered) $200
CloudFront CDN ~200 GB/month $30
EC2 VPN Endpoint t3.micro $15
Edge Hardware Intel NUC (amortized 3yr) ~$40
Internet (site) Business broadband $50
TOTAL ~$2,138/month

11.2 Cost Optimization Strategies

  1. Spot Instances: GPU nodes and batch processing on spot (70% savings)
  2. Reserved Instances: RDS and ElastiCache 1-year reserved (40% savings)
  3. S3 Lifecycle: Automatic tiering to IA and Glacier
  4. Right-sizing: Monitor actual usage, adjust requests/limits
  5. Edge AI: Pre-filter on Jetson Orin to reduce cloud bandwidth (future)

12. Implementation Phases

Phase 1: Foundation (Weeks 1-4)

  • Set up AWS VPC, EKS cluster
  • Deploy WireGuard VPN (cloud endpoint + edge gateway)
  • Deploy PostgreSQL, Redis, Kafka, MinIO
  • Build and deploy Edge Gateway Agent
  • Verify RTSP stream capture from all 8 channels
  • Basic stream ingestion to Kafka

Phase 2: Core AI (Weeks 5-8)

  • Deploy NVIDIA Triton with YOLOv8 detection model
  • Build AI Inference Service (Kafka consumer)
  • Implement person/vehicle detection pipeline
  • Build Suspicious Activity Service (night mode)
  • Face detection + embedding extraction
  • Alert generation and storage

Phase 3: Application (Weeks 9-12)

  • Build Backend API Service
  • Build Web Frontend (Next.js dashboard)
  • Implement live video playback (HLS)
  • Event timeline and search
  • Alert management UI
  • Face search by similarity

Phase 4: Operations (Weeks 13-16)

  • Notification Service (email, SMS, push)
  • Training Service + model retraining pipeline
  • Monitoring stack (Prometheus, Grafana, Loki)
  • Security hardening and penetration testing
  • Performance optimization and load testing
  • Documentation and operator training

13. Appendices

Appendix A: DVR RTSP URL Format

# CP PLUS ORANGE Series RTSP URL format
rtsp://<username>:<password>@<dvr_ip>:554/user=<username>&password=<password>&channel=<1-8>&stream=<0|1>.sdp?

# stream=0: Main stream (higher quality)
# stream=1: Sub stream (lower quality)

# Example:
rtsp://admin:password@192.168.29.200:554/user=admin&password=password&channel=1&stream=0.sdp?

# FFmpeg test command:
ffmpeg -i "rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?" \
       -c copy -f segment -segment_time 10 -reset_timestamps 1 \
       /recordings/ch1/%Y%m%d_%H%M%S.mkv

Appendix B: WireGuard Full Configuration

# === CLOUD SERVER (AWS EC2) ===
# /etc/wireguard/wg0-cloud.conf

[Interface]
PrivateKey = <cloud-private-key>
Address = 10.200.0.1/24
ListenPort = 51820
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
         iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE; \
         iptables -A FORWARD -p tcp --dport 5432 -j DROP; \
         iptables -A FORWARD -p tcp --dport 6379 -j DROP
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
           iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
DNS = 10.100.0.2

[Peer]
# Edge Gateway - Site 1
PublicKey = <edge1-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25

# === EDGE GATEWAY (Intel NUC) ===
# /etc/wireguard/wg0-edge.conf

[Interface]
PrivateKey = <edge-private-key>
Address = 10.200.0.2/32
DNS = 10.100.0.2
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
         iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
           iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.100.0.0/16, 10.200.0.0/24
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25

Appendix C: FFmpeg Stream Processing Command

#!/bin/bash
# Edge Gateway stream processing pipeline

CAMERA_ID=$1
CHANNEL=$2
DVR_IP="192.168.29.200"
DVR_USER="admin"
DVR_PASS=""

RTSP_URL="rtsp://${DVR_IP}:554/user=${DVR_USER}&password=${DVR_PASS}&channel=${CHANNEL}&stream=0.sdp?"
RECORDING_DIR="/var/recordings/${CAMERA_ID}"
AI_PIPE="/tmp/ai_pipe_${CAMERA_ID}"

mkdir -p "$RECORDING_DIR"
mkfifo "$AI_PIPE" 2>/dev/null

# Pipeline 1: Recording (10s segments)
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -i "$RTSP_URL" \
    -c copy -f segment \
    -segment_time 10 \
    -segment_format mp4 \
    -reset_timestamps 1 \
    -strftime 1 \
    "${RECORDING_DIR}/%Y%m%d_%H%M%S.mp4" \
    2>> /var/log/ffmpeg-${CAMERA_ID}.log &

# Pipeline 2: AI frame extraction (1 fps)
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -i "$RTSP_URL" \
    -vf "fps=1,scale=640:640" \
    -f image2pipe \
    -vcodec mjpeg \
    -q:v 5 \
    "$AI_PIPE" \
    2>> /var/log/ffmpeg-ai-${CAMERA_ID}.log &

# Pipeline 3: Frame batching and gRPC send to cloud
frame-batcher \
    --input "$AI_PIPE" \
    --camera-id "$CAMERA_ID" \
    --batch-size 8 \
    --cloud-endpoint "10.200.0.1:8081" \
    --vpn-interface wg0 \
    >> /var/log/batcher-${CAMERA_ID}.log 2>&1 &

Appendix D: Kubernetes Resource Summary

# Complete resource manifest summary
# Namespaces:
# - surveillance: Main application
# - surveillance-data: Database, cache, storage
# - surveillance-monitoring: Prometheus, Grafana
# - surveillance-ops: CI/CD, backup jobs

# Deployments (always running):
# - stream-ingestion: 3-20 replicas, HPA
# - ai-inference: 1-4 replicas (GPU), HPA
# - suspicious-activity: 2-8 replicas, HPA
# - backend-api: 3-10 replicas, HPA
# - video-playback: 2-4 replicas
# - notification-service: 2-5 replicas, HPA
# - web-frontend: 3 replicas (static, CDN-cached)
# - traefik: 2 replicas (DaemonSet preferred)

# StatefulSets:
# - minio: 4 replicas (distributed mode)

# CronJobs:
# - training-service: Weekly (Sundays 02:00)
# - db-backup: Daily (02:00)
# - storage-cleanup: Daily (03:00)
# - partition-maintenance: Monthly (1st, 03:00)

# External Services (AWS managed):
# - PostgreSQL: RDS db.r6g.xlarge Multi-AZ
# - Redis: ElastiCache cluster mode
# - Kafka: MSK 3 brokers
# - ALB: Internet-facing, WAF attached

Document History

Version Date Author Changes
1.0 2025-01-20 Solution Architect Initial complete architecture

End of Document