Topology, scaling, failover, and technology choices.
AI-Powered Industrial Surveillance Platform — System Architecture
Document Information
- Version: 1.0
- Date: 2025-01-20
- Status: Production Architecture Design
- Target Platform: CP PLUS ORANGE Series DVR (CP-UVR-0801E1-CV2)
- Camera Count: 8 channels (scalable to 64+)
- Resolution: 960x1080 per channel
Table of Contents
- Executive Summary
- Deployment Topology
- Network Security Zones
- Service Architecture
- Data Flow Design
- Technology Stack
- Scaling Strategy
- Failover & Reliability
- Security Architecture
- Monitoring & Observability
- Cost Estimation
- Implementation Phases
- Appendices
1. Executive Summary
This document presents the complete system architecture for an AI-powered industrial surveillance platform designed to process 8 camera channels (expandable to 64+) from a CP PLUS ORANGE Series DVR. The architecture follows a cloud+edge hybrid pattern where compute-intensive AI inference runs in the cloud while a local edge gateway handles stream ingestion and site-local concerns. All DVR communication is protected inside a WireGuard VPN tunnel — the DVR has zero public internet exposure.
Key Architectural Decisions
| Decision |
Choice |
Rationale |
| Cloud Provider |
AWS (us-east-1 / ap-south-1) |
Broadest IoT/edge tooling, VPC peering, lowest latency to India region |
| Container Orchestration |
Amazon EKS (Kubernetes) |
Managed control plane, auto-scaling, GPU node support for AI inference |
| VPN Solution |
WireGuard |
~60% faster than OpenVPN, modern crypto, simple setup, NAT traversal |
| Message Queue |
Apache Kafka (MSK) |
Durable, ordered event log, replay capability, proven at scale |
| Stream Processing |
Apache Flink on EKS |
Stateful stream processing, exactly-once semantics, windowed operations |
| Reverse Proxy |
Traefik (in-cluster) + AWS ALB (ingress) |
Native Kubernetes integration, automatic cert management |
| AI Framework |
NVIDIA Triton Inference Server + YOLOv8 |
GPU-optimized inference, model ensemble, dynamic batching |
| Object Storage |
MinIO (on-premises) + AWS S3 (cold archive) |
S3-compatible API, local buffering, cost-tiered archival |
| Database |
PostgreSQL 16 (RDS) + pgvector extension |
Relational integrity for events, native vector support for face embeddings |
| Cache/Queue |
Redis 7 Cluster (ElastiCache) |
Sub-ms latency, stream data type for real-time pub/sub |
| Edge Hardware |
Intel NUC 13 Pro i7 / NVIDIA Jetson Orin NX |
x86 preferred for flexibility; Jetson alternative for GPU-at-edge |
2. Deployment Topology
2.1 High-Level Topology Diagram
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ CLOUD (AWS VPC) │
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ KUBERNETES CLUSTER (EKS) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ API GW │ │ Stream │ │ AI Inf. │ │ Suspicious Act. │ │ │
│ │ │ (Traefik) │ │ Ingestion │ │ Service │ │ Service │ │ │
│ │ │ :8443 │ │ Service │ │ (Triton) │ │ (Night Mode) │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └─────────────────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ ┌─────────────────────┐ │ │
│ │ │ Web App │ │ Training │ │ Notification│ │ Video Playback │ │ │
│ │ │ (Next.js) │ │ Service │ │ Service │ │ Service (HLS) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ PostgreSQL │ │ Redis │ │ Kafka │ │ MinIO │ │ │
│ │ │ (RDS) │ │ Cluster │ │ (MSK) │ │ (S3-compatible) │ │ │
│ │ │ :5432 │ │ :6379 │ │ :9092 │ │ :9000 │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ AWS APPLICATION LOAD BALANCER (:443) │ │ │
│ │ │ SSL termination, WAF, rate limiting, geo-restriction │ │ │
│ │ └─────────────────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ WireGuard VPN Tunnel (UDP 51820) │
│ │ Site-to-Site encrypted tunnel │
│ │ Cloud peer: 10.200.0.1/32 ←→ Edge peer: 10.200.0.2/32 │
└───────────┼─────────────────────────────────────────────────────────────────────────────────┘
│
┌───────────┴─────────────────────────────────────────────────────────────────────────────────┐
│ EDGE SITE (Local Network) │
│ │
│ ┌─────────────────────────────────┐ ┌─────────────────────────────────────────┐ │
│ │ EDGE GATEWAY │ │ LOCAL NETWORK │ │
│ │ (Intel NUC / Jetson Orin) │ │ (192.168.29.0/24) │ │
│ │ OS: Ubuntu 22.04 LTS │ │ │ │
│ │ WireGuard endpoint │ │ ┌─────────────────────────────────┐ │ │
│ │ K3s lightweight cluster │ │ │ CP PLUS DVR │ │ │
│ │ │ │ │ CP-UVR-0801E1-CV2 │ │ │
│ │ ┌───────────────────────┐ │◄────────►│ │ LAN: 192.168.29.200 │ │ │
│ │ │ Edge Gateway Agent │ │ :554 │ │ RTSP: 554, HTTP: 80/443 │ │ │
│ │ │ - Stream puller │ │ :80 │ │ TCP: 25001, UDP: 25002 │ │ │
│ │ │ - Buffer/forward │ │ │ │ 8 Channels × 960×1080 │ │ │
│ │ │ - Local recording │ │ │ └─────────────────────────────────┘ │ │
│ │ │ - VPN client │ │ │ │ │
│ │ └───────────────────────┘ │ │ ┌─────────────────────────────────┐ │ │
│ │ │ │ │ Local Monitor (optional) │ │ │
│ │ Local Storage: 2TB NVMe │ │ │ 192.168.29.10 │ │ │
│ │ (7-day circular buffer) │ │ └─────────────────────────────────┘ │ │
│ └─────────────────────────────────┘ │ │ │
│ │ CAMERAS (BNC/IP) ──┐ │ │
│ │ │ │ │
│ │ CH1 ──► CH2 ──► CH3 ──► CH4 │ │
│ │ CH5 ──► CH6 ──► CH7 ──► CH8 │ │
│ │ │ │
│ └────────────────────────────────────────┘ │
│ │
│ Network: Edge Gateway has TWO interfaces: │
│ - eth0: 192.168.29.5/24 ←→ Local network (DVR access) │
│ - eth1: DHCP / Static ←→ Internet (VPN tunnel to cloud) │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
2.2 Physical Edge Gateway Specification
| Component |
Specification |
| Hardware |
Intel NUC 13 Pro, Core i7-1360P, 32GB DDR4, 2TB NVMe SSD |
| Alternative |
NVIDIA Jetson Orin NX 16GB (for on-edge AI inference) |
| OS |
Ubuntu 22.04 LTS Server, minimal install |
| Container Runtime |
containerd (via K3s) |
| K8s Distribution |
K3s v1.28+ (lightweight, single-node or 2-node HA) |
| Power |
UPS-backed, auto-restart on power loss (BIOS setting) |
| Network |
Dual Ethernet: one for local DVR segment, one for internet/VPN |
| Local Storage |
2TB NVMe for 7-day circular buffer of all 8 streams |
2.3 Cloud Infrastructure Specification
| Component |
Specification |
| Region |
Primary: ap-south-1 (Mumbai), DR: ap-southeast-1 (Singapore) |
| VPC |
10.100.0.0/16, 3 AZs, private subnets only for workloads |
| EKS |
Managed node groups: on-demand for API, spot for batch processing |
| GPU Nodes |
g4dn.xlarge (NVIDIA T4) for Triton inference, 1-4 nodes auto-scaled |
| ALB |
Internet-facing, WAF v2 attached, Shield Advanced optional |
| RDS |
PostgreSQL 16, db.r6g.xlarge, Multi-AZ, encrypted at rest |
| ElastiCache |
Redis 7, cluster mode enabled, 2 shards × 2 replicas |
| MSK (Kafka) |
3 broker nodes, kafka.m5.large, 3 AZs |
| S3 |
Standard (hot), IA (30 days), Glacier Deep Archive (1 year) |
3. Network Security Zones
3.1 Security Zone Diagram
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ SECURITY ZONES │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ ZONE 0: INTERNET (UNTRUSTED) │ │
│ │ │ │
│ │ Users/Browsers ──► AWS ALB (:443) ──► WAF ──► Rate Limit ──► Geo-Block │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ ZONE 1: AWS VPC EDGE (DEMILITARIZED) │ │
│ │ │ │
│ │ ALB ──► Traefik Ingress ──► Public API endpoints only │ │
│ │ Auth: JWT + RBAC, API key for edge gateway │ │
│ │ │ │
│ │ AWS ALB Security Group: Allow 443 from 0.0.0.0/0 │ │
│ │ Traefik SG: Allow 8443 from ALB-SG only │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ ZONE 2: AWS VPC APPLICATION (TRUSTED, ISOLATED) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Stream Ing. │ │ AI Inference│ │ Suspicious │ │ Training Service │ │ │
│ │ │ Service │ │ Service │ │ Activity │ │ │ │ │
│ │ │ │ │ │ │ Service │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ │ Pod Security Policies: No root, read-only FS, no privilege escalation │ │
│ │ Network Policies: Ingress only from API GW namespace, egress to data layer │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ ZONE 3: AWS VPC DATA (HIGHLY RESTRICTED) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ PostgreSQL │ │ Redis │ │ Kafka │ │ MinIO │ │ │
│ │ │ (RDS) │ │ (ElastiC.) │ │ (MSK) │ │ (S3 API) │ │ │
│ │ │ :5432 │ │ :6379 │ │ :9092 │ │ :9000 │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ │ Security Groups: Allow connections ONLY from Application Zone SGs │ │
│ │ RDS: Encrypted (AWS KMS), no public access, IAM auth enabled │ │
│ │ S3: Bucket policy deny all except VPC endpoint, versioning enabled │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ WireGuard VPN Tunnel │ (UDP 51820, ChaCha20-Poly1305) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ ZONE 4: EDGE NETWORK (PHYSICALLY ISOLATED) │ │
│ │ │ │
│ │ ┌──────────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ EDGE GATEWAY AGENT │ │ DVR (192.168.29.200) │ │ │
│ │ │ - K3s node │◄────────►│ NO INTERNET ACCESS │ │ │
│ │ │ - WireGuard peer │ :554 │ Firewall: DROP all non-local│ │ │
│ │ │ - Stream ingestion │ :80 │ │ │ │
│ │ │ - Local buffer │ │ Only 192.168.29.0/24 allowed│ │ │
│ │ └──────────────────────────┘ └─────────────────────────────────┘ │ │
│ │ │ │
│ │ Edge Gateway Firewall (ufw): │ │
│ │ - ALLOW 192.168.29.0/24 → DVR ports (554, 80) │ │
│ │ - ALLOW OUT 51820/udp → Cloud VPN endpoint │ │
│ │ - DENY ALL other incoming │ │
│ │ - No forwarding to local network from VPN (except explicit rules) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
3.2 Firewall Rules
Edge Gateway (UFW)
# Default deny
ufw default deny incoming
ufw default allow outgoing
# Local network access to DVR
ufw allow from 192.168.29.200 to any port 554 proto tcp # RTSP
ufw allow from 192.168.29.200 to any port 80 proto tcp # HTTP (ONVIF)
# WireGuard VPN to cloud
ufw allow out on eth1 to <cloud-vpn-ip> port 51820 proto udp
# Local admin access (optional, from specific admin IP)
ufw allow from 192.168.29.10 to any port 22 proto tcp # SSH from admin workstation
AWS Security Groups
| Security Group |
Ingress Rules |
Egress Rules |
alb-public-sg |
TCP 443 from 0.0.0.0/0 |
All to VPC |
traefik-ingress-sg |
TCP 8443 from alb-public-sg only |
All to VPC |
app-services-sg |
TCP 8080-8090 from traefik-ingress-sg |
All to data-sg |
data-layer-sg |
TCP 5432, 6379, 9092, 9000 from app-services-sg only |
None |
vpn-endpoint-sg |
UDP 51820 from edge-gateway-ip/32 |
All to VPC |
4. Service Architecture
4.1 Service Interaction Diagram
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│ SERVICE ARCHITECTURE │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ API GATEWAY LAYER │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Traefik Ingress Controller (K8s) │ │ │
│ │ │ - Route: /api/* → Backend Service │ │ │
│ │ │ - Route: /ws/* → WebSocket Handler (live video) │ │ │
│ │ │ - Route: / → Next.js Web App │ │ │
│ │ │ - TLS: Let's Encrypt automatic certificates │ │ │
│ │ │ - Middleware: rate limit (100 req/min per IP), JWT validation, CORS │ │ │
│ │ └─────────────────────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┼──────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ BACKEND SERVICE │ │ WEB FRONTEND │ │ VIDEO PLAYBACK │ │
│ │ (Go/Gin) │ │ (Next.js 14) │ │ SERVICE (Go) │ │
│ │ :8080 │ │ :3000 │ │ :8085 (HLS) │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ REST API │ │ │ │ React SSR │ │ │ │ HLS Segment │ │ │
│ │ │ - /cameras │ │ │ │ - Dashboard │ │ │ │ Server │ │ │
│ │ │ - /events │ │ │ │ - Live View │ │ │ │ - /live/:id │ │ │
│ │ │ - /alerts │ │ │ │ - Timeline │ │ │ │ - /vod/:id │ │ │
│ │ │ - /search │ │ │ │ - Analytics │ │ │ │ (DASH/HLS) │ │ │
│ │ │ - /training │ │ │ │ - Admin │ │ │ └──────────────┘ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │ │ │
│ │ ┌──────────────┐ │ └──────────────────────┘ └──────────────────────┘ │
│ │ │ gRPC Client │ │ │
│ │ │ (to AI svc) │ │ │
│ │ └──────────────┘ │ │
│ └─────────────────────┘ │
│ │ │
│ │ gRPC (:50051) │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ EVENT & MESSAGE BUS │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Apache Kafka (MSK) │ │ │
│ │ │ │ │ │
│ │ │ Topics: │ │ │
│ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────────┐│ │ │
│ │ │ │ streams.raw │ │ ai.detections │ │ alerts.critical ││ │ │
│ │ │ │ (protobuf) │ │ (JSON) │ │ (JSON) ││ │ │
│ │ │ │ - 8 partitions │ │ - 16 partitions │ │ - 4 partitions ││ │ │
│ │ │ │ - 7-day reten. │ │ - 30-day reten. │ │ - 90-day reten. ││ │ │
│ │ │ └─────────────────┘ └─────────────────┘ └─────────────────────────────┘│ │ │
│ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────────┐│ │ │
│ │ │ │ training.data │ │ system.metrics │ │ notifications.email ││ │ │
│ │ │ │ (protobuf) │ │ (JSON) │ │ notifications.sms ││ │ │
│ │ │ │ - 30-day reten. │ │ - 7-day reten. │ │ notifications.push ││ │ │
│ │ │ └─────────────────┘ └─────────────────┘ └─────────────────────────────┘│ │ │
│ │ └─────────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Redis Cluster (ElastiCache) │ │ │
│ │ │ │ │ │
│ │ │ Streams: ┌─────────────────┐ Pub/Sub: ┌──────────────────────┐ │ │ │
│ │ │ │ live:cam:{id} │ │ alert:broadcast │ │ │ │
│ │ │ │ (video chunks) │ │ ws:session:* │ │ │ │
│ │ │ │ cache:api:* │ │ stream:status │ │ │ │
│ │ │ └─────────────────┘ └──────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ STREAM ING. │ │ AI INFERENCE │ │ SUSPICIOUS ACTIVITY │ │
│ │ SERVICE │ │ SERVICE │ │ SERVICE │ │
│ │ (Go/FFmpeg) │ │ (Python/gRPC)│ │ (Go/Python) │ │
│ │ :8081 │ │ :8001 (Triton)│ │ :8083 │ │
│ │ │ │ │ │ │ │
│ │┌────────────┐│ │┌────────────┐│ │┌────────────────────┐│ │
│ ││RTSP Client ││ ││Triton Svr ││ ││Night Mode Analyzer ││ │
│ ││(ffmpeg) ││ ││├─YOLOv8-det││ ││├─Motion detection ││ │
│ ││8 concurrent││ ││├─YOLOv8-face││ ││├─Loitering detect. ││ │
│ ││streams ││ ││├─ArcFace ││ ││├─Perimeter breach ││ │
│ │├────────────┤│ ││└────────────┘│ ││├─Abandoned object ││ │
│ ││Frame Extrac││ ││Model Mgmt. ││ ││├─Crowd detection ││ │
│ ││1 fps anal. ││ ││└────────────┘│ ││└────────────────────┘│ │
│ │├────────────┤│ │├─────────────┤│ │├────────────────────┤│ │
│ ││Kafka Produc.││ ││gRPC API ││ ││Kafka Consumer ││ │
│ ││(raw frames) ││ ││- detect() ││ ││(ai.detections) ││ │
│ │└─────────────┘│ ││- embed() ││ │├────────────────────┤│ │
│ │┌────────────┐ │ ││- compare() ││ ││Rule Engine ││ │
│ ││MinIO Client│ │ │└─────────────┘│ ││├─Time-based rules ││ │
│ ││(video seg.)│ │ └──────────────┘ ││├─Zone-based rules ││ │
│ │└────────────┘ │ ││├─Severity scoring ││ │
│ └───────────────┘ │└────────────────────┘│ │
│ ▲ └──────────────────────┘ │
│ │ WireGuard VPN │
│ │ │
│ ┌──────────────┐ │
│ │ EDGE GATEWAY │ │
│ │ SERVICE │ │
│ │ (Local) │ │
│ └──────────────┘ │
│ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ PostgreSQL 16│ │ pgvector │ │ MinIO │ │ S3 (Cold Archive) │ │ │
│ │ │ (RDS) │ │ Extension │ │ (on-prem + │ │ │ │ │
│ │ │ │ │ │ │ cloud) │ │ │ │ │
│ │ │ cameras │ │ face_embed. │ │ video-seg/ │ │ yearly archive │ │ │
│ │ │ events │ │ table │ │ training-img │ │ compliance storage │ │ │
│ │ │ alerts │ │ (vector) │ │ │ │ │ │ │
│ │ │ audit_log │ │ │ │ lifecycle: │ │ lifecycle: │ │ │
│ │ │ users │ │ HNSW index │ │ 7d local → │ │ 90d → Glacier │ │ │
│ │ │ zones │ │ cos similarity│ │ 30d cloud → │ │ Deep Archive │ │ │
│ │ └──────────────┘ └──────────────┘ │ 1yr archive │ │ │ │ │
│ │ └──────────────┘ └──────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────────────────────┘
4.2 Service Specifications
4.2.1 Edge Gateway Service (Local)
| Attribute |
Specification |
| Runtime |
Go 1.21, compiled binary |
| Deployment |
Systemd service on Ubuntu + K3s for containerized components |
| Location |
Intel NUC, physically on-site |
| Responsibilities |
RTSP stream pull, local recording buffer, VPN tunnel endpoint, heartbeat to cloud |
| Ports |
8080 (HTTP admin), 51820 (WireGuard), 1935 (RTMP relay if needed) |
| Stream Protocol |
RTSP over TCP (interleaved) from DVR at 192.168.29.200:554 |
| Local Storage |
2TB NVMe, 7-day circular buffer, ~1.5GB/hour per channel = ~288GB/day for 8ch |
| Reconnect Policy |
Exponential backoff: 1s → 2s → 4s → 8s → max 60s, reset on success |
| Heartbeat |
Every 30s to cloud Stream Ingestion Service via VPN |
| Failover |
Auto-restart via systemd (Restart=always, RestartSec=5) |
4.2.2 Stream Ingestion Service (Cloud)
| Attribute |
Specification |
| Runtime |
Go 1.21 |
| Deployment |
Kubernetes Deployment, 3 replicas minimum |
| Responsibilities |
Receive frames from edge, decode, produce to Kafka, store segments to MinIO |
| Protocol |
gRPC bidirectional streaming from Edge Gateway |
| Frame Rate |
1 fps for AI analysis (decimated from 25fps source) |
| Full Rate |
25 fps for event clips (triggered recordings) |
| Kafka Topic |
streams.raw.{camera_id} — protobuf-encoded frame batches |
| Video Segments |
10-second H.264 segments → MinIO bucket video-segments |
| Resource Request |
1 CPU, 2GB RAM per replica |
| HPA |
3-20 replicas based on CPU > 70% |
4.2.3 AI Inference Service
| Attribute |
Specification |
| Runtime |
NVIDIA Triton Inference Server 2.40+ (Docker) |
| Deployment |
Kubernetes Deployment on GPU nodes (g4dn.xlarge) |
| GPU |
NVIDIA T4 16GB, 1 GPU per replica |
| Models |
YOLOv8x (detection), YOLOv8x-face (face detection), ArcFace (face recognition/embedding) |
| Model Format |
TensorRT engines (.plan) for GPU optimization |
| gRPC API |
:8001 — Triton native gRPC |
| HTTP API |
:8000 — Triton native HTTP |
| Metrics |
:8002 — Prometheus metrics endpoint |
| Dynamic Batching |
Max batch size: 8 for detection, 16 for face embedding |
| Input |
JPEG frames (960×1080) from Kafka topic streams.raw.* |
| Output |
Detections (bbox, class, confidence) → Kafka ai.detections |
| Face Embeddings |
512-dim float32 vectors → pgvector (PostgreSQL) |
| Resource |
1× T4 GPU, 4 CPU, 16GB RAM per replica |
| HPA |
1-4 replicas based on GPU utilization > 80% and Kafka consumer lag |
4.2.4 Suspicious Activity Service (Night Mode)
| Attribute |
Specification |
| Runtime |
Python 3.11 (OpenCV, scikit-learn) + Go orchestrator |
| Deployment |
Kubernetes Deployment, 2-8 replicas |
| Input |
Kafka topic ai.detections + streams.raw.* for motion analysis |
| Responsibilities |
Night-mode analysis, loitering detection, perimeter breach, abandoned object, crowd detection |
| Rules Engine |
YAML-configured rules per camera, per time schedule |
| Night Schedule |
Configurable (default: 22:00 - 06:00), overrides day-mode sensitivity |
| Output |
Scored alerts → Kafka alerts.critical + PostgreSQL alerts table |
| ML Models |
Background subtraction (MOG2), optical flow for motion tracking, Kalman filters for object tracking |
| Resource Request |
2 CPU, 4GB RAM per replica |
4.2.5 Training Service
| Attribute |
Specification |
| Runtime |
Python 3.11, PyTorch 2.1, NVIDIA CUDA 12.1 |
| Deployment |
Kubernetes Job/CronJob, runs on GPU spot instances |
| Responsibilities |
Model retraining, fine-tuning on collected data, A/B model validation |
| Trigger |
Weekly scheduled (Sunday 02:00) or manual (API call) |
| Data Source |
MinIO bucket training-data (curated positive/negative samples) |
| Output |
New TensorRT engines → MinIO bucket model-artifacts |
| A/B Rollout |
Blue/green model deployment via Triton model repository |
| Validation |
mAP > 0.85 required before promotion to production |
| Resource |
1× V100 GPU (spot), 8 CPU, 32GB RAM |
4.2.6 API Gateway / Backend Service
| Attribute |
Specification |
| Runtime |
Go 1.21, Gin framework |
| Deployment |
Kubernetes Deployment, 3-10 replicas |
| Protocol |
HTTP/2, REST API + WebSocket for live updates |
| Authentication |
JWT (RS256), access token 15min, refresh token 7 days |
| Authorization |
RBAC: admin, operator, viewer roles |
| Rate Limiting |
100 req/min per IP, 1000 req/min per API key |
| Endpoints |
See API Specification below |
| Caching |
Redis for session store and API response caching (TTL 60s) |
| Resource Request |
0.5 CPU, 1GB RAM per replica |
API Endpoints:
| Endpoint |
Method |
Description |
Auth |
/api/v1/auth/login |
POST |
User authentication |
Public |
/api/v1/auth/refresh |
POST |
Token refresh |
Public |
/api/v1/cameras |
GET |
List all cameras |
Viewer+ |
/api/v1/cameras/{id} |
GET |
Camera details |
Viewer+ |
/api/v1/cameras/{id}/live |
GET |
Live stream URL (HLS) |
Viewer+ |
/api/v1/events |
GET |
Query events (paginated, filtered) |
Viewer+ |
/api/v1/events/{id} |
GET |
Event details with snapshot |
Viewer+ |
/api/v1/alerts |
GET |
List alerts |
Viewer+ |
/api/v1/alerts/{id}/ack |
POST |
Acknowledge alert |
Operator+ |
/api/v1/search/faces |
POST |
Face search by image |
Operator+ |
/api/v1/search/faces/{embedding} |
GET |
Similar face lookup |
Operator+ |
/api/v1/training/upload |
POST |
Upload training samples |
Admin |
/api/v1/training/jobs |
GET |
List training jobs |
Admin |
/api/v1/zones |
CRUD |
Perimeter zones per camera |
Admin |
/api/v1/reports/daily |
GET |
Daily activity report |
Viewer+ |
/api/v1/system/health |
GET |
System health status |
Internal |
4.2.7 Web Frontend
| Attribute |
Specification |
| Framework |
Next.js 14 (App Router), React 18, TypeScript |
| Styling |
Tailwind CSS + shadcn/ui components |
| State Management |
Zustand (client), React Query (server) |
| Video Player |
HLS.js for live stream playback, Video.js for VOD |
| Maps |
MapLibre GL JS (open source, no API key required) for camera geolocation |
| Real-time |
WebSocket connection for alert notifications |
| Build Output |
Static export → served via CDN (CloudFront) |
| PWA |
Service worker for offline dashboard viewing |
4.2.8 Notification Service
| Attribute |
Specification |
| Runtime |
Go 1.21 |
| Deployment |
Kubernetes Deployment, 2-5 replicas |
| Input |
Kafka topic alerts.critical |
| Channels |
Email (SMTP/AWS SES), SMS (Twilio/AWS SNS), Push (Firebase FCM), Webhook |
| Templates |
HTML email templates with event snapshot attachment |
| Rate Limiting |
Max 1 SMS per phone per 5 minutes; max 10 emails per address per hour |
| Retry Policy |
3 retries with exponential backoff for each channel; dead-letter after failure |
| Escalation |
Unacknowledged critical alerts escalate after 15 minutes (to admin) |
4.2.9 Database — PostgreSQL 16 (RDS)
| Attribute |
Specification |
| Instance |
db.r6g.xlarge (4 vCPU, 32GB RAM) |
| Storage |
500GB gp3, auto-scaling to 2TB |
| Multi-AZ |
Enabled for production |
| Extensions |
pgvector (face embeddings), PostGIS (zone geometry), pg_stat_statements |
| Backup |
Daily automated, 35-day retention |
| Read Replica |
1 read replica for analytics queries |
Schema Overview:
-- Core tables
cameras (id, name, dvr_channel, rtsp_url, location, status, created_at)
events (id, camera_id, event_type, confidence, bounding_box, snapshot_path,
start_time, end_time, severity, metadata JSONB, created_at)
alerts (id, event_id, rule_id, severity, status [new|ack|resolved],
acknowledged_by, acked_at, notification_channels, created_at)
face_embeddings (id, person_name, embedding vector(512), camera_id,
first_seen, last_seen, occurrence_count, metadata JSONB)
users (id, username, password_hash, role, email, phone, created_at)
alert_rules (id, camera_id, rule_type, config JSONB, schedule JSONB,
severity, enabled, created_at)
audit_log (id, user_id, action, resource, details JSONB, ip_address, created_at)
perimeter_zones (id, camera_id, name, polygon GEOMETRY(POLYGON),
alert_on_enter, alert_on_exit, schedule, created_at)
4.2.10 Object Storage — MinIO + S3
| Attribute |
Specification |
| Local (Edge) |
MinIO single-node, 2TB NVMe, 7-day retention |
| Cloud (Primary) |
MinIO distributed cluster on EKS, 10TB initial, auto-scaling |
| Archive |
AWS S3 with lifecycle: Standard → IA (30d) → Glacier Deep Archive (365d) |
| API |
S3-compatible, same SDK for all tiers |
| Buckets |
video-segments (10s segments), event-clips (triggered recordings), training-data (curated samples), snapshots (JPEG event frames), model-artifacts (TensorRT engines) |
4.2.11 Redis Cluster
| Attribute |
Specification |
| Type |
ElastiCache for Redis, cluster mode enabled |
| Node Type |
cache.r6g.large per shard |
| Shards |
2 shards, 2 replicas per shard |
| Max Memory Policy |
allkeys-lru (evict least recently used) |
| Persistence |
AOF everysec, RDB every 60min |
| Use Cases |
Session store, API cache, real-time pub/sub, stream position tracking |
4.2.12 Vector Store (pgvector)
| Attribute |
Specification |
| Integration |
PostgreSQL extension (same RDS instance) |
| Table |
face_embeddings with embedding vector(512) column |
| Index |
HNSW (hierarchical navigable small world) for approximate nearest neighbor |
| Index Parameters |
m = 16, ef_construction = 64 |
| Similarity Metric |
Cosine similarity (<=> operator) |
| Query |
SELECT * FROM face_embeddings ORDER BY embedding <=> $1 LIMIT 10 |
| Expected Volume |
~1M vectors per year (8 cameras) |
5. Data Flow Design
5.1 Complete Data Flow Diagram
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW │
├─────────────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: CAPTURE & INGESTION │
│ ══════════════════════════ │
│ │
│ CAMERAS (8ch) ──► DVR (192.168.29.200) ──► RTSP (:554) ──► EDGE GATEWAY (192.168.29.5) │
│ │
│ Camera → BNC coax → DVR encoder → H.264 stream → RTSP server (DVR builtin) │
│ │
│ Edge Gateway pulls 8 concurrent RTSP streams: │
│ rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp? │
│ rtsp://192.168.29.200:554/user=admin&password=&channel=2&stream=0.sdp? │
│ ... (channels 1-8) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ EDGE GATEWAY PROCESSING per stream: │ │
│ │ 1. FFmpeg demux → raw H.264 Annex-B frames │ │
│ │ 2. Segment into 10s chunks → local MinIO (circular buffer) │ │
│ │ 3. Extract 1 fps JPEG frames (960×1080 → 640×640 resize for AI) │ │
│ │ 4. Protobuf-encode frame batches │ │
│ │ 5. Send via gRPC bidirectional stream over WireGuard VPN │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ PATH A: LIVE VIDEO │ │ PATH B: AI ANALYSIS │ │ PATH C: RECORDING │ │
│ │ (WebRTC/HLS path) │ │ (detection pipeline)│ │ (event archival) │ │
│ └─────────────────────┘ └─────────────────────┘ └─────────────────────┘ │
│ │
│ LAYER 2: STREAM PROCESSING (Cloud) │
│ ══════════════════════════════════ │
│ │
│ PATH A: LIVE VIDEO ──────────────────────────────────────────────────────── │
│ │
│ Edge Gateway ──► Stream Ing. Svc ──► Redis Stream (live:cam:{id}) ──► HLS Segment Svc │
│ RTSP (decode) (pub/sub buffer) (m3u8 + .ts) │
│ │ │
│ ▼ │
│ CloudFront CDN │
│ │ │
│ ▼ │
│ Web Browser (HLS.js) │
│ │
│ PATH B: AI ANALYSIS ─────────────────────────────────────────────────────── │
│ │
│ Stream Ing. Svc ──► Kafka (streams.raw.{cam}) ──► AI Inference Svc (Triton) │
│ (frame batches) (ordered, partitioned) (YOLOv8 + ArcFace) │
│ │ │
│ ┌─────────────────────────┼─────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Detections │ │ Face Emb. │ │ Stream to │ │
│ │ (person, │ │ (512-dim) │ │ Suspicious │ │
│ │ vehicle) │ │ │ │ Activity │ │
│ └──────┬───────┘ └──────┬───────┘ │ Service │ │
│ │ │ └──────┬───────┘ │
│ ▼ ▼ │ │
│ Kafka (ai. PostgreSQL Kafka │ │
│ detections) (pgvector) alerts │ │
│ .critical │
│ │
│ PATH C: RECORDING ───────────────────────────────────────────────────────── │
│ │
│ Edge Gateway ──► Local MinIO (7d) ──► Sync ──► Cloud MinIO ──► S3 Lifecycle → Glacier │
│ (10s segments) (hot buffer) (daily) (30d hot) (1yr archive) │
│ │
│ LAYER 3: EVENT PROCESSING │
│ ═════════════════════════ │
│ │
│ AI Inference ──► Kafka (ai.detections) ──► Suspicious Activity Svc │
│ Output - bbox, class, conf - Rule engine evaluation │
│ - timestamp - Loitering detection │
│ - camera_id - Perimeter breach check │
│ - embedding_id - Crowd counting │
│ - Time-of-day scoring │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ PostgreSQL │ │ Kafka │ │ Notification Svc │ │
│ │ (alerts) │ │ (alerts. │ │ - Email (SES) │ │
│ │ (events) │ │ critical) │ │ - SMS (Twilio) │ │
│ └──────────────┘ └──────────────┘ │ - Push (FCM) │ │
│ │ - Webhook │ │
│ └──────────────────────┘ │
│ │
│ LAYER 4: CONSUMPTION │
│ ════════════════════ │
│ │
│ Web Frontend ──► API Gateway ──► Backend Service ──► PostgreSQL/Redis/MinIO │
│ (Next.js) (Traefik) (Go/Gin) (data queries) │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ DASHBOARD VIEWS: │ │ │
│ │ │ - Live View: HLS.js + WebSocket for alert overlay │ │ │
│ │ │ - Event Timeline: Infinite scroll, filters │ │ │
│ │ │ - Alert Management: Ack/Nack, assignment │ │ │
│ │ │ - Face Search: Upload photo → pgvector similarity search │ │ │
│ │ │ - Analytics: Time-series charts (event frequency, heatmaps) │ │ │
│ │ │ - Settings: Camera config, zone drawing, rule management │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────── WebSocket: /ws/alerts ───────────────────────┘ │
│ (real-time alert push) │
│ │
│ LAYER 5: TRAINING DATA FLOW │
│ ═══════════════════════════ │
│ │
│ Events (false positive) ──► Admin review ──► "Add to Training" ──► MinIO (training-data) │
│ Events (missed detect) ──► Manual upload ──► Labeling UI ──► Curated dataset │
│ │ │
│ ▼ │
│ Training Service (weekly CronJob) │
│ - Load dataset from MinIO │
│ - Fine-tune YOLOv8 weights │
│ - Convert to TensorRT engine │
│ - Validate mAP > 0.85 │
│ │ │
│ ▼ │
│ Model Registry (MinIO) │
│ - Blue/green deployment │
│ - Triton model repository │
│ │ │
│ ▼ │
│ AI Inference Svc (rolling update) │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
5.2 Stream Flow Detail
┌─────────────────────────────────────────────────────────────────────────────┐
│ VIDEO STREAM FLOW (Per Camera) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Camera ──► DVR Encoder ──► RTSP Stream ──► Edge Gateway │
│ (H.264, 25fps, │
│ 960x1080) │
│ │
│ Edge Gateway Processing: │
│ ┌─────────────────────────────────────┐ │
│ │ 1. FFmpeg process per channel │ │
│ │ -input rtsp://dvr/ch{N} │ │
│ │ -c:v copy -f segment │ │
│ │ -segment_time 10 │ │
│ │ /recordings/ch{N}/%d.ts │ │
│ │ │ │
│ │ 2. Parallel: 1 fps extraction │ │
│ │ -vf fps=1,scale=640:640 │ │
│ │ -f image2pipe -vcodec mjpeg │ │
│ │ → AI pipeline │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Local Buffer │ │ AI Frames │ │ Cloud Upload │ │
│ │ (7-day ring) │ │ (1 fps JPEG) │ │ (10s chunks) │ │
│ └──────────────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ ┌────────────────────┘ │ │
│ │ WireGuard VPN │ │
│ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Cloud Stream │ │ Cloud MinIO │ │
│ │ Ingestion Svc │ │ (30-day hot) │ │
│ └───────┬────────┘ └────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Kafka (streams.raw) │ │
│ │ Partition = camera_id │ │
│ │ Guarantees ordering │ │
│ │ per camera │ │
│ └───────────┬────────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────────┐ │
│ │ AI │ │ HLS │ │ Recording│ │
│ │ Inf. │ │ Seg. │ │ Archival │ │
│ │ Svc │ │ Svc │ │ Svc │ │
│ └──────┘ └──────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5.3 Event/Alert Flow Detail
┌─────────────────────────────────────────────────────────────────────────────┐
│ EVENT & ALERT FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ AI Inference Output: │
│ { │
│ "camera_id": "cam_001", │
│ "timestamp": "2025-01-20T14:30:00Z", │
│ "detections": [ │
│ { "class": "person", "confidence": 0.94, │
│ "bbox": [120, 340, 280, 560], "track_id": 42 }, │
│ { "class": "face", "confidence": 0.89, │
│ "bbox": [150, 360, 200, 420], "embedding_id": "emb_12345" } │
│ ] │
│ } │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Kafka Topic: ai.detections │ │
│ │ (JSON, 16 partitions) │ │
│ └─────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Face │ │ Suspicious │ │
│ │ Matching │ │ Activity Svc │ │
│ │ (pgvector│ │ │ │
│ │ search) │ │ Rule Eval: │ │
│ │ │ │ - Night mode?│ │
│ └────┬─────┘ │ - Zone │ │
│ │ │ overlap? │ │
│ │ │ - Loitering │ │
│ │ │ > 5 min? │ │
│ │ │ - Crowd │ │
│ │ │ > 5 ppl? │ │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ MATCH FOUND │ ALERT TRIGGERED │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ PostgreSQL │ │
│ │ - events table (all detections) │ │
│ │ - alerts table (triggered alerts) │ │
│ │ - face_embeddings (if new/matched) │ │
│ └───────────────────┬──────────────────────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ WebSocket│ │ Kafka │ │ Notification │ │
│ │ Push │ │ alerts. │ │ Service │ │
│ │ (live │ │ critical │ │ │ │
│ │ update) │ │ │ │ - Email │ │
│ └──────────┘ └──────────┘ │ - SMS │ │
│ │ - Push │ │
│ │ - Webhook │ │
│ └──────────────┘ │
│ │
│ Alert Lifecycle: │
│ DETECTED → NEW (insert) → WebSocket push → NOTIFY → ACK/RESOLVE │
│ ↓ │
│ If unacked 15min → ESCALATE to admin │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5.4 Live Video to Browser Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ LIVE VIDEO TO BROWSER FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ BROWSER │ │ CloudFront │ │ EKS HLS │ │
│ │ │ │ CDN │ │ Service │ │
│ │ ┌────────┐ │ │ │ │ │ │
│ │ │HLS.js │ │ │ │ │ ┌────────┐ │ │
│ │ │Player │◄─┼──────┼─── m3u8 ─────┼──────┼──│ Playlist│ │ │
│ │ │ │ │ │ + .ts │ │ │ Builder │ │ │
│ │ └────────┘ │ │ segments │ │ └────┬───┘ │ │
│ │ │ │ │ │ │ │ │
│ │ WebSocket ──┼──────┼──────────────┼──────┼───────┘ │ │
│ │ /ws/alerts │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Redis Stream │ │
│ │ live:cam:{id} │ │
│ │ │ │
│ │ ┌──────────┐ │ │
│ │ │Segment 1 │ │ │
│ │ │Segment 2 │ │ │
│ │ │Segment 3 │──┼──► FIFO │
│ │ └──────────┘ │ (keep 30) │
│ └───────┬────────┘ │
│ │ │
│ ┌─────────────────────┘ │
│ │ WireGuard VPN │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Edge Gateway │ │
│ │ FFmpeg → HLS │ │
│ │ segmenter │ │
│ └──────────────────┘ │
│ │
│ Latency Budget: │
│ - DVR encoding: ~100ms │
│ - RTSP to Edge: ~50ms │
│ - VPN tunnel: ~30-80ms (depending on internet) │
│ - Cloud HLS svc: ~50ms │
│ - CDN delivery: ~20-100ms │
│ - Player buffer: 3-6 segments (~30-60s behind real-time) │
│ TOTAL LIVE LATENCY: ~35-65 seconds (HLS inherent) │
│ │
│ For lower latency: WebRTC mode (optional future): │
│ - Target: < 2 seconds using WHIP/WHEP │
│ - Requires direct edge-to-browser or TURN relay │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5.5 Training Data Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRAINING DATA FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SOURCE 1: Automatic (False Positive Detection) │
│ ────────────────────────────────────────────── │
│ │
│ AI Inference → Confidence 0.3-0.6 range → Flag as "uncertain" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ MinIO bucket: training-data/auto/ │ │
│ │ - Original frame (JPEG) │ │
│ │ - Inference result (JSON) │ │
│ │ - Flagged for review │ │
│ └─────────────────────────────────────────┘ │
│ │
│ SOURCE 2: Manual (Operator Upload) │
│ ────────────────────────────────── │
│ │
│ Operator → Dashboard "Upload Training Image" → Label with bounding boxes │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ MinIO bucket: training-data/manual/ │ │
│ │ - Uploaded image with labels (COCO fmt)│ │
│ └─────────────────────────────────────────┘ │
│ │
│ SOURCE 3: Missed Detection (Post-Incident) │
│ ──────────────────────────────────────────── │
│ │
│ Security review → "AI should have caught this" → Extract from recording │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ MinIO bucket: training-data/incident/ │ │
│ │ - Video clip with manual annotation │ │
│ └─────────────────────────────────────────┘ │
│ │
│ AGGREGATION: │
│ ════════════ │
│ │
│ All sources → Weekly CronJob (Sunday 02:00 UTC) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Training Service Pipeline: │ │
│ │ │ │
│ │ 1. Download all new training data from MinIO │ │
│ │ 2. Deduplicate (perceptual hashing) │ │
│ │ 3. Augment: rotation, brightness, noise (albumentations) │ │
│ │ 4. Validate: train/val/test split (80/10/10) │ │
│ │ 5. Fine-tune YOLOv8x: │ │
│ │ - Base: COCO-pretrained weights │ │
│ │ - Epochs: 100, early stopping patience 10 │ │
│ │ - LR: 0.001 with cosine decay │ │
│ │ - Batch: 8 per GPU │ │
│ │ 6. Validate mAP@0.5 > 0.85 │ │
│ │ 7. Convert to TensorRT engine (FP16, max batch 8) │ │
│ │ 8. Upload to MinIO: model-artifacts/{version}/ │ │
│ │ 9. A/B test: shadow mode for 24 hours │ │
│ │ 10. Promote to production if FP rate < baseline │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ MODEL DEPLOYMENT: │
│ ═════════════════ │
│ │
│ MinIO model-artifacts/ → Triton Model Repository → SIGHUP reload │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Blue/Green Deployment: │ │
│ │ - Triton loads new model as "green" │ │
│ │ - 5% traffic routed for 1 hour │ │
│ │ - Monitor: latency P99, error rate │ │
│ │ - If OK: 100% traffic, "blue" retired │ │
│ │ - If FAIL: automatic rollback │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
6. Technology Stack
6.1 Technology Selection Matrix
| Category |
Technology |
Alternative |
Selection Criteria |
| Cloud Platform |
AWS |
GCP, Azure |
Best India region coverage (Mumbai), most mature managed Kafka (MSK), broad GPU instance types, VPC endpoints for private service communication |
| Container Orchestration |
Amazon EKS |
GKE, AKS, self-managed |
Managed control plane, GPU device plugin support, Cluster Autoscaler, native IAM integration |
| Edge K8s |
K3s |
K0s, MicroK8s, Docker Compose |
Single binary, lightweight, automatic HA with embedded etcd, built-in Helm, compatible with standard K8s manifests |
| VPN |
WireGuard |
OpenVPN, IPSec, Tailscale |
Modern crypto (Curve25519, ChaCha20, Poly1305), kernel module since Linux 5.6, ~60% faster than OpenVPN, NAT traversal, simple config |
| Message Queue |
Apache Kafka (MSK) |
RabbitMQ, NATS, AWS SQS |
Ordered event log, stream replay, high throughput, exactly-once processing with Flink, managed service reduces ops |
| Stream Processing |
Apache Flink on EKS |
Kafka Streams, Spark Streaming |
Stateful processing, event time semantics, exactly-once, CEP (complex event processing) for multi-frame rules |
| Reverse Proxy |
Traefik |
NGINX, HAProxy, Envoy |
Native Kubernetes Ingress, automatic Let's Encrypt, middleware chains, WebSocket support, Prometheus metrics |
| AI Inference |
NVIDIA Triton + YOLOv8 |
TorchServe, TensorFlow Serving, custom |
Multi-framework support, TensorRT optimization, dynamic batching, model ensemble, Prometheus metrics |
| Database |
PostgreSQL 16 (RDS) + pgvector |
MySQL, MongoDB, separate vector DB |
ACID compliance, mature managed service, pgvector handles 512-dim embeddings at scale, no separate DB to manage |
| Cache |
Redis 7 Cluster |
Memcached, KeyDB |
Data structures (streams, sorted sets), pub/sub, persistence, cluster mode for horizontal scaling |
| Object Storage |
MinIO + S3 |
Ceph, GlusterFS, pure S3 |
S3-compatible API everywhere, local buffering at edge, cloud tiering, cost optimization via lifecycle policies |
| Backend Language |
Go 1.21 |
Python, Java, Rust |
Compiled performance for high-throughput streaming, excellent concurrency (goroutines), small container images |
| Frontend |
Next.js 14 + React 18 |
Vue, Angular, Svelte |
SSR for SEO/performance, React ecosystem, API routes, image optimization, easy deployment to CDN |
| Monitoring |
Prometheus + Grafana + Loki |
Datadog, New Relic, CloudWatch |
Open source, no per-host licensing, powerful alerting, log aggregation with Loki, custom dashboards |
| CI/CD |
GitHub Actions + ArgoCD |
GitLab CI, Jenkins, Flux |
GitOps deployment, automated rollback, drift detection, progressive delivery |
6.2 WireGuard VPN Configuration
# Cloud Server (AWS EC2 bastion / VPN endpoint)
[Interface]
Address = 10.200.0.1/32
ListenPort = 51820
PrivateKey = <cloud-private-key>
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
[Peer]
# Edge Gateway
PublicKey = <edge-public-key>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25
# Edge Gateway (Intel NUC)
[Interface]
Address = 10.200.0.2/32
PrivateKey = <edge-private-key>
[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
AllowedIPs = 10.100.0.0/16 # Entire AWS VPC
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25
6.3 Port Reference Table
| Service |
Port |
Protocol |
Location |
Notes |
| DVR RTSP |
554 |
TCP |
192.168.29.200 |
Local network only |
| DVR HTTP |
80 |
TCP |
192.168.29.200 |
Admin UI, local only |
| DVR HTTPS |
443 |
TCP |
192.168.29.200 |
Admin UI, local only |
| DVR TCP |
25001 |
TCP |
192.168.29.200 |
Proprietary protocol |
| DVR UDP |
25002 |
UDP |
192.168.29.200 |
Proprietary protocol |
| DVR NTP |
123 |
UDP |
192.168.29.200 |
Time sync |
| WireGuard |
51820 |
UDP |
Cloud + Edge |
VPN tunnel |
| Edge Admin |
8080 |
TCP |
192.168.29.5 |
Local admin UI |
| Edge SSH |
22 |
TCP |
192.168.29.5 |
Admin access only |
| Traefik HTTP |
8000 |
TCP |
EKS |
Internal HTTP entrypoint |
| Traefik HTTPS |
8443 |
TCP |
EKS |
Internal HTTPS entrypoint |
| ALB HTTPS |
443 |
TCP |
AWS |
Public-facing |
| Backend API |
8080 |
TCP |
EKS pods |
Internal service port |
| Triton HTTP |
8000 |
TCP |
EKS GPU nodes |
Model inference HTTP |
| Triton gRPC |
8001 |
TCP |
EKS GPU nodes |
Model inference gRPC |
| Triton Metrics |
8002 |
TCP |
EKS GPU nodes |
Prometheus metrics |
| PostgreSQL |
5432 |
TCP |
RDS |
VPC-private |
| Redis |
6379 |
TCP |
ElastiCache |
VPC-private |
| Kafka |
9092 |
TCP |
MSK |
VPC-private |
| MinIO API |
9000 |
TCP |
EKS + Edge |
S3-compatible API |
| MinIO Console |
9001 |
TCP |
EKS + Edge |
Admin console |
| Prometheus |
9090 |
TCP |
EKS |
Metrics collection |
| Grafana |
3000 |
TCP |
EKS |
Dashboards |
7. Scaling Strategy
7.1 Camera Scaling Roadmap
┌─────────────────────────────────────────────────────────────────────────────┐
│ CAMERA SCALING ROADMAP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CURRENT: 8 cameras (1 DVR) │
│ ├─ Edge: Intel NUC i7, 32GB RAM │
│ ├─ Streams: 8 × RTSP @ 960×1080 │
│ ├─ Bandwidth: ~16 Mbps upstream (2 Mbps per H.264 stream) │
│ └─ Cloud AI: 1× T4 GPU (handles 8 streams @ 1 fps) │
│ │
│ PHASE 1: 16 cameras (2 DVRs) │
│ ├─ Edge: Intel NUC i7 (sufficient) or 2× NUC │
│ ├─ Add 2nd edge gateway for 2nd DVR site (if different location) │
│ ├─ Streams: 16 × RTSP │
│ ├─ Bandwidth: ~32 Mbps │
│ ├─ Cloud AI: 1× T4 GPU (still sufficient, batch size 8 → 16) │
│ └─ Kafka: 8 partitions → 16 partitions │
│ │
│ PHASE 2: 32 cameras (4 DVRs / 4 sites) │
│ ├─ Edge: 4× Intel NUC (one per site) │
│ ├─ VPN: Hub-spoke model (4 edge peers → 1 cloud endpoint) │
│ ├─ Bandwidth: ~64 Mbps │
│ ├─ Cloud AI: 2× T4 GPUs (HPA: 2-6 replicas) │
│ ├─ Stream Ing.: 6-12 replicas (HPA) │
│ ├─ Kafka: 32 partitions │
│ └─ PostgreSQL: db.r6g.2xlarge (scale up) │
│ │
│ PHASE 3: 64 cameras (8 DVRs / 8 sites) │
│ ├─ Edge: 8× Intel NUC (or NVIDIA Jetson Orin for edge AI pre-filter) │
│ ├─ VPN: WireGuard hub-spoke or mesh (consider Tailscale for simplicity) │
│ ├─ Bandwidth: ~128 Mbps (dedicated internet circuit recommended) │
│ ├─ Cloud AI: 4× T4 GPUs or 2× A10G (g5.2xlarge) │
│ ├─ Stream Ing.: 12-20 replicas │
│ ├─ Kafka: 64 partitions, consider MSK multi-cluster │
│ ├─ PostgreSQL: db.r6g.4xlarge + read replica │
│ ├─ Redis: 4 shards │
│ └─ MinIO: Distributed mode, 4+ nodes │
│ │
│ PHASE 4: 64+ cameras (NVR consolidation) │
│ ├─ Consider NVR-to-edge consolidation (fewer, more powerful recorders) │
│ ├─ Edge AI pre-filtering (Jetson Orin): only send motion frames │
│ ├─ Bandwidth reduction: ~50% via smart filtering │
│ └─ Multi-region cloud deployment for latency optimization │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
7.2 AI Inference Scaling
| Metric |
8 Cameras |
16 Cameras |
32 Cameras |
64 Cameras |
| Frame Rate |
8 fps (1 per cam) |
16 fps |
32 fps |
64 fps |
| GPU Replicas |
1× T4 |
1× T4 |
2× T4 |
4× T4 or 2× A10G |
| Inference Latency (P99) |
80ms |
120ms |
150ms |
200ms |
| Kafka Partitions (raw) |
8 |
16 |
32 |
64 |
| Consumer Groups |
3 |
4 |
6 |
8 |
Auto-scaling Triggers:
- GPU utilization > 80% for 2 minutes → scale out
- Kafka consumer lag > 1000 messages for 5 minutes → scale out
- Queue depth < 100 for 10 minutes → scale in (to minimum)
7.3 Storage Scaling
┌─────────────────────────────────────────────────────────────────────────────┐
│ STORAGE CAPACITY PLANNING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Per-Camera Storage Profile: │
│ - Continuous recording: ~1.5 GB/hour @ 960×1080 H.264 main profile │
│ - AI snapshots (1 fps): ~50 MB/hour (JPEG compressed) │
│ - Event clips: ~10 MB average per event (30-second clip) │
│ │
│ Total Per Day (8 cameras): │
│ - Video: 8 × 1.5 GB × 24h = 288 GB/day │
│ - Snapshots: 8 × 50 MB × 24h = 9.6 GB/day │
│ - Events (est. 500/day): 500 × 10 MB = 5 GB/day │
│ - TOTAL: ~303 GB/day = ~9 TB/month │
│ │
│ Tiered Storage Strategy: │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TIER 1: EDGE LOCAL (Hot, 7 days) │ │
│ │ - Capacity: 2TB NVMe per edge gateway │ │
│ │ - All 8 streams, full resolution, 10s segments │ │
│ │ - Cost: Hardware (CAPEX) │ │
│ │ - Use: Immediate playback, event export │ │
│ │ │ │
│ │ 7 days × 303 GB = 2.1 TB ✓ (fits in 2TB with compression) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TIER 2: CLOUD MINIO (Warm, 30 days) │ │
│ │ - Capacity: 10TB initial, auto-scaling │ │
│ │ - Full resolution video segments + event snapshots │ │
│ │ - Cost: ~$0.023/GB/month (S3 Standard equivalent) │ │
│ │ - Use: Dashboard playback, search, investigation │ │
│ │ │ │
│ │ 30 days × 303 GB = 9.1 TB │ │
│ │ Cost: 9,100 GB × $0.023 = ~$209/month │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TIER 3: S3 IA (Cool, 31-90 days) │ │
│ │ - Capacity: Auto (lifecycle transition) │ │
│ │ - Cost: ~$0.0125/GB/month │ │
│ │ - Use: Occasional access, compliance review │ │
│ │ │ │
│ │ 60 days × 303 GB = 18.2 TB │ │
│ │ Cost: 18,200 GB × $0.0125 = ~$228/month │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TIER 4: GLACIER DEEP ARCHIVE (Cold, 90+ days) │ │
│ │ - Capacity: Unbounded │ │
│ │ - Cost: ~$0.00099/GB/month │ │
│ │ - Retrieval: 12-48 hours (batch) │ │
│ │ - Use: Long-term compliance, legal hold │ │
│ │ │ │
│ │ Annual accumulation: 303 GB × 365 = 110 TB │ │
│ │ Cost: 110,000 GB × $0.00099 = ~$109/month │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ TOTAL MONTHLY STORAGE COST (8 cameras, steady state): │
│ - Tier 2 (hot): $209 │
│ - Tier 3 (warm): $228 │
│ - Tier 4 (cold): $109 │
│ - TOTAL: ~$546/month │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
7.4 Database Partitioning Strategy
-- Partition events table by month (range partitioning)
CREATE TABLE events (
id BIGSERIAL,
camera_id VARCHAR(32) NOT NULL,
event_type VARCHAR(50) NOT NULL,
confidence DECIMAL(4,3),
bounding_box BOX,
snapshot_path VARCHAR(512),
start_time TIMESTAMPTZ NOT NULL,
end_time TIMESTAMPTZ,
severity VARCHAR(20),
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (start_time);
-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... auto-created by cron job
-- Partition pruning ensures queries for specific time ranges
-- only scan relevant partitions
-- Automated partition creation (pg_partman extension)
SELECT partman.create_parent('public.events', 'start_time', 'native', 'monthly');
-- Partition compression and archival
-- Partitions older than 12 months:
-- 1. Compress with pg_compress
-- 2. Move to S3 via pg_s3_fifo FDW
-- 3. Drop local partition (data in cold archive)
8. Failover & Reliability
8.1 Service Restart Policies
# Kubernetes Deployment - Restart Policy Example
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stream-ingestion-service
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime deployment
template:
spec:
containers:
- name: ingestion
image: surveillance/stream-ingestion:v1.2.3
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "4Gi"
livenessProbe:
grpc:
port: 8081
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3 # Restart after 30s of failures
readinessProbe:
grpc:
port: 8081
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
grpc:
port: 8081
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30 # 150s max for startup
8.2 Stream Reconnect Logic
// Edge Gateway Stream Reconnect Logic (Go pseudocode)
func maintainStream(cameraID string, rtspURL string) {
backoff := NewExponentialBackoff(
Initial: 1 * time.Second,
Max: 60 * time.Second,
Multiplier: 2.0,
Jitter: 0.1,
)
for {
ctx, cancel := context.WithCancel(context.Background())
err := connectAndStream(ctx, cameraID, rtspURL)
if err != nil {
log.Error("stream disconnected", "camera", cameraID, "error", err)
// Update health status in Redis
redis.HSet("stream:health", cameraID, "disconnected")
// Wait with backoff
wait := backoff.Next()
log.Info("reconnecting", "camera", cameraID, "wait", wait)
time.Sleep(wait)
cancel()
continue
}
// Success - reset backoff
backoff.Reset()
redis.HSet("stream:health", cameraID, "connected")
}
}
// Circuit breaker pattern for cloud connection
type CircuitBreaker struct {
state State // Closed, Open, HalfOpen
failureCount int
failureThreshold int // 5 failures
timeout time.Duration // 60s open state
lastFailureTime time.Time
}
8.3 VPN Tunnel Recovery
#!/bin/bash
# /usr/local/bin/wireguard-watchdog.sh
# Runs every 30 seconds via cron
CLOUD_ENDPOINT="10.200.0.1"
TUNNEL_INTERFACE="wg0"
MAX_PING_LOSS=3
LOG_FILE="/var/log/wg-watchdog.log"
# Check tunnel health
ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "$(date): VPN tunnel unhealthy, restarting..." >> $LOG_FILE
# 1. Restart WireGuard interface
wg-quick down $TUNNEL_INTERFACE
sleep 2
wg-quick up $TUNNEL_INTERFACE
# 2. Verify recovery
sleep 5
ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "$(date): VPN tunnel recovered" >> $LOG_FILE
# Notify cloud of recovery
curl -X POST http://10.200.0.1:8080/api/v1/system/edge-recovery \
-H "Authorization: Bearer $EDGE_TOKEN" \
-d "{\"edge_id\": \"$HOSTNAME\", \"status\": \"recovered\"}"
else
echo "$(date): VPN tunnel recovery FAILED" >> $LOG_FILE
# Escalate: local alert (buzzer/email if available)
fi
fi
8.4 Queue Recovery & Durability
┌─────────────────────────────────────────────────────────────────────────────┐
│ KAFKA DURABILITY CONFIGURATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Producer Configuration (Stream Ingestion Service): │
│ ────────────────────────────────────────────────── │
│ - acks=all # Wait for all replicas │
│ - retries=10 # Aggressive retry │
│ - retry.backoff.ms=1000 # 1 second between retries │
│ - enable.idempotence=true # Exactly-once semantics │
│ - max.in.flight.requests=1 # Preserve ordering during retry │
│ - compression.type=lz4 # Efficient compression │
│ │
│ Topic Configuration: │
│ ──────────────────── │
│ - replication.factor=3 # 3 copies across AZs │
│ - min.insync.replicas=2 # Require 2 acks for producer commit │
│ - retention.ms=604800000 # 7 days for raw streams │
│ - retention.ms=2592000000 # 30 days for detections │
│ - unclean.leader.election.enable=false # Never lose committed data │
│ │
│ Consumer Configuration (AI Inference Service): │
│ ────────────────────────────────────────────── │
│ - enable.auto.commit=false # Manual offset management │
│ - auto.offset.reset=earliest # Replay from beginning on new group │
│ - max.poll.records=100 # Process in batches │
│ - isolation.level=read_committed # Only read committed transactions │
│ │
│ Offset Commit Strategy: │
│ ─────────────────────── │
│ 1. Pull batch from Kafka │
│ 2. Process (run inference) │
│ 3. Write results to PostgreSQL (transaction) │
│ 4. Commit Kafka offset ONLY after DB write succeeds │
│ 5. If any step fails: don't commit, reprocess on next poll │
│ │
│ Dead Letter Queue: │
│ ────────────────── │
│ - Topic: streams.raw.dlq │
│ - After 5 processing failures, message moved to DLQ │
│ - DLQ consumer: alerts admin, manual inspection │
│ - Retention: 30 days │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.5 Graceful Degradation
┌─────────────────────────────────────────────────────────────────────────────┐
│ GRACEFUL DEGRADATION MATRIX │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ FAILURE MODE │ DEGRADATION STRATEGY │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ AI Inference Service DOWN │ Continue recording ALL video │ │
│ │ (GPU failure, model crash) │ - Events stored as "unprocessed"│ │
│ │ │ - No real-time alerts │ │
│ │ │ - Queue frames for later batch │ │
│ │ │ processing when AI recovers │ │
│ │ │ - Dashboard shows "AI OFFLINE" │ │
│ │ │ banner │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ Kafka DOWN │ Edge Gateway buffers locally │ │
│ │ (MSK outage) │ - Local MinIO ring buffer │ │
│ │ │ - Backpressure: reduce to │ │
│ │ │ key frames only (0.2 fps) │ │
│ │ │ - Auto-reconnect with 2x │ │
│ │ │ exponential backoff │ │
│ │ │ - Replay from local buffer │ │
│ │ │ when Kafka recovers │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ VPN Tunnel DOWN │ Full local operation mode │ │
│ │ (internet outage) │ - All recording continues │ │
│ │ │ locally (7-day buffer) │ │
│ │ │ - Local alert buzzer/relay │ │
│ │ │ (configurable) │ │
│ │ │ - No cloud dashboard access │ │
│ │ │ - Auto-sync when VPN recovers │ │
│ │ │ - Queue cloud events for │ │
│ │ │ later replay │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ PostgreSQL DOWN │ Alert queue builds in Kafka │ │
│ │ (RDS outage) │ - Events not lost (Kafka dur.) │ │
│ │ │ - Read-only dashboard mode │ │
│ │ │ - Cached data from Redis │ │
│ │ │ - Alert on-call engineer │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ Notification Service DOWN │ Alerts accumulate in DB │ │
│ │ │ - Retry with exponential backoff│ │
│ │ │ - Dead letter after 24 hours │ │
│ │ │ - Dashboard shows pending count │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ Edge Gateway DOWN │ Cloud dashboard shows │ │
│ │ (power/hardware failure) │ "SITE OFFLINE" │ │
│ │ │ - Last known recordings in │ │
│ │ │ cloud (up to disconnect) │ │
│ │ │ - Alert sent immediately │ │
│ │ │ - UPS on edge: graceful │ │
│ │ │ shutdown, preserve data │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Priority Order (highest first): │
│ 1. Video recording NEVER STOPS (local edge priority) │
│ 2. Critical alerts ALWAYS FIRE (local buzzer + queued cloud alerts) │
│ 3. AI inference gracefully degrades to batch catch-up │
│ 4. Dashboard operates in read-only/cache mode during DB outage │
│ 5. Cloud sync resumes automatically when connectivity restored │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.6 Health Check Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ HEALTH CHECK ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: KUBERNETES PROBES │
│ ────────────────────────── │
│ - Liveness Probe: /health/live → Restart container if failing │
│ - Readiness Probe: /health/ready → Remove from service if failing │
│ - Startup Probe: /health/startup→ Allow long initialization │
│ │
│ LAYER 2: SERVICE-LEVEL HEALTH (Prometheus metrics) │
│ ────────────────────────────────────────────────── │
│ Each service exposes: │
│ - app_health_status{service="X"} 0=healthy, 1=degraded, 2=critical │
│ - app_health_details{check="db"} last check timestamp + result │
│ │
│ LAYER 3: DEPENDENCY HEALTH CHECKS │
│ ──────────────────────────────── │
│ Backend Service checks: │
│ ├─ PostgreSQL: SELECT 1; (timeout 2s) │
│ ├─ Redis: PING (timeout 1s) │
│ ├─ Kafka: ListTopics (timeout 3s) │
│ ├─ MinIO: ListBuckets (timeout 3s) │
│ └─ Triton: ModelReady API (timeout 5s) │
│ │
│ LAYER 4: END-TO-END HEALTH │
│ ────────────────────────── │
│ Synthetic probe: │
│ 1. Upload test image to stream ingestion │
│ 2. Verify AI detection result appears in Kafka │
│ 3. Verify event written to PostgreSQL │
│ 4. Verify alert queryable via API │
│ 5. Verify WebSocket push received │
│ Run: Every 60 seconds from monitoring namespace │
│ │
│ LAYER 5: EDGE HEALTH HEARTBEAT │
│ ──────────────────────────────── │
│ - Edge Gateway sends heartbeat every 30 seconds │
│ - Payload: {edge_id, timestamp, stream_count, disk_free, mem_usage} │
│ - Missed 3 heartbeats (90s) → "EDGE OFFLINE" alert │
│ - Recovers → "EDGE ONLINE" notification │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
9. Security Architecture
9.1 Defense in Depth
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEFENSE IN DEPTH LAYERS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: PERIMETER │
│ ────────────── │
│ - AWS WAF v2: SQL injection, XSS, rate limiting rules │
│ - Geo-restriction: Allow only specific countries │
│ - AWS Shield Standard (DDoS protection) │
│ - ALB access logs → S3 → Athena for analysis │
│ │
│ LAYER 2: TRANSPORT │
│ ────────────── │
│ - TLS 1.3 for all external HTTPS connections │
│ - WireGuard ChaCha20-Poly1305 for VPN tunnel │
│ - mTLS (mutual TLS) for internal service-to-service communication │
│ - Certificate rotation: Let's Encrypt auto (90-day) │
│ │
│ LAYER 3: AUTHENTICATION & AUTHORIZATION │
│ ────────────────────────────────────────── │
│ - JWT with RS256 (asymmetric signing) │
│ - Access token: 15 minutes │
│ - Refresh token: 7 days (stored in httpOnly cookie) │
│ - RBAC: admin, operator, viewer roles │
│ - API keys for edge gateway authentication │
│ - Multi-factor authentication for admin role │
│ │
│ LAYER 4: APPLICATION SECURITY │
│ ──────────────────────────── │
│ - Input validation: strict JSON schemas │
│ - SQL injection: parameterized queries only (pgx) │
│ - XSS prevention: Content Security Policy headers │
│ - CSRF tokens for state-changing operations │
│ - File upload: virus scanning, size limits, type validation │
│ │
│ LAYER 5: DATA SECURITY │
│ ──────────────────── │
│ - RDS: Encryption at rest (AES-256, AWS KMS CMK) │
│ - RDS: Encryption in transit (TLS 1.2+) │
│ - S3: Default encryption (SSE-S3 or SSE-KMS) │
│ - Redis: TLS in transit, no AUTH token exposure │
│ - Face embeddings: stored as vectors, not raw images (privacy) │
│ - Backup encryption: separate KMS key for backups │
│ │
│ LAYER 6: NETWORK SEGMENTATION │
│ ─────────────────────────── │
│ - VPC private subnets for all workloads │
│ - Security groups: least privilege, explicit allow only │
│ - Network Policies: namespace-level isolation in K8s │
│ - DVR: NO public IP, NO internet gateway, local network only │
│ - VPN: Single controlled entry point │
│ │
│ LAYER 7: AUDIT & MONITORING │
│ ───────────────────────── │
│ - All API calls logged with user, IP, timestamp, resource │
│ - PostgreSQL audit_log table (append-only) │
│ - CloudTrail for AWS API calls │
│ - VPC Flow Logs for network analysis │
│ - Alert on abnormal patterns (unusual login times, geo anomalies) │
│ - Log retention: 1 year in S3 Glacier │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
9.2 Secret Management
# Kubernetes External Secrets (AWS Secrets Manager integration)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: surveillance
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: aws-secrets-manager
target:
name: db-credentials
creationPolicy: Owner
data:
- secretKey: DB_PASSWORD
remoteRef:
key: surveillance/production/db
property: password
- secretKey: DB_USER
remoteRef:
key: surveillance/production/db
property: username
10. Monitoring & Observability
10.1 Monitoring Stack
| Component |
Technology |
Purpose |
| Metrics |
Prometheus + Thanos |
Time-series collection, long-term storage |
| Visualization |
Grafana |
Dashboards for all services |
| Logs |
Loki + Promtail |
Log aggregation, indexed by labels |
| Traces |
Jaeger |
Distributed request tracing |
| Alerts |
Alertmanager + PagerDuty |
Multi-channel alerting |
| Uptime |
UptimeRobot (external) |
External endpoint monitoring |
10.2 Key Metrics
┌─────────────────────────────────────────────────────────────────────────────┐
│ KEY METRICS DASHBOARD │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ STREAM HEALTH │
│ ──────────── │
│ - stream_active{camera_id} Gauge: 0/1 │
│ - stream_fps{camera_id} Gauge: actual FPS │
│ - stream_bitrate{camera_id} Gauge: kbps │
│ - stream_reconnect_total{camera_id} Counter: reconnect events │
│ - stream_latency_seconds{camera_id} Histogram: end-to-end latency │
│ │
│ AI INFERENCE │
│ ──────────── │
│ - ai_inference_duration_seconds Histogram: per-model latency │
│ - ai_detection_total{model,class} Counter: detections by class │
│ - ai_gpu_utilization_percent Gauge: GPU usage │
│ - ai_gpu_memory_used_bytes Gauge: VRAM usage │
│ - ai_batch_size_current Gauge: current batch size │
│ - ai_queue_depth Gauge: pending inference requests │
│ │
│ EVENTS & ALERTS │
│ ─────────────── │
│ - events_total{type,severity} Counter: events processed │
│ - alerts_active{severity} Gauge: unacknowledged alerts │
│ - alert_ack_duration_seconds Histogram: time to acknowledge │
│ - false_positive_rate Gauge: FP ratio (training feedback) │
│ │
│ SYSTEM │
│ ────── │
│ - edge_disk_free_bytes Gauge: local storage remaining │
│ - edge_memory_usage_percent Gauge: RAM usage │
│ - vpn_latency_ms Gauge: tunnel round-trip time │
│ - kafka_consumer_lag{topic,group} Gauge: message backlog │
│ - db_connection_pool_active Gauge: DB connections in use │
│ - api_request_duration_seconds Histogram: API response time │
│ - api_requests_total{status,path} Counter: HTTP status distribution │
│ │
│ BUSINESS │
│ ───────── │
│ - cameras_online_total Gauge: healthy camera count │
│ - daily_events_total Counter: events per day │
│ - alert_response_time_avg Gauge: avg ack time (SLA: <5min) │
│ - storage_cost_daily_usd Gauge: estimated daily cost │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
10.3 Alerting Rules
# Prometheus alerting rules
# alerts.yml
groups:
- name: surveillance-critical
rules:
- alert: CameraStreamDown
expr: stream_active == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Camera {{ $labels.camera_id }} stream is down"
- alert: EdgeGatewayOffline
expr: time() - vpn_last_heartbeat > 120
for: 1m
labels:
severity: critical
annotations:
summary: "Edge gateway {{ $labels.edge_id }} is offline"
- alert: AIInferenceHighLatency
expr: histogram_quantile(0.99, ai_inference_duration_seconds) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "AI inference P99 latency is {{ $value }}ms"
- alert: DiskSpaceLow
expr: edge_disk_free_bytes / edge_disk_total_bytes < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Edge disk usage is above 90%"
- alert: UnacknowledgedCriticalAlerts
expr: alerts_active{severity="critical"} > 0
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $value }} critical alerts unacknowledged for >15 minutes"
11. Cost Estimation
11.1 Monthly Cost Breakdown (8 cameras)
| Service |
Instance/Type |
Monthly Cost (USD) |
| EKS Control Plane |
Managed |
$73 |
| EKS Worker Nodes (on-demand) |
3× t3.large (API, services) |
$200 |
| EKS GPU Nodes |
1× g4dn.xlarge (spot when possible) |
$350 |
| RDS PostgreSQL |
db.r6g.xlarge Multi-AZ |
$520 |
| ElastiCache Redis |
cache.r6g.large (2 shards) |
$260 |
| MSK Kafka |
3× kafka.m5.large |
$350 |
| ALB + Data Transfer |
~500 GB/month |
$50 |
| S3 Storage |
~10 TB (tiered) |
$200 |
| CloudFront CDN |
~200 GB/month |
$30 |
| EC2 VPN Endpoint |
t3.micro |
$15 |
| Edge Hardware |
Intel NUC (amortized 3yr) |
~$40 |
| Internet (site) |
Business broadband |
$50 |
| TOTAL |
|
~$2,138/month |
11.2 Cost Optimization Strategies
- Spot Instances: GPU nodes and batch processing on spot (70% savings)
- Reserved Instances: RDS and ElastiCache 1-year reserved (40% savings)
- S3 Lifecycle: Automatic tiering to IA and Glacier
- Right-sizing: Monitor actual usage, adjust requests/limits
- Edge AI: Pre-filter on Jetson Orin to reduce cloud bandwidth (future)
12. Implementation Phases
Phase 1: Foundation (Weeks 1-4)
Phase 2: Core AI (Weeks 5-8)
Phase 3: Application (Weeks 9-12)
Phase 4: Operations (Weeks 13-16)
13. Appendices
Appendix A: DVR RTSP URL Format
# CP PLUS ORANGE Series RTSP URL format
rtsp://<username>:<password>@<dvr_ip>:554/user=<username>&password=<password>&channel=<1-8>&stream=<0|1>.sdp?
# stream=0: Main stream (higher quality)
# stream=1: Sub stream (lower quality)
# Example:
rtsp://admin:password@192.168.29.200:554/user=admin&password=password&channel=1&stream=0.sdp?
# FFmpeg test command:
ffmpeg -i "rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?" \
-c copy -f segment -segment_time 10 -reset_timestamps 1 \
/recordings/ch1/%Y%m%d_%H%M%S.mkv
Appendix B: WireGuard Full Configuration
# === CLOUD SERVER (AWS EC2) ===
# /etc/wireguard/wg0-cloud.conf
[Interface]
PrivateKey = <cloud-private-key>
Address = 10.200.0.1/24
ListenPort = 51820
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE; \
iptables -A FORWARD -p tcp --dport 5432 -j DROP; \
iptables -A FORWARD -p tcp --dport 6379 -j DROP
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
DNS = 10.100.0.2
[Peer]
# Edge Gateway - Site 1
PublicKey = <edge1-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25
# === EDGE GATEWAY (Intel NUC) ===
# /etc/wireguard/wg0-edge.conf
[Interface]
PrivateKey = <edge-private-key>
Address = 10.200.0.2/32
DNS = 10.100.0.2
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.100.0.0/16, 10.200.0.0/24
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25
Appendix C: FFmpeg Stream Processing Command
#!/bin/bash
# Edge Gateway stream processing pipeline
CAMERA_ID=$1
CHANNEL=$2
DVR_IP="192.168.29.200"
DVR_USER="admin"
DVR_PASS=""
RTSP_URL="rtsp://${DVR_IP}:554/user=${DVR_USER}&password=${DVR_PASS}&channel=${CHANNEL}&stream=0.sdp?"
RECORDING_DIR="/var/recordings/${CAMERA_ID}"
AI_PIPE="/tmp/ai_pipe_${CAMERA_ID}"
mkdir -p "$RECORDING_DIR"
mkfifo "$AI_PIPE" 2>/dev/null
# Pipeline 1: Recording (10s segments)
ffmpeg -hide_banner -loglevel warning \
-rtsp_transport tcp \
-i "$RTSP_URL" \
-c copy -f segment \
-segment_time 10 \
-segment_format mp4 \
-reset_timestamps 1 \
-strftime 1 \
"${RECORDING_DIR}/%Y%m%d_%H%M%S.mp4" \
2>> /var/log/ffmpeg-${CAMERA_ID}.log &
# Pipeline 2: AI frame extraction (1 fps)
ffmpeg -hide_banner -loglevel warning \
-rtsp_transport tcp \
-i "$RTSP_URL" \
-vf "fps=1,scale=640:640" \
-f image2pipe \
-vcodec mjpeg \
-q:v 5 \
"$AI_PIPE" \
2>> /var/log/ffmpeg-ai-${CAMERA_ID}.log &
# Pipeline 3: Frame batching and gRPC send to cloud
frame-batcher \
--input "$AI_PIPE" \
--camera-id "$CAMERA_ID" \
--batch-size 8 \
--cloud-endpoint "10.200.0.1:8081" \
--vpn-interface wg0 \
>> /var/log/batcher-${CAMERA_ID}.log 2>&1 &
Appendix D: Kubernetes Resource Summary
# Complete resource manifest summary
# Namespaces:
# - surveillance: Main application
# - surveillance-data: Database, cache, storage
# - surveillance-monitoring: Prometheus, Grafana
# - surveillance-ops: CI/CD, backup jobs
# Deployments (always running):
# - stream-ingestion: 3-20 replicas, HPA
# - ai-inference: 1-4 replicas (GPU), HPA
# - suspicious-activity: 2-8 replicas, HPA
# - backend-api: 3-10 replicas, HPA
# - video-playback: 2-4 replicas
# - notification-service: 2-5 replicas, HPA
# - web-frontend: 3 replicas (static, CDN-cached)
# - traefik: 2 replicas (DaemonSet preferred)
# StatefulSets:
# - minio: 4 replicas (distributed mode)
# CronJobs:
# - training-service: Weekly (Sundays 02:00)
# - db-backup: Daily (02:00)
# - storage-cleanup: Daily (03:00)
# - partition-maintenance: Monthly (1st, 03:00)
# External Services (AWS managed):
# - PostgreSQL: RDS db.r6g.xlarge Multi-AZ
# - Redis: ElastiCache cluster mode
# - Kafka: MSK 3 brokers
# - ALB: Internet-facing, WAF attached
Document History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-01-20 |
Solution Architect |
Initial complete architecture |
End of Document