# AI-Powered Industrial Surveillance Platform — System Architecture

## Document Information
- **Version**: 1.0
- **Date**: 2025-01-20
- **Status**: Production Architecture Design
- **Target Platform**: CP PLUS ORANGE Series DVR (CP-UVR-0801E1-CV2)
- **Camera Count**: 8 channels (scalable to 64+)
- **Resolution**: 960x1080 per channel

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Deployment Topology](#2-deployment-topology)
3. [Network Security Zones](#3-network-security-zones)
4. [Service Architecture](#4-service-architecture)
5. [Data Flow Design](#5-data-flow-design)
6. [Technology Stack](#6-technology-stack)
7. [Scaling Strategy](#7-scaling-strategy)
8. [Failover & Reliability](#8-failover--reliability)
9. [Security Architecture](#9-security-architecture)
10. [Monitoring & Observability](#10-monitoring--observability)
11. [Cost Estimation](#11-cost-estimation)
12. [Implementation Phases](#12-implementation-phases)
13. [Appendices](#13-appendices)

---

## 1. Executive Summary

This document presents the complete system architecture for an AI-powered industrial surveillance platform designed to process 8 camera channels (expandable to 64+) from a CP PLUS ORANGE Series DVR. The architecture follows a **cloud+edge hybrid pattern** where compute-intensive AI inference runs in the cloud while a local edge gateway handles stream ingestion and site-local concerns. All DVR communication is protected inside a WireGuard VPN tunnel — the DVR has **zero public internet exposure**.

### Key Architectural Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Cloud Provider | AWS (us-east-1 / ap-south-1) | Broadest IoT/edge tooling, VPC peering, lowest latency to India region |
| Container Orchestration | Amazon EKS (Kubernetes) | Managed control plane, auto-scaling, GPU node support for AI inference |
| VPN Solution | WireGuard | ~60% faster than OpenVPN, modern crypto, simple setup, NAT traversal |
| Message Queue | Apache Kafka (MSK) | Durable, ordered event log, replay capability, proven at scale |
| Stream Processing | Apache Flink on EKS | Stateful stream processing, exactly-once semantics, windowed operations |
| Reverse Proxy | Traefik (in-cluster) + AWS ALB (ingress) | Native Kubernetes integration, automatic cert management |
| AI Framework | NVIDIA Triton Inference Server + YOLOv8 | GPU-optimized inference, model ensemble, dynamic batching |
| Object Storage | MinIO (on-premises) + AWS S3 (cold archive) | S3-compatible API, local buffering, cost-tiered archival |
| Database | PostgreSQL 16 (RDS) + pgvector extension | Relational integrity for events, native vector support for face embeddings |
| Cache/Queue | Redis 7 Cluster (ElastiCache) | Sub-ms latency, stream data type for real-time pub/sub |
| Edge Hardware | Intel NUC 13 Pro i7 / NVIDIA Jetson Orin NX | x86 preferred for flexibility; Jetson alternative for GPU-at-edge |

---

## 2. Deployment Topology

### 2.1 High-Level Topology Diagram

```
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    CLOUD (AWS VPC)                                          │
│  ┌─────────────────────────────────────────────────────────────────────────────────────┐    │
│  │                         KUBERNETES CLUSTER (EKS)                                   │    │
│  │                                                                                    │    │
│  │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐  │    │
│  │   │  API GW     │   │  Stream     │   │   AI Inf.   │   │   Suspicious Act.   │  │    │
│  │   │  (Traefik)  │   │  Ingestion  │   │  Service    │   │   Service           │  │    │
│  │   │  :8443      │   │  Service    │   │  (Triton)   │   │   (Night Mode)      │  │    │
│  │   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘   └─────────────────────┘  │    │
│  │          │                 │                 │                                    │    │
│  │   ┌──────┴──────┐   ┌──────┴──────┐   ┌──────┴──────┐   ┌─────────────────────┐  │    │
│  │   │  Web App    │   │  Training   │   │ Notification│   │   Video Playback    │  │    │
│  │   │  (Next.js)  │   │  Service    │   │  Service    │   │   Service (HLS)     │  │    │
│  │   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────────────┘  │    │
│  │                                                                                    │    │
│  │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐  │    │
│  │   │  PostgreSQL │   │    Redis    │   │   Kafka     │   │      MinIO          │  │    │
│  │   │  (RDS)      │   │  Cluster    │   │   (MSK)     │   │   (S3-compatible)   │  │    │
│  │   │  :5432      │   │  :6379      │   │  :9092      │   │   :9000             │  │    │
│  │   └─────────────┘   └─────────────┘   └─────────────┘   └─────────────────────┘  │    │
│  │                                                                                    │    │
│  │   ┌─────────────────────────────────────────────────────────────────────────┐      │    │
│  │   │              AWS APPLICATION LOAD BALANCER (:443)                       │      │    │
│  │   │         SSL termination, WAF, rate limiting, geo-restriction            │      │    │
│  │   └─────────────────────────────────────────────────────────────────────────┘      │    │
│  └─────────────────────────────────────────────────────────────────────────────────────┘    │
│           ▲                                                                                 │
│           │ WireGuard VPN Tunnel (UDP 51820)                                                │
│           │ Site-to-Site encrypted tunnel                                                   │
│           │ Cloud peer: 10.200.0.1/32  ←→  Edge peer: 10.200.0.2/32                       │
└───────────┼─────────────────────────────────────────────────────────────────────────────────┘
            │
┌───────────┴─────────────────────────────────────────────────────────────────────────────────┐
│                                 EDGE SITE (Local Network)                                   │
│                                                                                             │
│   ┌─────────────────────────────────┐          ┌─────────────────────────────────────────┐  │
│   │      EDGE GATEWAY               │          │         LOCAL NETWORK                   │  │
│   │   (Intel NUC / Jetson Orin)     │          │     (192.168.29.0/24)                   │  │
│   │   OS: Ubuntu 22.04 LTS          │          │                                         │  │
│   │   WireGuard endpoint            │          │   ┌─────────────────────────────────┐   │  │
│   │   K3s lightweight cluster       │          │   │   CP PLUS DVR                     │   │  │
│   │                                 │          │   │   CP-UVR-0801E1-CV2               │   │  │
│   │   ┌───────────────────────┐     │◄────────►│   │   LAN: 192.168.29.200             │   │  │
│   │   │  Edge Gateway Agent   │     │  :554    │   │   RTSP: 554, HTTP: 80/443         │   │  │
│   │   │  - Stream puller      │     │  :80     │   │   TCP: 25001, UDP: 25002          │   │  │
│   │   │  - Buffer/forward     │     │          │   │   8 Channels × 960×1080           │   │  │
│   │   │  - Local recording    │     │          │   └─────────────────────────────────┘   │  │
│   │   │  - VPN client         │     │          │                                         │  │
│   │   └───────────────────────┘     │          │   ┌─────────────────────────────────┐   │  │
│   │                                 │          │   │   Local Monitor (optional)      │   │  │
│   │   Local Storage: 2TB NVMe       │          │   │   192.168.29.10                 │   │  │
│   │   (7-day circular buffer)       │          │   └─────────────────────────────────┘   │  │
│   └─────────────────────────────────┘          │                                         │  │
│                                                │   CAMERAS (BNC/IP) ──┐                  │  │
│                                                │                      │                  │  │
│                                                │   CH1 ──► CH2 ──► CH3 ──► CH4           │  │
│                                                │   CH5 ──► CH6 ──► CH7 ──► CH8           │  │
│                                                │                                        │  │
│                                                └────────────────────────────────────────┘  │
│                                                                                             │
│   Network: Edge Gateway has TWO interfaces:                                                 │
│   - eth0: 192.168.29.5/24  ←→ Local network (DVR access)                                  │
│   - eth1: DHCP / Static    ←→ Internet (VPN tunnel to cloud)                                │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
```

### 2.2 Physical Edge Gateway Specification

| Component | Specification |
|-----------|--------------|
| Hardware | Intel NUC 13 Pro, Core i7-1360P, 32GB DDR4, 2TB NVMe SSD |
| Alternative | NVIDIA Jetson Orin NX 16GB (for on-edge AI inference) |
| OS | Ubuntu 22.04 LTS Server, minimal install |
| Container Runtime | containerd (via K3s) |
| K8s Distribution | K3s v1.28+ (lightweight, single-node or 2-node HA) |
| Power | UPS-backed, auto-restart on power loss (BIOS setting) |
| Network | Dual Ethernet: one for local DVR segment, one for internet/VPN |
| Local Storage | 2TB NVMe for 7-day circular buffer of all 8 streams |

### 2.3 Cloud Infrastructure Specification

| Component | Specification |
|-----------|--------------|
| Region | Primary: ap-south-1 (Mumbai), DR: ap-southeast-1 (Singapore) |
| VPC | 10.100.0.0/16, 3 AZs, private subnets only for workloads |
| EKS | Managed node groups: `on-demand` for API, `spot` for batch processing |
| GPU Nodes | g4dn.xlarge (NVIDIA T4) for Triton inference, 1-4 nodes auto-scaled |
| ALB | Internet-facing, WAF v2 attached, Shield Advanced optional |
| RDS | PostgreSQL 16, db.r6g.xlarge, Multi-AZ, encrypted at rest |
| ElastiCache | Redis 7, cluster mode enabled, 2 shards × 2 replicas |
| MSK (Kafka) | 3 broker nodes, kafka.m5.large, 3 AZs |
| S3 | Standard (hot), IA (30 days), Glacier Deep Archive (1 year) |

---

## 3. Network Security Zones

### 3.1 Security Zone Diagram

```
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    SECURITY ZONES                                           │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                        ZONE 0: INTERNET (UNTRUSTED)                             │       │
│  │                                                                                 │       │
│  │   Users/Browsers  ──►  AWS ALB (:443)  ──►  WAF  ──►  Rate Limit  ──►  Geo-Block │    │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                    ZONE 1: AWS VPC EDGE (DEMILITARIZED)                         │       │
│  │                                                                                 │       │
│  │   ALB ──► Traefik Ingress ──► Public API endpoints only                         │       │
│  │   Auth: JWT + RBAC, API key for edge gateway                                    │       │
│  │                                                                                 │       │
│  │   AWS ALB Security Group: Allow 443 from 0.0.0.0/0                             │       │
│  │   Traefik SG: Allow 8443 from ALB-SG only                                       │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │              ZONE 2: AWS VPC APPLICATION (TRUSTED, ISOLATED)                    │       │
│  │                                                                                 │       │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │       │
│  │   │ Stream Ing. │  │ AI Inference│  │ Suspicious  │  │  Training Service   │   │       │
│  │   │  Service    │  │  Service    │  │  Activity   │  │                     │   │       │
│  │   │             │  │             │  │  Service    │  │                     │   │       │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘   │       │
│  │                                                                                 │       │
│  │   Pod Security Policies: No root, read-only FS, no privilege escalation        │       │
│  │   Network Policies: Ingress only from API GW namespace, egress to data layer   │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │                ZONE 3: AWS VPC DATA (HIGHLY RESTRICTED)                         │       │
│  │                                                                                 │       │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐   │       │
│  │   │ PostgreSQL  │  │    Redis    │  │    Kafka    │  │       MinIO         │   │       │
│  │   │   (RDS)     │  │  (ElastiC.) │  │    (MSK)    │  │    (S3 API)         │   │       │
│  │   │   :5432     │  │   :6379     │  │   :9092     │  │    :9000            │   │       │
│  │   └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘   │       │
│  │                                                                                 │       │
│  │   Security Groups: Allow connections ONLY from Application Zone SGs              │       │
│  │   RDS: Encrypted (AWS KMS), no public access, IAM auth enabled                  │       │
│  │   S3: Bucket policy deny all except VPC endpoint, versioning enabled             │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                          │                                                  │
│                    WireGuard VPN Tunnel  │  (UDP 51820, ChaCha20-Poly1305)                │
│                                          │                                                  │
│                                          ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐       │
│  │              ZONE 4: EDGE NETWORK (PHYSICALLY ISOLATED)                         │       │
│  │                                                                                 │       │
│  │   ┌──────────────────────────┐          ┌─────────────────────────────────┐     │       │
│  │   │   EDGE GATEWAY AGENT     │          │     DVR (192.168.29.200)        │     │       │
│  │   │   - K3s node              │◄────────►│     NO INTERNET ACCESS          │     │       │
│  │   │   - WireGuard peer        │   :554   │     Firewall: DROP all non-local│     │       │
│  │   │   - Stream ingestion      │   :80    │                                 │     │       │
│  │   │   - Local buffer          │          │     Only 192.168.29.0/24 allowed│     │       │
│  │   └──────────────────────────┘          └─────────────────────────────────┘     │       │
│  │                                                                                 │       │
│  │   Edge Gateway Firewall (ufw):                                                  │       │
│  │   - ALLOW 192.168.29.0/24 → DVR ports (554, 80)                               │       │
│  │   - ALLOW OUT 51820/udp → Cloud VPN endpoint                                  │       │
│  │   - DENY ALL other incoming                                                   │       │
│  │   - No forwarding to local network from VPN (except explicit rules)            │       │
│  │                                                                                 │       │
│  └─────────────────────────────────────────────────────────────────────────────────┘       │
│                                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
```

### 3.2 Firewall Rules

#### Edge Gateway (UFW)

```bash
# Default deny
ufw default deny incoming
ufw default allow outgoing

# Local network access to DVR
ufw allow from 192.168.29.200 to any port 554 proto tcp    # RTSP
ufw allow from 192.168.29.200 to any port 80 proto tcp     # HTTP (ONVIF)

# WireGuard VPN to cloud
ufw allow out on eth1 to <cloud-vpn-ip> port 51820 proto udp

# Local admin access (optional, from specific admin IP)
ufw allow from 192.168.29.10 to any port 22 proto tcp      # SSH from admin workstation
```

#### AWS Security Groups

| Security Group | Ingress Rules | Egress Rules |
|----------------|--------------|--------------|
| `alb-public-sg` | TCP 443 from 0.0.0.0/0 | All to VPC |
| `traefik-ingress-sg` | TCP 8443 from alb-public-sg only | All to VPC |
| `app-services-sg` | TCP 8080-8090 from traefik-ingress-sg | All to data-sg |
| `data-layer-sg` | TCP 5432, 6379, 9092, 9000 from app-services-sg only | None |
| `vpn-endpoint-sg` | UDP 51820 from edge-gateway-ip/32 | All to VPC |

---

## 4. Service Architecture

### 4.1 Service Interaction Diagram

```
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      SERVICE ARCHITECTURE                                       │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                              API GATEWAY LAYER                                      │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Traefik Ingress Controller (K8s)                                           │   │     │
│   │   │  - Route: /api/* → Backend Service                                          │   │     │
│   │   │  - Route: /ws/*  → WebSocket Handler (live video)                          │   │     │
│   │   │  - Route: /     → Next.js Web App                                           │   │     │
│   │   │  - TLS: Let's Encrypt automatic certificates                                │   │     │
│   │   │  - Middleware: rate limit (100 req/min per IP), JWT validation, CORS       │   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│                                           │                                                    │
│                    ┌──────────────────────┼──────────────────────┐                             │
│                    │                      │                      │                             │
│                    ▼                      ▼                      ▼                             │
│   ┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐             │
│   │   BACKEND SERVICE    │   │    WEB FRONTEND      │   │   VIDEO PLAYBACK     │             │
│   │   (Go/Gin)           │   │   (Next.js 14)       │   │   SERVICE (Go)       │             │
│   │   :8080              │   │   :3000              │   │   :8085 (HLS)        │             │
│   │                      │   │                      │   │                      │             │
│   │  ┌──────────────┐   │   │  ┌──────────────┐   │   │  ┌──────────────┐   │             │
│   │  │ REST API     │   │   │  │ React SSR    │   │   │  │ HLS Segment  │   │             │
│   │  │ - /cameras   │   │   │  │ - Dashboard  │   │   │  │ Server       │   │             │
│   │  │ - /events    │   │   │  │ - Live View  │   │   │  │ - /live/:id  │   │             │
│   │  │ - /alerts    │   │   │  │ - Timeline   │   │   │  │ - /vod/:id   │   │             │
│   │  │ - /search    │   │   │  │ - Analytics  │   │   │  │ (DASH/HLS)   │   │             │
│   │  │ - /training  │   │   │  │ - Admin      │   │   │  └──────────────┘   │             │
│   │  └──────────────┘   │   │  └──────────────┘   │   │                      │             │
│   │  ┌──────────────┐   │   └──────────────────────┘   └──────────────────────┘             │
│   │  │ gRPC Client  │   │                                                                    │
│   │  │ (to AI svc)  │   │                                                                    │
│   │  └──────────────┘   │                                                                    │
│   └─────────────────────┘                                                                    │
│            │                                                                                  │
│            │ gRPC (:50051)                                                                    │
│            ▼                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                           EVENT & MESSAGE BUS                                       │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Apache Kafka (MSK)                                                          │   │     │
│   │   │                                                                             │   │     │
│   │   │  Topics:                                                                    │   │     │
│   │   │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────────┐│   │     │
│   │   │  │ streams.raw     │  │ ai.detections   │  │ alerts.critical             ││   │     │
│   │   │  │ (protobuf)      │  │ (JSON)          │  │ (JSON)                      ││   │     │
│   │   │  │ - 8 partitions  │  │ - 16 partitions │  │ - 4 partitions              ││   │     │
│   │   │  │ - 7-day reten.  │  │ - 30-day reten. │  │ - 90-day reten.             ││   │     │
│   │   │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘│   │     │
│   │   │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────────┐│   │     │
│   │   │  │ training.data   │  │ system.metrics  │  │ notifications.email         ││   │     │
│   │   │  │ (protobuf)      │  │ (JSON)          │  │ notifications.sms           ││   │     │
│   │   │  │ - 30-day reten. │  │ - 7-day reten.  │  │ notifications.push          ││   │     │
│   │   │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘│   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   │                                                                                      │     │
│   │   ┌─────────────────────────────────────────────────────────────────────────────┐   │     │
│   │   │  Redis Cluster (ElastiCache)                                                 │   │     │
│   │   │                                                                             │   │     │
│   │   │  Streams:  ┌─────────────────┐  Pub/Sub:  ┌──────────────────────┐        │   │     │
│   │   │            │ live:cam:{id}   │            │ alert:broadcast      │        │   │     │
│   │   │            │ (video chunks)  │            │ ws:session:*         │        │   │     │
│   │   │            │ cache:api:*     │            │ stream:status        │        │   │     │
│   │   │            └─────────────────┘            └──────────────────────┘        │   │     │
│   │   └─────────────────────────────────────────────────────────────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│            │                        │                        │                                 │
│            ▼                        ▼                        ▼                                 │
│   ┌──────────────┐        ┌──────────────┐        ┌──────────────────────┐                   │
│   │ STREAM ING.  │        │ AI INFERENCE │        │ SUSPICIOUS ACTIVITY  │                   │
│   │ SERVICE      │        │ SERVICE      │        │ SERVICE              │                   │
│   │ (Go/FFmpeg)  │        │ (Python/gRPC)│        │ (Go/Python)          │                   │
│   │ :8081        │        │ :8001 (Triton)│       │ :8083                │                   │
│   │              │        │              │        │                      │                   │
│   │┌────────────┐│        │┌────────────┐│        │┌────────────────────┐│                   │
│   ││RTSP Client ││        ││Triton Svr  ││        ││Night Mode Analyzer ││                   │
│   ││(ffmpeg)    ││        ││├─YOLOv8-det││        ││├─Motion detection  ││                   │
│   ││8 concurrent││        ││├─YOLOv8-face││       ││├─Loitering detect. ││                   │
│   ││streams     ││        ││├─ArcFace    ││       ││├─Perimeter breach   ││                   │
│   │├────────────┤│        ││└────────────┘│       ││├─Abandoned object  ││                   │
│   ││Frame Extrac││        ││Model Mgmt.  ││       ││├─Crowd detection   ││                   │
│   ││1 fps anal. ││        ││└────────────┘│       ││└────────────────────┘│                   │
│   │├────────────┤│        │├─────────────┤│       │├────────────────────┤│                   │
│   ││Kafka Produc.││        ││gRPC API     ││       ││Kafka Consumer      ││                   │
│   ││(raw frames) ││       ││- detect()   ││       ││(ai.detections)     ││                   │
│   │└─────────────┘│        ││- embed()    ││       │├────────────────────┤│                   │
│   │┌────────────┐ │        ││- compare()  ││       ││Rule Engine         ││                   │
│   ││MinIO Client│ │        │└─────────────┘│       ││├─Time-based rules  ││                   │
│   ││(video seg.)│ │        └──────────────┘       ││├─Zone-based rules  ││                   │
│   │└────────────┘ │                                ││├─Severity scoring  ││                   │
│   └───────────────┘                                │└────────────────────┘│                   │
│            ▲                                       └──────────────────────┘                   │
│            │ WireGuard VPN                                                                     │
│            │                                                                                   │
│   ┌──────────────┐                                                                            │
│   │ EDGE GATEWAY │                                                                            │
│   │ SERVICE      │                                                                            │
│   │ (Local)      │                                                                            │
│   └──────────────┘                                                                            │
│                                                                                               │
│            │                                                                                   │
│            ▼                                                                                   │
│   ┌──────────────────────────────────────────────────────────────────────────────────────┐     │
│   │                           DATA LAYER                                                │     │
│   │                                                                                      │     │
│   │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │     │
│   │   │ PostgreSQL 16│  │  pgvector    │  │    MinIO     │  │   S3 (Cold Archive)  │   │     │
│   │   │  (RDS)       │  │  Extension   │  │  (on-prem +  │  │                      │   │     │
│   │   │              │  │              │  │   cloud)     │  │                      │   │     │
│   │   │ cameras      │  │ face_embed.  │  │ video-seg/   │  │  yearly archive      │   │     │
│   │   │ events       │  │  table       │  │ training-img │  │  compliance storage  │   │     │
│   │   │ alerts       │  │  (vector)    │  │              │  │                      │   │     │
│   │   │ audit_log    │  │              │  │ lifecycle:   │  │  lifecycle:          │   │     │
│   │   │ users        │  │ HNSW index   │  │ 7d local →   │  │  90d → Glacier       │   │     │
│   │   │ zones        │  │ cos similarity│ │ 30d cloud →  │  │  Deep Archive        │   │     │
│   │   └──────────────┘  └──────────────┘  │ 1yr archive  │  │                      │   │     │
│   │                                         └──────────────┘  └──────────────────────┘   │     │
│   └──────────────────────────────────────────────────────────────────────────────────────┘     │
│                                                                                                │
└────────────────────────────────────────────────────────────────────────────────────────────────┘
```

### 4.2 Service Specifications

#### 4.2.1 Edge Gateway Service (Local)

| Attribute | Specification |
|-----------|--------------|
| Runtime | Go 1.21, compiled binary |
| Deployment | Systemd service on Ubuntu + K3s for containerized components |
| Location | Intel NUC, physically on-site |
| Responsibilities | RTSP stream pull, local recording buffer, VPN tunnel endpoint, heartbeat to cloud |
| Ports | 8080 (HTTP admin), 51820 (WireGuard), 1935 (RTMP relay if needed) |
| Stream Protocol | RTSP over TCP (interleaved) from DVR at 192.168.29.200:554 |
| Local Storage | 2TB NVMe, 7-day circular buffer, ~1.5GB/hour per channel = ~288GB/day for 8ch |
| Reconnect Policy | Exponential backoff: 1s → 2s → 4s → 8s → max 60s, reset on success |
| Heartbeat | Every 30s to cloud Stream Ingestion Service via VPN |
| Failover | Auto-restart via systemd (Restart=always, RestartSec=5) |

#### 4.2.2 Stream Ingestion Service (Cloud)

| Attribute | Specification |
|-----------|--------------|
| Runtime | Go 1.21 |
| Deployment | Kubernetes Deployment, 3 replicas minimum |
| Responsibilities | Receive frames from edge, decode, produce to Kafka, store segments to MinIO |
| Protocol | gRPC bidirectional streaming from Edge Gateway |
| Frame Rate | 1 fps for AI analysis (decimated from 25fps source) |
| Full Rate | 25 fps for event clips (triggered recordings) |
| Kafka Topic | `streams.raw.{camera_id}` — protobuf-encoded frame batches |
| Video Segments | 10-second H.264 segments → MinIO bucket `video-segments` |
| Resource Request | 1 CPU, 2GB RAM per replica |
| HPA | 3-20 replicas based on CPU > 70% |

#### 4.2.3 AI Inference Service

| Attribute | Specification |
|-----------|--------------|
| Runtime | NVIDIA Triton Inference Server 2.40+ (Docker) |
| Deployment | Kubernetes Deployment on GPU nodes (g4dn.xlarge) |
| GPU | NVIDIA T4 16GB, 1 GPU per replica |
| Models | YOLOv8x (detection), YOLOv8x-face (face detection), ArcFace (face recognition/embedding) |
| Model Format | TensorRT engines (.plan) for GPU optimization |
| gRPC API | `:8001` — Triton native gRPC |
| HTTP API | `:8000` — Triton native HTTP |
| Metrics | `:8002` — Prometheus metrics endpoint |
| Dynamic Batching | Max batch size: 8 for detection, 16 for face embedding |
| Input | JPEG frames (960×1080) from Kafka topic `streams.raw.*` |
| Output | Detections (bbox, class, confidence) → Kafka `ai.detections` |
| Face Embeddings | 512-dim float32 vectors → pgvector (PostgreSQL) |
| Resource | 1× T4 GPU, 4 CPU, 16GB RAM per replica |
| HPA | 1-4 replicas based on GPU utilization > 80% and Kafka consumer lag |

#### 4.2.4 Suspicious Activity Service (Night Mode)

| Attribute | Specification |
|-----------|--------------|
| Runtime | Python 3.11 (OpenCV, scikit-learn) + Go orchestrator |
| Deployment | Kubernetes Deployment, 2-8 replicas |
| Input | Kafka topic `ai.detections` + `streams.raw.*` for motion analysis |
| Responsibilities | Night-mode analysis, loitering detection, perimeter breach, abandoned object, crowd detection |
| Rules Engine | YAML-configured rules per camera, per time schedule |
| Night Schedule | Configurable (default: 22:00 - 06:00), overrides day-mode sensitivity |
| Output | Scored alerts → Kafka `alerts.critical` + PostgreSQL `alerts` table |
| ML Models | Background subtraction (MOG2), optical flow for motion tracking, Kalman filters for object tracking |
| Resource Request | 2 CPU, 4GB RAM per replica |

#### 4.2.5 Training Service

| Attribute | Specification |
|-----------|--------------|
| Runtime | Python 3.11, PyTorch 2.1, NVIDIA CUDA 12.1 |
| Deployment | Kubernetes Job/CronJob, runs on GPU spot instances |
| Responsibilities | Model retraining, fine-tuning on collected data, A/B model validation |
| Trigger | Weekly scheduled (Sunday 02:00) or manual (API call) |
| Data Source | MinIO bucket `training-data` (curated positive/negative samples) |
| Output | New TensorRT engines → MinIO bucket `model-artifacts` |
| A/B Rollout | Blue/green model deployment via Triton model repository |
| Validation | mAP > 0.85 required before promotion to production |
| Resource | 1× V100 GPU (spot), 8 CPU, 32GB RAM |

#### 4.2.6 API Gateway / Backend Service

| Attribute | Specification |
|-----------|--------------|
| Runtime | Go 1.21, Gin framework |
| Deployment | Kubernetes Deployment, 3-10 replicas |
| Protocol | HTTP/2, REST API + WebSocket for live updates |
| Authentication | JWT (RS256), access token 15min, refresh token 7 days |
| Authorization | RBAC: admin, operator, viewer roles |
| Rate Limiting | 100 req/min per IP, 1000 req/min per API key |
| Endpoints | See API Specification below |
| Caching | Redis for session store and API response caching (TTL 60s) |
| Resource Request | 0.5 CPU, 1GB RAM per replica |

**API Endpoints:**

| Endpoint | Method | Description | Auth |
|----------|--------|-------------|------|
| `/api/v1/auth/login` | POST | User authentication | Public |
| `/api/v1/auth/refresh` | POST | Token refresh | Public |
| `/api/v1/cameras` | GET | List all cameras | Viewer+ |
| `/api/v1/cameras/{id}` | GET | Camera details | Viewer+ |
| `/api/v1/cameras/{id}/live` | GET | Live stream URL (HLS) | Viewer+ |
| `/api/v1/events` | GET | Query events (paginated, filtered) | Viewer+ |
| `/api/v1/events/{id}` | GET | Event details with snapshot | Viewer+ |
| `/api/v1/alerts` | GET | List alerts | Viewer+ |
| `/api/v1/alerts/{id}/ack` | POST | Acknowledge alert | Operator+ |
| `/api/v1/search/faces` | POST | Face search by image | Operator+ |
| `/api/v1/search/faces/{embedding}` | GET | Similar face lookup | Operator+ |
| `/api/v1/training/upload` | POST | Upload training samples | Admin |
| `/api/v1/training/jobs` | GET | List training jobs | Admin |
| `/api/v1/zones` | CRUD | Perimeter zones per camera | Admin |
| `/api/v1/reports/daily` | GET | Daily activity report | Viewer+ |
| `/api/v1/system/health` | GET | System health status | Internal |

#### 4.2.7 Web Frontend

| Attribute | Specification |
|-----------|--------------|
| Framework | Next.js 14 (App Router), React 18, TypeScript |
| Styling | Tailwind CSS + shadcn/ui components |
| State Management | Zustand (client), React Query (server) |
| Video Player | HLS.js for live stream playback, Video.js for VOD |
| Maps | MapLibre GL JS (open source, no API key required) for camera geolocation |
| Real-time | WebSocket connection for alert notifications |
| Build Output | Static export → served via CDN (CloudFront) |
| PWA | Service worker for offline dashboard viewing |

#### 4.2.8 Notification Service

| Attribute | Specification |
|-----------|--------------|
| Runtime | Go 1.21 |
| Deployment | Kubernetes Deployment, 2-5 replicas |
| Input | Kafka topic `alerts.critical` |
| Channels | Email (SMTP/AWS SES), SMS (Twilio/AWS SNS), Push (Firebase FCM), Webhook |
| Templates | HTML email templates with event snapshot attachment |
| Rate Limiting | Max 1 SMS per phone per 5 minutes; max 10 emails per address per hour |
| Retry Policy | 3 retries with exponential backoff for each channel; dead-letter after failure |
| Escalation | Unacknowledged critical alerts escalate after 15 minutes (to admin) |

#### 4.2.9 Database — PostgreSQL 16 (RDS)

| Attribute | Specification |
|-----------|--------------|
| Instance | db.r6g.xlarge (4 vCPU, 32GB RAM) |
| Storage | 500GB gp3, auto-scaling to 2TB |
| Multi-AZ | Enabled for production |
| Extensions | pgvector (face embeddings), PostGIS (zone geometry), pg_stat_statements |
| Backup | Daily automated, 35-day retention |
| Read Replica | 1 read replica for analytics queries |

**Schema Overview:**

```sql
-- Core tables
cameras (id, name, dvr_channel, rtsp_url, location, status, created_at)
events (id, camera_id, event_type, confidence, bounding_box, snapshot_path, 
        start_time, end_time, severity, metadata JSONB, created_at)
alerts (id, event_id, rule_id, severity, status [new|ack|resolved], 
        acknowledged_by, acked_at, notification_channels, created_at)
face_embeddings (id, person_name, embedding vector(512), camera_id, 
                 first_seen, last_seen, occurrence_count, metadata JSONB)
users (id, username, password_hash, role, email, phone, created_at)
alert_rules (id, camera_id, rule_type, config JSONB, schedule JSONB, 
             severity, enabled, created_at)
audit_log (id, user_id, action, resource, details JSONB, ip_address, created_at)
perimeter_zones (id, camera_id, name, polygon GEOMETRY(POLYGON), 
                 alert_on_enter, alert_on_exit, schedule, created_at)
```

#### 4.2.10 Object Storage — MinIO + S3

| Attribute | Specification |
|-----------|--------------|
| Local (Edge) | MinIO single-node, 2TB NVMe, 7-day retention |
| Cloud (Primary) | MinIO distributed cluster on EKS, 10TB initial, auto-scaling |
| Archive | AWS S3 with lifecycle: Standard → IA (30d) → Glacier Deep Archive (365d) |
| API | S3-compatible, same SDK for all tiers |
| Buckets | `video-segments` (10s segments), `event-clips` (triggered recordings), `training-data` (curated samples), `snapshots` (JPEG event frames), `model-artifacts` (TensorRT engines) |

#### 4.2.11 Redis Cluster

| Attribute | Specification |
|-----------|--------------|
| Type | ElastiCache for Redis, cluster mode enabled |
| Node Type | cache.r6g.large per shard |
| Shards | 2 shards, 2 replicas per shard |
| Max Memory Policy | allkeys-lru (evict least recently used) |
| Persistence | AOF everysec, RDB every 60min |
| Use Cases | Session store, API cache, real-time pub/sub, stream position tracking |

#### 4.2.12 Vector Store (pgvector)

| Attribute | Specification |
|-----------|--------------|
| Integration | PostgreSQL extension (same RDS instance) |
| Table | `face_embeddings` with `embedding vector(512)` column |
| Index | HNSW (hierarchical navigable small world) for approximate nearest neighbor |
| Index Parameters | `m = 16`, `ef_construction = 64` |
| Similarity Metric | Cosine similarity (`<=>` operator) |
| Query | `SELECT * FROM face_embeddings ORDER BY embedding <=> $1 LIMIT 10` |
| Expected Volume | ~1M vectors per year (8 cameras) |

---

## 5. Data Flow Design

### 5.1 Complete Data Flow Diagram

```
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      DATA FLOW                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                                 │
│  LAYER 1: CAPTURE & INGESTION                                                                   │
│  ══════════════════════════                                                                     │
│                                                                                                 │
│   CAMERAS (8ch) ──► DVR (192.168.29.200) ──► RTSP (:554) ──► EDGE GATEWAY (192.168.29.5)      │
│                                                                                                 │
│   Camera → BNC coax → DVR encoder → H.264 stream → RTSP server (DVR builtin)                  │
│                                                                                                 │
│   Edge Gateway pulls 8 concurrent RTSP streams:                                                 │
│   rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?                       │
│   rtsp://192.168.29.200:554/user=admin&password=&channel=2&stream=0.sdp?                       │
│   ... (channels 1-8)                                                                           │
│                                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐                  │
│   │  EDGE GATEWAY PROCESSING per stream:                                     │                  │
│   │  1. FFmpeg demux → raw H.264 Annex-B frames                            │                  │
│   │  2. Segment into 10s chunks → local MinIO (circular buffer)            │                  │
│   │  3. Extract 1 fps JPEG frames (960×1080 → 640×640 resize for AI)       │                  │
│   │  4. Protobuf-encode frame batches                                        │                  │
│   │  5. Send via gRPC bidirectional stream over WireGuard VPN                │                  │
│   └─────────────────────────────────────────────────────────────────────────┘                  │
│                                          │                                                      │
│                    ┌─────────────────────┼─────────────────────┐                                │
│                    │                     │                     │                                │
│                    ▼                     ▼                     ▼                                │
│   ┌─────────────────────┐   ┌─────────────────────┐   ┌─────────────────────┐                 │
│   │ PATH A: LIVE VIDEO  │   │ PATH B: AI ANALYSIS │   │ PATH C: RECORDING   │                 │
│   │ (WebRTC/HLS path)   │   │ (detection pipeline)│   │ (event archival)    │                 │
│   └─────────────────────┘   └─────────────────────┘   └─────────────────────┘                 │
│                                                                                                 │
│  LAYER 2: STREAM PROCESSING (Cloud)                                                             │
│  ══════════════════════════════════                                                             │
│                                                                                                 │
│  PATH A: LIVE VIDEO ────────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Edge Gateway ──► Stream Ing. Svc ──► Redis Stream (live:cam:{id}) ──► HLS Segment Svc      │
│        RTSP              (decode)           (pub/sub buffer)              (m3u8 + .ts)        │
│                                                                            │                    │
│                                                                            ▼                    │
│                                                                     CloudFront CDN              │
│                                                                            │                    │
│                                                                            ▼                    │
│                                                                    Web Browser (HLS.js)        │
│                                                                                                 │
│  PATH B: AI ANALYSIS ───────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Stream Ing. Svc ──► Kafka (streams.raw.{cam}) ──► AI Inference Svc (Triton)                 │
│   (frame batches)         (ordered, partitioned)         (YOLOv8 + ArcFace)                    │
│                                                              │                                  │
│                                    ┌─────────────────────────┼─────────────────────────┐        │
│                                    │                         │                         │        │
│                                    ▼                         ▼                         ▼        │
│                            ┌──────────────┐         ┌──────────────┐         ┌──────────────┐  │
│                            │ Detections   │         │  Face Emb.   │         │  Stream to   │  │
│                            │ (person,     │         │  (512-dim)   │         │  Suspicious  │  │
│                            │  vehicle)    │         │              │         │  Activity    │  │
│                            └──────┬───────┘         └──────┬───────┘         │  Service     │  │
│                                   │                        │                 └──────┬───────┘  │
│                                   ▼                        ▼                        │          │
│                            Kafka (ai.              PostgreSQL                 Kafka  │          │
│                            detections)             (pgvector)                 alerts  │          │
│                                                                                 .critical      │
│                                                                                                 │
│  PATH C: RECORDING ─────────────────────────────────────────────────────────                    │
│                                                                                                 │
│   Edge Gateway ──► Local MinIO (7d) ──► Sync ──► Cloud MinIO ──► S3 Lifecycle → Glacier       │
│   (10s segments)      (hot buffer)    (daily)     (30d hot)        (1yr archive)               │
│                                                                                                 │
│  LAYER 3: EVENT PROCESSING                                                                      │
│  ═════════════════════════                                                                      │
│                                                                                                 │
│   AI Inference ──► Kafka (ai.detections) ──► Suspicious Activity Svc                           │
│   Output              - bbox, class, conf          - Rule engine evaluation                      │
│                       - timestamp                  - Loitering detection                         │
│                       - camera_id                  - Perimeter breach check                      │
│                       - embedding_id               - Crowd counting                              │
│                                                    - Time-of-day scoring                         │
│                                                          │                                      │
│                                    ┌─────────────────────┼─────────────────────┐                │
│                                    │                     │                     │                │
│                                    ▼                     ▼                     ▼                │
│                            ┌──────────────┐     ┌──────────────┐     ┌──────────────────────┐   │
│                            │ PostgreSQL   │     │   Kafka      │     │  Notification Svc    │   │
│                            │ (alerts)     │     │ (alerts.     │     │  - Email (SES)       │   │
│                            │ (events)     │     │  critical)   │     │  - SMS (Twilio)      │   │
│                            └──────────────┘     └──────────────┘     │  - Push (FCM)        │   │
│                                                                      │  - Webhook           │   │
│                                                                      └──────────────────────┘   │
│                                                                                                 │
│  LAYER 4: CONSUMPTION                                                                           │
│  ════════════════════                                                                           │
│                                                                                                 │
│   Web Frontend ──► API Gateway ──► Backend Service ──► PostgreSQL/Redis/MinIO                 │
│   (Next.js)          (Traefik)      (Go/Gin)            (data queries)                         │
│      │                                                                        │                │
│      │  ┌─────────────────────────────────────────────────────────────────┐   │                │
│      │  │  DASHBOARD VIEWS:                                               │   │                │
│      │  │  - Live View: HLS.js + WebSocket for alert overlay              │   │                │
│      │  │  - Event Timeline: Infinite scroll, filters                     │   │                │
│      │  │  - Alert Management: Ack/Nack, assignment                       │   │                │
│      │  │  - Face Search: Upload photo → pgvector similarity search        │   │                │
│      │  │  - Analytics: Time-series charts (event frequency, heatmaps)    │   │                │
│      │  │  - Settings: Camera config, zone drawing, rule management       │   │                │
│      │  └─────────────────────────────────────────────────────────────────┘   │                │
│      │                                                                        │                │
│      └─────────────────────────── WebSocket: /ws/alerts ───────────────────────┘                │
│                                    (real-time alert push)                                       │
│                                                                                                 │
│  LAYER 5: TRAINING DATA FLOW                                                                    │
│  ═══════════════════════════                                                                    │
│                                                                                                 │
│   Events (false positive) ──► Admin review ──► "Add to Training" ──► MinIO (training-data)    │
│   Events (missed detect)  ──► Manual upload ──► Labeling UI ──► Curated dataset               │
│                                                                     │                           │
│                                                                     ▼                           │
│                                                          Training Service (weekly CronJob)      │
│                                                          - Load dataset from MinIO              │
│                                                          - Fine-tune YOLOv8 weights             │
│                                                          - Convert to TensorRT engine           │
│                                                          - Validate mAP > 0.85                  │
│                                                                │                                │
│                                                                ▼                                │
│                                                          Model Registry (MinIO)                 │
│                                                          - Blue/green deployment                │
│                                                          - Triton model repository              │
│                                                                │                                │
│                                                                ▼                                │
│                                                          AI Inference Svc (rolling update)      │
│                                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
```

### 5.2 Stream Flow Detail

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     VIDEO STREAM FLOW (Per Camera)                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Camera ──► DVR Encoder ──► RTSP Stream ──► Edge Gateway                  │
│                              (H.264, 25fps,                               │
│                               960x1080)                                     │
│                                                                             │
│                              Edge Gateway Processing:                       │
│                              ┌─────────────────────────────────────┐       │
│                              │ 1. FFmpeg process per channel       │       │
│                              │    -input rtsp://dvr/ch{N}          │       │
│                              │    -c:v copy -f segment             │       │
│                              │    -segment_time 10                 │       │
│                              │    /recordings/ch{N}/%d.ts          │       │
│                              │                                     │       │
│                              │ 2. Parallel: 1 fps extraction       │       │
│                              │    -vf fps=1,scale=640:640          │       │
│                              │    -f image2pipe -vcodec mjpeg      │       │
│                              │    → AI pipeline                    │       │
│                              └─────────────────────────────────────┘       │
│                                          │                                  │
│                    ┌─────────────────────┼─────────────────────┐           │
│                    │                     │                     │           │
│                    ▼                     ▼                     ▼           │
│            ┌──────────────┐      ┌──────────────┐      ┌──────────────┐   │
│            │ Local Buffer │      │ AI Frames    │      │ Cloud Upload │   │
│            │ (7-day ring) │      │ (1 fps JPEG) │      │ (10s chunks) │   │
│            └──────────────┘      └──────┬───────┘      └──────┬───────┘   │
│                                         │                     │           │
│                    ┌────────────────────┘                     │           │
│                    │ WireGuard VPN                              │           │
│                    ▼                                          ▼           │
│           ┌────────────────┐                        ┌────────────────┐    │
│           │ Cloud Stream   │                        │ Cloud MinIO    │    │
│           │ Ingestion Svc  │                        │ (30-day hot)   │    │
│           └───────┬────────┘                        └────────────────┘    │
│                   │                                                        │
│                   ▼                                                        │
│     ┌────────────────────────┐                                             │
│     │  Kafka (streams.raw)   │                                             │
│     │  Partition = camera_id │                                             │
│     │  Guarantees ordering   │                                             │
│     │  per camera            │                                             │
│     └───────────┬────────────┘                                             │
│                 │                                                          │
│    ┌────────────┼────────────┐                                             │
│    │            │            │                                             │
│    ▼            ▼            ▼                                             │
│ ┌──────┐   ┌──────┐   ┌──────────┐                                       │
│ │ AI   │   │ HLS  │   │ Recording│                                       │
│ │ Inf. │   │ Seg. │   │ Archival │                                       │
│ │ Svc  │   │ Svc  │   │ Svc      │                                       │
│ └──────┘   └──────┘   └──────────┘                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 5.3 Event/Alert Flow Detail

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      EVENT & ALERT FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  AI Inference Output:                                                       │
│  {                                                                         │
│    "camera_id": "cam_001",                                                  │
│    "timestamp": "2025-01-20T14:30:00Z",                                     │
│    "detections": [                                                          │
│      { "class": "person", "confidence": 0.94,                               │
│        "bbox": [120, 340, 280, 560], "track_id": 42 },                     │
│      { "class": "face", "confidence": 0.89,                                 │
│        "bbox": [150, 360, 200, 420], "embedding_id": "emb_12345" }         │
│    ]                                                                        │
│  }                                                                          │
│                                                                             │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────────────────────────────┐                               │
│  │  Kafka Topic: ai.detections             │                               │
│  │  (JSON, 16 partitions)                  │                               │
│  └─────────────┬───────────────────────────┘                               │
│                │                                                            │
│    ┌───────────┴───────────┐                                               │
│    │                       │                                               │
│    ▼                       ▼                                               │
│ ┌──────────┐        ┌──────────────┐                                      │
│ │ Face     │        │ Suspicious   │                                      │
│ │ Matching │        │ Activity Svc │                                      │
│ │ (pgvector│        │              │                                      │
│ │  search) │        │ Rule Eval:   │                                      │
│ │          │        │ - Night mode?│                                      │
│ └────┬─────┘        │ - Zone       │                                      │
│      │              │   overlap?   │                                      │
│      │              │ - Loitering  │                                      │
│      │              │   > 5 min?   │                                      │
│      │              │ - Crowd      │                                      │
│      │              │   > 5 ppl?   │                                      │
│      │              └──────┬───────┘                                      │
│      │                     │                                              │
│      │    MATCH FOUND      │ ALERT TRIGGERED                             │
│      │         │           │                                              │
│      ▼         ▼           ▼                                              │
│  ┌──────────────────────────────────────────┐                            │
│  │  PostgreSQL                             │                            │
│  │  - events table (all detections)        │                            │
│  │  - alerts table (triggered alerts)      │                            │
│  │  - face_embeddings (if new/matched)     │                            │
│  └───────────────────┬──────────────────────┘                            │
│                      │                                                    │
│          ┌───────────┼───────────┐                                       │
│          │           │           │                                       │
│          ▼           ▼           ▼                                       │
│   ┌──────────┐ ┌──────────┐ ┌──────────────┐                           │
│   │ WebSocket│ │ Kafka    │ │ Notification │                           │
│   │ Push     │ │ alerts.  │ │ Service      │                           │
│   │ (live    │ │ critical │ │              │                           │
│   │  update) │ │          │ │ - Email      │                           │
│   └──────────┘ └──────────┘ │ - SMS        │                           │
│                             │ - Push       │                           │
│                             │ - Webhook    │                           │
│                             └──────────────┘                           │
│                                                                             │
│  Alert Lifecycle:                                                           │
│  DETECTED → NEW (insert) → WebSocket push → NOTIFY → ACK/RESOLVE          │
│                                ↓                                            │
│                           If unacked 15min → ESCALATE to admin             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 5.4 Live Video to Browser Flow

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                   LIVE VIDEO TO BROWSER FLOW                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐            │
│  │   BROWSER    │      │  CloudFront  │      │   EKS HLS    │            │
│  │              │      │   CDN        │      │   Service    │            │
│  │  ┌────────┐  │      │              │      │              │            │
│  │  │HLS.js  │  │      │              │      │  ┌────────┐  │            │
│  │  │Player  │◄─┼──────┼─── m3u8 ─────┼──────┼──│ Playlist│  │            │
│  │  │        │  │      │   + .ts      │      │  │ Builder │  │            │
│  │  └────────┘  │      │   segments   │      │  └────┬───┘  │            │
│  │              │      │              │      │       │      │            │
│  │  WebSocket ──┼──────┼──────────────┼──────┼───────┘      │            │
│  │  /ws/alerts  │      │              │      │              │            │
│  └──────────────┘      └──────────────┘      └──────┬───────┘            │
│                                                     │                      │
│                                                     ▼                      │
│                                            ┌────────────────┐             │
│                                            │  Redis Stream  │             │
│                                            │  live:cam:{id} │             │
│                                            │                │             │
│                                            │  ┌──────────┐  │             │
│                                            │  │Segment 1 │  │             │
│                                            │  │Segment 2 │  │             │
│                                            │  │Segment 3 │──┼──► FIFO     │
│                                            │  └──────────┘  │   (keep 30)  │
│                                            └───────┬────────┘             │
│                                                    │                       │
│                              ┌─────────────────────┘                       │
│                              │ WireGuard VPN                               │
│                              ▼                                              │
│                    ┌──────────────────┐                                    │
│                    │  Edge Gateway    │                                    │
│                    │  FFmpeg → HLS    │                                    │
│                    │  segmenter       │                                    │
│                    └──────────────────┘                                    │
│                                                                             │
│  Latency Budget:                                                            │
│  - DVR encoding:  ~100ms                                                    │
│  - RTSP to Edge:  ~50ms                                                     │
│  - VPN tunnel:    ~30-80ms (depending on internet)                         │
│  - Cloud HLS svc: ~50ms                                                     │
│  - CDN delivery:  ~20-100ms                                                 │
│  - Player buffer: 3-6 segments (~30-60s behind real-time)                  │
│  TOTAL LIVE LATENCY: ~35-65 seconds (HLS inherent)                         │
│                                                                             │
│  For lower latency: WebRTC mode (optional future):                          │
│  - Target: < 2 seconds using WHIP/WHEP                                      │
│  - Requires direct edge-to-browser or TURN relay                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 5.5 Training Data Flow

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      TRAINING DATA FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SOURCE 1: Automatic (False Positive Detection)                            │
│  ──────────────────────────────────────────────                             │
│                                                                             │
│  AI Inference → Confidence 0.3-0.6 range → Flag as "uncertain"            │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/auto/      │                               │
│  │  - Original frame (JPEG)                │                               │
│  │  - Inference result (JSON)              │                               │
│  │  - Flagged for review                   │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  SOURCE 2: Manual (Operator Upload)                                        │
│  ──────────────────────────────────                                         │
│                                                                             │
│  Operator → Dashboard "Upload Training Image" → Label with bounding boxes  │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/manual/    │                               │
│  │  - Uploaded image with labels (COCO fmt)│                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  SOURCE 3: Missed Detection (Post-Incident)                                │
│  ────────────────────────────────────────────                               │
│                                                                             │
│  Security review → "AI should have caught this" → Extract from recording   │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  MinIO bucket: training-data/incident/  │                               │
│  │  - Video clip with manual annotation    │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
│  AGGREGATION:                                                               │
│  ════════════                                                               │
│                                                                             │
│  All sources → Weekly CronJob (Sunday 02:00 UTC)                           │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Training Service Pipeline:                                         │   │
│  │                                                                     │   │
│  │  1. Download all new training data from MinIO                       │   │
│  │  2. Deduplicate (perceptual hashing)                                │   │
│  │  3. Augment: rotation, brightness, noise (albumentations)           │   │
│  │  4. Validate: train/val/test split (80/10/10)                       │   │
│  │  5. Fine-tune YOLOv8x:                                              │   │
│  │     - Base: COCO-pretrained weights                                 │   │
│  │     - Epochs: 100, early stopping patience 10                       │   │
│  │     - LR: 0.001 with cosine decay                                   │   │
│  │     - Batch: 8 per GPU                                              │   │
│  │  6. Validate mAP@0.5 > 0.85                                         │   │
│  │  7. Convert to TensorRT engine (FP16, max batch 8)                  │   │
│  │  8. Upload to MinIO: model-artifacts/{version}/                     │   │
│  │  9. A/B test: shadow mode for 24 hours                               │   │
│  │  10. Promote to production if FP rate < baseline                    │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  MODEL DEPLOYMENT:                                                          │
│  ═════════════════                                                          │
│                                                                             │
│  MinIO model-artifacts/ → Triton Model Repository → SIGHUP reload          │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────┐                               │
│  │  Blue/Green Deployment:                 │                               │
│  │  - Triton loads new model as "green"    │                               │
│  │  - 5% traffic routed for 1 hour         │                               │
│  │  - Monitor: latency P99, error rate     │                               │
│  │  - If OK: 100% traffic, "blue" retired  │                               │
│  │  - If FAIL: automatic rollback          │                               │
│  └─────────────────────────────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## 6. Technology Stack

### 6.1 Technology Selection Matrix

| Category | Technology | Alternative | Selection Criteria |
|----------|-----------|-------------|-------------------|
| **Cloud Platform** | **AWS** | GCP, Azure | Best India region coverage (Mumbai), most mature managed Kafka (MSK), broad GPU instance types, VPC endpoints for private service communication |
| **Container Orchestration** | **Amazon EKS** | GKE, AKS, self-managed | Managed control plane, GPU device plugin support, Cluster Autoscaler, native IAM integration |
| **Edge K8s** | **K3s** | K0s, MicroK8s, Docker Compose | Single binary, lightweight, automatic HA with embedded etcd, built-in Helm, compatible with standard K8s manifests |
| **VPN** | **WireGuard** | OpenVPN, IPSec, Tailscale | Modern crypto (Curve25519, ChaCha20, Poly1305), kernel module since Linux 5.6, ~60% faster than OpenVPN, NAT traversal, simple config |
| **Message Queue** | **Apache Kafka (MSK)** | RabbitMQ, NATS, AWS SQS | Ordered event log, stream replay, high throughput, exactly-once processing with Flink, managed service reduces ops |
| **Stream Processing** | **Apache Flink on EKS** | Kafka Streams, Spark Streaming | Stateful processing, event time semantics, exactly-once, CEP (complex event processing) for multi-frame rules |
| **Reverse Proxy** | **Traefik** | NGINX, HAProxy, Envoy | Native Kubernetes Ingress, automatic Let's Encrypt, middleware chains, WebSocket support, Prometheus metrics |
| **AI Inference** | **NVIDIA Triton + YOLOv8** | TorchServe, TensorFlow Serving, custom | Multi-framework support, TensorRT optimization, dynamic batching, model ensemble, Prometheus metrics |
| **Database** | **PostgreSQL 16 (RDS) + pgvector** | MySQL, MongoDB, separate vector DB | ACID compliance, mature managed service, pgvector handles 512-dim embeddings at scale, no separate DB to manage |
| **Cache** | **Redis 7 Cluster** | Memcached, KeyDB | Data structures (streams, sorted sets), pub/sub, persistence, cluster mode for horizontal scaling |
| **Object Storage** | **MinIO + S3** | Ceph, GlusterFS, pure S3 | S3-compatible API everywhere, local buffering at edge, cloud tiering, cost optimization via lifecycle policies |
| **Backend Language** | **Go 1.21** | Python, Java, Rust | Compiled performance for high-throughput streaming, excellent concurrency (goroutines), small container images |
| **Frontend** | **Next.js 14 + React 18** | Vue, Angular, Svelte | SSR for SEO/performance, React ecosystem, API routes, image optimization, easy deployment to CDN |
| **Monitoring** | **Prometheus + Grafana + Loki** | Datadog, New Relic, CloudWatch | Open source, no per-host licensing, powerful alerting, log aggregation with Loki, custom dashboards |
| **CI/CD** | **GitHub Actions + ArgoCD** | GitLab CI, Jenkins, Flux | GitOps deployment, automated rollback, drift detection, progressive delivery |

### 6.2 WireGuard VPN Configuration

```
# Cloud Server (AWS EC2 bastion / VPN endpoint)
[Interface]
Address = 10.200.0.1/32
ListenPort = 51820
PrivateKey = <cloud-private-key>
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Edge Gateway
PublicKey = <edge-public-key>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25

# Edge Gateway (Intel NUC)
[Interface]
Address = 10.200.0.2/32
PrivateKey = <edge-private-key>

[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
AllowedIPs = 10.100.0.0/16  # Entire AWS VPC
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25
```

### 6.3 Port Reference Table

| Service | Port | Protocol | Location | Notes |
|---------|------|----------|----------|-------|
| DVR RTSP | 554 | TCP | 192.168.29.200 | Local network only |
| DVR HTTP | 80 | TCP | 192.168.29.200 | Admin UI, local only |
| DVR HTTPS | 443 | TCP | 192.168.29.200 | Admin UI, local only |
| DVR TCP | 25001 | TCP | 192.168.29.200 | Proprietary protocol |
| DVR UDP | 25002 | UDP | 192.168.29.200 | Proprietary protocol |
| DVR NTP | 123 | UDP | 192.168.29.200 | Time sync |
| WireGuard | 51820 | UDP | Cloud + Edge | VPN tunnel |
| Edge Admin | 8080 | TCP | 192.168.29.5 | Local admin UI |
| Edge SSH | 22 | TCP | 192.168.29.5 | Admin access only |
| Traefik HTTP | 8000 | TCP | EKS | Internal HTTP entrypoint |
| Traefik HTTPS | 8443 | TCP | EKS | Internal HTTPS entrypoint |
| ALB HTTPS | 443 | TCP | AWS | Public-facing |
| Backend API | 8080 | TCP | EKS pods | Internal service port |
| Triton HTTP | 8000 | TCP | EKS GPU nodes | Model inference HTTP |
| Triton gRPC | 8001 | TCP | EKS GPU nodes | Model inference gRPC |
| Triton Metrics | 8002 | TCP | EKS GPU nodes | Prometheus metrics |
| PostgreSQL | 5432 | TCP | RDS | VPC-private |
| Redis | 6379 | TCP | ElastiCache | VPC-private |
| Kafka | 9092 | TCP | MSK | VPC-private |
| MinIO API | 9000 | TCP | EKS + Edge | S3-compatible API |
| MinIO Console | 9001 | TCP | EKS + Edge | Admin console |
| Prometheus | 9090 | TCP | EKS | Metrics collection |
| Grafana | 3000 | TCP | EKS | Dashboards |

---

## 7. Scaling Strategy

### 7.1 Camera Scaling Roadmap

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAMERA SCALING ROADMAP                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CURRENT: 8 cameras (1 DVR)                                                 │
│  ├─ Edge: Intel NUC i7, 32GB RAM                                          │
│  ├─ Streams: 8 × RTSP @ 960×1080                                          │
│  ├─ Bandwidth: ~16 Mbps upstream (2 Mbps per H.264 stream)                │
│  └─ Cloud AI: 1× T4 GPU (handles 8 streams @ 1 fps)                       │
│                                                                             │
│  PHASE 1: 16 cameras (2 DVRs)                                               │
│  ├─ Edge: Intel NUC i7 (sufficient) or 2× NUC                            │
│  ├─ Add 2nd edge gateway for 2nd DVR site (if different location)        │
│  ├─ Streams: 16 × RTSP                                                    │
│  ├─ Bandwidth: ~32 Mbps                                                   │
│  ├─ Cloud AI: 1× T4 GPU (still sufficient, batch size 8 → 16)            │
│  └─ Kafka: 8 partitions → 16 partitions                                   │
│                                                                             │
│  PHASE 2: 32 cameras (4 DVRs / 4 sites)                                     │
│  ├─ Edge: 4× Intel NUC (one per site)                                     │
│  ├─ VPN: Hub-spoke model (4 edge peers → 1 cloud endpoint)               │
│  ├─ Bandwidth: ~64 Mbps                                                   │
│  ├─ Cloud AI: 2× T4 GPUs (HPA: 2-6 replicas)                             │
│  ├─ Stream Ing.: 6-12 replicas (HPA)                                      │
│  ├─ Kafka: 32 partitions                                                  │
│  └─ PostgreSQL: db.r6g.2xlarge (scale up)                                 │
│                                                                             │
│  PHASE 3: 64 cameras (8 DVRs / 8 sites)                                     │
│  ├─ Edge: 8× Intel NUC (or NVIDIA Jetson Orin for edge AI pre-filter)     │
│  ├─ VPN: WireGuard hub-spoke or mesh (consider Tailscale for simplicity) │
│  ├─ Bandwidth: ~128 Mbps (dedicated internet circuit recommended)        │
│  ├─ Cloud AI: 4× T4 GPUs or 2× A10G (g5.2xlarge)                         │
│  ├─ Stream Ing.: 12-20 replicas                                           │
│  ├─ Kafka: 64 partitions, consider MSK multi-cluster                      │
│  ├─ PostgreSQL: db.r6g.4xlarge + read replica                             │
│  ├─ Redis: 4 shards                                                       │
│  └─ MinIO: Distributed mode, 4+ nodes                                     │
│                                                                             │
│  PHASE 4: 64+ cameras (NVR consolidation)                                   │
│  ├─ Consider NVR-to-edge consolidation (fewer, more powerful recorders)   │
│  ├─ Edge AI pre-filtering (Jetson Orin): only send motion frames         │
│  ├─ Bandwidth reduction: ~50% via smart filtering                         │
│  └─ Multi-region cloud deployment for latency optimization                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 7.2 AI Inference Scaling

| Metric | 8 Cameras | 16 Cameras | 32 Cameras | 64 Cameras |
|--------|-----------|------------|------------|------------|
| Frame Rate | 8 fps (1 per cam) | 16 fps | 32 fps | 64 fps |
| GPU Replicas | 1× T4 | 1× T4 | 2× T4 | 4× T4 or 2× A10G |
| Inference Latency (P99) | 80ms | 120ms | 150ms | 200ms |
| Kafka Partitions (raw) | 8 | 16 | 32 | 64 |
| Consumer Groups | 3 | 4 | 6 | 8 |

**Auto-scaling Triggers:**
- GPU utilization > 80% for 2 minutes → scale out
- Kafka consumer lag > 1000 messages for 5 minutes → scale out
- Queue depth < 100 for 10 minutes → scale in (to minimum)

### 7.3 Storage Scaling

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    STORAGE CAPACITY PLANNING                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Per-Camera Storage Profile:                                                │
│  - Continuous recording: ~1.5 GB/hour @ 960×1080 H.264 main profile       │
│  - AI snapshots (1 fps): ~50 MB/hour (JPEG compressed)                    │
│  - Event clips: ~10 MB average per event (30-second clip)                 │
│                                                                             │
│  Total Per Day (8 cameras):                                                 │
│  - Video: 8 × 1.5 GB × 24h = 288 GB/day                                   │
│  - Snapshots: 8 × 50 MB × 24h = 9.6 GB/day                                │
│  - Events (est. 500/day): 500 × 10 MB = 5 GB/day                          │
│  - TOTAL: ~303 GB/day = ~9 TB/month                                       │
│                                                                             │
│  Tiered Storage Strategy:                                                   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 1: EDGE LOCAL (Hot, 7 days)                                   │   │
│  │  - Capacity: 2TB NVMe per edge gateway                              │   │
│  │  - All 8 streams, full resolution, 10s segments                     │   │
│  │  - Cost: Hardware (CAPEX)                                           │   │
│  │  - Use: Immediate playback, event export                            │   │
│  │                                                                     │   │
│  │  7 days × 303 GB = 2.1 TB ✓ (fits in 2TB with compression)        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 2: CLOUD MINIO (Warm, 30 days)                                │   │
│  │  - Capacity: 10TB initial, auto-scaling                             │   │
│  │  - Full resolution video segments + event snapshots                 │   │
│  │  - Cost: ~$0.023/GB/month (S3 Standard equivalent)                  │   │
│  │  - Use: Dashboard playback, search, investigation                   │   │
│  │                                                                     │   │
│  │  30 days × 303 GB = 9.1 TB                                          │   │
│  │  Cost: 9,100 GB × $0.023 = ~$209/month                              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 3: S3 IA (Cool, 31-90 days)                                   │   │
│  │  - Capacity: Auto (lifecycle transition)                            │   │
│  │  - Cost: ~$0.0125/GB/month                                          │   │
│  │  - Use: Occasional access, compliance review                        │   │
│  │                                                                     │   │
│  │  60 days × 303 GB = 18.2 TB                                         │   │
│  │  Cost: 18,200 GB × $0.0125 = ~$228/month                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  TIER 4: GLACIER DEEP ARCHIVE (Cold, 90+ days)                      │   │
│  │  - Capacity: Unbounded                                              │   │
│  │  - Cost: ~$0.00099/GB/month                                         │   │
│  │  - Retrieval: 12-48 hours (batch)                                   │   │
│  │  - Use: Long-term compliance, legal hold                            │   │
│  │                                                                     │   │
│  │  Annual accumulation: 303 GB × 365 = 110 TB                         │   │
│  │  Cost: 110,000 GB × $0.00099 = ~$109/month                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  TOTAL MONTHLY STORAGE COST (8 cameras, steady state):                      │
│  - Tier 2 (hot): $209                                                       │
│  - Tier 3 (warm): $228                                                      │
│  - Tier 4 (cold): $109                                                      │
│  - TOTAL: ~$546/month                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 7.4 Database Partitioning Strategy

```sql
-- Partition events table by month (range partitioning)
CREATE TABLE events (
    id BIGSERIAL,
    camera_id VARCHAR(32) NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    confidence DECIMAL(4,3),
    bounding_box BOX,
    snapshot_path VARCHAR(512),
    start_time TIMESTAMPTZ NOT NULL,
    end_time TIMESTAMPTZ,
    severity VARCHAR(20),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (start_time);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... auto-created by cron job

-- Partition pruning ensures queries for specific time ranges
-- only scan relevant partitions

-- Automated partition creation (pg_partman extension)
SELECT partman.create_parent('public.events', 'start_time', 'native', 'monthly');

-- Partition compression and archival
-- Partitions older than 12 months:
-- 1. Compress with pg_compress
-- 2. Move to S3 via pg_s3_fifo FDW
-- 3. Drop local partition (data in cold archive)
```

---

## 8. Failover & Reliability

### 8.1 Service Restart Policies

```yaml
# Kubernetes Deployment - Restart Policy Example
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stream-ingestion-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime deployment
  template:
    spec:
      containers:
        - name: ingestion
          image: surveillance/stream-ingestion:v1.2.3
          resources:
            requests:
              cpu: "1000m"
              memory: "2Gi"
            limits:
              cpu: "2000m"
              memory: "4Gi"
          livenessProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3  # Restart after 30s of failures
          readinessProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          startupProbe:
            grpc:
              port: 8081
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 30  # 150s max for startup
```

### 8.2 Stream Reconnect Logic

```go
// Edge Gateway Stream Reconnect Logic (Go pseudocode)
func maintainStream(cameraID string, rtspURL string) {
    backoff := NewExponentialBackoff(
        Initial:    1 * time.Second,
        Max:        60 * time.Second,
        Multiplier: 2.0,
        Jitter:     0.1,
    )
    
    for {
        ctx, cancel := context.WithCancel(context.Background())
        
        err := connectAndStream(ctx, cameraID, rtspURL)
        if err != nil {
            log.Error("stream disconnected", "camera", cameraID, "error", err)
            
            // Update health status in Redis
            redis.HSet("stream:health", cameraID, "disconnected")
            
            // Wait with backoff
            wait := backoff.Next()
            log.Info("reconnecting", "camera", cameraID, "wait", wait)
            time.Sleep(wait)
            
            cancel()
            continue
        }
        
        // Success - reset backoff
        backoff.Reset()
        redis.HSet("stream:health", cameraID, "connected")
    }
}

// Circuit breaker pattern for cloud connection
type CircuitBreaker struct {
    state          State  // Closed, Open, HalfOpen
    failureCount   int
    failureThreshold int    // 5 failures
    timeout        time.Duration  // 60s open state
    lastFailureTime time.Time
}
```

### 8.3 VPN Tunnel Recovery

```bash
#!/bin/bash
# /usr/local/bin/wireguard-watchdog.sh
# Runs every 30 seconds via cron

CLOUD_ENDPOINT="10.200.0.1"
TUNNEL_INTERFACE="wg0"
MAX_PING_LOSS=3
LOG_FILE="/var/log/wg-watchdog.log"

# Check tunnel health
ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1

if [ $? -ne 0 ]; then
    echo "$(date): VPN tunnel unhealthy, restarting..." >> $LOG_FILE
    
    # 1. Restart WireGuard interface
    wg-quick down $TUNNEL_INTERFACE
    sleep 2
    wg-quick up $TUNNEL_INTERFACE
    
    # 2. Verify recovery
    sleep 5
    ping -c 3 -W 5 -I $TUNNEL_INTERFACE $CLOUD_ENDPOINT > /dev/null 2>&1
    
    if [ $? -eq 0 ]; then
        echo "$(date): VPN tunnel recovered" >> $LOG_FILE
        # Notify cloud of recovery
        curl -X POST http://10.200.0.1:8080/api/v1/system/edge-recovery \
            -H "Authorization: Bearer $EDGE_TOKEN" \
            -d "{\"edge_id\": \"$HOSTNAME\", \"status\": \"recovered\"}"
    else
        echo "$(date): VPN tunnel recovery FAILED" >> $LOG_FILE
        # Escalate: local alert (buzzer/email if available)
    fi
fi
```

### 8.4 Queue Recovery & Durability

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    KAFKA DURABILITY CONFIGURATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Producer Configuration (Stream Ingestion Service):                         │
│  ──────────────────────────────────────────────────                         │
│  - acks=all              # Wait for all replicas                            │
│  - retries=10            # Aggressive retry                                 │
│  - retry.backoff.ms=1000 # 1 second between retries                       │
│  - enable.idempotence=true  # Exactly-once semantics                      │
│  - max.in.flight.requests=1  # Preserve ordering during retry             │
│  - compression.type=lz4  # Efficient compression                          │
│                                                                             │
│  Topic Configuration:                                                       │
│  ────────────────────                                                       │
│  - replication.factor=3     # 3 copies across AZs                          │
│  - min.insync.replicas=2    # Require 2 acks for producer commit           │
│  - retention.ms=604800000   # 7 days for raw streams                       │
│  - retention.ms=2592000000  # 30 days for detections                       │
│  - unclean.leader.election.enable=false  # Never lose committed data       │
│                                                                             │
│  Consumer Configuration (AI Inference Service):                             │
│  ──────────────────────────────────────────────                             │
│  - enable.auto.commit=false  # Manual offset management                     │
│  - auto.offset.reset=earliest  # Replay from beginning on new group        │
│  - max.poll.records=100      # Process in batches                           │
│  - isolation.level=read_committed  # Only read committed transactions       │
│                                                                             │
│  Offset Commit Strategy:                                                    │
│  ───────────────────────                                                    │
│  1. Pull batch from Kafka                                                   │
│  2. Process (run inference)                                                 │
│  3. Write results to PostgreSQL (transaction)                               │
│  4. Commit Kafka offset ONLY after DB write succeeds                        │
│  5. If any step fails: don't commit, reprocess on next poll                 │
│                                                                             │
│  Dead Letter Queue:                                                         │
│  ──────────────────                                                         │
│  - Topic: streams.raw.dlq                                                   │
│  - After 5 processing failures, message moved to DLQ                        │
│  - DLQ consumer: alerts admin, manual inspection                            │
│  - Retention: 30 days                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 8.5 Graceful Degradation

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    GRACEFUL DEGRADATION MATRIX                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  FAILURE MODE                    │  DEGRADATION STRATEGY            │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  AI Inference Service DOWN       │  Continue recording ALL video    │   │
│  │  (GPU failure, model crash)      │  - Events stored as "unprocessed"│   │
│  │                                  │  - No real-time alerts           │   │
│  │                                  │  - Queue frames for later batch  │   │
│  │                                  │    processing when AI recovers   │   │
│  │                                  │  - Dashboard shows "AI OFFLINE"  │   │
│  │                                  │    banner                        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Kafka DOWN                      │  Edge Gateway buffers locally    │   │
│  │  (MSK outage)                    │  - Local MinIO ring buffer       │   │
│  │                                  │  - Backpressure: reduce to       │   │
│  │                                  │    key frames only (0.2 fps)     │   │
│  │                                  │  - Auto-reconnect with 2x        │   │
│  │                                  │    exponential backoff           │   │
│  │                                  │  - Replay from local buffer      │   │
│  │                                  │    when Kafka recovers           │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  VPN Tunnel DOWN                 │  Full local operation mode       │   │
│  │  (internet outage)               │  - All recording continues       │   │
│  │                                  │    locally (7-day buffer)        │   │
│  │                                  │  - Local alert buzzer/relay      │   │
│  │                                  │    (configurable)                │   │
│  │                                  │  - No cloud dashboard access     │   │
│  │                                  │  - Auto-sync when VPN recovers   │   │
│  │                                  │  - Queue cloud events for        │   │
│  │                                  │    later replay                  │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  PostgreSQL DOWN                 │  Alert queue builds in Kafka     │   │
│  │  (RDS outage)                    │  - Events not lost (Kafka dur.)  │   │
│  │                                  │  - Read-only dashboard mode      │   │
│  │                                  │  - Cached data from Redis        │   │
│  │                                  │  - Alert on-call engineer        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Notification Service DOWN       │  Alerts accumulate in DB         │   │
│  │                                  │  - Retry with exponential backoff│   │
│  │                                  │  - Dead letter after 24 hours    │   │
│  │                                  │  - Dashboard shows pending count │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │  Edge Gateway DOWN               │  Cloud dashboard shows           │   │
│  │  (power/hardware failure)        │  "SITE OFFLINE"                  │   │
│  │                                  │  - Last known recordings in      │   │
│  │                                  │    cloud (up to disconnect)      │   │
│  │                                  │  - Alert sent immediately        │   │
│  │                                  │  - UPS on edge: graceful         │   │
│  │                                  │    shutdown, preserve data       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Priority Order (highest first):                                            │
│  1. Video recording NEVER STOPS (local edge priority)                       │
│  2. Critical alerts ALWAYS FIRE (local buzzer + queued cloud alerts)        │
│  3. AI inference gracefully degrades to batch catch-up                      │
│  4. Dashboard operates in read-only/cache mode during DB outage             │
│  5. Cloud sync resumes automatically when connectivity restored             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 8.6 Health Check Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEALTH CHECK ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 1: KUBERNETES PROBES                                                │
│  ──────────────────────────                                                 │
│  - Liveness Probe:  /health/live   → Restart container if failing          │
│  - Readiness Probe: /health/ready  → Remove from service if failing        │
│  - Startup Probe:   /health/startup→ Allow long initialization             │
│                                                                             │
│  LAYER 2: SERVICE-LEVEL HEALTH (Prometheus metrics)                        │
│  ──────────────────────────────────────────────────                         │
│  Each service exposes:                                                      │
│  - app_health_status{service="X"}  0=healthy, 1=degraded, 2=critical      │
│  - app_health_details{check="db"}  last check timestamp + result           │
│                                                                             │
│  LAYER 3: DEPENDENCY HEALTH CHECKS                                          │
│  ────────────────────────────────                                           │
│  Backend Service checks:                                                    │
│  ├─ PostgreSQL: SELECT 1; (timeout 2s)                                     │
│  ├─ Redis: PING (timeout 1s)                                               │
│  ├─ Kafka: ListTopics (timeout 3s)                                         │
│  ├─ MinIO: ListBuckets (timeout 3s)                                        │
│  └─ Triton: ModelReady API (timeout 5s)                                    │
│                                                                             │
│  LAYER 4: END-TO-END HEALTH                                                │
│  ──────────────────────────                                                 │
│  Synthetic probe:                                                           │
│  1. Upload test image to stream ingestion                                   │
│  2. Verify AI detection result appears in Kafka                             │
│  3. Verify event written to PostgreSQL                                      │
│  4. Verify alert queryable via API                                          │
│  5. Verify WebSocket push received                                          │
│  Run: Every 60 seconds from monitoring namespace                            │
│                                                                             │
│  LAYER 5: EDGE HEALTH HEARTBEAT                                            │
│  ────────────────────────────────                                           │
│  - Edge Gateway sends heartbeat every 30 seconds                            │
│  - Payload: {edge_id, timestamp, stream_count, disk_free, mem_usage}       │
│  - Missed 3 heartbeats (90s) → "EDGE OFFLINE" alert                       │
│  - Recovers → "EDGE ONLINE" notification                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## 9. Security Architecture

### 9.1 Defense in Depth

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEFENSE IN DEPTH LAYERS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LAYER 1: PERIMETER                                                         │
│  ──────────────                                                             │
│  - AWS WAF v2: SQL injection, XSS, rate limiting rules                     │
│  - Geo-restriction: Allow only specific countries                          │
│  - AWS Shield Standard (DDoS protection)                                   │
│  - ALB access logs → S3 → Athena for analysis                              │
│                                                                             │
│  LAYER 2: TRANSPORT                                                         │
│  ──────────────                                                             │
│  - TLS 1.3 for all external HTTPS connections                              │
│  - WireGuard ChaCha20-Poly1305 for VPN tunnel                              │
│  - mTLS (mutual TLS) for internal service-to-service communication         │
│  - Certificate rotation: Let's Encrypt auto (90-day)                       │
│                                                                             │
│  LAYER 3: AUTHENTICATION & AUTHORIZATION                                    │
│  ──────────────────────────────────────────                                 │
│  - JWT with RS256 (asymmetric signing)                                     │
│  - Access token: 15 minutes                                                │
│  - Refresh token: 7 days (stored in httpOnly cookie)                       │
│  - RBAC: admin, operator, viewer roles                                     │
│  - API keys for edge gateway authentication                                │
│  - Multi-factor authentication for admin role                              │
│                                                                             │
│  LAYER 4: APPLICATION SECURITY                                              │
│  ────────────────────────────                                               │
│  - Input validation: strict JSON schemas                                   │
│  - SQL injection: parameterized queries only (pgx)                         │
│  - XSS prevention: Content Security Policy headers                         │
│  - CSRF tokens for state-changing operations                               │
│  - File upload: virus scanning, size limits, type validation               │
│                                                                             │
│  LAYER 5: DATA SECURITY                                                     │
│  ────────────────────                                                       │
│  - RDS: Encryption at rest (AES-256, AWS KMS CMK)                          │
│  - RDS: Encryption in transit (TLS 1.2+)                                   │
│  - S3: Default encryption (SSE-S3 or SSE-KMS)                              │
│  - Redis: TLS in transit, no AUTH token exposure                           │
│  - Face embeddings: stored as vectors, not raw images (privacy)            │
│  - Backup encryption: separate KMS key for backups                         │
│                                                                             │
│  LAYER 6: NETWORK SEGMENTATION                                              │
│  ───────────────────────────                                                │
│  - VPC private subnets for all workloads                                   │
│  - Security groups: least privilege, explicit allow only                   │
│  - Network Policies: namespace-level isolation in K8s                      │
│  - DVR: NO public IP, NO internet gateway, local network only              │
│  - VPN: Single controlled entry point                                      │
│                                                                             │
│  LAYER 7: AUDIT & MONITORING                                                │
│  ─────────────────────────                                                  │
│  - All API calls logged with user, IP, timestamp, resource                 │
│  - PostgreSQL audit_log table (append-only)                                │
│  - CloudTrail for AWS API calls                                            │
│  - VPC Flow Logs for network analysis                                      │
│  - Alert on abnormal patterns (unusual login times, geo anomalies)         │
│  - Log retention: 1 year in S3 Glacier                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 9.2 Secret Management

```yaml
# Kubernetes External Secrets (AWS Secrets Manager integration)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: surveillance
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: db-credentials
    creationPolicy: Owner
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: surveillance/production/db
        property: password
    - secretKey: DB_USER
      remoteRef:
        key: surveillance/production/db
        property: username
```

---

## 10. Monitoring & Observability

### 10.1 Monitoring Stack

| Component | Technology | Purpose |
|-----------|-----------|---------|
| Metrics | Prometheus + Thanos | Time-series collection, long-term storage |
| Visualization | Grafana | Dashboards for all services |
| Logs | Loki + Promtail | Log aggregation, indexed by labels |
| Traces | Jaeger | Distributed request tracing |
| Alerts | Alertmanager + PagerDuty | Multi-channel alerting |
| Uptime | UptimeRobot (external) | External endpoint monitoring |

### 10.2 Key Metrics

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    KEY METRICS DASHBOARD                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STREAM HEALTH                                                              │
│  ────────────                                                               │
│  - stream_active{camera_id}           Gauge: 0/1                           │
│  - stream_fps{camera_id}              Gauge: actual FPS                    │
│  - stream_bitrate{camera_id}          Gauge: kbps                          │
│  - stream_reconnect_total{camera_id}  Counter: reconnect events            │
│  - stream_latency_seconds{camera_id}  Histogram: end-to-end latency        │
│                                                                             │
│  AI INFERENCE                                                               │
│  ────────────                                                               │
│  - ai_inference_duration_seconds      Histogram: per-model latency         │
│  - ai_detection_total{model,class}    Counter: detections by class         │
│  - ai_gpu_utilization_percent         Gauge: GPU usage                     │
│  - ai_gpu_memory_used_bytes           Gauge: VRAM usage                    │
│  - ai_batch_size_current              Gauge: current batch size            │
│  - ai_queue_depth                     Gauge: pending inference requests    │
│                                                                             │
│  EVENTS & ALERTS                                                            │
│  ───────────────                                                            │
│  - events_total{type,severity}        Counter: events processed            │
│  - alerts_active{severity}            Gauge: unacknowledged alerts         │
│  - alert_ack_duration_seconds         Histogram: time to acknowledge       │
│  - false_positive_rate                Gauge: FP ratio (training feedback)  │
│                                                                             │
│  SYSTEM                                                                       │
│  ──────                                                                     │
│  - edge_disk_free_bytes               Gauge: local storage remaining       │
│  - edge_memory_usage_percent          Gauge: RAM usage                     │
│  - vpn_latency_ms                     Gauge: tunnel round-trip time        │
│  - kafka_consumer_lag{topic,group}    Gauge: message backlog               │
│  - db_connection_pool_active          Gauge: DB connections in use         │
│  - api_request_duration_seconds       Histogram: API response time         │
│  - api_requests_total{status,path}    Counter: HTTP status distribution    │
│                                                                             │
│  BUSINESS                                                                   │
│  ─────────                                                                  │
│  - cameras_online_total               Gauge: healthy camera count          │
│  - daily_events_total                 Counter: events per day              │
│  - alert_response_time_avg            Gauge: avg ack time (SLA: <5min)     │
│  - storage_cost_daily_usd             Gauge: estimated daily cost          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 10.3 Alerting Rules

```yaml
# Prometheus alerting rules
# alerts.yml
groups:
  - name: surveillance-critical
    rules:
      - alert: CameraStreamDown
        expr: stream_active == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Camera {{ $labels.camera_id }} stream is down"
          
      - alert: EdgeGatewayOffline
        expr: time() - vpn_last_heartbeat > 120
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Edge gateway {{ $labels.edge_id }} is offline"
          
      - alert: AIInferenceHighLatency
        expr: histogram_quantile(0.99, ai_inference_duration_seconds) > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI inference P99 latency is {{ $value }}ms"
          
      - alert: DiskSpaceLow
        expr: edge_disk_free_bytes / edge_disk_total_bytes < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Edge disk usage is above 90%"
          
      - alert: UnacknowledgedCriticalAlerts
        expr: alerts_active{severity="critical"} > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} critical alerts unacknowledged for >15 minutes"
```

---

## 11. Cost Estimation

### 11.1 Monthly Cost Breakdown (8 cameras)

| Service | Instance/Type | Monthly Cost (USD) |
|---------|--------------|-------------------|
| EKS Control Plane | Managed | $73 |
| EKS Worker Nodes (on-demand) | 3× t3.large (API, services) | $200 |
| EKS GPU Nodes | 1× g4dn.xlarge (spot when possible) | $350 |
| RDS PostgreSQL | db.r6g.xlarge Multi-AZ | $520 |
| ElastiCache Redis | cache.r6g.large (2 shards) | $260 |
| MSK Kafka | 3× kafka.m5.large | $350 |
| ALB + Data Transfer | ~500 GB/month | $50 |
| S3 Storage | ~10 TB (tiered) | $200 |
| CloudFront CDN | ~200 GB/month | $30 |
| EC2 VPN Endpoint | t3.micro | $15 |
| Edge Hardware | Intel NUC (amortized 3yr) | ~$40 |
| Internet (site) | Business broadband | $50 |
| **TOTAL** | | **~$2,138/month** |

### 11.2 Cost Optimization Strategies

1. **Spot Instances**: GPU nodes and batch processing on spot (70% savings)
2. **Reserved Instances**: RDS and ElastiCache 1-year reserved (40% savings)
3. **S3 Lifecycle**: Automatic tiering to IA and Glacier
4. **Right-sizing**: Monitor actual usage, adjust requests/limits
5. **Edge AI**: Pre-filter on Jetson Orin to reduce cloud bandwidth (future)

---

## 12. Implementation Phases

### Phase 1: Foundation (Weeks 1-4)
- [ ] Set up AWS VPC, EKS cluster
- [ ] Deploy WireGuard VPN (cloud endpoint + edge gateway)
- [ ] Deploy PostgreSQL, Redis, Kafka, MinIO
- [ ] Build and deploy Edge Gateway Agent
- [ ] Verify RTSP stream capture from all 8 channels
- [ ] Basic stream ingestion to Kafka

### Phase 2: Core AI (Weeks 5-8)
- [ ] Deploy NVIDIA Triton with YOLOv8 detection model
- [ ] Build AI Inference Service (Kafka consumer)
- [ ] Implement person/vehicle detection pipeline
- [ ] Build Suspicious Activity Service (night mode)
- [ ] Face detection + embedding extraction
- [ ] Alert generation and storage

### Phase 3: Application (Weeks 9-12)
- [ ] Build Backend API Service
- [ ] Build Web Frontend (Next.js dashboard)
- [ ] Implement live video playback (HLS)
- [ ] Event timeline and search
- [ ] Alert management UI
- [ ] Face search by similarity

### Phase 4: Operations (Weeks 13-16)
- [ ] Notification Service (email, SMS, push)
- [ ] Training Service + model retraining pipeline
- [ ] Monitoring stack (Prometheus, Grafana, Loki)
- [ ] Security hardening and penetration testing
- [ ] Performance optimization and load testing
- [ ] Documentation and operator training

---

## 13. Appendices

### Appendix A: DVR RTSP URL Format

```
# CP PLUS ORANGE Series RTSP URL format
rtsp://<username>:<password>@<dvr_ip>:554/user=<username>&password=<password>&channel=<1-8>&stream=<0|1>.sdp?

# stream=0: Main stream (higher quality)
# stream=1: Sub stream (lower quality)

# Example:
rtsp://admin:password@192.168.29.200:554/user=admin&password=password&channel=1&stream=0.sdp?

# FFmpeg test command:
ffmpeg -i "rtsp://192.168.29.200:554/user=admin&password=&channel=1&stream=0.sdp?" \
       -c copy -f segment -segment_time 10 -reset_timestamps 1 \
       /recordings/ch1/%Y%m%d_%H%M%S.mkv
```

### Appendix B: WireGuard Full Configuration

```bash
# === CLOUD SERVER (AWS EC2) ===
# /etc/wireguard/wg0-cloud.conf

[Interface]
PrivateKey = <cloud-private-key>
Address = 10.200.0.1/24
ListenPort = 51820
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
         iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE; \
         iptables -A FORWARD -p tcp --dport 5432 -j DROP; \
         iptables -A FORWARD -p tcp --dport 6379 -j DROP
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
           iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
DNS = 10.100.0.2

[Peer]
# Edge Gateway - Site 1
PublicKey = <edge1-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.200.0.2/32, 192.168.29.0/24
PersistentKeepalive = 25

# === EDGE GATEWAY (Intel NUC) ===
# /etc/wireguard/wg0-edge.conf

[Interface]
PrivateKey = <edge-private-key>
Address = 10.200.0.2/32
DNS = 10.100.0.2
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; \
         iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; \
           iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer]
# Cloud Server
PublicKey = <cloud-public-key>
PresharedKey = <preshared-key-1>
AllowedIPs = 10.100.0.0/16, 10.200.0.0/24
Endpoint = <cloud-public-ip>:51820
PersistentKeepalive = 25
```

### Appendix C: FFmpeg Stream Processing Command

```bash
#!/bin/bash
# Edge Gateway stream processing pipeline

CAMERA_ID=$1
CHANNEL=$2
DVR_IP="192.168.29.200"
DVR_USER="admin"
DVR_PASS=""

RTSP_URL="rtsp://${DVR_IP}:554/user=${DVR_USER}&password=${DVR_PASS}&channel=${CHANNEL}&stream=0.sdp?"
RECORDING_DIR="/var/recordings/${CAMERA_ID}"
AI_PIPE="/tmp/ai_pipe_${CAMERA_ID}"

mkdir -p "$RECORDING_DIR"
mkfifo "$AI_PIPE" 2>/dev/null

# Pipeline 1: Recording (10s segments)
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -i "$RTSP_URL" \
    -c copy -f segment \
    -segment_time 10 \
    -segment_format mp4 \
    -reset_timestamps 1 \
    -strftime 1 \
    "${RECORDING_DIR}/%Y%m%d_%H%M%S.mp4" \
    2>> /var/log/ffmpeg-${CAMERA_ID}.log &

# Pipeline 2: AI frame extraction (1 fps)
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -i "$RTSP_URL" \
    -vf "fps=1,scale=640:640" \
    -f image2pipe \
    -vcodec mjpeg \
    -q:v 5 \
    "$AI_PIPE" \
    2>> /var/log/ffmpeg-ai-${CAMERA_ID}.log &

# Pipeline 3: Frame batching and gRPC send to cloud
frame-batcher \
    --input "$AI_PIPE" \
    --camera-id "$CAMERA_ID" \
    --batch-size 8 \
    --cloud-endpoint "10.200.0.1:8081" \
    --vpn-interface wg0 \
    >> /var/log/batcher-${CAMERA_ID}.log 2>&1 &
```

### Appendix D: Kubernetes Resource Summary

```yaml
# Complete resource manifest summary
# Namespaces:
# - surveillance: Main application
# - surveillance-data: Database, cache, storage
# - surveillance-monitoring: Prometheus, Grafana
# - surveillance-ops: CI/CD, backup jobs

# Deployments (always running):
# - stream-ingestion: 3-20 replicas, HPA
# - ai-inference: 1-4 replicas (GPU), HPA
# - suspicious-activity: 2-8 replicas, HPA
# - backend-api: 3-10 replicas, HPA
# - video-playback: 2-4 replicas
# - notification-service: 2-5 replicas, HPA
# - web-frontend: 3 replicas (static, CDN-cached)
# - traefik: 2 replicas (DaemonSet preferred)

# StatefulSets:
# - minio: 4 replicas (distributed mode)

# CronJobs:
# - training-service: Weekly (Sundays 02:00)
# - db-backup: Daily (02:00)
# - storage-cleanup: Daily (03:00)
# - partition-maintenance: Monthly (1st, 03:00)

# External Services (AWS managed):
# - PostgreSQL: RDS db.r6g.xlarge Multi-AZ
# - Redis: ElastiCache cluster mode
# - Kafka: MSK 3 brokers
# - ALB: Internet-facing, WAF attached
```

---

## Document History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-01-20 | Solution Architect | Initial complete architecture |

---

*End of Document*
