# AI-Powered Industrial Surveillance Platform

## Unified Technical Blueprint — Part A: Sections 1-10

| **Document Property** | **Value** |
|---|---|
| **Version** | 1.0.0 |
| **Classification** | Technical Blueprint — Production Design |
| **Target DVR** | CP PLUS ORANGE CP-UVR-0801E1-CV2 |
| **Channels** | 8 active (scalable to 64+) |
| **Resolution** | 960 x 1080 per channel |
| **DVR Network** | 192.168.29.200/24, RTSP port 554 |
| **Date** | 2025 |

---

> **Cross-Reference Guide**: This unified blueprint synthesizes six specialist design documents. For detailed specifications on any subsystem, refer to:
> - `architecture.md` — Full architecture, scaling, failover, cost estimation
> - `video_ingestion.md` — RTSP configuration, FFmpeg commands, edge gateway specs
> - `ai_vision.md` — Model configurations, inference code, benchmarks
> - `database_schema.md` — Complete DDL, triggers, views, RLS policies
> - `suspicious_activity.md` — Detection algorithms, scoring engine pseudocode
> - `training_system.md` — Training pipelines, quality gates, versioning logic

---

## Table of Contents

- [Section 1: Executive Summary](#section-1-executive-summary)
- [Section 2: Kimi Swarm Team and Agent Responsibilities](#section-2-kimi-swarm-team-and-agent-responsibilities)
- [Section 3: Assumptions](#section-3-assumptions)
- [Section 4: Full Architecture](#section-4-full-architecture)
- [Section 5: Data Flow from DVR to Cloud to Dashboard](#section-5-data-flow-from-dvr-to-cloud-to-dashboard)
- [Section 6: Recommended Tech Stack](#section-6-recommended-tech-stack)
- [Section 7: Database Schema](#section-7-database-schema)
- [Section 8: AI Model and Training Strategy](#section-8-ai-model-and-training-strategy)
- [Section 9: Suspicious Activity Night-Mode Design](#section-9-suspicious-activity-night-mode-design)
- [Section 10: Live Video Streaming Design](#section-10-live-video-streaming-design)

---

## Section 1: Executive Summary

### 1.1 Project Objective

This blueprint defines the complete technical design for an **AI-powered industrial surveillance platform** that transforms a legacy CP PLUS 8-channel DVR system into a modern, intelligent security operations center. The platform processes real-time video from 8 camera channels, applies state-of-the-art computer vision and face recognition AI, detects suspicious activity during night hours, and provides a unified dashboard for security operators — all while maintaining the highest standards of reliability, security, and data privacy.

The system is designed around a **cloud+edge hybrid architecture** where all compute-intensive AI inference runs in the cloud (AWS Mumbai), while a local edge gateway handles stream ingestion, buffering, and site-local concerns. A WireGuard VPN tunnel protects all communication between edge and cloud, ensuring the DVR has **zero public internet exposure**.

### 1.2 Key Capabilities

| **Capability** | **Description** | **Technology** |
|---|---|---|
| **Human Detection** | Real-time person detection across all 8 channels at 15-20 FPS | YOLO11m + TensorRT FP16, 640x640 |
| **Face Detection** | Accurate face localization with 5-point landmarks for alignment | SCRFD-500M-BNKPS, 640x640 |
| **Face Recognition** | 512-D embedding extraction with 99.83% LFW accuracy | ArcFace R100 IR-SE100 (MS1MV3) |
| **Person Tracking** | Persistent identity tracking across frames with occlusion recovery | ByteTrack (Kalman + IoU), 80.3% MOTA |
| **Unknown Clustering** | Automatic grouping of unknown faces for operator review | HDBSCAN + DBSCAN fallback, 89.5% purity |
| **Night Mode Surveillance** | 10-detection-module suspicious activity analysis (22:00-06:00) | Composite scoring engine with time-decay |
| **AI Vibe Controls** | Three intuitive presets (Relaxed/Balanced/Strict) mapping to 4 confidence levels | Dynamic threshold adjustment |
| **Safe Self-Learning** | Three-mode training system with conflict detection and approval workflows | MLflow + Airflow + Manual Review |
| **24/7 Reliability** | Graceful degradation: video never stops, AI catch-up on recovery | Tiered storage + circuit breakers + replay |
| **Real-Time Alerts** | 6-level escalation (NONE to EMERGENCY) with multi-channel notifications | Telegram, WhatsApp, Email, Webhook |
| **Live Dashboard** | Multi-camera grid with HLS streaming and single-camera low-latency WebRTC | Next.js 14 + HLS.js + WebRTC |

### 1.3 Architecture Approach

The platform follows a **cloud+edge+VPN hybrid pattern** with five network security zones:

```
Cameras (8ch) --> DVR (local) --> Edge Gateway (local) --> WireGuard VPN --> AWS Cloud (EKS)
                                      |                        |
                                      | 2TB NVMe buffer         | Encrypted tunnel
                                      | 7-day ring buffer       | UDP 51820
                                      | FFmpeg ingestion        | ChaCha20-Poly1305
```

**Key architectural decisions:**

| **Decision** | **Choice** | **Rationale** |
|---|---|---|
| Cloud Provider | AWS ap-south-1 (Mumbai) | Lowest latency to India, mature managed services |
| Container Orchestration | Amazon EKS + K3s edge | Managed control plane, GPU node support, lightweight edge |
| VPN | WireGuard | ~60% faster than OpenVPN, modern crypto, simple setup |
| Message Queue | Apache Kafka (MSK) | Durable ordered log, replay capability, proven at scale |
| AI Inference | NVIDIA Triton + TensorRT | GPU-optimized, dynamic batching, model ensemble |
| Database | PostgreSQL 16 + pgvector | ACID compliance, native 512-D vector support |
| Object Storage | MinIO (edge+cloud) + S3 (archive) | S3-compatible API, tiered cost optimization |

### 1.4 Target Environment

The platform targets a **CP PLUS ORANGE CP-UVR-0801E1-CV2** DVR with the following characteristics:

| **Property** | **Value** | **Impact on Design** |
|---|---|---|
| Brand/Model | CP PLUS ORANGE CP-UVR-0801E1-CV2 | Dahua-compatible RTSP URL scheme |
| Channels | 8 active | Initial deployment scope |
| Resolution | 960 x 1080 per channel | AI input: letterbox to 640x640 |
| LAN IP | 192.168.29.200/24 | Edge gateway on same subnet |
| RTSP Port | 554 | TCP interleaved mandatory |
| ONVIF | V2.6.1.867657 (Server V19.06) | Auto-discovery supported |
| DVR Disk | FULL (0 bytes free) | All archival is edge-managed; no DVR recording |
| VPN Access | WireGuard-secured | No public exposure; all traffic encrypted |

> **Critical Design Impact**: The DVR disk being full means the system cannot rely on DVR-side recording or playback features. All archival storage is managed by the edge gateway's 2TB NVMe buffer and cloud tiering.

### 1.5 Key Differentiators

**1. AI Vibe Controls**
Instead of exposing complex threshold parameters to operators, the system provides three intuitive "vibe" presets — **Relaxed**, **Balanced**, and **Strict** — that internally map to optimized configurations for detection sensitivity and face match strictness. This innovation makes the system accessible to non-technical security staff while maintaining AI precision.

**2. Safe Self-Learning Training System**
The platform captures operator corrections (confirmations, corrections, merges, rejections) and feeds them back into model improvement through a carefully designed three-mode learning pipeline: **Manual Only**, **Suggested Learning** (recommended), and **Approved Auto-Update**. A synchronous conflict detector blocks five types of label conflicts before they reach the training dataset, ensuring model integrity.

**3. 24/7 Reliability with Graceful Degradation**
The system is architected around a single priority: **video recording never stops**. If the AI inference service fails, recording continues locally with queued catch-up processing on recovery. If the VPN tunnel fails, the edge gateway maintains 7 days of local buffer. If the cloud database fails, alerts accumulate in Kafka's durable log. Every failure mode has a defined degradation strategy.

**4. 10-Module Night Surveillance**
The suspicious activity detection system goes beyond simple motion detection to provide comprehensive behavioral analysis through 10 specialized detection modules — from intrusion and loitering to abandoned objects and repeated re-entry patterns — all combined through a composite scoring engine with exponential time-decay.

### 1.6 Production Readiness Assessment

| **Dimension** | **Status** | **Notes** |
|---|---|---|
| Architecture Completeness | Production-Ready | All 12 services fully specified with resource allocations |
| AI Model Selection | Production-Ready | Industry-standard models with published benchmarks |
| Database Design | Production-Ready | 29 tables, 4 views, 8 triggers, partitioning, RLS |
| Security Architecture | Production-Ready | 7-layer defense in depth, encrypted credentials, VPN-only |
| Scaling Path | Defined | 8 -> 16 -> 32 -> 64+ cameras with concrete resource allocations |
| Failover Design | Production-Ready | Graceful degradation matrix for all failure modes |
| Estimated Timeline | 14 weeks | 4 implementation phases defined |
| Estimated Monthly Cost | ~$2,140 USD | 8-camera deployment at steady state |

---

## Section 2: Kimi Swarm Team and Agent Responsibilities

The unified blueprint was synthesized from the outputs of 11 specialist agents, each responsible for a specific domain of the platform design.

### 2.1 Agent Responsibility Matrix

| **#** | **Agent** | **Responsibility** | **Key Deliverables** |
|---|---|---|---|
| 1 | **Requirements Analyst** | Elicited and structured all functional/non-functional requirements | Requirements traceability matrix, user stories, acceptance criteria |
| 2 | **System Architect** | Designed overall cloud+edge+VPN topology and service interactions | Deployment topology, 5 security zones, scaling roadmap, failover matrix |
| 3 | **Video Ingestion Engineer** | Specified RTSP configuration, edge gateway, and stream processing | RTSP URL patterns, FFmpeg commands, auto-reconnect logic, HLS generation |
| 4 | **AI Vision Scientist** | Selected and configured all CV/AI models for the inference pipeline | Model selection table, inference pipeline architecture, confidence handling |
| 5 | **Database Architect** | Designed complete data model with partitioning, indexing, and security | 29 tables + 4 views + 8 triggers, pgvector HNSW index, RLS policies |
| 6 | **Suspicious Activity Designer** | Designed 10 detection modules and composite scoring engine | Detection algorithms, scoring formula, YAML configuration schema |
| 7 | **Training System Engineer** | Designed self-learning pipeline with safety controls | 3 learning modes, conflict detection, quality gates, versioning |
| 8 | **Frontend Developer** | Designed Next.js dashboard with real-time video and alerts | Component architecture, HLS.js integration, WebSocket alerts |
| 9 | **DevOps Engineer** | Specified CI/CD, monitoring, and infrastructure-as-code | GitHub Actions + ArgoCD, Prometheus/Grafana, alerting rules |
| 10 | **Security Architect** | Designed defense-in-depth security across all layers | 7 security layers, secret management, encryption standards |
| 11 | **Technical Writer** (this document) | Synthesized all specialist outputs into unified blueprint | 10-section unified document with cross-references |

### 2.2 Agent Interaction Flow

```
+-----------------------------------------------------------------------------+
|                         KIMI SWARM TEAM ORCHESTRATION                        |
+-----------------------------------------------------------------------------+
|                                                                              |
|   Requirements Analyst                                                       |
|        |                                                                     |
|        v                                                                     |
|   +---------+    +---------+    +---------+    +---------+                  |
|   | System  |<-->| Video   |<-->| AI      |<-->| Database|                  |
|   |Architect|    |Ingestion|    |Vision   |    |Architect|                  |
|   +---------+    +---------+    +---------+    +---------+                  |
|        ^                                              |                      |
|        |           +---------+    +---------+        |                      |
|        +---------->|Suspicious|<-->|Training |<-------+                      |
|                    |Activity  |    |System   |                               |
|                    |Designer  |    |Engineer |                               |
|                    +---------+    +---------+                               |
|                        |                                              |
|                        v                                              |
|                   +---------+    +---------+    +---------+           |
|                   |Frontend |    |DevOps   |    |Security |           |
|                   |Developer|    |Engineer |    |Architect|           |
|                   +---------+    +---------+    +---------+           |
|                        |                                              |
|                        v                                              |
|                   +---------------------+                             |
|                   | Technical Writer    |                             |
|                   | (Unified Blueprint) |                             |
|                   +---------------------+                             |
|                                                                              |
+-----------------------------------------------------------------------------+
```

### 2.3 Cross-Agent Design Consistency

The following cross-cutting concerns were harmonized across all agent outputs during synthesis:

| **Concern** | **Resolution** | **Agents Coordinated** |
|---|---|---|
| Video latency budget | < 100ms end-to-end (AI); ~35-65s (HLS live) | Video Ingestion, AI Vision, Frontend |
| Face embedding storage | 512-D float32, pgvector HNSW index, cosine similarity | Database, AI Vision, Training |
| Event data retention | 90 days hot (MinIO), 1 year cold (Glacier), 7 days edge | Database, Architecture, Video Ingestion |
| Alert escalation | 6 levels: NONE -> LOW -> MEDIUM -> HIGH -> CRITICAL -> EMERGENCY | Suspicious Activity, Database, Frontend |
| Model versioning | Semantic MAJOR.MINOR.PATCH with MLflow registry | Training, AI Vision, Architecture |
| Graceful degradation | Video never stops; AI catch-up on recovery | Architecture, Video Ingestion, AI Vision |
| Security zones | 5 zones: Internet -> ALB -> Application -> Data -> Edge | Architecture, Security, Video Ingestion |

---

## Section 3: Assumptions

All assumptions made across the specialist designs are consolidated below. These should be validated before implementation begins.

### 3.1 Network and Hardware Assumptions

| **ID** | **Assumption** | **Validation Method** | **Risk if Invalid** |
|---|---|---|---|
| NW-01 | Edge gateway has dual Ethernet: one for local DVR subnet (192.168.29.0/24), one for internet/VPN | Physical site survey | Cannot bridge DVR to VPN |
| NW-02 | Site internet bandwidth >= 16 Mbps sustained upload for 8 channels | ISP speed test | Video drops, AI delays |
| NW-03 | WireGuard UDP port 51820 is not blocked by site firewall | Firewall rule check | VPN cannot establish |
| NW-04 | DVR RTSP server supports TCP interleaved transport (`rtsp_transport tcp`) | FFmpeg test probe | UDP fallback has packet loss |
| NW-05 | DVR supports 16+ concurrent RTSP sessions (8 channels x 2 streams) | Session stress test | Stream contention |
| NW-06 | MTU 1400 is viable through site NAT/firewall for WireGuard tunnel | Ping with DF bit test | Fragmentation issues |
| HW-01 | Intel NUC 13 Pro (i5-1340P, 16GB RAM, 512GB NVMe) is available for edge gateway | Hardware procurement | May need Jetson Orin alternative |
| HW-02 | Edge gateway has UPS backup for graceful shutdown on power loss | Electrical survey | Data corruption on hard power-off |
| HW-03 | AWS g4dn.xlarge (T4 GPU) instances are available in ap-south-1 | AWS EC2 capacity check | Need alternative GPU instance |

### 3.2 DVR Capabilities Assumptions

| **ID** | **Assumption** | **Validation Method** | **Risk if Invalid** |
|---|---|---|---|
| DVR-01 | DVR RTSP streams are accessible at `rtsp://admin:{PASS}@192.168.29.200:554/cam/realmonitor?channel=N&subtype=M` | FFmpeg connectivity test | Need alternative URL format |
| DVR-02 | DVR continues serving RTSP streams even with disk full (0 bytes free) | 24-hour stream stability test | Streams may stall |
| DVR-03 | DVR sub-stream (subtype=1) provides sufficient quality for AI inference (typically 352x288 to 704x576) | Frame quality inspection | May need main stream for AI |
| DVR-04 | DVR ONVIF server supports device discovery and stream URI retrieval | ONVIF Device Manager test | Manual camera configuration needed |
| DVR-05 | DVR channel numbering is 1-indexed (1-8) | ONVIF profile enumeration | Off-by-one errors in configuration |
| DVR-06 | DVR Digest authentication works with the provided credentials | RTSP DESCRIBE request test | May need Basic auth or different scheme |

### 3.3 Environmental Assumptions

| **ID** | **Assumption** | **Impact if Invalid** |
|---|---|---|
| ENV-01 | Cameras provide adequate lighting for face recognition during night hours (minimum 10 lux at face distance) | Face recognition accuracy degrades; may need IR illumination |
| ENV-02 | Camera angles allow frontal face capture at entry/exit points (yaw < 45 degrees) | Face recognition miss rate increases |
| ENV-03 | Indoor industrial environment with minimal weather interference | False positive rate from rain/shadows is low |
| ENV-04 | Maximum person-to-camera distance is within 10 meters for face recognition | Faces may be too small (< 20px) for reliable detection |
| ENV-05 | Camera positions are stable (no PTZ movement during normal operation) | Zone calibration remains valid |

### 3.4 Operational Assumptions

| **ID** | **Assumption** | **Impact if Invalid** |
|---|---|---|
| OPS-01 | Security operators will review unknown face clusters and provide identity labels daily | Unknown person database grows without enrichment |
| OPS-02 | Admin will review training suggestions at least weekly in "Suggested Learning" mode | Training queue backlog accumulates |
| OPS-03 | Site has authorized personnel who can access edge gateway for maintenance (SSH, physical) | Remote troubleshooting limited |
| OPS-04 | Alert fatigue is a genuine concern — false positive rate > 20% leads to ignored alerts | AI vibe controls and suppression tuned accordingly |
| OPS-05 | Incident video review requires 10-second pre-event and 30-second post-event clips | Clip configuration fixed |

### 3.5 Security Assumptions

| **ID** | **Assumption** | **Impact if Invalid** |
|---|---|---|
| SEC-01 | WireGuard encryption (ChaCha20-Poly1305) meets organizational security requirements | May need additional encryption layer |
| SEC-02 | AWS VPC with private subnets satisfies data residency requirements for India | Compliance review needed |
| SEC-03 | Face embeddings (512-D vectors) do not constitute PII under applicable regulations | Legal review needed for biometric data handling |
| SEC-04 | Edge gateway physical security is equivalent to server room security | Tampering risk if edge is physically accessible |
| SEC-05 | DVR credentials can be stored encrypted (AES-256) in cloud database | Key management infrastructure required |

### 3.6 AI Performance Assumptions

| **ID** | **Assumption** | **Impact if Invalid** |
|---|---|---|
| AI-01 | YOLO11m TensorRT FP16 achieves > 75% person AP@50 on surveillance footage | May need fine-tuning on site-specific data |
| AI-02 | ArcFace R100 achieves > 98% Rank-1 accuracy on enrolled persons with 5+ reference images | Enrollment quality gates ensure minimum samples |
| AI-03 | HDBSCAN achieves > 89% cluster purity on 512-D face embeddings from this camera setup | Fallback to DBSCAN if density varies too much |
| AI-04 | ByteTrack maintains < 2 ID switches per 100 frames in industrial environment with occlusion | May need BoT-SORT upgrade for complex scenes |
| AI-05 | GPU (T4) can sustain 15-20 FPS processing per stream across 8 streams with batching | CPU fallback at 5-8 FPS if GPU unavailable |

---


## Section 4: Full Architecture

### 4.1 High-Level System Architecture

The platform employs a **cloud+edge hybrid architecture** with five network security zones. Video streams are ingested at the edge, processed by AI in the cloud, and presented through a web-based dashboard. A WireGuard VPN tunnel provides encrypted, zero-exposure connectivity between edge and cloud.

```
+=============================================================================+
|                         CLOUD+EDGE+VPN ARCHITECTURE                          |
+=============================================================================+
|                                                                              |
|   ZONE 0: INTERNET (UNTRUSTED)                                               |
|   +---------------------+                                                    |
|   |  Users / Browsers   |                                                    |
|   |  HTTPS :443         |                                                    |
|   +----------+----------+                                                    |
|              |                                                               |
|              v                                                               |
|   ZONE 1: AWS VPC EDGE (DEMILITARIZED)                                       |
|   +--------------------------------------------------------------+          |
|   |  AWS ALB (:443) + WAF v2 + Rate Limit + Geo-Restriction      |          |
|   |       |                                                      |          |
|   |       v                                                      |          |
|   |  Traefik Ingress Controller (:8443)                          |          |
|   |  - Route: /api/*  -> Backend Service                         |          |
|   |  - Route: /ws/*   -> WebSocket Handler                       |          |
|   |  - Route: /       -> Next.js Web App                         |          |
|   |  - TLS: Let's Encrypt auto certificates                      |          |
|   +--------------------------------------------------------------+          |
|              |                                                               |
|              v                                                               |
|   ZONE 2: AWS VPC APPLICATION (TRUSTED)                                      |
|   +--------------------------------------------------------------+          |
|   |  +-------------+  +-------------+  +---------------------+   |          |
|   |  | Stream      |  | AI Inference|  | Suspicious Activity |   |          |
|   |  | Ingestion   |  | Service     |  | Service (Night Mode)|   |          |
|   |  | (Go/FFmpeg) |  | (Triton)    |  | (Go/Python)         |   |          |
|   |  | :8081       |  | :8001 gRPC  |  | :8083               |   |          |
|   |  +-------------+  +-------------+  +---------------------+   |          |
|   |  +-------------+  +-------------+  +---------------------+   |          |
|   |  | Backend API |  | Training    |  | Notification        |   |          |
|   |  | (Go/Gin)    |  | Service     |  | Service             |   |          |
|   |  | :8080       |  | (PyTorch)   |  | (Go)                |   |          |
|   |  +-------------+  +-------------+  +---------------------+   |          |
|   |  +--------------------+                                      |          |
|   |  | Web Frontend       |  HLS Playback Service               |          |
|   |  | (Next.js 14 :3000) |  (Go :8085)                         |          |
|   |  +--------------------+                                      |          |
|   +--------------------------------------------------------------+          |
|              |                                                               |
|              v                                                               |
|   ZONE 3: AWS VPC DATA (HIGHLY RESTRICTED)                                   |
|   +--------------------------------------------------------------+          |
|   |  +-------------+  +-------------+  +-------------+           |          |
|   |  | PostgreSQL  |  | Redis       |  | Kafka       |           |          |
|   |  | 16 (RDS)    |  | 7 Cluster   |  | (MSK)       |           |          |
|   |  | :5432       |  | :6379       |  | :9092       |           |          |
|   |  | pgvector    |  | Pub/Sub     |  | 3 brokers   |           |          |
|   |  | HNSW index  |  | Streams     |  | 3 AZs       |           |          |
|   |  +-------------+  +-------------+  +-------------+           |          |
|   |  +-------------+  +-----------------------------------+      |          |
|   |  | MinIO       |  | S3 (Cold Archive)                 |      |          |
|   |  | (S3-compat) |  | - Standard (30d)                  |      |          |
|   |  | :9000       |  | - IA (31-90d)                     |      |          |
|   |  | 10 TB       |  | - Glacier Deep Archive (90d+)     |      |          |
|   |  +-------------+  +-----------------------------------+      |          |
|   +--------------------------------------------------------------+          |
|              |                                                               |
|              | WireGuard VPN Tunnel (UDP 51820)                                |
|              | ChaCha20-Poly1305 encryption                                    |
|              | Cloud peer: 10.200.0.1/32 <-> Edge peer: 10.200.0.2/32         |
|              v                                                               |
|   ZONE 4: EDGE NETWORK (PHYSICALLY ISOLATED)                                 |
|   +--------------------------------------------------------------+          |
|   |  +--------------------------------------------------------+  |          |
|   |  |              EDGE GATEWAY (Intel NUC)                  |  |          |
|   |  |  Ubuntu 22.04 LTS | K3s v1.28+ | 2TB NVMe             |  |          |
|   |  |                                                          |  |          |
|   |  |  +-----------------+  +-----------------+                |  |          |
|   |  |  | Stream Manager  |  | HLS Segmenter   |                |  |          |
|   |  |  | (Python/asyncio)|  | (FFmpeg/nginx)  |                |  |          |
|   |  |  | 8x RTSP feeds   |  | 2s segments     |                |  |          |
|   |  |  +-----------------+  +-----------------+                |  |          |
|   |  |  +-----------------+  +-----------------+                |  |          |
|   |  |  | Frame Extractor |  | Buffer Manager  |                |  |          |
|   |  |  | (AI decimation) |  | (20GB ring buf) |                |  |          |
|   |  |  +-----------------+  +-----------------+                |  |          |
|   |  |  +--------------------------------------------------+    |  |          |
|   |  |  | VPN Client (WireGuard)  |  Health Monitor         |    |  |          |
|   |  |  +--------------------------------------------------+    |  |          |
|   |  +--------------------------------------------------------+  |          |
|   |                            |                                             |
|   |   Local Network (192.168.29.0/24)                                       |
|   |   +------------------+    +------------------+                           |
|   |   | CP PLUS DVR      |    | Local Monitor    |                           |
|   |   | 192.168.29.200   |    | 192.168.29.10    |                           |
|   |   | 8ch | RTSP :554  |    | (optional)       |                           |
|   |   +------------------+    +------------------+                           |
|   |   CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8                                      |
|   +--------------------------------------------------------------+          |
|                                                                              |
+=============================================================================+
```

### 4.2 Service Interaction Diagram

```
+-----------------------------------------------------------------------------+
|                           SERVICE INTERACTIONS                               |
+-----------------------------------------------------------------------------+
|                                                                              |
|   INTERNET USERS                                                             |
|        |                                                                     |
|        | HTTPS :443                                                          |
|        v                                                                     |
|   +---------+      +----------+      +----------+                           |
|   | AWS ALB |----->| Traefik  |----->| Next.js  |  Web Frontend             |
|   | +WAF    |      | Ingress  |      | (SSR)    |  Dashboard                |
|   +---------+      +----------+      +----+-----+                           |
|                                             |                                |
|                        +--------------------+--------------------+           |
|                        |                    |                    |           |
|                        v                    v                    v           |
|                   +---------+       +------------+      +----------+       |
|                   |Backend  |       | WebSocket  |      | HLS      |       |
|                   |API (Go) |       | Handler    |      | Playback |       |
|                   |:8080    |       | /ws/alerts |      | Service  |       |
|                   +----+----+       +------------+      +----+-----+       |
|                        |                                               |
|                        | gRPC :50051                                    |
|                        v                                               |
|   +---------+    +------------+    +----------+    +----------+       |
|   | Stream  |    | AI         |    |Suspicious|    |Training  |       |
|   |Ingestion|<-->| Inference  |<-->| Activity |    |Service   |       |
|   |(Go)     |    |(Triton)    |    |(Night)   |    |(PyTorch) |       |
|   +----+----+    +------+-----+    +----+-----+    +----+-----+       |
|        |                |               |               |             |
|        v                v               v               v             |
|   +---------------------------------------------------------------+   |
|   |                        KAFKA (MSK)                            |   |
|   |  streams.raw (8 parts)  ai.detections (16 parts)             |   |
|   |  alerts.critical (4 parts)  training.data (30-day ret.)      |   |
|   |  notifications.*  system.metrics (7-day ret.)                |   |
|   +---------------------------------------------------------------+   |
|        |                |               |               |             |
|        v                v               v               v             |
|   +---------+    +------------+    +----------+    +----------+       |
|   |PostgreSQL|   | Redis      |    | MinIO    |    | MLflow   |       |
|   |16 +pgvec |   |7 Cluster   |    |S3-compat |    | Model    |       |
|   |:5432     |   |:6379       |    |:9000     |    | Registry |       |
|   +---------+    +------------+    +----------+    +----------+       |
|                                                                              |
|   Edge Gateway: WireGuard peer at 10.200.0.2/32                            |
|   Stream Ingestion pulls frames via VPN -> sends to Kafka                   |
|                                                                              |
+-----------------------------------------------------------------------------+
```

### 4.3 Network Security Zones

Five security zones provide defense in depth, from the public internet to the physically isolated edge network.

```
+=============================================================================+
|                         NETWORK SECURITY ZONES                               |
+=============================================================================+
|                                                                              |
|  +---------------------------------------------------------------------+    |
|  |  ZONE 0: INTERNET (UNTRUSTED)                                        |    |
|  |  - Public users, any source IP                                        |    |
|  |  - AWS Shield Standard DDoS protection                               |    |
|  |  - Geo-restriction: allow specific countries only                    |    |
|  +---------------------------+-----------------------------------------+    |
|                              |                                               |
|                              | HTTPS :443                                    |
|                              v                                               |
|  +---------------------------------------------------------------------+    |
|  |  ZONE 1: AWS VPC EDGE (DEMILITARIZED)                                |    |
|  |  - ALB + WAF v2 (SQL injection, XSS, rate limiting rules)           |    |
|  |  - Traefik Ingress (:8443)                                          |    |
|  |  - Auth: JWT + RBAC, API keys for edge gateway                     |    |
|  |  - Public API endpoints ONLY                                        |    |
|  |  SG: alb-public-sg: 443 from 0.0.0.0/0                             |    |
|  |  SG: traefik-sg: 8443 from alb-sg ONLY                              |    |
|  +---------------------------+-----------------------------------------+    |
|                              |                                               |
|                              | Internal :8080-8090                         |
|                              v                                               |
|  +---------------------------------------------------------------------+    |
|  |  ZONE 2: AWS VPC APPLICATION (TRUSTED, ISOLATED)                     |    |
|  |  - Stream Ingestion, AI Inference, Suspicious Activity              |    |
|  |  - Training, Backend API, Notification Services                     |    |
|  |  - Pod Security: No root, read-only FS, no privilege escalation    |    |
|  |  - Network Policies: Ingress only from API GW namespace            |    |
|  |  SG: app-sg: 8080-8090 from traefik-sg ONLY                         |    |
|  +---------------------------+-----------------------------------------+    |
|                              |                                               |
|                              | Data Layer :5432, :6379, :9092, :9000       |
|                              v                                               |
|  +---------------------------------------------------------------------+    |
|  |  ZONE 3: AWS VPC DATA (HIGHLY RESTRICTED)                            |    |
|  |  - PostgreSQL (RDS), Redis (ElastiCache), Kafka (MSK)               |    |
|  |  - MinIO object storage, S3 cold archive                            |    |
|  |  - Security Groups: ONLY from app-sg                                |    |
|  |  - RDS: Encrypted at rest (AWS KMS), no public access              |    |
|  |  - S3: Bucket policy deny all except VPC endpoint                   |    |
|  +---------------------------+-----------------------------------------+    |
|                              |                                               |
|                              | WireGuard VPN (UDP 51820)                     |
|                              | ChaCha20-Poly1305                             |
|                              v                                               |
|  +---------------------------------------------------------------------+    |
|  |  ZONE 4: EDGE NETWORK (PHYSICALLY ISOLATED)                          |    |
|  |  - Edge Gateway (Intel NUC), K3s node                                |    |
|  |  - WireGuard peer, stream ingestion, local buffer                    |    |
|  |  - DVR (192.168.29.200): NO internet access, local ONLY             |    |
|  |  - Edge Firewall: ALLOW 192.168.29.0/24 -> DVR :554,:80           |    |
|  |                   ALLOW OUT 51820/udp -> Cloud VPN endpoint        |    |
|  |                   DENY ALL other incoming                           |    |
|  +---------------------------------------------------------------------+    |
|                                                                              |
+=============================================================================+
```

### 4.4 Service Descriptions

| **#** | **Service** | **Purpose** | **Technology** | **Port** | **Replicas** |
|---|---|---|---|---|---|
| 1 | **Edge Gateway Agent** | RTSP stream pull, local recording, VPN endpoint, heartbeat | Go 1.21, systemd + K3s | 8080, 51820 | 1 (per site) |
| 2 | **Stream Ingestion** | Receive frames from edge, decode, produce to Kafka, store segments | Go 1.21, FFmpeg | 8081 | 3-20 (HPA) |
| 3 | **AI Inference** | GPU-accelerated detection, face recognition, embedding | Triton 2.40, TensorRT | 8000, 8001, 8002 | 1-4 (GPU HPA) |
| 4 | **Suspicious Activity** | Night-mode analysis, 10 detection modules, scoring engine | Python 3.11, OpenCV | 8083 | 2-8 (HPA) |
| 5 | **Training Service** | Model retraining, fine-tuning, A/B validation | PyTorch 2.1, CUDA 12.1 | 8084 | 0-1 (GPU spot) |
| 6 | **Backend API** | REST API, authentication, business logic | Go 1.21, Gin | 8080 | 3-10 (HPA) |
| 7 | **Web Frontend** | Dashboard, live view, timeline, analytics | Next.js 14, React 18 | 3000 | 3 (CDN) |
| 8 | **Notification** | Multi-channel alert dispatch (Telegram, WhatsApp, Email) | Go 1.21 | 8086 | 2-5 (HPA) |
| 9 | **HLS Playback** | HLS segment serving for dashboard live view | Go 1.21 | 8085 | 2-4 (HPA) |
| 10 | **PostgreSQL** | Primary database with pgvector for embeddings | PostgreSQL 16 (RDS) | 5432 | 1 (Multi-AZ) |
| 11 | **Redis** | Session store, cache, pub/sub, stream tracking | Redis 7 (ElastiCache) | 6379 | 2 shards x 2 replicas |
| 12 | **Kafka** | Event bus, durable log, stream replay | Apache Kafka (MSK) | 9092 | 3 brokers x 3 AZs |
| 13 | **MinIO** | Object storage for video, snapshots, model artifacts | MinIO (S3-compatible) | 9000, 9001 | Edge: 1, Cloud: 4 |

### 4.5 Physical Edge Gateway Specification

| **Component** | **Specification** |
|---|---|
| Hardware | Intel NUC 13 Pro, Core i5-1340P (12 cores, 16 threads) |
| Alternative | NVIDIA Jetson Orin NX 16GB (for on-edge AI inference) |
| RAM | 16GB DDR4-3200 (32GB recommended for 16+ channels) |
| Storage | 2TB NVMe SSD (7-day circular buffer for all 8 streams) |
| LAN | Intel i226-V 2.5GbE (local DVR subnet) |
| WAN | Second Ethernet or WiFi (internet for VPN) |
| OS | Ubuntu 22.04.4 LTS Server (no GUI) |
| Container Runtime | Docker CE 25.x + Docker Compose 2.x |
| K8s Distribution | K3s v1.28+ (lightweight, single-node or 2-node HA) |
| Power | UPS-backed, auto-restart on power loss (BIOS setting) |
| Network | Dual interface: eth0 for local DVR, eth1 for internet/VPN |

### 4.6 Cloud Infrastructure Specification

| **Component** | **Specification** |
|---|---|
| Region | Primary: ap-south-1 (Mumbai), DR: ap-southeast-1 (Singapore) |
| VPC | 10.100.0.0/16, 3 AZs, private subnets only for workloads |
| EKS | Managed node groups: `on-demand` for API, `spot` for batch/GPU |
| GPU Nodes | g4dn.xlarge (NVIDIA T4) for Triton inference, 1-4 auto-scaled |
| ALB | Internet-facing, WAF v2 attached, Shield Advanced optional |
| RDS | PostgreSQL 16, db.r6g.xlarge, Multi-AZ, encrypted at rest |
| ElastiCache | Redis 7, cluster mode enabled, 2 shards x 2 replicas |
| MSK (Kafka) | 3 broker nodes, kafka.m5.large, 3 AZs |
| S3 | Standard (hot 30d), IA (31-90d), Glacier Deep Archive (90d+) |

### 4.7 Scaling Approach

The system scales from the initial 8-camera deployment to 64+ cameras through well-defined phases:

```
+-----------------------------------------------------------------------------+
|                        CAMERA SCALING ROADMAP                                |
+-----------------------------------------------------------------------------+
|                                                                              |
|  CURRENT: 8 cameras (1 DVR)                                                  |
|  +-- Edge: Intel NUC i7, 32GB RAM                                           |
|  +-- Bandwidth: ~16 Mbps upstream (2 Mbps per H.264 stream)                 |
|  +-- Cloud AI: 1x T4 GPU (8 streams @ 1 fps, batch=8)                       |
|  +-- Kafka: 8 partitions (streams.raw)                                      |
|  +-- PostgreSQL: db.r6g.xlarge                                              |
|  +-- Monthly cost: ~$2,140                                                  |
|                                                                              |
|  PHASE 1: 16 cameras (2 DVRs / 2 sites)                                      |
|  +-- Edge: 2x Intel NUC (one per site)                                      |
|  +-- Bandwidth: ~32 Mbps                                                    |
|  +-- Cloud AI: 1x T4 GPU (batch=16, still sufficient)                       |
|  +-- Kafka: 16 partitions                                                   |
|  +-- Monthly cost: ~$3,200                                                  |
|                                                                              |
|  PHASE 2: 32 cameras (4 DVRs / 4 sites)                                      |
|  +-- Edge: 4x Intel NUC                                                     |
|  +-- VPN: Hub-spoke model (4 edge peers -> 1 cloud endpoint)                |
|  +-- Bandwidth: ~64 Mbps                                                    |
|  +-- Cloud AI: 2x T4 GPUs (HPA: 2-6 replicas)                               |
|  +-- Kafka: 32 partitions                                                   |
|  +-- PostgreSQL: db.r6g.2xlarge                                             |
|  +-- Monthly cost: ~$5,500                                                  |
|                                                                              |
|  PHASE 3: 64 cameras (8 DVRs / 8 sites)                                      |
|  +-- Edge: 8x Intel NUC (or Jetson Orin for edge AI pre-filter)              |
|  +-- Bandwidth: ~128 Mbps (dedicated circuit recommended)                   |
|  +-- Cloud AI: 4x T4 GPUs or 2x A10G (g5.2xlarge)                           |
|  +-- Kafka: 64 partitions, consider MSK multi-cluster                        |
|  +-- PostgreSQL: db.r6g.4xlarge + read replica                              |
|  +-- Monthly cost: ~$9,800                                                  |
|                                                                              |
+-----------------------------------------------------------------------------+
```

### 4.8 Failover and Reliability Design

The graceful degradation matrix defines behavior for every failure mode:

```
+=============================================================================+
|                     GRACEFUL DEGRADATION MATRIX                              |
+=============================================================================+
|                                                                              |
|  Failure Mode              | Degradation Strategy                            |
|  ------------------------- | ----------------------------------------------- |
|  AI Inference Service DOWN | Continue recording ALL video locally            |
|  (GPU failure, model crash)| Events stored as "unprocessed"                  |
|                            | No real-time alerts                             |
|                            | Queue frames for later batch processing         |
|                            | Dashboard shows "AI OFFLINE" banner             |
|                                                                              |
|  Kafka DOWN (MSK outage)   | Edge Gateway buffers locally (20GB ring buffer) |
|                            | Backpressure: reduce to key frames only (0.2fps)|
|                            | Auto-reconnect with 2x exponential backoff      |
|                            | Replay from local buffer when Kafka recovers    |
|                                                                              |
|  VPN Tunnel DOWN           | Full local operation mode                       |
|  (internet outage)         | All recording continues locally (7-day buffer)  |
|                            | Local alert buzzer/relay (configurable)         |
|                            | No cloud dashboard access                       |
|                            | Auto-sync when VPN recovers                     |
|                                                                              |
|  PostgreSQL DOWN (RDS)     | Alert queue builds in Kafka (durable log)       |
|                            | Events not lost (Kafka 7-day retention)         |
|                            | Read-only dashboard mode (Redis cache)          |
|                            | Alert on-call engineer                          |
|                                                                              |
|  Notification Service DOWN | Alerts accumulate in DB                         |
|                            | Retry with exponential backoff                  |
|                            | Dead letter after 24 hours                      |
|                            | Dashboard shows pending count                   |
|                                                                              |
|  Edge Gateway DOWN (power) | Cloud dashboard shows "SITE OFFLINE"            |
|                            | Last known recordings in cloud                  |
|                            | Alert sent immediately                          |
|                            | UPS: graceful shutdown, preserve data           |
|                                                                              |
+=============================================================================+
```

**Priority Order (highest first):**
1. Video recording NEVER STOPS (local edge priority)
2. Critical alerts ALWAYS FIRE (local buzzer + queued cloud alerts)
3. AI inference gracefully degrades to batch catch-up on recovery
4. Dashboard operates in read-only/cache mode during DB outage
5. Cloud sync resumes automatically when connectivity restored

**Reliability Mechanisms:**

| **Mechanism** | **Implementation** | **Target** |
|---|---|---|
| Stream Reconnect | Exponential backoff: 1s -> 2s -> 4s -> 8s -> max 30s | < 60s recovery |
| Circuit Breaker | 5 failures -> OPEN (60s) -> HALF_OPEN (3 test calls) -> CLOSED | Prevent cascade failures |
| VPN Watchdog | Ping every 30s, restart WireGuard on 3 consecutive failures | < 90s VPN recovery |
| Kafka Producer | `acks=all`, `retries=10`, `enable.idempotence=true`, LZ4 compression | Zero message loss |
| Kafka Consumer | Manual offset commit AFTER DB write success | Exactly-once processing |
| Health Checks | 5-layer: K8s probes -> Service metrics -> Dependency checks -> E2E synthetic -> Edge heartbeat | < 2 min detection |
| Auto-scaling | GPU util > 80% for 2 min -> scale out; Kafka lag > 1000 for 5 min -> scale out | Proactive capacity |

---


## Section 5: Data Flow from DVR to Cloud to Dashboard

This section traces the complete data journey from camera capture through AI processing to user presentation.

### 5.1 Overview: Seven Data Flows

```
+=============================================================================+
|                        SEVEN DATA FLOW PATHWAYS                              |
+=============================================================================+
|                                                                              |
|  Flow 1: Camera --> DVR --> Edge Gateway                                    |
|          [Analog/Digital] -> [H.264 Encode] -> [RTSP Server]                |
|                                                                              |
|  Flow 2: Edge Gateway --> VPN --> Cloud Kafka                               |
|          [FFmpeg ingest] -> [Frame extract] -> [Kafka Producer]             |
|                                                                              |
|  Flow 3: Stream Ingestion --> AI Inference                                  |
|          [Kafka Consumer] -> [GPU Batch] -> [Detection + Face Recog.]       |
|                                                                              |
|  Flow 4: AI Inference --> Events --> Database                               |
|          [Detection results] -> [Event enrich] -> [PostgreSQL]              |
|                                                                              |
|  Flow 5: Events --> Alerts --> Notifications                                |
|          [Scoring engine] -> [Alert create] -> [Multi-channel send]         |
|                                                                              |
|  Flow 6: Live Streams --> Browser Dashboard                                 |
|          [HLS segmenter] -> [Nginx relay] -> [HLS.js player]                |
|                                                                              |
|  Flow 7: Training Feedback Loop                                             |
|          [Operator review] -> [Conflict detect] -> [Model update]           |
|                                                                              |
+=============================================================================+
```

### 5.2 Flow 1: Camera to DVR to Edge Gateway

**Path:** Analog/Digital Camera -> DVR internal encoder -> DVR RTSP server -> Edge Gateway FFmpeg client

**Protocol Stack:**

| **Layer** | **Technology** | **Details** |
|---|---|---|
| Camera Interface | Analog BNC / CVBS / AHD | CP PLUS DVR supports multiple analog standards |
| DVR Encoding | H.264 High Profile | Hardware encoder, real-time, low latency |
| DVR Storage | Internal HDD (currently FULL) | 0 bytes free — no local recording possible |
| Network Transport | RTSP over TCP (interleaved) | Mandatory for reliable NAT/VPN traversal |
| URL Pattern | `rtsp://admin:{PASS}@192.168.29.200:554/cam/realmonitor?channel=N&subtype=M` | N=1-8, M=0(main)/1(sub) |
| Client | FFmpeg 6.0+ | `-rtsp_transport tcp -stimeout 5000000` |
| Frame Rate | 25 FPS (PAL) or 30 FPS (NTSC) | Configurable per channel |
| Resolution (main) | 960 x 1080 (per channel) | Full resolution |
| Resolution (sub) | 352 x 288 to 704 x 576 | Lower bandwidth for AI |

**FFmpeg RTSP Connection Command:**

```bash
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp \
    -stimeout 5000000 \
    -fflags +genpts+discardcorrupt+igndts+ignidx \
    -reorder_queue_size 64 \
    -i "rtsp://admin:password@192.168.29.200:554/cam/realmonitor?channel=1&subtype=0" \
    -c copy -f segment -segment_time 60 -reset_timestamps 1 \
    -strftime 1 "/data/buffer/ch1/%Y%m%d_%H%M%S.mkv"
```

**Latency Budget:**

| **Stage** | **Latency** |
|---|---|
| Camera -> DVR (analog) | ~1-5 ms |
| DVR encoding | ~50-100 ms |
| RTSP over LAN | ~1-2 ms |
| **Total (camera to edge gateway)** | **~52-107 ms** |

### 5.3 Flow 2: Edge Gateway to VPN Tunnel to Cloud

**Path:** Edge Gateway FFmpeg -> Frame extraction -> JPEG encoding -> Kafka Producer -> WireGuard VPN -> Cloud MSK

**Frame Processing Pipeline:**

```
+------------+    +-------------+    +---------------+    +-------------+    +-----------+
| Raw RTSP   | -> | FFmpeg      | -> | Frame         | -> | JPEG        | -> | Kafka     |
| H.264      |    | Demux/Decode|    | Decimation    |    | Encoder     |    | Producer  |
| 25 FPS     |    |             |    | (1 fps)       |    | Quality 85  |    | (LZ4)     |
| 960x1080   |    |             |    | 640x640 crop  |    |             |    |           |
+------------+    +-------------+    +---------------+    +-------------+    +-----------+
```

**FFmpeg Frame Extraction for AI:**

```bash
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp -stimeout 5000000 \
    -fflags +genpts+discardcorrupt -reorder_queue_size 64 \
    -i "rtsp://admin:password@192.168.29.200:554/cam/realmonitor?channel=1&subtype=0" \
    -vf "fps=1,scale=640:640:force_original_aspect_ratio=decrease,pad=640:640:(ow-iw)/2:(oh-ih)/2:black" \
    -q:v 5 -f image2pipe -vcodec mjpeg pipe:1
```

**WireGuard VPN Tunnel Configuration:**

| **Parameter** | **Value** |
|---|---|
| Protocol | UDP 51820 |
| Encryption | ChaCha20-Poly1305 |
| Key Exchange | Curve25519 (ECDH) |
| Preshared Key | Enabled per-peer |
| Keepalive | 25 seconds |
| MTU | 1400 (to account for WireGuard + IP headers) |
| Cloud Endpoint | 10.200.0.1/32 (EC2 bastion or ALB) |
| Edge Endpoint | 10.200.0.2/32 |
| Route | 10.200.0.0/16 (AWS VPC) accessible from edge |

**VPN watchdog script** runs every 30 seconds; restarts WireGuard on 3 consecutive ping failures.

**Latency Budget:**

| **Stage** | **Latency** |
|---|---|
| Frame extraction (FFmpeg) | ~50-100 ms |
| JPEG encoding | ~5-10 ms |
| Kafka produce (local) | ~1-2 ms |
| WireGuard tunnel | ~5-15 ms (Mumbai -> India site) |
| MSK broker | ~1-2 ms |
| **Total (edge to cloud Kafka)** | **~62-129 ms** |

### 5.4 Flow 3: Stream Ingestion to AI Inference

**Path:** Kafka `streams.raw` topic -> Stream Ingestion consumer -> Triton Inference Server -> Kafka `ai.detections` topic

**Pipeline Architecture:**

```
+------------+    +-------------------+    +------------------+    +-------------+
| streams.raw| -> | Stream Ingestion  | -> | NVIDIA Triton    | -> | ai.detections |
| (8 parts)  |    | (Go consumer)     |    | (GPU inference)  |    | (16 parts)    |
| JPEG frames|    | Batch aggregator  |    | gRPC :8001       |    | Detection     |
| + metadata |    | (batch=8, timeout)|    | Dynamic batching |    | + embeddings  |
+------------+    +-------------------+    +------------------+    +-------------+
```

**Triton Model Configuration:**

| **Model** | **Inputs** | **Outputs** | **GPU Memory** | **Latency (P50)** |
|---|---|---|---|---|
| YOLO11m-det (TensorRT FP16) | 3x640x640 float16 | Bboxes, scores, labels | ~2.1 GB | 12 ms |
| SCRFD-500M (TensorRT FP16) | 3x640x640 float16 | Bboxes, landmarks, scores | ~1.8 GB | 8 ms |
| ArcFace R100 (TensorRT FP16) | 3x112x112 float16 | 512-D embedding | ~3.2 GB | 5 ms |

**Total GPU memory: ~7.1 GB (fits in T4 16 GB with 8 streams)**

**Latency Budget:**

| **Stage** | **Latency** |
|---|---|
| Kafka consume (batch) | ~10-50 ms |
| Preprocessing (resize, normalize) | ~5-15 ms |
| YOLO11m inference (GPU) | ~12 ms (P50) |
| SCRFD face detection (GPU) | ~8 ms (P50) |
| ArcFace embedding (GPU, per face) | ~5 ms (P50) |
| Post-processing (NMS, matching) | ~10-30 ms |
| Kafka produce (results) | ~1-2 ms |
| **Total (Kafka to detection output)** | **~51-132 ms** |

### 5.5 Flow 4: AI Inference to Events to Database

**Path:** AI Detection results -> Event enricher -> PostgreSQL (multiple tables)

**Data Transformation:**

```
+------------+    +-------------------+    +---------------------+    +------------+
| Detection  | -> | Event Enricher    | -> | PostgreSQL Writer   | -> | events     |
| results    |    | - Add camera_id   |    | - UPSERT person     |    | persons    |
| (raw)      |    | - Match person    |    | - INSERT event      |    | embeddings |
|            |    | - Check whitelist |    | - INSERT embedding  |    | face_crops |
+------------+    +-------------------+    +---------------------+    +------------+
```

**Database Write Operations per Detection:**

| **Operation** | **Table** | **Type** | **Notes** |
|---|---|---|---|
| Insert event record | `events` | INSERT | With bounding box, confidence, timestamp |
| Upsert person | `persons` | INSERT/UPDATE | If new face, create person record |
| Insert face crop | `face_crops` | INSERT | S3 URL, bounding box, quality score |
| Upsert embedding | `face_embeddings` | INSERT/UPDATE | 512-D vector, pgvector HNSW index |
| Increment counters | `camera_stats` | UPDATE | Daily aggregation |

### 5.6 Flow 5: Events to Alerts to Notifications

**Path:** AI events -> Suspicious Activity scoring engine -> Alert creation -> Notification dispatch

**Scoring and Escalation:**

```
+------------+    +-------------------+    +------------------+    +-------------+
| AI events  | -> | Suspicious Activity| -> | Alert Manager    | -> | Notification |
| (persons,  |    | Scoring Engine     |    | - Deduplicate    |    | Service      |
|  faces)    |    | - 10 modules       |    | - Rate limit     |    | - Telegram   |
|            |    | - Composite score  |    | - Suppress dup   |    | - WhatsApp   |
|            |    | - Time decay       |    | - Escalation     |    | - Email      |
+------------+    +-------------------+    +------------------+    +-------------+
```

**Alert Escalation Matrix:**

| **Score** | **Level** | **Color** | **Notification** | **Action** |
|---|---|---|---|---|
| 0.00 - 0.20 | NONE | Gray | None | Log only |
| 0.20 - 0.40 | LOW | Blue | Dashboard only | Log + indicator |
| 0.40 - 0.60 | MEDIUM | Yellow | Dashboard + App push | Alert dispatched |
| 0.60 - 0.80 | HIGH | Orange | All of above + Telegram | Immediate alert |
| 0.80 - 1.00 | CRITICAL | Red | All of above + WhatsApp + Email | Critical alert |
| > 1.00 | EMERGENCY | Purple + flashing | All channels + SMS | Emergency dispatch |

### 5.7 Flow 6: Live Streams to Browser Dashboard

**Path:** DVR RTSP -> Edge Gateway FFmpeg -> HLS segmenter -> Nginx -> CDN -> Browser HLS.js

```
+--------+    +---------------+    +---------------+    +---------+    +----------+
| DVR    | -> | Edge Gateway  | -> | HLS Segmenter | -> | Nginx   | -> | Browser  |
| RTSP   |    | FFmpeg        |    | (2s segments) |    | (relay) |    | HLS.js   |
| 25 FPS |    | -copyts       |    | H.264 + AAC   |    | HTTPS   |    | Video tag|
+--------+    +---------------+    +---------------+    +---------+    +----------+
```

**HLS Configuration:**

| **Parameter** | **Value** |
|---|---|
| Segment duration | 2 seconds |
| Segment list size | 5 segments (10-second sliding window) |
| Playlist type | Live (no #EXT-X-ENDLIST) |
| Codec | H.264 High Profile + AAC-LC |
| Adaptive bitrate | 3 variants: high (3 Mbps), mid (1 Mbps), low (500 Kbps) |

**Latency:**

| **Stage** | **Latency** |
|---|---|
| DVR encoding | ~50-100 ms |
| RTSP to edge | ~1-2 ms |
| FFmpeg demux/remux | ~20-50 ms |
| HLS segmenting (2s) | ~2000 ms |
| Nginx relay | ~1-5 ms |
| CDN propagation | ~10-50 ms |
| HLS.js buffer | ~1-2 segments (2-4s) |
| Browser decode | ~20-50 ms |
| **Total (camera to eye)** | **~2.1 - 2.3 seconds** |

### 5.8 Flow 7: Training Feedback Loop

**Path:** Operator review actions -> Conflict detection -> Training dataset -> Model training -> Quality gates -> Deployment

```
+------------+    +------------------+    +----------------+    +-------------+    +-----------+
| Operator   | -> | Conflict         | -> | Training       | -> | Quality     | -> | Deployment |
| Review     |    | Detection        |    | Dataset        |    | Gates       |    | (A/B test) |
| (confirm,  |    | (5 types)        |    | - Curate       |    | - Precision |    |            |
|  correct,  |    | - Block conflicts|    | - Label        |    |   >= 0.97   |    |            |
|  merge,    |    | - Queue safe     |    | - Augment      |    | - Recall    |    |            |
|  reject)   |    |   additions      |    | - Version      |    |   >= 0.95   |    |            |
+------------+    +------------------+    +----------------+    +-------------+    +-----------+
```

**Training Data Flow:**

| **Stage** | **Frequency** | **Trigger** |
|---|---|---|
| Review action collection | Continuous | Operator clicks on dashboard |
| Conflict detection | Immediate (synchronous) | Every review action |
| Training dataset build | Weekly (or on-demand) | Queue threshold or manual |
| Model training | On dataset build | Airflow DAG trigger |
| Quality gate evaluation | After training | Automated pipeline |
| A/B deployment | After quality pass | Admin approval |
| Full production | After A/B success | Auto-promote at 48h |

---

## Section 6: Recommended Tech Stack

### 6.1 Technology Selection Matrix

| **Layer** | **Technology** | **Version** | **Purpose** | **Rationale** |
|---|---|---|---|---|
| **Cloud Platform** | AWS | 2025 | Infrastructure (ap-south-1 Mumbai) | Best India region latency, mature managed services |
| **Container Orchestration** | Amazon EKS | v1.28+ | Managed Kubernetes control plane | GPU node support, Cluster Autoscaler |
| **Edge K8s** | K3s | v1.28+ | Lightweight Kubernetes at edge | Single binary, resource-efficient |
| **VPN** | WireGuard | v1.0+ | Encrypted tunnel between edge and cloud | ~60% faster than OpenVPN, modern crypto |
| **Reverse Proxy** | Traefik | v2.10+ | Kubernetes Ingress controller | Native K8s integration, automatic TLS |
| **AI Inference** | NVIDIA Triton | 2.40 | GPU model serving, dynamic batching | Multi-framework, TensorRT optimization |
| **CV Framework** | OpenCV | 4.8+ | Image processing, pre/post-processing | Industry standard, Python/Go bindings |
| **AI/ML Framework** | PyTorch | 2.1+ | Model training, custom inference | Ecosystem, CUDA 12 support |
| **Deep Learning** | TensorRT | 8.6+ | GPU-optimized inference for YOLO, SCRFD, ArcFace | FP16 support, 3-5x speedup |
| **Language: AI** | Python | 3.11 | AI inference, training, suspicious activity detection | Ecosystem, scientific computing |
| **Language: Services** | Go | 1.21 | Stream ingestion, backend API, notifications | Performance, concurrency, small binaries |
| **Language: Frontend** | TypeScript | 5.2 | Web dashboard | Type safety, React ecosystem |
| **Web Framework** | Next.js | 14 (App Router) | React SSR dashboard | Server components, streaming |
| **UI Library** | React | 18 | Component-based UI | Concurrent features, Suspense |
| **Styling** | Tailwind CSS | 3.4 | Utility-first CSS | Rapid development, consistent design |
| **Video Player** | HLS.js | 1.4 | Browser HLS playback | MSE-based, adaptive bitrate |
| **Database** | PostgreSQL | 16 | Primary database, vector storage | ACID, pgvector extension |
| **Vector Search** | pgvector | 0.5+ | HNSW index for 512-D face embeddings | Native PostgreSQL, ivfflat+hnsw |
| **Cache/Session** | Redis | 7 | Session store, pub/sub, rate limiting | Data structures, cluster mode |
| **Message Queue** | Apache Kafka | 3.6+ (MSK) | Durable event log, stream replay | Exactly-once, retention, partitions |
| **Object Storage** | MinIO | latest (RELEASE.2024) | S3-compatible hot storage | Edge + cloud, erasure coding |
| **Cold Archive** | Amazon S3 | Standard/IA/Glacier | Tiered archival (30d/90d/365d) | Cost optimization |
| **Model Registry** | MLflow | 2.8+ | Model versioning, experiment tracking | Open source, S3 artifact store |
| **Orchestration** | Apache Airflow | 2.7+ | Training pipeline DAGs | Backfill, retries, observability |
| **Monitoring** | Prometheus | 2.47+ | Metrics collection | Pull-based, K8s service discovery |
| **Visualization** | Grafana | 10.1+ | Dashboards, alerting | Panels, annotations, shared links |
| **Log Aggregation** | Grafana Loki | 2.9+ | Centralized logging | Label-based, cost-effective |
| **CI/CD** | GitHub Actions | v4 | Build, test, lint pipelines | Native GitHub integration |
| **GitOps** | ArgoCD | 2.9+ | Kubernetes continuous delivery | Declarative, drift detection |
| **Infrastructure** | Terraform | 1.6+ | IaC for AWS resources | State management, modules |
| **Secrets** | AWS Secrets Manager | - | Encrypted credential storage | Rotation, IAM integration |

### 6.2 Hardware Requirements

#### Edge Gateway (Per Site)

| **Component** | **Minimum** | **Recommended** | **High Availability** |
|---|---|---|---|
| CPU | Intel i5-1340P (12 cores) | Intel i7-1370P (14 cores) | 2x Intel i7 (HA cluster) |
| RAM | 16 GB DDR4-3200 | 32 GB DDR4-3200 | 32 GB per node |
| Storage | 1 TB NVMe SSD | 2 TB NVMe SSD | 2 TB per node + NAS sync |
| Network | 1 Gbps Ethernet | 2.5 Gbps Ethernet | Dual NIC + bonding |
| GPU (optional) | None | NVIDIA Jetson Orin NX 16GB | On-edge AI pre-filtering |
| Power | UPS 600VA | UPS 1000VA | Dual PSU + generator |

#### Cloud GPU Nodes (AI Inference)

| **Cameras** | **GPU** | **VRAM** | **Streams** | **Cost/month (spot)** |
|---|---|---|---|---|
| 1-8 | g4dn.xlarge (T4) | 16 GB | 8 | ~$200-350 |
| 8-16 | g4dn.xlarge (T4) | 16 GB | 16 | ~$350-500 |
| 16-32 | g4dn.2xlarge (T4) | 16 GB | 32 | ~$600-900 |
| 32-64 | g5.2xlarge (A10G) | 24 GB | 64 | ~$1200-1800 |
| 64+ | p4d.24xlarge (A100) | 40 GB | 128 | ~$5000-8000 |

### 6.3 Software Versions Summary

| **Category** | **Software** | **Version** |
|---|---|---|
| Operating System | Ubuntu Server LTS | 22.04.4 |
| Container Runtime | Docker CE | 25.x |
| Container Orchestration | Kubernetes (EKS/K3s) | 1.28+ |
| AI Serving | NVIDIA Triton Inference Server | 2.40 |
| GPU Runtime | CUDA | 12.1+ |
| GPU Driver | NVIDIA Driver | 535+ |
| Deep Learning Optimization | TensorRT | 8.6+ |
| AI Framework | PyTorch | 2.1+ |
| Computer Vision | OpenCV | 4.8+ |
| Video Processing | FFmpeg | 6.0+ |
| Service Language | Go | 1.21+ |
| AI/Training Language | Python | 3.11+ |
| Frontend Framework | Next.js | 14 |
| UI Library | React | 18 |
| Database | PostgreSQL | 16 |
| Message Queue | Apache Kafka | 3.6+ |
| Cache | Redis | 7 |
| Object Storage | MinIO | 2024+ |
| CI/CD | GitHub Actions | v4 |
| GitOps | ArgoCD | 2.9+ |
| Monitoring | Prometheus + Grafana | 2.47+ / 10.1+ |
| Logging | Grafana Loki | 2.9+ |
| VPN | WireGuard | 1.0+ |
| Model Registry | MLflow | 2.8+ |
| Orchestration | Apache Airflow | 2.7+ |
| Infrastructure | Terraform | 1.6+ |

### 6.4 Port Reference

| **Service** | **Port** | **Protocol** | **Location** | **Notes** |
|---|---|---|---|---|
| DVR RTSP | 554 | TCP | 192.168.29.200 | Local network only |
| DVR HTTP | 80 | TCP | 192.168.29.200 | Admin UI, local only |
| DVR HTTPS | 443 | TCP | 192.168.29.200 | Admin UI, local only |
| DVR TCP | 25001 | TCP | 192.168.29.200 | Proprietary protocol |
| DVR UDP | 25002 | UDP | 192.168.29.200 | Proprietary protocol |
| DVR NTP | 123 | UDP | 192.168.29.200 | Time sync |
| WireGuard | 51820 | UDP | Cloud + Edge | VPN tunnel |
| Edge Admin | 8080 | TCP | 192.168.29.5 | Local admin UI |
| Edge SSH | 22 | TCP | 192.168.29.5 | Admin access only |
| Traefik HTTP | 8000 | TCP | EKS | Internal HTTP entrypoint |
| Traefik HTTPS | 8443 | TCP | EKS | Internal HTTPS entrypoint |
| ALB HTTPS | 443 | TCP | AWS | Public-facing |
| Backend API | 8080 | TCP | EKS pods | Internal service port |
| Triton HTTP | 8000 | TCP | EKS GPU nodes | Model inference HTTP |
| Triton gRPC | 8001 | TCP | EKS GPU nodes | Model inference gRPC |
| Triton Metrics | 8002 | TCP | EKS GPU nodes | Prometheus metrics |
| PostgreSQL | 5432 | TCP | RDS | VPC-private |
| Redis | 6379 | TCP | ElastiCache | VPC-private |
| Kafka | 9092 | TCP | MSK | VPC-private |
| MinIO API | 9000 | TCP | EKS + Edge | S3-compatible API |
| MinIO Console | 9001 | TCP | EKS + Edge | Admin console |
| Prometheus | 9090 | TCP | EKS | Metrics collection |
| Grafana | 3000 | TCP | EKS | Dashboards |

---


## Section 7: Database Schema

### 7.1 Schema Overview

The database is designed around a **relational core (PostgreSQL 16)** with **pgvector** extension for 512-dimensional face embedding storage and similarity search. The schema consists of **29 tables**, **4 views**, and **8 trigger functions**, organized into 10 logical domains.

**Schema Philosophy:**
- **Strict normalization** for reference data (cameras, persons, rules) to ensure data integrity
- **JSONB flexibility** for event metadata and configuration to accommodate evolving AI outputs
- **Partitioning** on all high-volume time-series tables for query performance and lifecycle management
- **pgvector HNSW indexing** for sub-10ms face similarity search at scale
- **Row-level security (RLS)** for multi-tenant site isolation
- **AES-256 encryption** for all stored credentials (DVR passwords, API tokens)

### 7.2 Entity Relationship Overview

```
+=============================================================================+
|                    ENTITY RELATIONSHIP DIAGRAM                               |
+=============================================================================+
|                                                                              |
|   SITE (1) --------------------< (N) DVR                                     |
|    |                              |                                          |
|    |                              | (1)                                      |
|    |                              v                                          |
|    |                           CAMERA (N) <------------------< (N) ALERT_RULE|
|    |                              |                              |           |
|    |                              | (N)                            | (1)      |
|    |                              v                              v           |
|    |   +---------------------------------------------------------+           |
|    |   | EVENT (N) -->--(1) PERSON (1)--< (N) FACE_EMBEDDING               |
|    |   |   |                                                      |         |
|    |   |   | (N)                                                  | (N)     |
|    |   |   v                                                      v         |
|    |   | FACE_CROP (N)                                    PERSON_CLUSTER     |
|    |   |   |                                                                  |
|    |   |   | (N)                                                  +---------+|
|    |   |   v                                                      | Training||
|    |   | MEDIA_FILE (1) ----------------------------------------->| Dataset  ||
|    |   |                                                          |---------||
|    |   +--------------------------------------------------------->| Job      ||
|    |                                                              | Model    ||
|    |                              +---------+                     | Version  ||
|    |                              | Review  |                     +---------+|
|    |                              | Action  |                                |
|    |                              +---------+                                |
|    |                                    ^                                    |
|    |                                    | (N)                                |
|    +------------------------------------+                                    |
|   USER (N) -->--(N) ROLE_PERMISSION                                          |
|    |                                                                         |
|    | (1)                                                                     |
|    v                                                                         |
|   WATCHLIST (N) -->--(N) WATCHLIST_ENTRY                                     |
|                                                                              |
|   +---------+    +---------+    +---------+    +---------+                  |
|   | Telegram|    |WhatsApp |    | Email   |    |Webhook  |                  |
|   | Config  |    | Config  |    | Config  |    | Config  |                  |
|   +---------+    +---------+    +---------+    +---------+                  |
|        ^              ^             ^              ^                         |
|        |              |             |              |                         |
|        +--------------+-------------+--------------+                         |
|                         |                                                    |
|                   NOTIFICATION_CHANNEL                                         |
|                         |                                                    |
|                         | (1)                                                |
|                         v                                                    |
|                   NOTIFICATION_LOG                                             |
|                                                                              |
|   +---------+    +---------+    +---------+                                  |
|   | Audit   |    | System  |    | Device  |                                  |
|   | Log     |    | Health  |    | Connect.|                                  |
|   |(partitioned) |  Log    |    |  Log    |                                  |
|   +---------+    +---------+    +---------+                                  |
|                                                                              |
+=============================================================================+
```

### 7.3 Core Tables Summary

#### 7.3.1 Site and Infrastructure Tables

| **Table** | **Purpose** | **Key Fields** | **Rows (est.)** |
|---|---|---|---|
| `sites` | Physical locations (factories, warehouses) | id, name, location, timezone, settings | 1-10 |
| `dvrs` | DVR/NVR devices per site | id, site_id, ip_address, port, username, password_encrypted, model, channels, status | 1-10 |
| `cameras` | Individual camera channels | id, dvr_id, channel_number, name, rtsp_url, resolution, fps, status, zone_config, zone_description | 8-64 |

#### 7.3.2 AI Detection and Identity Tables

| **Table** | **Purpose** | **Key Fields** | **Rows (est.)** |
|---|---|---|---|
| `events` | All AI detection events (partitioned monthly) | id, camera_id, event_type, timestamp, confidence, bounding_box, person_id, face_crop_id, track_id | 1M-10M/month |
| `persons` | Known and unknown individuals | id, name, status (known/unknown/blacklisted), role, company, notes, created_at | 100-10,000 |
| `face_crops` | Cropped face images metadata | id, event_id, person_id, storage_path, bounding_box, quality_score, blur_score, pose_yaw, pose_pitch | 500K-5M/month |
| `face_embeddings` | 512-D face embeddings (pgvector) | id, person_id, face_crop_id, embedding (vector(512)), model_version, is_primary | 500K-5M |
| `person_clusters` | Unknown person cluster groups | id, cluster_label, representative_embedding_id, sample_count, first_seen, last_seen, status | 10-1,000 |

#### 7.3.3 Alert and Notification Tables

| **Table** | **Purpose** | **Key Fields** | **Rows (est.)** |
|---|---|---|---|
| `alert_rules` | Per-camera alert configuration | id, camera_id, rule_type, name, config_json, schedule, enabled | 50-500 |
| `alerts` | Generated alert records | id, camera_id, rule_id, person_id, alert_type, severity, status, message | 1K-50K/month |
| `notification_channels` | Alert destination endpoints | id, name, channel_type, config_json, is_active | 5-20 |
| `telegram_configs` | Telegram Bot API credentials | id, channel_id, bot_token_encrypted, chat_id | 1-5 |
| `whatsapp_configs` | WhatsApp Business API credentials | id, channel_id, api_key_encrypted, phone_number_id | 1-5 |
| `notification_log` | Delivery status per notification | id, alert_id, channel_id, status, sent_at, error_message | 1K-50K/month |

#### 7.3.4 Watchlist and Access Control Tables

| **Table** | **Purpose** | **Key Fields** | **Rows (est.)** |
|---|---|---|---|
| `users` | Dashboard users and operators | id, username, email, password_hash, role, is_active | 5-50 |
| `roles` | Permission roles | id, name, permissions_json | 3-10 |
| `watchlists` | Named monitoring lists | id, name, watch_type (vip/blacklist/custom), is_active | 5-20 |
| `watchlist_entries` | Persons on watchlists | id, watchlist_id, person_id, added_by, added_at | 10-1,000 |

#### 7.3.5 Training and ML Pipeline Tables

| **Table** | **Purpose** | **Key Fields** | **Rows (est.)** |
|---|---|---|---|
| `training_datasets` | Curated face datasets for training | id, name, description, person_ids_json, sample_count, version, status | 10-100 |
| `training_jobs` | Model training job tracking | id, dataset_id, model_version_from, model_version_to, status, metrics_json | 10-100 |
| `model_versions` | Registry of trained model versions | id, version_string, training_job_id, metrics_json, is_production, is_rollback_available | 10-50 |
| `review_actions` | Operator review decisions | id, event_id, reviewer_id, action, from_person_id, to_person_id, notes | 1K-100K |

#### 7.3.6 Media and Storage Tables

| **Table** | **Purpose** | **Key Fields** | **Rows (est.)** |
|---|---|---|---|
| `media_files` | Registry of stored video/images | id, file_type, storage_path, size_bytes, checksum, camera_id, event_id, retention_until | 100K-1M |
| `video_clips` | Video clip metadata for incidents | id, media_file_id, start_time, end_time, camera_id, event_id, duration_seconds | 10K-100K |

#### 7.3.7 Audit and Monitoring Tables (Partitioned)

| **Table** | **Purpose** | **Partition** | **Retention** |
|---|---|---|---|
| `audit_logs` | All user and system actions | Monthly by timestamp | 1 year (Glacier) |
| `system_health_logs` | Component health metrics | Monthly by timestamp | 90 days |
| `device_connectivity_logs` | Camera/DVR connectivity events | Monthly by timestamp | 90 days |

### 7.4 Indexing Strategy

#### 7.4.1 pgvector HNSW Index (Critical Path)

```sql
-- HNSW index for sub-10ms face similarity search
-- ef_search controls recall/speed tradeoff (higher = more accurate, slower)
CREATE INDEX idx_face_embeddings_hnsw
ON face_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);

-- Query: Find top-K similar faces
SELECT person_id, 1 - (embedding <=> query_vector) AS similarity
FROM face_embeddings
WHERE is_primary = true
ORDER BY embedding <=> query_vector
LIMIT 5;
```

| **Parameter** | **Value** | **Rationale** |
|---|---|---|
| `m` | 16 | Number of bi-directional links per node (higher = better recall, more memory) |
| `ef_construction` | 128 | Build-time exploration factor (higher = better index quality) |
| `ef_search` (runtime SET) | 64-256 | Search-time exploration factor (SET hnsw.ef_search = 128) |
| Distance metric | Cosine similarity (`<=>`) | Optimal for normalized face embeddings |

#### 7.4.2 B-Tree Indexes (Standard Queries)

| **Table** | **Index** | **Purpose** |
|---|---|---|
| `events` | `(camera_id, timestamp DESC)` | Time-range queries per camera |
| `events` | `(event_type, timestamp DESC)` | Filter by event type |
| `events` | `(person_id)` WHERE person_id IS NOT NULL | Person event lookup |
| `face_crops` | `(person_id, quality_score DESC)` | Best quality face per person |
| `alerts` | `(status, created_at DESC)` | Pending alerts by age |
| `alerts` | `(severity, status)` | Critical alert dashboard |
| `persons` | `(status, name)` | Person directory with status filter |
| `persons` | `(created_at DESC)` | Recently added persons |
| `media_files` | `(retention_until)` WHERE retention_until < NOW() + 7 days | Expiring media cleanup |

### 7.5 Partitioning Strategy

All high-volume time-series tables are partitioned **monthly** using pg_partman for automated partition management.

```
+-----------------------------------------------------------------------------+
|                    PARTITIONING ARCHITECTURE                                 |
+-----------------------------------------------------------------------------+
|                                                                              |
|   events (parent, empty)                                                     |
|   +-- events_y2024m01   (Jan 2024 data)                                     |
|   +-- events_y2024m02   (Feb 2024 data)                                     |
|   +-- events_y2024m03   (Mar 2024 data)                                     |
|   +-- events_y2024m04   (Apr 2024 data)                                     |
|   +-- events_y2024m05   (May 2024 data)  <-- Hot (in memory)               |
|   +-- events_default    (fallback)                                          |
|                                                                              |
|   Partition pruning: WHERE timestamp >= '2024-05-01'                        |
|                      -> Only scans events_y2024m05                           |
|                      -> ~30x faster for time-range queries                  |
|                                                                              |
|   Managed by: pg_partman extension                                          |
|   - Auto-create: 2 months ahead                                             |
|   - Auto-drop: After retention period (detach + archive)                    |
|                                                                              |
+-----------------------------------------------------------------------------+
```

**Partitioned Tables:**

| **Table** | **Partition Key** | **Partition Type** | **Retention** |
|---|---|---|---|
| `events` | `timestamp` | Monthly RANGE | 90 days hot, 1 year archive |
| `audit_logs` | `timestamp` | Monthly RANGE | 1 year total |
| `system_health_logs` | `timestamp` | Monthly RANGE | 90 days |
| `device_connectivity_logs` | `timestamp` | Monthly RANGE | 90 days |
| `face_crops` | `created_at` | Monthly RANGE | 90 days hot, 1 year archive |

### 7.6 Retention Policies

| **Data Tier** | **Storage** | **Duration** | **Lifecycle** |
|---|---|---|---|
| **Hot Tier** | PostgreSQL + MinIO | 0-30 days | Fast query, indexed, in-memory cache |
| **Warm Tier** | S3 Standard | 30-90 days | Available on-demand, still indexed |
| **Cold Tier** | S3 Infrequent Access | 90-365 days | Retrieval within minutes |
| **Archive Tier** | Glacier Deep Archive | 1-7 years | Retrieval within 12-48 hours |
| **Compliance** | Glacier Vault Lock | 7+ years | Immutable, legal hold |

**Automated Cleanup:**

| **Task** | **Frequency** | **Mechanism** |
|---|---|---|
| Expire old event partitions | Daily (pg_partman) | DETACH PARTITION + S3 upload |
| Delete expired media files | Daily | Cron job: DELETE from media_files + MinIO removal |
| Purge old notification logs | Weekly | DELETE WHERE created_at < NOW() - INTERVAL '90 days' |
| Archive face crops to S3 | Daily | Lambda: copy to S3 IA, update storage_path |
| Compress audit logs | Monthly | pglz/zstd compression on detached partitions |
| Vacuum and analyze | Weekly (auto-vacuum) | PostgreSQL autovacuum daemon |

### 7.7 Security Considerations

#### 7.7.1 Credential Encryption

All sensitive credentials stored with **AES-256** encryption:

| **Table** | **Encrypted Field** | **Encryption** |
|---|---|---|
| `dvrs` | `password_encrypted` | AES-256-CBC, key from AWS Secrets Manager |
| `telegram_configs` | `bot_token_encrypted` | AES-256-CBC |
| `whatsapp_configs` | `api_key_encrypted` | AES-256-CBC |

#### 7.7.2 Row-Level Security (RLS)

For multi-site deployments, RLS policies enforce that users only see data for sites they have access to:

```sql
-- Enable RLS on critical tables
ALTER TABLE events ENABLE ROW LEVEL SECURITY;
ALTER TABLE persons ENABLE ROW LEVEL SECURITY;
ALTER TABLE alerts ENABLE ROW LEVEL SECURITY;

-- Policy: Users see only data from their assigned sites
CREATE POLICY site_isolation_events ON events
    USING (camera_id IN (
        SELECT c.id FROM cameras c
        JOIN dvrs d ON c.dvr_id = d.id
        JOIN site_users su ON d.site_id = su.site_id
        WHERE su.user_id = current_setting('app.current_user_id')::UUID
    ));
```

#### 7.7.3 Access Control

| **Role** | **Permissions** |
|---|---|
| `super_admin` | Full access to all sites, all operations |
| `site_admin` | Full access to assigned sites, user management |
| `operator` | View dashboards, acknowledge alerts, review persons |
| `viewer` | Read-only access to dashboards and events |

#### 7.7.4 Audit Trail

The `audit_logs` table (partitioned monthly) captures every significant action:

| **Action** | **Captured Data** |
|---|---|
| `login` | User, IP, timestamp, MFA status, success/failure |
| `person_create` | Creator, name, initial status, source event |
| `person_update` | Updater, changed fields, old/new values |
| `alert_acknowledge` | Acknowledger, alert ID, timestamp |
| `alert_resolve` | Resolver, resolution notes |
| `training_approve` | Approver, model version, dataset version |
| `model_deploy` | Deployer, version, A/B split percentage |
| `config_change` | Changer, changed parameters, old/new values |

#### 7.7.5 Backup Strategy

| **Component** | **Method** | **Frequency** | **Retention** |
|---|---|---|---|
| PostgreSQL | RDS automated backups | Daily | 35 days |
| PostgreSQL | Manual snapshots | Before any schema change | 90 days |
| MinIO/S3 | Cross-region replication | Continuous | 90 days in DR region |
| Face embeddings | pg_dump + vector export | Weekly | 90 days |
| Model artifacts | MLflow artifact store | On training completion | Indefinite |

> **Reference**: For complete DDL including all CREATE TABLE statements, triggers, views, and functions, see `database_schema.md` — Sections 2 through 15 contain the full schema definition with comments and constraints.

---


## Section 8: AI Model and Training Strategy

### 8.1 AI Model Selection

The inference pipeline uses three complementary deep learning models — for human detection, face detection, and face recognition — all optimized with TensorRT for GPU inference. All models run on a single NVIDIA T4 GPU with dynamic batching.

| **Component** | **Model** | **Framework** | **Input Size** | **FPS (T4)** | **Accuracy** |
|---|---|---|---|---|---|
| **Human Detection** | YOLO11m (Ultralytics) | PyTorch -> ONNX -> TensorRT FP16 | 640 x 640 | 213 | mAP@50: 80.5% (COCO) |
| **Face Detection** | SCRFD-500M-BNKPS (InsightFace) | PyTorch -> ONNX -> TensorRT FP16 | 640 x 640 | ~400 | AP_medium: 87.2% (WIDERFace) |
| **Face Recognition** | ArcFace R100 (IR-SE100) | PyTorch -> ONNX -> TensorRT FP16 | 112 x 112 | ~800 | 99.83% (LFW), 98.35% (MegaFace) |
| **Person Tracking** | ByteTrack | Native Python + NumPy | N/A | N/A | 80.3% MOTA (MOT17) |
| **Unknown Clustering** | HDBSCAN + DBSCAN fallback | scikit-learn | 512-D vectors | N/A | 89.5% purity, 0.855 BCubed F |
| **Fall Detection** | YOLOv8n-pose | TensorRT FP16 | 640 x 640 | ~300 | Part of suspicious activity |
| **Object Detection** | YOLOv8s | TensorRT FP16 | 640 x 640 | ~450 | Abandoned object detection |

#### 8.1.1 Human Detection: YOLO11m

| **Property** | **Value** |
|---|---|
| Architecture | CSPDarknet backbone + PANet neck + Decoupled head |
| Parameters | 19.6 M |
| FLOPs | 68.2 B (at 640x640) |
| TensorRT Optimization | FP16, dynamic batch (1-16), layer fusion |
| GPU Memory | ~2.1 GB at batch=8 |
| Person class priority | Highest NMS score weighting for person class |
| Preprocessing | Letterbox resize to 640x640, normalize [0,1] |

**Export pipeline:**
```bash
# PyTorch -> ONNX -> TensorRT Engine
yolo export model=yolo11m.pt format=onnx imgsz=640 half=True opset=17 simplify=True
trtexec --onnx=yolo11m.onnx --saveEngine=yolo11m.engine --fp16 \
  --minShapes=images:1x3x640x640 --optShapes=images:8x3x640x640 --maxShapes=images:16x3x640x640
```

#### 8.1.2 Face Detection: SCRFD-500M-BNKPS

| **Property** | **Value** |
|---|---|
| Architecture | Single-stage detector with FPN, BN+KPS head |
| Parameters | 500 M (large variant for high accuracy) |
| Detects | Face bounding box + 5 facial landmarks |
| Minimum face size | 20 x 20 pixels (configurable) |
| NMS threshold | 0.45 (IoU) |
| Confidence threshold | 0.5 (minimum detection score) |
| GPU Memory | ~1.8 GB at batch=32 |

#### 8.1.3 Face Recognition: ArcFace R100 (IR-SE100)

| **Property** | **Value** |
|---|---|
| Backbone | IR-SE100 (Improved ResNet-100 with SE blocks) |
| Training data | MS1MV3 (5.8M images, 85K identities) |
| Loss function | ArcFace additive angular margin (m=0.5) |
| Embedding dimension | 512 (float32, L2-normalized) |
| Distance metric | Cosine similarity (1 - cosine_distance) |
| Matching threshold (strict) | 0.60 |
| Matching threshold (balanced) | 0.45 |
| Matching threshold (relaxed) | 0.30 |
| GPU Memory | ~3.2 GB at batch=64 |

**Published benchmarks on standard datasets:**

| **Dataset** | **Accuracy** | **Notes** |
|---|---|---|
| LFW (Labeled Faces in the Wild) | 99.83% | Unconstrained face verification |
| CFP-FP (Frontal-Profile) | 99.17% | Cross-pose evaluation |
| AgeDB-30 | 98.28% | Age-invariant recognition |
| MegaFace (1M distractors) | 98.35% | Large-scale recognition |
| IJB-C | 96.18% (TAR@FAR=1e-4) | Template-based verification |

### 8.2 Inference Pipeline Architecture

```
+=============================================================================+
|                    REAL-TIME INFERENCE PIPELINE                              |
+=============================================================================+
|                                                                              |
|  INPUT: RTSP Frame (640x640, 1 fps per stream)                              |
|       |                                                                      |
|       v                                                                      |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Frame Preprocessor| -> | YOLO11m Detector  | -> | Person Detection  |    |
|  | - Resize          |    | (TensorRT FP16)   |    | Results:          |    |
|  | - Normalize       |    | GPU: 12ms (P50)   |    | - bbox (x1,y1,x2, |    |
|  | - NCHW layout     |    | Batch: 1-16       |    |   y2)             |    |
|  +-------------------+    +-------------------+    | - confidence      |    |
|                                                     | - class (person)  |    |
|                                                     +---------+---------+    |
|                                                               |              |
|                                                               v              |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Face Crop Extract | <- | SCRFD-500M        | <- | Face Detection    |    |
|  | (ROI from person  |    | (TensorRT FP16)   |    | Results:          |    |
|  |  bounding box)    |    | GPU: 8ms (P50)    |    | - face bbox       |    |
|  |                   |    | Batch: per-face   |    | - 5 landmarks     |    |
|  +-------------------+    +-------------------+    | - confidence      |    |
|                                                     +---------+---------+    |
|                                                               |              |
|                                                               v              |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Face Alignment    | <- | ArcFace R100      | <- | Embedding Vector  |    |
|  | (5-point affine   |    | (TensorRT FP16)   |    | 512-D float32,   |    |
|  |  transform to     |    | GPU: 5ms (P50)    |    | L2-normalized     |    |
|  |  112x112)         |    | Batch: 1-64       |    |                   |    |
|  +-------------------+    +-------------------+    +---------+---------+    |
|                                                               |              |
|                                                               v              |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Face Matching     | <- | Person Tracking   | <- | Track-to-Person   |    |
|  | (cosine similarity|    | (ByteTrack)       |    | Association       |    |
|  |  vs. known DB)    |    | CPU: 2ms/frame    |    | - Match embedding |    |
|  +-------------------+    +-------------------+    |   to known persons  |    |
|       |  |  |                                      | - Create/update     |    |
|       |  |  |                                      |   track             |    |
|       v  v  v                                      +-------------------+    |
|  +-------------------+                                                        |
|  | Confidence Scorer |                                                        |
|  | (aggregate score  |                                                        |
|  |  for all detect)  |                                                        |
|  +-------------------+                                                        |
|       |                                                                       |
|       v                                                                       |
|  OUTPUT: DetectionEvent (JSON)                                               |
|  { person_id, track_id, confidence, bbox, face_crop,                         |
|    embedding, recognized_name?, quality_scores }                             |
|                                                                              |
+=============================================================================+
```

**End-to-end latency budget per frame:**

| **Stage** | **GPU** | **CPU Fallback** |
|---|---|---|
| Frame preprocessing | 2-5 ms | 5-10 ms |
| YOLO11m detection | 12 ms (P50) | 35-56 ms (ONNX+OpenVINO) |
| SCRFD face detection | 8 ms (P50) | 15-25 ms |
| ArcFace embedding (per face) | 5 ms (P50) | 12-18 ms |
| ByteTrack tracking | 2 ms | 2-5 ms |
| Post-processing | 5-10 ms | 10-20 ms |
| **Total (no face)** | **~29 ms** | **~67-116 ms** |
| **Total (1 face)** | **~34 ms** | **~79-134 ms** |
| **Total (5 faces)** | **~54 ms** | **~127-214 ms** |

### 8.3 Face Recognition Matching Strategy

#### 8.3.1 Known Person Matching

```
+-----------------------------------------------------------------------------+
|                    FACE RECOGNITION MATCHING FLOW                            |
+-----------------------------------------------------------------------------+
|                                                                              |
|  New Face Embedding (512-D)                                                  |
|       |                                                                      |
|       v                                                                      |
|  +-------------------+                                                       |
|  | L2 Normalize      |  embedding = embedding / ||embedding||_2              |
|  +-------------------+                                                       |
|       |                                                                      |
|       v                                                                      |
|  +-------------------+    +-------------------+                              |
|  | pgvector HNSW     | -> | Top-5 Candidates  |                              |
|  | Similarity Search |    | (cosine distance) |                              |
|  | ef_search=128     |    +-------------------+                              |
|  +-------------------+            |                                          |
|                                   v                                          |
|  +-------------------+    +-------------------+                              |
|  | Threshold Check   | <- | Best Match Score  |                              |
|  | (per AI Vibe)     |    +-------------------+                              |
|  +-------------------+            |                                          |
|       |                          |                                          |
|       +------------+-------------+                                          |
|                    |                                                        |
|         +----------+----------+                                             |
|         |                     |                                             |
|         v                     v                                             |
|    Above threshold      Below threshold                                     |
|    (Recognized)         (Unknown)                                           |
|         |                     |                                             |
|         v                     v                                             |
|  +------------+       +------------------+                                 |
|  | Assign to  |       | Check against    |                                 |
|  | known      |       | recent unknown   |                                 |
|  | person_id  |       | embeddings       |                                 |
|  | (with      |       | (5-min window)   |                                 |
|  | confidence)|       +--------+---------+                                 |
|  +------------+                |                                            |
|                                |                                            |
|                       +--------+--------+                                   |
|                       |                 |                                   |
|                       v                 v                                   |
|                  Similar unknown    No similar unknown                      |
|                  (same person)      (new unknown)                           |
|                       |                 |                                   |
|                       v                 v                                   |
|                  Reuse person_id   Create new                              |
|                  Update centroid   unknown person                           |
|                                    record                                   |
|                                                                              |
+-----------------------------------------------------------------------------+
```

#### 8.3.2 AI Vibe Threshold Mapping

The AI Vibe system maps three intuitive presets to internal confidence thresholds:

| **Vibe** | **Face Match Threshold** | **Detection Confidence** | **Use Case** |
|---|---|---|---|
| **Relaxed** | 0.30 cosine similarity | 0.40 minimum | Known persons re-identified more easily; more false positives acceptable |
| **Balanced** | 0.45 cosine similarity | 0.55 minimum | Default; good precision-recall tradeoff |
| **Strict** | 0.60 cosine similarity | 0.70 minimum | High-security scenarios; minimize false positives |

**Per-stream Vibe Selection:**
- Vibe can be set per camera via dashboard
- Night mode automatically applies Strict vibe
- Alert-triggered cameras automatically upgrade to Strict for 5 minutes

### 8.4 Unknown Person Clustering Approach

Unknown persons (faces that don't match any known person above threshold) are automatically clustered to help operators identify recurring visitors.

#### 8.4.1 Clustering Pipeline

```
+-----------------------------------------------------------------------------+
|                    UNKNOWN PERSON CLUSTERING PIPELINE                        |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Unknown Face Embeddings (streaming)                                         |
|       |                                                                      |
|       v                                                                      |
|  +-------------------+                                                       |
|  | Sliding Window    |  Keep last N embeddings in memory (configurable)     |
|  | Buffer (500)      |  + persistent storage for long-term clustering       |
|  +-------------------+                                                       |
|       |                                                                      |
|       v                                                                      |
|  +-------------------+    +-------------------+                              |
|  | HDBSCAN Clustering| -> | Primary clusters  |  min_cluster_size=5        |
|  | (density-based)   |    | formed             |  min_samples=2             |
|  | metric=cosine     |    +-------------------+  eps=auto                   |
|  +-------------------+            |                                          |
|       | (fallback)                |                                          |
|       v                           v                                          |
|  +-------------------+    +-------------------+                              |
|  | DBSCAN Fallback   |    | Merge with        |  Check: temporal gap       |
|  | (if HDBSCAN fails |    | existing clusters |  < 30 days, cosine sim     |
|  |  to find structure|    | - centroid        |  > 0.85                    |
|  +-------------------+    |   distance        |                            |
|                           +-------------------+                            |
|                                   |                                          |
|                                   v                                          |
|                           +-------------------+                              |
|                           | Operator Review   |  Dashboard shows clusters   |
|                           | Queue             |  pending identification     |
|                           +-------------------+                              |
|                                                                              |
+-----------------------------------------------------------------------------+
```

#### 8.4.2 Clustering Parameters

| **Parameter** | **Value** | **Description** |
|---|---|---|
| Algorithm | HDBSCAN (primary), DBSCAN (fallback) | Density-based for irregular cluster shapes |
| Distance metric | Cosine similarity | Optimal for face embeddings |
| Minimum cluster size | 5 embeddings | Minimum to form a cluster |
| Minimum samples | 2 | Core point density threshold |
| Merge threshold | 0.85 cosine similarity | Merge clusters if centroids are close |
| Temporal window | 30 days | Maximum gap between cluster appearances |
| Review trigger | 10+ embeddings | Send to operator review queue |

#### 8.4.3 Clustering Quality Targets

| **Metric** | **Target** | **Measurement** |
|---|---|---|
| Cluster Purity | > 89% | % of embeddings in a cluster belonging to the same person |
| BCubed F-Measure | > 0.85 | Harmonic mean of precision and recall for clustering |
| Silhouette Score | > 0.3 | Separation quality between clusters |
| False Merge Rate | < 5% | Different persons incorrectly merged |
| Split Rate | < 15% | Same person split into multiple clusters |

### 8.5 Confidence Handling

#### 8.5.1 Confidence Score Computation

Each detection event carries an aggregate confidence score computed from multiple signals:

```
confidence_aggregate = weighted_average(
    detection_confidence:    0.35 * yolo_confidence,
    face_detection_quality:  0.25 * scrfd_confidence,
    face_recognition_score:  0.25 * (1 - cosine_distance_to_match),
    face_quality_score:      0.15 * quality_composite
)

Where quality_composite = average(
    1.0 - blur_score,       # Sharpness (higher is better)
    1.0 - abs(pose_yaw)/90, # Frontal preference
    illumination_score,      # Well-lit face
    resolution_adequacy      # Sufficient pixels for face
)
```

#### 8.5.2 Confidence Levels

| **Level** | **Score Range** | **Color** | **Action** |
|---|---|---|---|
| **High Confidence** | 0.80 - 1.00 | Green | Auto-accept, no review needed |
| **Medium Confidence** | 0.60 - 0.79 | Yellow | Accepted, flagged for periodic review |
| **Low Confidence** | 0.40 - 0.59 | Orange | Requires operator review within 24h |
| **Very Low Confidence** | 0.00 - 0.39 | Red | Rejected, not used for training |

### 8.6 Training Workflow Overview

The safe self-learning system captures operator feedback and converts it into model improvements through a carefully controlled pipeline.

#### 8.6.1 Three Learning Modes

| **Mode** | **Description** | **Use Case** | **Risk Level** |
|---|---|---|---|
| **Manual Only** | Operator explicitly triggers training runs | Highly regulated environments | Lowest |
| **Suggested Learning** (Recommended) | System suggests training candidates; operator approves | Standard production deployment | Low |
| **Approved Auto-Update** | Auto-training triggers after admin approval threshold | Mature deployment with trusted operators | Medium |

#### 8.6.2 Training Pipeline Architecture

```
+=============================================================================+
|                    SAFE SELF-LEARNING PIPELINE                               |
+=============================================================================+
|                                                                              |
|  STEP 1: COLLECTION                                                          |
|  +-------------------+                                                       |
|  | Operator Review   |  confirm, correct_name, merge, reject                |
|  | Actions           |  + automatic high-confidence acceptances              |
|  +-------------------+                                                       |
|       |                                                                      |
|       v                                                                      |
|  STEP 2: CONFLICT DETECTION (Synchronous, blocks immediately)               |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Label Conflict    | -> | If conflict found | -> | Block from training |   |
|  | Detector          |    | (5 types)         |    | dataset, alert admin |   |
|  | - Same face, diff |    +-------------------+    +-------------------+    |
|  |   names           |                                                       |
|  | - Diff faces, same|                                                       |
|  |   name            |                                                       |
|  | - Merge circular  |                                                       |
|  |   reference       |                                                       |
|  | - Name to already-|                                                       |
|  |   deleted person  |                                                       |
|  | - Quality below   |                                                       |
|  |   threshold       |                                                       |
|  +-------------------+                                                       |
|       |                                                                      |
|       v                                                                      |
|  STEP 3: DATASET CURATION                                                    |
|  +-------------------+                                                       |
|  | Training Dataset  |  - Collect approved examples                         |
|  | Builder           |  - Balance classes (min 5 per person)                |
|  |                   |  - Augmentation (flip, rotate, brightness)           |
|  |                   |  - Quality filter (blur, pose, illumination)         |
|  |                   |  - Train/val split (80/20)                            |
|  +-------------------+                                                       |
|       |                                                                      |
|       v                                                                      |
|  STEP 4: MODEL TRAINING                                                      |
|  +-------------------+                                                       |
|  | Training Job      |  - ArcFace R100 backbone                              |
|  | (Airflow DAG)     |  - Fine-tuning on curated dataset                     |
|  |                   |  - Cosine annealing LR schedule                        |
|  |                   |  - Early stopping (patience=10)                       |
|  |                   |  - Mixed precision (AMP)                              |
|  |                   |  - Typical duration: 2-8 hours on V100                |
|  +-------------------+                                                       |
|       |                                                                      |
|       v                                                                      |
|  STEP 5: QUALITY GATES                                                       |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Gate 1: Hold-out  | -> | Gate 2: Compare   | -> | Gate 3: Identity  |    |
|  |    evaluation     |    |    vs current     |    |    accuracy       |    |
|  |    (precision,    |    |    production     |    |    (100% known)   |    |
|  |     recall, f1)   |    |    (no >2% regress)|   |                   |    |
|  +-------------------+    +-------------------+    +-------------------+    |
|       |                          |                          |                |
|       +------------+-------------+--------------------------+                |
|                    |                                                          |
|         +----------+----------+                                              |
|         |                     |                                              |
|         v                     v                                              |
|     ALL PASSED            ANY FAILED                                       |
|         |                     |                                              |
|         v                     v                                              |
|  +------------+       +------------------+                                 |
|  | Proceed to |       | REJECT           |                                 |
|  | Deployment |       | - Log failure    |                                 |
|  +------------+       | - Alert admin    |                                 |
|                       | - Keep in staging|                                 |
|                       +------------------+                                 |
|                                                                              |
|  STEP 6: DEPLOYMENT                                                          |
|  +-------------------+                                                       |
|  | A/B Testing       |  - Shadow mode: 0% traffic (validation)              |
|  | (gradual rollout) |  - Canary: 5% traffic for 24h                        |
|  |                   |  - Monitor: latency, error rate, FP rate              |
|  |                   |  - Full rollout: 100% traffic                         |
|  |                   |  - Rollback: < 60 seconds to previous version         |
|  +-------------------+                                                       |
|                                                                              |
+=============================================================================+
```

### 8.7 Model Versioning and Rollback

#### 8.7.1 Semantic Versioning

| **Version Component** | **Increment When** | **Example** |
|---|---|---|
| **MAJOR (X.0.0)** | Full retraining, architecture change, breaking embedding change | 1.0.0 -> 2.0.0 (new backbone) |
| **MINOR (x.Y.0)** | Fine-tuning, significant new data (>50 new identities) | 1.0.0 -> 1.1.0 (new employees) |
| **PATCH (x.y.Z)** | Incremental update, centroid update, hotfix | 1.0.0 -> 1.0.1 (new photos added) |

#### 8.7.2 Version States

| **State** | **Description** | **Transition** |
|---|---|---|
| `TRAINING` | Model is being trained | Auto -> STAGING on completion |
| `STAGING` | Awaiting quality gate evaluation | Auto -> AWAITING_APPROVAL on pass |
| `AWAITING_APPROVAL` | Pending admin approval | Manual -> CANARY on approve |
| `CANARY` | 5% traffic, monitoring | Auto -> PRODUCTION on success (24h) |
| `PRODUCTION` | 100% traffic, active serving | Manual -> ARCHIVED on new version deploy |
| `ARCHIVED` | Kept for rollback, no traffic | Auto -> ROLLBACK_AVAILABLE after 30 days |
| `ROLLBACK_AVAILABLE` | Can be rolled back to | Manual -> PRODUCTION on rollback trigger |
| `DEPRECATED` | Cannot be rolled back to | Final state |

#### 8.7.3 Rollback Procedure

```
+-----------------------------------------------------------------------------+
|                    EMERGENCY ROLLBACK PROCEDURE                              |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Trigger: Admin initiates rollback or automatic rollback on failure         |
|                                                                              |
|  Step 1: Validate target version exists and is in ROLLBACK_AVAILABLE state  |
|  Step 2: Load target model artifacts from S3/MinIO (pre-warm GPU)          |
|  Step 3: Atomic switch: update model reference in Triton config             |
|  Step 4: Triton SIGHUP reload (zero-downtime model swap)                   |
|  Step 5: Validate: send test inference requests, check latency              |
|  Step 6: If validation fails -> auto-revert to previous production          |
|  Step 7: If validation passes -> update database model version records      |
|  Step 8: Log rollback event in audit_logs                                   |
|                                                                              |
|  Maximum rollback time: < 60 seconds                                        |
|  Zero inference downtime during rollback                                    |
|                                                                              |
+-----------------------------------------------------------------------------+
```

### 8.8 Quality Gates

#### 8.8.1 Gate Thresholds

| **Gate** | **Metric** | **Minimum** | **Maximum** | **Critical** |
|---|---|---|---|---|
| Hold-out Evaluation | Precision | 0.97 | — | Yes (cannot override) |
| Hold-out Evaluation | Recall | 0.95 | — | Yes |
| Hold-out Evaluation | F1 Score | 0.96 | — | Yes |
| No Regression | Metric regression vs production | — | 2% | No (admin can override) |
| Identity Accuracy | Known identity recall | 100% | — | Yes |
| Latency | P99 inference latency | — | 150 ms | Yes |
| Confusion Analysis | False positive rate | — | 5% | No |

#### 8.8.2 Quality Gate Report Example

```json
{
  "gate_run_id": "550e8400-e29b-41d4-a716-446655440000",
  "candidate_model_version": "1.2.0",
  "baseline_model_version": "1.1.0",
  "timestamp": "2024-01-25T10:30:00Z",
  "overall_result": "PASSED",
  "gates": [
    {
      "name": "holdout_performance",
      "status": "PASSED",
      "critical": true,
      "metrics": {
        "precision": 0.9842,
        "recall": 0.9678,
        "f1_score": 0.9759
      }
    },
    {
      "name": "no_regression",
      "status": "PASSED",
      "metrics": {
        "max_regression_pct": 0.8,
        "per_metric": {
          "precision": 0.003,
          "recall": -0.008,
          "f1_score": -0.002
        }
      }
    },
    {
      "name": "known_identity_accuracy",
      "status": "PASSED",
      "metrics": {
        "known_identities_tested": 142,
        "perfect_accuracy": 142,
        "accuracy_below_threshold": 0
      }
    },
    {
      "name": "latency_requirement",
      "status": "PASSED",
      "metrics": {
        "p50_latency_ms": 45,
        "p99_latency_ms": 128,
        "threshold_ms": 150
      }
    }
  ]
}
```

#### 8.8.3 Embedding Update Strategies

After a model passes quality gates and is deployed, the face embedding database must be updated. Five strategies are available:

| **Strategy** | **When to Use** | **Duration** | **Impact** |
|---|---|---|---|
| **Centroid Update** | Few new examples (<10 per identity), same model | Seconds | Update running mean only |
| **Incremental Add** | Many new examples (10-100 per identity), same model | Minutes | Add new embeddings, keep existing |
| **Full Reindex** | Model version changed, or >10% of identities updated | Hours | Recompute all embeddings |
| **Merge and Update** | Identity merge operation | Seconds | Weighted centroid merge |
| **Rollback Reindex** | Model rollback | Minutes | Restore previous embeddings |

**Decision Matrix:**

```
+-----------------------------------------------------------------------------+
|                    EMBEDDING UPDATE STRATEGY SELECTION                       |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Model changed?                                                              |
|       |                                                                      |
|       +-- YES -> FULL_REINDEX (required, embeddings are model-dependent)     |
|       |                                                                      |
|       NO -> What changed?                                                    |
|               |                                                              |
|               +-- Identity merge -> MERGE_AND_UPDATE                         |
|               |                                                              |
|               +-- Rollback -> ROLLBACK_REINDEX                               |
|               |                                                              |
|               +-- New examples?                                              |
|                       |                                                      |
|                       +-- < 10 per identity, < 10% total -> CENTROID_UPDATE |
|                       |                                                      |
|                       +-- Otherwise -> INCREMENTAL_ADD                       |
|                                                                              |
+-----------------------------------------------------------------------------+
```

> **Reference**: For complete model export commands, INT8 calibration scripts, performance benchmarks, and the full Python module structure, see `ai_vision.md` — Sections 10-14. For the complete training pipeline code, Airflow DAG definitions, and quality gate implementations, see `training_system.md` — Sections 5-10.

---


## Section 9: Suspicious Activity Night-Mode Design

### 9.1 Overview

The suspicious activity detection system provides comprehensive behavioral analysis during night hours (22:00-06:00 by default) through **10 specialized detection modules**. Each module operates on the output of the AI inference pipeline (detected persons, tracked positions, and face identities) to identify anomalous behavior patterns.

The system features a **composite scoring engine** that combines signals from all modules with exponential time-decay, enabling unified threat assessment and intelligent escalation. Each camera can be independently configured with custom zones, thresholds, and schedules.

### 9.2 Ten Detection Modules Summary

| **#** | **Module** | **Description** | **Severity** | **Key CV Model** |
|---|---|---|---|---|
| 1 | **Intrusion Detection** | Detects persons entering restricted polygon zones | HIGH (default) | YOLO11m detections + zone polygon |
| 2 | **Loitering Detection** | Flags persons dwelling in an area longer than threshold | MEDIUM (default) | ByteTrack + timer per track |
| 3 | **Running Detection** | Identifies abnormally fast movement | MEDIUM (default) | YOLOv8n-pose + optical flow speed |
| 4 | **Crowding Detection** | Alerts when group density exceeds threshold | HIGH (default) | DBSCAN spatial clustering |
| 5 | **Fall Detection** | Detects persons falling or collapsing | CRITICAL | YOLOv8n-pose keypoint analysis |
| 6 | **Abandoned Object** | Identifies unattended objects left behind | HIGH (default) | YOLOv8s + MOG2 background subtraction |
| 7 | **After-Hours Presence** | Detects any person presence during night hours | MEDIUM (default) | YOLO11m person class only |
| 8 | **Zone Breach** | Triggers on crossing virtual boundary lines | MEDIUM (default) | ByteTrack + line crossing algorithm |
| 9 | **Repeated Re-entry** | Flags patterns of entering/exiting an area multiple times | MEDIUM (default) | ByteTrack + entry/exit state machine |
| 10 | **Suspicious Dwell Time** | Alerts on extended presence near sensitive areas | MEDIUM (configurable) | ByteTrack + per-zone timers |

### 9.3 Module Details

#### 9.3.1 Module 1: Intrusion Detection

Detects when a person enters a user-defined restricted polygon zone.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `confidence_threshold` | 0.55 | 0.3-0.9 | Minimum person detection confidence |
| `overlap_threshold` | 0.30 | 0.1-0.9 | Min IoU between person bbox and zone |
| `cooldown_seconds` | 60 | 0-3600 | Cooldown before re-alerting same zone |
| `zone_severity` | HIGH | LOW/MEDIUM/HIGH | Per-zone configurable |

**Algorithm:**
```
For each detected person:
    For each restricted zone polygon:
        Compute IoU(person_bbox, zone_polygon)
        If IoU > overlap_threshold AND confidence > confidence_threshold:
            If zone not in cooldown:
                Trigger INTRUSION alert
                Start cooldown timer
```

#### 9.3.2 Module 2: Loitering Detection

Flags persons who remain in an area longer than a threshold.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `dwell_time_threshold_seconds` | 300 | 30-1800 | Time before triggering loitering alert |
| `movement_tolerance_pixels` | 50 | 10-200 | Max centroid movement to still count as "stationary" |
| `cooldown_seconds` | 300 | 0-3600 | Cooldown after alert |

**Algorithm:**
```
For each active track:
    If track centroid moved < tolerance in last N seconds:
        Increment dwell timer
        If dwell_timer > threshold:
            Trigger LOITERING alert
            Reset timer (or hold until movement detected)
    Else:
        Reset dwell timer
```

#### 9.3.3 Module 3: Running Detection

Identifies abnormally fast movement using pose keypoints and optical flow.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `speed_threshold_pixels_per_second` | 150 | 50-500 | Pixel speed threshold |
| `speed_threshold_kmh` | 15.0 | 5-40 | Real-world speed (requires calibration) |
| `confirmation_frames` | 3 | 1-10 | Consecutive frames to confirm running |

**Algorithm:**
```
For each active track:
    Compute torso keypoint displacement between frames
    Convert pixel speed to km/h (if calibration available)
    Apply Farneback optical flow for refinement
    If speed > threshold for confirmation_frames:
        Trigger RUNNING alert
```

#### 9.3.4 Module 4: Crowding Detection

Alerts when person group density exceeds threshold.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `count_threshold` | 5 | 2-50 | Minimum person count in cluster |
| `area_threshold` | 0.15 | 0.05-0.5 | Fraction of frame covered by group |
| `density_threshold` | 0.05 | 0.01-0.2 | Persons per square meter (calibrated) |
| `dbscan_eps` | 0.08 | 0.01-0.3 | DBSCAN neighborhood radius (normalized) |

**Algorithm:**
```
Collect all person centroids in current frame
Run DBSCAN(eps=0.08, min_samples=2) on centroids
For each cluster:
    If cluster_size >= count_threshold OR cluster_area >= area_threshold:
        Trigger CROWDING alert
```

#### 9.3.5 Module 5: Fall Detection

Detects persons falling or collapsing using pose keypoint analysis.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `fall_score_threshold` | 0.75 | 0.5-0.95 | Combined fall confidence score |
| `min_keypoint_confidence` | 0.30 | 0.1-0.5 | Minimum keypoint detection confidence |
| `torso_angle_threshold_deg` | 45 | 30-75 | Torso angle from vertical to trigger |
| `aspect_ratio_threshold` | 1.2 | 0.8-2.0 | Width/height ratio of person bbox |
| `temporal_confirmation_ms` | 1000 | 500-3000 | Duration to confirm fall (not just bend) |

**Algorithm:**
```
For each detected person with pose keypoints:
    Compute torso angle from vertical (using shoulder-hip line)
    Compute bbox aspect ratio
    Check if person is on ground (feet keypoint confidence drops)
    Calculate fall_score = weighted_combination(angle, aspect_ratio, ground_contact)
    If fall_score > threshold AND duration > confirmation_ms:
        Trigger FALL alert (CRITICAL severity)
```

#### 9.3.6 Module 6: Abandoned Object Detection

Identifies unattended objects using background subtraction and object detection.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `unattended_time_threshold_seconds` | 60 | 10-600 | Time before object is considered abandoned |
| `proximity_threshold_pixels` | 100 | 20-300 | Max distance from owner before "unattended" |
| `watchlist_classes` | ["backpack", "suitcase", "box", "bag"] | — | Object classes to monitor |
| `bg_learning_rate` | 0.005 | 0.001-0.01 | MOG2 background model learning rate |

**Algorithm:**
```
Run YOLOv8s to detect objects in watchlist_classes
Run MOG2 background subtraction to identify static foreground
For each detected object:
    Track owner proximity (nearest person)
    If owner distance > threshold AND object stationary > time_threshold:
        Trigger ABANDONED_OBJECT alert
```

#### 9.3.7 Module 7: After-Hours Presence

Simple but effective: any person detected during night hours triggers an alert.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `detection_confidence_threshold` | 0.50 | 0.3-0.9 | Minimum person detection confidence |
| `min_detection_frames` | 5 | 1-30 | Frames to confirm (avoid false positives) |
| `check_authorized_personnel` | false | true/false | If true, check against known persons whitelist |

#### 9.3.8 Module 8: Zone Breach

Detects crossing of virtual boundary lines (directional or bidirectional).

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `boundary_lines` | [] (user-defined) | — | Array of {start, end, direction, severity} |
| `allowed_direction` | "both" | both/a_to_b/b_to_a | Which direction is allowed |
| `crossing_threshold_pixels` | 20 | 5-100 | Min distance past line to trigger |
| `cooldown_seconds` | 30 | 0-3600 | Cooldown per (track, line) pair |

**Algorithm:**
```
For each active track:
    For each boundary line:
        Check if track centroid crosses line in forbidden direction
        Using line equation: ax + by + c = 0, check sign change
        If crossed AND distance_past_line > threshold:
            Trigger ZONE_BREACH alert
```

#### 9.3.9 Module 9: Repeated Re-entry Patterns

Detects suspicious patterns of entering and exiting an area multiple times.

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `reentry_zone` | Full frame | polygon | Area to monitor for entries/exits |
| `time_window_seconds` | 600 | 60-3600 | Time window for counting cycles |
| `reentry_threshold` | 3 | 2-10 | Min entry/exit cycles to trigger |
| `min_cycle_duration_seconds` | 30 | 5-300 | Min duration of one cycle |

**State Machine:**
```
For each track:
    Track state: OUTSIDE -> ENTERING -> INSIDE -> EXITING -> OUTSIDE
    Each complete cycle (entry + exit) increments counter
    If cycle_count >= threshold within time_window:
        Trigger REENTRY_PATTERN alert
```

#### 9.3.10 Module 10: Suspicious Dwell Time

Extended presence near sensitive areas (different from general loitering).

| **Parameter** | **Default** | **Range** | **Description** |
|---|---|---|---|
| `sensitive_zones` | [] (user-defined) | — | Zones with custom dwell thresholds |
| `default_dwell_threshold_seconds` | 120 | 10-1800 | Default threshold |
| `max_gap_seconds` | 5.0 | 1.0-30.0 | Max disappearance gap before timer reset |

**Predefined zone types with default thresholds:**

| **Zone Type** | **Default Threshold** | **Default Severity** |
|---|---|---|
| `main_entrance` | 60s | MEDIUM |
| `emergency_exit` | 30s | HIGH |
| `equipment_room` | 45s | HIGH |
| `storage_area` | 120s | MEDIUM |
| `elevator_bank` | 90s | LOW |
| `parking_access` | 60s | MEDIUM |

### 9.4 Activity Scoring Engine

#### 9.4.1 Composite Score Formula

All 10 modules feed into a unified scoring engine that produces a single suspicious activity score per camera:

```
S_total(t) = SUM_i( weight_i * signal_i(t) * decay(t - t_i) ) + bonus_cross_module

Where:
    weight_i: module-specific weight (see table below)
    signal_i(t): normalized signal value from module i [0, 1]
    decay(delta_t): exponential time-decay function
    bonus_cross_module: extra score when multiple modules fire simultaneously
    t_i: timestamp of most recent event from module i
```

#### 9.4.2 Module Weights

| **Module** | **Weight** | **Signal Source** | **Signal Range** |
|---|---|---|---|
| Intrusion Detection | 0.25 | overlap_ratio * confidence | 0.0 - 1.0 |
| Loitering Detection | 0.15 | dwell_ratio (dwell_time / threshold) | 0.0 - 1.0+ |
| Running Detection | 0.10 | speed_ratio normalized | 0.0 - 1.0+ |
| Crowding Detection | 0.12 | crowd_density_score | 0.0 - 1.0 |
| Fall Detection | 0.20 | fall_confidence_score | 0.0 - 1.0 |
| Abandoned Object | 0.18 | unattended_ratio (duration / threshold) | 0.0 - 1.0+ |
| After-Hours Presence | 0.05 | binary (1 if detected) * zone_severity_multiplier | 0.0 - 1.0 |
| Zone Breach | 0.12 | severity_mapped (LOW=0.3, MED=0.6, HIGH=1.0) | 0.0 - 1.0 |
| Re-entry Patterns | 0.10 | cycle_ratio (count / threshold) | 0.0 - 1.0+ |
| Suspicious Dwell | 0.13 | dwell_ratio (duration / zone_threshold) | 0.0 - 1.0+ |

**Note:** Weights sum to 1.40 — this is intentional to allow cross-module amplification when multiple modules fire simultaneously.

#### 9.4.3 Time-Decay Function

```python
def time_decay(delta_t_seconds, half_life=300):
    """Exponential decay with 5-minute half-life by default."""
    import math
    return math.exp(-0.693 * delta_t_seconds / half_life)

# Decay reference:
#   0 min -> 1.000 (full contribution)
#   1 min -> 0.871
#   5 min -> 0.500
#  10 min -> 0.250
#  20 min -> 0.063
#  30 min -> 0.016 (effectively zero)
```

#### 9.4.4 Cross-Module Amplification Bonus

When multiple modules detect simultaneously for the same track or in close proximity:

```python
def compute_cross_module_bonus(active_signals, proximity_weight=0.15):
    n_modules = len(active_signals)
    if n_modules <= 1:
        return 0.0

    # Base bonus: +15% per additional module
    base_bonus = proximity_weight * (n_modules - 1)

    # Track overlap: same person triggering multiple rules -> higher threat
    track_bonus = 0.10 * (n_same_track_signals - 1) if n_same_track_signals >= 2 else 0

    # Zone overlap: multiple signals in same zone -> higher threat
    zone_bonus = 0.08 * (n_same_zone_signals - 1) if n_same_zone_signals >= 2 else 0

    return min(base_bonus + track_bonus + zone_bonus, 0.50)  # Cap at +0.50
```

#### 9.4.5 Escalation Thresholds

| **Score Range** | **Threat Level** | **Color** | **Actions** |
|---|---|---|---|
| 0.00 - 0.20 | NONE | Gray | Log only, no alert |
| 0.20 - 0.40 | LOW | Blue | Log + dashboard indicator |
| 0.40 - 0.60 | MEDIUM | Yellow | Log + non-urgent alert dispatch |
| 0.60 - 0.80 | HIGH | Orange | Log + immediate alert + highlight |
| 0.80 - 1.00 | CRITICAL | Red | Log + all channels + security dispatch recommendation |
| > 1.00 | EMERGENCY | Purple/Flashing | All channels + automatic escalation to security lead |

### 9.5 Night Mode Scheduler

#### 9.5.1 Automatic Schedule

| **Parameter** | **Default** | **Configurable** |
|---|---|---|
| Start time | 22:00 (10 PM) | Yes, per camera |
| End time | 06:00 (6 AM) | Yes, per camera |
| Gradual transition | 15 minutes | Yes (0-60 min) |
| Timezone | Local site timezone | Yes |
| Override | Manual toggle available | Admin only |

#### 9.5.2 Gradual Transition

During the 15-minute transition window, sensitivity ramps linearly:

```
Transition Start (21:45)          Night Full (22:00)         Transition End (22:15)
      |                                  |                           |
      v                                  v                           v
Sensitivity: 0% ---- 25% ---- 50% ---- 75% ---- 100% ---- 100% ---- 100%
              |__________|__________|__________|__________|__________|
                  Ramp up to full night sensitivity over 15 minutes
```

This prevents sudden spikes in alerts when night mode activates.

#### 9.5.3 Night Mode Behavior Changes

| **Aspect** | **Day Mode** | **Night Mode** |
|---|---|---|
| Detection modules | Intrusion, Crowding, Fall, Abandoned Object | All 10 modules active |
| AI Vibe preset | Per-camera setting | Automatically Strict |
| Confidence threshold | Per-camera setting | +0.10 (stricter) |
| Scoring engine weights | Standard weights | +25% intrusion, +20% fall |
| Alert suppression | 5-minute cooldown | 2-minute cooldown (faster alerts) |
| After-hours detection | Disabled | Enabled (primary night function) |

### 9.6 Per-Camera Configuration

Each camera has independent configuration for all detection modules:

```yaml
# Example: Camera 1 - Main Entrance
cam_01:
  enabled: true
  location: "Main Entrance Lobby"
  night_mode:
    enabled: true
    custom_schedule: null        # Use system default (22:00-06:00)
    sensitivity_multiplier: 1.0   # Standard sensitivity

  intrusion_detection:
    enabled: true
    confidence_threshold: 0.65
    overlap_threshold: 0.30
    cooldown_seconds: 30
    restricted_zones:
      - zone_id: "server_room_door"
        polygon: [[0.65,0.20], [0.85,0.20], [0.85,0.60], [0.65,0.60]]
        severity: "HIGH"

  loitering_detection:
    enabled: true
    dwell_time_threshold_seconds: 300
    movement_tolerance_pixels: 50

  running_detection:
    enabled: true
    speed_threshold_pixels_per_second: 150
    confirmation_frames: 3

  fall_detection:
    enabled: true
    fall_score_threshold: 0.75
    temporal_confirmation_ms: 1000

  # ... (all 10 modules configured)
```

### 9.7 Alert Generation Logic

#### 9.7.1 Alert Lifecycle

```
+------------+    +------------+    +------------+    +------------+
|  DETECTED  | -> | SUPPRESSED | -> |  EVIDENCE  | -> | DISPATCHED |
| (Rule fire)|    | (Dedup)    |    | (Capture)  |    | (Send)     |
+------------+    +------------+    +------------+    +------------+
                                                          |
                                                          v
                                                   +------------+
                                                   | ACKNOWLEDGE|
                                                   | or AUTO    |
                                                   +------------+
```

#### 9.7.2 Suppression Rules

| **Condition** | **Action** | **Reason** |
|---|---|---|
| Duplicate within suppression window | Log + increment counter | Prevent alert spam |
| Detection confidence < rule minimum | Log only | Insufficient evidence |
| Threat score < LOW threshold | Log only | Below alert threshold |
| Max alerts/hour for camera exceeded | Log + rate-limit flag | Prevent overflow |
| Composite score indicates low overall threat | Log + dashboard only | Reduce noise |

#### 9.7.3 Suppression Configuration

| **Parameter** | **Default** | **Range** |
|---|---|---|
| Default suppression window | 5 minutes | 0-60 minutes |
| Max alerts per hour per camera | 20 | 5-100 |
| Max alerts per hour per rule | 10 | 5-50 |
| Evidence snapshot frames before | 5 frames | 1-30 |
| Evidence snapshot frames after | 10 frames | 1-30 |
| Evidence clip duration | 10 seconds | 5-60 |

#### 9.7.4 Severity Assignment

Final alert severity considers both the triggering module and the composite score context:

```python
def assign_alert_severity(detection_event, composite_score):
    base_severity = detection_event['severity']  # From module config
    severity_levels = {'LOW': 1, 'MEDIUM': 2, 'HIGH': 3, 'CRITICAL': 4}
    base_level = severity_levels.get(base_severity, 2)

    # Escalation: high composite score bumps severity up one level
    if composite_score >= 0.80 and base_level < 3:
        base_level = min(base_level + 1, 4)

    # Escalation: multiple concurrent detections for same track
    if detection_event.get('concurrent_detections_count', 0) >= 2:
        base_level = min(base_level + 1, 4)

    # Zone-specific escalation override
    if detection_event.get('zone_severity_override'):
        zone_level = severity_levels.get(detection_event['zone_severity_override'], base_level)
        base_level = max(base_level, zone_level)

    reverse_levels = {v: k for k, v in severity_levels.items()}
    return reverse_levels.get(base_level, 'MEDIUM')
```

### 9.8 Integration with Main AI Pipeline

The suspicious activity service consumes detection events from the main AI pipeline:

```
+-----------------------------------------------------------------------------+
|               SUSPICIOUS ACTIVITY INTEGRATION WITH MAIN PIPELINE             |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Main AI Pipeline Output:                                                    |
|  { person_id, track_id, bbox, keypoints, face_embedding, timestamp,        |
|    camera_id, confidence, face_crop_path }                                  |
|       |                                                                      |
|       v                                                                      |
|  +-------------------+    +-------------------+    +-------------------+    |
|  | Kafka Topic       | -> | Suspicious Activity| -> | Scoring Engine    |    |
|  | ai.detections     |    | Service            |    | (per camera)      |    |
|  | (JSON events)     |    | - 10 modules       |    | - Composite score |    |
|  +-------------------+    | - Per-camera config|    | - Time decay      |    |
|                           | - Zone polygons    |    | - Cross-module    |    |
|                           +-------------------+    |   bonus           |    |
|                                                     +---------+---------+    |
|                                                               |              |
|                                                               v              |
|                           +-------------------+    +-------------------+    |
|                           | Alert Manager     | <- | Scoring Output    |    |
|                           | - Deduplicate     |    | - Score [0, 1.5]  |    |
|                           | - Rate limit      |    | - Threat level    |    |
|                           | - Severity assign |    | - Active signals  |    |
|                           +---------+---------+    +-------------------+    |
|                                     |                                        |
|                                     v                                        |
|                           +-------------------+                             |
|                           | Alerts Table (DB) |                             |
|                           | Notification Svc  |                             |
|                           +-------------------+                             |
|                                                                              |
+-----------------------------------------------------------------------------+
```

**Key integration points:**
- Suspicious Activity Service is a **Kafka consumer** on the `ai.detections` topic
- Processes events **after** face recognition (has access to person identity)
- Produces alert records to the `alerts.critical` topic for notification dispatch
- Updates the composite score in **Redis** (with TTL = 2 * half_life) for dashboard real-time display
- Stores all alert records in **PostgreSQL** for history and analytics

> **Reference**: For complete detection algorithm pseudocode, zone configuration YAML schema, scoring engine implementation, and evidence capture logic, see `suspicious_activity.md` — Sections 2-6.

---

## Section 10: Live Video Streaming Design

### 10.1 RTSP Stream Configuration for CP PLUS DVR

#### 10.1.1 URL Format

The CP PLUS ORANGE DVR uses a Dahua-compatible RTSP URL scheme:

```
rtsp://admin:{password}@{dvr_ip}:554/cam/realmonitor?channel={N}&subtype={M}

Where:
    N = channel number (1-8)
    M = stream type (0 = main stream, 1 = sub stream)
```

**Example URLs for all 8 channels:**

| **Channel** | **Main Stream** | **Sub Stream** |
|---|---|---|
| CH1 | `rtsp://admin:password@192.168.29.200:554/cam/realmonitor?channel=1&subtype=0` | `rtsp://admin:password@192.168.29.200:554/cam/realmonitor?channel=1&subtype=1` |
| CH2 | `...channel=2&subtype=0` | `...channel=2&subtype=1` |
| CH3 | `...channel=3&subtype=0` | `...channel=3&subtype=1` |
| CH4 | `...channel=4&subtype=0` | `...channel=4&subtype=1` |
| CH5 | `...channel=5&subtype=0` | `...channel=5&subtype=1` |
| CH6 | `...channel=6&subtype=0` | `...channel=6&subtype=1` |
| CH7 | `...channel=7&subtype=0` | `...channel=7&subtype=1` |
| CH8 | `...channel=8&subtype=0` | `...channel=8&subtype=1` |

#### 10.1.2 Stream Properties

| **Property** | **Main Stream (subtype=0)** | **Sub Stream (subtype=1)** |
|---|---|---|
| Resolution | 960 x 1080 | 352 x 288 to 704 x 576 |
| Frame rate | 25 FPS (PAL) | 25 FPS |
| Video codec | H.264 High Profile | H.264 Baseline/Main |
| Bitrate | ~4 Mbps per channel | ~1 Mbps per channel |
| Audio | G.711/AAC (optional) | None |
| Use case | Fullscreen viewing, evidence clips | AI inference, multi-camera grid |

#### 10.1.3 Stream Discovery

The edge gateway can auto-discover streams via ONVIF:

```python
from onvif import ONVIFCamera

camera = ONVIFCamera('192.168.29.200', 80, 'admin', 'password')
media_service = camera.create_media_service()
profiles = media_service.GetProfiles()

for profile in profiles:
    stream_uri = media_service.GetStreamUri({
        'StreamSetup': {'Stream': 'RTP_unicast', 'Transport': 'RTSP'},
        'ProfileToken': profile.token
    })
    print(f"Channel: {profile.token}, URI: {stream_uri.Uri}")
```

### 10.2 Edge Gateway Stream Handling

#### 10.2.1 FFmpeg Ingestion Pipeline

The edge gateway runs one FFmpeg process per camera stream:

```bash
# Main stream: HLS generation for live viewing
ffmpeg -hide_banner -loglevel warning \
    -rtsp_transport tcp -stimeout 5000000 \
    -fflags +genpts+discardcorrupt+igndts+ignidx \
    -reorder_queue_size 64 -buffer_size 655360 \
    -i "rtsp://admin:password@192.168.29.200:554/cam/realmonitor?channel=1&subtype=0" \
    -c:v copy -c:a copy \
    -f hls -hls_time 2 -hls_list_size 5 -hls_delete_threshold 2 \
    -hls_flags delete_segments+omit_endlist+program_date_time \
    -hls_segment_filename "/data/hls/ch1_%04d.ts" \
    "/data/hls/ch1.m3u8" \
    2>> /var/log/ffmpeg_ch1.log
```

#### 10.2.2 Stream Health Monitoring

| **Check** | **Frequency** | **Failure Action** |
|---|---|---|
| FFmpeg process alive | Every 5s | Restart process |
| RTSP connection health | Every 10s | Reconnect with backoff |
| Frame rate validation | Every 30s | Alert if FPS < 20 |
| Bitrate validation | Every 30s | Alert if bitrate < 50% expected |
| Disk space check | Every 60s | Alert if < 10% free, emergency if < 5% |

#### 10.2.3 Auto-Reconnect Logic

```python
class StreamReconnectManager:
    """Handles RTSP stream reconnection with exponential backoff."""

    INITIAL_BACKOFF = 1.0       # seconds
    MAX_BACKOFF = 60.0          # seconds
    BACKOFF_MULTIPLIER = 2.0
    JITTER = 0.1                # 10% random jitter

    def __init__(self):
        self.current_backoff = self.INITIAL_BACKOFF
        self.consecutive_failures = 0

    def on_disconnect(self):
        self.consecutive_failures += 1
        wait_time = min(
            self.current_backoff * (self.BACKOFF_MULTIPLIER ** self.consecutive_failures),
            self.MAX_BACKOFF
        )
        # Add jitter to prevent thundering herd
        wait_time *= (1 + random.uniform(-self.JITTER, self.JITTER))
        return wait_time

    def on_success(self):
        self.current_backoff = self.INITIAL_BACKOFF
        self.consecutive_failures = 0

    def should_circuit_break(self):
        return self.consecutive_failures >= 5  # Open circuit after 5 failures
```

### 10.3 HLS Generation for Dashboard

#### 10.3.1 HLS Segment Configuration

| **Parameter** | **Value** | **Rationale** |
|---|---|---|
| Segment duration (`-hls_time`) | 2 seconds | Balance between latency and segment count |
| Playlist size (`-hls_list_size`) | 5 segments | 10-second sliding window for live playback |
| Delete threshold | 2 segments beyond playlist size | Disk cleanup |
| Flags | `delete_segments+omit_endlist+program_date_time` | Live mode, no end list, accurate timing |
| Segment naming | `ch{N}_%04d.ts` | Sequential numbering for cache busting |
| Segment path | `/data/hls/` | Fast NVMe storage |

#### 10.3.2 Multi-Bitrate HLS (Optional)

For adaptive bitrate streaming, three variants are generated per channel:

```bash
# High quality (main stream, copy codec)
ffmpeg -i "rtsp://...channel=1&subtype=0" -c:v copy -f hls -hls_time 2 \
    -hls_playlist_type vod -hls_segment_filename "ch1_high_%04d.ts" "ch1_high.m3u8"

# Medium quality (transcoded)
ffmpeg -i "rtsp://...channel=1&subtype=0" -c:v libx264 -preset fast -crf 23 \
    -vf "scale=640:480" -f hls -hls_time 2 \
    -hls_segment_filename "ch1_mid_%04d.ts" "ch1_mid.m3u8"

# Low quality (sub stream)
ffmpeg -i "rtsp://...channel=1&subtype=1" -c:v copy -f hls -hls_time 2 \
    -hls_segment_filename "ch1_low_%04d.ts" "ch1_low.m3u8"
```

**Master playlist:**
```m3u8
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=4000000,RESOLUTION=960x1080
ch1_high.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=640x480
ch1_mid.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=500000,RESOLUTION=352x288
ch1_low.m3u8
```

#### 10.3.3 HLS Latency Budget

| **Stage** | **Latency** |
|---|---|
| DVR encoding | 50-100 ms |
| RTSP to edge | 1-2 ms |
| FFmpeg demux/remux | 20-50 ms |
| HLS segment duration | 2000 ms (2-second segments) |
| Nginx/CDN delivery | 10-50 ms |
| HLS.js buffer | 2000-4000 ms (1-2 segments) |
| Browser decode + render | 20-50 ms |
| **Total (camera to eye)** | **~2.1 - 2.3 seconds** |

### 10.4 WebRTC for Low-Latency Single Camera

For single-camera fullscreen viewing where low latency is critical, WebRTC provides sub-second delivery.

#### 10.4.1 WebRTC Architecture

```
+------------+    +-------------------+    +-------------------+    +--------+
| Browser    |    | Edge Gateway      |    | FFmpeg            |    | DVR    |
| (WebRTC    |<-->| (WHIP/WHEP        |<-->| (decode RTSP,     |<-->| RTSP   |
|  client)   |    |  bridge)          |    |  encode VP8/H.264)|    | Server |
+------------+    +-------------------+    +-------------------+    +--------+
```

#### 10.4.2 WebRTC Configuration

| **Parameter** | **Value** |
|---|---|
| Signaling protocol | WHIP ( ingress) / WHEP (egress) |
| Video codec | H.264 (hardware) or VP8 (software) |
| Latency target | < 500 ms end-to-end |
| ICE servers | STUN only (both peers behind NAT) |
| Max bitrate | 3 Mbps |
| Resolution | 960x1080 (main stream) |

#### 10.4.3 WebRTC Latency Budget

| **Stage** | **Latency** |
|---|---|
| DVR encoding | 50-100 ms |
| RTSP to edge | 1-2 ms |
| FFmpeg decode + WebRTC encode | 30-80 ms |
| Network (edge to browser via VPN) | 100-200 ms |
| Browser decode | 20-50 ms |
| **Total** | **~200-430 ms** |

### 10.5 Multi-Camera Grid Layout

#### 10.5.1 Layout Configurations

| **Layout** | **Cameras** | **Stream Used** | **Per-Camera Resolution** | **Total Bandwidth** |
|---|---|---|---|---|
| 1x1 (fullscreen) | 1 | Main (subtype=0) | 960x1080 | ~4 Mbps |
| 2x2 grid | 4 | Sub (subtype=1) | 352x288 | ~4 Mbps total |
| 3x3 grid | 8+1 empty | Sub (subtype=1) | 352x288 | ~8 Mbps total |
| 4x2 grid | 8 | Sub (subtype=1) | 352x288 | ~8 Mbps total |
| Custom | User-defined | Mixed | Mixed | Sum of selected |

**Smart stream selection:** The dashboard automatically switches streams based on layout:
- Fullscreen single camera -> Main stream (high quality)
- Grid layout -> Sub stream (bandwidth-efficient)
- Camera clicked for fullscreen -> Dynamically switch to main stream

#### 10.5.2 Grid Rendering

```
+-----------------------------------------------------------------------------+
|                         DASHBOARD GRID LAYOUTS                               |
+-----------------------------------------------------------------------------+
|                                                                              |
|  1x1 Layout:                         2x2 Layout:                            |
|  +------------------------+          +----------+----------+                 |
|  |                        |          | CH1      | CH2      |                 |
|  |   Camera 1             |          | (sub)    | (sub)    |                 |
|  |   Main stream          |          |          |          |                 |
|  |   960x1080             |          +----------+----------+                 |
|  |   ~4 Mbps              |          | CH3      | CH4      |                 |
|  +------------------------+          | (sub)    | (sub)    |                 |
|                                      |          |          |                 |
|                                      +----------+----------+                 |
|                                                                              |
|  3x3 Layout (8 cameras):                                                     |
|  +----------+----------+----------+                                          |
|  | CH1      | CH2      | CH3      |                                          |
|  | (sub)    | (sub)    | (sub)    |                                          |
|  +----------+----------+----------+                                          |
|  | CH4      | CH5      | CH6      |                                          |
|  | (sub)    | (sub)    | (sub)    |                                          |
|  +----------+----------+----------+                                          |
|  | CH7      | CH8      | [Empty]  |                                          |
|  | (sub)    | (sub)    |          |                                          |
|  +----------+----------+----------+                                          |
|                                                                              |
|  Bandwidth: ~8 Mbps total for 3x3 layout (8 x ~1 Mbps sub streams)          |
|                                                                              |
+-----------------------------------------------------------------------------+
```

### 10.6 Bandwidth Optimization

#### 10.6.1 Total Bandwidth Budget

| **Traffic Type** | **Direction** | **Bandwidth** | **Notes** |
|---|---|---|---|
| 8x RTSP ingestion | Edge -> DVR (local) | ~32 Mbps receive | Local LAN only |
| 8x HLS upload to cloud | Edge -> Cloud (via VPN) | ~8-16 Mbps upload | Transcoded and compressed |
| AI frames to cloud | Edge -> Cloud (via VPN) | ~2-4 Mbps upload | 1 FPS, JPEG compressed |
| Dashboard HLS playback | Cloud -> Browser | ~8 Mbps per user | Cached at CDN |
| Control/management | Bidirectional | < 1 Mbps | WebSocket, API calls |
| **Total edge upload** | | **~10-20 Mbps** | Primary concern for site bandwidth |

#### 10.6.2 Optimization Techniques

| **Technique** | **Savings** | **Implementation** |
|---|---|---|
| Sub-stream for grid view | 75% bandwidth reduction | Use subtype=1 (352x288) instead of subtype=0 (960x1080) |
| H.264 copy (no re-encode) for main stream | Zero CPU overhead | `-c:v copy` when no format change needed |
| JPEG quality tuning for AI frames | 50-70% size reduction | Quality 70-85 depending on scene complexity |
| Frame deduplication for AI | 10-30% frame reduction | Skip frames with < 2% pixel change |
| HLS segment caching at edge | Reduces cloud upload spikes | 5-segment buffer smooths burstiness |
| Gzip compression for API/WebSocket | 60-80% reduction | Content-Encoding: gzip |

### 10.7 Fallback Handling

#### 10.7.1 Stream Failure Fallback Chain

```
Step 1: RTSP connection fails
    +-> Retry with exponential backoff (3 attempts)
    +-> Try UDP transport if TCP fails
    +-> Circuit breaker opens after 5 consecutive failures
    |
Step 2: Stream stall detected (no frames for 10s)
    +-> Kill FFmpeg process
    +-> Restart with fresh connection
    |
Step 3: Camera marked OFFLINE
    +-> Dashboard shows "Camera Offline" placeholder
    +-> HLS playlist returns 404
    +-> Last known frame displayed with timestamp overlay
    +-> Alert sent to operations team
    |
Step 4: Camera recovers
    +-> Circuit breaker transitions to HALF_OPEN
    +-> Test stream pulled for 10 seconds
    +-> On success: circuit CLOSED, stream resumes
    +-> Dashboard auto-refreshes
```

#### 10.7.2 Offline Placeholder

When a camera is offline, the HLS endpoint returns a static playlist:

```m3u8
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:2
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-ERROR: "Camera OFFLINE - Channel 1"
#EXTINF:2.000,
offline_placeholder.ts
```

The dashboard detects the `#EXT-X-ERROR` tag and displays a camera offline indicator with the last known timestamp.

#### 10.7.3 Edge Buffer Management

The 2TB NVMe edge storage is partitioned for circular buffer operation:

| **Directory** | **Max Size** | **Retention** | **Cleanup** |
|---|---|---|---|
| `/data/hls/` | 20 GB | Rolling (5 segments) | Automatic via FFmpeg |
| `/data/buffer/ch1-ch8/` | 1.5 TB | 7 days circular | Age-based FIFO |
| `/data/buffer/ai_frames/` | 100 GB | 24 hours | Age-based |
| `/data/buffer/evidence/` | 200 GB | 30 days | Event-linked retention |
| `/data/logs/` | 10 GB | 30 days | Logrotate |
| `/data/tmp/` | 50 GB | On process exit | Cleanup on restart |
| **Total reserved** | **~1.88 TB** | — | Fits in 2TB NVMe |

**Buffer exhaustion handling:**
1. At 80% capacity: Alert admin, begin aggressive cleanup of old non-evidence data
2. At 90% capacity: Stop non-critical buffering (AI frames), preserve HLS + evidence only
3. At 95% capacity: Emergency mode — evidence-only recording, all other buffers purged
4. Never delete evidence clips linked to unresolved alerts

#### 10.7.4 DVR Full Disk Mitigation

Since the DVR disk is full (0 bytes free), the system does not rely on DVR-side recording:

| **Function** | **Traditional** | **Our Design** |
|---|---|---|
| Continuous recording | DVR internal HDD | Edge gateway 2TB NVMe buffer |
| Event/alert clips | DVR playback export | Cloud MinIO + S3 archival |
| Long-term storage | DVR disk rotation | AWS S3 tiered lifecycle |
| Playback | DVR web UI | Cloud dashboard with timeline |

> **Reference**: For complete FFmpeg commands including multi-output tee muxer, frame extraction for AI, WebRTC bridge code, and the ring buffer implementation, see `video_ingestion.md` — Sections 4-7.

---

*End of Part A (Sections 1-10)*

*This unified technical blueprint synthesizes outputs from 11 specialist agents across 6 domain-specific design documents. For detailed implementation code, DDL, algorithms, and configuration, refer to the individual specialist documents listed in the cross-reference guide at the top of this document.*

| **Document** | **Path** | **Content** |
|---|---|---|
| Architecture | `architecture.md` | Full deployment specs, scaling, cost, failover |
| Video Ingestion | `video_ingestion.md` | RTSP config, FFmpeg, edge gateway, HLS, WebRTC |
| AI Vision | `ai_vision.md` | Model configs, inference pipeline, benchmarks |
| Database Schema | `database_schema.md` | Complete DDL, triggers, views, RLS |
| Suspicious Activity | `suspicious_activity.md` | 10 detection modules, scoring engine |
| Training System | `training_system.md` | Learning pipeline, quality gates, versioning |