# Industrial Surveillance AI Vision Pipeline — Complete Technical Design

## Document Information
| Property | Value |
|----------|-------|
| Version | 1.0.0 |
| Date | 2025-07-24 |
| DVR | CP PLUS 8-Channel, 960x1080 per channel |
| Environment | Indoor / Industrial Mixed |
| Target Streams | 8 simultaneous RTSP feeds |
| Edge Compute | Moderate (NVIDIA Jetson or x86 + T4 class GPU) |
| Cloud Compute | GPU-backed (V100 / A10 / T4) |

---

## 1. Human Detection Module

### 1.1 Model Selection: YOLO11-Medium (YOLO11m)

**Primary Choice: `YOLO11m` (Ultralytics)**

Rationale: YOLO11 strikes the optimal balance between accuracy and inference speed for industrial surveillance. The medium variant provides sufficient capacity to detect partially occluded humans at mid-range distances while maintaining real-time throughput across 8 streams.

| Attribute | Specification |
|-----------|--------------|
| Model | YOLO11m (Ultralytics release 8.3.x) |
| Backbone | C3k2 bottleneck + C2PSA attention module |
| Neck | PANet with compact feature aggregation |
| Head | Anchor-free decoupled detection head |
| Parameters | 20.1 M |
| Input Resolution | 640 x 640 (letterboxed from 960x1080) |
| mAP@50-95 (COCO) | 51.5% |
| Person Class AP | ~75-80% (estimated, COCO "person") |

**Alternative for GPU-constrained edge:** `YOLO11s` (9.4M params, 47.0% mAP, 2.5ms T4 TensorRT)
**Alternative for maximum accuracy:** `RT-DETR-L` (53.4% mAP, 6.8ms T4 TensorRT, transformer-based)

### 1.2 Inference Configuration

```yaml
# yolo11m_detection.yaml
model:
  weights: "yolo11m.pt"
  class_filter: ["person"]          # Only detect human class (COCO idx 0)
  confidence_threshold: 0.35         # Balanced sensitivity
  iou_threshold: 0.45                # NMS IoU threshold
  max_detections: 50                 # Max persons per frame
  imgsz: 640                         # Square input
  half: true                         # FP16 inference

dataloader:
  batch_size: 8                      # One frame per stream
  workers: 4
  pin_memory: true
```

### 1.3 Quantization & Optimization Strategy

| Optimization Stage | Target | Expected Speedup | Accuracy Impact |
|-------------------|--------|-----------------|-----------------|
| PyTorch FP32 | Baseline | 1.0x | Baseline (51.5% mAP) |
| ONNX Export | Interop | 1.1x | Negligible |
| TensorRT FP16 | Production GPU | **2.8x** (~4.7ms T4) | -0.1% mAP |
| TensorRT INT8 | Maximum throughput | **3.5x** (~3.8ms T4) | -0.3% to -0.5% mAP |
| INT8 + DLA | Jetson Orin DLA | **4.0x** | -0.5% mAP |

**Recommended production path:** TensorRT FP16 on GPU (best accuracy/speed tradeoff). INT8 for edge gateway with calibration dataset of 500+ representative surveillance frames.

### 1.4 Frame Preprocessing

```python
# Input: 960 x 1080 from DVR (960H resolution, 960x1080)
# Step 1: Resize to 640x640 with letterboxing (maintain aspect ratio)
# Step 2: Normalize: divide by 255.0, mean=[0.0,0.0,0.0], std=[1.0,1.0,1.0]
# Step 3: HWC -> CHW format
# Step 4: Batch: [8, 3, 640, 640]

# Expected output: person bounding boxes [x1, y1, x2, y2, conf, class_id]
```

### 1.5 Performance Targets

| Metric | Target | Notes |
|--------|--------|-------|
| Latency (single frame) | < 5ms @ T4 TensorRT FP16 | YOLO11m at 640 |
| Throughput (8 streams) | > 160 FPS aggregate | Batch=8, processing 20 FPS per stream |
| Person AP@50 | > 75% | On surveillance test set |
| Small person detection | > 60% AP | For persons > 30px tall |
| Occlusion handling | > 50% AP | Partial visibility (occlusion level 1-2) |

---

## 2. Face Detection Module

### 2.1 Model Selection: SCRFD-500MF-640GPU

**Primary Choice: `SCRFD_500M_BNKPS` (InsightFace Model Zoo)**

SCRFD (Sample and Computation Redistribution for Efficient Face Detection) achieves the best speed-accuracy tradeoff for face detection on GPU. The 500MF variant is optimized for 640px inputs and provides 5-point facial keypoints (eyes, nose, mouth corners) critical for face alignment prior to recognition.

| Attribute | Specification |
|-----------|--------------|
| Model | SCRFD_500M_BNKPS (ONNX) |
| Source | deepinsight/insightface model zoo |
| Input Resolution | 640 x 640 |
| FLOPs | 500 MFLOPs |
| Parameters | ~1.5 M |
| WIDERFACE AP (Easy) | 0.906 |
| WIDERFACE AP (Medium) | 0.870 |
| WIDERFACE AP (Hard) | 0.720 |
| Keypoint Output | 5 facial landmarks (eyes x2, nose, mouth corners x2) |

**Alternative (CPU/Edge):** `YuNet` (OpenCV DNN, ~1ms CPU, AP_Easy 0.884, AP_Medium 0.866)
**Alternative (Maximum accuracy):** `RetinaFace-R50` (higher AP but 5-8x slower)

### 2.2 Inference Configuration

```yaml
# scrfd_face_detection.yaml
model:
  onnx_file: "scrfd_500m_bnkps.onnx"
  input_size: 640
  confidence_threshold: 0.45          # Face detection minimum confidence
  nms_threshold: 0.4
  top_k: 100                           # Max faces per frame
  min_face_size: 20                    # Minimum face pixel height (20px)
  scale_factor: [8, 16, 32]           # Feature pyramid strides

# Face quality scoring
quality:
  blur_threshold: 50.0                 # Laplacian variance threshold
  pose_max_yaw: 45.0                   # Degrees - reject profile faces
  pose_max_pitch: 30.0                 # Degrees
  min_face_width: 20                   # Pixels - ignore tiny faces
  max_face_width: 300                  # Pixels - ignore giant close-ups
```

### 2.3 Face Quality Assessment

Each detected face is scored on multiple dimensions before proceeding to recognition:

| Quality Metric | Method | Threshold | Rejection Rate |
|---------------|--------|-----------|----------------|
| **Sharpness/Blur** | Laplacian variance | Var > 50 | ~15% of detections |
| **Face Size** | Bounding box height | > 20px, < 300px | ~10% of detections |
| **Head Pose** | 5-point landmark geometry | Yaw < 45, Pitch < 30 | ~20% of detections |
| **Face Confidence** | SCRFD detection score | > 0.45 | ~5% of detections |
| **Illumination** | Mean face ROI intensity | 40 < mean < 240 | ~5% of detections |

Only faces passing all quality gates proceed to the recognition module.

### 2.4 Face Alignment

Using the 5-point landmarks from SCRFD, each face is aligned to a canonical pose using a similarity transform:

```python
# Alignment target landmarks (112x112 template)
TARGET_LANDMARKS = np.array([
    [38.2946, 51.6963],   # Left eye
    [73.5318, 51.5014],   # Right eye
    [56.0252, 71.7366],   # Nose
    [41.5493, 92.3655],   # Left mouth
    [70.7299, 92.2041],   # Right mouth
], dtype=np.float32)

# Apply similarity transform (scale, rotation, translation)
# Output: aligned 112x112 face crop ready for ArcFace
```

### 2.5 Performance Targets

| Metric | Target | Notes |
|--------|--------|-------|
| Latency (single face) | < 3ms @ T4 TensorRT FP16 | SCRFD-500M |
| Latency (batch 32 faces) | < 12ms | Batch processing |
| WIDERFACE AP (Hard) | > 0.70 | Challenging angles, lighting |
| Min detectable face size | 20x20 pixels | ~10m distance at 1080p |
| 5-landmark accuracy | < 3px NME | Normalized mean error |

---

## 3. Face Recognition Module

### 3.1 Model Selection: ArcFace R100 (MS1MV3)

**Primary Choice: `ArcFace R100` (IResNet100, InsightFace)**

ArcFace with Additive Angular Margin Loss is the industry-standard face recognition model. The ResNet100 (IR-SE100) backbone trained on MS1MV3 (MS1M-V3) provides state-of-the-art accuracy with 512-dimensional embeddings.

| Attribute | Specification |
|-----------|--------------|
| Model | ArcFace with IResNet100 (IR-SE100) backbone |
| Loss Function | Additive Angular Margin (ArcFace) |
| Training Data | MS1MV3 (~5.1M images, 93K identities) |
| Input Size | 3 x 112 x 112 (aligned face crop) |
| Embedding Dimension | 512 (float32) |
| Parameters | ~65 M |
| LFW Accuracy | 99.83% |
| CFP-FP Accuracy | 98.27% |
| AgeDB-30 Accuracy | 98.28% |
| IJB-C (TPR@FPR=1e-4) | 96.1% |

**Alternative (speed-focused):** `ArcFace R50` (IR-SE50, 25M params, LFW 99.80%, ~2x faster)
**Alternative (edge/mobile):** `MobileFaceNet` (4M params, 128-D embedding, LFW 99.28%)

### 3.2 Embedding Extraction Pipeline

```python
# Pipeline per detected face:
# 1. Crop face from original frame using SCRFD bounding box
# 2. Align using 5-point landmarks -> 112x112 normalized crop
# 3. Normalize pixel values: (pixel - 127.5) / 128.0
# 4. Forward pass through ArcFace R100
# 5. L2-normalize the 512-D embedding vector
# 6. Store embedding + metadata for matching

# Output: 512-D unit vector representing facial identity
```

### 3.3 Similarity Computation & Matching

| Parameter | Value | Description |
|-----------|-------|-------------|
| Similarity Metric | Cosine Similarity | dot(u, v) / (||u|| * ||v||) |
| Embedding Dim | 512 | Float32 per vector = 2KB storage |
| Distance Metric | 1 - Cosine Similarity | Range [0.0, 2.0] |
| Top-K Query | K=5 | Return top 5 candidates |
| **Strict Match Threshold** | 0.58 (cosine) / 0.42 (distance) | High confidence ID |
| **Balanced Match Threshold** | 0.50 (cosine) / 0.50 (distance) | Standard confidence |
| **Relaxed Match Threshold** | 0.42 (cosine) / 0.58 (distance) | Maximum recall |

### 3.4 Face Database Structure

```python
# Known Person Database (Milvus/FAISS vector store)
known_persons_db = {
    "person_id": "uuid-string",          # Unique person identifier
    "name": "John Doe",                  # Display name (optional)
    "employee_id": "EMP001",             # External reference
    "embeddings": [                      # Multiple reference embeddings
        {
            "vector": np.array(512),       # L2-normalized embedding
            "source_camera": "CAM_01",
            "timestamp": "2025-07-01T10:00:00Z",
            "face_quality": 0.92,
            "pose_yaw": 5.2,               # Head pose at capture
        }
    ],
    "created_at": "2025-07-01T00:00:00Z",
    "updated_at": "2025-07-20T15:30:00Z",
    "enrollment_count": 3,               # Number of reference photos
}
```

### 3.5 Top-K Matching Strategy

```python
def match_face(embedding: np.ndarray, db: VectorStore, k: int = 5) -> MatchResult:
    """
    1. Query vector DB for top-K nearest neighbors (cosine similarity)
    2. Compute similarity scores for all K candidates
    3. Apply threshold-based classification:
       - Highest score >= strict_threshold   -> CONFIDENT_MATCH
       - Highest score >= balanced_threshold -> PROBABLE_MATCH
       - Highest score >= relaxed_threshold  -> POSSIBLE_MATCH
       - All scores < relaxed_threshold       -> UNKNOWN
    4. For CONFIDENT_MATCH: return person_id with confidence
    5. For UNKNOWN: route to clustering module for unknown identity grouping
    """
```

### 3.6 Performance Targets

| Metric | Target | Notes |
|--------|--------|-------|
| Latency (single face) | < 8ms @ T4 TensorRT FP16 | ArcFace R100 at 112x112 |
| Latency (batch 32 faces) | < 25ms | Batch processing |
| LFW Verification | > 99.8% | Standard benchmark |
| CFP-FP (frontal-profile) | > 98.0% | Pose variation robustness |
| False Acceptance Rate | < 0.1% @ 99% TPR | For access control scenarios |
| Embedding Throughput | > 4,000 faces/sec | GPU batch inference |

---

## 4. Person Tracking Module

### 4.1 Model Selection: ByteTrack

**Primary Choice: `ByteTrack` (Peize Sun et al., ByteDance)**

ByteTrack achieves the best accuracy-speed tradeoff for surveillance tracking. Its dual-threshold association mechanism recovers objects from low-confidence detections, dramatically reducing ID switches during occlusions — a critical requirement for industrial environments with shelving, machinery, and partial obstructions.

| Attribute | Specification |
|-----------|--------------|
| Algorithm | ByteTrack (BYTE association) |
| Motion Model | Kalman Filter (constant velocity) |
| Similarity Metric | IoU (first association), IoU (second association) |
| Detection Threshold (high) | 0.6 |
| Detection Threshold (low) | 0.1 |
| Track Buffer (lost frames) | 30 frames (~1 sec @ 30 FPS) |
| IoU Match Threshold | 0.2 (reject matches below) |
| FPS (V100) | 30 FPS (detection + tracking) |
| MOTA (MOT17) | 80.3% |
| IDF1 (MOT17) | 77.3% |
| HOTA (MOT17) | 63.1% |

**Alternative (accuracy-focused):** `BoT-SORT` (+1% MOTA, +MOTP, includes Camera Motion Compensation, ~35 FPS)
**Alternative (edge/CPU):** `OC-SORT` (hundreds of FPS on CPU, handles non-linear motion)

### 4.2 Tracking Pipeline Configuration

```yaml
# bytetrack_config.yaml
bytetrack:
  track_thresh: 0.6              # High-confidence detection threshold
  track_buffer: 30               # Max frames to keep lost tracks alive
  match_thresh: 0.8              # IoU matching threshold (first stage)
  det_thresh_low: 0.1            # Low-confidence threshold for second association
  iou_thresh_reject: 0.2         # Minimum IoU to accept a match
  min_box_area: 100              # Ignore detections smaller than 10x10 px
  aspect_ratio_thresh: 10.0      # Reject extreme aspect ratios
  mot20: false                   # Standard density mode
```

### 4.3 Track ID Management

```python
class TrackManager:
    """Manages track lifecycle across all camera streams."""

    def __init__(self):
        self.next_track_id = 0          # Monotonically increasing
        self.active_tracks = {}         # track_id -> TrackState
        self.lost_tracks = {}           # Recently lost, may recover
        self.archived_tracks = {}       # Finalized trajectories

    def create_track(self, detection, camera_id):
        """Initialize new track from high-confidence detection."""
        track_id = self.next_track_id
        self.next_track_id += 1
        # Initialize Kalman filter state
        # Store: bbox, confidence, camera_id, first_seen, last_seen
        return track_id

    def update_track(self, track_id, detection):
        """Update existing track with matched detection."""
        # Update Kalman filter
        # Update last_seen timestamp
        # Increment hit count

    def mark_lost(self, track_id):
        """Track not matched in current frame."""
        # Increment lost count
        # If lost > track_buffer, archive track

    def get_track_summary(self, track_id) -> dict:
        """Return track metadata: duration, camera span, entry/exit zones."""
```

### 4.4 Cross-Camera Track Association

For multi-camera scenarios (8 channels), a secondary association layer links tracks across cameras using:

1. **Temporal proximity** — tracks appearing on different cameras within a time window
2. **Appearance features** — ArcFace embedding similarity for re-identification
3. **Zone transition rules** — predefined camera adjacency graph (CAM_01 -> CAM_02)

```python
def associate_cross_camera(track_cam_a, track_cam_b, max_time_gap=60):
    """
    Associate tracks across cameras using:
    - Time gap between track end (A) and track start (B) < max_time_gap seconds
    - Embedding cosine similarity > 0.65 (relaxed threshold for ReID)
    - Camera adjacency is valid in zone graph
    """
```

### 4.5 Performance Targets

| Metric | Target | Notes |
|--------|--------|-------|
| MOTA | > 75% | Multi-object tracking accuracy |
| IDF1 | > 70% | Identity preservation across frames |
| ID Switches | < 2 per 100 frames | Per camera stream |
| Fragmentation | < 3 per track | Track splits per person per session |
| Track Recovery | > 80% within 1 sec | Re-acquire after brief occlusion |
| Latency overhead | < 1ms per frame | Tracking association cost |

---

## 5. Unknown Person Clustering Module

### 5.1 Model Selection: HDBSCAN + Chinese Whisper Ensemble

**Primary Choice: `HDBSCAN`** (Hierarchical Density-Based Spatial Clustering)

For unknown face embedding clustering, HDBSCAN outperforms DBSCAN by not requiring a global density parameter (`eps`) and naturally handling variable-density clusters — critical for surveillance where some individuals appear frequently and others only once.

| Attribute | Specification |
|-----------|--------------|
| Clustering Algorithm | HDBSCAN (primary) + DBSCAN (fallback) |
| Embedding Input | 512-D L2-normalized ArcFace embeddings |
| Distance Metric | Cosine distance (1 - cosine similarity) |
| Min Cluster Size | 3 | Minimum embeddings to form a cluster |
| Min Samples | 2 | Core point neighborhood parameter |
| Cluster Selection Method | eom (Excess of Mass) |
| Allow Single Cluster | True |

### 5.2 Clustering Pipeline

```python
class UnknownPersonClustering:
    """Clusters unknown person embeddings to identify recurring visitors."""

    def __init__(self):
        self.clusters = {}              # cluster_id -> ClusterProfile
        self.noise_embeddings = []      # Unclustered (single-appearance)
        self.merge_candidates = []      # Pairs flagged for merge review
        self.dbscan_eps = 0.28          # Fallback DBSCAN parameter
        self.dbscan_min_samples = 2

    def add_embedding(self, embedding: np.ndarray, metadata: dict) -> str:
        """
        1. Try HDBSCAN fit_predict on accumulated embeddings
        2. If HDBSCAN fails (all noise), fall back to DBSCAN
        3. Assign embedding to cluster or mark as noise (-1)
        4. If cluster assignment: update cluster centroid and metadata
        5. Check for cluster merge opportunities
        6. Return: cluster_id or "noise"
        """

    def merge_clusters(self, cluster_a: str, cluster_b: str) -> str:
        """
        Merge two clusters that belong to the same person.
        Trigger: centroid distance < 0.25 (cosine distance)
                 OR temporal overlap analysis
                 OR manual operator confirmation
        """

    def get_recurring_unknowns(self, min_appearances: int = 3) -> list:
        """Return unknown persons seen at least N times (potential enrollment candidates)."""

    def compute_cluster_centroid(self, cluster_id: str) -> np.ndarray:
        """L2-normalized mean of all embeddings in cluster."""
```

### 5.3 Cluster Data Structure

```python
@dataclass
class ClusterProfile:
    cluster_id: str                     # UUID
    centroid: np.ndarray                # 512-D mean embedding (L2-normalized)
    embeddings: List[np.ndarray]        # All member embeddings
    metadata: List[dict]                # Source info per embedding
    first_seen: datetime
    last_seen: datetime
    appearance_count: int               # Total embeddings in cluster
    camera_span: Set[str]               # Which cameras observed this person
    quality_score: float                # Average face quality (0-1)
    best_face_crop: str                 # Path to highest quality crop
    is_named: bool = False              # Flag when promoted to known person
    person_name: Optional[str] = None   # Assigned name (if promoted)
```

### 5.4 Merge Logic & Cluster Maintenance

| Trigger | Action | Threshold |
|---------|--------|-----------|
| Centroid distance | Auto-merge clusters | cosine distance < 0.20 |
| Centroid distance | Flag for review | cosine distance 0.20-0.30 |
| Temporal overlap | Prevent merge | Same time on different cameras |
| Cluster size | Auto-archive | > 100 embeddings, compress to centroid |
| Age | Archive old clusters | No activity for 90 days |

### 5.5 Three-Tier Identity Classification

```
┌─────────────────────────────────────────────────────────────┐
│                    IDENTITY CLASSIFICATION                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────┐    cosine >= 0.58                 │
│  │   KNOWN PERSON      │◄──────────────────────────────┐    │
│  │   (Database Match)  │                                  │
│  └─────────────────────┘                                  │
│           ▲                                               │
│           │                                               │
│  ┌────────┴────────────┐    0.35 <= cosine < 0.58        │
│  │  UNKNOWN RECURRING  │◄──────────────────────────────┐   │
│  │  (Cluster Match)    │                                  │
│  └─────────────────────┘                                  │
│           ▲                                               │
│           │                                               │
│  ┌────────┴────────────┐    cosine < 0.35                │
│  │   NEW UNKNOWN       │◄──────────────────────────────┐  │
│  │   (Noise / New)     │                                 │
│  └─────────────────────┘                                 │
│                                                             │
│  ┌─────────────────────┐                                  │
│  │   REVIEW QUEUE      │◄── Low quality / Low confidence  │
│  │  (Operator Review)  │                                  │
│  └─────────────────────┘                                  │
└─────────────────────────────────────────────────────────────┘
```

### 5.6 Clustering Performance Targets

| Metric | Target | Notes |
|--------|--------|-------|
| Cluster Purity | > 89% | Same person in same cluster (HDBSCAN benchmark) |
| BCubed F-Measure | > 0.85 | Precision-recall balanced clustering |
| Clustering Latency | < 100ms | Per batch of 50 new embeddings |
| False Merge Rate | < 5% | Different people in same cluster |
| Memory per cluster | ~4 KB | Centroid + metadata |

---

## 6. Evidence Capture Module

### 6.1 Capture Triggers

Evidence is captured (face crop + metadata saved) on the following events:

| Event Type | Trigger Condition | Priority |
|-----------|-------------------|----------|
| `KNOWN_PERSON_DETECTED` | Face match confidence >= 0.50 | Medium |
| `UNKNOWN_PERSON_DETECTED` | New cluster formed, 3rd appearance | High |
| `REVIEW_NEEDED` | Low confidence match OR low quality face | High |
| `ZONE_VIOLATION` | Person enters restricted zone | Critical |
| `TAILGATING` | Two persons detected on single credential swipe | Critical |
| `AFTER_HOURS` | Person detected outside authorized hours | High |
| `SUSPICIOUS_BEHAVIOR` | Loitering (>5 min in same area) | Medium |

### 6.2 Evidence Record Structure

```python
@dataclass
class EvidenceRecord:
    # Unique identifiers
    evidence_id: str                    # UUID v4
    event_id: str                       # Links to event log
    camera_id: str                      # CAM_01 .. CAM_08
    stream_id: str                      # DVR channel identifier

    # Temporal
    timestamp_utc: datetime
    timestamp_local: datetime
    frame_number: int
    video_segment: str                  # Path to 10-sec video clip

    # Person identity
    identity_type: str                  # "known" | "unknown_recurring" | "unknown_new" | "review"
    person_id: Optional[str]            # Track ID or cluster ID
    person_name: Optional[str]          # Known person name
    match_confidence: float             # Face recognition confidence (0-1)

    # Face crop
    face_crop_path: str                 # /evidence/faces/2025/07/24/{id}.jpg
    face_crop_dimensions: tuple         # (w, h) of crop
    face_quality_score: float           # Combined quality metric
    face_landmarks: np.ndarray          # 5-point landmarks
    head_pose: dict                     # {yaw, pitch, roll}

    # Full frame reference
    full_frame_path: str                # /evidence/frames/2025/07/24/{id}.jpg
    bounding_box: tuple                 # (x1, y1, x2, y2) in original frame

    # AI confidence levels
    detection_confidence: float         # YOLO person detection confidence
    face_detection_confidence: float    # SCRFD face detection confidence
    recognition_confidence: float       # ArcFace match confidence

    # Vibe settings at capture time
    detection_sensitivity: str          # "low" | "balanced" | "high"
    face_match_strictness: str          # "relaxed" | "balanced" | "strict"

    # Review state
    review_status: str                  # "pending" | "reviewed" | "confirmed" | "false_positive"
    reviewed_by: Optional[str]
    review_notes: Optional[str]
```

### 6.3 Deduplication Strategy

To avoid storing duplicate evidence of the same person within short time windows:

```python
class EvidenceDeduplicator:
    """Prevents duplicate evidence capture using time-based gating."""

    DEDUP_WINDOW_KNOWN = 300        # 5 minutes between captures of same known person
    DEDUP_WINDOW_UNKNOWN = 60       # 1 minute between captures of same unknown person
    DEDUP_WINDOW_EVENT = 10         # 10 seconds between same event type

    def should_capture(self, person_id: str, event_type: str,
                       camera_id: str, timestamp: datetime) -> bool:
        """
        1. Check last capture time for this person_id + camera_id
        2. If within dedup window: skip capture, increment visit counter
        3. If outside window: allow capture, update last capture time
        4. Special: always capture if event_type is CRITICAL priority
        """
```

### 6.4 Storage Layout

```
/evidence/
  faces/
    2025/07/24/
      {evidence_id}_{camera_id}_{person_id}_face.jpg      # 112x112 aligned crop
      {evidence_id}_{camera_id}_{person_id}_full.jpg       # Full bounding box crop
  frames/
    2025/07/24/
      {evidence_id}_{camera_id}_frame.jpg                  # Full frame with annotation overlay
  video_clips/
    2025/07/24/
      {evidence_id}_{camera_id}_{timestamp}.mp4            # 10-second H.264 clip
  metadata/
    2025/07/24/
      {evidence_id}.json                                   # Full EvidenceRecord as JSON
```

### 6.5 Storage Requirements Estimate

| Content Type | Size Each | Daily (8 cams) | Monthly |
|-------------|-----------|----------------|---------|
| Face crop (112x112 JPEG) | ~8 KB | ~50 MB | ~1.5 GB |
| Full crop (200x300 JPEG) | ~25 KB | ~150 MB | ~4.5 GB |
| Frame snapshot (960x1080 JPEG) | ~150 KB | ~900 MB | ~27 GB |
| 10-sec video clip (H.264) | ~500 KB | ~3 GB | ~90 GB |
| Metadata JSON | ~2 KB | ~12 MB | ~360 MB |
| **Total (all media)** | — | **~4.1 GB** | **~123 GB** |

**Recommended:** Store face crops + metadata for all events. Full frames and video clips only for priority events (review_needed, zone_violation, after_hours).

---

## 7. Confidence Handling & Thresholds

### 7.1 Confidence Level Definitions

| Level | Aggregate Score | Color | Action |
|-------|----------------|-------|--------|
| **HIGH** | >= 0.75 | Green | Auto-process, no review needed |
| **MEDIUM** | 0.50 - 0.75 | Yellow | Process with confidence label, flag for spot-check |
| **LOW** | 0.35 - 0.50 | Orange | Capture evidence, mark for review |
| **REVIEW_NEEDED** | < 0.35 | Red | Always queue for operator review |

### 7.2 Aggregate Confidence Score

The aggregate confidence is computed as a weighted combination:

```python
def compute_aggregate_confidence(det_conf: float, face_conf: float,
                                  match_conf: float, quality_score: float) -> float:
    """
    Aggregate = 0.25 * det_conf + 0.20 * face_conf + 0.35 * match_conf + 0.20 * quality_score

    Where:
    - det_conf:     YOLO person detection confidence (0-1)
    - face_conf:    SCRFD face detection confidence (0-1)
    - match_conf:   ArcFace recognition match confidence (0-1), 0.0 for unknowns
    - quality_score: Face quality composite score (0-1)
    """
```

### 7.3 AI Vibe Settings Mapping

The system exposes three "vibe" settings that internally map to threshold configurations:

**Detection Sensitivity (applies to YOLO + SCRFD):**

| Setting | YOLO Conf Threshold | SCRFD Conf Threshold | Effect |
|---------|-------------------|---------------------|--------|
| **Low** | 0.50 | 0.55 | Fewer detections, lower false positive rate |
| **Balanced** | 0.35 | 0.45 | Standard detection rate |
| **High** | 0.20 | 0.35 | Maximum detection, higher false positive rate |

**Face Match Strictness (applies to ArcFace matching):**

| Setting | Strict Threshold | Balanced Threshold | Relaxed Threshold | Effect |
|---------|-----------------|-------------------|-------------------|--------|
| **Relaxed** | 0.50 | 0.42 | 0.35 | High recall, more false matches |
| **Balanced** | 0.58 | 0.50 | 0.42 | Balanced precision-recall |
| **Strict** | 0.65 | 0.58 | 0.50 | High precision, stricter matching |

### 7.4 Vibe Configuration Matrix

```yaml
# vibe_presets.yaml
vibe_presets:
  access_control:                    # High security area
    detection_sensitivity: "balanced"
    face_match_strictness: "strict"

  general_surveillance:              # Standard monitoring
    detection_sensitivity: "balanced"
    face_match_strictness: "balanced"

  perimeter_monitoring:              # Catching all activity
    detection_sensitivity: "high"
    face_match_strictness: "relaxed"

  after_hours:                       # Night mode
    detection_sensitivity: "high"
    face_match_strictness: "balanced"

  privacy_mode:                      # Minimal detection
    detection_sensitivity: "low"
    face_match_strictness: "strict"
```

### 7.5 Threshold Auto-Tuning Strategy

```python
class ThresholdTuner:
    """Periodically adjusts thresholds based on operational feedback."""

    def analyze_feedback(self, review_results: list):
        """
        1. Collect operator review labels on REVIEW_NEEDED items
        2. Track false positive rate and false negative rate
        3. If FP rate > 10%: increase confidence thresholds by 5%
        4. If FN rate > 10%: decrease confidence thresholds by 5%
        5. Only adjust within +/- 15% of baseline values
        6. Log all threshold changes with rationale
        """

    def weekly_report(self) -> dict:
        """Generate confidence distribution and threshold effectiveness report."""
```

---

## 8. Inference Pipeline Architecture

### 8.1 Per-Stream Processing Pipeline

```
┌─────────────────────────────────────────────────────────────────┐
│                    PER-STREAM PIPELINE                           │
│                    (Executed per camera frame)                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐       │
│  │  RTSP    │    │   Frame      │    │   Frame Queue    │       │
│  │  Stream  │───▶│   Decode     │───▶│   (ring buffer)  │       │
│  │  (H.264) │    │   (960x1080) │    │   max 30 frames  │       │
│  └──────────┘    └──────────────┘    └──────────────────┘       │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 1: HUMAN DETECTION (YOLO11m TensorRT FP16)          │   │
│  │  Input: 640x640 batch tensor                               │   │
│  │  Output: person bboxes [N x 6] (x1,y1,x2,y2,conf,cls)    │   │
│  │  Latency: ~4.7ms per frame (T4)                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 2: FACE DETECTION (SCRFD-500M TensorRT FP16)        │   │
│  │  Input: Cropped person regions from Step 1                 │   │
│  │  Output: face bboxes + 5 landmarks per face                │   │
│  │  Latency: ~2.5ms per face (T4)                            │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 3: FACE ALIGNMENT & QUALITY CHECK                   │   │
│  │  Input: Face crop + 5 landmarks                            │   │
│  │  Process: Similarity transform -> 112x112 aligned crop     │   │
│  │  Quality: Blur, pose, illumination checks                  │   │
│  │  Latency: ~0.3ms (OpenCV CPU)                             │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 4: FACE RECOGNITION (ArcFace R100 TensorRT FP16)    │   │
│  │  Input: 112x112 aligned face crop (batch)                  │   │
│  │  Output: 512-D L2-normalized embedding                     │   │
│  │  Latency: ~6ms per face (T4, batch=8)                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 5: IDENTITY MATCHING (FAISS/Milvus vector search)   │   │
│  │  Input: 512-D embedding                                    │   │
│  │  Output: Top-K matches with similarity scores              │   │
│  │  Latency: < 5ms (in-memory, <10K identities)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 6: PERSON TRACKING (ByteTrack)                      │   │
│  │  Input: Person detections + face embeddings               │   │
│  │  Output: Persistent track IDs with identity labels         │   │
│  │  Latency: ~1ms per frame                                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 7: UNKNOWN CLUSTERING (HDBSCAN)                     │   │
│  │  Input: Embeddings of unmatched faces                     │   │
│  │  Output: Cluster assignments for recurring unknowns        │   │
│  │  Latency: ~50ms (batch update, every 30 sec)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 8: EVIDENCE CAPTURE & EVENT GENERATION              │   │
│  │  Input: Track results + identity + confidence             │   │
│  │  Output: Evidence records, event log entries, alerts       │   │
│  │  Latency: ~5ms (async I/O)                                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  OUTPUT: Structured event stream to central system        │   │
│  │  { track_id, identity, confidence, bbox, timestamp,       │   │
│  │    camera_id, event_type, evidence_refs }                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### 8.2 Multi-Stream Orchestration

```python
class MultiStreamPipeline:
    """Orchestrates inference across 8 simultaneous camera streams."""

    def __init__(self, config: PipelineConfig):
        # 4 inference workers (each processes 2 streams)
        self.workers = [InferenceWorker(gpu_id=i % 2) for i in range(4)]

        # Stream assignments: worker -> [stream_ids]
        self.stream_map = {
            0: ["CAM_01", "CAM_02"],
            1: ["CAM_03", "CAM_04"],
            2: ["CAM_05", "CAM_06"],
            3: ["CAM_07", "CAM_08"],
        }

        # Shared components (thread-safe)
        self.tracker_pool = {cam: ByteTrack(config.track) for cam in ALL_CAMERAS}
        self.face_db = VectorDatabase(config.db)          # Milvus/FAISS
        self.clustering = UnknownPersonClustering(config.cluster)
        self.evidence = EvidenceCaptureManager(config.evidence)

    def process_frame(self, camera_id: str, frame: np.ndarray, timestamp: datetime):
        """Process a single frame through the complete pipeline."""
        # STEP 1: Human Detection
        person_dets = self.yolo_detector.detect(frame)

        # STEP 2: Face Detection (within person regions)
        face_dets = []
        for det in person_dets:
            person_crop = crop_region(frame, det.bbox)
            faces = self.face_detector.detect(person_crop)
            face_dets.extend(faces)

        # STEP 3: Face Alignment + Quality
        aligned_faces = []
        for face in face_dets:
            aligned = align_face(frame, face.landmarks)
            quality = self.quality_checker.score(aligned)
            if quality.passed:
                aligned_faces.append((aligned, quality.score, face))

        # STEP 4: Face Recognition (batch)
        if aligned_faces:
            embeddings = self.face_recognizer.embed(
                [f[0] for f in aligned_faces]
            )

            # STEP 5: Identity Matching
            for emb, (aligned, quality, face) in zip(embeddings, aligned_faces):
                matches = self.face_db.search(emb, top_k=5)
                identity = self.classify_identity(emb, matches)
                face.identity = identity

        # STEP 6: Person Tracking
        tracks = self.tracker_pool[camera_id].update(person_dets)

        # STEP 7: Associate face identities with person tracks
        self.associate_faces_with_tracks(tracks, face_dets)

        # STEP 8: Unknown clustering (periodic batch)
        self.clustering.update_periodic()

        # STEP 9: Evidence capture
        self.evidence.capture_events(tracks, camera_id, timestamp)

        return tracks
```

### 8.3 Batch Processing Strategy

For GPU efficiency, frames are processed in batched groups:

| Batch Type | Batch Size | Frequency | GPU Utilization |
|-----------|-----------|-----------|----------------|
| Human Detection | 8 frames | Every frame decode | ~85% |
| Face Detection | Variable (up to 32 faces) | Per 2 frames | ~60% |
| Face Recognition | Up to 32 faces | Per 2 frames | ~75% |
| Tracking | Per stream | Every frame | CPU-bound |

### 8.4 GPU Utilization Strategy

```
GPU 0 (Primary - T4 / A10):
  ├─ Stream 0-1: YOLO11m detection
  ├─ Stream 0-1: SCRFD face detection
  ├─ Stream 0-1: ArcFace R100 recognition
  └─ TensorRT Context 0: All models (shared)

GPU 1 (Optional - V100 / A100 for scale):
  ├─ Stream 2-3: Same pipeline
  └─ TensorRT Context 1: Dedicated context

CPU (x86_64):
  ├─ Stream decode (FFmpeg, 8 threads)
  ├─ ByteTrack association (all streams)
  ├─ Face alignment + quality (OpenCV)
  ├─ HDBSCAN clustering (background thread)
  ├─ Evidence I/O (async thread pool)
  └─ API server (FastAPI, 4 workers)
```

### 8.5 Performance Budget (Per 8-Stream System)

| Pipeline Stage | Per-Frame Cost | 8-Stream Aggregate | GPU % |
|---------------|---------------|-------------------|-------|
| Frame decode | ~2ms | 16ms (parallel) | — |
| YOLO11m detection | ~4.7ms | ~37.6ms (batched) | 35% |
| SCRFD face detection | ~2.5ms avg | ~20ms (batched) | 20% |
| Face alignment + quality | ~0.3ms | ~2.4ms (CPU) | — |
| ArcFace R100 recognition | ~6ms avg | ~48ms (batched) | 45% |
| ByteTrack tracking | ~1ms | ~8ms (CPU) | — |
| Vector search | ~1ms | ~8ms (CPU) | — |
| Evidence capture | ~2ms | ~16ms (async I/O) | — |
| **Total effective** | — | **~30-35ms end-to-end** | — |
| **Effective throughput** | — | **~28 FPS per stream** | **100%** |

**Target: 15-20 FPS processing per stream at 960x1080 with batching optimizations.**

---

## 9. Model Selection Summary Table

| Component | Model Choice | Framework | Input Size | FPS Target (T4) | Accuracy Metric |
|-----------|-------------|-----------|------------|-----------------|-----------------|
| **Human Detection** | **YOLO11m** (Ultralytics) | TensorRT FP16 | 640 x 640 | **213 FPS** (batch=8) | 51.5% mAP@50-95 COCO; ~78% person AP |
| **Face Detection** | **SCRFD-500M-BNKPS** (InsightFace) | TensorRT FP16 | 640 x 640 | **~400 FPS** (batch=32) | 90.6% AP-Easy, 87.0% AP-Med, 72.0% AP-Hard (WIDERFACE) |
| **Face Recognition** | **ArcFace R100** IR-SE100 (InsightFace, MS1MV3) | TensorRT FP16 | 112 x 112 | **~170 FPS** (batch=32) | 99.83% LFW, 98.27% CFP-FP, 96.1% IJB-C@1e-4 |
| **Person Tracking** | **ByteTrack** (BYTE association, Kalman filter) | NumPy/OpenCV | — | **>500 FPS** (association only) | 80.3% MOTA, 77.3% IDF1, 63.1% HOTA (MOT17) |
| **Unknown Clustering** | **HDBSCAN** (hdbscan library) + DBSCAN fallback | scikit-learn/hdbscan | 512-D embeddings | **<100ms per batch** | 89.5% cluster purity, BCubed F > 0.85 |
| **Vector Search** | **FAISS** (IndexFlatIP) or **Milvus** | FAISS/Milvus | 512-D vectors | **<5ms per query** | Exact nearest neighbor (cosine) |

---

## 10. Technology Stack

### 10.1 Deep Learning Framework

| Layer | Technology | Version | Purpose |
|-------|-----------|---------|---------|
| Training | PyTorch | 2.2+ | Model fine-tuning, research |
| Export | ONNX | 1.15+ | Model portability |
| GPU Inference | TensorRT | 8.6+ / 10.0+ | Production inference optimization |
| CPU Inference | ONNX Runtime | 1.16+ | CPU fallback for edge |
| CPU (Intel) | OpenVINO | 2024.0+ | Intel-optimized inference |

### 10.2 Model Serving Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Docker Container: ai-vision-pipeline                      │  │
│  │  Base: nvidia/cuda:12.1-runtime-ubuntu22.04               │  │
│  │                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │  │
│  │  │  TensorRT   │  │  OpenCV     │  │  FastAPI        │   │  │
│  │  │  Engine     │  │  4.9+       │  │  Server         │   │  │
│  │  │  (TRT 10)   │  │  (CUDA)     │  │  (uvicorn)      │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘   │  │
│  │                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │  │
│  │  │  FAISS      │  │  hdbscan    │  │  Kafka / Redis  │   │  │
│  │  │  (vectors)  │  │  (cluster)  │  │  (event bus)    │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘   │  │
│  │                                                             │  │
│  │  ┌──────────────────────────────────────────────────────┐  │  │
│  │  │  Pipeline Orchestrator (Python asyncio)              │  │  │
│  │  │  - Stream reader threads (8x FFmpeg)                 │  │  │
│  │  │  - GPU inference queue                                 │  │  │
│  │  │  - CPU post-processing workers                         │  │  │
│  │  │  - Evidence async writer                               │  │  │
│  │  └──────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Docker Container: ai-vision-api                           │  │
│  │  - REST API for configuration                              │  │
│  │  - WebSocket for real-time events                          │  │
│  │  - Database: PostgreSQL + pgvector                         │  │
│  │  - Object storage: MinIO (evidence media)                  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### 10.3 GPU Requirements

| Deployment Mode | Minimum GPU | Recommended GPU | Notes |
|----------------|-------------|-----------------|-------|
| Edge Gateway | NVIDIA Jetson Orin Nano 8GB | Jetson Orin NX 16GB | INT8 quantization, 5-8 FPS per stream |
| Edge Server | NVIDIA T4 16GB | NVIDIA A10 24GB | FP16, full 8-stream real-time |
| Cloud Processing | NVIDIA T4 16GB | NVIDIA V100 32GB | FP16, 8+ streams, batching |
| Development | NVIDIA RTX 3080 10GB | NVIDIA RTX 4090 24GB | Full pipeline debugging |

### 10.4 CPU Fallback Options

When GPU is unavailable, the pipeline falls back to CPU-optimized models:

| Component | GPU Model | CPU Fallback | CPU Latency |
|-----------|-----------|-------------|-------------|
| Human Detection | YOLO11m TensorRT | YOLO11n ONNX + OpenVINO | ~56ms/frame |
| Face Detection | SCRFD TensorRT | YuNet OpenCV DNN | ~3ms/frame |
| Face Recognition | ArcFace R100 TensorRT | ArcFace MobileFaceNet ONNX | ~15ms/face |
| Tracking | ByteTrack (CPU) | ByteTrack (CPU) | ~2ms/frame |

**Note:** CPU fallback processes at ~5-8 FPS per stream. For full 8-stream real-time, GPU acceleration is required.

### 10.5 Docker Compose Configuration

```yaml
# docker-compose.yml
version: '3.8'

services:
  ai-vision-pipeline:
    image: surveillance/ai-vision-pipeline:1.0.0
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - PIPELINE_WORKERS=4
      - STREAM_COUNT=8
      - DETECTION_MODEL=/models/yolo11m.engine
      - FACE_MODEL=/models/scrfd_500m.engine
      - RECOGNITION_MODEL=/models/arcface_r100.engine
      - DETECTION_SENSITIVITY=balanced
      - FACE_MATCH_STRICTNESS=balanced
    volumes:
      - ./models:/models:ro
      - ./evidence:/evidence
      - ./config:/config:ro
    ports:
      - "8080:8080"        # REST API
      - "8081:8081"        # WebSocket events
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    depends_on:
      - redis
      - minio
      - postgres

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: surveillance
      POSTGRES_USER: ai_pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
    volumes:
      - miniodata:/data
    ports:
      - "9000:9000"
      - "9001:9001"

volumes:
  pgdata:
  miniodata:
```

### 10.6 Python Module Structure

```
ai_vision_pipeline/
├── pyproject.toml                    # Poetry/pip dependencies
├── Dockerfile
├── docker-compose.yml
├── config/
│   ├── pipeline.yaml                 # Main pipeline configuration
│   ├── yolo11m_detection.yaml
│   ├── scrfd_face_detection.yaml
│   ├── arcface_recognition.yaml
│   ├── bytetrack.yaml
│   ├── clustering.yaml
│   └── vibe_presets.yaml
├── models/
│   ├── yolo11m.engine                # TensorRT engine (YOLO11m)
│   ├── scrfd_500m_bnkps.engine       # TensorRT engine (SCRFD)
│   ├── arcface_r100.engine           # TensorRT engine (ArcFace)
│   └── yunet.onnx                    # CPU fallback (YuNet)
├── src/
│   ├── __init__.py
│   ├── main.py                       # Entry point
│   ├── config.py                     # Configuration loader
│   ├── pipeline/
│   │   ├── __init__.py
│   │   ├── orchestrator.py           # MultiStreamPipeline
│   │   ├── stream_reader.py          # RTSP/FFmpeg frame capture
│   │   └── frame_buffer.py           # Ring buffer management
│   ├── detection/
│   │   ├── __init__.py
│   │   ├── yolo_detector.py          # YOLO11m inference wrapper
│   │   └── detector_base.py          # Abstract detector interface
│   ├── face/
│   │   ├── __init__.py
│   │   ├── face_detector.py          # SCRFD inference wrapper
│   │   ├── face_recognizer.py        # ArcFace inference wrapper
│   │   ├── face_aligner.py           # 5-point alignment
│   │   ├── quality_checker.py        # Blur/pose/illumination
│   │   └── embedding_store.py        # Vector DB operations
│   ├── tracking/
│   │   ├── __init__.py
│   │   ├── bytetrack.py              # ByteTrack implementation
│   │   ├── kalman_filter.py          # Kalman filter
│   │   ├── track_manager.py          # Track lifecycle management
│   │   └── matching.py               # IoU / embedding matching
│   ├── clustering/
│   │   ├── __init__.py
│   │   ├── hdbscan_engine.py         # HDBSCAN wrapper
│   │   ├── cluster_manager.py        # Cluster CRUD + merge logic
│   │   └── cluster_profile.py        # Cluster data model
│   ├── evidence/
│   │   ├── __init__.py
│   │   ├── capture_manager.py        # Evidence capture orchestrator
│   │   ├── deduplicator.py           # Deduplication logic
│   │   ├── storage.py                # File system + object storage
│   │   └── metadata.py               # EvidenceRecord dataclass
│   ├── confidence/
│   │   ├── __init__.py
│   │   ├── scorer.py                 # Aggregate confidence computation
│   │   ├── threshold_manager.py      # Dynamic threshold adjustment
│   │   └── vibe_mapper.py            # Vibe settings -> thresholds
│   ├── inference/
│   │   ├── __init__.py
│   │   ├── tensorrt_wrapper.py       # Generic TensorRT inference
│   │   ├── onnx_wrapper.py           # ONNX Runtime inference
│   │   └── batch_processor.py        # Dynamic batching logic
│   ├── api/
│   │   ├── __init__.py
│   │   ├── server.py                 # FastAPI application
│   │   ├── routes/
│   │   │   ├── detection.py          # Detection config API
│   │   │   ├── faces.py              # Face database API
│   │   │   ├── tracks.py             # Track query API
│   │   │   ├── evidence.py           # Evidence retrieval API
│   │   │   └── settings.py           # Vibe settings API
│   │   └── websocket.py              # Real-time event streaming
│   └── utils/
│       ├── __init__.py
│       ├── logger.py                 # Structured logging
│       ├── metrics.py                # Prometheus metrics
│       ├── time_utils.py             # Timestamp handling
│       └── image_utils.py            # Crop, resize, encode
├── tests/
│   ├── unit/
│   ├── integration/
│   └── benchmarks/
└── scripts/
    ├── export_tensorrt.py            # Convert .pt -> .onnx -> .engine
    ├── calibrate_int8.py             # INT8 calibration with custom data
    ├── benchmark_pipeline.py         # End-to-end benchmark
    └── setup_vector_db.py            # Initialize FAISS/Milvus index
```

### 10.7 Core Inference Code Architecture

```python
# src/inference/tensorrt_wrapper.py — Generic TensorRT inference engine

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

class TensorRTInference:
    """Generic TensorRT inference wrapper supporting dynamic batch sizes."""

    def __init__(self, engine_path: str, max_batch_size: int = 32):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)

        with open(engine_path, "rb") as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()
        self.max_batch_size = max_batch_size
        self.stream = cuda.Stream()

        # Allocate GPU buffers
        self.inputs = []
        self.outputs = []
        self.bindings = []
        self._allocate_buffers()

    def _allocate_buffers(self):
        """Allocate pinned host and device memory for all I/O bindings."""
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            mode = self.engine.get_tensor_mode(name)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))

            size = trt.volume(shape) * self.max_batch_size
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))

            if mode == trt.TensorIOMode.INPUT:
                self.inputs.append({"name": name, "host": host_mem,
                                    "device": device_mem, "shape": shape, "dtype": dtype})
            else:
                self.outputs.append({"name": name, "host": host_mem,
                                     "device": device_mem, "shape": shape, "dtype": dtype})

    def infer(self, input_batch: np.ndarray) -> list[np.ndarray]:
        """Execute inference on a batched input."""
        batch_size = input_batch.shape[0]

        # Copy input to pinned memory
        np.copyto(self.inputs[0]["host"][:input_batch.size], input_batch.ravel())

        # Set dynamic batch size
        input_shape = list(self.inputs[0]["shape"])
        input_shape[0] = batch_size
        self.context.set_input_shape(self.inputs[0]["name"], input_shape)

        # Transfer H2D
        cuda.memcpy_htod_async(self.inputs[0]["device"],
                               self.inputs[0]["host"], self.stream)

        # Execute
        self.context.execute_async_v3(stream_handle=self.stream.handle)

        # Transfer D2H
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out["host"], out["device"], self.stream)

        self.stream.synchronize()

        # Reshape outputs
        results = []
        for out in self.outputs:
            out_shape = list(out["shape"])
            out_shape[0] = batch_size
            results.append(out["host"][:np.prod(out_shape)].reshape(out_shape))

        return results

    def __del__(self):
        self.stream.synchronize()
```

### 10.8 Key Dependencies

```toml
# pyproject.toml dependencies
[tool.poetry.dependencies]
python = "^3.10"
torch = "^2.2.0"
torchvision = "^2.2.0"
tensorrt = "^10.0.0"
pycuda = "^2024.1"
onnxruntime-gpu = "^1.16.0"
opencv-python = "^4.9.0"
numpy = "^1.26.0"
scipy = "^1.12.0"
scikit-learn = "^1.4.0"
hdbscan = "^0.8.33"
faiss-gpu = "^1.7.4"
pydantic = "^2.6.0"
fastapi = "^0.109.0"
uvicorn = "^0.27.0"
websockets = "^12.0"
aioredis = "^2.0.0"
asyncpg = "^0.29.0"
minio = "^7.2.0"
prometheus-client = "^0.20.0"
structlog = "^24.1.0"
python-multipart = "^0.0.9"
pillow = "^10.2.0"
```

---

## 11. Performance Summary & Benchmarks

### 11.1 Target System Performance

| Metric | Target | Notes |
|--------|--------|-------|
| **Processed FPS per stream** | 15-20 FPS | At 960x1080 input |
| **Total system throughput** | 120-160 FPS aggregate | 8 streams simultaneously |
| **End-to-end latency** | < 100ms | Frame in -> result out |
| **GPU memory** | < 10 GB | All 3 TensorRT engines loaded |
| **System RAM** | < 16 GB | Buffers + clustering + API |
| **Storage growth** | ~100 GB/month | With selective full-frame storage |
| **Concurrent API clients** | 50+ | WebSocket event subscribers |

### 11.2 Accuracy Targets on Surveillance Data

| Task | Metric | Target |
|------|--------|--------|
| Human Detection | mAP@50 (person) | > 75% |
| Human Detection | Recall@0.5IoU | > 85% |
| Face Detection | AP (medium) | > 85% |
| Face Detection | Min face size | 20x20 px |
| Face Recognition | Rank-1 accuracy (known persons) | > 98% |
| Face Recognition | False acceptance rate | < 0.1% |
| Tracking | MOTA | > 75% |
| Tracking | IDF1 | > 70% |
| Tracking | ID switches / 100 frames | < 2 |
| Clustering | Purity | > 89% |
| Clustering | BCubed F-Measure | > 0.85 |

### 11.3 Failure Modes & Mitigations

| Failure Mode | Detection | Mitigation |
|-------------|-----------|------------|
| GPU memory exhaustion | Monitor nvidia-smi | Reduce batch size, enable model streaming |
| Frame drop in decode | Monitor FFmpeg buffer | Increase ring buffer, enable HW decode |
| High false positive rate | Track review queue | Auto-increase detection threshold |
| Track fragmentation | Monitor ID switches | Tune ByteTrack track_buffer parameter |
| Cluster contamination | Monitor cluster purity | Lower DBSCAN eps, enable merge review |
| Vector DB latency growth | Query latency histogram | Switch from IndexFlat to IndexIVF |
| Disk space exhaustion | Storage capacity alert | Auto-archive evidence > 90 days |

---

## 12. Appendix A: Model Export Commands

```bash
# 1. Export YOLO11m to TensorRT
python -c "
from ultralytics import YOLO
model = YOLO('yolo11m.pt')
model.export(format='onnx', imgsz=640, opset=17, dynamic=True, simplify=True)
"
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolo11m.onnx \
  --saveEngine=yolo11m.engine \
  --fp16 \
  --minShapes=images:1x3x640x640 \
  --optShapes=images:8x3x640x640 \
  --maxShapes=images:16x3x640x640

# 2. Export SCRFD-500M to TensorRT (via ONNX)
python scripts/export_scrfd_onnx.py \
  --config configs/scrfd_500m_bnkps.py \
  --checkpoint scrfd_500m_bnkps.pth \
  --input-img test.jpg \
  --shape 640 640 \
  --show
/usr/src/tensorrt/bin/trtexec \
  --onnx=scrfd_500m.onnx \
  --saveEngine=scrfd_500m.engine \
  --fp16

# 3. Export ArcFace R100 to TensorRT
python -c "
import onnx
from insightface.model_zoo import get_model
model = get_model('arcface_r100_v1')
model.export_onnx('arcface_r100.onnx')
"
/usr/src/tensorrt/bin/trtexec \
  --onnx=arcface_r100.onnx \
  --saveEngine=arcface_r100.engine \
  --fp16 \
  --minShapes=input.1:1x3x112x112 \
  --optShapes=input.1:32x3x112x112 \
  --maxShapes=input.1:64x3x112x112
```

## 13. Appendix B: INT8 Calibration

```python
# scripts/calibrate_int8.py
import tensorrt as trt
from src.inference.calibrator import SurveillanceCalibrator

calibrator = SurveillanceCalibrator(
    calibration_data_dir="/data/calibration/surveillance_500frames",
    cache_file="yolo11m_calibration.cache",
    input_shape=(8, 3, 640, 640),
    max_batches=100
)

config = {
    "onnx_file": "yolo11m.onnx",
    "engine_file": "yolo11m_int8.engine",
    "precision": "int8",
    "calibrator": calibrator,
    "max_batch_size": 16,
    "workspace_mb": 4096,
}
# INT8 engine provides 3.5x speedup with <0.5% mAP drop
# Requires 500+ representative frames from target cameras
```

## 14. Appendix C: Performance Benchmark Script

```python
# scripts/benchmark_pipeline.py
import time
import statistics
from src.pipeline.orchestrator import MultiStreamPipeline

BENCHMARK_DURATION = 300  # 5 minutes
WARMUP_FRAMES = 60

def benchmark():
    pipeline = MultiStreamPipeline.from_config("config/pipeline.yaml")

    # Warmup
    for _ in range(WARMUP_FRAMES):
        pipeline.process_frame("CAM_01", dummy_frame, datetime.now())

    # Benchmark
    latencies = []
    start = time.monotonic()
    while time.monotonic() - start < BENCHMARK_DURATION:
        t0 = time.perf_counter()
        pipeline.process_frame("CAM_01", dummy_frame, datetime.now())
        latencies.append((time.perf_counter() - t0) * 1000)  # ms

    print(f"Mean latency: {statistics.mean(latencies):.1f}ms")
    print(f"P50 latency: {statistics.median(latencies):.1f}ms")
    print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.1f}ms")
    print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.1f}ms")
    print(f"Throughput: {len(latencies) / BENCHMARK_DURATION:.1f} FPS")

if __name__ == "__main__":
    benchmark()
```

---

*Document Version: 1.0.0 | Generated for CP PLUS 8-Channel DVR Surveillance Platform*
*All model specifications and benchmarks reflect publicly available data as of July 2025*
