AI Vision Pipeline

AI Vision Pipeline

Detection, tracking, face recognition, and inference details.

Industrial Surveillance AI Vision Pipeline — Complete Technical Design

Document Information

Property Value
Version 1.0.0
Date 2025-07-24
DVR CP PLUS 8-Channel, 960x1080 per channel
Environment Indoor / Industrial Mixed
Target Streams 8 simultaneous RTSP feeds
Edge Compute Moderate (NVIDIA Jetson or x86 + T4 class GPU)
Cloud Compute GPU-backed (V100 / A10 / T4)

1. Human Detection Module

1.1 Model Selection: YOLO11-Medium (YOLO11m)

Primary Choice: YOLO11m (Ultralytics)

Rationale: YOLO11 strikes the optimal balance between accuracy and inference speed for industrial surveillance. The medium variant provides sufficient capacity to detect partially occluded humans at mid-range distances while maintaining real-time throughput across 8 streams.

Attribute Specification
Model YOLO11m (Ultralytics release 8.3.x)
Backbone C3k2 bottleneck + C2PSA attention module
Neck PANet with compact feature aggregation
Head Anchor-free decoupled detection head
Parameters 20.1 M
Input Resolution 640 x 640 (letterboxed from 960x1080)
mAP@50-95 (COCO) 51.5%
Person Class AP ~75-80% (estimated, COCO "person")

Alternative for GPU-constrained edge: YOLO11s (9.4M params, 47.0% mAP, 2.5ms T4 TensorRT) Alternative for maximum accuracy: RT-DETR-L (53.4% mAP, 6.8ms T4 TensorRT, transformer-based)

1.2 Inference Configuration

# yolo11m_detection.yaml
model:
  weights: "yolo11m.pt"
  class_filter: ["person"]          # Only detect human class (COCO idx 0)
  confidence_threshold: 0.35         # Balanced sensitivity
  iou_threshold: 0.45                # NMS IoU threshold
  max_detections: 50                 # Max persons per frame
  imgsz: 640                         # Square input
  half: true                         # FP16 inference

dataloader:
  batch_size: 8                      # One frame per stream
  workers: 4
  pin_memory: true

1.3 Quantization & Optimization Strategy

Optimization Stage Target Expected Speedup Accuracy Impact
PyTorch FP32 Baseline 1.0x Baseline (51.5% mAP)
ONNX Export Interop 1.1x Negligible
TensorRT FP16 Production GPU 2.8x (~4.7ms T4) -0.1% mAP
TensorRT INT8 Maximum throughput 3.5x (~3.8ms T4) -0.3% to -0.5% mAP
INT8 + DLA Jetson Orin DLA 4.0x -0.5% mAP

Recommended production path: TensorRT FP16 on GPU (best accuracy/speed tradeoff). INT8 for edge gateway with calibration dataset of 500+ representative surveillance frames.

1.4 Frame Preprocessing

# Input: 960 x 1080 from DVR (960H resolution, 960x1080)
# Step 1: Resize to 640x640 with letterboxing (maintain aspect ratio)
# Step 2: Normalize: divide by 255.0, mean=[0.0,0.0,0.0], std=[1.0,1.0,1.0]
# Step 3: HWC -> CHW format
# Step 4: Batch: [8, 3, 640, 640]

# Expected output: person bounding boxes [x1, y1, x2, y2, conf, class_id]

1.5 Performance Targets

Metric Target Notes
Latency (single frame) < 5ms @ T4 TensorRT FP16 YOLO11m at 640
Throughput (8 streams) > 160 FPS aggregate Batch=8, processing 20 FPS per stream
Person AP@50 > 75% On surveillance test set
Small person detection > 60% AP For persons > 30px tall
Occlusion handling > 50% AP Partial visibility (occlusion level 1-2)

2. Face Detection Module

2.1 Model Selection: SCRFD-500MF-640GPU

Primary Choice: SCRFD_500M_BNKPS (InsightFace Model Zoo)

SCRFD (Sample and Computation Redistribution for Efficient Face Detection) achieves the best speed-accuracy tradeoff for face detection on GPU. The 500MF variant is optimized for 640px inputs and provides 5-point facial keypoints (eyes, nose, mouth corners) critical for face alignment prior to recognition.

Attribute Specification
Model SCRFD_500M_BNKPS (ONNX)
Source deepinsight/insightface model zoo
Input Resolution 640 x 640
FLOPs 500 MFLOPs
Parameters ~1.5 M
WIDERFACE AP (Easy) 0.906
WIDERFACE AP (Medium) 0.870
WIDERFACE AP (Hard) 0.720
Keypoint Output 5 facial landmarks (eyes x2, nose, mouth corners x2)

Alternative (CPU/Edge): YuNet (OpenCV DNN, ~1ms CPU, AP_Easy 0.884, AP_Medium 0.866) Alternative (Maximum accuracy): RetinaFace-R50 (higher AP but 5-8x slower)

2.2 Inference Configuration

# scrfd_face_detection.yaml
model:
  onnx_file: "scrfd_500m_bnkps.onnx"
  input_size: 640
  confidence_threshold: 0.45          # Face detection minimum confidence
  nms_threshold: 0.4
  top_k: 100                           # Max faces per frame
  min_face_size: 20                    # Minimum face pixel height (20px)
  scale_factor: [8, 16, 32]           # Feature pyramid strides

# Face quality scoring
quality:
  blur_threshold: 50.0                 # Laplacian variance threshold
  pose_max_yaw: 45.0                   # Degrees - reject profile faces
  pose_max_pitch: 30.0                 # Degrees
  min_face_width: 20                   # Pixels - ignore tiny faces
  max_face_width: 300                  # Pixels - ignore giant close-ups

2.3 Face Quality Assessment

Each detected face is scored on multiple dimensions before proceeding to recognition:

Quality Metric Method Threshold Rejection Rate
Sharpness/Blur Laplacian variance Var > 50 ~15% of detections
Face Size Bounding box height > 20px, < 300px ~10% of detections
Head Pose 5-point landmark geometry Yaw < 45, Pitch < 30 ~20% of detections
Face Confidence SCRFD detection score > 0.45 ~5% of detections
Illumination Mean face ROI intensity 40 < mean < 240 ~5% of detections

Only faces passing all quality gates proceed to the recognition module.

2.4 Face Alignment

Using the 5-point landmarks from SCRFD, each face is aligned to a canonical pose using a similarity transform:

# Alignment target landmarks (112x112 template)
TARGET_LANDMARKS = np.array([
    [38.2946, 51.6963],   # Left eye
    [73.5318, 51.5014],   # Right eye
    [56.0252, 71.7366],   # Nose
    [41.5493, 92.3655],   # Left mouth
    [70.7299, 92.2041],   # Right mouth
], dtype=np.float32)

# Apply similarity transform (scale, rotation, translation)
# Output: aligned 112x112 face crop ready for ArcFace

2.5 Performance Targets

Metric Target Notes
Latency (single face) < 3ms @ T4 TensorRT FP16 SCRFD-500M
Latency (batch 32 faces) < 12ms Batch processing
WIDERFACE AP (Hard) > 0.70 Challenging angles, lighting
Min detectable face size 20x20 pixels ~10m distance at 1080p
5-landmark accuracy < 3px NME Normalized mean error

3. Face Recognition Module

3.1 Model Selection: ArcFace R100 (MS1MV3)

Primary Choice: ArcFace R100 (IResNet100, InsightFace)

ArcFace with Additive Angular Margin Loss is the industry-standard face recognition model. The ResNet100 (IR-SE100) backbone trained on MS1MV3 (MS1M-V3) provides state-of-the-art accuracy with 512-dimensional embeddings.

Attribute Specification
Model ArcFace with IResNet100 (IR-SE100) backbone
Loss Function Additive Angular Margin (ArcFace)
Training Data MS1MV3 (~5.1M images, 93K identities)
Input Size 3 x 112 x 112 (aligned face crop)
Embedding Dimension 512 (float32)
Parameters ~65 M
LFW Accuracy 99.83%
CFP-FP Accuracy 98.27%
AgeDB-30 Accuracy 98.28%
IJB-C (TPR@FPR=1e-4) 96.1%

Alternative (speed-focused): ArcFace R50 (IR-SE50, 25M params, LFW 99.80%, ~2x faster) Alternative (edge/mobile): MobileFaceNet (4M params, 128-D embedding, LFW 99.28%)

3.2 Embedding Extraction Pipeline

# Pipeline per detected face:
# 1. Crop face from original frame using SCRFD bounding box
# 2. Align using 5-point landmarks -> 112x112 normalized crop
# 3. Normalize pixel values: (pixel - 127.5) / 128.0
# 4. Forward pass through ArcFace R100
# 5. L2-normalize the 512-D embedding vector
# 6. Store embedding + metadata for matching

# Output: 512-D unit vector representing facial identity

3.3 Similarity Computation & Matching

Parameter Value Description
Similarity Metric Cosine Similarity dot(u, v) / (
Embedding Dim 512 Float32 per vector = 2KB storage
Distance Metric 1 - Cosine Similarity Range [0.0, 2.0]
Top-K Query K=5 Return top 5 candidates
Strict Match Threshold 0.58 (cosine) / 0.42 (distance) High confidence ID
Balanced Match Threshold 0.50 (cosine) / 0.50 (distance) Standard confidence
Relaxed Match Threshold 0.42 (cosine) / 0.58 (distance) Maximum recall

3.4 Face Database Structure

# Known Person Database (Milvus/FAISS vector store)
known_persons_db = {
    "person_id": "uuid-string",          # Unique person identifier
    "name": "John Doe",                  # Display name (optional)
    "employee_id": "EMP001",             # External reference
    "embeddings": [                      # Multiple reference embeddings
        {
            "vector": np.array(512),       # L2-normalized embedding
            "source_camera": "CAM_01",
            "timestamp": "2025-07-01T10:00:00Z",
            "face_quality": 0.92,
            "pose_yaw": 5.2,               # Head pose at capture
        }
    ],
    "created_at": "2025-07-01T00:00:00Z",
    "updated_at": "2025-07-20T15:30:00Z",
    "enrollment_count": 3,               # Number of reference photos
}

3.5 Top-K Matching Strategy

def match_face(embedding: np.ndarray, db: VectorStore, k: int = 5) -> MatchResult:
    """
    1. Query vector DB for top-K nearest neighbors (cosine similarity)
    2. Compute similarity scores for all K candidates
    3. Apply threshold-based classification:
       - Highest score >= strict_threshold   -> CONFIDENT_MATCH
       - Highest score >= balanced_threshold -> PROBABLE_MATCH
       - Highest score >= relaxed_threshold  -> POSSIBLE_MATCH
       - All scores < relaxed_threshold       -> UNKNOWN
    4. For CONFIDENT_MATCH: return person_id with confidence
    5. For UNKNOWN: route to clustering module for unknown identity grouping
    """

3.6 Performance Targets

Metric Target Notes
Latency (single face) < 8ms @ T4 TensorRT FP16 ArcFace R100 at 112x112
Latency (batch 32 faces) < 25ms Batch processing
LFW Verification > 99.8% Standard benchmark
CFP-FP (frontal-profile) > 98.0% Pose variation robustness
False Acceptance Rate < 0.1% @ 99% TPR For access control scenarios
Embedding Throughput > 4,000 faces/sec GPU batch inference

4. Person Tracking Module

4.1 Model Selection: ByteTrack

Primary Choice: ByteTrack (Peize Sun et al., ByteDance)

ByteTrack achieves the best accuracy-speed tradeoff for surveillance tracking. Its dual-threshold association mechanism recovers objects from low-confidence detections, dramatically reducing ID switches during occlusions — a critical requirement for industrial environments with shelving, machinery, and partial obstructions.

Attribute Specification
Algorithm ByteTrack (BYTE association)
Motion Model Kalman Filter (constant velocity)
Similarity Metric IoU (first association), IoU (second association)
Detection Threshold (high) 0.6
Detection Threshold (low) 0.1
Track Buffer (lost frames) 30 frames (~1 sec @ 30 FPS)
IoU Match Threshold 0.2 (reject matches below)
FPS (V100) 30 FPS (detection + tracking)
MOTA (MOT17) 80.3%
IDF1 (MOT17) 77.3%
HOTA (MOT17) 63.1%

Alternative (accuracy-focused): BoT-SORT (+1% MOTA, +MOTP, includes Camera Motion Compensation, ~35 FPS) Alternative (edge/CPU): OC-SORT (hundreds of FPS on CPU, handles non-linear motion)

4.2 Tracking Pipeline Configuration

# bytetrack_config.yaml
bytetrack:
  track_thresh: 0.6              # High-confidence detection threshold
  track_buffer: 30               # Max frames to keep lost tracks alive
  match_thresh: 0.8              # IoU matching threshold (first stage)
  det_thresh_low: 0.1            # Low-confidence threshold for second association
  iou_thresh_reject: 0.2         # Minimum IoU to accept a match
  min_box_area: 100              # Ignore detections smaller than 10x10 px
  aspect_ratio_thresh: 10.0      # Reject extreme aspect ratios
  mot20: false                   # Standard density mode

4.3 Track ID Management

class TrackManager:
    """Manages track lifecycle across all camera streams."""

    def __init__(self):
        self.next_track_id = 0          # Monotonically increasing
        self.active_tracks = {}         # track_id -> TrackState
        self.lost_tracks = {}           # Recently lost, may recover
        self.archived_tracks = {}       # Finalized trajectories

    def create_track(self, detection, camera_id):
        """Initialize new track from high-confidence detection."""
        track_id = self.next_track_id
        self.next_track_id += 1
        # Initialize Kalman filter state
        # Store: bbox, confidence, camera_id, first_seen, last_seen
        return track_id

    def update_track(self, track_id, detection):
        """Update existing track with matched detection."""
        # Update Kalman filter
        # Update last_seen timestamp
        # Increment hit count

    def mark_lost(self, track_id):
        """Track not matched in current frame."""
        # Increment lost count
        # If lost > track_buffer, archive track

    def get_track_summary(self, track_id) -> dict:
        """Return track metadata: duration, camera span, entry/exit zones."""

4.4 Cross-Camera Track Association

For multi-camera scenarios (8 channels), a secondary association layer links tracks across cameras using:

  1. Temporal proximity — tracks appearing on different cameras within a time window
  2. Appearance features — ArcFace embedding similarity for re-identification
  3. Zone transition rules — predefined camera adjacency graph (CAM_01 -> CAM_02)
def associate_cross_camera(track_cam_a, track_cam_b, max_time_gap=60):
    """
    Associate tracks across cameras using:
    - Time gap between track end (A) and track start (B) < max_time_gap seconds
    - Embedding cosine similarity > 0.65 (relaxed threshold for ReID)
    - Camera adjacency is valid in zone graph
    """

4.5 Performance Targets

Metric Target Notes
MOTA > 75% Multi-object tracking accuracy
IDF1 > 70% Identity preservation across frames
ID Switches < 2 per 100 frames Per camera stream
Fragmentation < 3 per track Track splits per person per session
Track Recovery > 80% within 1 sec Re-acquire after brief occlusion
Latency overhead < 1ms per frame Tracking association cost

5. Unknown Person Clustering Module

5.1 Model Selection: HDBSCAN + Chinese Whisper Ensemble

Primary Choice: HDBSCAN (Hierarchical Density-Based Spatial Clustering)

For unknown face embedding clustering, HDBSCAN outperforms DBSCAN by not requiring a global density parameter (eps) and naturally handling variable-density clusters — critical for surveillance where some individuals appear frequently and others only once.

Attribute Specification
Clustering Algorithm HDBSCAN (primary) + DBSCAN (fallback)
Embedding Input 512-D L2-normalized ArcFace embeddings
Distance Metric Cosine distance (1 - cosine similarity)
Min Cluster Size 3
Min Samples 2
Cluster Selection Method eom (Excess of Mass)
Allow Single Cluster True

5.2 Clustering Pipeline

class UnknownPersonClustering:
    """Clusters unknown person embeddings to identify recurring visitors."""

    def __init__(self):
        self.clusters = {}              # cluster_id -> ClusterProfile
        self.noise_embeddings = []      # Unclustered (single-appearance)
        self.merge_candidates = []      # Pairs flagged for merge review
        self.dbscan_eps = 0.28          # Fallback DBSCAN parameter
        self.dbscan_min_samples = 2

    def add_embedding(self, embedding: np.ndarray, metadata: dict) -> str:
        """
        1. Try HDBSCAN fit_predict on accumulated embeddings
        2. If HDBSCAN fails (all noise), fall back to DBSCAN
        3. Assign embedding to cluster or mark as noise (-1)
        4. If cluster assignment: update cluster centroid and metadata
        5. Check for cluster merge opportunities
        6. Return: cluster_id or "noise"
        """

    def merge_clusters(self, cluster_a: str, cluster_b: str) -> str:
        """
        Merge two clusters that belong to the same person.
        Trigger: centroid distance < 0.25 (cosine distance)
                 OR temporal overlap analysis
                 OR manual operator confirmation
        """

    def get_recurring_unknowns(self, min_appearances: int = 3) -> list:
        """Return unknown persons seen at least N times (potential enrollment candidates)."""

    def compute_cluster_centroid(self, cluster_id: str) -> np.ndarray:
        """L2-normalized mean of all embeddings in cluster."""

5.3 Cluster Data Structure

@dataclass
class ClusterProfile:
    cluster_id: str                     # UUID
    centroid: np.ndarray                # 512-D mean embedding (L2-normalized)
    embeddings: List[np.ndarray]        # All member embeddings
    metadata: List[dict]                # Source info per embedding
    first_seen: datetime
    last_seen: datetime
    appearance_count: int               # Total embeddings in cluster
    camera_span: Set[str]               # Which cameras observed this person
    quality_score: float                # Average face quality (0-1)
    best_face_crop: str                 # Path to highest quality crop
    is_named: bool = False              # Flag when promoted to known person
    person_name: Optional[str] = None   # Assigned name (if promoted)

5.4 Merge Logic & Cluster Maintenance

Trigger Action Threshold
Centroid distance Auto-merge clusters cosine distance < 0.20
Centroid distance Flag for review cosine distance 0.20-0.30
Temporal overlap Prevent merge Same time on different cameras
Cluster size Auto-archive > 100 embeddings, compress to centroid
Age Archive old clusters No activity for 90 days

5.5 Three-Tier Identity Classification

┌─────────────────────────────────────────────────────────────┐
│                    IDENTITY CLASSIFICATION                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────┐    cosine >= 0.58                 │
│  │   KNOWN PERSON      │◄──────────────────────────────┐    │
│  │   (Database Match)  │                                  │
│  └─────────────────────┘                                  │
│           ▲                                               │
│           │                                               │
│  ┌────────┴────────────┐    0.35 <= cosine < 0.58        │
│  │  UNKNOWN RECURRING  │◄──────────────────────────────┐   │
│  │  (Cluster Match)    │                                  │
│  └─────────────────────┘                                  │
│           ▲                                               │
│           │                                               │
│  ┌────────┴────────────┐    cosine < 0.35                │
│  │   NEW UNKNOWN       │◄──────────────────────────────┐  │
│  │   (Noise / New)     │                                 │
│  └─────────────────────┘                                 │
│                                                             │
│  ┌─────────────────────┐                                  │
│  │   REVIEW QUEUE      │◄── Low quality / Low confidence  │
│  │  (Operator Review)  │                                  │
│  └─────────────────────┘                                  │
└─────────────────────────────────────────────────────────────┘

5.6 Clustering Performance Targets

Metric Target Notes
Cluster Purity > 89% Same person in same cluster (HDBSCAN benchmark)
BCubed F-Measure > 0.85 Precision-recall balanced clustering
Clustering Latency < 100ms Per batch of 50 new embeddings
False Merge Rate < 5% Different people in same cluster
Memory per cluster ~4 KB Centroid + metadata

6. Evidence Capture Module

6.1 Capture Triggers

Evidence is captured (face crop + metadata saved) on the following events:

Event Type Trigger Condition Priority
KNOWN_PERSON_DETECTED Face match confidence >= 0.50 Medium
UNKNOWN_PERSON_DETECTED New cluster formed, 3rd appearance High
REVIEW_NEEDED Low confidence match OR low quality face High
ZONE_VIOLATION Person enters restricted zone Critical
TAILGATING Two persons detected on single credential swipe Critical
AFTER_HOURS Person detected outside authorized hours High
SUSPICIOUS_BEHAVIOR Loitering (>5 min in same area) Medium

6.2 Evidence Record Structure

@dataclass
class EvidenceRecord:
    # Unique identifiers
    evidence_id: str                    # UUID v4
    event_id: str                       # Links to event log
    camera_id: str                      # CAM_01 .. CAM_08
    stream_id: str                      # DVR channel identifier

    # Temporal
    timestamp_utc: datetime
    timestamp_local: datetime
    frame_number: int
    video_segment: str                  # Path to 10-sec video clip

    # Person identity
    identity_type: str                  # "known" | "unknown_recurring" | "unknown_new" | "review"
    person_id: Optional[str]            # Track ID or cluster ID
    person_name: Optional[str]          # Known person name
    match_confidence: float             # Face recognition confidence (0-1)

    # Face crop
    face_crop_path: str                 # /evidence/faces/2025/07/24/{id}.jpg
    face_crop_dimensions: tuple         # (w, h) of crop
    face_quality_score: float           # Combined quality metric
    face_landmarks: np.ndarray          # 5-point landmarks
    head_pose: dict                     # {yaw, pitch, roll}

    # Full frame reference
    full_frame_path: str                # /evidence/frames/2025/07/24/{id}.jpg
    bounding_box: tuple                 # (x1, y1, x2, y2) in original frame

    # AI confidence levels
    detection_confidence: float         # YOLO person detection confidence
    face_detection_confidence: float    # SCRFD face detection confidence
    recognition_confidence: float       # ArcFace match confidence

    # Vibe settings at capture time
    detection_sensitivity: str          # "low" | "balanced" | "high"
    face_match_strictness: str          # "relaxed" | "balanced" | "strict"

    # Review state
    review_status: str                  # "pending" | "reviewed" | "confirmed" | "false_positive"
    reviewed_by: Optional[str]
    review_notes: Optional[str]

6.3 Deduplication Strategy

To avoid storing duplicate evidence of the same person within short time windows:

class EvidenceDeduplicator:
    """Prevents duplicate evidence capture using time-based gating."""

    DEDUP_WINDOW_KNOWN = 300        # 5 minutes between captures of same known person
    DEDUP_WINDOW_UNKNOWN = 60       # 1 minute between captures of same unknown person
    DEDUP_WINDOW_EVENT = 10         # 10 seconds between same event type

    def should_capture(self, person_id: str, event_type: str,
                       camera_id: str, timestamp: datetime) -> bool:
        """
        1. Check last capture time for this person_id + camera_id
        2. If within dedup window: skip capture, increment visit counter
        3. If outside window: allow capture, update last capture time
        4. Special: always capture if event_type is CRITICAL priority
        """

6.4 Storage Layout

/evidence/
  faces/
    2025/07/24/
      {evidence_id}_{camera_id}_{person_id}_face.jpg      # 112x112 aligned crop
      {evidence_id}_{camera_id}_{person_id}_full.jpg       # Full bounding box crop
  frames/
    2025/07/24/
      {evidence_id}_{camera_id}_frame.jpg                  # Full frame with annotation overlay
  video_clips/
    2025/07/24/
      {evidence_id}_{camera_id}_{timestamp}.mp4            # 10-second H.264 clip
  metadata/
    2025/07/24/
      {evidence_id}.json                                   # Full EvidenceRecord as JSON

6.5 Storage Requirements Estimate

Content Type Size Each Daily (8 cams) Monthly
Face crop (112x112 JPEG) ~8 KB ~50 MB ~1.5 GB
Full crop (200x300 JPEG) ~25 KB ~150 MB ~4.5 GB
Frame snapshot (960x1080 JPEG) ~150 KB ~900 MB ~27 GB
10-sec video clip (H.264) ~500 KB ~3 GB ~90 GB
Metadata JSON ~2 KB ~12 MB ~360 MB
Total (all media) ~4.1 GB ~123 GB

Recommended: Store face crops + metadata for all events. Full frames and video clips only for priority events (review_needed, zone_violation, after_hours).


7. Confidence Handling & Thresholds

7.1 Confidence Level Definitions

Level Aggregate Score Color Action
HIGH >= 0.75 Green Auto-process, no review needed
MEDIUM 0.50 - 0.75 Yellow Process with confidence label, flag for spot-check
LOW 0.35 - 0.50 Orange Capture evidence, mark for review
REVIEW_NEEDED < 0.35 Red Always queue for operator review

7.2 Aggregate Confidence Score

The aggregate confidence is computed as a weighted combination:

def compute_aggregate_confidence(det_conf: float, face_conf: float,
                                  match_conf: float, quality_score: float) -> float:
    """
    Aggregate = 0.25 * det_conf + 0.20 * face_conf + 0.35 * match_conf + 0.20 * quality_score

    Where:
    - det_conf:     YOLO person detection confidence (0-1)
    - face_conf:    SCRFD face detection confidence (0-1)
    - match_conf:   ArcFace recognition match confidence (0-1), 0.0 for unknowns
    - quality_score: Face quality composite score (0-1)
    """

7.3 AI Vibe Settings Mapping

The system exposes three "vibe" settings that internally map to threshold configurations:

Detection Sensitivity (applies to YOLO + SCRFD):

Setting YOLO Conf Threshold SCRFD Conf Threshold Effect
Low 0.50 0.55 Fewer detections, lower false positive rate
Balanced 0.35 0.45 Standard detection rate
High 0.20 0.35 Maximum detection, higher false positive rate

Face Match Strictness (applies to ArcFace matching):

Setting Strict Threshold Balanced Threshold Relaxed Threshold Effect
Relaxed 0.50 0.42 0.35 High recall, more false matches
Balanced 0.58 0.50 0.42 Balanced precision-recall
Strict 0.65 0.58 0.50 High precision, stricter matching

7.4 Vibe Configuration Matrix

# vibe_presets.yaml
vibe_presets:
  access_control:                    # High security area
    detection_sensitivity: "balanced"
    face_match_strictness: "strict"

  general_surveillance:              # Standard monitoring
    detection_sensitivity: "balanced"
    face_match_strictness: "balanced"

  perimeter_monitoring:              # Catching all activity
    detection_sensitivity: "high"
    face_match_strictness: "relaxed"

  after_hours:                       # Night mode
    detection_sensitivity: "high"
    face_match_strictness: "balanced"

  privacy_mode:                      # Minimal detection
    detection_sensitivity: "low"
    face_match_strictness: "strict"

7.5 Threshold Auto-Tuning Strategy

class ThresholdTuner:
    """Periodically adjusts thresholds based on operational feedback."""

    def analyze_feedback(self, review_results: list):
        """
        1. Collect operator review labels on REVIEW_NEEDED items
        2. Track false positive rate and false negative rate
        3. If FP rate > 10%: increase confidence thresholds by 5%
        4. If FN rate > 10%: decrease confidence thresholds by 5%
        5. Only adjust within +/- 15% of baseline values
        6. Log all threshold changes with rationale
        """

    def weekly_report(self) -> dict:
        """Generate confidence distribution and threshold effectiveness report."""

8. Inference Pipeline Architecture

8.1 Per-Stream Processing Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                    PER-STREAM PIPELINE                           │
│                    (Executed per camera frame)                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐       │
│  │  RTSP    │    │   Frame      │    │   Frame Queue    │       │
│  │  Stream  │───▶│   Decode     │───▶│   (ring buffer)  │       │
│  │  (H.264) │    │   (960x1080) │    │   max 30 frames  │       │
│  └──────────┘    └──────────────┘    └──────────────────┘       │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 1: HUMAN DETECTION (YOLO11m TensorRT FP16)          │   │
│  │  Input: 640x640 batch tensor                               │   │
│  │  Output: person bboxes [N x 6] (x1,y1,x2,y2,conf,cls)    │   │
│  │  Latency: ~4.7ms per frame (T4)                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 2: FACE DETECTION (SCRFD-500M TensorRT FP16)        │   │
│  │  Input: Cropped person regions from Step 1                 │   │
│  │  Output: face bboxes + 5 landmarks per face                │   │
│  │  Latency: ~2.5ms per face (T4)                            │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 3: FACE ALIGNMENT & QUALITY CHECK                   │   │
│  │  Input: Face crop + 5 landmarks                            │   │
│  │  Process: Similarity transform -> 112x112 aligned crop     │   │
│  │  Quality: Blur, pose, illumination checks                  │   │
│  │  Latency: ~0.3ms (OpenCV CPU)                             │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 4: FACE RECOGNITION (ArcFace R100 TensorRT FP16)    │   │
│  │  Input: 112x112 aligned face crop (batch)                  │   │
│  │  Output: 512-D L2-normalized embedding                     │   │
│  │  Latency: ~6ms per face (T4, batch=8)                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 5: IDENTITY MATCHING (FAISS/Milvus vector search)   │   │
│  │  Input: 512-D embedding                                    │   │
│  │  Output: Top-K matches with similarity scores              │   │
│  │  Latency: < 5ms (in-memory, <10K identities)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 6: PERSON TRACKING (ByteTrack)                      │   │
│  │  Input: Person detections + face embeddings               │   │
│  │  Output: Persistent track IDs with identity labels         │   │
│  │  Latency: ~1ms per frame                                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 7: UNKNOWN CLUSTERING (HDBSCAN)                     │   │
│  │  Input: Embeddings of unmatched faces                     │   │
│  │  Output: Cluster assignments for recurring unknowns        │   │
│  │  Latency: ~50ms (batch update, every 30 sec)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 8: EVIDENCE CAPTURE & EVENT GENERATION              │   │
│  │  Input: Track results + identity + confidence             │   │
│  │  Output: Evidence records, event log entries, alerts       │   │
│  │  Latency: ~5ms (async I/O)                                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  OUTPUT: Structured event stream to central system        │   │
│  │  { track_id, identity, confidence, bbox, timestamp,       │   │
│  │    camera_id, event_type, evidence_refs }                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

8.2 Multi-Stream Orchestration

class MultiStreamPipeline:
    """Orchestrates inference across 8 simultaneous camera streams."""

    def __init__(self, config: PipelineConfig):
        # 4 inference workers (each processes 2 streams)
        self.workers = [InferenceWorker(gpu_id=i % 2) for i in range(4)]

        # Stream assignments: worker -> [stream_ids]
        self.stream_map = {
            0: ["CAM_01", "CAM_02"],
            1: ["CAM_03", "CAM_04"],
            2: ["CAM_05", "CAM_06"],
            3: ["CAM_07", "CAM_08"],
        }

        # Shared components (thread-safe)
        self.tracker_pool = {cam: ByteTrack(config.track) for cam in ALL_CAMERAS}
        self.face_db = VectorDatabase(config.db)          # Milvus/FAISS
        self.clustering = UnknownPersonClustering(config.cluster)
        self.evidence = EvidenceCaptureManager(config.evidence)

    def process_frame(self, camera_id: str, frame: np.ndarray, timestamp: datetime):
        """Process a single frame through the complete pipeline."""
        # STEP 1: Human Detection
        person_dets = self.yolo_detector.detect(frame)

        # STEP 2: Face Detection (within person regions)
        face_dets = []
        for det in person_dets:
            person_crop = crop_region(frame, det.bbox)
            faces = self.face_detector.detect(person_crop)
            face_dets.extend(faces)

        # STEP 3: Face Alignment + Quality
        aligned_faces = []
        for face in face_dets:
            aligned = align_face(frame, face.landmarks)
            quality = self.quality_checker.score(aligned)
            if quality.passed:
                aligned_faces.append((aligned, quality.score, face))

        # STEP 4: Face Recognition (batch)
        if aligned_faces:
            embeddings = self.face_recognizer.embed(
                [f[0] for f in aligned_faces]
            )

            # STEP 5: Identity Matching
            for emb, (aligned, quality, face) in zip(embeddings, aligned_faces):
                matches = self.face_db.search(emb, top_k=5)
                identity = self.classify_identity(emb, matches)
                face.identity = identity

        # STEP 6: Person Tracking
        tracks = self.tracker_pool[camera_id].update(person_dets)

        # STEP 7: Associate face identities with person tracks
        self.associate_faces_with_tracks(tracks, face_dets)

        # STEP 8: Unknown clustering (periodic batch)
        self.clustering.update_periodic()

        # STEP 9: Evidence capture
        self.evidence.capture_events(tracks, camera_id, timestamp)

        return tracks

8.3 Batch Processing Strategy

For GPU efficiency, frames are processed in batched groups:

Batch Type Batch Size Frequency GPU Utilization
Human Detection 8 frames Every frame decode ~85%
Face Detection Variable (up to 32 faces) Per 2 frames ~60%
Face Recognition Up to 32 faces Per 2 frames ~75%
Tracking Per stream Every frame CPU-bound

8.4 GPU Utilization Strategy

GPU 0 (Primary - T4 / A10):
  ├─ Stream 0-1: YOLO11m detection
  ├─ Stream 0-1: SCRFD face detection
  ├─ Stream 0-1: ArcFace R100 recognition
  └─ TensorRT Context 0: All models (shared)

GPU 1 (Optional - V100 / A100 for scale):
  ├─ Stream 2-3: Same pipeline
  └─ TensorRT Context 1: Dedicated context

CPU (x86_64):
  ├─ Stream decode (FFmpeg, 8 threads)
  ├─ ByteTrack association (all streams)
  ├─ Face alignment + quality (OpenCV)
  ├─ HDBSCAN clustering (background thread)
  ├─ Evidence I/O (async thread pool)
  └─ API server (FastAPI, 4 workers)

8.5 Performance Budget (Per 8-Stream System)

Pipeline Stage Per-Frame Cost 8-Stream Aggregate GPU %
Frame decode ~2ms 16ms (parallel)
YOLO11m detection ~4.7ms ~37.6ms (batched) 35%
SCRFD face detection ~2.5ms avg ~20ms (batched) 20%
Face alignment + quality ~0.3ms ~2.4ms (CPU)
ArcFace R100 recognition ~6ms avg ~48ms (batched) 45%
ByteTrack tracking ~1ms ~8ms (CPU)
Vector search ~1ms ~8ms (CPU)
Evidence capture ~2ms ~16ms (async I/O)
Total effective ~30-35ms end-to-end
Effective throughput ~28 FPS per stream 100%

Target: 15-20 FPS processing per stream at 960x1080 with batching optimizations.


9. Model Selection Summary Table

Component Model Choice Framework Input Size FPS Target (T4) Accuracy Metric
Human Detection YOLO11m (Ultralytics) TensorRT FP16 640 x 640 213 FPS (batch=8) 51.5% mAP@50-95 COCO; ~78% person AP
Face Detection SCRFD-500M-BNKPS (InsightFace) TensorRT FP16 640 x 640 ~400 FPS (batch=32) 90.6% AP-Easy, 87.0% AP-Med, 72.0% AP-Hard (WIDERFACE)
Face Recognition ArcFace R100 IR-SE100 (InsightFace, MS1MV3) TensorRT FP16 112 x 112 ~170 FPS (batch=32) 99.83% LFW, 98.27% CFP-FP, 96.1% IJB-C@1e-4
Person Tracking ByteTrack (BYTE association, Kalman filter) NumPy/OpenCV >500 FPS (association only) 80.3% MOTA, 77.3% IDF1, 63.1% HOTA (MOT17)
Unknown Clustering HDBSCAN (hdbscan library) + DBSCAN fallback scikit-learn/hdbscan 512-D embeddings <100ms per batch 89.5% cluster purity, BCubed F > 0.85
Vector Search FAISS (IndexFlatIP) or Milvus FAISS/Milvus 512-D vectors <5ms per query Exact nearest neighbor (cosine)

10. Technology Stack

10.1 Deep Learning Framework

Layer Technology Version Purpose
Training PyTorch 2.2+ Model fine-tuning, research
Export ONNX 1.15+ Model portability
GPU Inference TensorRT 8.6+ / 10.0+ Production inference optimization
CPU Inference ONNX Runtime 1.16+ CPU fallback for edge
CPU (Intel) OpenVINO 2024.0+ Intel-optimized inference

10.2 Model Serving Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Docker Container: ai-vision-pipeline                      │  │
│  │  Base: nvidia/cuda:12.1-runtime-ubuntu22.04               │  │
│  │                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │  │
│  │  │  TensorRT   │  │  OpenCV     │  │  FastAPI        │   │  │
│  │  │  Engine     │  │  4.9+       │  │  Server         │   │  │
│  │  │  (TRT 10)   │  │  (CUDA)     │  │  (uvicorn)      │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘   │  │
│  │                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │  │
│  │  │  FAISS      │  │  hdbscan    │  │  Kafka / Redis  │   │  │
│  │  │  (vectors)  │  │  (cluster)  │  │  (event bus)    │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘   │  │
│  │                                                             │  │
│  │  ┌──────────────────────────────────────────────────────┐  │  │
│  │  │  Pipeline Orchestrator (Python asyncio)              │  │  │
│  │  │  - Stream reader threads (8x FFmpeg)                 │  │  │
│  │  │  - GPU inference queue                                 │  │  │
│  │  │  - CPU post-processing workers                         │  │  │
│  │  │  - Evidence async writer                               │  │  │
│  │  └──────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Docker Container: ai-vision-api                           │  │
│  │  - REST API for configuration                              │  │
│  │  - WebSocket for real-time events                          │  │
│  │  - Database: PostgreSQL + pgvector                         │  │
│  │  - Object storage: MinIO (evidence media)                  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

10.3 GPU Requirements

Deployment Mode Minimum GPU Recommended GPU Notes
Edge Gateway NVIDIA Jetson Orin Nano 8GB Jetson Orin NX 16GB INT8 quantization, 5-8 FPS per stream
Edge Server NVIDIA T4 16GB NVIDIA A10 24GB FP16, full 8-stream real-time
Cloud Processing NVIDIA T4 16GB NVIDIA V100 32GB FP16, 8+ streams, batching
Development NVIDIA RTX 3080 10GB NVIDIA RTX 4090 24GB Full pipeline debugging

10.4 CPU Fallback Options

When GPU is unavailable, the pipeline falls back to CPU-optimized models:

Component GPU Model CPU Fallback CPU Latency
Human Detection YOLO11m TensorRT YOLO11n ONNX + OpenVINO ~56ms/frame
Face Detection SCRFD TensorRT YuNet OpenCV DNN ~3ms/frame
Face Recognition ArcFace R100 TensorRT ArcFace MobileFaceNet ONNX ~15ms/face
Tracking ByteTrack (CPU) ByteTrack (CPU) ~2ms/frame

Note: CPU fallback processes at ~5-8 FPS per stream. For full 8-stream real-time, GPU acceleration is required.

10.5 Docker Compose Configuration

# docker-compose.yml
version: '3.8'

services:
  ai-vision-pipeline:
    image: surveillance/ai-vision-pipeline:1.0.0
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - PIPELINE_WORKERS=4
      - STREAM_COUNT=8
      - DETECTION_MODEL=/models/yolo11m.engine
      - FACE_MODEL=/models/scrfd_500m.engine
      - RECOGNITION_MODEL=/models/arcface_r100.engine
      - DETECTION_SENSITIVITY=balanced
      - FACE_MATCH_STRICTNESS=balanced
    volumes:
      - ./models:/models:ro
      - ./evidence:/evidence
      - ./config:/config:ro
    ports:
      - "8080:8080"        # REST API
      - "8081:8081"        # WebSocket events
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    depends_on:
      - redis
      - minio
      - postgres

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: surveillance
      POSTGRES_USER: ai_pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
    volumes:
      - miniodata:/data
    ports:
      - "9000:9000"
      - "9001:9001"

volumes:
  pgdata:
  miniodata:

10.6 Python Module Structure

ai_vision_pipeline/
├── pyproject.toml                    # Poetry/pip dependencies
├── Dockerfile
├── docker-compose.yml
├── config/
│   ├── pipeline.yaml                 # Main pipeline configuration
│   ├── yolo11m_detection.yaml
│   ├── scrfd_face_detection.yaml
│   ├── arcface_recognition.yaml
│   ├── bytetrack.yaml
│   ├── clustering.yaml
│   └── vibe_presets.yaml
├── models/
│   ├── yolo11m.engine                # TensorRT engine (YOLO11m)
│   ├── scrfd_500m_bnkps.engine       # TensorRT engine (SCRFD)
│   ├── arcface_r100.engine           # TensorRT engine (ArcFace)
│   └── yunet.onnx                    # CPU fallback (YuNet)
├── src/
│   ├── __init__.py
│   ├── main.py                       # Entry point
│   ├── config.py                     # Configuration loader
│   ├── pipeline/
│   │   ├── __init__.py
│   │   ├── orchestrator.py           # MultiStreamPipeline
│   │   ├── stream_reader.py          # RTSP/FFmpeg frame capture
│   │   └── frame_buffer.py           # Ring buffer management
│   ├── detection/
│   │   ├── __init__.py
│   │   ├── yolo_detector.py          # YOLO11m inference wrapper
│   │   └── detector_base.py          # Abstract detector interface
│   ├── face/
│   │   ├── __init__.py
│   │   ├── face_detector.py          # SCRFD inference wrapper
│   │   ├── face_recognizer.py        # ArcFace inference wrapper
│   │   ├── face_aligner.py           # 5-point alignment
│   │   ├── quality_checker.py        # Blur/pose/illumination
│   │   └── embedding_store.py        # Vector DB operations
│   ├── tracking/
│   │   ├── __init__.py
│   │   ├── bytetrack.py              # ByteTrack implementation
│   │   ├── kalman_filter.py          # Kalman filter
│   │   ├── track_manager.py          # Track lifecycle management
│   │   └── matching.py               # IoU / embedding matching
│   ├── clustering/
│   │   ├── __init__.py
│   │   ├── hdbscan_engine.py         # HDBSCAN wrapper
│   │   ├── cluster_manager.py        # Cluster CRUD + merge logic
│   │   └── cluster_profile.py        # Cluster data model
│   ├── evidence/
│   │   ├── __init__.py
│   │   ├── capture_manager.py        # Evidence capture orchestrator
│   │   ├── deduplicator.py           # Deduplication logic
│   │   ├── storage.py                # File system + object storage
│   │   └── metadata.py               # EvidenceRecord dataclass
│   ├── confidence/
│   │   ├── __init__.py
│   │   ├── scorer.py                 # Aggregate confidence computation
│   │   ├── threshold_manager.py      # Dynamic threshold adjustment
│   │   └── vibe_mapper.py            # Vibe settings -> thresholds
│   ├── inference/
│   │   ├── __init__.py
│   │   ├── tensorrt_wrapper.py       # Generic TensorRT inference
│   │   ├── onnx_wrapper.py           # ONNX Runtime inference
│   │   └── batch_processor.py        # Dynamic batching logic
│   ├── api/
│   │   ├── __init__.py
│   │   ├── server.py                 # FastAPI application
│   │   ├── routes/
│   │   │   ├── detection.py          # Detection config API
│   │   │   ├── faces.py              # Face database API
│   │   │   ├── tracks.py             # Track query API
│   │   │   ├── evidence.py           # Evidence retrieval API
│   │   │   └── settings.py           # Vibe settings API
│   │   └── websocket.py              # Real-time event streaming
│   └── utils/
│       ├── __init__.py
│       ├── logger.py                 # Structured logging
│       ├── metrics.py                # Prometheus metrics
│       ├── time_utils.py             # Timestamp handling
│       └── image_utils.py            # Crop, resize, encode
├── tests/
│   ├── unit/
│   ├── integration/
│   └── benchmarks/
└── scripts/
    ├── export_tensorrt.py            # Convert .pt -> .onnx -> .engine
    ├── calibrate_int8.py             # INT8 calibration with custom data
    ├── benchmark_pipeline.py         # End-to-end benchmark
    └── setup_vector_db.py            # Initialize FAISS/Milvus index

10.7 Core Inference Code Architecture

# src/inference/tensorrt_wrapper.py — Generic TensorRT inference engine

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

class TensorRTInference:
    """Generic TensorRT inference wrapper supporting dynamic batch sizes."""

    def __init__(self, engine_path: str, max_batch_size: int = 32):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)

        with open(engine_path, "rb") as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()
        self.max_batch_size = max_batch_size
        self.stream = cuda.Stream()

        # Allocate GPU buffers
        self.inputs = []
        self.outputs = []
        self.bindings = []
        self._allocate_buffers()

    def _allocate_buffers(self):
        """Allocate pinned host and device memory for all I/O bindings."""
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            mode = self.engine.get_tensor_mode(name)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))

            size = trt.volume(shape) * self.max_batch_size
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))

            if mode == trt.TensorIOMode.INPUT:
                self.inputs.append({"name": name, "host": host_mem,
                                    "device": device_mem, "shape": shape, "dtype": dtype})
            else:
                self.outputs.append({"name": name, "host": host_mem,
                                     "device": device_mem, "shape": shape, "dtype": dtype})

    def infer(self, input_batch: np.ndarray) -> list[np.ndarray]:
        """Execute inference on a batched input."""
        batch_size = input_batch.shape[0]

        # Copy input to pinned memory
        np.copyto(self.inputs[0]["host"][:input_batch.size], input_batch.ravel())

        # Set dynamic batch size
        input_shape = list(self.inputs[0]["shape"])
        input_shape[0] = batch_size
        self.context.set_input_shape(self.inputs[0]["name"], input_shape)

        # Transfer H2D
        cuda.memcpy_htod_async(self.inputs[0]["device"],
                               self.inputs[0]["host"], self.stream)

        # Execute
        self.context.execute_async_v3(stream_handle=self.stream.handle)

        # Transfer D2H
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out["host"], out["device"], self.stream)

        self.stream.synchronize()

        # Reshape outputs
        results = []
        for out in self.outputs:
            out_shape = list(out["shape"])
            out_shape[0] = batch_size
            results.append(out["host"][:np.prod(out_shape)].reshape(out_shape))

        return results

    def __del__(self):
        self.stream.synchronize()

10.8 Key Dependencies

# pyproject.toml dependencies
[tool.poetry.dependencies]
python = "^3.10"
torch = "^2.2.0"
torchvision = "^2.2.0"
tensorrt = "^10.0.0"
pycuda = "^2024.1"
onnxruntime-gpu = "^1.16.0"
opencv-python = "^4.9.0"
numpy = "^1.26.0"
scipy = "^1.12.0"
scikit-learn = "^1.4.0"
hdbscan = "^0.8.33"
faiss-gpu = "^1.7.4"
pydantic = "^2.6.0"
fastapi = "^0.109.0"
uvicorn = "^0.27.0"
websockets = "^12.0"
aioredis = "^2.0.0"
asyncpg = "^0.29.0"
minio = "^7.2.0"
prometheus-client = "^0.20.0"
structlog = "^24.1.0"
python-multipart = "^0.0.9"
pillow = "^10.2.0"

11. Performance Summary & Benchmarks

11.1 Target System Performance

Metric Target Notes
Processed FPS per stream 15-20 FPS At 960x1080 input
Total system throughput 120-160 FPS aggregate 8 streams simultaneously
End-to-end latency < 100ms Frame in -> result out
GPU memory < 10 GB All 3 TensorRT engines loaded
System RAM < 16 GB Buffers + clustering + API
Storage growth ~100 GB/month With selective full-frame storage
Concurrent API clients 50+ WebSocket event subscribers

11.2 Accuracy Targets on Surveillance Data

Task Metric Target
Human Detection mAP@50 (person) > 75%
Human Detection Recall@0.5IoU > 85%
Face Detection AP (medium) > 85%
Face Detection Min face size 20x20 px
Face Recognition Rank-1 accuracy (known persons) > 98%
Face Recognition False acceptance rate < 0.1%
Tracking MOTA > 75%
Tracking IDF1 > 70%
Tracking ID switches / 100 frames < 2
Clustering Purity > 89%
Clustering BCubed F-Measure > 0.85

11.3 Failure Modes & Mitigations

Failure Mode Detection Mitigation
GPU memory exhaustion Monitor nvidia-smi Reduce batch size, enable model streaming
Frame drop in decode Monitor FFmpeg buffer Increase ring buffer, enable HW decode
High false positive rate Track review queue Auto-increase detection threshold
Track fragmentation Monitor ID switches Tune ByteTrack track_buffer parameter
Cluster contamination Monitor cluster purity Lower DBSCAN eps, enable merge review
Vector DB latency growth Query latency histogram Switch from IndexFlat to IndexIVF
Disk space exhaustion Storage capacity alert Auto-archive evidence > 90 days

12. Appendix A: Model Export Commands

# 1. Export YOLO11m to TensorRT
python -c "
from ultralytics import YOLO
model = YOLO('yolo11m.pt')
model.export(format='onnx', imgsz=640, opset=17, dynamic=True, simplify=True)
"
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolo11m.onnx \
  --saveEngine=yolo11m.engine \
  --fp16 \
  --minShapes=images:1x3x640x640 \
  --optShapes=images:8x3x640x640 \
  --maxShapes=images:16x3x640x640

# 2. Export SCRFD-500M to TensorRT (via ONNX)
python scripts/export_scrfd_onnx.py \
  --config configs/scrfd_500m_bnkps.py \
  --checkpoint scrfd_500m_bnkps.pth \
  --input-img test.jpg \
  --shape 640 640 \
  --show
/usr/src/tensorrt/bin/trtexec \
  --onnx=scrfd_500m.onnx \
  --saveEngine=scrfd_500m.engine \
  --fp16

# 3. Export ArcFace R100 to TensorRT
python -c "
import onnx
from insightface.model_zoo import get_model
model = get_model('arcface_r100_v1')
model.export_onnx('arcface_r100.onnx')
"
/usr/src/tensorrt/bin/trtexec \
  --onnx=arcface_r100.onnx \
  --saveEngine=arcface_r100.engine \
  --fp16 \
  --minShapes=input.1:1x3x112x112 \
  --optShapes=input.1:32x3x112x112 \
  --maxShapes=input.1:64x3x112x112

13. Appendix B: INT8 Calibration

# scripts/calibrate_int8.py
import tensorrt as trt
from src.inference.calibrator import SurveillanceCalibrator

calibrator = SurveillanceCalibrator(
    calibration_data_dir="/data/calibration/surveillance_500frames",
    cache_file="yolo11m_calibration.cache",
    input_shape=(8, 3, 640, 640),
    max_batches=100
)

config = {
    "onnx_file": "yolo11m.onnx",
    "engine_file": "yolo11m_int8.engine",
    "precision": "int8",
    "calibrator": calibrator,
    "max_batch_size": 16,
    "workspace_mb": 4096,
}
# INT8 engine provides 3.5x speedup with <0.5% mAP drop
# Requires 500+ representative frames from target cameras

14. Appendix C: Performance Benchmark Script

# scripts/benchmark_pipeline.py
import time
import statistics
from src.pipeline.orchestrator import MultiStreamPipeline

BENCHMARK_DURATION = 300  # 5 minutes
WARMUP_FRAMES = 60

def benchmark():
    pipeline = MultiStreamPipeline.from_config("config/pipeline.yaml")

    # Warmup
    for _ in range(WARMUP_FRAMES):
        pipeline.process_frame("CAM_01", dummy_frame, datetime.now())

    # Benchmark
    latencies = []
    start = time.monotonic()
    while time.monotonic() - start < BENCHMARK_DURATION:
        t0 = time.perf_counter()
        pipeline.process_frame("CAM_01", dummy_frame, datetime.now())
        latencies.append((time.perf_counter() - t0) * 1000)  # ms

    print(f"Mean latency: {statistics.mean(latencies):.1f}ms")
    print(f"P50 latency: {statistics.median(latencies):.1f}ms")
    print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.1f}ms")
    print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.1f}ms")
    print(f"Throughput: {len(latencies) / BENCHMARK_DURATION:.1f} FPS")

if __name__ == "__main__":
    benchmark()

Document Version: 1.0.0 | Generated for CP PLUS 8-Channel DVR Surveillance Platform All model specifications and benchmarks reflect publicly available data as of July 2025