Industrial Surveillance AI Vision Pipeline — Complete Technical Design

Document Information

Property	Value
Version	1.0.0
Date	2025-07-24
DVR	CP PLUS 8-Channel, 960x1080 per channel
Environment	Indoor / Industrial Mixed
Target Streams	8 simultaneous RTSP feeds
Edge Compute	Moderate (NVIDIA Jetson or x86 + T4 class GPU)
Cloud Compute	GPU-backed (V100 / A10 / T4)

1. Human Detection Module

1.1 Model Selection: YOLO11-Medium (YOLO11m)

Primary Choice: YOLO11m (Ultralytics)

Rationale: YOLO11 strikes the optimal balance between accuracy and inference speed for industrial surveillance. The medium variant provides sufficient capacity to detect partially occluded humans at mid-range distances while maintaining real-time throughput across 8 streams.

Attribute	Specification
Model	YOLO11m (Ultralytics release 8.3.x)
Backbone	C3k2 bottleneck + C2PSA attention module
Neck	PANet with compact feature aggregation
Head	Anchor-free decoupled detection head
Parameters	20.1 M
Input Resolution	640 x 640 (letterboxed from 960x1080)
mAP@50-95 (COCO)	51.5%
Person Class AP	~75-80% (estimated, COCO "person")

Alternative for GPU-constrained edge: YOLO11s (9.4M params, 47.0% mAP, 2.5ms T4 TensorRT) Alternative for maximum accuracy: RT-DETR-L (53.4% mAP, 6.8ms T4 TensorRT, transformer-based)

1.2 Inference Configuration

# yolo11m_detection.yaml
model:
  weights: "yolo11m.pt"
  class_filter: ["person"]          # Only detect human class (COCO idx 0)
  confidence_threshold: 0.35         # Balanced sensitivity
  iou_threshold: 0.45                # NMS IoU threshold
  max_detections: 50                 # Max persons per frame
  imgsz: 640                         # Square input
  half: true                         # FP16 inference

dataloader:
  batch_size: 8                      # One frame per stream
  workers: 4
  pin_memory: true

1.3 Quantization & Optimization Strategy

Optimization Stage	Target	Expected Speedup	Accuracy Impact
PyTorch FP32	Baseline	1.0x	Baseline (51.5% mAP)
ONNX Export	Interop	1.1x	Negligible
TensorRT FP16	Production GPU	2.8x (~4.7ms T4)	-0.1% mAP
TensorRT INT8	Maximum throughput	3.5x (~3.8ms T4)	-0.3% to -0.5% mAP
INT8 + DLA	Jetson Orin DLA	4.0x	-0.5% mAP

Recommended production path: TensorRT FP16 on GPU (best accuracy/speed tradeoff). INT8 for edge gateway with calibration dataset of 500+ representative surveillance frames.

1.4 Frame Preprocessing

# Input: 960 x 1080 from DVR (960H resolution, 960x1080)
# Step 1: Resize to 640x640 with letterboxing (maintain aspect ratio)
# Step 2: Normalize: divide by 255.0, mean=[0.0,0.0,0.0], std=[1.0,1.0,1.0]
# Step 3: HWC -> CHW format
# Step 4: Batch: [8, 3, 640, 640]

# Expected output: person bounding boxes [x1, y1, x2, y2, conf, class_id]

1.5 Performance Targets

Metric	Target	Notes
Latency (single frame)	< 5ms @ T4 TensorRT FP16	YOLO11m at 640
Throughput (8 streams)	> 160 FPS aggregate	Batch=8, processing 20 FPS per stream
Person AP@50	> 75%	On surveillance test set
Small person detection	> 60% AP	For persons > 30px tall
Occlusion handling	> 50% AP	Partial visibility (occlusion level 1-2)

2. Face Detection Module

2.1 Model Selection: SCRFD-500MF-640GPU

Primary Choice: SCRFD_500M_BNKPS (InsightFace Model Zoo)

SCRFD (Sample and Computation Redistribution for Efficient Face Detection) achieves the best speed-accuracy tradeoff for face detection on GPU. The 500MF variant is optimized for 640px inputs and provides 5-point facial keypoints (eyes, nose, mouth corners) critical for face alignment prior to recognition.

Attribute	Specification
Model	SCRFD_500M_BNKPS (ONNX)
Source	deepinsight/insightface model zoo
Input Resolution	640 x 640
FLOPs	500 MFLOPs
Parameters	~1.5 M
WIDERFACE AP (Easy)	0.906
WIDERFACE AP (Medium)	0.870
WIDERFACE AP (Hard)	0.720
Keypoint Output	5 facial landmarks (eyes x2, nose, mouth corners x2)

Alternative (CPU/Edge): YuNet (OpenCV DNN, ~1ms CPU, AP_Easy 0.884, AP_Medium 0.866) Alternative (Maximum accuracy): RetinaFace-R50 (higher AP but 5-8x slower)

2.2 Inference Configuration

# scrfd_face_detection.yaml
model:
  onnx_file: "scrfd_500m_bnkps.onnx"
  input_size: 640
  confidence_threshold: 0.45          # Face detection minimum confidence
  nms_threshold: 0.4
  top_k: 100                           # Max faces per frame
  min_face_size: 20                    # Minimum face pixel height (20px)
  scale_factor: [8, 16, 32]           # Feature pyramid strides

# Face quality scoring
quality:
  blur_threshold: 50.0                 # Laplacian variance threshold
  pose_max_yaw: 45.0                   # Degrees - reject profile faces
  pose_max_pitch: 30.0                 # Degrees
  min_face_width: 20                   # Pixels - ignore tiny faces
  max_face_width: 300                  # Pixels - ignore giant close-ups

2.3 Face Quality Assessment

Each detected face is scored on multiple dimensions before proceeding to recognition:

Quality Metric	Method	Threshold	Rejection Rate
Sharpness/Blur	Laplacian variance	Var > 50	~15% of detections
Face Size	Bounding box height	> 20px, < 300px	~10% of detections
Head Pose	5-point landmark geometry	Yaw < 45, Pitch < 30	~20% of detections
Face Confidence	SCRFD detection score	> 0.45	~5% of detections
Illumination	Mean face ROI intensity	40 < mean < 240	~5% of detections

Only faces passing all quality gates proceed to the recognition module.

2.4 Face Alignment

Using the 5-point landmarks from SCRFD, each face is aligned to a canonical pose using a similarity transform:

# Alignment target landmarks (112x112 template)
TARGET_LANDMARKS = np.array([
    [38.2946, 51.6963],   # Left eye
    [73.5318, 51.5014],   # Right eye
    [56.0252, 71.7366],   # Nose
    [41.5493, 92.3655],   # Left mouth
    [70.7299, 92.2041],   # Right mouth
], dtype=np.float32)

# Apply similarity transform (scale, rotation, translation)
# Output: aligned 112x112 face crop ready for ArcFace

2.5 Performance Targets

Metric	Target	Notes
Latency (single face)	< 3ms @ T4 TensorRT FP16	SCRFD-500M
Latency (batch 32 faces)	< 12ms	Batch processing
WIDERFACE AP (Hard)	> 0.70	Challenging angles, lighting
Min detectable face size	20x20 pixels	~10m distance at 1080p
5-landmark accuracy	< 3px NME	Normalized mean error

3. Face Recognition Module

3.1 Model Selection: ArcFace R100 (MS1MV3)

Primary Choice: ArcFace R100 (IResNet100, InsightFace)

ArcFace with Additive Angular Margin Loss is the industry-standard face recognition model. The ResNet100 (IR-SE100) backbone trained on MS1MV3 (MS1M-V3) provides state-of-the-art accuracy with 512-dimensional embeddings.

Attribute	Specification
Model	ArcFace with IResNet100 (IR-SE100) backbone
Loss Function	Additive Angular Margin (ArcFace)
Training Data	MS1MV3 (~5.1M images, 93K identities)
Input Size	3 x 112 x 112 (aligned face crop)
Embedding Dimension	512 (float32)
Parameters	~65 M
LFW Accuracy	99.83%
CFP-FP Accuracy	98.27%
AgeDB-30 Accuracy	98.28%
IJB-C (TPR@FPR=1e-4)	96.1%

Alternative (speed-focused): ArcFace R50 (IR-SE50, 25M params, LFW 99.80%, ~2x faster) Alternative (edge/mobile): MobileFaceNet (4M params, 128-D embedding, LFW 99.28%)

3.2 Embedding Extraction Pipeline

# Pipeline per detected face:
# 1. Crop face from original frame using SCRFD bounding box
# 2. Align using 5-point landmarks -> 112x112 normalized crop
# 3. Normalize pixel values: (pixel - 127.5) / 128.0
# 4. Forward pass through ArcFace R100
# 5. L2-normalize the 512-D embedding vector
# 6. Store embedding + metadata for matching

# Output: 512-D unit vector representing facial identity

3.3 Similarity Computation & Matching

Parameter	Value	Description
Similarity Metric	Cosine Similarity	dot(u, v) / (
Embedding Dim	512	Float32 per vector = 2KB storage
Distance Metric	1 - Cosine Similarity	Range [0.0, 2.0]
Top-K Query	K=5	Return top 5 candidates
Strict Match Threshold	0.58 (cosine) / 0.42 (distance)	High confidence ID
Balanced Match Threshold	0.50 (cosine) / 0.50 (distance)	Standard confidence
Relaxed Match Threshold	0.42 (cosine) / 0.58 (distance)	Maximum recall

3.4 Face Database Structure

# Known Person Database (Milvus/FAISS vector store)
known_persons_db = {
    "person_id": "uuid-string",          # Unique person identifier
    "name": "John Doe",                  # Display name (optional)
    "employee_id": "EMP001",             # External reference
    "embeddings": [                      # Multiple reference embeddings
        {
            "vector": np.array(512),       # L2-normalized embedding
            "source_camera": "CAM_01",
            "timestamp": "2025-07-01T10:00:00Z",
            "face_quality": 0.92,
            "pose_yaw": 5.2,               # Head pose at capture
        }
    ],
    "created_at": "2025-07-01T00:00:00Z",
    "updated_at": "2025-07-20T15:30:00Z",
    "enrollment_count": 3,               # Number of reference photos
}

3.5 Top-K Matching Strategy

def match_face(embedding: np.ndarray, db: VectorStore, k: int = 5) -> MatchResult:
    """
    1. Query vector DB for top-K nearest neighbors (cosine similarity)
    2. Compute similarity scores for all K candidates
    3. Apply threshold-based classification:
       - Highest score >= strict_threshold   -> CONFIDENT_MATCH
       - Highest score >= balanced_threshold -> PROBABLE_MATCH
       - Highest score >= relaxed_threshold  -> POSSIBLE_MATCH
       - All scores < relaxed_threshold       -> UNKNOWN
    4. For CONFIDENT_MATCH: return person_id with confidence
    5. For UNKNOWN: route to clustering module for unknown identity grouping
    """

3.6 Performance Targets

Metric	Target	Notes
Latency (single face)	< 8ms @ T4 TensorRT FP16	ArcFace R100 at 112x112
Latency (batch 32 faces)	< 25ms	Batch processing
LFW Verification	> 99.8%	Standard benchmark
CFP-FP (frontal-profile)	> 98.0%	Pose variation robustness
False Acceptance Rate	< 0.1% @ 99% TPR	For access control scenarios
Embedding Throughput	> 4,000 faces/sec	GPU batch inference

4. Person Tracking Module

4.1 Model Selection: ByteTrack

Primary Choice: ByteTrack (Peize Sun et al., ByteDance)

ByteTrack achieves the best accuracy-speed tradeoff for surveillance tracking. Its dual-threshold association mechanism recovers objects from low-confidence detections, dramatically reducing ID switches during occlusions — a critical requirement for industrial environments with shelving, machinery, and partial obstructions.

Attribute	Specification
Algorithm	ByteTrack (BYTE association)
Motion Model	Kalman Filter (constant velocity)
Similarity Metric	IoU (first association), IoU (second association)
Detection Threshold (high)	0.6
Detection Threshold (low)	0.1
Track Buffer (lost frames)	30 frames (~1 sec @ 30 FPS)
IoU Match Threshold	0.2 (reject matches below)
FPS (V100)	30 FPS (detection + tracking)
MOTA (MOT17)	80.3%
IDF1 (MOT17)	77.3%
HOTA (MOT17)	63.1%

Alternative (accuracy-focused): BoT-SORT (+1% MOTA, +MOTP, includes Camera Motion Compensation, ~35 FPS) Alternative (edge/CPU): OC-SORT (hundreds of FPS on CPU, handles non-linear motion)

4.2 Tracking Pipeline Configuration

# bytetrack_config.yaml
bytetrack:
  track_thresh: 0.6              # High-confidence detection threshold
  track_buffer: 30               # Max frames to keep lost tracks alive
  match_thresh: 0.8              # IoU matching threshold (first stage)
  det_thresh_low: 0.1            # Low-confidence threshold for second association
  iou_thresh_reject: 0.2         # Minimum IoU to accept a match
  min_box_area: 100              # Ignore detections smaller than 10x10 px
  aspect_ratio_thresh: 10.0      # Reject extreme aspect ratios
  mot20: false                   # Standard density mode

4.3 Track ID Management

class TrackManager:
    """Manages track lifecycle across all camera streams."""

    def __init__(self):
        self.next_track_id = 0          # Monotonically increasing
        self.active_tracks = {}         # track_id -> TrackState
        self.lost_tracks = {}           # Recently lost, may recover
        self.archived_tracks = {}       # Finalized trajectories

    def create_track(self, detection, camera_id):
        """Initialize new track from high-confidence detection."""
        track_id = self.next_track_id
        self.next_track_id += 1
        # Initialize Kalman filter state
        # Store: bbox, confidence, camera_id, first_seen, last_seen
        return track_id

    def update_track(self, track_id, detection):
        """Update existing track with matched detection."""
        # Update Kalman filter
        # Update last_seen timestamp
        # Increment hit count

    def mark_lost(self, track_id):
        """Track not matched in current frame."""
        # Increment lost count
        # If lost > track_buffer, archive track

    def get_track_summary(self, track_id) -> dict:
        """Return track metadata: duration, camera span, entry/exit zones."""

4.4 Cross-Camera Track Association

For multi-camera scenarios (8 channels), a secondary association layer links tracks across cameras using:

Temporal proximity — tracks appearing on different cameras within a time window
Appearance features — ArcFace embedding similarity for re-identification
Zone transition rules — predefined camera adjacency graph (CAM_01 -> CAM_02)

def associate_cross_camera(track_cam_a, track_cam_b, max_time_gap=60):
    """
    Associate tracks across cameras using:
    - Time gap between track end (A) and track start (B) < max_time_gap seconds
    - Embedding cosine similarity > 0.65 (relaxed threshold for ReID)
    - Camera adjacency is valid in zone graph
    """

4.5 Performance Targets

Metric	Target	Notes
MOTA	> 75%	Multi-object tracking accuracy
IDF1	> 70%	Identity preservation across frames
ID Switches	< 2 per 100 frames	Per camera stream
Fragmentation	< 3 per track	Track splits per person per session
Track Recovery	> 80% within 1 sec	Re-acquire after brief occlusion
Latency overhead	< 1ms per frame	Tracking association cost

5. Unknown Person Clustering Module

5.1 Model Selection: HDBSCAN + Chinese Whisper Ensemble

Primary Choice: HDBSCAN (Hierarchical Density-Based Spatial Clustering)

For unknown face embedding clustering, HDBSCAN outperforms DBSCAN by not requiring a global density parameter (eps) and naturally handling variable-density clusters — critical for surveillance where some individuals appear frequently and others only once.

Attribute	Specification
Clustering Algorithm	HDBSCAN (primary) + DBSCAN (fallback)
Embedding Input	512-D L2-normalized ArcFace embeddings
Distance Metric	Cosine distance (1 - cosine similarity)
Min Cluster Size	3
Min Samples	2
Cluster Selection Method	eom (Excess of Mass)
Allow Single Cluster	True

5.2 Clustering Pipeline

class UnknownPersonClustering:
    """Clusters unknown person embeddings to identify recurring visitors."""

    def __init__(self):
        self.clusters = {}              # cluster_id -> ClusterProfile
        self.noise_embeddings = []      # Unclustered (single-appearance)
        self.merge_candidates = []      # Pairs flagged for merge review
        self.dbscan_eps = 0.28          # Fallback DBSCAN parameter
        self.dbscan_min_samples = 2

    def add_embedding(self, embedding: np.ndarray, metadata: dict) -> str:
        """
        1. Try HDBSCAN fit_predict on accumulated embeddings
        2. If HDBSCAN fails (all noise), fall back to DBSCAN
        3. Assign embedding to cluster or mark as noise (-1)
        4. If cluster assignment: update cluster centroid and metadata
        5. Check for cluster merge opportunities
        6. Return: cluster_id or "noise"
        """

    def merge_clusters(self, cluster_a: str, cluster_b: str) -> str:
        """
        Merge two clusters that belong to the same person.
        Trigger: centroid distance < 0.25 (cosine distance)
                 OR temporal overlap analysis
                 OR manual operator confirmation
        """

    def get_recurring_unknowns(self, min_appearances: int = 3) -> list:
        """Return unknown persons seen at least N times (potential enrollment candidates)."""

    def compute_cluster_centroid(self, cluster_id: str) -> np.ndarray:
        """L2-normalized mean of all embeddings in cluster."""

5.3 Cluster Data Structure

@dataclass
class ClusterProfile:
    cluster_id: str                     # UUID
    centroid: np.ndarray                # 512-D mean embedding (L2-normalized)
    embeddings: List[np.ndarray]        # All member embeddings
    metadata: List[dict]                # Source info per embedding
    first_seen: datetime
    last_seen: datetime
    appearance_count: int               # Total embeddings in cluster
    camera_span: Set[str]               # Which cameras observed this person
    quality_score: float                # Average face quality (0-1)
    best_face_crop: str                 # Path to highest quality crop
    is_named: bool = False              # Flag when promoted to known person
    person_name: Optional[str] = None   # Assigned name (if promoted)

5.4 Merge Logic & Cluster Maintenance

Trigger	Action	Threshold
Centroid distance	Auto-merge clusters	cosine distance < 0.20
Centroid distance	Flag for review	cosine distance 0.20-0.30
Temporal overlap	Prevent merge	Same time on different cameras
Cluster size	Auto-archive	> 100 embeddings, compress to centroid
Age	Archive old clusters	No activity for 90 days

5.5 Three-Tier Identity Classification

┌─────────────────────────────────────────────────────────────┐
│                    IDENTITY CLASSIFICATION                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────┐    cosine >= 0.58                 │
│  │   KNOWN PERSON      │◄──────────────────────────────┐    │
│  │   (Database Match)  │                                  │
│  └─────────────────────┘                                  │
│           ▲                                               │
│           │                                               │
│  ┌────────┴────────────┐    0.35 <= cosine < 0.58        │
│  │  UNKNOWN RECURRING  │◄──────────────────────────────┐   │
│  │  (Cluster Match)    │                                  │
│  └─────────────────────┘                                  │
│           ▲                                               │
│           │                                               │
│  ┌────────┴────────────┐    cosine < 0.35                │
│  │   NEW UNKNOWN       │◄──────────────────────────────┐  │
│  │   (Noise / New)     │                                 │
│  └─────────────────────┘                                 │
│                                                             │
│  ┌─────────────────────┐                                  │
│  │   REVIEW QUEUE      │◄── Low quality / Low confidence  │
│  │  (Operator Review)  │                                  │
│  └─────────────────────┘                                  │
└─────────────────────────────────────────────────────────────┘

5.6 Clustering Performance Targets

Metric	Target	Notes
Cluster Purity	> 89%	Same person in same cluster (HDBSCAN benchmark)
BCubed F-Measure	> 0.85	Precision-recall balanced clustering
Clustering Latency	< 100ms	Per batch of 50 new embeddings
False Merge Rate	< 5%	Different people in same cluster
Memory per cluster	~4 KB	Centroid + metadata

6. Evidence Capture Module

6.1 Capture Triggers

Evidence is captured (face crop + metadata saved) on the following events:

Event Type	Trigger Condition	Priority
`KNOWN_PERSON_DETECTED`	Face match confidence >= 0.50	Medium
`UNKNOWN_PERSON_DETECTED`	New cluster formed, 3rd appearance	High
`REVIEW_NEEDED`	Low confidence match OR low quality face	High
`ZONE_VIOLATION`	Person enters restricted zone	Critical
`TAILGATING`	Two persons detected on single credential swipe	Critical
`AFTER_HOURS`	Person detected outside authorized hours	High
`SUSPICIOUS_BEHAVIOR`	Loitering (>5 min in same area)	Medium

6.2 Evidence Record Structure

@dataclass
class EvidenceRecord:
    # Unique identifiers
    evidence_id: str                    # UUID v4
    event_id: str                       # Links to event log
    camera_id: str                      # CAM_01 .. CAM_08
    stream_id: str                      # DVR channel identifier

    # Temporal
    timestamp_utc: datetime
    timestamp_local: datetime
    frame_number: int
    video_segment: str                  # Path to 10-sec video clip

    # Person identity
    identity_type: str                  # "known" | "unknown_recurring" | "unknown_new" | "review"
    person_id: Optional[str]            # Track ID or cluster ID
    person_name: Optional[str]          # Known person name
    match_confidence: float             # Face recognition confidence (0-1)

    # Face crop
    face_crop_path: str                 # /evidence/faces/2025/07/24/{id}.jpg
    face_crop_dimensions: tuple         # (w, h) of crop
    face_quality_score: float           # Combined quality metric
    face_landmarks: np.ndarray          # 5-point landmarks
    head_pose: dict                     # {yaw, pitch, roll}

    # Full frame reference
    full_frame_path: str                # /evidence/frames/2025/07/24/{id}.jpg
    bounding_box: tuple                 # (x1, y1, x2, y2) in original frame

    # AI confidence levels
    detection_confidence: float         # YOLO person detection confidence
    face_detection_confidence: float    # SCRFD face detection confidence
    recognition_confidence: float       # ArcFace match confidence

    # Vibe settings at capture time
    detection_sensitivity: str          # "low" | "balanced" | "high"
    face_match_strictness: str          # "relaxed" | "balanced" | "strict"

    # Review state
    review_status: str                  # "pending" | "reviewed" | "confirmed" | "false_positive"
    reviewed_by: Optional[str]
    review_notes: Optional[str]

6.3 Deduplication Strategy

To avoid storing duplicate evidence of the same person within short time windows:

class EvidenceDeduplicator:
    """Prevents duplicate evidence capture using time-based gating."""

    DEDUP_WINDOW_KNOWN = 300        # 5 minutes between captures of same known person
    DEDUP_WINDOW_UNKNOWN = 60       # 1 minute between captures of same unknown person
    DEDUP_WINDOW_EVENT = 10         # 10 seconds between same event type

    def should_capture(self, person_id: str, event_type: str,
                       camera_id: str, timestamp: datetime) -> bool:
        """
        1. Check last capture time for this person_id + camera_id
        2. If within dedup window: skip capture, increment visit counter
        3. If outside window: allow capture, update last capture time
        4. Special: always capture if event_type is CRITICAL priority
        """

6.4 Storage Layout

/evidence/
  faces/
    2025/07/24/
      {evidence_id}_{camera_id}_{person_id}_face.jpg      # 112x112 aligned crop
      {evidence_id}_{camera_id}_{person_id}_full.jpg       # Full bounding box crop
  frames/
    2025/07/24/
      {evidence_id}_{camera_id}_frame.jpg                  # Full frame with annotation overlay
  video_clips/
    2025/07/24/
      {evidence_id}_{camera_id}_{timestamp}.mp4            # 10-second H.264 clip
  metadata/
    2025/07/24/
      {evidence_id}.json                                   # Full EvidenceRecord as JSON

6.5 Storage Requirements Estimate

Content Type	Size Each	Daily (8 cams)	Monthly
Face crop (112x112 JPEG)	~8 KB	~50 MB	~1.5 GB
Full crop (200x300 JPEG)	~25 KB	~150 MB	~4.5 GB
Frame snapshot (960x1080 JPEG)	~150 KB	~900 MB	~27 GB
10-sec video clip (H.264)	~500 KB	~3 GB	~90 GB
Metadata JSON	~2 KB	~12 MB	~360 MB
Total (all media)	—	~4.1 GB	~123 GB

Recommended: Store face crops + metadata for all events. Full frames and video clips only for priority events (review_needed, zone_violation, after_hours).

7. Confidence Handling & Thresholds

7.1 Confidence Level Definitions

Level	Aggregate Score	Color	Action
HIGH	>= 0.75	Green	Auto-process, no review needed
MEDIUM	0.50 - 0.75	Yellow	Process with confidence label, flag for spot-check
LOW	0.35 - 0.50	Orange	Capture evidence, mark for review
REVIEW_NEEDED	< 0.35	Red	Always queue for operator review

7.2 Aggregate Confidence Score

The aggregate confidence is computed as a weighted combination:

def compute_aggregate_confidence(det_conf: float, face_conf: float,
                                  match_conf: float, quality_score: float) -> float:
    """
    Aggregate = 0.25 * det_conf + 0.20 * face_conf + 0.35 * match_conf + 0.20 * quality_score

    Where:
    - det_conf:     YOLO person detection confidence (0-1)
    - face_conf:    SCRFD face detection confidence (0-1)
    - match_conf:   ArcFace recognition match confidence (0-1), 0.0 for unknowns
    - quality_score: Face quality composite score (0-1)
    """

7.3 AI Vibe Settings Mapping

The system exposes three "vibe" settings that internally map to threshold configurations:

Detection Sensitivity (applies to YOLO + SCRFD):

Setting	YOLO Conf Threshold	SCRFD Conf Threshold	Effect
Low	0.50	0.55	Fewer detections, lower false positive rate
Balanced	0.35	0.45	Standard detection rate
High	0.20	0.35	Maximum detection, higher false positive rate

Face Match Strictness (applies to ArcFace matching):

Setting	Strict Threshold	Balanced Threshold	Relaxed Threshold	Effect
Relaxed	0.50	0.42	0.35	High recall, more false matches
Balanced	0.58	0.50	0.42	Balanced precision-recall
Strict	0.65	0.58	0.50	High precision, stricter matching

7.4 Vibe Configuration Matrix

# vibe_presets.yaml
vibe_presets:
  access_control:                    # High security area
    detection_sensitivity: "balanced"
    face_match_strictness: "strict"

  general_surveillance:              # Standard monitoring
    detection_sensitivity: "balanced"
    face_match_strictness: "balanced"

  perimeter_monitoring:              # Catching all activity
    detection_sensitivity: "high"
    face_match_strictness: "relaxed"

  after_hours:                       # Night mode
    detection_sensitivity: "high"
    face_match_strictness: "balanced"

  privacy_mode:                      # Minimal detection
    detection_sensitivity: "low"
    face_match_strictness: "strict"

7.5 Threshold Auto-Tuning Strategy

class ThresholdTuner:
    """Periodically adjusts thresholds based on operational feedback."""

    def analyze_feedback(self, review_results: list):
        """
        1. Collect operator review labels on REVIEW_NEEDED items
        2. Track false positive rate and false negative rate
        3. If FP rate > 10%: increase confidence thresholds by 5%
        4. If FN rate > 10%: decrease confidence thresholds by 5%
        5. Only adjust within +/- 15% of baseline values
        6. Log all threshold changes with rationale
        """

    def weekly_report(self) -> dict:
        """Generate confidence distribution and threshold effectiveness report."""

8. Inference Pipeline Architecture

8.1 Per-Stream Processing Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                    PER-STREAM PIPELINE                           │
│                    (Executed per camera frame)                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐       │
│  │  RTSP    │    │   Frame      │    │   Frame Queue    │       │
│  │  Stream  │───▶│   Decode     │───▶│   (ring buffer)  │       │
│  │  (H.264) │    │   (960x1080) │    │   max 30 frames  │       │
│  └──────────┘    └──────────────┘    └──────────────────┘       │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 1: HUMAN DETECTION (YOLO11m TensorRT FP16)          │   │
│  │  Input: 640x640 batch tensor                               │   │
│  │  Output: person bboxes [N x 6] (x1,y1,x2,y2,conf,cls)    │   │
│  │  Latency: ~4.7ms per frame (T4)                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 2: FACE DETECTION (SCRFD-500M TensorRT FP16)        │   │
│  │  Input: Cropped person regions from Step 1                 │   │
│  │  Output: face bboxes + 5 landmarks per face                │   │
│  │  Latency: ~2.5ms per face (T4)                            │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 3: FACE ALIGNMENT & QUALITY CHECK                   │   │
│  │  Input: Face crop + 5 landmarks                            │   │
│  │  Process: Similarity transform -> 112x112 aligned crop     │   │
│  │  Quality: Blur, pose, illumination checks                  │   │
│  │  Latency: ~0.3ms (OpenCV CPU)                             │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 4: FACE RECOGNITION (ArcFace R100 TensorRT FP16)    │   │
│  │  Input: 112x112 aligned face crop (batch)                  │   │
│  │  Output: 512-D L2-normalized embedding                     │   │
│  │  Latency: ~6ms per face (T4, batch=8)                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 5: IDENTITY MATCHING (FAISS/Milvus vector search)   │   │
│  │  Input: 512-D embedding                                    │   │
│  │  Output: Top-K matches with similarity scores              │   │
│  │  Latency: < 5ms (in-memory, <10K identities)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 6: PERSON TRACKING (ByteTrack)                      │   │
│  │  Input: Person detections + face embeddings               │   │
│  │  Output: Persistent track IDs with identity labels         │   │
│  │  Latency: ~1ms per frame                                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 7: UNKNOWN CLUSTERING (HDBSCAN)                     │   │
│  │  Input: Embeddings of unmatched faces                     │   │
│  │  Output: Cluster assignments for recurring unknowns        │   │
│  │  Latency: ~50ms (batch update, every 30 sec)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  STEP 8: EVIDENCE CAPTURE & EVENT GENERATION              │   │
│  │  Input: Track results + identity + confidence             │   │
│  │  Output: Evidence records, event log entries, alerts       │   │
│  │  Latency: ~5ms (async I/O)                                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                               │                  │
│                                               ▼                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  OUTPUT: Structured event stream to central system        │   │
│  │  { track_id, identity, confidence, bbox, timestamp,       │   │
│  │    camera_id, event_type, evidence_refs }                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

8.2 Multi-Stream Orchestration

class MultiStreamPipeline:
    """Orchestrates inference across 8 simultaneous camera streams."""

    def __init__(self, config: PipelineConfig):
        # 4 inference workers (each processes 2 streams)
        self.workers = [InferenceWorker(gpu_id=i % 2) for i in range(4)]

        # Stream assignments: worker -> [stream_ids]
        self.stream_map = {
            0: ["CAM_01", "CAM_02"],
            1: ["CAM_03", "CAM_04"],
            2: ["CAM_05", "CAM_06"],
            3: ["CAM_07", "CAM_08"],
        }

        # Shared components (thread-safe)
        self.tracker_pool = {cam: ByteTrack(config.track) for cam in ALL_CAMERAS}
        self.face_db = VectorDatabase(config.db)          # Milvus/FAISS
        self.clustering = UnknownPersonClustering(config.cluster)
        self.evidence = EvidenceCaptureManager(config.evidence)

    def process_frame(self, camera_id: str, frame: np.ndarray, timestamp: datetime):
        """Process a single frame through the complete pipeline."""
        # STEP 1: Human Detection
        person_dets = self.yolo_detector.detect(frame)

        # STEP 2: Face Detection (within person regions)
        face_dets = []
        for det in person_dets:
            person_crop = crop_region(frame, det.bbox)
            faces = self.face_detector.detect(person_crop)
            face_dets.extend(faces)

        # STEP 3: Face Alignment + Quality
        aligned_faces = []
        for face in face_dets:
            aligned = align_face(frame, face.landmarks)
            quality = self.quality_checker.score(aligned)
            if quality.passed:
                aligned_faces.append((aligned, quality.score, face))

        # STEP 4: Face Recognition (batch)
        if aligned_faces:
            embeddings = self.face_recognizer.embed(
                [f[0] for f in aligned_faces]
            )

            # STEP 5: Identity Matching
            for emb, (aligned, quality, face) in zip(embeddings, aligned_faces):
                matches = self.face_db.search(emb, top_k=5)
                identity = self.classify_identity(emb, matches)
                face.identity = identity

        # STEP 6: Person Tracking
        tracks = self.tracker_pool[camera_id].update(person_dets)

        # STEP 7: Associate face identities with person tracks
        self.associate_faces_with_tracks(tracks, face_dets)

        # STEP 8: Unknown clustering (periodic batch)
        self.clustering.update_periodic()

        # STEP 9: Evidence capture
        self.evidence.capture_events(tracks, camera_id, timestamp)

        return tracks

8.3 Batch Processing Strategy

For GPU efficiency, frames are processed in batched groups:

Batch Type	Batch Size	Frequency	GPU Utilization
Human Detection	8 frames	Every frame decode	~85%
Face Detection	Variable (up to 32 faces)	Per 2 frames	~60%
Face Recognition	Up to 32 faces	Per 2 frames	~75%
Tracking	Per stream	Every frame	CPU-bound

8.4 GPU Utilization Strategy

GPU 0 (Primary - T4 / A10):
  ├─ Stream 0-1: YOLO11m detection
  ├─ Stream 0-1: SCRFD face detection
  ├─ Stream 0-1: ArcFace R100 recognition
  └─ TensorRT Context 0: All models (shared)

GPU 1 (Optional - V100 / A100 for scale):
  ├─ Stream 2-3: Same pipeline
  └─ TensorRT Context 1: Dedicated context

CPU (x86_64):
  ├─ Stream decode (FFmpeg, 8 threads)
  ├─ ByteTrack association (all streams)
  ├─ Face alignment + quality (OpenCV)
  ├─ HDBSCAN clustering (background thread)
  ├─ Evidence I/O (async thread pool)
  └─ API server (FastAPI, 4 workers)

8.5 Performance Budget (Per 8-Stream System)

Pipeline Stage	Per-Frame Cost	8-Stream Aggregate	GPU %
Frame decode	~2ms	16ms (parallel)	—
YOLO11m detection	~4.7ms	~37.6ms (batched)	35%
SCRFD face detection	~2.5ms avg	~20ms (batched)	20%
Face alignment + quality	~0.3ms	~2.4ms (CPU)	—
ArcFace R100 recognition	~6ms avg	~48ms (batched)	45%
ByteTrack tracking	~1ms	~8ms (CPU)	—
Vector search	~1ms	~8ms (CPU)	—
Evidence capture	~2ms	~16ms (async I/O)	—
Total effective	—	~30-35ms end-to-end	—
Effective throughput	—	~28 FPS per stream	100%

Target: 15-20 FPS processing per stream at 960x1080 with batching optimizations.

9. Model Selection Summary Table

Component	Model Choice	Framework	Input Size	FPS Target (T4)	Accuracy Metric
Human Detection	YOLO11m (Ultralytics)	TensorRT FP16	640 x 640	213 FPS (batch=8)	51.5% mAP@50-95 COCO; ~78% person AP
Face Detection	SCRFD-500M-BNKPS (InsightFace)	TensorRT FP16	640 x 640	~400 FPS (batch=32)	90.6% AP-Easy, 87.0% AP-Med, 72.0% AP-Hard (WIDERFACE)
Face Recognition	ArcFace R100 IR-SE100 (InsightFace, MS1MV3)	TensorRT FP16	112 x 112	~170 FPS (batch=32)	99.83% LFW, 98.27% CFP-FP, 96.1% IJB-C@1e-4
Person Tracking	ByteTrack (BYTE association, Kalman filter)	NumPy/OpenCV	—	>500 FPS (association only)	80.3% MOTA, 77.3% IDF1, 63.1% HOTA (MOT17)
Unknown Clustering	HDBSCAN (hdbscan library) + DBSCAN fallback	scikit-learn/hdbscan	512-D embeddings	<100ms per batch	89.5% cluster purity, BCubed F > 0.85
Vector Search	FAISS (IndexFlatIP) or Milvus	FAISS/Milvus	512-D vectors	<5ms per query	Exact nearest neighbor (cosine)

10. Technology Stack

10.1 Deep Learning Framework

Layer	Technology	Version	Purpose
Training	PyTorch	2.2+	Model fine-tuning, research
Export	ONNX	1.15+	Model portability
GPU Inference	TensorRT	8.6+ / 10.0+	Production inference optimization
CPU Inference	ONNX Runtime	1.16+	CPU fallback for edge
CPU (Intel)	OpenVINO	2024.0+	Intel-optimized inference

10.2 Model Serving Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Docker Container: ai-vision-pipeline                      │  │
│  │  Base: nvidia/cuda:12.1-runtime-ubuntu22.04               │  │
│  │                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │  │
│  │  │  TensorRT   │  │  OpenCV     │  │  FastAPI        │   │  │
│  │  │  Engine     │  │  4.9+       │  │  Server         │   │  │
│  │  │  (TRT 10)   │  │  (CUDA)     │  │  (uvicorn)      │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘   │  │
│  │                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐   │  │
│  │  │  FAISS      │  │  hdbscan    │  │  Kafka / Redis  │   │  │
│  │  │  (vectors)  │  │  (cluster)  │  │  (event bus)    │   │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘   │  │
│  │                                                             │  │
│  │  ┌──────────────────────────────────────────────────────┐  │  │
│  │  │  Pipeline Orchestrator (Python asyncio)              │  │  │
│  │  │  - Stream reader threads (8x FFmpeg)                 │  │  │
│  │  │  - GPU inference queue                                 │  │  │
│  │  │  - CPU post-processing workers                         │  │  │
│  │  │  - Evidence async writer                               │  │  │
│  │  └──────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Docker Container: ai-vision-api                           │  │
│  │  - REST API for configuration                              │  │
│  │  - WebSocket for real-time events                          │  │
│  │  - Database: PostgreSQL + pgvector                         │  │
│  │  - Object storage: MinIO (evidence media)                  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

10.3 GPU Requirements

Deployment Mode	Minimum GPU	Recommended GPU	Notes
Edge Gateway	NVIDIA Jetson Orin Nano 8GB	Jetson Orin NX 16GB	INT8 quantization, 5-8 FPS per stream
Edge Server	NVIDIA T4 16GB	NVIDIA A10 24GB	FP16, full 8-stream real-time
Cloud Processing	NVIDIA T4 16GB	NVIDIA V100 32GB	FP16, 8+ streams, batching
Development	NVIDIA RTX 3080 10GB	NVIDIA RTX 4090 24GB	Full pipeline debugging

10.4 CPU Fallback Options

When GPU is unavailable, the pipeline falls back to CPU-optimized models:

Component	GPU Model	CPU Fallback	CPU Latency
Human Detection	YOLO11m TensorRT	YOLO11n ONNX + OpenVINO	~56ms/frame
Face Detection	SCRFD TensorRT	YuNet OpenCV DNN	~3ms/frame
Face Recognition	ArcFace R100 TensorRT	ArcFace MobileFaceNet ONNX	~15ms/face
Tracking	ByteTrack (CPU)	ByteTrack (CPU)	~2ms/frame

Note: CPU fallback processes at ~5-8 FPS per stream. For full 8-stream real-time, GPU acceleration is required.

10.5 Docker Compose Configuration

# docker-compose.yml
version: '3.8'

services:
  ai-vision-pipeline:
    image: surveillance/ai-vision-pipeline:1.0.0
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - PIPELINE_WORKERS=4
      - STREAM_COUNT=8
      - DETECTION_MODEL=/models/yolo11m.engine
      - FACE_MODEL=/models/scrfd_500m.engine
      - RECOGNITION_MODEL=/models/arcface_r100.engine
      - DETECTION_SENSITIVITY=balanced
      - FACE_MATCH_STRICTNESS=balanced
    volumes:
      - ./models:/models:ro
      - ./evidence:/evidence
      - ./config:/config:ro
    ports:
      - "8080:8080"        # REST API
      - "8081:8081"        # WebSocket events
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    depends_on:
      - redis
      - minio
      - postgres

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: surveillance
      POSTGRES_USER: ai_pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
    volumes:
      - miniodata:/data
    ports:
      - "9000:9000"
      - "9001:9001"

volumes:
  pgdata:
  miniodata:

10.6 Python Module Structure

ai_vision_pipeline/
├── pyproject.toml                    # Poetry/pip dependencies
├── Dockerfile
├── docker-compose.yml
├── config/
│   ├── pipeline.yaml                 # Main pipeline configuration
│   ├── yolo11m_detection.yaml
│   ├── scrfd_face_detection.yaml
│   ├── arcface_recognition.yaml
│   ├── bytetrack.yaml
│   ├── clustering.yaml
│   └── vibe_presets.yaml
├── models/
│   ├── yolo11m.engine                # TensorRT engine (YOLO11m)
│   ├── scrfd_500m_bnkps.engine       # TensorRT engine (SCRFD)
│   ├── arcface_r100.engine           # TensorRT engine (ArcFace)
│   └── yunet.onnx                    # CPU fallback (YuNet)
├── src/
│   ├── __init__.py
│   ├── main.py                       # Entry point
│   ├── config.py                     # Configuration loader
│   ├── pipeline/
│   │   ├── __init__.py
│   │   ├── orchestrator.py           # MultiStreamPipeline
│   │   ├── stream_reader.py          # RTSP/FFmpeg frame capture
│   │   └── frame_buffer.py           # Ring buffer management
│   ├── detection/
│   │   ├── __init__.py
│   │   ├── yolo_detector.py          # YOLO11m inference wrapper
│   │   └── detector_base.py          # Abstract detector interface
│   ├── face/
│   │   ├── __init__.py
│   │   ├── face_detector.py          # SCRFD inference wrapper
│   │   ├── face_recognizer.py        # ArcFace inference wrapper
│   │   ├── face_aligner.py           # 5-point alignment
│   │   ├── quality_checker.py        # Blur/pose/illumination
│   │   └── embedding_store.py        # Vector DB operations
│   ├── tracking/
│   │   ├── __init__.py
│   │   ├── bytetrack.py              # ByteTrack implementation
│   │   ├── kalman_filter.py          # Kalman filter
│   │   ├── track_manager.py          # Track lifecycle management
│   │   └── matching.py               # IoU / embedding matching
│   ├── clustering/
│   │   ├── __init__.py
│   │   ├── hdbscan_engine.py         # HDBSCAN wrapper
│   │   ├── cluster_manager.py        # Cluster CRUD + merge logic
│   │   └── cluster_profile.py        # Cluster data model
│   ├── evidence/
│   │   ├── __init__.py
│   │   ├── capture_manager.py        # Evidence capture orchestrator
│   │   ├── deduplicator.py           # Deduplication logic
│   │   ├── storage.py                # File system + object storage
│   │   └── metadata.py               # EvidenceRecord dataclass
│   ├── confidence/
│   │   ├── __init__.py
│   │   ├── scorer.py                 # Aggregate confidence computation
│   │   ├── threshold_manager.py      # Dynamic threshold adjustment
│   │   └── vibe_mapper.py            # Vibe settings -> thresholds
│   ├── inference/
│   │   ├── __init__.py
│   │   ├── tensorrt_wrapper.py       # Generic TensorRT inference
│   │   ├── onnx_wrapper.py           # ONNX Runtime inference
│   │   └── batch_processor.py        # Dynamic batching logic
│   ├── api/
│   │   ├── __init__.py
│   │   ├── server.py                 # FastAPI application
│   │   ├── routes/
│   │   │   ├── detection.py          # Detection config API
│   │   │   ├── faces.py              # Face database API
│   │   │   ├── tracks.py             # Track query API
│   │   │   ├── evidence.py           # Evidence retrieval API
│   │   │   └── settings.py           # Vibe settings API
│   │   └── websocket.py              # Real-time event streaming
│   └── utils/
│       ├── __init__.py
│       ├── logger.py                 # Structured logging
│       ├── metrics.py                # Prometheus metrics
│       ├── time_utils.py             # Timestamp handling
│       └── image_utils.py            # Crop, resize, encode
├── tests/
│   ├── unit/
│   ├── integration/
│   └── benchmarks/
└── scripts/
    ├── export_tensorrt.py            # Convert .pt -> .onnx -> .engine
    ├── calibrate_int8.py             # INT8 calibration with custom data
    ├── benchmark_pipeline.py         # End-to-end benchmark
    └── setup_vector_db.py            # Initialize FAISS/Milvus index

10.7 Core Inference Code Architecture

# src/inference/tensorrt_wrapper.py — Generic TensorRT inference engine

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

class TensorRTInference:
    """Generic TensorRT inference wrapper supporting dynamic batch sizes."""

    def __init__(self, engine_path: str, max_batch_size: int = 32):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)

        with open(engine_path, "rb") as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()
        self.max_batch_size = max_batch_size
        self.stream = cuda.Stream()

        # Allocate GPU buffers
        self.inputs = []
        self.outputs = []
        self.bindings = []
        self._allocate_buffers()

    def _allocate_buffers(self):
        """Allocate pinned host and device memory for all I/O bindings."""
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            mode = self.engine.get_tensor_mode(name)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))

            size = trt.volume(shape) * self.max_batch_size
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))

            if mode == trt.TensorIOMode.INPUT:
                self.inputs.append({"name": name, "host": host_mem,
                                    "device": device_mem, "shape": shape, "dtype": dtype})
            else:
                self.outputs.append({"name": name, "host": host_mem,
                                     "device": device_mem, "shape": shape, "dtype": dtype})

    def infer(self, input_batch: np.ndarray) -> list[np.ndarray]:
        """Execute inference on a batched input."""
        batch_size = input_batch.shape[0]

        # Copy input to pinned memory
        np.copyto(self.inputs[0]["host"][:input_batch.size], input_batch.ravel())

        # Set dynamic batch size
        input_shape = list(self.inputs[0]["shape"])
        input_shape[0] = batch_size
        self.context.set_input_shape(self.inputs[0]["name"], input_shape)

        # Transfer H2D
        cuda.memcpy_htod_async(self.inputs[0]["device"],
                               self.inputs[0]["host"], self.stream)

        # Execute
        self.context.execute_async_v3(stream_handle=self.stream.handle)

        # Transfer D2H
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out["host"], out["device"], self.stream)

        self.stream.synchronize()

        # Reshape outputs
        results = []
        for out in self.outputs:
            out_shape = list(out["shape"])
            out_shape[0] = batch_size
            results.append(out["host"][:np.prod(out_shape)].reshape(out_shape))

        return results

    def __del__(self):
        self.stream.synchronize()

10.8 Key Dependencies

# pyproject.toml dependencies
[tool.poetry.dependencies]
python = "^3.10"
torch = "^2.2.0"
torchvision = "^2.2.0"
tensorrt = "^10.0.0"
pycuda = "^2024.1"
onnxruntime-gpu = "^1.16.0"
opencv-python = "^4.9.0"
numpy = "^1.26.0"
scipy = "^1.12.0"
scikit-learn = "^1.4.0"
hdbscan = "^0.8.33"
faiss-gpu = "^1.7.4"
pydantic = "^2.6.0"
fastapi = "^0.109.0"
uvicorn = "^0.27.0"
websockets = "^12.0"
aioredis = "^2.0.0"
asyncpg = "^0.29.0"
minio = "^7.2.0"
prometheus-client = "^0.20.0"
structlog = "^24.1.0"
python-multipart = "^0.0.9"
pillow = "^10.2.0"

11. Performance Summary & Benchmarks

11.1 Target System Performance

Metric	Target	Notes
Processed FPS per stream	15-20 FPS	At 960x1080 input
Total system throughput	120-160 FPS aggregate	8 streams simultaneously
End-to-end latency	< 100ms	Frame in -> result out
GPU memory	< 10 GB	All 3 TensorRT engines loaded
System RAM	< 16 GB	Buffers + clustering + API
Storage growth	~100 GB/month	With selective full-frame storage
Concurrent API clients	50+	WebSocket event subscribers

11.2 Accuracy Targets on Surveillance Data

Task	Metric	Target
Human Detection	mAP@50 (person)	> 75%
Human Detection	Recall@0.5IoU	> 85%
Face Detection	AP (medium)	> 85%
Face Detection	Min face size	20x20 px
Face Recognition	Rank-1 accuracy (known persons)	> 98%
Face Recognition	False acceptance rate	< 0.1%
Tracking	MOTA	> 75%
Tracking	IDF1	> 70%
Tracking	ID switches / 100 frames	< 2
Clustering	Purity	> 89%
Clustering	BCubed F-Measure	> 0.85

11.3 Failure Modes & Mitigations

Failure Mode	Detection	Mitigation
GPU memory exhaustion	Monitor nvidia-smi	Reduce batch size, enable model streaming
Frame drop in decode	Monitor FFmpeg buffer	Increase ring buffer, enable HW decode
High false positive rate	Track review queue	Auto-increase detection threshold
Track fragmentation	Monitor ID switches	Tune ByteTrack track_buffer parameter
Cluster contamination	Monitor cluster purity	Lower DBSCAN eps, enable merge review
Vector DB latency growth	Query latency histogram	Switch from IndexFlat to IndexIVF
Disk space exhaustion	Storage capacity alert	Auto-archive evidence > 90 days

12. Appendix A: Model Export Commands

# 1. Export YOLO11m to TensorRT
python -c "
from ultralytics import YOLO
model = YOLO('yolo11m.pt')
model.export(format='onnx', imgsz=640, opset=17, dynamic=True, simplify=True)
"
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolo11m.onnx \
  --saveEngine=yolo11m.engine \
  --fp16 \
  --minShapes=images:1x3x640x640 \
  --optShapes=images:8x3x640x640 \
  --maxShapes=images:16x3x640x640

# 2. Export SCRFD-500M to TensorRT (via ONNX)
python scripts/export_scrfd_onnx.py \
  --config configs/scrfd_500m_bnkps.py \
  --checkpoint scrfd_500m_bnkps.pth \
  --input-img test.jpg \
  --shape 640 640 \
  --show
/usr/src/tensorrt/bin/trtexec \
  --onnx=scrfd_500m.onnx \
  --saveEngine=scrfd_500m.engine \
  --fp16

# 3. Export ArcFace R100 to TensorRT
python -c "
import onnx
from insightface.model_zoo import get_model
model = get_model('arcface_r100_v1')
model.export_onnx('arcface_r100.onnx')
"
/usr/src/tensorrt/bin/trtexec \
  --onnx=arcface_r100.onnx \
  --saveEngine=arcface_r100.engine \
  --fp16 \
  --minShapes=input.1:1x3x112x112 \
  --optShapes=input.1:32x3x112x112 \
  --maxShapes=input.1:64x3x112x112

13. Appendix B: INT8 Calibration

# scripts/calibrate_int8.py
import tensorrt as trt
from src.inference.calibrator import SurveillanceCalibrator

calibrator = SurveillanceCalibrator(
    calibration_data_dir="/data/calibration/surveillance_500frames",
    cache_file="yolo11m_calibration.cache",
    input_shape=(8, 3, 640, 640),
    max_batches=100
)

config = {
    "onnx_file": "yolo11m.onnx",
    "engine_file": "yolo11m_int8.engine",
    "precision": "int8",
    "calibrator": calibrator,
    "max_batch_size": 16,
    "workspace_mb": 4096,
}
# INT8 engine provides 3.5x speedup with <0.5% mAP drop
# Requires 500+ representative frames from target cameras

14. Appendix C: Performance Benchmark Script

# scripts/benchmark_pipeline.py
import time
import statistics
from src.pipeline.orchestrator import MultiStreamPipeline

BENCHMARK_DURATION = 300  # 5 minutes
WARMUP_FRAMES = 60

def benchmark():
    pipeline = MultiStreamPipeline.from_config("config/pipeline.yaml")

    # Warmup
    for _ in range(WARMUP_FRAMES):
        pipeline.process_frame("CAM_01", dummy_frame, datetime.now())

    # Benchmark
    latencies = []
    start = time.monotonic()
    while time.monotonic() - start < BENCHMARK_DURATION:
        t0 = time.perf_counter()
        pipeline.process_frame("CAM_01", dummy_frame, datetime.now())
        latencies.append((time.perf_counter() - t0) * 1000)  # ms

    print(f"Mean latency: {statistics.mean(latencies):.1f}ms")
    print(f"P50 latency: {statistics.median(latencies):.1f}ms")
    print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.1f}ms")
    print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.1f}ms")
    print(f"Throughput: {len(latencies) / BENCHMARK_DURATION:.1f} FPS")

if __name__ == "__main__":
    benchmark()

Document Version: 1.0.0 | Generated for CP PLUS 8-Channel DVR Surveillance Platform All model specifications and benchmarks reflect publicly available data as of July 2025