Industrial Surveillance AI Vision Pipeline — Complete Technical Design
Document Information
| Property | Value |
|---|---|
| Version | 1.0.0 |
| Date | 2025-07-24 |
| DVR | CP PLUS 8-Channel, 960x1080 per channel |
| Environment | Indoor / Industrial Mixed |
| Target Streams | 8 simultaneous RTSP feeds |
| Edge Compute | Moderate (NVIDIA Jetson or x86 + T4 class GPU) |
| Cloud Compute | GPU-backed (V100 / A10 / T4) |
1. Human Detection Module
1.1 Model Selection: YOLO11-Medium (YOLO11m)
Primary Choice: YOLO11m (Ultralytics)
Rationale: YOLO11 strikes the optimal balance between accuracy and inference speed for industrial surveillance. The medium variant provides sufficient capacity to detect partially occluded humans at mid-range distances while maintaining real-time throughput across 8 streams.
| Attribute | Specification |
|---|---|
| Model | YOLO11m (Ultralytics release 8.3.x) |
| Backbone | C3k2 bottleneck + C2PSA attention module |
| Neck | PANet with compact feature aggregation |
| Head | Anchor-free decoupled detection head |
| Parameters | 20.1 M |
| Input Resolution | 640 x 640 (letterboxed from 960x1080) |
| mAP@50-95 (COCO) | 51.5% |
| Person Class AP | ~75-80% (estimated, COCO "person") |
Alternative for GPU-constrained edge: YOLO11s (9.4M params, 47.0% mAP, 2.5ms T4 TensorRT)
Alternative for maximum accuracy: RT-DETR-L (53.4% mAP, 6.8ms T4 TensorRT, transformer-based)
1.2 Inference Configuration
# yolo11m_detection.yaml
model:
weights: "yolo11m.pt"
class_filter: ["person"] # Only detect human class (COCO idx 0)
confidence_threshold: 0.35 # Balanced sensitivity
iou_threshold: 0.45 # NMS IoU threshold
max_detections: 50 # Max persons per frame
imgsz: 640 # Square input
half: true # FP16 inference
dataloader:
batch_size: 8 # One frame per stream
workers: 4
pin_memory: true
1.3 Quantization & Optimization Strategy
| Optimization Stage | Target | Expected Speedup | Accuracy Impact |
|---|---|---|---|
| PyTorch FP32 | Baseline | 1.0x | Baseline (51.5% mAP) |
| ONNX Export | Interop | 1.1x | Negligible |
| TensorRT FP16 | Production GPU | 2.8x (~4.7ms T4) | -0.1% mAP |
| TensorRT INT8 | Maximum throughput | 3.5x (~3.8ms T4) | -0.3% to -0.5% mAP |
| INT8 + DLA | Jetson Orin DLA | 4.0x | -0.5% mAP |
Recommended production path: TensorRT FP16 on GPU (best accuracy/speed tradeoff). INT8 for edge gateway with calibration dataset of 500+ representative surveillance frames.
1.4 Frame Preprocessing
# Input: 960 x 1080 from DVR (960H resolution, 960x1080)
# Step 1: Resize to 640x640 with letterboxing (maintain aspect ratio)
# Step 2: Normalize: divide by 255.0, mean=[0.0,0.0,0.0], std=[1.0,1.0,1.0]
# Step 3: HWC -> CHW format
# Step 4: Batch: [8, 3, 640, 640]
# Expected output: person bounding boxes [x1, y1, x2, y2, conf, class_id]
1.5 Performance Targets
| Metric | Target | Notes |
|---|---|---|
| Latency (single frame) | < 5ms @ T4 TensorRT FP16 | YOLO11m at 640 |
| Throughput (8 streams) | > 160 FPS aggregate | Batch=8, processing 20 FPS per stream |
| Person AP@50 | > 75% | On surveillance test set |
| Small person detection | > 60% AP | For persons > 30px tall |
| Occlusion handling | > 50% AP | Partial visibility (occlusion level 1-2) |
2. Face Detection Module
2.1 Model Selection: SCRFD-500MF-640GPU
Primary Choice: SCRFD_500M_BNKPS (InsightFace Model Zoo)
SCRFD (Sample and Computation Redistribution for Efficient Face Detection) achieves the best speed-accuracy tradeoff for face detection on GPU. The 500MF variant is optimized for 640px inputs and provides 5-point facial keypoints (eyes, nose, mouth corners) critical for face alignment prior to recognition.
| Attribute | Specification |
|---|---|
| Model | SCRFD_500M_BNKPS (ONNX) |
| Source | deepinsight/insightface model zoo |
| Input Resolution | 640 x 640 |
| FLOPs | 500 MFLOPs |
| Parameters | ~1.5 M |
| WIDERFACE AP (Easy) | 0.906 |
| WIDERFACE AP (Medium) | 0.870 |
| WIDERFACE AP (Hard) | 0.720 |
| Keypoint Output | 5 facial landmarks (eyes x2, nose, mouth corners x2) |
Alternative (CPU/Edge): YuNet (OpenCV DNN, ~1ms CPU, AP_Easy 0.884, AP_Medium 0.866)
Alternative (Maximum accuracy): RetinaFace-R50 (higher AP but 5-8x slower)
2.2 Inference Configuration
# scrfd_face_detection.yaml
model:
onnx_file: "scrfd_500m_bnkps.onnx"
input_size: 640
confidence_threshold: 0.45 # Face detection minimum confidence
nms_threshold: 0.4
top_k: 100 # Max faces per frame
min_face_size: 20 # Minimum face pixel height (20px)
scale_factor: [8, 16, 32] # Feature pyramid strides
# Face quality scoring
quality:
blur_threshold: 50.0 # Laplacian variance threshold
pose_max_yaw: 45.0 # Degrees - reject profile faces
pose_max_pitch: 30.0 # Degrees
min_face_width: 20 # Pixels - ignore tiny faces
max_face_width: 300 # Pixels - ignore giant close-ups
2.3 Face Quality Assessment
Each detected face is scored on multiple dimensions before proceeding to recognition:
| Quality Metric | Method | Threshold | Rejection Rate |
|---|---|---|---|
| Sharpness/Blur | Laplacian variance | Var > 50 | ~15% of detections |
| Face Size | Bounding box height | > 20px, < 300px | ~10% of detections |
| Head Pose | 5-point landmark geometry | Yaw < 45, Pitch < 30 | ~20% of detections |
| Face Confidence | SCRFD detection score | > 0.45 | ~5% of detections |
| Illumination | Mean face ROI intensity | 40 < mean < 240 | ~5% of detections |
Only faces passing all quality gates proceed to the recognition module.
2.4 Face Alignment
Using the 5-point landmarks from SCRFD, each face is aligned to a canonical pose using a similarity transform:
# Alignment target landmarks (112x112 template)
TARGET_LANDMARKS = np.array([
[38.2946, 51.6963], # Left eye
[73.5318, 51.5014], # Right eye
[56.0252, 71.7366], # Nose
[41.5493, 92.3655], # Left mouth
[70.7299, 92.2041], # Right mouth
], dtype=np.float32)
# Apply similarity transform (scale, rotation, translation)
# Output: aligned 112x112 face crop ready for ArcFace
2.5 Performance Targets
| Metric | Target | Notes |
|---|---|---|
| Latency (single face) | < 3ms @ T4 TensorRT FP16 | SCRFD-500M |
| Latency (batch 32 faces) | < 12ms | Batch processing |
| WIDERFACE AP (Hard) | > 0.70 | Challenging angles, lighting |
| Min detectable face size | 20x20 pixels | ~10m distance at 1080p |
| 5-landmark accuracy | < 3px NME | Normalized mean error |
3. Face Recognition Module
3.1 Model Selection: ArcFace R100 (MS1MV3)
Primary Choice: ArcFace R100 (IResNet100, InsightFace)
ArcFace with Additive Angular Margin Loss is the industry-standard face recognition model. The ResNet100 (IR-SE100) backbone trained on MS1MV3 (MS1M-V3) provides state-of-the-art accuracy with 512-dimensional embeddings.
| Attribute | Specification |
|---|---|
| Model | ArcFace with IResNet100 (IR-SE100) backbone |
| Loss Function | Additive Angular Margin (ArcFace) |
| Training Data | MS1MV3 (~5.1M images, 93K identities) |
| Input Size | 3 x 112 x 112 (aligned face crop) |
| Embedding Dimension | 512 (float32) |
| Parameters | ~65 M |
| LFW Accuracy | 99.83% |
| CFP-FP Accuracy | 98.27% |
| AgeDB-30 Accuracy | 98.28% |
| IJB-C (TPR@FPR=1e-4) | 96.1% |
Alternative (speed-focused): ArcFace R50 (IR-SE50, 25M params, LFW 99.80%, ~2x faster)
Alternative (edge/mobile): MobileFaceNet (4M params, 128-D embedding, LFW 99.28%)
3.2 Embedding Extraction Pipeline
# Pipeline per detected face:
# 1. Crop face from original frame using SCRFD bounding box
# 2. Align using 5-point landmarks -> 112x112 normalized crop
# 3. Normalize pixel values: (pixel - 127.5) / 128.0
# 4. Forward pass through ArcFace R100
# 5. L2-normalize the 512-D embedding vector
# 6. Store embedding + metadata for matching
# Output: 512-D unit vector representing facial identity
3.3 Similarity Computation & Matching
| Parameter | Value | Description |
|---|---|---|
| Similarity Metric | Cosine Similarity | dot(u, v) / ( |
| Embedding Dim | 512 | Float32 per vector = 2KB storage |
| Distance Metric | 1 - Cosine Similarity | Range [0.0, 2.0] |
| Top-K Query | K=5 | Return top 5 candidates |
| Strict Match Threshold | 0.58 (cosine) / 0.42 (distance) | High confidence ID |
| Balanced Match Threshold | 0.50 (cosine) / 0.50 (distance) | Standard confidence |
| Relaxed Match Threshold | 0.42 (cosine) / 0.58 (distance) | Maximum recall |
3.4 Face Database Structure
# Known Person Database (Milvus/FAISS vector store)
known_persons_db = {
"person_id": "uuid-string", # Unique person identifier
"name": "John Doe", # Display name (optional)
"employee_id": "EMP001", # External reference
"embeddings": [ # Multiple reference embeddings
{
"vector": np.array(512), # L2-normalized embedding
"source_camera": "CAM_01",
"timestamp": "2025-07-01T10:00:00Z",
"face_quality": 0.92,
"pose_yaw": 5.2, # Head pose at capture
}
],
"created_at": "2025-07-01T00:00:00Z",
"updated_at": "2025-07-20T15:30:00Z",
"enrollment_count": 3, # Number of reference photos
}
3.5 Top-K Matching Strategy
def match_face(embedding: np.ndarray, db: VectorStore, k: int = 5) -> MatchResult:
"""
1. Query vector DB for top-K nearest neighbors (cosine similarity)
2. Compute similarity scores for all K candidates
3. Apply threshold-based classification:
- Highest score >= strict_threshold -> CONFIDENT_MATCH
- Highest score >= balanced_threshold -> PROBABLE_MATCH
- Highest score >= relaxed_threshold -> POSSIBLE_MATCH
- All scores < relaxed_threshold -> UNKNOWN
4. For CONFIDENT_MATCH: return person_id with confidence
5. For UNKNOWN: route to clustering module for unknown identity grouping
"""
3.6 Performance Targets
| Metric | Target | Notes |
|---|---|---|
| Latency (single face) | < 8ms @ T4 TensorRT FP16 | ArcFace R100 at 112x112 |
| Latency (batch 32 faces) | < 25ms | Batch processing |
| LFW Verification | > 99.8% | Standard benchmark |
| CFP-FP (frontal-profile) | > 98.0% | Pose variation robustness |
| False Acceptance Rate | < 0.1% @ 99% TPR | For access control scenarios |
| Embedding Throughput | > 4,000 faces/sec | GPU batch inference |
4. Person Tracking Module
4.1 Model Selection: ByteTrack
Primary Choice: ByteTrack (Peize Sun et al., ByteDance)
ByteTrack achieves the best accuracy-speed tradeoff for surveillance tracking. Its dual-threshold association mechanism recovers objects from low-confidence detections, dramatically reducing ID switches during occlusions — a critical requirement for industrial environments with shelving, machinery, and partial obstructions.
| Attribute | Specification |
|---|---|
| Algorithm | ByteTrack (BYTE association) |
| Motion Model | Kalman Filter (constant velocity) |
| Similarity Metric | IoU (first association), IoU (second association) |
| Detection Threshold (high) | 0.6 |
| Detection Threshold (low) | 0.1 |
| Track Buffer (lost frames) | 30 frames (~1 sec @ 30 FPS) |
| IoU Match Threshold | 0.2 (reject matches below) |
| FPS (V100) | 30 FPS (detection + tracking) |
| MOTA (MOT17) | 80.3% |
| IDF1 (MOT17) | 77.3% |
| HOTA (MOT17) | 63.1% |
Alternative (accuracy-focused): BoT-SORT (+1% MOTA, +MOTP, includes Camera Motion Compensation, ~35 FPS)
Alternative (edge/CPU): OC-SORT (hundreds of FPS on CPU, handles non-linear motion)
4.2 Tracking Pipeline Configuration
# bytetrack_config.yaml
bytetrack:
track_thresh: 0.6 # High-confidence detection threshold
track_buffer: 30 # Max frames to keep lost tracks alive
match_thresh: 0.8 # IoU matching threshold (first stage)
det_thresh_low: 0.1 # Low-confidence threshold for second association
iou_thresh_reject: 0.2 # Minimum IoU to accept a match
min_box_area: 100 # Ignore detections smaller than 10x10 px
aspect_ratio_thresh: 10.0 # Reject extreme aspect ratios
mot20: false # Standard density mode
4.3 Track ID Management
class TrackManager:
"""Manages track lifecycle across all camera streams."""
def __init__(self):
self.next_track_id = 0 # Monotonically increasing
self.active_tracks = {} # track_id -> TrackState
self.lost_tracks = {} # Recently lost, may recover
self.archived_tracks = {} # Finalized trajectories
def create_track(self, detection, camera_id):
"""Initialize new track from high-confidence detection."""
track_id = self.next_track_id
self.next_track_id += 1
# Initialize Kalman filter state
# Store: bbox, confidence, camera_id, first_seen, last_seen
return track_id
def update_track(self, track_id, detection):
"""Update existing track with matched detection."""
# Update Kalman filter
# Update last_seen timestamp
# Increment hit count
def mark_lost(self, track_id):
"""Track not matched in current frame."""
# Increment lost count
# If lost > track_buffer, archive track
def get_track_summary(self, track_id) -> dict:
"""Return track metadata: duration, camera span, entry/exit zones."""
4.4 Cross-Camera Track Association
For multi-camera scenarios (8 channels), a secondary association layer links tracks across cameras using:
- Temporal proximity — tracks appearing on different cameras within a time window
- Appearance features — ArcFace embedding similarity for re-identification
- Zone transition rules — predefined camera adjacency graph (CAM_01 -> CAM_02)
def associate_cross_camera(track_cam_a, track_cam_b, max_time_gap=60):
"""
Associate tracks across cameras using:
- Time gap between track end (A) and track start (B) < max_time_gap seconds
- Embedding cosine similarity > 0.65 (relaxed threshold for ReID)
- Camera adjacency is valid in zone graph
"""
4.5 Performance Targets
| Metric | Target | Notes |
|---|---|---|
| MOTA | > 75% | Multi-object tracking accuracy |
| IDF1 | > 70% | Identity preservation across frames |
| ID Switches | < 2 per 100 frames | Per camera stream |
| Fragmentation | < 3 per track | Track splits per person per session |
| Track Recovery | > 80% within 1 sec | Re-acquire after brief occlusion |
| Latency overhead | < 1ms per frame | Tracking association cost |
5. Unknown Person Clustering Module
5.1 Model Selection: HDBSCAN + Chinese Whisper Ensemble
Primary Choice: HDBSCAN (Hierarchical Density-Based Spatial Clustering)
For unknown face embedding clustering, HDBSCAN outperforms DBSCAN by not requiring a global density parameter (eps) and naturally handling variable-density clusters — critical for surveillance where some individuals appear frequently and others only once.
| Attribute | Specification |
|---|---|
| Clustering Algorithm | HDBSCAN (primary) + DBSCAN (fallback) |
| Embedding Input | 512-D L2-normalized ArcFace embeddings |
| Distance Metric | Cosine distance (1 - cosine similarity) |
| Min Cluster Size | 3 |
| Min Samples | 2 |
| Cluster Selection Method | eom (Excess of Mass) |
| Allow Single Cluster | True |
5.2 Clustering Pipeline
class UnknownPersonClustering:
"""Clusters unknown person embeddings to identify recurring visitors."""
def __init__(self):
self.clusters = {} # cluster_id -> ClusterProfile
self.noise_embeddings = [] # Unclustered (single-appearance)
self.merge_candidates = [] # Pairs flagged for merge review
self.dbscan_eps = 0.28 # Fallback DBSCAN parameter
self.dbscan_min_samples = 2
def add_embedding(self, embedding: np.ndarray, metadata: dict) -> str:
"""
1. Try HDBSCAN fit_predict on accumulated embeddings
2. If HDBSCAN fails (all noise), fall back to DBSCAN
3. Assign embedding to cluster or mark as noise (-1)
4. If cluster assignment: update cluster centroid and metadata
5. Check for cluster merge opportunities
6. Return: cluster_id or "noise"
"""
def merge_clusters(self, cluster_a: str, cluster_b: str) -> str:
"""
Merge two clusters that belong to the same person.
Trigger: centroid distance < 0.25 (cosine distance)
OR temporal overlap analysis
OR manual operator confirmation
"""
def get_recurring_unknowns(self, min_appearances: int = 3) -> list:
"""Return unknown persons seen at least N times (potential enrollment candidates)."""
def compute_cluster_centroid(self, cluster_id: str) -> np.ndarray:
"""L2-normalized mean of all embeddings in cluster."""
5.3 Cluster Data Structure
@dataclass
class ClusterProfile:
cluster_id: str # UUID
centroid: np.ndarray # 512-D mean embedding (L2-normalized)
embeddings: List[np.ndarray] # All member embeddings
metadata: List[dict] # Source info per embedding
first_seen: datetime
last_seen: datetime
appearance_count: int # Total embeddings in cluster
camera_span: Set[str] # Which cameras observed this person
quality_score: float # Average face quality (0-1)
best_face_crop: str # Path to highest quality crop
is_named: bool = False # Flag when promoted to known person
person_name: Optional[str] = None # Assigned name (if promoted)
5.4 Merge Logic & Cluster Maintenance
| Trigger | Action | Threshold |
|---|---|---|
| Centroid distance | Auto-merge clusters | cosine distance < 0.20 |
| Centroid distance | Flag for review | cosine distance 0.20-0.30 |
| Temporal overlap | Prevent merge | Same time on different cameras |
| Cluster size | Auto-archive | > 100 embeddings, compress to centroid |
| Age | Archive old clusters | No activity for 90 days |
5.5 Three-Tier Identity Classification
┌─────────────────────────────────────────────────────────────┐
│ IDENTITY CLASSIFICATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ cosine >= 0.58 │
│ │ KNOWN PERSON │◄──────────────────────────────┐ │
│ │ (Database Match) │ │
│ └─────────────────────┘ │
│ ▲ │
│ │ │
│ ┌────────┴────────────┐ 0.35 <= cosine < 0.58 │
│ │ UNKNOWN RECURRING │◄──────────────────────────────┐ │
│ │ (Cluster Match) │ │
│ └─────────────────────┘ │
│ ▲ │
│ │ │
│ ┌────────┴────────────┐ cosine < 0.35 │
│ │ NEW UNKNOWN │◄──────────────────────────────┐ │
│ │ (Noise / New) │ │
│ └─────────────────────┘ │
│ │
│ ┌─────────────────────┐ │
│ │ REVIEW QUEUE │◄── Low quality / Low confidence │
│ │ (Operator Review) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
5.6 Clustering Performance Targets
| Metric | Target | Notes |
|---|---|---|
| Cluster Purity | > 89% | Same person in same cluster (HDBSCAN benchmark) |
| BCubed F-Measure | > 0.85 | Precision-recall balanced clustering |
| Clustering Latency | < 100ms | Per batch of 50 new embeddings |
| False Merge Rate | < 5% | Different people in same cluster |
| Memory per cluster | ~4 KB | Centroid + metadata |
6. Evidence Capture Module
6.1 Capture Triggers
Evidence is captured (face crop + metadata saved) on the following events:
| Event Type | Trigger Condition | Priority |
|---|---|---|
KNOWN_PERSON_DETECTED |
Face match confidence >= 0.50 | Medium |
UNKNOWN_PERSON_DETECTED |
New cluster formed, 3rd appearance | High |
REVIEW_NEEDED |
Low confidence match OR low quality face | High |
ZONE_VIOLATION |
Person enters restricted zone | Critical |
TAILGATING |
Two persons detected on single credential swipe | Critical |
AFTER_HOURS |
Person detected outside authorized hours | High |
SUSPICIOUS_BEHAVIOR |
Loitering (>5 min in same area) | Medium |
6.2 Evidence Record Structure
@dataclass
class EvidenceRecord:
# Unique identifiers
evidence_id: str # UUID v4
event_id: str # Links to event log
camera_id: str # CAM_01 .. CAM_08
stream_id: str # DVR channel identifier
# Temporal
timestamp_utc: datetime
timestamp_local: datetime
frame_number: int
video_segment: str # Path to 10-sec video clip
# Person identity
identity_type: str # "known" | "unknown_recurring" | "unknown_new" | "review"
person_id: Optional[str] # Track ID or cluster ID
person_name: Optional[str] # Known person name
match_confidence: float # Face recognition confidence (0-1)
# Face crop
face_crop_path: str # /evidence/faces/2025/07/24/{id}.jpg
face_crop_dimensions: tuple # (w, h) of crop
face_quality_score: float # Combined quality metric
face_landmarks: np.ndarray # 5-point landmarks
head_pose: dict # {yaw, pitch, roll}
# Full frame reference
full_frame_path: str # /evidence/frames/2025/07/24/{id}.jpg
bounding_box: tuple # (x1, y1, x2, y2) in original frame
# AI confidence levels
detection_confidence: float # YOLO person detection confidence
face_detection_confidence: float # SCRFD face detection confidence
recognition_confidence: float # ArcFace match confidence
# Vibe settings at capture time
detection_sensitivity: str # "low" | "balanced" | "high"
face_match_strictness: str # "relaxed" | "balanced" | "strict"
# Review state
review_status: str # "pending" | "reviewed" | "confirmed" | "false_positive"
reviewed_by: Optional[str]
review_notes: Optional[str]
6.3 Deduplication Strategy
To avoid storing duplicate evidence of the same person within short time windows:
class EvidenceDeduplicator:
"""Prevents duplicate evidence capture using time-based gating."""
DEDUP_WINDOW_KNOWN = 300 # 5 minutes between captures of same known person
DEDUP_WINDOW_UNKNOWN = 60 # 1 minute between captures of same unknown person
DEDUP_WINDOW_EVENT = 10 # 10 seconds between same event type
def should_capture(self, person_id: str, event_type: str,
camera_id: str, timestamp: datetime) -> bool:
"""
1. Check last capture time for this person_id + camera_id
2. If within dedup window: skip capture, increment visit counter
3. If outside window: allow capture, update last capture time
4. Special: always capture if event_type is CRITICAL priority
"""
6.4 Storage Layout
/evidence/
faces/
2025/07/24/
{evidence_id}_{camera_id}_{person_id}_face.jpg # 112x112 aligned crop
{evidence_id}_{camera_id}_{person_id}_full.jpg # Full bounding box crop
frames/
2025/07/24/
{evidence_id}_{camera_id}_frame.jpg # Full frame with annotation overlay
video_clips/
2025/07/24/
{evidence_id}_{camera_id}_{timestamp}.mp4 # 10-second H.264 clip
metadata/
2025/07/24/
{evidence_id}.json # Full EvidenceRecord as JSON
6.5 Storage Requirements Estimate
| Content Type | Size Each | Daily (8 cams) | Monthly |
|---|---|---|---|
| Face crop (112x112 JPEG) | ~8 KB | ~50 MB | ~1.5 GB |
| Full crop (200x300 JPEG) | ~25 KB | ~150 MB | ~4.5 GB |
| Frame snapshot (960x1080 JPEG) | ~150 KB | ~900 MB | ~27 GB |
| 10-sec video clip (H.264) | ~500 KB | ~3 GB | ~90 GB |
| Metadata JSON | ~2 KB | ~12 MB | ~360 MB |
| Total (all media) | — | ~4.1 GB | ~123 GB |
Recommended: Store face crops + metadata for all events. Full frames and video clips only for priority events (review_needed, zone_violation, after_hours).
7. Confidence Handling & Thresholds
7.1 Confidence Level Definitions
| Level | Aggregate Score | Color | Action |
|---|---|---|---|
| HIGH | >= 0.75 | Green | Auto-process, no review needed |
| MEDIUM | 0.50 - 0.75 | Yellow | Process with confidence label, flag for spot-check |
| LOW | 0.35 - 0.50 | Orange | Capture evidence, mark for review |
| REVIEW_NEEDED | < 0.35 | Red | Always queue for operator review |
7.2 Aggregate Confidence Score
The aggregate confidence is computed as a weighted combination:
def compute_aggregate_confidence(det_conf: float, face_conf: float,
match_conf: float, quality_score: float) -> float:
"""
Aggregate = 0.25 * det_conf + 0.20 * face_conf + 0.35 * match_conf + 0.20 * quality_score
Where:
- det_conf: YOLO person detection confidence (0-1)
- face_conf: SCRFD face detection confidence (0-1)
- match_conf: ArcFace recognition match confidence (0-1), 0.0 for unknowns
- quality_score: Face quality composite score (0-1)
"""
7.3 AI Vibe Settings Mapping
The system exposes three "vibe" settings that internally map to threshold configurations:
Detection Sensitivity (applies to YOLO + SCRFD):
| Setting | YOLO Conf Threshold | SCRFD Conf Threshold | Effect |
|---|---|---|---|
| Low | 0.50 | 0.55 | Fewer detections, lower false positive rate |
| Balanced | 0.35 | 0.45 | Standard detection rate |
| High | 0.20 | 0.35 | Maximum detection, higher false positive rate |
Face Match Strictness (applies to ArcFace matching):
| Setting | Strict Threshold | Balanced Threshold | Relaxed Threshold | Effect |
|---|---|---|---|---|
| Relaxed | 0.50 | 0.42 | 0.35 | High recall, more false matches |
| Balanced | 0.58 | 0.50 | 0.42 | Balanced precision-recall |
| Strict | 0.65 | 0.58 | 0.50 | High precision, stricter matching |
7.4 Vibe Configuration Matrix
# vibe_presets.yaml
vibe_presets:
access_control: # High security area
detection_sensitivity: "balanced"
face_match_strictness: "strict"
general_surveillance: # Standard monitoring
detection_sensitivity: "balanced"
face_match_strictness: "balanced"
perimeter_monitoring: # Catching all activity
detection_sensitivity: "high"
face_match_strictness: "relaxed"
after_hours: # Night mode
detection_sensitivity: "high"
face_match_strictness: "balanced"
privacy_mode: # Minimal detection
detection_sensitivity: "low"
face_match_strictness: "strict"
7.5 Threshold Auto-Tuning Strategy
class ThresholdTuner:
"""Periodically adjusts thresholds based on operational feedback."""
def analyze_feedback(self, review_results: list):
"""
1. Collect operator review labels on REVIEW_NEEDED items
2. Track false positive rate and false negative rate
3. If FP rate > 10%: increase confidence thresholds by 5%
4. If FN rate > 10%: decrease confidence thresholds by 5%
5. Only adjust within +/- 15% of baseline values
6. Log all threshold changes with rationale
"""
def weekly_report(self) -> dict:
"""Generate confidence distribution and threshold effectiveness report."""
8. Inference Pipeline Architecture
8.1 Per-Stream Processing Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ PER-STREAM PIPELINE │
│ (Executed per camera frame) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ RTSP │ │ Frame │ │ Frame Queue │ │
│ │ Stream │───▶│ Decode │───▶│ (ring buffer) │ │
│ │ (H.264) │ │ (960x1080) │ │ max 30 frames │ │
│ └──────────┘ └──────────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 1: HUMAN DETECTION (YOLO11m TensorRT FP16) │ │
│ │ Input: 640x640 batch tensor │ │
│ │ Output: person bboxes [N x 6] (x1,y1,x2,y2,conf,cls) │ │
│ │ Latency: ~4.7ms per frame (T4) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 2: FACE DETECTION (SCRFD-500M TensorRT FP16) │ │
│ │ Input: Cropped person regions from Step 1 │ │
│ │ Output: face bboxes + 5 landmarks per face │ │
│ │ Latency: ~2.5ms per face (T4) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 3: FACE ALIGNMENT & QUALITY CHECK │ │
│ │ Input: Face crop + 5 landmarks │ │
│ │ Process: Similarity transform -> 112x112 aligned crop │ │
│ │ Quality: Blur, pose, illumination checks │ │
│ │ Latency: ~0.3ms (OpenCV CPU) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 4: FACE RECOGNITION (ArcFace R100 TensorRT FP16) │ │
│ │ Input: 112x112 aligned face crop (batch) │ │
│ │ Output: 512-D L2-normalized embedding │ │
│ │ Latency: ~6ms per face (T4, batch=8) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 5: IDENTITY MATCHING (FAISS/Milvus vector search) │ │
│ │ Input: 512-D embedding │ │
│ │ Output: Top-K matches with similarity scores │ │
│ │ Latency: < 5ms (in-memory, <10K identities) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 6: PERSON TRACKING (ByteTrack) │ │
│ │ Input: Person detections + face embeddings │ │
│ │ Output: Persistent track IDs with identity labels │ │
│ │ Latency: ~1ms per frame │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 7: UNKNOWN CLUSTERING (HDBSCAN) │ │
│ │ Input: Embeddings of unmatched faces │ │
│ │ Output: Cluster assignments for recurring unknowns │ │
│ │ Latency: ~50ms (batch update, every 30 sec) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ STEP 8: EVIDENCE CAPTURE & EVENT GENERATION │ │
│ │ Input: Track results + identity + confidence │ │
│ │ Output: Evidence records, event log entries, alerts │ │
│ │ Latency: ~5ms (async I/O) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ OUTPUT: Structured event stream to central system │ │
│ │ { track_id, identity, confidence, bbox, timestamp, │ │
│ │ camera_id, event_type, evidence_refs } │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
8.2 Multi-Stream Orchestration
class MultiStreamPipeline:
"""Orchestrates inference across 8 simultaneous camera streams."""
def __init__(self, config: PipelineConfig):
# 4 inference workers (each processes 2 streams)
self.workers = [InferenceWorker(gpu_id=i % 2) for i in range(4)]
# Stream assignments: worker -> [stream_ids]
self.stream_map = {
0: ["CAM_01", "CAM_02"],
1: ["CAM_03", "CAM_04"],
2: ["CAM_05", "CAM_06"],
3: ["CAM_07", "CAM_08"],
}
# Shared components (thread-safe)
self.tracker_pool = {cam: ByteTrack(config.track) for cam in ALL_CAMERAS}
self.face_db = VectorDatabase(config.db) # Milvus/FAISS
self.clustering = UnknownPersonClustering(config.cluster)
self.evidence = EvidenceCaptureManager(config.evidence)
def process_frame(self, camera_id: str, frame: np.ndarray, timestamp: datetime):
"""Process a single frame through the complete pipeline."""
# STEP 1: Human Detection
person_dets = self.yolo_detector.detect(frame)
# STEP 2: Face Detection (within person regions)
face_dets = []
for det in person_dets:
person_crop = crop_region(frame, det.bbox)
faces = self.face_detector.detect(person_crop)
face_dets.extend(faces)
# STEP 3: Face Alignment + Quality
aligned_faces = []
for face in face_dets:
aligned = align_face(frame, face.landmarks)
quality = self.quality_checker.score(aligned)
if quality.passed:
aligned_faces.append((aligned, quality.score, face))
# STEP 4: Face Recognition (batch)
if aligned_faces:
embeddings = self.face_recognizer.embed(
[f[0] for f in aligned_faces]
)
# STEP 5: Identity Matching
for emb, (aligned, quality, face) in zip(embeddings, aligned_faces):
matches = self.face_db.search(emb, top_k=5)
identity = self.classify_identity(emb, matches)
face.identity = identity
# STEP 6: Person Tracking
tracks = self.tracker_pool[camera_id].update(person_dets)
# STEP 7: Associate face identities with person tracks
self.associate_faces_with_tracks(tracks, face_dets)
# STEP 8: Unknown clustering (periodic batch)
self.clustering.update_periodic()
# STEP 9: Evidence capture
self.evidence.capture_events(tracks, camera_id, timestamp)
return tracks
8.3 Batch Processing Strategy
For GPU efficiency, frames are processed in batched groups:
| Batch Type | Batch Size | Frequency | GPU Utilization |
|---|---|---|---|
| Human Detection | 8 frames | Every frame decode | ~85% |
| Face Detection | Variable (up to 32 faces) | Per 2 frames | ~60% |
| Face Recognition | Up to 32 faces | Per 2 frames | ~75% |
| Tracking | Per stream | Every frame | CPU-bound |
8.4 GPU Utilization Strategy
GPU 0 (Primary - T4 / A10):
├─ Stream 0-1: YOLO11m detection
├─ Stream 0-1: SCRFD face detection
├─ Stream 0-1: ArcFace R100 recognition
└─ TensorRT Context 0: All models (shared)
GPU 1 (Optional - V100 / A100 for scale):
├─ Stream 2-3: Same pipeline
└─ TensorRT Context 1: Dedicated context
CPU (x86_64):
├─ Stream decode (FFmpeg, 8 threads)
├─ ByteTrack association (all streams)
├─ Face alignment + quality (OpenCV)
├─ HDBSCAN clustering (background thread)
├─ Evidence I/O (async thread pool)
└─ API server (FastAPI, 4 workers)
8.5 Performance Budget (Per 8-Stream System)
| Pipeline Stage | Per-Frame Cost | 8-Stream Aggregate | GPU % |
|---|---|---|---|
| Frame decode | ~2ms | 16ms (parallel) | — |
| YOLO11m detection | ~4.7ms | ~37.6ms (batched) | 35% |
| SCRFD face detection | ~2.5ms avg | ~20ms (batched) | 20% |
| Face alignment + quality | ~0.3ms | ~2.4ms (CPU) | — |
| ArcFace R100 recognition | ~6ms avg | ~48ms (batched) | 45% |
| ByteTrack tracking | ~1ms | ~8ms (CPU) | — |
| Vector search | ~1ms | ~8ms (CPU) | — |
| Evidence capture | ~2ms | ~16ms (async I/O) | — |
| Total effective | — | ~30-35ms end-to-end | — |
| Effective throughput | — | ~28 FPS per stream | 100% |
Target: 15-20 FPS processing per stream at 960x1080 with batching optimizations.
9. Model Selection Summary Table
| Component | Model Choice | Framework | Input Size | FPS Target (T4) | Accuracy Metric |
|---|---|---|---|---|---|
| Human Detection | YOLO11m (Ultralytics) | TensorRT FP16 | 640 x 640 | 213 FPS (batch=8) | 51.5% mAP@50-95 COCO; ~78% person AP |
| Face Detection | SCRFD-500M-BNKPS (InsightFace) | TensorRT FP16 | 640 x 640 | ~400 FPS (batch=32) | 90.6% AP-Easy, 87.0% AP-Med, 72.0% AP-Hard (WIDERFACE) |
| Face Recognition | ArcFace R100 IR-SE100 (InsightFace, MS1MV3) | TensorRT FP16 | 112 x 112 | ~170 FPS (batch=32) | 99.83% LFW, 98.27% CFP-FP, 96.1% IJB-C@1e-4 |
| Person Tracking | ByteTrack (BYTE association, Kalman filter) | NumPy/OpenCV | — | >500 FPS (association only) | 80.3% MOTA, 77.3% IDF1, 63.1% HOTA (MOT17) |
| Unknown Clustering | HDBSCAN (hdbscan library) + DBSCAN fallback | scikit-learn/hdbscan | 512-D embeddings | <100ms per batch | 89.5% cluster purity, BCubed F > 0.85 |
| Vector Search | FAISS (IndexFlatIP) or Milvus | FAISS/Milvus | 512-D vectors | <5ms per query | Exact nearest neighbor (cosine) |
10. Technology Stack
10.1 Deep Learning Framework
| Layer | Technology | Version | Purpose |
|---|---|---|---|
| Training | PyTorch | 2.2+ | Model fine-tuning, research |
| Export | ONNX | 1.15+ | Model portability |
| GPU Inference | TensorRT | 8.6+ / 10.0+ | Production inference optimization |
| CPU Inference | ONNX Runtime | 1.16+ | CPU fallback for edge |
| CPU (Intel) | OpenVINO | 2024.0+ | Intel-optimized inference |
10.2 Model Serving Architecture
┌─────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Docker Container: ai-vision-pipeline │ │
│ │ Base: nvidia/cuda:12.1-runtime-ubuntu22.04 │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ TensorRT │ │ OpenCV │ │ FastAPI │ │ │
│ │ │ Engine │ │ 4.9+ │ │ Server │ │ │
│ │ │ (TRT 10) │ │ (CUDA) │ │ (uvicorn) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ FAISS │ │ hdbscan │ │ Kafka / Redis │ │ │
│ │ │ (vectors) │ │ (cluster) │ │ (event bus) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Pipeline Orchestrator (Python asyncio) │ │ │
│ │ │ - Stream reader threads (8x FFmpeg) │ │ │
│ │ │ - GPU inference queue │ │ │
│ │ │ - CPU post-processing workers │ │ │
│ │ │ - Evidence async writer │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Docker Container: ai-vision-api │ │
│ │ - REST API for configuration │ │
│ │ - WebSocket for real-time events │ │
│ │ - Database: PostgreSQL + pgvector │ │
│ │ - Object storage: MinIO (evidence media) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
10.3 GPU Requirements
| Deployment Mode | Minimum GPU | Recommended GPU | Notes |
|---|---|---|---|
| Edge Gateway | NVIDIA Jetson Orin Nano 8GB | Jetson Orin NX 16GB | INT8 quantization, 5-8 FPS per stream |
| Edge Server | NVIDIA T4 16GB | NVIDIA A10 24GB | FP16, full 8-stream real-time |
| Cloud Processing | NVIDIA T4 16GB | NVIDIA V100 32GB | FP16, 8+ streams, batching |
| Development | NVIDIA RTX 3080 10GB | NVIDIA RTX 4090 24GB | Full pipeline debugging |
10.4 CPU Fallback Options
When GPU is unavailable, the pipeline falls back to CPU-optimized models:
| Component | GPU Model | CPU Fallback | CPU Latency |
|---|---|---|---|
| Human Detection | YOLO11m TensorRT | YOLO11n ONNX + OpenVINO | ~56ms/frame |
| Face Detection | SCRFD TensorRT | YuNet OpenCV DNN | ~3ms/frame |
| Face Recognition | ArcFace R100 TensorRT | ArcFace MobileFaceNet ONNX | ~15ms/face |
| Tracking | ByteTrack (CPU) | ByteTrack (CPU) | ~2ms/frame |
Note: CPU fallback processes at ~5-8 FPS per stream. For full 8-stream real-time, GPU acceleration is required.
10.5 Docker Compose Configuration
# docker-compose.yml
version: '3.8'
services:
ai-vision-pipeline:
image: surveillance/ai-vision-pipeline:1.0.0
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- CUDA_VISIBLE_DEVICES=0
- PIPELINE_WORKERS=4
- STREAM_COUNT=8
- DETECTION_MODEL=/models/yolo11m.engine
- FACE_MODEL=/models/scrfd_500m.engine
- RECOGNITION_MODEL=/models/arcface_r100.engine
- DETECTION_SENSITIVITY=balanced
- FACE_MATCH_STRICTNESS=balanced
volumes:
- ./models:/models:ro
- ./evidence:/evidence
- ./config:/config:ro
ports:
- "8080:8080" # REST API
- "8081:8081" # WebSocket events
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
depends_on:
- redis
- minio
- postgres
redis:
image: redis:7-alpine
ports:
- "6379:6379"
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: surveillance
POSTGRES_USER: ai_pipeline
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
ports:
- "5432:5432"
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: ${MINIO_USER}
MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}
volumes:
- miniodata:/data
ports:
- "9000:9000"
- "9001:9001"
volumes:
pgdata:
miniodata:
10.6 Python Module Structure
ai_vision_pipeline/
├── pyproject.toml # Poetry/pip dependencies
├── Dockerfile
├── docker-compose.yml
├── config/
│ ├── pipeline.yaml # Main pipeline configuration
│ ├── yolo11m_detection.yaml
│ ├── scrfd_face_detection.yaml
│ ├── arcface_recognition.yaml
│ ├── bytetrack.yaml
│ ├── clustering.yaml
│ └── vibe_presets.yaml
├── models/
│ ├── yolo11m.engine # TensorRT engine (YOLO11m)
│ ├── scrfd_500m_bnkps.engine # TensorRT engine (SCRFD)
│ ├── arcface_r100.engine # TensorRT engine (ArcFace)
│ └── yunet.onnx # CPU fallback (YuNet)
├── src/
│ ├── __init__.py
│ ├── main.py # Entry point
│ ├── config.py # Configuration loader
│ ├── pipeline/
│ │ ├── __init__.py
│ │ ├── orchestrator.py # MultiStreamPipeline
│ │ ├── stream_reader.py # RTSP/FFmpeg frame capture
│ │ └── frame_buffer.py # Ring buffer management
│ ├── detection/
│ │ ├── __init__.py
│ │ ├── yolo_detector.py # YOLO11m inference wrapper
│ │ └── detector_base.py # Abstract detector interface
│ ├── face/
│ │ ├── __init__.py
│ │ ├── face_detector.py # SCRFD inference wrapper
│ │ ├── face_recognizer.py # ArcFace inference wrapper
│ │ ├── face_aligner.py # 5-point alignment
│ │ ├── quality_checker.py # Blur/pose/illumination
│ │ └── embedding_store.py # Vector DB operations
│ ├── tracking/
│ │ ├── __init__.py
│ │ ├── bytetrack.py # ByteTrack implementation
│ │ ├── kalman_filter.py # Kalman filter
│ │ ├── track_manager.py # Track lifecycle management
│ │ └── matching.py # IoU / embedding matching
│ ├── clustering/
│ │ ├── __init__.py
│ │ ├── hdbscan_engine.py # HDBSCAN wrapper
│ │ ├── cluster_manager.py # Cluster CRUD + merge logic
│ │ └── cluster_profile.py # Cluster data model
│ ├── evidence/
│ │ ├── __init__.py
│ │ ├── capture_manager.py # Evidence capture orchestrator
│ │ ├── deduplicator.py # Deduplication logic
│ │ ├── storage.py # File system + object storage
│ │ └── metadata.py # EvidenceRecord dataclass
│ ├── confidence/
│ │ ├── __init__.py
│ │ ├── scorer.py # Aggregate confidence computation
│ │ ├── threshold_manager.py # Dynamic threshold adjustment
│ │ └── vibe_mapper.py # Vibe settings -> thresholds
│ ├── inference/
│ │ ├── __init__.py
│ │ ├── tensorrt_wrapper.py # Generic TensorRT inference
│ │ ├── onnx_wrapper.py # ONNX Runtime inference
│ │ └── batch_processor.py # Dynamic batching logic
│ ├── api/
│ │ ├── __init__.py
│ │ ├── server.py # FastAPI application
│ │ ├── routes/
│ │ │ ├── detection.py # Detection config API
│ │ │ ├── faces.py # Face database API
│ │ │ ├── tracks.py # Track query API
│ │ │ ├── evidence.py # Evidence retrieval API
│ │ │ └── settings.py # Vibe settings API
│ │ └── websocket.py # Real-time event streaming
│ └── utils/
│ ├── __init__.py
│ ├── logger.py # Structured logging
│ ├── metrics.py # Prometheus metrics
│ ├── time_utils.py # Timestamp handling
│ └── image_utils.py # Crop, resize, encode
├── tests/
│ ├── unit/
│ ├── integration/
│ └── benchmarks/
└── scripts/
├── export_tensorrt.py # Convert .pt -> .onnx -> .engine
├── calibrate_int8.py # INT8 calibration with custom data
├── benchmark_pipeline.py # End-to-end benchmark
└── setup_vector_db.py # Initialize FAISS/Milvus index
10.7 Core Inference Code Architecture
# src/inference/tensorrt_wrapper.py — Generic TensorRT inference engine
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
class TensorRTInference:
"""Generic TensorRT inference wrapper supporting dynamic batch sizes."""
def __init__(self, engine_path: str, max_batch_size: int = 32):
self.logger = trt.Logger(trt.Logger.WARNING)
self.runtime = trt.Runtime(self.logger)
with open(engine_path, "rb") as f:
self.engine = self.runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
self.max_batch_size = max_batch_size
self.stream = cuda.Stream()
# Allocate GPU buffers
self.inputs = []
self.outputs = []
self.bindings = []
self._allocate_buffers()
def _allocate_buffers(self):
"""Allocate pinned host and device memory for all I/O bindings."""
for i in range(self.engine.num_io_tensors):
name = self.engine.get_tensor_name(i)
mode = self.engine.get_tensor_mode(name)
shape = self.engine.get_tensor_shape(name)
dtype = trt.nptype(self.engine.get_tensor_dtype(name))
size = trt.volume(shape) * self.max_batch_size
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))
if mode == trt.TensorIOMode.INPUT:
self.inputs.append({"name": name, "host": host_mem,
"device": device_mem, "shape": shape, "dtype": dtype})
else:
self.outputs.append({"name": name, "host": host_mem,
"device": device_mem, "shape": shape, "dtype": dtype})
def infer(self, input_batch: np.ndarray) -> list[np.ndarray]:
"""Execute inference on a batched input."""
batch_size = input_batch.shape[0]
# Copy input to pinned memory
np.copyto(self.inputs[0]["host"][:input_batch.size], input_batch.ravel())
# Set dynamic batch size
input_shape = list(self.inputs[0]["shape"])
input_shape[0] = batch_size
self.context.set_input_shape(self.inputs[0]["name"], input_shape)
# Transfer H2D
cuda.memcpy_htod_async(self.inputs[0]["device"],
self.inputs[0]["host"], self.stream)
# Execute
self.context.execute_async_v3(stream_handle=self.stream.handle)
# Transfer D2H
for out in self.outputs:
cuda.memcpy_dtoh_async(out["host"], out["device"], self.stream)
self.stream.synchronize()
# Reshape outputs
results = []
for out in self.outputs:
out_shape = list(out["shape"])
out_shape[0] = batch_size
results.append(out["host"][:np.prod(out_shape)].reshape(out_shape))
return results
def __del__(self):
self.stream.synchronize()
10.8 Key Dependencies
# pyproject.toml dependencies
[tool.poetry.dependencies]
python = "^3.10"
torch = "^2.2.0"
torchvision = "^2.2.0"
tensorrt = "^10.0.0"
pycuda = "^2024.1"
onnxruntime-gpu = "^1.16.0"
opencv-python = "^4.9.0"
numpy = "^1.26.0"
scipy = "^1.12.0"
scikit-learn = "^1.4.0"
hdbscan = "^0.8.33"
faiss-gpu = "^1.7.4"
pydantic = "^2.6.0"
fastapi = "^0.109.0"
uvicorn = "^0.27.0"
websockets = "^12.0"
aioredis = "^2.0.0"
asyncpg = "^0.29.0"
minio = "^7.2.0"
prometheus-client = "^0.20.0"
structlog = "^24.1.0"
python-multipart = "^0.0.9"
pillow = "^10.2.0"
11. Performance Summary & Benchmarks
11.1 Target System Performance
| Metric | Target | Notes |
|---|---|---|
| Processed FPS per stream | 15-20 FPS | At 960x1080 input |
| Total system throughput | 120-160 FPS aggregate | 8 streams simultaneously |
| End-to-end latency | < 100ms | Frame in -> result out |
| GPU memory | < 10 GB | All 3 TensorRT engines loaded |
| System RAM | < 16 GB | Buffers + clustering + API |
| Storage growth | ~100 GB/month | With selective full-frame storage |
| Concurrent API clients | 50+ | WebSocket event subscribers |
11.2 Accuracy Targets on Surveillance Data
| Task | Metric | Target |
|---|---|---|
| Human Detection | mAP@50 (person) | > 75% |
| Human Detection | Recall@0.5IoU | > 85% |
| Face Detection | AP (medium) | > 85% |
| Face Detection | Min face size | 20x20 px |
| Face Recognition | Rank-1 accuracy (known persons) | > 98% |
| Face Recognition | False acceptance rate | < 0.1% |
| Tracking | MOTA | > 75% |
| Tracking | IDF1 | > 70% |
| Tracking | ID switches / 100 frames | < 2 |
| Clustering | Purity | > 89% |
| Clustering | BCubed F-Measure | > 0.85 |
11.3 Failure Modes & Mitigations
| Failure Mode | Detection | Mitigation |
|---|---|---|
| GPU memory exhaustion | Monitor nvidia-smi | Reduce batch size, enable model streaming |
| Frame drop in decode | Monitor FFmpeg buffer | Increase ring buffer, enable HW decode |
| High false positive rate | Track review queue | Auto-increase detection threshold |
| Track fragmentation | Monitor ID switches | Tune ByteTrack track_buffer parameter |
| Cluster contamination | Monitor cluster purity | Lower DBSCAN eps, enable merge review |
| Vector DB latency growth | Query latency histogram | Switch from IndexFlat to IndexIVF |
| Disk space exhaustion | Storage capacity alert | Auto-archive evidence > 90 days |
12. Appendix A: Model Export Commands
# 1. Export YOLO11m to TensorRT
python -c "
from ultralytics import YOLO
model = YOLO('yolo11m.pt')
model.export(format='onnx', imgsz=640, opset=17, dynamic=True, simplify=True)
"
/usr/src/tensorrt/bin/trtexec \
--onnx=yolo11m.onnx \
--saveEngine=yolo11m.engine \
--fp16 \
--minShapes=images:1x3x640x640 \
--optShapes=images:8x3x640x640 \
--maxShapes=images:16x3x640x640
# 2. Export SCRFD-500M to TensorRT (via ONNX)
python scripts/export_scrfd_onnx.py \
--config configs/scrfd_500m_bnkps.py \
--checkpoint scrfd_500m_bnkps.pth \
--input-img test.jpg \
--shape 640 640 \
--show
/usr/src/tensorrt/bin/trtexec \
--onnx=scrfd_500m.onnx \
--saveEngine=scrfd_500m.engine \
--fp16
# 3. Export ArcFace R100 to TensorRT
python -c "
import onnx
from insightface.model_zoo import get_model
model = get_model('arcface_r100_v1')
model.export_onnx('arcface_r100.onnx')
"
/usr/src/tensorrt/bin/trtexec \
--onnx=arcface_r100.onnx \
--saveEngine=arcface_r100.engine \
--fp16 \
--minShapes=input.1:1x3x112x112 \
--optShapes=input.1:32x3x112x112 \
--maxShapes=input.1:64x3x112x112
13. Appendix B: INT8 Calibration
# scripts/calibrate_int8.py
import tensorrt as trt
from src.inference.calibrator import SurveillanceCalibrator
calibrator = SurveillanceCalibrator(
calibration_data_dir="/data/calibration/surveillance_500frames",
cache_file="yolo11m_calibration.cache",
input_shape=(8, 3, 640, 640),
max_batches=100
)
config = {
"onnx_file": "yolo11m.onnx",
"engine_file": "yolo11m_int8.engine",
"precision": "int8",
"calibrator": calibrator,
"max_batch_size": 16,
"workspace_mb": 4096,
}
# INT8 engine provides 3.5x speedup with <0.5% mAP drop
# Requires 500+ representative frames from target cameras
14. Appendix C: Performance Benchmark Script
# scripts/benchmark_pipeline.py
import time
import statistics
from src.pipeline.orchestrator import MultiStreamPipeline
BENCHMARK_DURATION = 300 # 5 minutes
WARMUP_FRAMES = 60
def benchmark():
pipeline = MultiStreamPipeline.from_config("config/pipeline.yaml")
# Warmup
for _ in range(WARMUP_FRAMES):
pipeline.process_frame("CAM_01", dummy_frame, datetime.now())
# Benchmark
latencies = []
start = time.monotonic()
while time.monotonic() - start < BENCHMARK_DURATION:
t0 = time.perf_counter()
pipeline.process_frame("CAM_01", dummy_frame, datetime.now())
latencies.append((time.perf_counter() - t0) * 1000) # ms
print(f"Mean latency: {statistics.mean(latencies):.1f}ms")
print(f"P50 latency: {statistics.median(latencies):.1f}ms")
print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.1f}ms")
print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.1f}ms")
print(f"Throughput: {len(latencies) / BENCHMARK_DURATION:.1f} FPS")
if __name__ == "__main__":
benchmark()
Document Version: 1.0.0 | Generated for CP PLUS 8-Channel DVR Surveillance Platform All model specifications and benchmarks reflect publicly available data as of July 2025