# Self-Learning & Training System Design for Surveillance AI Platform

## Table of Contents

1. [Architecture Overview](#1-architecture-overview)
2. [Data Collection Pipeline](#2-data-collection-pipeline)
3. [Candidate Training Examples](#3-candidate-training-examples)
4. [Label Conflict Detection](#4-label-conflict-detection)
5. [Approval Workflow](#5-approval-workflow)
6. [Retraining Pipeline](#6-retraining-pipeline)
7. [Model Versioning](#7-model-versioning)
8. [Quality Gates](#8-quality-gates)
9. [Embedding Update Strategy](#9-embedding-update-strategy)
10. [Training Dataset Management](#10-training-dataset-management)
11. [Performance Monitoring](#11-performance-monitoring)
12. [UI for Training Management](#12-ui-for-training-management)
13. [Database Schema](#13-database-schema)
14. [API Specification](#14-api-specification)
15. [Implementation Roadmap](#15-implementation-roadmap)

---

## 1. Architecture Overview

### 1.1 System Architecture Diagram

```
+-------------------+     +-------------------+     +-------------------+
|  Face Recognition |     |  User Correction  |     |   Admin Console   |
|  Pipeline (Prod)  |     |  UI / Actions     |     |  (Training Mgmt)  |
+--------+----------+     +--------+----------+     +--------+----------+
         |                         |                         |
         v                         v                         v
+-------------------+     +-------------------+     +-------------------+
|  Detection Store  |     |  Training Event   |     |  Training Queue   |
|  (Face Detections)|     |  Collector        |     |  & Scheduler      |
+-------------------+     +--------+----------+     +--------+----------+
         |                         |                         |
         v                         v                         v
+-------------------+     +-------------------+     +-------------------+
|  Embedding Index  |<--->|  Training Data    |     |  Training Engine  |
|  (FAISS/Milvus)   |     |  Warehouse        |     |  (GPU Workers)    |
+-------------------+     +--------+----------+     +--------+----------+
         |                         |                         |
         v                         v                         v
+-------------------+     +-------------------+     +-------------------+
|  Identity DB      |     |  Conflict Detector|     |  Model Registry   |
|  (Postgres)       |     |  (Auto-blocker)   |     |  (MLflow)         |
+-------------------+     +-------------------+     +-------------------+
                                                               |
                                                         +-----+-----+
                                                         v           v
                                                  +-----------+ +-----------+
                                                  |  Quality  | |  A/B Test |
                                                  |  Gates    | |  Manager  |
                                                  +-----------+ +-----------+
```

### 1.2 Core Principles

| Principle | Description |
|-----------|-------------|
| **Never Silent** | All training actions are explicit, logged, and auditable |
| **Safety First** | Conflicting data is automatically blocked from training |
| **Rollback Ready** | Every deployed model can be rolled back within 60 seconds |
| **Human in Loop** | User corrections drive learning, but humans approve changes |
| **Incremental by Default** | Prefer embedding updates over full retraining |
| **Observable** | All training jobs, model versions, and metrics are tracked |

### 1.3 Technology Stack

| Component | Technology | Purpose |
|-----------|-----------|---------|
| Training Engine | PyTorch + PyTorch Lightning | Model training and fine-tuning |
| Model Registry | MLflow | Model versioning and artifact storage |
| GPU Orchestration | Kubernetes + KEDA | Auto-scaling GPU training jobs |
| Embedding Store | FAISS (local) / Milvus (distributed) | Vector search for face matching |
| Metadata DB | PostgreSQL | Training events, labels, model versions |
| Object Storage | MinIO / S3 | Training images and model artifacts |
| Message Queue | Redis / Apache Kafka | Training job queue and events |
| Pipeline Orchestration | Apache Airflow | Scheduled retraining workflows |
| Monitoring | Prometheus + Grafana | Model performance metrics |

---

## 2. Data Collection Pipeline

### 2.1 Training Event Types

The system captures five categories of training events from user interactions:

```
+------------------+------------------------------------------------------+
| Event Type       | Description                                          |
+------------------+------------------------------------------------------+
| IDENTITY_NAMED   | User assigns a name to an unknown face detection     |
| MATCH_CONFIRMED  | User confirms an identity match is correct           |
| MATCH_CORRECTED  | User corrects a wrong match to a different identity  |
| IDENTITY_MERGED  | User merges two duplicate identities into one        |
| MATCH_REJECTED   | User explicitly rejects a match suggestion           |
+------------------+------------------------------------------------------+
```

### 2.2 Event Collection Flow

```
User Action in UI
       |
       v
+--------------+     +------------------+     +---------------------+
| Capture Event| --> | Enrich Metadata  | --> | Write to Training   |
| (API Call)   |     | (face quality,   |     | Event Queue (Redis) |
+--------------+     |  source, user)   |     +---------------------+
                     +------------------+              |
                                                       v
                                              +------------------+
                                              | Async Processor  |
                                              | (Celery Worker)  |
                                              +--------+---------+
                                                       |
                              +------------------------+------------------------+
                              |                        |                        |
                              v                        v                        v
                       +-------------+          +-------------+          +-------------+
                       | Quality     |          | Label Index |          | Conflict    |
                       | Filter      |          | Update      |          | Detector    |
                       +-------------+          +-------------+          +-------------+
```

### 2.3 Quality Filtering Criteria

Every training example must pass the following quality gates before being accepted:

```python
class QualityFilterConfig:
    """Quality thresholds for training data acceptance."""

    # Face quality thresholds
    min_face_size_px: int = 80               # Minimum face dimension in pixels
    min_confidence: float = 0.85             # Face detection confidence
    min_image_quality_score: float = 0.6     # Blur/lighting quality score

    # Pose thresholds
    max_yaw_degrees: float = 45.0            # Maximum head yaw angle
    max_pitch_degrees: float = 30.0          # Maximum head pitch angle
    max_roll_degrees: float = 30.0           # Maximum head roll angle

    # Occlusion thresholds
    max_occlusion_ratio: float = 0.3         # Maximum occluded face area

    # Consistency thresholds
    min_embedding_quality: float = 0.7       # Embedding vector quality metric

    # Acceptance logic
    require_frontal: bool = False            # If true, only near-frontal faces
    require_eyes_visible: bool = True        # Both eyes must be visible
```

### 2.4 Training Event Schema

```python
class TrainingEvent(BaseModel):
    """A single training data capture event."""

    event_id: UUID                    # Unique event identifier
    event_type: TrainingEventType     # NAMED, CONFIRMED, CORRECTED, MERGED, REJECTED
    timestamp: datetime               # Event timestamp (UTC)

    # Source detection
    detection_id: UUID                # Reference to original face detection
    face_image_path: str              # S3/MinIO path to cropped face image
    full_frame_path: str              # S3/MinIO path to source frame (optional)

    # Identity information
    identity_id: UUID                 # Assigned/corrected identity ID
    previous_identity_id: UUID | None # For corrections: previous (wrong) identity
    is_unknown: bool                  # Whether this was previously unknown

    # Embedding data
    embedding_vector: list[float]     # 512-dim face embedding
    embedding_model_version: str      # Model version that produced embedding

    # Quality metrics
    face_quality_score: float         # Combined quality score (0-1)
    face_size_px: int                 # Face bounding box size
    detection_confidence: float       # Face detection confidence
    pose_angles: dict[str, float]     # yaw, pitch, roll

    # Metadata
    source_camera_id: str | None      # Camera that captured the face
    captured_at: datetime             # Original capture timestamp
    reviewed_by_user_id: UUID         # User who performed the action
    review_timestamp: datetime        # When user reviewed
    review_method: str                # "manual_click", "bulk_approve", "auto"

    # Training status
    training_status: EventStatus      # pending, approved, rejected, conflicted
    conflict_id: UUID | None          # Set if a conflict was detected
    approved_at: datetime | None
```

### 2.5 Event Collection API

```python
# POST /api/v1/training/events
class SubmitTrainingEventRequest(BaseModel):
    event_type: TrainingEventType
    detection_id: UUID
    identity_id: UUID
    previous_identity_id: UUID | None = None
    reviewed_by_user_id: UUID
    review_method: str = "manual_click"
    confidence_override: float | None = None  # User confidence in correction
```

### 2.6 Bulk Event Collection

For batch operations (e.g., bulk confirmation of matches):

```python
# POST /api/v1/training/events/bulk
class BulkTrainingEventRequest(BaseModel):
    events: list[SingleEvent]
    common_fields: dict               # Shared fields (user_id, timestamp, etc.)
    auto_approve: bool = False        # If True, skip suggestion queue

# Processing: Events are validated, quality-filtered, and queued
# Response: { accepted: int, rejected: int, conflicts: int, event_ids: list[UUID] }
```

---

## 3. Candidate Training Examples

### 3.1 Dataset Structure Per Identity

For each identity, the system maintains a structured dataset:

```
Identity: "John Smith" (ID: uuid-1234)
|
|-- Positive Examples (confirmed faces of John Smith)
|   |-- Auto-captured: high-confidence detections confirmed by system
|   |-- User-confirmed: matches user explicitly confirmed
|   |-- User-named: original naming of unknown person
|   |-- Merged-in: faces from merged duplicate identity
|   |
|   +-- Embedding centroid (running average)
|   +-- Confidence radius (max distance from centroid)
|
|-- Negative Examples (faces confirmed NOT John Smith)
|   |-- Corrected-away: faces user moved to different identity
|   |-- Rejected matches: system suggestions user rejected
|   |-- Hard negatives: similar-looking faces from other identities
|   |
|   +-- Hard negative mining pool
|
|-- Statistics
    |-- Total positive examples
    |-- Total negative examples
    |-- Last trained timestamp
    |-- Model version last trained on
    |-- Average embedding quality
```

### 3.2 Dataset Balance Logic

```python
class IdentityDatasetManager:
    """Manages training examples per identity with balancing."""

    MIN_POSITIVE_EXAMPLES: int = 5       # Min positives before identity included
    MAX_POSITIVE_EXAMPLES: int = 500     # Cap to prevent over-representation
    MAX_NEGATIVE_EXAMPLES: int = 200     # Cap negatives per identity
    NEGATIVE_POSITIVE_RATIO: float = 0.5  # Negatives per positive (target)

    def build_training_batch(self, identity_id: UUID) -> TrainingBatch:
        """Build a balanced training batch for an identity."""
        positives = self.get_positive_examples(
            identity_id,
            max_count=self.MAX_POSITIVE_EXAMPLES,
            prefer_high_quality=True,
            prefer_recent=True,
            ensure_diversity=True  # Spread across time/cameras
        )

        if len(positives) < self.MIN_POSITIVE_EXAMPLES:
            raise InsufficientSamplesError(
                f"Identity {identity_id} has only {len(positives)} positive "
                f"examples (min: {self.MIN_POSITIVE_EXAMPLES})"
            )

        target_negatives = int(len(positives) * self.NEGATIVE_POSITIVE_RATIO)
        negatives = self.select_negative_examples(
            identity_id,
            count=min(target_negatives, self.MAX_NEGATIVE_EXAMPLES),
            include_hard_negatives=True,
            diversity_weight=0.3
        )

        return TrainingBatch(
            identity_id=identity_id,
            positive_examples=positives,
            negative_examples=negatives,
            positive_count=len(positives),
            negative_count=len(negatives)
        )
```

### 3.3 Hard Negative Mining

```python
class HardNegativeMiner:
    """Find hard negative examples to improve model discrimination."""

    def mine_hard_negatives(
        self,
        identity_id: UUID,
        model_version: str,
        top_k: int = 50
    ) -> list[TrainingExample]:
        """
        Find faces that are similar to the target identity but
        belong to other people. These are 'hard' because the model
        might confuse them.
        """
        centroid = self.get_embedding_centroid(identity_id)

        # Search embedding space for nearest neighbors
        # that belong to OTHER identities
        candidates = self.embedding_index.search(
            query=centroid,
            top_k=top_k * 3,  # Over-fetch to filter
            exclude_identity=identity_id
        )

        # Filter to those within the identity's confidence radius
        radius = self.get_confidence_radius(identity_id)
        hard_negatives = [
            c for c in candidates
            if c.distance < radius * 1.5  # Within 1.5x the identity radius
            and c.identity_id != identity_id
        ]

        # Score by 'hardness' (closer = harder)
        hard_negatives.sort(key=lambda x: x.distance)

        return hard_negatives[:top_k]
```

### 3.4 Dataset Statistics Per Identity

| Metric | Threshold | Action if Below |
|--------|-----------|-----------------|
| Positive examples | >= 5 | Identity excluded from training |
| Negative examples | >= 2 (or ratio 0.2) | Use global negatives |
| Embedding quality | >= 0.7 avg | Flag for review |
| Age of newest sample | < 90 days | Consider identity stale |
| Diversity score | >= 0.4 (cameras/time) | Add diversity warning |

---

## 4. Label Conflict Detection

### 4.1 Conflict Types

```
+-------------------+------------------------------------------------------+
| Conflict Type     | Description & Example                                |
+-------------------+------------------------------------------------------+
| SAME_FACE_DIFF_ID | Same face image assigned to different identities     |
|                   | Ex: Face-A labeled as "John" then corrected to "Bob" |
+-------------------+------------------------------------------------------+
| TRANSITIVE_CHAIN  | A->B and B->C implies A->C which may be wrong        |
|                   | Ex: "John" merged into "Bob", but some "John" faces  |
|                   | were previously corrected to "Alice"                 |
+-------------------+------------------------------------------------------+
| SELF_CONTRADICTION| Same user contradicts their own previous label       |
|                   | Ex: User confirmed face as "John", later says "Bob"  |
+-------------------+------------------------------------------------------+
| TEMPORAL_DRIFT    | Identity labels drift over time for same person      |
|                   | Ex: Early captures labeled "Unknown", later "John"   |
+-------------------+------------------------------------------------------+
| MERGE_COLLISION   | Merging identities creates conflicting embeddings    |
|                   | Ex: Two identities with very different face clusters |
+-------------------+------------------------------------------------------+
```

### 4.2 Conflict Detection Pipeline

```python
class ConflictDetector:
    """Detects and manages label conflicts in training data."""

    async def check_for_conflicts(
        self,
        new_event: TrainingEvent
    ) -> list[LabelConflict]:
        """
        Check if a new training event creates any label conflicts.
        Called synchronously on every training event submission.
        """
        conflicts = []

        # 1. Check SAME_FACE_DIFF_ID
        same_face_events = await self.db.find_events_by_detection(
            new_event.detection_id
        )
        conflicts.extend(
            self._detect_same_face_different_identity(
                new_event, same_face_events
            )
        )

        # 2. Check if new assignment contradicts existing corrections
        if new_event.event_type == TrainingEventType.MATCH_CORRECTED:
            conflicts.extend(
                await self._detect_transitive_conflicts(new_event)
            )

        # 3. Check for self-contradiction by same user
        conflicts.extend(
            await self._detect_user_self_contradiction(new_event)
        )

        # 4. Check merge compatibility
        if new_event.event_type == TrainingEventType.IDENTITY_MERGED:
            conflicts.extend(
                await self._detect_merge_collisions(new_event)
            )

        return conflicts

    def _detect_same_face_different_identity(
        self,
        new_event: TrainingEvent,
        existing_events: list[TrainingEvent]
    ) -> list[LabelConflict]:
        conflicts = []
        for existing in existing_events:
            if existing.identity_id != new_event.identity_id:
                conflict = LabelConflict(
                    conflict_type=ConflictType.SAME_FACE_DIFF_ID,
                    severity=ConflictSeverity.HIGH,
                    detection_id=new_event.detection_id,
                    face_image_path=new_event.face_image_path,
                    conflicting_labels=[
                        (existing.event_id, existing.identity_id),
                        (new_event.event_id, new_event.identity_id)
                    ],
                    description=(
                        f"Same face assigned to both "
                        f"{existing.identity_id} and {new_event.identity_id}"
                    ),
                    blocks_training=True,  # Hard block
                    detected_at=datetime.utcnow()
                )
                conflicts.append(conflict)
        return conflicts
```

### 4.3 Conflict Resolution Workflow

```
+-----------------------------------------+
| Conflict Detected                       |
+-----------------------------------------+
                    |
                    v
+-----------------------------------------+
| 1. Set conflict_id on affected events   |
| 2. Set event.training_status = CONFLICTED|
| 3. Block all conflicting events from    |
|    training dataset                     |
+-----------------------------------------+
                    |
                    v
+-----------------------------------------+
| 4. Create conflict ticket in review UI  |
| 5. Notify admin (in-app + email)        |
+-----------------------------------------+
                    |
                    v
+-----------------------------------------+
| Admin reviews conflict in UI            |
+-----------------------------------------+
          |                     |
          v                     v
+----------------+   +-------------------+
| RESOLVE: Pick  |   | ESCALATE: Flag for|
| correct label  |   | further review    |
+--------+-------+   +--------+----------+
         |                    |
         v                    v
+----------------+   +-------------------+
| Winning label  |   | Assigned to senior|
| marked as      |   | reviewer          |
| APPROVED,      |   +-------------------+
| losing marked  |
| as REJECTED    |
+--------+-------+
         |
         v
+----------------+
| All events now |
| non-conflicted |
| Unblock for    |
| training       |
+----------------+
```

### 4.4 Conflict Database Schema

```sql
CREATE TABLE label_conflicts (
    conflict_id UUID PRIMARY KEY,
    conflict_type VARCHAR(32) NOT NULL,
    severity VARCHAR(16) NOT NULL,  -- LOW, MEDIUM, HIGH, CRITICAL
    status VARCHAR(16) DEFAULT 'open',  -- open, under_review, resolved, escalated

    -- References
    detection_id UUID REFERENCES detections(detection_id),
    face_image_path TEXT NOT NULL,

    -- Conflicting assignments (JSON array of event_id + identity_id pairs)
    conflicting_labels JSONB NOT NULL,

    -- Resolution
    resolution VARCHAR(16),  -- label_selected, both_valid, escalated, auto_resolved
    resolved_by_user_id UUID REFERENCES users(user_id),
    winning_identity_id UUID REFERENCES identities(identity_id),
    resolution_notes TEXT,

    -- Timestamps
    detected_at TIMESTAMP NOT NULL DEFAULT NOW(),
    resolved_at TIMESTAMP,

    -- Training block
    blocks_training BOOLEAN NOT NULL DEFAULT TRUE,
    training_unblocked_at TIMESTAMP
);

-- Index for fast lookups
CREATE INDEX idx_label_conflicts_status ON label_conflicts(status);
CREATE INDEX idx_label_conflicts_detection ON label_conflicts(detection_id);
CREATE INDEX idx_label_conflicts_severity ON label_conflicts(severity);
```

---

## 5. Approval Workflow

### 5.1 Three Learning Modes

```
+==========================================================================+
| MODE 1: MANUAL ONLY                                                      |
+==========================================================================+
|                                                                          |
|  User Corrections      Training Events          Admin                    |
|        |                    |                    |                       |
|        v                    v                    v                       |
|   +---------+          +---------+         +----------+                  |
|   | Capture |--------->| Store   |-------->| Queue    |                  |
|   | Events  |          | (DB)    |         | (Wait)   |                  |
|   +---------+          +---------+         +----------+                  |
|                                                   |                      |
|                                                   | Admin manually       |
|                                                   | triggers training    |
|                                                   v                      |
|                                            +------------+                |
|                                            | Training   |                |
|                                            | Job        |                |
|                                            +------------+                |
+==========================================================================+
| Best for: High-security environments, regulated industries               |
|                                                                          |
+==========================================================================+

+==========================================================================+
| MODE 2: SUGGESTED LEARNING (Recommended)                                 |
+==========================================================================+
|                                                                          |
|  User Corrections      Training Events          Suggestion Engine        |
|        |                    |                    |                       |
|        v                    v                    v                       |
|   +---------+          +---------+         +----------+                  |
|   | Capture |--------->| Quality |-------->| Generate |                  |
|   | Events  |          | Filter  |         | Suggestions               |
|   +---------+          +---------+         +----------+                  |
|                                                   |                      |
|                                                   v                      |
|                                            +----------+                  |
|                                            | Review   |                  |
|                                            | Queue    |                  |
|                                            +----------+                  |
|                                                   |                      |
|                              +--------------------+------------------+   |
|                              |                    |                  |   |
|                              v                    v                  v   |
|                         +---------+         +---------+        +--------+|
|                         | Approve |         | Reject  |        | Correct||
|                         | (Train) |         | (Drop)  |        | Label  ||
|                         +---------+         +---------+        +--------+|
|                                                                          |
+==========================================================================+
| Best for: Most production environments. Human reviews every suggestion   |
+==========================================================================+

+==========================================================================+
| MODE 3: APPROVED AUTO-UPDATE                                             |
+==========================================================================+
|                                                                          |
|  +-----------------------------------------------------------+           |
|  | Trusted Users / Auto-Approval Rules                       |           |
|  +-----------------------------------------------------------+           |
|           |                                     |                        |
|           v                                     v                        |
|  +----------------+  Auto-approve if:  +-------------------+             |
|  | > High-confidence|                 | > From trusted user|             |
|  |   system matches |                 | > No conflicts     |             |
|  |   (score > 0.95) |                 | > Identity has 10+ |             |
|  | > User has >100  |                 |   existing samples |             |
|  |   prior approvals  |                | > Single correction|             |
|  +----------------+                   +-------------------+             |
|           |                                     |                        |
|           v                                     v                        |
|     Direct to Training Dataset          Into Suggestion Queue            |
|     (Skip Review)                       (Mode 2 behavior)                |
|                                                                          |
|  Automatic retraining triggers on thresholds:                            |
|  - >50 new approved examples total                                       |
|  - >10 new examples for any single identity                              |
|  - Daily schedule at 02:00 AM                                            |
|  - Performance degradation detected                                      |
|                                                                          |
+==========================================================================+
| Best for: Mature systems with well-established identity databases        |
+==========================================================================+
```

### 5.2 Suggestion Generation Engine

```python
class TrainingSuggestionEngine:
    """Generates training suggestions from captured events."""

    async def generate_suggestions(self) -> list[TrainingSuggestion]:
        """
        Aggregate training events into actionable suggestions.
        Grouped by identity for efficient review.
        """
        # Get all pending training events
        pending_events = await self.db.get_pending_events(
            status=EventStatus.PENDING_REVIEW,
            min_quality_score=0.7,
            exclude_conflicted=True
        )

        # Group by target identity
        by_identity = self._group_by_identity(pending_events)

        suggestions = []
        for identity_id, events in by_identity.items():
            suggestion = TrainingSuggestion(
                suggestion_id=uuid4(),
                identity_id=identity_id,
                identity_name=await self._get_identity_name(identity_id),

                # Grouped changes
                new_positive_examples=[
                    e for e in events
                    if e.event_type in (IDENTITY_NAMED, MATCH_CONFIRMED)
                ],
                corrected_examples=[
                    e for e in events
                    if e.event_type == MATCH_CORRECTED
                ],
                merged_examples=[
                    e for e in events
                    if e.event_type == IDENTITY_MERGED
                ],

                # Impact assessment
                estimated_accuracy_change=self._estimate_impact(events),
                conflict_risk_score=self._assess_conflict_risk(events),
                sample_diversity_score=self._assess_diversity(events),

                # Metadata
                generated_at=datetime.utcnow(),
                expires_at=datetime.utcnow() + timedelta(days=7),
                auto_approve_eligible=self._check_auto_approve(events),

                # Current state
                existing_positive_count=await self._get_positive_count(identity_id),
                existing_negative_count=await self._get_negative_count(identity_id)
            )
            suggestions.append(suggestion)

        # Sort by priority (impact x risk)
        suggestions.sort(
            key=lambda s: s.estimated_accuracy_change * (1 - s.conflict_risk_score),
            reverse=True
        )

        return suggestions
```

### 5.3 Suggestion Review API

```python
# GET /api/v1/training/suggestions
# Returns paginated list of pending suggestions

# POST /api/v1/training/suggestions/{id}/approve
class ApproveSuggestionRequest(BaseModel):
    review_method: str = "manual"       # manual, bulk, auto
    notes: str | None = None
    # On approval: events marked as APPROVED, added to training dataset

# POST /api/v1/training/suggestions/{id}/reject
class RejectSuggestionRequest(BaseModel):
    reason: str                         # Required: why rejected
    reject_events: list[UUID] | None = None  # Reject specific events, or all

# POST /api/v1/training/suggestions/{id}/correct
class CorrectSuggestionRequest(BaseModel):
    corrections: list[EventCorrection]  # Fix specific event labels
    # Events are updated and re-checked for conflicts

# POST /api/v1/training/suggestions/bulk-approve
class BulkApproveRequest(BaseModel):
    suggestion_ids: list[UUID]
    review_method: str = "bulk"

# GET /api/v1/training/suggestions/{id}/details
# Returns full details: all events, face thumbnails, impact analysis
```

### 5.4 Suggestion Review UI State Machine

```
                    +-------------+
                    |   PENDING   |
                    |   (Queue)   |
                    +------+------+
                           |
          +----------------+----------------+
          |                |                |
          v                v                v
    +---------+      +---------+      +---------+
    | APPROVED|      | REJECTED|      |CORRECTED|
    +----+----+      +----+----+      +----+----+
         |                |                |
         v                v                v
    +---------+      +---------+      +---------+
    | Added to|      | Dropped |      | Re-check|
    | Training|      | (Log)   |      | Conflict|
    | Dataset |      +---------+      +---------+
    +---------+
         |
         v
    +---------+
    | Marked  |
    | TRAINED |
    +---------+
```

---

## 6. Retraining Pipeline

### 6.1 Pipeline Architecture

```
+------------------------------------------------------------------+
|                    RETRAINING PIPELINE                            |
+------------------------------------------------------------------+
|                                                                   |
|  +----------+  +----------+  +----------+  +----------+          |
|  | Trigger  |  | Dataset  |  | Training |  | Quality  |          |
|  | Sources  |->| Builder  |->|  Engine  |->|  Gates   |          |
|  +----------+  +----------+  +----------+  +----+-----+          |
|       |                                         |                 |
|       |                                         v                 |
|       |                                    +----------+           |
|       |                                    | Deploy?  |           |
|       |                                    +----+-----+           |
|       |                                         |                 |
|       |                              +----------+----------+     |
|       |                              |                     |     |
|       |                              v                     v     |
|       |                         +---------+          +---------+ |
|       |                         | DEPLOY  |          |  HOLD   | |
|       |                         | (New    |          | (Keep   | |
|       |                         | Model)  |          | Current)| |
|       |                         +----+----+          +---------+ |
|       |                              |                           |
|       |                              v                           |
|       |                         +---------+                      |
|       |                         | Rollback|                      |
|       |                         | Ready   |                      |
|       |                         +---------+                      |
|       |                                                          |
|  +----+------------------------------------------------------+  |
|  |                    TRIGGER SOURCES                          |  |
|  +------------------------------------------------------------+  |
|  | 1. SCHEDULED: Daily at 02:00, Weekly on Sunday              |  |
|  | 2. THRESHOLD: >50 new approved examples since last train   |  |
|  | 3. MANUAL: Admin triggers via UI/API                       |  |
|  | 4. DEGRADATION: Performance monitor signals accuracy drop  |  |
|  | 5. NEW_IDENTITY: New identity with enough samples added    |  |
|  +------------------------------------------------------------+  |
+------------------------------------------------------------------+
```

### 6.2 Training Job Orchestration

```python
class TrainingOrchestrator:
    """Orchestrates the complete retraining pipeline."""

    async def create_training_job(
        self,
        trigger: TrainingTrigger,
        config: TrainingConfig
    ) -> TrainingJob:
        """
        Create and queue a new training job.
        """
        job = TrainingJob(
            job_id=uuid4(),
            trigger=trigger,  # scheduled, threshold, manual, degradation
            status=JobStatus.QUEUED,
            config=config,
            created_at=datetime.utcnow(),
            gpu_requirement=config.gpu_type or GPUType.V100,
            priority=self._calculate_priority(trigger)
        )

        # Write to job queue (Redis sorted set by priority)
        await self.job_queue.enqueue(job)

        # Log
        logger.info(f"Training job {job.job_id} queued with priority {job.priority}")

        return job

    async def execute_training_job(self, job: TrainingJob):
        """
        Execute a training job end-to-end.
        """
        try:
            # Phase 1: Acquire GPU resources
            await self._update_status(job, JobStatus.ACQUIRING_RESOURCES)
            gpu_pod = await self.gpu_manager.acquire_gpu(
                gpu_type=job.config.gpu_type,
                max_wait_seconds=3600
            )

            # Phase 2: Build training dataset
            await self._update_status(job, JobStatus.BUILDING_DATASET)
            dataset = await self._build_training_dataset(job)
            dataset_version = await self._version_dataset(dataset)

            # Phase 3: Determine training strategy
            await self._update_status(job, JobStatus.DETERMINING_STRATEGY)
            strategy = self._determine_training_strategy(
                job=job,
                dataset=dataset,
                current_model=await self._get_current_production_model()
            )

            # Phase 4: Execute training
            if strategy == TrainingStrategy.INCREMENTAL_EMBEDDING:
                await self._update_status(job, JobStatus.INCREMENTAL_UPDATE)
                result = await self._run_incremental_update(job, dataset, gpu_pod)
            elif strategy == TrainingStrategy.FULL_RETRAIN:
                await self._update_status(job, JobStatus.FULL_TRAINING)
                result = await self._run_full_retrain(job, dataset, gpu_pod)
            elif strategy == TrainingStrategy.FINE_TUNE:
                await self._update_status(job, JobStatus.FINE_TUNING)
                result = await self._run_fine_tuning(job, dataset, gpu_pod)

            # Phase 5: Quality gates
            await self._update_status(job, JobStatus.QUALITY_CHECK)
            passed = await self._run_quality_gates(result)

            if not passed:
                await self._update_status(job, JobStatus.QUALITY_FAILED)
                await self._handle_quality_failure(job, result)
                return

            # Phase 6: Register model
            await self._update_status(job, JobStatus.REGISTERING)
            model_version = await self._register_model(result, dataset_version)

            # Phase 7: Deploy (if auto-deploy enabled) or hold for approval
            if job.config.auto_deploy and job.trigger != TrainingTrigger.MANUAL:
                await self._update_status(job, JobStatus.DEPLOYING)
                await self._deploy_model(model_version)
                await self._update_status(job, JobStatus.COMPLETED)
            else:
                await self._update_status(
                    job, JobStatus.AWAITING_DEPLOYMENT_APPROVAL
                )

        except Exception as e:
            await self._update_status(job, JobStatus.FAILED, error=str(e))
            logger.error(f"Training job {job.job_id} failed: {e}")
            raise

        finally:
            await self.gpu_manager.release_gpu(gpu_pod)
```

### 6.3 Training Strategy Decision Logic

```python
def determine_training_strategy(
    self,
    new_examples_count: int,
    new_identities_count: int,
    days_since_last_train: int,
    current_model_age_days: int,
    drift_detected: bool
) -> TrainingStrategy:
    """
    Decide between incremental update, fine-tuning, or full retraining.

    Decision Matrix:
    +----------------------+----------------------+----------------------+
    | Scenario             | Strategy             | Reason               |
    +----------------------+----------------------+----------------------+
    | < 100 new examples   | INCREMENTAL_EMBEDDING| Just update centroid  |
    | Existing identities  |                      | vectors, no model     |
    | No drift             |                      | retraining needed     |
    +----------------------+----------------------+----------------------+
    | 100-1000 examples    | FINE_TUNE            | Update top layers,    |
    | Some new identities  |                      | keep base frozen      |
    | Minor drift          |                      |                       |
    +----------------------+----------------------+----------------------+
    | > 1000 examples      | FULL_RETRAIN         | Significant new data, |
    | Many new identities  |                      | retrain from scratch  |
    | Major drift          |                      | or from checkpoint    |
    | Model > 30 days old  |                      |                       |
    +----------------------+----------------------+----------------------+
    """
    if drift_detected and current_model_age_days > 30:
        return TrainingStrategy.FULL_RETRAIN

    if new_examples_count > 1000 or new_identities_count > 50:
        return TrainingStrategy.FULL_RETRAIN

    if new_examples_count > 100:
        return TrainingStrategy.FINE_TUNE

    return TrainingStrategy.INCREMENTAL_EMBEDDING
```

### 6.4 Training Job States

```
+---------+   +-----------+   +------------+   +--------------+
| QUEUED  |-->| ACQUIRING |-->| BUILDING   |-->| DETERMINING  |
|         |   | RESOURCES |   | DATASET    |   | STRATEGY     |
+---------+   +-----------+   +------------+   +------+-------+
                                                       |
              +----------------------------------------+--------------------------------+
              |                                        |                                |
              v                                        v                                v
       +-------------+                          +------------+                  +------------+
       | INCREMENTAL |                          | FINE_TUNING|                  | FULL       |
       | UPDATE      |                          |            |                  | TRAINING   |
       +------+------+                          +------+-----+                  +------+-----+
              |                                        |                                |
              +------------------------+----------------+-------------------------------+
                                       |
                                       v
                              +--------------+
                              | QUALITY_CHECK|
                              +------+-------+
                                     |
                          +----------+----------+
                          |                     |
                          v                     v
                    +---------+          +-----------+
                    | PASSED  |          | FAILED    |
                    +----+----+          +-----+-----+
                         |                     |
                         v                     v
                    +---------+          +-----------+
                    |REGISTER |          | NOTIFIED  |
                    | MODEL   |          | (Admin)   |
                    +----+----+          +-----------+
                         |
                         v
              +----------+-----------+
              |                      |
              v                      v
       +-------------+       +-------------------+
       | DEPLOYED    |       | AWAITING_APPROVAL |
       | (Auto)      |       | (Manual)          |
       +-------------+       +-------------------+
```

### 6.5 GPU Resource Management

```python
class GPUResourceManager:
    """Manages GPU allocation for training jobs."""

    # Configuration
    GPU_POOL_CONFIG = {
        GPUType.V100: {"count": 4, "memory_gb": 32, "cost_per_hour": 2.5},
        GPUType.A100: {"count": 2, "memory_gb": 80, "cost_per_hour": 3.5},
        GPUType.T4:   {"count": 8, "memory_gb": 16, "cost_per_hour": 0.35},
    }

    # Job to GPU mapping
    JOB_GPU_REQUIREMENTS = {
        TrainingStrategy.INCREMENTAL_EMBEDDING: GPUType.T4,    # Fast, light
        TrainingStrategy.FINE_TUNE: GPUType.V100,              # Moderate
        TrainingStrategy.FULL_RETRAIN: GPUType.A100,           # Heavy
    }

    async def acquire_gpu(
        self,
        gpu_type: GPUType,
        max_wait_seconds: int = 3600
    ) -> GPUPod:
        """
        Acquire a GPU for training with fair queuing.
        """
        # Check if GPU available
        available = await self.kubernetes.get_available_gpus(gpu_type)
        if available > 0:
            pod = await self.kubernetes.create_training_pod(
                image="face-training:latest",
                gpu_type=gpu_type,
                resources={"memory": "32Gi", "cpu": 8}
            )
            return pod

        # Queue and wait with priority
        queue_position = await self.waiting_queue.add({
            "gpu_type": gpu_type,
            "requested_at": datetime.utcnow().isoformat()
        })

        logger.info(f"GPU {gpu_type.value} not available, queued at position {queue_position}")

        # Wait with timeout
        pod = await self._wait_for_gpu(gpu_type, timeout=max_wait_seconds)
        return pod

    async def auto_scale_gpu_pool(self):
        """
        Auto-scale GPU pool based on queue depth and cost constraints.
        Uses KEDA for Kubernetes event-driven autoscaling.
        """
        queue_depth = await self.waiting_queue.depth()
        running_jobs = await self.kubernetes.get_running_training_jobs()
        current_cost = self._calculate_current_cost(running_jobs)

        # Scale up if queue is backed up
        if queue_depth > 5 and current_cost < self.max_hourly_budget:
            await self.kubernetes.scale_gpu_node_pool(
                target_replicas=min(queue_depth // 2, self.max_gpus)
            )

        # Scale down if idle
        if queue_depth == 0 and len(running_jobs) == 0:
            await self.kubernetes.scale_gpu_node_pool(target_replicas=1)
```

---

## 7. Model Versioning

### 7.1 Model Registry Structure

```
Model Registry (MLflow-like)
|
|-- face-recognition-model (Registered Model)
|   |
|   |-- Version 1.0.0 (Production, 2024-01-15)
|   |   |-- Model artifact (PyTorch .pt)
|   |   |-- Training dataset version: dataset-v1.0.0
|   |   |-- Training config JSON
|   |   |-- Performance metrics JSON
|   |   |-- Stage: Production
|   |   |-- Deployment date: 2024-01-15T08:30:00Z
|   |   +-- Rollback ready: Yes
|   |
|   |-- Version 1.1.0 (Staging, 2024-01-22)
|   |   |-- Model artifact
|   |   |-- Dataset: dataset-v1.1.0
|   |   |-- Stage: Staging (awaiting quality gates)
|   |   +-- A/B test: 10% traffic
|   |
|   |-- Version 1.1.1 (Archived, 2024-01-23)
|   |   |-- Stage: Archived
|   |   +-- Reason: Failed quality gate (precision drop)
|   |
|   +-- Version 1.2.0 (Staging, 2024-01-25)
|       |-- Dataset: dataset-v1.2.0
|       |-- Stage: Staging
|       +-- Changelog: "Added 45 new identities, fine-tuned embeddings"
```

### 7.2 Versioning Schema

```python
class ModelVersion(BaseModel):
    """A versioned model in the registry."""

    version_id: UUID
    version_string: str                    # Semantic: 1.2.3

    # Model metadata
    model_name: str = "face-recognition"
    model_type: str                        # "arcface", "facenet", "custom"
    architecture: str                      # "iresnet100", "mobilenet", etc.
    embedding_dimension: int = 512

    # Training lineage
    training_job_id: UUID
    training_dataset_version: str          # References dataset version
    training_config: TrainingConfig
    base_model_version: str | None         # For fine-tuning: what we started from
    training_strategy: TrainingStrategy

    # Model artifacts
    model_artifact_path: str               # S3/MinIO path to .pt file
    checkpoint_path: str                   # Full training checkpoint
    onnx_export_path: str | None           # ONNX optimized version
    tensorrt_path: str | None              # TensorRT optimized (GPU)

    # Performance metrics
    training_metrics: TrainingMetrics
    evaluation_metrics: EvaluationMetrics  # On hold-out test set
    identity_performance: dict[UUID, IdentityMetrics]  # Per-identity breakdown

    # Deployment state
    stage: ModelStage                      # STAGING, PRODUCTION, ARCHIVED
    deployment_date: datetime | None
    traffic_percentage: float = 0.0        # For A/B testing

    # Rollback
    can_rollback: bool = True
    rollback_to_version: str | None        # Previous version

    # Timestamps
    created_at: datetime
    updated_at: datetime

    # Audit
    created_by_user_id: UUID
    approved_by_user_id: UUID | None
    approval_notes: str | None
```

### 7.3 Semantic Versioning for Models

```
Version Format: MAJOR.MINOR.PATCH

MAJOR (X.0.0): Full retraining, architecture change, or breaking embedding change
  - New base model (e.g., ArcFace -> FaceNet)
  - Embedding dimension change
  - All embeddings invalidated, full database rebuild required

MINOR (x.Y.0): Fine-tuning or significant new training data
  - 50+ new identities added
  - Fine-tuning with substantial new data
  - Embeddings compatible, incremental update possible

PATCH (x.y.Z): Incremental update, hotfix
  - New examples for existing identities
  - Centroid updates only
  - Backward compatible, rolling deployment safe
```

### 7.4 Rollback Mechanism

```python
class RollbackManager:
    """Manages model rollback with zero-downtime switching."""

    ROLLBACK_TIMEOUT_SECONDS = 60

    async def rollback_to_version(
        self,
        target_version: str,
        reason: str,
        initiated_by: UUID
    ) -> RollbackResult:
        """
        Rollback to a previous model version.
        """
        # 1. Validate target version exists and is rollback-eligible
        target = await self.model_registry.get_version(target_version)
        if not target.can_rollback:
            raise RollbackError(f"Version {target_version} is not rollback-eligible")

        current = await self.model_registry.get_production_version()

        # 2. Pre-warm: Load target model into memory
        logger.info(f"Pre-warming model {target_version} for rollback")
        await self.inference_service.load_model(target_version)

        # 3. Atomic switch with traffic mirroring
        await self._atomic_model_switch(
            from_version=current.version_string,
            to_version=target.version_string,
            mirror_traffic=True,  # Send to both, compare results
            mirror_duration_seconds=30
        )

        # 4. Validate switch health
        health = await self._check_model_health(target_version)
        if not health.healthy:
            # Auto-revert to current
            logger.error(f"Rollback health check failed, reverting to {current.version_string}")
            await self._atomic_model_switch(
                from_version=target.version_string,
                to_version=current.version_string
            )
            raise RollbackError(f"Rollback health check failed: {health.details}")

        # 5. Commit the rollback
        await self.model_registry.update_stage(
            version=target.version_string,
            stage=ModelStage.PRODUCTION,
            traffic_percentage=100.0
        )
        await self.model_registry.update_stage(
            version=current.version_string,
            stage=ModelStage.ARCHIVED,
            archive_reason=f"Rolled back to {target.version_string}: {reason}"
        )

        # 6. Log rollback event
        await self.audit_log.record(RollbackEvent(
            from_version=current.version_string,
            to_version=target.version_string,
            reason=reason,
            initiated_by=initiated_by,
            completed_at=datetime.utcnow(),
            duration_seconds=self.ROLLBACK_TIMEOUT_SECONDS
        ))

        return RollbackResult(
            success=True,
            previous_version=current.version_string,
            new_version=target.version_string,
            switch_duration_ms=health.switch_duration_ms
        )
```

### 7.5 A/B Testing Framework

```python
class ABTestManager:
    """Manages A/B testing of model versions."""

    async def start_ab_test(
        self,
        new_version: str,
        traffic_split: dict[str, float],  # {"current": 0.9, "new": 0.1}
        test_duration_hours: int = 48,
        success_criteria: ABTestCriteria | None = None
    ) -> ABTest:
        """
        Start an A/B test between current and new model.
        """
        test = ABTest(
            test_id=uuid4(),
            new_version=new_version,
            traffic_split=traffic_split,
            started_at=datetime.utcnow(),
            ends_at=datetime.utcnow() + timedelta(hours=test_duration_hours),
            criteria=success_criteria or ABTestCriteria(
                min_accuracy_improvement=0.005,  # 0.5% improvement required
                max_latency_regression=0.1,      # 10% latency increase max
                max_false_positive_increase=0.05  # 5% FP increase max
            ),
            status=ABTestStatus.RUNNING
        )

        # Configure traffic routing
        await self.traffic_router.set_split(
            model_a=(await self.model_registry.get_production_version()).version_string,
            model_b=new_version,
            split_a=traffic_split.get("current", 0.9),
            split_b=traffic_split.get("new", 0.1)
        )

        # Start metrics collection
        await self.metrics_collector.start_comparison(test)

        return test

    async def evaluate_ab_test(self, test_id: UUID) -> ABTestResult:
        """
        Evaluate whether A/B test passed or failed.
        """
        test = await self.ab_test_store.get(test_id)
        metrics = await self.metrics_collector.get_comparison_metrics(test)

        passed = (
            metrics.accuracy_improvement >= test.criteria.min_accuracy_improvement
            and metrics.latency_regression <= test.criteria.max_latency_regression
            and metrics.false_positive_change <= test.criteria.max_false_positive_increase
        )

        if passed:
            await self._promote_to_full(test)
        else:
            await self._revert_traffic(test)

        return ABTestResult(
            test_id=test_id,
            passed=passed,
            metrics=metrics,
            recommendation="promote" if passed else "revert",
            confidence=metrics.statistical_confidence
        )
```

---

## 8. Quality Gates

### 8.1 Quality Gate Pipeline

```
+----------------+     +----------------+     +----------------+
|  Hold-Out Test |     |  Model Compare |     |  Identity      |
|  Evaluation    | --> |  vs Current    | --> |  Regression    |
|                |     |  Production    |     |  Check         |
+----------------+     +----------------+     +----------------+
         |                      |                      |
         v                      v                      v
+--------+---------+   +--------+---------+   +--------+---------+
| PASS/FAIL        |   | PASS/FAIL        |   | PASS/FAIL        |
| Threshold:       |   | Threshold:       |   | Threshold:       |
| Precision >= 0.97|   | No regression >  |   | 100% recall on   |
| Recall >= 0.95   |   | 2% on any metric |   | all known IDs    |
+------------------+   +------------------+   +------------------+
         |                      |                      |
         +----------+-----------+----------------------+
                    |
                    v
         +----------+-----------+
         |  ALL PASSED?         |
         +----------+-----------+
                    |
           +--------+--------+
           |                 |
           v                 v
    +-------------+   +------------------+
    | DEPLOY      |   | REJECT & NOTIFY  |
    | (Proceed)   |   | (Log + Alert)    |
    +-------------+   +------------------+
```

### 8.2 Quality Gate Implementation

```python
class QualityGates:
    """Quality gate checks before model deployment."""

    # Thresholds
    MIN_PRECISION: float = 0.97
    MIN_RECALL: float = 0.95
    MIN_F1_SCORE: float = 0.96
    MAX_LATENCY_MS: float = 150.0
    MAX_REGRESSION_PCT: float = 0.02  # 2% regression allowed
    MIN_KNOWN_IDENTITY_ACCURACY: float = 1.0  # 100% on known IDs

    async def run_all_gates(
        self,
        candidate_model: ModelVersion,
        current_model: ModelVersion,
        test_dataset: TestDataset
    ) -> QualityGateResult:
        """Run all quality gates and return combined result."""
        results = []

        # Gate 1: Hold-out test performance
        results.append(await self._gate_holdout_performance(
            candidate_model, test_dataset
        ))

        # Gate 2: Comparison against production model
        results.append(await self._gate_no_regression(
            candidate_model, current_model, test_dataset
        ))

        # Gate 3: Known identity accuracy
        results.append(await self._gate_known_identity_accuracy(
            candidate_model, test_dataset
        ))

        # Gate 4: Latency check
        results.append(await self._gate_latency_requirement(
            candidate_model
        ))

        # Gate 5: Confusion matrix analysis
        results.append(await self._gate_confusion_analysis(
            candidate_model, test_dataset
        ))

        # Combine results
        all_passed = all(r.passed for r in results)

        return QualityGateResult(
            passed=all_passed,
            gate_results=results,
            overall_recommendation="deploy" if all_passed else "reject",
            can_override=not any(r.critical for r in results),
            details={r.gate_name: r.details for r in results}
        )

    async def _gate_holdout_performance(
        self,
        model: ModelVersion,
        test_dataset: TestDataset
    ) -> GateResult:
        """Gate 1: Must meet minimum performance on hold-out set."""
        metrics = await self.evaluator.evaluate(model, test_dataset)

        passed = (
            metrics.precision >= self.MIN_PRECISION
            and metrics.recall >= self.MIN_RECALL
            and metrics.f1_score >= self.MIN_F1_SCORE
        )

        return GateResult(
            gate_name="holdout_performance",
            passed=passed,
            critical=True,  # Cannot override
            details={
                "precision": metrics.precision,
                "precision_threshold": self.MIN_PRECISION,
                "recall": metrics.recall,
                "recall_threshold": self.MIN_RECALL,
                "f1_score": metrics.f1_score,
                "f1_threshold": self.MIN_F1_SCORE,
                "false_positive_rate": metrics.fpr,
                "false_negative_rate": metrics.fnr
            },
            failure_reason=None if passed else (
                f"Precision {metrics.precision:.4f} < {self.MIN_PRECISION} "
                f"or Recall {metrics.recall:.4f} < {self.MIN_RECALL}"
            )
        )

    async def _gate_no_regression(
        self,
        candidate: ModelVersion,
        current: ModelVersion,
        test_dataset: TestDataset
    ) -> GateResult:
        """Gate 2: New model must not regress on any key metric."""
        candidate_metrics = await self.evaluator.evaluate(candidate, test_dataset)
        current_metrics = await self.evaluator.evaluate(current, test_dataset)

        regressions = {}
        for metric_name in ["precision", "recall", "f1_score"]:
            candidate_val = getattr(candidate_metrics, metric_name)
            current_val = getattr(current_metrics, metric_name)
            regression = (current_val - candidate_val) / current_val if current_val > 0 else 0
            regressions[metric_name] = regression

        max_regression = max(regressions.values())
        passed = max_regression <= self.MAX_REGRESSION_PCT

        return GateResult(
            gate_name="no_regression",
            passed=passed,
            critical=False,  # Can be overridden by admin
            details={
                "max_regression_pct": max_regression * 100,
                "threshold_pct": self.MAX_REGRESSION_PCT * 100,
                "per_metric_regression": regressions,
                "candidate_metrics": candidate_metrics.dict(),
                "current_metrics": current_metrics.dict()
            },
            failure_reason=None if passed else (
                f"Maximum regression {max_regression*100:.2f}% exceeds "
                f"threshold {self.MAX_REGRESSION_PCT*100:.2f}%"
            )
        )

    async def _gate_known_identity_accuracy(
        self,
        model: ModelVersion,
        test_dataset: TestDataset
    ) -> GateResult:
        """Gate 3: Must have 100% accuracy on known identities."""
        known_identities = test_dataset.get_known_identities()
        per_identity_metrics = await self.evaluator.evaluate_per_identity(
            model, known_identities
        )

        failing_identities = [
            (id_id, m.accuracy)
            for id_id, m in per_identity_metrics.items()
            if m.accuracy < self.MIN_KNOWN_IDENTITY_ACCURACY
        ]

        passed = len(failing_identities) == 0

        return GateResult(
            gate_name="known_identity_accuracy",
            passed=passed,
            critical=True,
            details={
                "total_known_identities": len(known_identities),
                "failing_identities": [
                    {"identity_id": str(i), "accuracy": a}
                    for i, a in failing_identities
                ]
            },
            failure_reason=None if passed else (
                f"{len(failing_identities)} known identities have "
                f"accuracy below {self.MIN_KNOWN_IDENTITY_ACCURACY}"
            )
        )
```

### 8.3 Quality Gate Report

```json
{
  "gate_run_id": "uuid-string",
  "candidate_model_version": "1.2.0",
  "baseline_model_version": "1.1.0",
  "timestamp": "2024-01-25T10:30:00Z",
  "overall_result": "PASSED",
  "gates": [
    {
      "name": "holdout_performance",
      "status": "PASSED",
      "critical": true,
      "metrics": {
        "precision": 0.9842,
        "recall": 0.9678,
        "f1_score": 0.9759
      }
    },
    {
      "name": "no_regression",
      "status": "PASSED",
      "metrics": {
        "max_regression_pct": 0.8,
        "per_metric": {
          "precision": 0.003,
          "recall": -0.008,
          "f1_score": -0.002
        }
      }
    },
    {
      "name": "known_identity_accuracy",
      "status": "PASSED",
      "metrics": {
        "known_identities_tested": 142,
        "perfect_accuracy": 142,
        "accuracy_below_threshold": 0
      }
    },
    {
      "name": "latency_requirement",
      "status": "PASSED",
      "metrics": {
        "p50_latency_ms": 45,
        "p99_latency_ms": 128,
        "threshold_ms": 150
      }
    }
  ]
}
```

---

## 9. Embedding Update Strategy

### 9.1 Update Strategy Decision Matrix

```python
class EmbeddingUpdateStrategy(Enum):
    """Strategies for updating the embedding database."""
    CENTROID_UPDATE = "centroid_update"       # Update running mean only
    FULL_REINDEX = "full_reindex"             # Recompute all embeddings
    INCREMENTAL_ADD = "incremental_add"       # Add new, keep existing
    MERGE_AND_UPDATE = "merge_and_update"     # Handle identity merges
    ROLLBACK_REINDEX = "rollback_reindex"     # Restore previous embeddings


def select_embedding_strategy(
    change_type: str,
    affected_identities: int,
    total_identities: int,
    model_version_changed: bool
) -> EmbeddingUpdateStrategy:
    """
    Select the optimal embedding update strategy.

    +------------------+------------------+------------------+------------------+
    | Scenario         | Model Same       | Model Changed    | Action           |
    +------------------+------------------+------------------+------------------+
    | Few new examples | Centroid update  | Full reindex     | CENTROID_UPDATE  |
    | (< 10/identity)  | (fast, online)   | (required)       | or FULL_REINDEX  |
    +------------------+------------------+------------------+------------------+
    | Many new examples| Incremental add  | Full reindex     | INCREMENTAL_ADD  |
    | (10-100/identity)| + centroid       |                  | or FULL_REINDEX  |
    +------------------+------------------+------------------+------------------+
    | Identity merge   | Merge & update   | Full reindex     | MERGE_AND_UPDATE |
    +------------------+------------------+------------------+------------------+
    | Model rollback   | N/A              | Rollback         | ROLLBACK_REINDEX |
    +------------------+------------------+------------------+------------------+
    """
    if model_version_changed:
        return EmbeddingUpdateStrategy.FULL_REINDEX

    if change_type == "merge":
        return EmbeddingUpdateStrategy.MERGE_AND_UPDATE

    if change_type == "rollback":
        return EmbeddingUpdateStrategy.ROLLBACK_REINDEX

    if affected_identities / total_identities > 0.1:  # > 10% changed
        return EmbeddingUpdateStrategy.FULL_REINDEX

    if change_type == "new_examples" and affected_identities < 10:
        return EmbeddingUpdateStrategy.CENTROID_UPDATE

    return EmbeddingUpdateStrategy.INCREMENTAL_ADD
```

### 9.2 Centroid Update (Online, Fast)

```python
class CentroidUpdater:
    """
    Fast online update of identity centroids without full model inference.
    Suitable for small batches of new confirmed examples.
    """

    async def update_centroid(
        self,
        identity_id: UUID,
        new_embeddings: list[np.ndarray]
    ) -> UpdatedIdentity:
        """
        Update the centroid embedding for an identity with new examples.
        Uses running average with configurable decay.
        """
        identity = await self.identity_store.get(identity_id)

        # Get current centroid and count
        current_centroid = identity.embedding_centroid
        current_count = identity.positive_example_count

        # Compute new embeddings centroid
        new_centroid = np.mean(new_embeddings, axis=0)
        new_count = len(new_embeddings)

        # Weighted average with configurable decay
        # Higher alpha = faster adaptation to new examples
        alpha = self.config.centroid_update_alpha  # e.g., 0.3

        updated_centroid = (
            (1 - alpha) * current_centroid * current_count +
            alpha * new_centroid * new_count
        ) / ((1 - alpha) * current_count + alpha * new_count)

        # Normalize (embeddings are typically L2 normalized)
        updated_centroid = updated_centroid / np.linalg.norm(updated_centroid)

        # Update confidence radius
        distances = [
            np.linalg.norm(emb - updated_centroid)
            for emb in new_embeddings
        ]
        new_radius = max(distances) if distances else identity.confidence_radius
        updated_radius = max(
            identity.confidence_radius * 0.95,  # Decay old radius slightly
            new_radius
        )

        # Update in database and index
        await self.identity_store.update_embedding(
            identity_id=identity_id,
            centroid=updated_centroid.tolist(),
            confidence_radius=updated_radius,
            example_count=current_count + new_count,
            last_updated=datetime.utcnow()
        )

        # Update FAISS index
        await self.embedding_index.update_vector(
            identity_id=identity_id,
            new_vector=updated_centroid
        )

        return UpdatedIdentity(
            identity_id=identity_id,
            previous_centroid=current_centroid.tolist(),
            new_centroid=updated_centroid.tolist(),
            update_type="centroid_online",
            examples_added=new_count,
            processing_time_ms=0  # Near-instant
        )
```

### 9.3 Full Reindex (Batch, Comprehensive)

```python
class FullReindexer:
    """
    Complete reindexing of all embeddings. Required when:
    - Model version changes
    - Large batch updates (> 10% of identities)
    - Embedding dimension changes
    - Periodic consistency maintenance
    """

    async def full_reindex(
        self,
        model_version: str,
        batch_size: int = 256,
        parallel_workers: int = 4
    ) -> ReindexResult:
        """
        Recompute all embeddings for all training examples using
        the specified model version. Build a new FAISS index.
        """
        start_time = datetime.utcnow()

        # Phase 1: Collect all training examples
        all_examples = await self.training_store.get_all_approved_examples()

        # Phase 2: Group by identity
        by_identity = defaultdict(list)
        for ex in all_examples:
            by_identity[ex.identity_id].append(ex)

        # Phase 3: Recompute embeddings in batches
        new_embeddings = {}
        model = await self.model_registry.load_model(model_version)

        for identity_id, examples in tqdm(by_identity.items()):
            # Load face images
            images = await self._load_face_images(
                [ex.face_image_path for ex in examples]
            )

            # Batch inference
            embeddings = await self._batch_inference(model, images, batch_size)

            # Compute new centroid
            centroid = np.mean(embeddings, axis=0)
            centroid = centroid / np.linalg.norm(centroid)

            # Compute confidence radius
            distances = [np.linalg.norm(e - centroid) for e in embeddings]
            radius = np.percentile(distances, 95)  # 95th percentile

            new_embeddings[identity_id] = {
                "centroid": centroid,
                "radius": radius,
                "count": len(examples),
                "all_embeddings": embeddings
            }

        # Phase 4: Build new FAISS index
        new_index = self._build_faiss_index(new_embeddings)

        # Phase 5: Atomic swap (dual-index strategy)
        await self._atomic_index_swap(
            old_index=self.embedding_index.current_index(),
            new_index=new_index
        )

        # Phase 6: Update identity records
        for identity_id, data in new_embeddings.items():
            await self.identity_store.update_embedding(
                identity_id=identity_id,
                centroid=data["centroid"].tolist(),
                confidence_radius=data["radius"],
                example_count=data["count"],
                model_version=model_version,
                last_reindexed=datetime.utcnow()
            )

        # Phase 7: Consistency check
        consistency = await self._run_consistency_check(new_index, new_embeddings)

        duration = (datetime.utcnow() - start_time).total_seconds()

        return ReindexResult(
            model_version=model_version,
            identities_reindexed=len(new_embeddings),
            total_examples=len(all_examples),
            index_build_time_seconds=duration,
            consistency_check=consistency,
            index_size_mb=new_index.size() / (1024 * 1024)
        )
```

### 9.4 Identity Merge Embedding Update

```python
class MergeEmbeddingUpdater:
    """Handle embedding updates when identities are merged."""

    async def handle_identity_merge(
        self,
        source_identity_id: UUID,
        target_identity_id: UUID,
        merge_method: str = "weighted_centroid"
    ) -> MergeResult:
        """
        Merge embeddings from source identity into target identity.
        """
        source = await self.identity_store.get(source_identity_id)
        target = await self.identity_store.get(target_identity_id)

        if merge_method == "weighted_centroid":
            # Weighted average based on example counts
            total = source.positive_example_count + target.positive_example_count
            merged_centroid = (
                np.array(source.embedding_centroid) * source.positive_example_count +
                np.array(target.embedding_centroid) * target.positive_example_count
            ) / total
            merged_centroid = merged_centroid / np.linalg.norm(merged_centroid)

        elif merge_method == "recompute_all":
            # Recompute from all combined examples (slower but more accurate)
            all_examples = (
                source.get_all_embeddings() +
                target.get_all_embeddings()
            )
            merged_centroid = np.mean(all_examples, axis=0)
            merged_centroid = merged_centroid / np.linalg.norm(merged_centroid)

        # Update target identity
        await self.identity_store.update_embedding(
            identity_id=target_identity_id,
            centroid=merged_centroid.tolist(),
            confidence_radius=max(source.confidence_radius, target.confidence_radius),
            example_count=total,
            merged_from=[source_identity_id]
        )

        # Remove source from embedding index
        await self.embedding_index.delete_vector(source_identity_id)

        # Update target in embedding index
        await self.embedding_index.update_vector(
            identity_id=target_identity_id,
            new_vector=merged_centroid
        )

        # Mark source identity as merged
        await self.identity_store.mark_merged(
            source_identity_id,
            merged_into=target_identity_id
        )

        return MergeResult(
            source_identity_id=source_identity_id,
            target_identity_id=target_identity_id,
            combined_example_count=total,
            new_centroid=merged_centroid.tolist(),
            processing_time_ms=duration_ms
        )
```

### 9.5 Consistency Checks

```python
class EmbeddingConsistencyChecker:
    """Verify embedding database consistency after updates."""

    async def run_consistency_check(self) -> ConsistencyReport:
        """
        Run post-update consistency verification.
        """
        issues = []

        # Check 1: Every identity has a centroid
        identities = await self.identity_store.get_all_active()
        for identity in identities:
            if identity.embedding_centroid is None:
                issues.append(ConsistencyIssue(
                    type="missing_centroid",
                    identity_id=identity.identity_id,
                    severity="critical",
                    message=f"Identity {identity.identity_id} has no embedding centroid"
                ))

        # Check 2: Every centroid is in the FAISS index
        index = self.embedding_index.current_index()
        indexed_ids = set(index.get_all_ids())
        for identity in identities:
            if identity.identity_id not in indexed_ids:
                issues.append(ConsistencyIssue(
                    type="missing_from_index",
                    identity_id=identity.identity_id,
                    severity="critical",
                    message=f"Identity {identity.identity_id} not found in FAISS index"
                ))

        # Check 3: Centroid matches index vector
        for identity in identities:
            if identity.identity_id in indexed_ids:
                index_vector = index.get_vector(identity.identity_id)
                centroid = np.array(identity.embedding_centroid)
                if not np.allclose(index_vector, centroid, atol=1e-5):
                    issues.append(ConsistencyIssue(
                        type="centroid_mismatch",
                        identity_id=identity.identity_id,
                        severity="high",
                        message="Database centroid differs from index vector"
                    ))

        # Check 4: No orphaned index entries
        active_ids = {i.identity_id for i in identities}
        orphaned = indexed_ids - active_ids
        for orphan_id in orphaned:
            issues.append(ConsistencyIssue(
                type="orphaned_index_entry",
                identity_id=orphan_id,
                severity="medium",
                message=f"FAISS entry for deleted identity: {orphan_id}"
            ))

        # Check 5: Embedding dimension consistency
        expected_dim = self.config.embedding_dimension
        for identity in identities:
            if len(identity.embedding_centroid) != expected_dim:
                issues.append(ConsistencyIssue(
                    type="dimension_mismatch",
                    identity_id=identity.identity_id,
                    severity="critical",
                    message=f"Expected dim {expected_dim}, got {len(identity.embedding_centroid)}"
                ))

        # Auto-fix minor issues
        fixed = await self._auto_fix_issues([i for i in issues if i.severity == "medium"])

        return ConsistencyReport(
            total_identities=len(identities),
            total_issues=len(issues),
            critical_issues=len([i for i in issues if i.severity == "critical"]),
            auto_fixed=len(fixed),
            requires_manual_fix=len([i for i in issues if i.severity in ("high", "critical")]),
            details=issues
        )
```

---

## 10. Training Dataset Management

### 10.1 Dataset Versioning

```python
class TrainingDatasetVersion(BaseModel):
    """A versioned snapshot of the training dataset."""

    version_id: str                    # Semantic version: "1.2.3"
    created_at: datetime
    created_by: UUID                   # User or system

    # Composition
    total_identities: int
    total_positive_examples: int
    total_negative_examples: int
    total_examples: int

    # Per-identity breakdown
    identity_breakdown: list[IdentityDatasetStats]

    # Source events
    included_event_ids: list[UUID]     # All training events in this version
    excluded_event_ids: list[UUID]    # Events excluded (quality/conflict)

    # Quality summary
    avg_face_quality_score: float
    avg_embedding_quality: float
    conflict_count: int                # Should be 0 (conflicts resolved before versioning)

    # Storage
    dataset_manifest_path: str         # S3 path to manifest.json
    images_tar_path: str               # S3 path to images tarball
    embeddings_path: str               # S3 path to precomputed embeddings

    # Audit
    parent_version: str | None         # Previous version this was derived from
    changelog: str                     # Human-readable changes
    retention_until: datetime          # When this version can be deleted


class IdentityDatasetStats(BaseModel):
    """Statistics for a single identity in a dataset version."""
    identity_id: UUID
    identity_name: str
    positive_examples: int
    negative_examples: int
    avg_quality_score: float
    camera_diversity: int              # Number of unique cameras
    time_span_days: int                # Time span of examples
    newest_example_age_days: int
```

### 10.2 Dataset Export

```python
class DatasetExporter:
    """Export training datasets in various formats."""

    async def export_dataset(
        self,
        version: str,
        format: ExportFormat,
        include_images: bool = True,
        anonymize: bool = False,
        filter_criteria: ExportFilter | None = None
    ) -> ExportResult:
        """
        Export a dataset version to a portable format.

        Formats:
        - STRUCTURED_JSON: Organized by identity with metadata
        - COCO_FORMAT: Standard COCO detection format
        - REID_FORMAT: Person re-identification format
        - RAW_IMAGES: Flat folder structure with label files
        """
        dataset = await self.dataset_store.load_version(version)

        if filter_criteria:
            dataset = self._apply_filter(dataset, filter_criteria)

        # Build manifest
        manifest = self._build_manifest(dataset, format)

        # Export images (if requested)
        if include_images:
            image_archive = await self._export_images(
                dataset, anonymize=anonymize
            )
        else:
            image_archive = None

        # Write to export location
        export_path = f"exports/{version}/{format.value}_{datetime.utcnow().isoformat()}.tar.gz"

        await self.object_store.upload(
            path=export_path,
            data={"manifest": manifest, "images": image_archive}
        )

        return ExportResult(
            export_path=export_path,
            format=format,
            total_identities=len(dataset.identities),
            total_examples=dataset.total_examples,
            file_size_mb=await self.object_store.get_size(export_path)
        )
```

### 10.3 Retention Policy

```python
class RetentionPolicy:
    """Manage training data retention with privacy controls."""

    DEFAULT_POLICY = {
        # Event retention
        "rejected_events_days": 30,        # Keep rejected events for 30 days
        "approved_events_years": 2,        # Keep approved events for 2 years
        "conflicted_events_years": 1,      # Keep conflict records for 1 year

        # Dataset versions
        "dataset_versions_keep": 10,       # Keep last 10 dataset versions
        "dataset_versions_min_days": 90,   # Minimum retention for any version

        # Images
        "training_images_years": 2,        # Keep training images for 2 years
        "deleted_identity_images_days": 30, # Keep images of deleted identities 30d

        # Model artifacts
        "model_versions_keep": 20,         # Keep last 20 model versions
        "archived_models_days": 365,       # Archived models kept 1 year

        # Audit
        "audit_logs_years": 5,             # Audit logs kept 5 years
    }

    async def apply_retention(self):
        """
        Apply retention policy: soft-delete expired data,
        hard-delete after grace period.
        """
        # Soft-delete expired training events
        expired_events = await self.db.find_expired_events(
            before=datetime.utcnow() - timedelta(days=self.policy["rejected_events_days"])
        )
        for event in expired_events:
            await self.db.soft_delete_event(event.event_id)

        # Hard-delete after grace period
        grace_period_events = await self.db.find_soft_deleted_before(
            before=datetime.utcnow() - timedelta(days=30)
        )
        for event in grace_period_events:
            # Delete image files
            await self.object_store.delete(event.face_image_path)
            await self.db.hard_delete_event(event.event_id)

        # Cleanup old dataset versions
        versions = await self.dataset_store.list_versions()
        if len(versions) > self.policy["dataset_versions_keep"]:
            to_delete = versions[self.policy["dataset_versions_keep"]:]
            for version in to_delete:
                if version.age_days > self.policy["dataset_versions_min_days"]:
                    await self.dataset_store.archive_version(version.version_id)
```

### 10.4 Privacy Controls

```python
class PrivacyControls:
    """Access control and anonymization for training data."""

    PERMISSION_MATRIX = {
        # Role -> [permissions]
        "admin": ["view_all", "export", "delete", "anonymize", "change_retention"],
        "trainer": ["view_approved", "export_filtered", "request_delete"],
        "reviewer": ["view_pending", "approve_reject", "view_own_actions"],
        "auditor": ["view_analytics", "view_audit_log", "export_metrics_only"],
        "viewer": ["view_own_identities", "view_own_examples"],
    }

    async def can_view_training_image(
        self,
        user_id: UUID,
        image_path: str,
        identity_id: UUID
    ) -> bool:
        """
        Check if user can view a specific training image.
        """
        user = await self.auth.get_user(user_id)

        # Admins can view all
        if user.role == "admin":
            return True

        # Users can view images they reviewed
        if await self.db.user_reviewed_image(user_id, image_path):
            return True

        # Users can view images for identities they manage
        if await self.db.user_manages_identity(user_id, identity_id):
            return True

        # Training role can view approved examples (with audit)
        if user.role == "trainer":
            await self.audit_log.record_image_view(user_id, image_path)
            return True

        return False

    async def anonymize_identity(
        self,
        identity_id: UUID,
        method: str = "blur_faces"
    ) -> AnonymizeResult:
        """
        Anonymize all training images for an identity.
        Used for GDPR/privacy compliance.
        """
        examples = await self.training_store.get_identity_examples(identity_id)

        for example in examples:
            if method == "blur_faces":
                await self._apply_face_blur(example.face_image_path)
            elif method == "remove_images":
                await self.object_store.delete(example.face_image_path)
                await self.training_store.mark_image_deleted(example.event_id)
            elif method == "synthetic_replace":
                # Replace with synthetic face preserving attributes
                await self._replace_with_synthetic(example)

        return AnonymizeResult(
            identity_id=identity_id,
            method=method,
            examples_processed=len(examples),
            completion_time=datetime.utcnow()
        )
```

---

## 11. Performance Monitoring

### 11.1 Monitoring Architecture

```
+------------------------------------------------------------------+
|                    PERFORMANCE MONITORING                         |
+------------------------------------------------------------------+
|                                                                   |
|  +------------------+  +------------------+  +------------------+ |
|  | Detection Events |  | Training Events  |  | Model Predictions| |
|  +--------+---------+  +--------+---------+  +--------+---------+ |
|           |                     |                     |           |
|           v                     v                     v           |
|  +--------+---------+  +--------+---------+  +--------+---------+ |
|  | Metrics          |  | Metrics          |  | Metrics          | |
|  | Collector        |  | Collector        |  | Collector        | |
|  +--------+---------+  +--------+---------+  +--------+---------+ |
|           |                     |                     |           |
|           +----------+----------+----------+----------+           |
|                      |                               |             |
|                      v                               v             |
|              +-------+-------+              +--------+-------+     |
|              |  Prometheus   |              |   Alerting     |     |
|              |  Time-Series  |              |   Manager      |     |
|              |  Database     |              +--------+-------+     |
|              +-------+-------+                       |             |
|                      |                               v             |
|                      v                      +--------+-------+     |
|              +-------+-------+              | Slack/Email/   |     |
|              |  Grafana      |              | PagerDuty      |     |
|              |  Dashboards   |              +----------------+     |
|              +---------------+                                      |
+------------------------------------------------------------------+
```

### 11.2 Key Metrics

```python
class PerformanceMetrics:
    """Metrics tracked for the face recognition system."""

    # Per-identity metrics (tracked over time)
    class IdentityMetrics(BaseModel):
        identity_id: UUID
        timestamp: datetime

        # Accuracy
        true_positives: int
        false_positives: int
        true_negatives: int
        false_negatives: int
        precision: float
        recall: float
        f1_score: float

        # Confidence
        avg_match_confidence: float
        confidence_std: float

        # Drift
        embedding_centroid_drift: float  # Distance from training centroid
        match_distance_trend: float      # Increasing = model struggling

        # Coverage
        detection_rate: float            # % of appearances detected
        unique_cameras_detected: int

    # Global metrics
    class GlobalMetrics(BaseModel):
        timestamp: datetime
        model_version: str

        # Overall accuracy
        overall_precision: float
        overall_recall: float
        overall_f1: float

        # Error rates
        false_positive_rate: float
        false_negative_rate: float
        false_discovery_rate: float

        # System health
        avg_inference_latency_ms: float
        p99_inference_latency_ms: float
        embedding_index_size: int
        active_identities: int

        # Training
        total_training_examples: int
        examples_last_24h: int
        pending_suggestions: int
        open_conflicts: int

    # Drift detection
    class DriftMetrics(BaseModel):
        timestamp: datetime
        model_version: str

        # Embedding drift
        avg_centroid_drift: float        # Average drift across all identities
        max_centroid_drift: float        # Worst-case drift
        drift_alert_threshold: float

        # Performance drift
        accuracy_trend: float            # Slope over last 7 days
        precision_trend: float
        recall_trend: float

        # Data distribution drift
        input_distribution_shift: float  # KL divergence from training distribution
        new_identity_rate: float         # % of detections that are new faces

        # Recommendations
        retrain_recommended: bool
        retrain_reason: str | None
        estimated_accuracy_if_retrained: float | None
```

### 11.3 Drift Detection

```python
class ModelDriftDetector:
    """Detects when model performance is degrading over time."""

    DRIFT_THRESHOLDS = {
        "accuracy_drop_pct": 0.03,        # 3% drop triggers alert
        "latency_increase_pct": 0.20,     # 20% latency increase triggers alert
        "centroid_drift_threshold": 0.15,  # Average drift > 0.15 triggers alert
        "new_identity_rate_threshold": 0.3, # >30% unknowns suggests model gap
        "consecutive_degraded_days": 3,   # 3 days of degradation triggers retrain
    }

    async def detect_drift(self) -> DriftReport:
        """
        Run drift detection and return report with recommendations.
        """
        report = DriftReport(timestamp=datetime.utcnow())

        # 1. Compare recent accuracy vs. baseline
        recent_accuracy = await self._get_recent_accuracy(days=7)
        baseline_accuracy = await self._get_baseline_accuracy()
        accuracy_drop = baseline_accuracy - recent_accuracy

        if accuracy_drop > self.DRIFT_THRESHOLDS["accuracy_drop_pct"]:
            report.alerts.append(DriftAlert(
                type="accuracy_degradation",
                severity="high",
                message=f"Accuracy dropped {accuracy_drop*100:.1f}% from baseline",
                recent_value=recent_accuracy,
                baseline_value=baseline_accuracy
            ))

        # 2. Check embedding centroid drift
        avg_drift = await self._calculate_avg_centroid_drift(days=7)
        if avg_drift > self.DRIFT_THRESHOLDS["centroid_drift_threshold"]:
            report.alerts.append(DriftAlert(
                type="embedding_drift",
                severity="medium",
                message=f"Average centroid drift: {avg_drift:.3f}",
                recent_value=avg_drift,
                baseline_value=0.0
            ))

        # 3. Check latency trends
        recent_latency = await self._get_recent_p99_latency(days=7)
        baseline_latency = await self._get_baseline_latency()
        latency_increase = (recent_latency - baseline_latency) / baseline_latency

        if latency_increase > self.DRIFT_THRESHOLDS["latency_increase_pct"]:
            report.alerts.append(DriftAlert(
                type="latency_increase",
                severity="medium",
                message=f"P99 latency increased {latency_increase*100:.1f}%",
                recent_value=recent_latency,
                baseline_value=baseline_latency
            ))

        # 4. Check for consecutive degradation
        consecutive_degraded = await self._check_consecutive_degraded_days()
        if consecutive_degraded >= self.DRIFT_THRESHOLDS["consecutive_degraded_days"]:
            report.recommendations.append(RetrainRecommendation(
                trigger="consecutive_degradation",
                urgency="high",
                reason=f"{consecutive_degraded} consecutive days of performance degradation",
                estimated_benefit=f"Restore accuracy to ~{baseline_accuracy:.3f}"
            ))

        # Determine overall drift status
        report.is_drifting = len(report.alerts) > 0
        report.retrain_recommended = any(
            r.urgency == "high" for r in report.recommendations
        )

        return report
```

### 11.4 Alert Configuration

```python
class AlertConfiguration:
    """Alert rules for the training system."""

    ALERT_RULES = [
        # Performance alerts
        {
            "name": "accuracy_drop_critical",
            "condition": "overall_precision < 0.95",
            "severity": "critical",
            "channels": ["pagerduty", "slack"],
            "cooldown_minutes": 30
        },
        {
            "name": "accuracy_drop_warning",
            "condition": "overall_precision < 0.97 AND >= 0.95",
            "severity": "warning",
            "channels": ["slack"],
            "cooldown_minutes": 60
        },
        {
            "name": "identity_accuracy_degradation",
            "condition": "any(identity.precision < 0.9 for identity in identities)",
            "severity": "warning",
            "channels": ["slack"],
            "cooldown_minutes": 120
        },
        {
            "name": "drift_detected",
            "condition": "drift_report.is_drifting == True",
            "severity": "warning",
            "channels": ["slack"],
            "cooldown_minutes": 360
        },
        {
            "name": "high_retrain_recommended",
            "condition": "drift_report.retrain_recommended == True",
            "severity": "info",
            "channels": ["slack", "email"],
            "cooldown_minutes": 1440
        },
        # Training queue alerts
        {
            "name": "training_queue_backlog",
            "condition": "pending_suggestions > 500",
            "severity": "warning",
            "channels": ["slack"],
            "cooldown_minutes": 120
        },
        {
            "name": "label_conflicts_open",
            "condition": "open_conflicts > 10",
            "severity": "warning",
            "channels": ["slack"],
            "cooldown_minutes": 240
        },
        {
            "name": "gpu_resource_exhausted",
            "condition": "gpu_queue_wait_time_minutes > 60",
            "severity": "warning",
            "channels": ["slack"],
            "cooldown_minutes": 60
        },
    ]
```

---

## 12. UI for Training Management

### 12.1 Training Dashboard Layout

```
+------------------------------------------------------------------+
|  [Logo]  Surveillance AI    Training Center    [User: Admin]     |
+------------------------------------------------------------------+
|                                                                    |
|  +------------------+  +------------------+  +------------------+ |
|  | Training Status  |  | Suggestions      |  | Model Health     | |
|  |                  |  | Pending Review   |  |                  | |
|  | [=====>  ] 67%   |  |                  |  | Precision: 98.4% | |
|  | Job: #1247       |  | 42 suggestions   |  | Recall:   96.7%  | |
|  | ETA: 12 min      |  | from 8 identities|  | F1:       97.5%  | |
|  |                  |  |                  |  |                  | |
|  | [Pause] [Cancel] |  | [Review All]     |  | [View Details]   | |
|  +------------------+  +------------------+  +------------------+ |
|                                                                    |
|  +------------------+  +------------------+  +------------------+ |
|  | Recent Activity  |  | Conflicts        |  | Model Version    | |
|  |                  |  |                  |  |                  | |
|  | + Identity       |  | 3 open conflicts |  | Current: v1.2.0  | |
|  |   'John Smith'   |  | (needs review)   |  | Staging: v1.3.0  | |
|  |   +12 examples   |  |                  |  |                  | |
|  |                  |  | [View Conflicts] |  | [View History]   | |
|  | + 2 corrections  |  |                  |  | [Trigger Train]  | |
|  |   processed      |  |                  |  |                  | |
|  +------------------+  +------------------+  +------------------+ |
|                                                                    |
+------------------------------------------------------------------+
|  [Dashboard] [Suggestions] [Conflicts] [Models] [Datasets] [Logs] |
+------------------------------------------------------------------+
```

### 12.2 Suggestion Review Interface

```
+------------------------------------------------------------------+
|  Suggestions > Identity: John Smith        [Learning Mode: Suggested] |
+------------------------------------------------------------------+
|                                                                    |
|  Identity: John Smith (#uuid-1234)                                |
|  Current examples: 23 positive, 8 negative                         |
|  Estimated accuracy impact: +1.2%                                 |
|                                                                    |
|  +------------------+  +------------------+  +------------------+ |
|  | [Thumbnail 1]    |  | [Thumbnail 2]    |  | [Thumbnail 3]    | |
|  | Confidence: 0.97 |  | Confidence: 0.94 |  | Confidence: 0.91 | |
|  | Camera: Cam-01   |  | Camera: Cam-03   |  | Camera: Cam-01   | |
|  | 2024-01-20 14:32 |  | 2024-01-21 09:15 |  | 2024-01-22 16:45 | |
|  |                  |  |                  |  |                  | |
|  | [x] Approve      |  | [x] Approve      |  | [ ] Approve      | |
|  | [ ] Reject       |  | [ ] Reject       |  | [x] Reject       | |
|  | [ ] Correct...   |  | [ ] Correct...   |  | [ ] Correct...   | |
|  +------------------+  +------------------+  +------------------+ |
|                                                                    |
|  +------------------+  +------------------+  +------------------+ |
|  | [Thumbnail 4]    |  | [Thumbnail 5]    |  | [Thumbnail 6]    | |
|  | Confidence: 0.89 |  | Confidence: 0.87 |  | Confidence: 0.85 | |
|  |                  |  |                  |  |                  | |
|  | [x] Approve      |  | [ ] Approve      |  | [ ] Approve      | |
|  | [ ] Reject       |  | [ ] Reject       |  | [ ] Reject       | |
|  | [ ] Correct...   |  | [ ] Correct...   |  | [ ] Correct...   | |
|  +------------------+  +------------------+  +------------------+ |
|                                                                    |
|  [Select All] [Select None]  [Approve Selected] [Reject Selected] |
|                                                                    |
|  Conflict Risk: LOW  |  Diversity Score: 0.73  |  Quality: GOOD   |
+------------------------------------------------------------------+
```

### 12.3 Conflict Resolution Interface

```
+------------------------------------------------------------------+
|  Conflicts > #conf-456 (SAME_FACE_DIFF_ID)                      |
+------------------------------------------------------------------+
|                                                                    |
|  ALERT: Same face assigned to different identities                 |
|  Face Detection: #det-789  |  Quality: 0.94                       |
|                                                                    |
|  +--------------------+       +--------------------+              |
|  | [Large Face Image] |       | Assignment History |              |
|  |                    |       |                    |              |
|  |                    |       | + 2024-01-15       |              |
|  |                    |       |   By: user_a       |              |
|  |   High-res face    |       |   Assigned to:     |              |
|  |   bounding box     |       |   "John Smith"     |              |
|  |   with landmarks   |       |   [View Event]     |              |
|  |                    |       |                    |              |
|  |                    |       | + 2024-01-22       |              |
|  |                    |       |   By: user_b       |              |
|  |                    |       |   Corrected to:    |              |
|  |                    |       |   "Bob Johnson"    |              |
|  |                    |       |   [View Event]     |              |
|  +--------------------+       +--------------------+              |
|                                                                    |
|  Resolution:                                                       |
|  ( ) This face is "John Smith" - keep first assignment            |
|  ( ) This face is "Bob Johnson" - accept the correction           |
|  ( ) This is a different person - create new identity             |
|  ( ) Cannot determine - escalate to senior reviewer               |
|                                                                    |
|  [Submit Resolution]  [View Similar Faces]  [Escalate]            |
+------------------------------------------------------------------+
```

### 12.4 Model Version History

```
+------------------------------------------------------------------+
|  Models > Version History                                          |
+------------------------------------------------------------------+
|                                                                    |
|  [Filter: All] [Sort: Newest]  [Compare] [Export Report]          |
|                                                                    |
|  +------+----------+--------+----------+----------+---------------+ |
|  | Ver  | Date     | Status | Precision| Recall   | Actions       | |
|  +------+----------+--------+----------+----------+---------------+ |
|  | 1.3.0| Jan 25   | Staging| 98.4%    | 96.8%  | [View][Deploy][A/B]| |
|  | 1.2.1| Jan 23   | Prod   | 98.2%    | 96.7%  | [View][Rollback]   | |
|  | 1.2.0| Jan 20   | Arch   | 97.8%    | 96.1%  | [View][Restore]    | |
|  | 1.1.0| Jan 15   | Arch   | 97.5%    | 95.8%  | [View]             | |
|  | 1.0.0| Jan 10   | Arch   | 96.2%    | 94.1%  | [View]             | |
|  +------+----------+--------+----------+----------+---------------+ |
|                                                                    |
|  Version 1.3.0 Details:                                           |
|  - Training data: dataset-v1.3.0 (1,245 examples, 89 identities)  |
|  - Training strategy: Fine-tune (from v1.2.0 checkpoint)          |
|  - Training time: 45 minutes                                        |
|  - GPU used: NVIDIA V100 x 2                                        |
|  - New identities added: 12                                        |
|  - Quality gates: ALL PASSED                                        |
|                                                                    |
|  [Download Model] [Download Dataset] [View Training Log]          |
+------------------------------------------------------------------+
```

### 12.5 Training Trigger UI

```
+------------------------------------------------------------------+
|  Manual Training Trigger                                          |
+------------------------------------------------------------------+
|                                                                    |
|  Training Mode:                                                    |
|  ( ) Incremental Update - Fast, update embeddings only            |
|  (*) Fine-tuning - Moderate, update model weights                 |
|  ( ) Full Retraining - Slow, retrain from scratch                 |
|                                                                    |
|  Data Selection:                                                   |
|  [x] Include all approved examples since last training             |
|  [x] Include new identities (5 new since last train)               |
|  [ ] Force retrain on all identities                               |
|  [ ] Use custom dataset version: [________] [Browse]               |
|                                                                    |
|  Advanced Options:                                                 |
|  [ ] Override quality gates (admin only)                           |
|  [ ] Auto-deploy on success                                        |
|  [ ] Enable A/B testing: [__10__]% traffic to new model            |
|                                                                    |
|  Estimated:                                                        |
|  - Training time: ~45 minutes                                      |
|  - GPU requirement: V100 x 2                                       |
|  - Dataset size: ~1,245 examples                                   |
|                                                                    |
|  [Preview Changes]  [Schedule for Later]  [START TRAINING NOW]     |
+------------------------------------------------------------------+
```

---

## 13. Database Schema

### 13.1 Core Tables

```sql
-- ============================================================
-- TRAINING EVENTS
-- Captures every user action that could generate training data
-- ============================================================
CREATE TABLE training_events (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type VARCHAR(32) NOT NULL,        -- IDENTITY_NAMED, MATCH_CONFIRMED, etc.
    status VARCHAR(16) NOT NULL DEFAULT 'pending',
                                            -- pending, approved, rejected, conflicted, trained

    -- Source detection reference
    detection_id UUID NOT NULL,
    face_image_path TEXT NOT NULL,
    full_frame_path TEXT,

    -- Identity assignment
    identity_id UUID NOT NULL,
    previous_identity_id UUID,              -- For corrections: old identity
    is_unknown BOOLEAN NOT NULL DEFAULT FALSE,

    -- Embedding (at time of event)
    embedding_vector VECTOR(512),           -- pgvector extension
    embedding_model_version VARCHAR(32) NOT NULL,

    -- Quality metrics
    face_quality_score FLOAT NOT NULL,
    face_size_px INTEGER NOT NULL,
    detection_confidence FLOAT NOT NULL,
    pose_yaw FLOAT,
    pose_pitch FLOAT,
    pose_roll FLOAT,

    -- Metadata
    source_camera_id VARCHAR(64),
    captured_at TIMESTAMP NOT NULL,
    reviewed_by_user_id UUID NOT NULL,
    review_timestamp TIMESTAMP NOT NULL,
    review_method VARCHAR(32) NOT NULL DEFAULT 'manual_click',

    -- Training lifecycle
    suggestion_id UUID,                     -- Links to training suggestion
    conflict_id UUID,                       -- Links to label conflict
    approved_at TIMESTAMP,
    trained_at TIMESTAMP,
    model_version_trained VARCHAR(32),

    -- Timestamps
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

    -- Indexes
    CONSTRAINT valid_status CHECK (status IN (
        'pending', 'approved', 'rejected', 'conflicted', 'trained'
    )),
    CONSTRAINT valid_event_type CHECK (event_type IN (
        'IDENTITY_NAMED', 'MATCH_CONFIRMED', 'MATCH_CORRECTED',
        'IDENTITY_MERGED', 'MATCH_REJECTED'
    ))
);

CREATE INDEX idx_training_events_identity ON training_events(identity_id, status);
CREATE INDEX idx_training_events_detection ON training_events(detection_id);
CREATE INDEX idx_training_events_status ON training_events(status) WHERE status = 'pending';
CREATE INDEX idx_training_events_suggestion ON training_events(suggestion_id);
CREATE INDEX idx_training_events_conflict ON training_events(conflict_id);
CREATE INDEX idx_training_events_created ON training_events(created_at);

-- HNSW index for vector similarity search on embeddings
CREATE INDEX idx_training_events_embedding ON training_events
    USING hnsw (embedding_vector vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- ============================================================
-- LABEL CONFLICTS
-- Tracks detected label conflicts requiring resolution
-- ============================================================
CREATE TABLE label_conflicts (
    conflict_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    conflict_type VARCHAR(32) NOT NULL,     -- SAME_FACE_DIFF_ID, TRANSITIVE_CHAIN, etc.
    severity VARCHAR(16) NOT NULL,          -- LOW, MEDIUM, HIGH, CRITICAL
    status VARCHAR(16) NOT NULL DEFAULT 'open',
                                            -- open, under_review, resolved, escalated

    detection_id UUID NOT NULL,
    face_image_path TEXT NOT NULL,
    conflicting_labels JSONB NOT NULL,      -- Array of {event_id, identity_id, user_id}

    -- Resolution
    resolution VARCHAR(16),                 -- label_selected, both_valid, escalated
    resolved_by_user_id UUID,
    winning_identity_id UUID,
    resolution_notes TEXT,

    blocks_training BOOLEAN NOT NULL DEFAULT TRUE,

    detected_at TIMESTAMP NOT NULL DEFAULT NOW(),
    resolved_at TIMESTAMP,

    CONSTRAINT valid_severity CHECK (severity IN ('LOW', 'MEDIUM', 'HIGH', 'CRITICAL')),
    CONSTRAINT valid_status CHECK (status IN ('open', 'under_review', 'resolved', 'escalated'))
);

CREATE INDEX idx_label_conflicts_status ON label_conflicts(status);
CREATE INDEX idx_label_conflicts_detection ON label_conflicts(detection_id);

-- ============================================================
-- TRAINING SUGGESTIONS
-- Aggregated suggestions for admin review
-- ============================================================
CREATE TABLE training_suggestions (
    suggestion_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    identity_id UUID NOT NULL,
    identity_name VARCHAR(256) NOT NULL,

    status VARCHAR(16) NOT NULL DEFAULT 'pending',
                                            -- pending, approved, rejected, corrected

    -- Content summary
    new_positive_count INTEGER NOT NULL DEFAULT 0,
    corrected_count INTEGER NOT NULL DEFAULT 0,
    merged_count INTEGER NOT NULL DEFAULT 0,

    -- Impact assessment
    estimated_accuracy_change FLOAT,        -- Predicted accuracy delta
    conflict_risk_score FLOAT NOT NULL,     -- 0-1, higher = more risk
    diversity_score FLOAT,                  -- 0-1, camera/time diversity

    -- Current state
    existing_positive_count INTEGER NOT NULL,
    existing_negative_count INTEGER NOT NULL,

    -- Review
    reviewed_by UUID,
    reviewed_at TIMESTAMP,
    review_notes TEXT,

    auto_approve_eligible BOOLEAN DEFAULT FALSE,

    generated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    expires_at TIMESTAMP,                   -- Suggestions expire after 7 days

    CONSTRAINT valid_status CHECK (status IN ('pending', 'approved', 'rejected', 'corrected'))
);

CREATE INDEX idx_training_suggestions_status ON training_suggestions(status);
CREATE INDEX idx_training_suggestions_identity ON training_suggestions(identity_id);

-- ============================================================
-- MODEL VERSIONS
-- Registry of all trained model versions
-- ============================================================
CREATE TABLE model_versions (
    version_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    version_string VARCHAR(16) NOT NULL UNIQUE,  -- Semantic: 1.2.3

    model_name VARCHAR(64) NOT NULL DEFAULT 'face-recognition',
    model_type VARCHAR(32) NOT NULL,             -- arcface, facenet, custom
    architecture VARCHAR(32) NOT NULL,           -- iresnet100, mobilenet, etc.
    embedding_dimension INTEGER NOT NULL DEFAULT 512,

    -- Training lineage
    training_job_id UUID NOT NULL,
    training_dataset_version VARCHAR(32) NOT NULL,
    base_model_version VARCHAR(16),              -- Previous version (for fine-tuning)
    training_strategy VARCHAR(32) NOT NULL,      -- incremental, fine_tune, full_retrain

    -- Artifacts
    model_artifact_path TEXT NOT NULL,
    checkpoint_path TEXT,
    onnx_export_path TEXT,
    tensorrt_path TEXT,

    -- Performance (from quality gates)
    test_precision FLOAT,
    test_recall FLOAT,
    test_f1_score FLOAT,
    test_false_positive_rate FLOAT,
    test_false_negative_rate FLOAT,
    test_latency_p50_ms FLOAT,
    test_latency_p99_ms FLOAT,

    -- Deployment state
    stage VARCHAR(16) NOT NULL DEFAULT 'staging',
                                            -- staging, production, archived
    deployment_date TIMESTAMP,
    traffic_percentage FLOAT NOT NULL DEFAULT 0.0,

    -- Rollback
    can_rollback BOOLEAN NOT NULL DEFAULT TRUE,
    rollback_to_version VARCHAR(16),

    -- Audit
    created_by UUID NOT NULL,
    approved_by UUID,
    approval_notes TEXT,

    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

    CONSTRAINT valid_stage CHECK (stage IN ('staging', 'production', 'archived'))
);

CREATE INDEX idx_model_versions_stage ON model_versions(stage);
CREATE INDEX idx_model_versions_string ON model_versions(version_string);

-- ============================================================
-- TRAINING JOBS
-- Track all training job executions
-- ============================================================
CREATE TABLE training_jobs (
    job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    trigger_type VARCHAR(32) NOT NULL,       -- scheduled, threshold, manual, degradation

    status VARCHAR(32) NOT NULL DEFAULT 'queued',
                                            -- See state machine below

    -- Configuration
    training_mode VARCHAR(32) NOT NULL,      -- incremental, fine_tune, full_retrain
    dataset_version VARCHAR(32),
    gpu_type VARCHAR(16),
    gpu_count INTEGER NOT NULL DEFAULT 1,
    auto_deploy BOOLEAN DEFAULT FALSE,
    override_quality_gates BOOLEAN DEFAULT FALSE,

    -- Progress tracking
    current_phase VARCHAR(32),
    progress_percent FLOAT DEFAULT 0.0,
    current_step TEXT,

    -- Results
    model_version_id UUID REFERENCES model_versions(version_id),
    quality_gate_result JSONB,
    training_metrics JSONB,
    error_message TEXT,

    -- Timing
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    duration_seconds INTEGER,

    -- Audit
    created_by UUID NOT NULL,

    CONSTRAINT valid_trigger CHECK (trigger_type IN (
        'scheduled', 'threshold', 'manual', 'degradation', 'new_identity'
    )),
    CONSTRAINT valid_status CHECK (status IN (
        'queued', 'acquiring_resources', 'building_dataset', 'determining_strategy',
        'incremental_update', 'fine_tuning', 'full_training', 'quality_check',
        'quality_failed', 'registering', 'deploying', 'completed',
        'awaiting_deployment_approval', 'failed', 'cancelled'
    ))
);

CREATE INDEX idx_training_jobs_status ON training_jobs(status);
CREATE INDEX idx_training_jobs_created ON training_jobs(created_at);

-- ============================================================
-- DATASET VERSIONS
-- Track dataset snapshots used for training
-- ============================================================
CREATE TABLE dataset_versions (
    version_id VARCHAR(32) PRIMARY KEY,      -- Semantic: "dataset-v1.2.3"
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    created_by UUID NOT NULL,

    total_identities INTEGER NOT NULL,
    total_positive_examples INTEGER NOT NULL,
    total_negative_examples INTEGER NOT NULL,

    avg_face_quality_score FLOAT,
    avg_embedding_quality FLOAT,
    conflict_count INTEGER NOT NULL DEFAULT 0,

    -- Storage
    manifest_path TEXT NOT NULL,
    images_archive_path TEXT,
    embeddings_path TEXT,

    -- Lineage
    parent_version VARCHAR(32) REFERENCES dataset_versions(version_id),
    changelog TEXT,

    -- Retention
    retention_until TIMESTAMP NOT NULL
);

-- ============================================================
-- PERFORMANCE METRICS (Time-series, optional: use TimescaleDB)
-- ============================================================
CREATE TABLE identity_performance_metrics (
    metric_id BIGSERIAL PRIMARY KEY,
    identity_id UUID NOT NULL,
    model_version VARCHAR(16) NOT NULL,
    timestamp TIMESTAMP NOT NULL DEFAULT NOW(),

    true_positives INTEGER NOT NULL DEFAULT 0,
    false_positives INTEGER NOT NULL DEFAULT 0,
    true_negatives INTEGER NOT NULL DEFAULT 0,
    false_negatives INTEGER NOT NULL DEFAULT 0,

    precision FLOAT,
    recall FLOAT,
    f1_score FLOAT,

    avg_match_confidence FLOAT,
    embedding_centroid_drift FLOAT,
    detection_rate FLOAT,

    CONSTRAINT positive_metrics CHECK (
        true_positives >= 0 AND false_positives >= 0 AND
        true_negatives >= 0 AND false_negatives >= 0
    )
);

-- Convert to hypertable with TimescaleDB
SELECT create_hypertable('identity_performance_metrics', 'timestamp');

CREATE INDEX idx_identity_perf_identity ON identity_performance_metrics(identity_id, timestamp);
CREATE INDEX idx_identity_perf_model ON identity_performance_metrics(model_version, timestamp);

-- ============================================================
-- AUDIT LOG
-- All administrative actions on training system
-- ============================================================
CREATE TABLE training_audit_log (
    audit_id BIGSERIAL PRIMARY KEY,
    action VARCHAR(64) NOT NULL,
    actor_id UUID NOT NULL,
    actor_role VARCHAR(32) NOT NULL,
    target_type VARCHAR(32) NOT NULL,       -- model, dataset, event, conflict, suggestion
    target_id UUID NOT NULL,
    details JSONB,
    ip_address INET,
    timestamp TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_audit_log_target ON training_audit_log(target_type, target_id);
CREATE INDEX idx_audit_log_actor ON training_audit_log(actor_id, timestamp);
CREATE INDEX idx_audit_log_timestamp ON training_audit_log(timestamp);
```

---

## 14. API Specification

### 14.1 Training Events API

```yaml
openapi: 3.0.0
paths:
  /api/v1/training/events:
    post:
      summary: Submit a training event
      requestBody:
        content:
          application/json:
            schema:
              type: object
              required: [event_type, detection_id, identity_id, reviewed_by_user_id]
              properties:
                event_type:
                  type: string
                  enum: [IDENTITY_NAMED, MATCH_CONFIRMED, MATCH_CORRECTED, IDENTITY_MERGED, MATCH_REJECTED]
                detection_id:
                  type: string
                  format: uuid
                identity_id:
                  type: string
                  format: uuid
                previous_identity_id:
                  type: string
                  format: uuid
                reviewed_by_user_id:
                  type: string
                  format: uuid
                review_method:
                  type: string
                  enum: [manual_click, bulk_approve, auto, api]
      responses:
        201:
          description: Event created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/TrainingEvent'
        409:
          description: Conflict detected - requires resolution

  /api/v1/training/events/bulk:
    post:
      summary: Submit multiple training events in bulk
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                events:
                  type: array
                  items:
                    $ref: '#/components/schemas/SingleEvent'
                auto_approve:
                  type: boolean
      responses:
        201:
          description: Events processed
          content:
            application/json:
              schema:
                type: object
                properties:
                  accepted:
                    type: integer
                  rejected:
                    type: integer
                  conflicts:
                    type: integer
                  event_ids:
                    type: array
                    items:
                      type: string
                      format: uuid

  /api/v1/training/suggestions:
    get:
      summary: List pending training suggestions
      parameters:
        - name: status
          in: query
          schema:
            type: string
            enum: [pending, approved, rejected, all]
        - name: identity_id
          in: query
          schema:
            type: string
            format: uuid
        - name: limit
          in: query
          schema:
            type: integer
            default: 50
        - name: offset
          in: query
          schema:
            type: integer
            default: 0
      responses:
        200:
          description: List of suggestions

  /api/v1/training/suggestions/{id}/approve:
    post:
      summary: Approve a training suggestion
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                notes:
                  type: string
      responses:
        200:
          description: Suggestion approved

  /api/v1/training/suggestions/{id}/reject:
    post:
      summary: Reject a training suggestion
      requestBody:
        content:
          application/json:
            schema:
              type: object
              required: [reason]
              properties:
                reason:
                  type: string
      responses:
        200:
          description: Suggestion rejected

  /api/v1/training/suggestions/bulk-approve:
    post:
      summary: Bulk approve multiple suggestions
      requestBody:
        content:
          application/json:
            schema:
              type: object
              required: [suggestion_ids]
              properties:
                suggestion_ids:
                  type: array
                  items:
                    type: string
                    format: uuid
      responses:
        200:
          description: Suggestions approved

  /api/v1/training/jobs:
    post:
      summary: Trigger a new training job
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                training_mode:
                  type: string
                  enum: [incremental, fine_tune, full_retrain]
                dataset_version:
                  type: string
                auto_deploy:
                  type: boolean
                ab_test_config:
                  type: object
      responses:
        201:
          description: Training job created

    get:
      summary: List training jobs
      parameters:
        - name: status
          in: query
          schema:
            type: string
        - name: limit
          in: query
          schema:
            type: integer
            default: 20
      responses:
        200:
          description: List of training jobs

  /api/v1/training/jobs/{id}:
    get:
      summary: Get training job details and status
      responses:
        200:
          description: Job details

    delete:
      summary: Cancel a running training job
      responses:
        200:
          description: Job cancelled

  /api/v1/models:
    get:
      summary: List model versions
      responses:
        200:
          description: List of models

  /api/v1/models/{version}/rollback:
    post:
      summary: Rollback to a specific model version
      requestBody:
        content:
          application/json:
            schema:
              type: object
              required: [reason]
              properties:
                reason:
                  type: string
      responses:
        200:
          description: Rollback initiated

  /api/v1/models/{version}/ab-test:
    post:
      summary: Start A/B test for a model version
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                traffic_split:
                  type: object
                  properties:
                    current:
                      type: number
                    new:
                      type: number
                duration_hours:
                  type: integer
      responses:
        201:
          description: A/B test started

  /api/v1/metrics/drift:
    get:
      summary: Get drift detection report
      responses:
        200:
          description: Drift report

  /api/v1/metrics/performance:
    get:
      summary: Get model performance metrics
      parameters:
        - name: model_version
          in: query
          schema:
            type: string
        - name: identity_id
          in: query
          schema:
            type: string
            format: uuid
        - name: from
          in: query
          schema:
            type: string
            format: date-time
        - name: to
          in: query
          schema:
            type: string
            format: date-time
      responses:
        200:
          description: Performance metrics

  /api/v1/conflicts:
    get:
      summary: List label conflicts
      parameters:
        - name: status
          in: query
          schema:
            type: string
            enum: [open, under_review, resolved, all]
      responses:
        200:
          description: List of conflicts

  /api/v1/conflicts/{id}/resolve:
    post:
      summary: Resolve a label conflict
      requestBody:
        content:
          application/json:
            schema:
              type: object
              required: [resolution, winning_identity_id]
              properties:
                resolution:
                  type: string
                  enum: [label_selected, both_valid, escalated]
                winning_identity_id:
                  type: string
                  format: uuid
                notes:
                  type: string
      responses:
        200:
          description: Conflict resolved
```

### 14.2 WebSocket Events

```yaml
TrainingWebSocket:
  /ws/training:
    events:
      job_status_update:
        description: Real-time training job progress
        payload:
          job_id: uuid
          status: string
          progress_percent: float
          current_step: string
          eta_seconds: int

      suggestion_created:
        description: New training suggestion available
        payload:
          suggestion_id: uuid
          identity_id: uuid
          identity_name: string
          new_example_count: int

      conflict_detected:
        description: New label conflict detected
        payload:
          conflict_id: uuid
          conflict_type: string
          severity: string
          detection_id: uuid

      quality_gate_result:
        description: Quality gate completed for training job
        payload:
          job_id: uuid
          model_version: string
          passed: boolean
          gate_results: object

      model_deployed:
        description: New model deployed to production
        payload:
          version: string
          previous_version: string
          deployment_method: string  # auto, manual, ab_test_promoted
```

---

## 15. Implementation Roadmap

### 15.1 Phase 1: Foundation (Weeks 1-4)

| Week | Deliverable | Description |
|------|-------------|-------------|
| 1 | Database schema | Create all core tables, indexes, constraints |
| 1 | Event collection API | Submit and bulk-submit training events |
| 2 | Quality filter | Face quality assessment pipeline |
| 2 | Conflict detector | Automatic conflict detection on event submission |
| 3 | Conflict UI | Conflict review and resolution interface |
| 3 | Suggestion engine | Aggregate events into reviewable suggestions |
| 4 | Suggestion UI | Admin review interface for suggestions |
| 4 | Learning mode config | Support manual-only and suggested modes |

### 15.2 Phase 2: Training Pipeline (Weeks 5-8)

| Week | Deliverable | Description |
|------|-------------|-------------|
| 5 | Dataset builder | Build training datasets from approved events |
| 5 | Dataset versioning | Snapshot and version datasets |
| 6 | Training orchestrator | Job creation, queuing, GPU allocation |
| 6 | Incremental updater | Centroid-based embedding updates |
| 7 | Full retrain pipeline | Complete model retraining workflow |
| 7 | Fine-tuning pipeline | Fine-tune from existing checkpoint |
| 8 | Quality gates | All 5 gate checks implemented |
| 8 | Model registry | Version registration and artifact storage |

### 15.3 Phase 3: Deployment & Safety (Weeks 9-11)

| Week | Deliverable | Description |
|------|-------------|-------------|
| 9 | Model deployment | Atomic model switching |
| 9 | Rollback system | One-click rollback to previous version |
| 10 | A/B testing | Traffic splitting and comparison |
| 10 | Auto-deploy config | Configurable auto-deployment rules |
| 11 | Embedding migration | Safe embedding index updates |
| 11 | Consistency checker | Post-update verification |

### 15.4 Phase 4: Monitoring & Polish (Weeks 12-14)

| Week | Deliverable | Description |
|------|-------------|-------------|
| 12 | Performance metrics | Per-identity and global metrics collection |
| 12 | Drift detector | Automatic drift detection and alerts |
| 13 | Alerting system | Slack/email/pagerduty integration |
| 13 | Training dashboard | Full admin dashboard |
| 14 | Dataset export | Multiple export formats |
| 14 | Retention policy | Automated data lifecycle management |
| 14 | Documentation | Full API docs and runbooks |

---

## Appendix A: Key Design Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Embedding store | FAISS + pgvector | Fast ANN search with relational metadata |
| Model training | PyTorch Lightning | Structured training with built-in checkpointing |
| Model registry | MLflow | Industry standard, supports versioning and lineage |
| GPU scheduling | Kubernetes + KEDA | Auto-scaling based on queue depth |
| Orchestration | Apache Airflow | Robust DAG-based pipeline execution |
| Conflicts | Hard block on detection | Never train on conflicting data |
| Rollback | Dual-index strategy | Zero-downtime atomic switching |
| Updates | Centroid-first | 95% of updates are incremental; full retrain only when needed |

## Appendix B: Failure Mode Handling

| Failure | Detection | Mitigation |
|---------|-----------|------------|
| Training job fails | Job status = FAILED, alert sent | Auto-cleanup, no model deployed |
| Quality gate fails | Gate report = FAILED | Model held, admin notified, error logged |
| GPU OOM | Worker crash, timeout | Job retried with smaller batch size |
| Conflict detected | Synchronous on event submission | Training blocked, conflict ticket created |
| Drift detected | Periodic drift analysis | Retrain recommended, alert sent |
| Model deployment fails | Health check failure | Automatic rollback to previous version |
| Embedding index corruption | Consistency checker | Rebuild from dataset version |
| Database connection lost | Connection timeout | Retry with exponential backoff |

## Appendix C: Configuration Reference

```yaml
# training_config.yaml
training:
  learning_mode: "suggested"           # manual, suggested, approved_auto
  min_positive_examples: 5
  max_positive_examples: 500
  negative_positive_ratio: 0.5

  centroid_update:
    alpha: 0.3                         # Centroid adaptation rate
    auto_trigger_threshold: 10           # New examples to trigger incremental update

  scheduled_training:
    daily_time: "02:00"
    weekly_day: "sunday"
    enabled: true

  thresholds:
    full_retrain_examples: 1000
    full_retrain_identities: 50
    full_retrain_model_age_days: 30
    drift_retrain_accuracy_drop: 0.03
    consecutive_degraded_days: 3

quality_gates:
  min_precision: 0.97
  min_recall: 0.95
  min_f1_score: 0.96
  max_regression_pct: 0.02
  max_latency_p99_ms: 150

  known_identity:
    require_perfect_accuracy: true
    test_all_identities: true

gpu:
  pool:
    v100_count: 4
    a100_count: 2
    t4_count: 8
  auto_scale:
    enabled: true
    max_hourly_cost: 50.0
    max_queue_depth: 20

  job_timeouts:
    incremental_minutes: 30
    fine_tune_minutes: 120
    full_retrain_minutes: 480

embedding_index:
  type: "faiss"
  metric: "cosine"
  nlist: 100                           # IVF clusters
  nprobe: 10                           # Clusters to search
  hnsw_m: 16                           # HNSW connections
  ef_construction: 64
  ef_search: 32

retention:
  rejected_events_days: 30
  approved_events_years: 2
  dataset_versions_keep: 10
  training_images_years: 2
  model_versions_keep: 20

monitoring:
  metrics_interval_minutes: 5
  drift_check_interval_hours: 6
  alert_channels:
    - type: "slack"
      webhook_url: "${SLACK_WEBHOOK}"
    - type: "pagerduty"
      service_key: "${PAGERDUTY_KEY}"
```
