video_translate/docs/plans/2026-03-19-4-plus-1-subtitle-pipeline-design.md

# 4+1 Subtitle Pipeline Design

## Goal

Replace the current one-shot subtitle generation flow with a staged `4+1` pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.

## Scope

- Redesign the backend subtitle generation pipeline around five explicit stages.
- Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
- Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
- Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.

## Non-Goals

- Do not add a new human review UI in this step.
- Do not replace the current editor or dubbing UI.
- Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.

## Problem Summary

The current pipeline asks a single provider call to:

- watch and listen to the video
- transcribe dialogue
- split subtitle segments
- translate to English
- translate to the TTS language
- infer speaker metadata
- select a voice id

This creates two major problems:

1. When the model mishears the original dialogue, every downstream field is wrong.
2. It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.

The new design fixes that by isolating each responsibility.

## Design

### Stage overview

The new pipeline contains four production stages plus one validation stage:

1. `Stage 1: Transcription`
2. `Stage 2: Segmentation`
3. `Stage 3: Translation`
4. `Stage 4: Voice Matching`
5. `Stage 5: Validation`

Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.

### Stage 1: Transcription

**Purpose**

Extract the source dialogue from the video as faithfully as possible.

**Input**

- local video path or remote `fileId`
- provider configuration
- request id

**Output**

```ts
interface TranscriptSegment {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  speakerId: string;
  speaker?: string;
  gender?: 'male' | 'female' | 'unknown';
  confidence?: number;
  needsReview?: boolean;
}
```

**Rules**

- Only transcribe audible dialogue.
- Do not translate.
- Do not rewrite or polish.
- Prefer conservative output when unclear.
- Mark low-confidence segments with `needsReview`.

**Notes**

For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.

### Stage 2: Segmentation

**Purpose**

Turn raw transcript segments into subtitle-friendly chunks without changing meaning.

**Input**

- `TranscriptSegment[]`

**Output**

```ts
interface SegmentedSubtitle {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  speakerId: string;
  speaker?: string;
  gender?: 'male' | 'female' | 'unknown';
  confidence?: number;
  needsReview?: boolean;
}
```

**Rules**

- May split or merge segments for readability.
- Must not paraphrase `originalText`.
- Must preserve chronological order and non-overlap.
- Should reuse existing normalization and sentence reconstruction helpers where possible.

**Notes**

This stage should absorb logic that is currently mixed between [subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts) and provider prompts.

### Stage 3: Translation

**Purpose**

Translate already-confirmed source dialogue into display subtitles and dubbing text.

**Input**

- `SegmentedSubtitle[]`
- subtitle language settings
- TTS language

**Output**

```ts
interface TranslatedSubtitle extends SegmentedSubtitle {
  translatedText: string;
  ttsText: string;
  ttsLanguage: string;
}
```

**Rules**

- `translatedText` is always English for on-screen subtitles.
- `ttsText` is always the requested TTS language.
- Translation must derive from `originalText` only.
- This stage must not edit timestamps or speaker identity.

### Stage 4: Voice Matching

**Purpose**

Assign the most suitable `voiceId` to each subtitle segment.

**Input**

- `TranslatedSubtitle[]`
- available voices for `ttsLanguage`

**Output**

```ts
interface VoiceMatchedSubtitle extends TranslatedSubtitle {
  voiceId: string;
}
```

**Rules**

- Only select from the provided voice catalog.
- Use `speaker`, `gender`, and tone hints when available.
- Must not change transcript or translation fields.

### Stage 5: Validation

**Purpose**

Check internal consistency before returning the final result to the editor.

**Input**

- `VoiceMatchedSubtitle[]`

**Output**

```ts
interface ValidationIssue {
  subtitleId: string;
  code:
    | 'low_confidence_transcript'
    | 'timing_overlap'
    | 'missing_tts_text'
    | 'voice_language_mismatch'
    | 'empty_translation';
  message: string;
  severity: 'warning' | 'error';
}
```

**Rules**

- Do not rewrite content in this stage.
- Report warnings and errors separately.
- Only block final success on true contract failures, not soft quality warnings.

## Data Model

### Final subtitle shape

The editor should continue to receive `SubtitlePipelineResult`, but the result will now be built from staged outputs rather than a single provider response.

Existing final subtitle fields remain:

- `id`
- `startTime`
- `endTime`
- `originalText`
- `translatedText`
- `ttsText`
- `ttsLanguage`
- `speaker`
- `speakerId`
- `confidence`
- `voiceId`

### New metadata

Add optional pipeline metadata to the final result:

```ts
interface SubtitlePipelineDiagnostics {
  validationIssues?: ValidationIssue[];
  stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
}
```

This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.

## Runtime Architecture

### New server modules

Add stage-focused modules under a new folder:

- `src/server/subtitleStages/transcriptionStage.ts`
- `src/server/subtitleStages/segmentationStage.ts`
- `src/server/subtitleStages/translationStage.ts`
- `src/server/subtitleStages/voiceMatchingStage.ts`
- `src/server/subtitleStages/validationStage.ts`

Add one orchestrator:

- `src/server/multiStageSubtitleGeneration.ts`

### Existing modules to adapt

- [src/server/subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
  - stop calling one all-in-one generator
  - call the new orchestrator instead
- [src/server/videoSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/videoSubtitleGeneration.ts)
  - shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers
- [src/server/subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts)
  - reuse normalization logic in Stage 2 and final payload assembly
- [src/types.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/types.ts)
  - add stage-specific types and validation metadata
- [server.ts](/E:/Downloads/ai-video-dubbing-&-translation/server.ts)
  - optionally expose finer async job progress messages

## Job Progress

The async job API should keep the same public contract but expose more precise stage labels:

- `transcribing`
- `segmenting`
- `translating`
- `matching_voice`
- `validating`

This can be done either by extending `SubtitleJobStage` or mapping internal stage names back to the existing progress system.

## Error Handling

### Hard failures

- unreadable video input
- provider request failure
- invalid stage output schema
- no subtitles returned after transcription
- no valid `voiceId` in Stage 4

These should fail the async job.

### Soft warnings

- low transcript confidence
- suspected mistranscription
- missing speaker gender
- fallback default voice applied

These should be attached to diagnostics and surfaced later, but should not block the result.

## Testing

### Unit tests

- Stage 1 prompt and parsing tests
- Stage 2 segmentation behavior tests
- Stage 3 translation contract tests
- Stage 4 voice catalog matching tests
- Stage 5 validation issue generation tests

### Integration tests

- full orchestration success path
- transcription low-confidence warning path
- translation failure path
- voice mismatch validation path
- async job progress updates across all five stages

## Rollout Strategy

### Phase 1

- Add stage contracts and orchestrator behind the existing `/generate-subtitles` entry point.
- Keep the final API shape stable.

### Phase 2

- Store validation issues in the result payload.
- Surface review warnings in the editor.

### Phase 3

- Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.

## Recommendation

Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.