video_translate/docs/plans/2026-03-19-4-plus-1-subtitle-pipeline-design.md
Song367 85065cbca3
All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
Build multi-stage subtitle and dubbing pipeline
2026-03-20 20:55:40 +08:00

9.1 KiB

4+1 Subtitle Pipeline Design

Goal

Replace the current one-shot subtitle generation flow with a staged 4+1 pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.

Scope

  • Redesign the backend subtitle generation pipeline around five explicit stages.
  • Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
  • Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
  • Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.

Non-Goals

  • Do not add a new human review UI in this step.
  • Do not replace the current editor or dubbing UI.
  • Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.

Problem Summary

The current pipeline asks a single provider call to:

  • watch and listen to the video
  • transcribe dialogue
  • split subtitle segments
  • translate to English
  • translate to the TTS language
  • infer speaker metadata
  • select a voice id

This creates two major problems:

  1. When the model mishears the original dialogue, every downstream field is wrong.
  2. It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.

The new design fixes that by isolating each responsibility.

Design

Stage overview

The new pipeline contains four production stages plus one validation stage:

  1. Stage 1: Transcription
  2. Stage 2: Segmentation
  3. Stage 3: Translation
  4. Stage 4: Voice Matching
  5. Stage 5: Validation

Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.

Stage 1: Transcription

Purpose

Extract the source dialogue from the video as faithfully as possible.

Input

  • local video path or remote fileId
  • provider configuration
  • request id

Output

interface TranscriptSegment {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  speakerId: string;
  speaker?: string;
  gender?: 'male' | 'female' | 'unknown';
  confidence?: number;
  needsReview?: boolean;
}

Rules

  • Only transcribe audible dialogue.
  • Do not translate.
  • Do not rewrite or polish.
  • Prefer conservative output when unclear.
  • Mark low-confidence segments with needsReview.

Notes

For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.

Stage 2: Segmentation

Purpose

Turn raw transcript segments into subtitle-friendly chunks without changing meaning.

Input

  • TranscriptSegment[]

Output

interface SegmentedSubtitle {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  speakerId: string;
  speaker?: string;
  gender?: 'male' | 'female' | 'unknown';
  confidence?: number;
  needsReview?: boolean;
}

Rules

  • May split or merge segments for readability.
  • Must not paraphrase originalText.
  • Must preserve chronological order and non-overlap.
  • Should reuse existing normalization and sentence reconstruction helpers where possible.

Notes

This stage should absorb logic that is currently mixed between subtitlePipeline.ts and provider prompts.

Stage 3: Translation

Purpose

Translate already-confirmed source dialogue into display subtitles and dubbing text.

Input

  • SegmentedSubtitle[]
  • subtitle language settings
  • TTS language

Output

interface TranslatedSubtitle extends SegmentedSubtitle {
  translatedText: string;
  ttsText: string;
  ttsLanguage: string;
}

Rules

  • translatedText is always English for on-screen subtitles.
  • ttsText is always the requested TTS language.
  • Translation must derive from originalText only.
  • This stage must not edit timestamps or speaker identity.

Stage 4: Voice Matching

Purpose

Assign the most suitable voiceId to each subtitle segment.

Input

  • TranslatedSubtitle[]
  • available voices for ttsLanguage

Output

interface VoiceMatchedSubtitle extends TranslatedSubtitle {
  voiceId: string;
}

Rules

  • Only select from the provided voice catalog.
  • Use speaker, gender, and tone hints when available.
  • Must not change transcript or translation fields.

Stage 5: Validation

Purpose

Check internal consistency before returning the final result to the editor.

Input

  • VoiceMatchedSubtitle[]

Output

interface ValidationIssue {
  subtitleId: string;
  code:
    | 'low_confidence_transcript'
    | 'timing_overlap'
    | 'missing_tts_text'
    | 'voice_language_mismatch'
    | 'empty_translation';
  message: string;
  severity: 'warning' | 'error';
}

Rules

  • Do not rewrite content in this stage.
  • Report warnings and errors separately.
  • Only block final success on true contract failures, not soft quality warnings.

Data Model

Final subtitle shape

The editor should continue to receive SubtitlePipelineResult, but the result will now be built from staged outputs rather than a single provider response.

Existing final subtitle fields remain:

  • id
  • startTime
  • endTime
  • originalText
  • translatedText
  • ttsText
  • ttsLanguage
  • speaker
  • speakerId
  • confidence
  • voiceId

New metadata

Add optional pipeline metadata to the final result:

interface SubtitlePipelineDiagnostics {
  validationIssues?: ValidationIssue[];
  stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
}

This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.

Runtime Architecture

New server modules

Add stage-focused modules under a new folder:

  • src/server/subtitleStages/transcriptionStage.ts
  • src/server/subtitleStages/segmentationStage.ts
  • src/server/subtitleStages/translationStage.ts
  • src/server/subtitleStages/voiceMatchingStage.ts
  • src/server/subtitleStages/validationStage.ts

Add one orchestrator:

  • src/server/multiStageSubtitleGeneration.ts

Existing modules to adapt

Job Progress

The async job API should keep the same public contract but expose more precise stage labels:

  • transcribing
  • segmenting
  • translating
  • matching_voice
  • validating

This can be done either by extending SubtitleJobStage or mapping internal stage names back to the existing progress system.

Error Handling

Hard failures

  • unreadable video input
  • provider request failure
  • invalid stage output schema
  • no subtitles returned after transcription
  • no valid voiceId in Stage 4

These should fail the async job.

Soft warnings

  • low transcript confidence
  • suspected mistranscription
  • missing speaker gender
  • fallback default voice applied

These should be attached to diagnostics and surfaced later, but should not block the result.

Testing

Unit tests

  • Stage 1 prompt and parsing tests
  • Stage 2 segmentation behavior tests
  • Stage 3 translation contract tests
  • Stage 4 voice catalog matching tests
  • Stage 5 validation issue generation tests

Integration tests

  • full orchestration success path
  • transcription low-confidence warning path
  • translation failure path
  • voice mismatch validation path
  • async job progress updates across all five stages

Rollout Strategy

Phase 1

  • Add stage contracts and orchestrator behind the existing /generate-subtitles entry point.
  • Keep the final API shape stable.

Phase 2

  • Store validation issues in the result payload.
  • Surface review warnings in the editor.

Phase 3

  • Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.

Recommendation

Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.