# 4+1 Subtitle Pipeline Design ## Goal Replace the current one-shot subtitle generation flow with a staged `4+1` pipeline so transcription fidelity, translation quality, and voice selection can be improved independently. ## Scope - Redesign the backend subtitle generation pipeline around five explicit stages. - Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass. - Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review. - Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response. ## Non-Goals - Do not add a new human review UI in this step. - Do not replace the current editor or dubbing UI. - Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only. ## Problem Summary The current pipeline asks a single provider call to: - watch and listen to the video - transcribe dialogue - split subtitle segments - translate to English - translate to the TTS language - infer speaker metadata - select a voice id This creates two major problems: 1. When the model mishears the original dialogue, every downstream field is wrong. 2. It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching. The new design fixes that by isolating each responsibility. ## Design ### Stage overview The new pipeline contains four production stages plus one validation stage: 1. `Stage 1: Transcription` 2. `Stage 2: Segmentation` 3. `Stage 3: Translation` 4. `Stage 4: Voice Matching` 5. `Stage 5: Validation` Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages. ### Stage 1: Transcription **Purpose** Extract the source dialogue from the video as faithfully as possible. **Input** - local video path or remote `fileId` - provider configuration - request id **Output** ```ts interface TranscriptSegment { id: string; startTime: number; endTime: number; originalText: string; speakerId: string; speaker?: string; gender?: 'male' | 'female' | 'unknown'; confidence?: number; needsReview?: boolean; } ``` **Rules** - Only transcribe audible dialogue. - Do not translate. - Do not rewrite or polish. - Prefer conservative output when unclear. - Mark low-confidence segments with `needsReview`. **Notes** For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one. ### Stage 2: Segmentation **Purpose** Turn raw transcript segments into subtitle-friendly chunks without changing meaning. **Input** - `TranscriptSegment[]` **Output** ```ts interface SegmentedSubtitle { id: string; startTime: number; endTime: number; originalText: string; speakerId: string; speaker?: string; gender?: 'male' | 'female' | 'unknown'; confidence?: number; needsReview?: boolean; } ``` **Rules** - May split or merge segments for readability. - Must not paraphrase `originalText`. - Must preserve chronological order and non-overlap. - Should reuse existing normalization and sentence reconstruction helpers where possible. **Notes** This stage should absorb logic that is currently mixed between [subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts) and provider prompts. ### Stage 3: Translation **Purpose** Translate already-confirmed source dialogue into display subtitles and dubbing text. **Input** - `SegmentedSubtitle[]` - subtitle language settings - TTS language **Output** ```ts interface TranslatedSubtitle extends SegmentedSubtitle { translatedText: string; ttsText: string; ttsLanguage: string; } ``` **Rules** - `translatedText` is always English for on-screen subtitles. - `ttsText` is always the requested TTS language. - Translation must derive from `originalText` only. - This stage must not edit timestamps or speaker identity. ### Stage 4: Voice Matching **Purpose** Assign the most suitable `voiceId` to each subtitle segment. **Input** - `TranslatedSubtitle[]` - available voices for `ttsLanguage` **Output** ```ts interface VoiceMatchedSubtitle extends TranslatedSubtitle { voiceId: string; } ``` **Rules** - Only select from the provided voice catalog. - Use `speaker`, `gender`, and tone hints when available. - Must not change transcript or translation fields. ### Stage 5: Validation **Purpose** Check internal consistency before returning the final result to the editor. **Input** - `VoiceMatchedSubtitle[]` **Output** ```ts interface ValidationIssue { subtitleId: string; code: | 'low_confidence_transcript' | 'timing_overlap' | 'missing_tts_text' | 'voice_language_mismatch' | 'empty_translation'; message: string; severity: 'warning' | 'error'; } ``` **Rules** - Do not rewrite content in this stage. - Report warnings and errors separately. - Only block final success on true contract failures, not soft quality warnings. ## Data Model ### Final subtitle shape The editor should continue to receive `SubtitlePipelineResult`, but the result will now be built from staged outputs rather than a single provider response. Existing final subtitle fields remain: - `id` - `startTime` - `endTime` - `originalText` - `translatedText` - `ttsText` - `ttsLanguage` - `speaker` - `speakerId` - `confidence` - `voiceId` ### New metadata Add optional pipeline metadata to the final result: ```ts interface SubtitlePipelineDiagnostics { validationIssues?: ValidationIssue[]; stageDurationsMs?: Partial>; } ``` This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it. ## Runtime Architecture ### New server modules Add stage-focused modules under a new folder: - `src/server/subtitleStages/transcriptionStage.ts` - `src/server/subtitleStages/segmentationStage.ts` - `src/server/subtitleStages/translationStage.ts` - `src/server/subtitleStages/voiceMatchingStage.ts` - `src/server/subtitleStages/validationStage.ts` Add one orchestrator: - `src/server/multiStageSubtitleGeneration.ts` ### Existing modules to adapt - [src/server/subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts) - stop calling one all-in-one generator - call the new orchestrator instead - [src/server/videoSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/videoSubtitleGeneration.ts) - shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers - [src/server/subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts) - reuse normalization logic in Stage 2 and final payload assembly - [src/types.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/types.ts) - add stage-specific types and validation metadata - [server.ts](/E:/Downloads/ai-video-dubbing-&-translation/server.ts) - optionally expose finer async job progress messages ## Job Progress The async job API should keep the same public contract but expose more precise stage labels: - `transcribing` - `segmenting` - `translating` - `matching_voice` - `validating` This can be done either by extending `SubtitleJobStage` or mapping internal stage names back to the existing progress system. ## Error Handling ### Hard failures - unreadable video input - provider request failure - invalid stage output schema - no subtitles returned after transcription - no valid `voiceId` in Stage 4 These should fail the async job. ### Soft warnings - low transcript confidence - suspected mistranscription - missing speaker gender - fallback default voice applied These should be attached to diagnostics and surfaced later, but should not block the result. ## Testing ### Unit tests - Stage 1 prompt and parsing tests - Stage 2 segmentation behavior tests - Stage 3 translation contract tests - Stage 4 voice catalog matching tests - Stage 5 validation issue generation tests ### Integration tests - full orchestration success path - transcription low-confidence warning path - translation failure path - voice mismatch validation path - async job progress updates across all five stages ## Rollout Strategy ### Phase 1 - Add stage contracts and orchestrator behind the existing `/generate-subtitles` entry point. - Keep the final API shape stable. ### Phase 2 - Store validation issues in the result payload. - Surface review warnings in the editor. ### Phase 3 - Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code. ## Recommendation Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.