All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
349 lines
9.1 KiB
Markdown
349 lines
9.1 KiB
Markdown
# 4+1 Subtitle Pipeline Design
|
|
|
|
## Goal
|
|
|
|
Replace the current one-shot subtitle generation flow with a staged `4+1` pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.
|
|
|
|
## Scope
|
|
|
|
- Redesign the backend subtitle generation pipeline around five explicit stages.
|
|
- Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
|
|
- Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
|
|
- Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.
|
|
|
|
## Non-Goals
|
|
|
|
- Do not add a new human review UI in this step.
|
|
- Do not replace the current editor or dubbing UI.
|
|
- Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.
|
|
|
|
## Problem Summary
|
|
|
|
The current pipeline asks a single provider call to:
|
|
|
|
- watch and listen to the video
|
|
- transcribe dialogue
|
|
- split subtitle segments
|
|
- translate to English
|
|
- translate to the TTS language
|
|
- infer speaker metadata
|
|
- select a voice id
|
|
|
|
This creates two major problems:
|
|
|
|
1. When the model mishears the original dialogue, every downstream field is wrong.
|
|
2. It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.
|
|
|
|
The new design fixes that by isolating each responsibility.
|
|
|
|
## Design
|
|
|
|
### Stage overview
|
|
|
|
The new pipeline contains four production stages plus one validation stage:
|
|
|
|
1. `Stage 1: Transcription`
|
|
2. `Stage 2: Segmentation`
|
|
3. `Stage 3: Translation`
|
|
4. `Stage 4: Voice Matching`
|
|
5. `Stage 5: Validation`
|
|
|
|
Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.
|
|
|
|
### Stage 1: Transcription
|
|
|
|
**Purpose**
|
|
|
|
Extract the source dialogue from the video as faithfully as possible.
|
|
|
|
**Input**
|
|
|
|
- local video path or remote `fileId`
|
|
- provider configuration
|
|
- request id
|
|
|
|
**Output**
|
|
|
|
```ts
|
|
interface TranscriptSegment {
|
|
id: string;
|
|
startTime: number;
|
|
endTime: number;
|
|
originalText: string;
|
|
speakerId: string;
|
|
speaker?: string;
|
|
gender?: 'male' | 'female' | 'unknown';
|
|
confidence?: number;
|
|
needsReview?: boolean;
|
|
}
|
|
```
|
|
|
|
**Rules**
|
|
|
|
- Only transcribe audible dialogue.
|
|
- Do not translate.
|
|
- Do not rewrite or polish.
|
|
- Prefer conservative output when unclear.
|
|
- Mark low-confidence segments with `needsReview`.
|
|
|
|
**Notes**
|
|
|
|
For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.
|
|
|
|
### Stage 2: Segmentation
|
|
|
|
**Purpose**
|
|
|
|
Turn raw transcript segments into subtitle-friendly chunks without changing meaning.
|
|
|
|
**Input**
|
|
|
|
- `TranscriptSegment[]`
|
|
|
|
**Output**
|
|
|
|
```ts
|
|
interface SegmentedSubtitle {
|
|
id: string;
|
|
startTime: number;
|
|
endTime: number;
|
|
originalText: string;
|
|
speakerId: string;
|
|
speaker?: string;
|
|
gender?: 'male' | 'female' | 'unknown';
|
|
confidence?: number;
|
|
needsReview?: boolean;
|
|
}
|
|
```
|
|
|
|
**Rules**
|
|
|
|
- May split or merge segments for readability.
|
|
- Must not paraphrase `originalText`.
|
|
- Must preserve chronological order and non-overlap.
|
|
- Should reuse existing normalization and sentence reconstruction helpers where possible.
|
|
|
|
**Notes**
|
|
|
|
This stage should absorb logic that is currently mixed between [subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts) and provider prompts.
|
|
|
|
### Stage 3: Translation
|
|
|
|
**Purpose**
|
|
|
|
Translate already-confirmed source dialogue into display subtitles and dubbing text.
|
|
|
|
**Input**
|
|
|
|
- `SegmentedSubtitle[]`
|
|
- subtitle language settings
|
|
- TTS language
|
|
|
|
**Output**
|
|
|
|
```ts
|
|
interface TranslatedSubtitle extends SegmentedSubtitle {
|
|
translatedText: string;
|
|
ttsText: string;
|
|
ttsLanguage: string;
|
|
}
|
|
```
|
|
|
|
**Rules**
|
|
|
|
- `translatedText` is always English for on-screen subtitles.
|
|
- `ttsText` is always the requested TTS language.
|
|
- Translation must derive from `originalText` only.
|
|
- This stage must not edit timestamps or speaker identity.
|
|
|
|
### Stage 4: Voice Matching
|
|
|
|
**Purpose**
|
|
|
|
Assign the most suitable `voiceId` to each subtitle segment.
|
|
|
|
**Input**
|
|
|
|
- `TranslatedSubtitle[]`
|
|
- available voices for `ttsLanguage`
|
|
|
|
**Output**
|
|
|
|
```ts
|
|
interface VoiceMatchedSubtitle extends TranslatedSubtitle {
|
|
voiceId: string;
|
|
}
|
|
```
|
|
|
|
**Rules**
|
|
|
|
- Only select from the provided voice catalog.
|
|
- Use `speaker`, `gender`, and tone hints when available.
|
|
- Must not change transcript or translation fields.
|
|
|
|
### Stage 5: Validation
|
|
|
|
**Purpose**
|
|
|
|
Check internal consistency before returning the final result to the editor.
|
|
|
|
**Input**
|
|
|
|
- `VoiceMatchedSubtitle[]`
|
|
|
|
**Output**
|
|
|
|
```ts
|
|
interface ValidationIssue {
|
|
subtitleId: string;
|
|
code:
|
|
| 'low_confidence_transcript'
|
|
| 'timing_overlap'
|
|
| 'missing_tts_text'
|
|
| 'voice_language_mismatch'
|
|
| 'empty_translation';
|
|
message: string;
|
|
severity: 'warning' | 'error';
|
|
}
|
|
```
|
|
|
|
**Rules**
|
|
|
|
- Do not rewrite content in this stage.
|
|
- Report warnings and errors separately.
|
|
- Only block final success on true contract failures, not soft quality warnings.
|
|
|
|
## Data Model
|
|
|
|
### Final subtitle shape
|
|
|
|
The editor should continue to receive `SubtitlePipelineResult`, but the result will now be built from staged outputs rather than a single provider response.
|
|
|
|
Existing final subtitle fields remain:
|
|
|
|
- `id`
|
|
- `startTime`
|
|
- `endTime`
|
|
- `originalText`
|
|
- `translatedText`
|
|
- `ttsText`
|
|
- `ttsLanguage`
|
|
- `speaker`
|
|
- `speakerId`
|
|
- `confidence`
|
|
- `voiceId`
|
|
|
|
### New metadata
|
|
|
|
Add optional pipeline metadata to the final result:
|
|
|
|
```ts
|
|
interface SubtitlePipelineDiagnostics {
|
|
validationIssues?: ValidationIssue[];
|
|
stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
|
|
}
|
|
```
|
|
|
|
This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.
|
|
|
|
## Runtime Architecture
|
|
|
|
### New server modules
|
|
|
|
Add stage-focused modules under a new folder:
|
|
|
|
- `src/server/subtitleStages/transcriptionStage.ts`
|
|
- `src/server/subtitleStages/segmentationStage.ts`
|
|
- `src/server/subtitleStages/translationStage.ts`
|
|
- `src/server/subtitleStages/voiceMatchingStage.ts`
|
|
- `src/server/subtitleStages/validationStage.ts`
|
|
|
|
Add one orchestrator:
|
|
|
|
- `src/server/multiStageSubtitleGeneration.ts`
|
|
|
|
### Existing modules to adapt
|
|
|
|
- [src/server/subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
|
|
- stop calling one all-in-one generator
|
|
- call the new orchestrator instead
|
|
- [src/server/videoSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/videoSubtitleGeneration.ts)
|
|
- shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers
|
|
- [src/server/subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts)
|
|
- reuse normalization logic in Stage 2 and final payload assembly
|
|
- [src/types.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/types.ts)
|
|
- add stage-specific types and validation metadata
|
|
- [server.ts](/E:/Downloads/ai-video-dubbing-&-translation/server.ts)
|
|
- optionally expose finer async job progress messages
|
|
|
|
## Job Progress
|
|
|
|
The async job API should keep the same public contract but expose more precise stage labels:
|
|
|
|
- `transcribing`
|
|
- `segmenting`
|
|
- `translating`
|
|
- `matching_voice`
|
|
- `validating`
|
|
|
|
This can be done either by extending `SubtitleJobStage` or mapping internal stage names back to the existing progress system.
|
|
|
|
## Error Handling
|
|
|
|
### Hard failures
|
|
|
|
- unreadable video input
|
|
- provider request failure
|
|
- invalid stage output schema
|
|
- no subtitles returned after transcription
|
|
- no valid `voiceId` in Stage 4
|
|
|
|
These should fail the async job.
|
|
|
|
### Soft warnings
|
|
|
|
- low transcript confidence
|
|
- suspected mistranscription
|
|
- missing speaker gender
|
|
- fallback default voice applied
|
|
|
|
These should be attached to diagnostics and surfaced later, but should not block the result.
|
|
|
|
## Testing
|
|
|
|
### Unit tests
|
|
|
|
- Stage 1 prompt and parsing tests
|
|
- Stage 2 segmentation behavior tests
|
|
- Stage 3 translation contract tests
|
|
- Stage 4 voice catalog matching tests
|
|
- Stage 5 validation issue generation tests
|
|
|
|
### Integration tests
|
|
|
|
- full orchestration success path
|
|
- transcription low-confidence warning path
|
|
- translation failure path
|
|
- voice mismatch validation path
|
|
- async job progress updates across all five stages
|
|
|
|
## Rollout Strategy
|
|
|
|
### Phase 1
|
|
|
|
- Add stage contracts and orchestrator behind the existing `/generate-subtitles` entry point.
|
|
- Keep the final API shape stable.
|
|
|
|
### Phase 2
|
|
|
|
- Store validation issues in the result payload.
|
|
- Surface review warnings in the editor.
|
|
|
|
### Phase 3
|
|
|
|
- Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.
|
|
|
|
## Recommendation
|
|
|
|
Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.
|