Compare commits

...

2 Commits

Author SHA1 Message Date
Song367
85065cbca3 Build multi-stage subtitle and dubbing pipeline
All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
2026-03-20 20:55:40 +08:00
Song367
7cbc6c697c 配置提示词 2026-03-19 22:56:53 +08:00
37 changed files with 6312 additions and 322 deletions

View File

@ -4,9 +4,27 @@ GEMINI_API_KEY="MY_GEMINI_API_KEY"
# ARK_API_KEY: Required when the editor LLM is set to Doubao.
ARK_API_KEY="YOUR_ARK_API_KEY"
# VITE_ARK_API_KEY: Required only if the browser uploads videos directly to Ark Files API.
# This exposes the key to the frontend and should only be used in trusted environments.
# VITE_ARK_API_KEY="YOUR_ARK_API_KEY"
# VOLCENGINE_ASR_APP_KEY: Required for Stage 1 audio transcription when using Doubao.
VOLCENGINE_ASR_APP_KEY="YOUR_VOLCENGINE_ASR_APP_KEY"
# VOLCENGINE_ASR_ACCESS_KEY: Required for Stage 1 audio transcription when using Doubao.
VOLCENGINE_ASR_ACCESS_KEY="YOUR_VOLCENGINE_ASR_ACCESS_KEY"
# VOLCENGINE_ASR_RESOURCE_ID: Optional override for flash ASR resource id.
# Defaults to volc.bigasr.auc_turbo.
# VOLCENGINE_ASR_RESOURCE_ID="volc.bigasr.auc_turbo"
# VOLCENGINE_ASR_BASE_URL: Optional override for flash ASR endpoint.
# Defaults to https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash
# VOLCENGINE_ASR_BASE_URL="https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash"
# VOLCENGINE_ASR_MODEL_NAME: Optional override for flash ASR model_name.
# Defaults to bigmodel.
# VOLCENGINE_ASR_MODEL_NAME="bigmodel"
# VOLCENGINE_ASR_TIMEOUT_MS: Optional timeout for flash ASR requests in milliseconds.
# Defaults to DOUBAO_TIMEOUT_MS or 600000.
# VOLCENGINE_ASR_TIMEOUT_MS="600000"
# DEFAULT_LLM_PROVIDER: Optional editor default. Supported values: doubao, gemini.
# Defaults to doubao.

View File

@ -0,0 +1,348 @@
# 4+1 Subtitle Pipeline Design
## Goal
Replace the current one-shot subtitle generation flow with a staged `4+1` pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.
## Scope
- Redesign the backend subtitle generation pipeline around five explicit stages.
- Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
- Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
- Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.
## Non-Goals
- Do not add a new human review UI in this step.
- Do not replace the current editor or dubbing UI.
- Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.
## Problem Summary
The current pipeline asks a single provider call to:
- watch and listen to the video
- transcribe dialogue
- split subtitle segments
- translate to English
- translate to the TTS language
- infer speaker metadata
- select a voice id
This creates two major problems:
1. When the model mishears the original dialogue, every downstream field is wrong.
2. It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.
The new design fixes that by isolating each responsibility.
## Design
### Stage overview
The new pipeline contains four production stages plus one validation stage:
1. `Stage 1: Transcription`
2. `Stage 2: Segmentation`
3. `Stage 3: Translation`
4. `Stage 4: Voice Matching`
5. `Stage 5: Validation`
Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.
### Stage 1: Transcription
**Purpose**
Extract the source dialogue from the video as faithfully as possible.
**Input**
- local video path or remote `fileId`
- provider configuration
- request id
**Output**
```ts
interface TranscriptSegment {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
```
**Rules**
- Only transcribe audible dialogue.
- Do not translate.
- Do not rewrite or polish.
- Prefer conservative output when unclear.
- Mark low-confidence segments with `needsReview`.
**Notes**
For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.
### Stage 2: Segmentation
**Purpose**
Turn raw transcript segments into subtitle-friendly chunks without changing meaning.
**Input**
- `TranscriptSegment[]`
**Output**
```ts
interface SegmentedSubtitle {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
```
**Rules**
- May split or merge segments for readability.
- Must not paraphrase `originalText`.
- Must preserve chronological order and non-overlap.
- Should reuse existing normalization and sentence reconstruction helpers where possible.
**Notes**
This stage should absorb logic that is currently mixed between [subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts) and provider prompts.
### Stage 3: Translation
**Purpose**
Translate already-confirmed source dialogue into display subtitles and dubbing text.
**Input**
- `SegmentedSubtitle[]`
- subtitle language settings
- TTS language
**Output**
```ts
interface TranslatedSubtitle extends SegmentedSubtitle {
translatedText: string;
ttsText: string;
ttsLanguage: string;
}
```
**Rules**
- `translatedText` is always English for on-screen subtitles.
- `ttsText` is always the requested TTS language.
- Translation must derive from `originalText` only.
- This stage must not edit timestamps or speaker identity.
### Stage 4: Voice Matching
**Purpose**
Assign the most suitable `voiceId` to each subtitle segment.
**Input**
- `TranslatedSubtitle[]`
- available voices for `ttsLanguage`
**Output**
```ts
interface VoiceMatchedSubtitle extends TranslatedSubtitle {
voiceId: string;
}
```
**Rules**
- Only select from the provided voice catalog.
- Use `speaker`, `gender`, and tone hints when available.
- Must not change transcript or translation fields.
### Stage 5: Validation
**Purpose**
Check internal consistency before returning the final result to the editor.
**Input**
- `VoiceMatchedSubtitle[]`
**Output**
```ts
interface ValidationIssue {
subtitleId: string;
code:
| 'low_confidence_transcript'
| 'timing_overlap'
| 'missing_tts_text'
| 'voice_language_mismatch'
| 'empty_translation';
message: string;
severity: 'warning' | 'error';
}
```
**Rules**
- Do not rewrite content in this stage.
- Report warnings and errors separately.
- Only block final success on true contract failures, not soft quality warnings.
## Data Model
### Final subtitle shape
The editor should continue to receive `SubtitlePipelineResult`, but the result will now be built from staged outputs rather than a single provider response.
Existing final subtitle fields remain:
- `id`
- `startTime`
- `endTime`
- `originalText`
- `translatedText`
- `ttsText`
- `ttsLanguage`
- `speaker`
- `speakerId`
- `confidence`
- `voiceId`
### New metadata
Add optional pipeline metadata to the final result:
```ts
interface SubtitlePipelineDiagnostics {
validationIssues?: ValidationIssue[];
stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
}
```
This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.
## Runtime Architecture
### New server modules
Add stage-focused modules under a new folder:
- `src/server/subtitleStages/transcriptionStage.ts`
- `src/server/subtitleStages/segmentationStage.ts`
- `src/server/subtitleStages/translationStage.ts`
- `src/server/subtitleStages/voiceMatchingStage.ts`
- `src/server/subtitleStages/validationStage.ts`
Add one orchestrator:
- `src/server/multiStageSubtitleGeneration.ts`
### Existing modules to adapt
- [src/server/subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
- stop calling one all-in-one generator
- call the new orchestrator instead
- [src/server/videoSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/videoSubtitleGeneration.ts)
- shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers
- [src/server/subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts)
- reuse normalization logic in Stage 2 and final payload assembly
- [src/types.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/types.ts)
- add stage-specific types and validation metadata
- [server.ts](/E:/Downloads/ai-video-dubbing-&-translation/server.ts)
- optionally expose finer async job progress messages
## Job Progress
The async job API should keep the same public contract but expose more precise stage labels:
- `transcribing`
- `segmenting`
- `translating`
- `matching_voice`
- `validating`
This can be done either by extending `SubtitleJobStage` or mapping internal stage names back to the existing progress system.
## Error Handling
### Hard failures
- unreadable video input
- provider request failure
- invalid stage output schema
- no subtitles returned after transcription
- no valid `voiceId` in Stage 4
These should fail the async job.
### Soft warnings
- low transcript confidence
- suspected mistranscription
- missing speaker gender
- fallback default voice applied
These should be attached to diagnostics and surfaced later, but should not block the result.
## Testing
### Unit tests
- Stage 1 prompt and parsing tests
- Stage 2 segmentation behavior tests
- Stage 3 translation contract tests
- Stage 4 voice catalog matching tests
- Stage 5 validation issue generation tests
### Integration tests
- full orchestration success path
- transcription low-confidence warning path
- translation failure path
- voice mismatch validation path
- async job progress updates across all five stages
## Rollout Strategy
### Phase 1
- Add stage contracts and orchestrator behind the existing `/generate-subtitles` entry point.
- Keep the final API shape stable.
### Phase 2
- Store validation issues in the result payload.
- Surface review warnings in the editor.
### Phase 3
- Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.
## Recommendation
Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.

View File

@ -0,0 +1,282 @@
# 4+1 Subtitle Pipeline Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Replace the current one-shot subtitle generation flow with a 4+1 staged pipeline that isolates transcription, segmentation, translation, voice matching, and validation.
**Architecture:** Introduce stage-specific server modules under a new `subtitleStages` folder and route the existing `/generate-subtitles` backend entry point through a new orchestrator. Keep the final `SubtitlePipelineResult` contract stable for the editor while adding internal stage contracts and diagnostics.
**Tech Stack:** TypeScript, Node server pipeline, Vitest, React client services, async subtitle job polling
---
### Task 1: Define stage contracts and lock them with tests
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\types.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\stageTypes.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\stageTypes.test.ts`
**Step 1: Write the failing test**
- Assert stage types support:
- transcription output with `confidence` and `needsReview`
- translated output with `ttsText` and `ttsLanguage`
- validation issue output with `code` and `severity`
**Step 2: Run test to verify it fails**
Run:
`npm run test -- src/server/subtitleStages/stageTypes.test.ts`
Expected:
FAIL because the new stage files and contracts do not exist yet.
**Step 3: Write minimal implementation**
- Create `stageTypes.ts` with:
- `TranscriptSegment`
- `SegmentedSubtitle`
- `TranslatedSubtitle`
- `VoiceMatchedSubtitle`
- `ValidationIssue`
- any stage diagnostics helpers needed by the orchestrator
- Extend `src/types.ts` only where the public result contract needs optional diagnostics.
**Step 4: Run test to verify it passes**
Run:
`npm run test -- src/server/subtitleStages/stageTypes.test.ts`
Expected:
PASS
**Step 5: Commit**
Skip commit for now.
### Task 2: Add the transcription stage
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\videoSubtitleGeneration.ts`
**Step 1: Write the failing tests**
- Assert the transcription stage prompt asks only for:
- faithful transcription
- timestamps
- speaker metadata
- Assert it does not request translation or voice selection.
- Assert parser output normalizes low-confidence and missing speaker fields safely.
**Step 2: Run tests to verify they fail**
Run:
`npm run test -- src/server/subtitleStages/transcriptionStage.test.ts src/server/videoSubtitleGeneration.test.ts`
Expected:
FAIL because the transcription stage does not exist and current prompt is still all-in-one.
**Step 3: Write minimal implementation**
- Extract provider-specific transcription logic from `videoSubtitleGeneration.ts` into `transcriptionStage.ts`.
- Narrow the transcription prompt and JSON schema to transcription-only fields.
- Return `TranscriptSegment[]`.
**Step 4: Run tests to verify they pass**
Run:
`npm run test -- src/server/subtitleStages/transcriptionStage.test.ts src/server/videoSubtitleGeneration.test.ts`
Expected:
PASS
**Step 5: Commit**
Skip commit for now.
### Task 3: Add the segmentation stage
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\segmentationStage.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\segmentationStage.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.ts`
**Step 1: Write the failing tests**
- Assert long transcript segments are split into subtitle-friendly chunks.
- Assert segmentation preserves `originalText`, timing order, and speaker identity.
- Assert no paraphrasing occurs during segmentation.
**Step 2: Run tests to verify they fail**
Run:
`npm run test -- src/server/subtitleStages/segmentationStage.test.ts src/server/subtitlePipeline.test.ts`
Expected:
FAIL because there is no explicit segmentation stage.
**Step 3: Write minimal implementation**
- Reuse normalization helpers from `subtitlePipeline.ts`.
- Implement deterministic segmentation that:
- preserves chronology
- keeps original text intact
- marks impossible cases for later review instead of rewriting
**Step 4: Run tests to verify it passes**
Run:
`npm run test -- src/server/subtitleStages/segmentationStage.test.ts src/server/subtitlePipeline.test.ts`
Expected:
PASS
**Step 5: Commit**
Skip commit for now.
### Task 4: Add the translation stage
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\translationStage.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\translationStage.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleGeneration.ts`
**Step 1: Write the failing tests**
- Assert translation stage input is `originalText` from segmentation, not raw provider output.
- Assert it returns:
- `translatedText`
- `ttsText`
- `ttsLanguage`
- Assert it never changes timestamps.
**Step 2: Run tests to verify they fail**
Run:
`npm run test -- src/server/subtitleStages/translationStage.test.ts src/server/subtitleGeneration.test.ts`
Expected:
FAIL because translation is not separated yet.
**Step 3: Write minimal implementation**
- Build a translation-only stage that consumes segmented subtitles.
- Keep English subtitle generation and TTS-language generation separate but paired.
- Return `TranslatedSubtitle[]`.
**Step 4: Run tests to verify they pass**
Run:
`npm run test -- src/server/subtitleStages/translationStage.test.ts src/server/subtitleGeneration.test.ts`
Expected:
PASS
**Step 5: Commit**
Skip commit for now.
### Task 5: Add voice matching and validation stages
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\voiceMatchingStage.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\voiceMatchingStage.test.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\validationStage.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\validationStage.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\voices.ts`
**Step 1: Write the failing tests**
- Assert voice matching only picks from the current language-specific catalog.
- Assert it falls back safely when gender or speaker tone is missing.
- Assert validation returns warnings for:
- low confidence transcript
- voice language mismatch
- empty translation
- timing overlap
**Step 2: Run tests to verify they fail**
Run:
`npm run test -- src/server/subtitleStages/voiceMatchingStage.test.ts src/server/subtitleStages/validationStage.test.ts`
Expected:
FAIL because neither stage exists yet.
**Step 3: Write minimal implementation**
- Implement a pure voice matcher that adds `voiceId` and never rewrites text.
- Implement a validator that inspects final subtitles and returns `ValidationIssue[]`.
**Step 4: Run tests to verify they pass**
Run:
`npm run test -- src/server/subtitleStages/voiceMatchingStage.test.ts src/server/subtitleStages/validationStage.test.ts`
Expected:
PASS
**Step 5: Commit**
Skip commit for now.
### Task 6: Integrate the orchestrator and async job progress
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleGeneration.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleJobs.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\services\subtitleService.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\types.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
**Step 1: Write the failing tests**
- Assert the orchestrator runs stages in order:
- transcription
- segmentation
- translation
- voice matching
- validation
- Assert async progress updates expose stage-specific messages.
- Assert final `SubtitlePipelineResult` stays backward compatible for the editor.
**Step 2: Run tests to verify they fail**
Run:
`npm run test -- src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleJobs.test.ts src/services/subtitleService.test.ts`
Expected:
FAIL because the orchestrator and new stage progress do not exist yet.
**Step 3: Write minimal implementation**
- Add `multiStageSubtitleGeneration.ts`.
- Route existing backend entry points through the orchestrator.
- Keep `/generate-subtitles` and polling payloads stable.
- Include optional validation diagnostics in the final result.
**Step 4: Run tests to verify they pass**
Run:
`npm run test -- src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleJobs.test.ts src/services/subtitleService.test.ts`
Expected:
PASS
**Step 5: Run focused regression tests**
Run:
`npm run test -- src/server/videoSubtitleGeneration.test.ts src/server/subtitleGeneration.test.ts src/server/subtitleJobs.test.ts src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx`
Expected:
PASS
**Step 6: Commit**
Skip commit for now.

View File

@ -0,0 +1,356 @@
# Speaker Casting Stage Design
**Date:** 2026-03-20
## Goal
Add a dedicated `Speaker Casting` stage after ASR so the pipeline can analyze the full episode transcript plus the original video, merge segments that belong to the same character, infer speaker gender, and assign one stable `voiceId` per character for the whole episode.
## Problem
The current pipeline still has two major gaps after the Stage 1 ASR upgrade:
1. `speakerId` is still synthetic and sequential.
- Stage 1 currently maps each ASR utterance to `speaker-1`, `speaker-2`, `speaker-3`, and so on.
- This is not real speaker tracking or character clustering.
2. Voice assignment is still local and shallow.
- The current `voiceMatching` stage mostly filters by `ttsLanguage`.
- When `gender` is `unknown`, it often falls back to the first voice in that language.
- This is why the UI can show the same voice, such as `Santa_Claus`, across multiple unrelated lines.
The user wants a more vertical agent that does one narrow but important job:
- read the whole ASR result
- look at the original video
- inspect the available voices
- decide which transcript segments belong to the same speaker
- assign one stable voice per speaker for the whole episode
## Recommendation
Upgrade the current `4+1` pipeline into a `5+1` pipeline by inserting a dedicated `Speaker Casting` stage between `Transcription` and `Segmentation`.
New pipeline order:
1. `Transcription`
2. `Speaker Casting`
3. `Segmentation`
4. `Translation`
5. `Voice Matching` fallback
6. `Validation`
This keeps the current strengths of the staged pipeline while moving speaker identity and voice assignment to a stage that can reason across the full episode instead of one subtitle at a time.
## Scope
### In Scope
- Add a new `speakerCasting` stage to the server pipeline.
- Feed the stage:
- the original video
- full-episode ASR output
- `ttsLanguage`
- language-filtered voice catalog
- Let the stage return:
- canonical `speakerId`
- speaker label
- speaker gender
- stable `voiceId`
- segment-to-speaker assignments
- Preserve the existing frontend subtitle payload shape.
- Keep `voiceMatching` as a fallback safety stage rather than the primary casting decision maker.
- Add full structured logs for stage input, model output, and normalized result.
### Out of Scope
- Do not add a new manual character management UI in this step.
- Do not rewrite ASR text in this stage.
- Do not move translation into the casting stage.
- Do not require true biometric speaker diarization or voiceprint clustering in this first version.
## Why a Dedicated Casting Stage
This stage is valuable because it can use information that no current stage has in one place:
- the full ASR result across the whole episode
- recurring character behavior and dialogue patterns
- visual context from the original video
- the exact current voice catalog for the chosen `ttsLanguage`
That lets the model answer three questions together:
1. Which segments belong to the same speaker?
2. Is this speaker more likely male, female, or unknown?
3. Which provided voice best matches this speaker for the whole episode?
This is intentionally similar to the old one-shot agent in capability, but much narrower in responsibility. It no longer transcribes or translates. It only performs speaker consolidation and casting.
## Data Flow
### Stage 1: Transcription
Unchanged responsibility:
- extract dialogue from audio
- return `TranscriptSegment[]`
Example output today:
```ts
interface TranscriptSegment {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
```
### Stage 2: Speaker Casting
New responsibility:
- merge transcript segments into canonical speakers
- assign `gender`
- assign one episode-level `voiceId` per canonical speaker
Input:
```ts
interface SpeakerCastingInput {
videoPath: string;
transcriptSegments: TranscriptSegment[];
ttsLanguage: string;
availableVoices: Voice[];
}
```
Output:
```ts
interface CanonicalSpeaker {
speakerId: string;
label: string;
gender: 'male' | 'female' | 'unknown';
voiceId?: string;
confidence?: number;
reason?: string;
}
interface SpeakerAssignment {
segmentId: string;
speakerId: string;
}
interface SpeakerCastingResult {
speakers: CanonicalSpeaker[];
segmentAssignments: SpeakerAssignment[];
}
```
Normalized segment output after casting:
```ts
interface CastTranscriptSegment extends TranscriptSegment {
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
voiceId?: string;
}
```
### Stage 3: Segmentation
Unchanged core purpose, with one important inheritance rule:
- if a transcript segment is split into multiple subtitle-friendly chunks, every child chunk inherits:
- `speakerId`
- `speaker`
- `gender`
- `voiceId`
### Stage 4: Translation
Still only translates text.
It must not change:
- `speakerId`
- `speaker`
- `gender`
- `voiceId`
### Stage 5: Voice Matching Fallback
This stage stays in the pipeline, but its role changes:
- if `Speaker Casting` already assigned a valid `voiceId`, keep it
- if a segment or speaker is missing `voiceId`, fill it in
- if the chosen `voiceId` is invalid for the current language, replace it with a safe fallback
This keeps the pipeline resilient without making fallback matching the main casting system.
### Stage 6: Validation
Add checks for the new casting layer:
- segment missing canonical speaker assignment
- canonical speaker missing `voiceId`
- returned `voiceId` not present in the provided catalog
- `voiceId` language mismatch
- suspiciously low casting confidence
## Prompt Design
The `Speaker Casting` prompt should be strict and narrow.
### System Prompt Responsibilities
- analyze the original video and the provided ASR transcript
- group transcript segments by the same recurring speaker or character
- assign one canonical `speakerId` per speaker
- choose one stable `voiceId` for each canonical speaker from the provided voice list
- infer `gender` conservatively as `male`, `female`, or `unknown`
- never rewrite transcript text
- never invent voices outside the provided list
- return JSON only
### User Prompt Inputs
- `ttsLanguage`
- filtered `availableVoices`
- transcript segments with `id`, `time range`, and `originalText`
- original video content
### Output Contract
```json
{
"speakers": [
{
"speakerId": "speaker-a",
"label": "female-lead",
"gender": "female",
"voiceId": "Charming_Lady",
"confidence": 0.88,
"reason": "young female lead with bright assertive tone"
}
],
"segmentAssignments": [
{
"segmentId": "segment-1",
"speakerId": "speaker-a"
}
]
}
```
`reason` is for logs and debugging, not required for frontend rendering.
## Fallback Rules
### Full Stage Failure
If the entire `Speaker Casting` stage fails:
- keep the original ASR-generated synthetic `speakerId`
- leave `speaker` and `gender` unchanged where possible
- run the existing `voiceMatching` stage as the primary fallback
- mark the result with casting diagnostics so we know episode-level casting did not run successfully
### Partial Failure
If only some segments or speakers fail normalization:
- keep valid assignments
- mark invalid assignments for review
- let fallback voice matching fill missing `voiceId`
### Invalid Voice
If the model returns a `voiceId` not found in the provided catalog:
- discard it
- do not trust the invalid value
- let fallback voice matching assign a safe valid voice
### Uncertain Gender
If gender is ambiguous:
- normalize to `unknown`
- do not force `male` or `female`
## Logging
The new stage should log full stage payloads in the same spirit as the ASR logs.
Required logs:
- stage started
- model request built
- model response received
- normalized casting result
- stage failed
Important fields:
- `requestId`
- `stage`
- `ttsLanguage`
- number of transcript segments
- number of available voices
- full raw model response
- normalized `speakers`
- normalized `segmentAssignments`
This is important because casting mistakes will otherwise look like translation or TTS mistakes in the UI.
## Files Expected to Change
Primary backend files:
- [stageTypes.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/stageTypes.ts)
- [multiStageSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/multiStageSubtitleGeneration.ts)
- [voiceMatchingStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/voiceMatchingStage.ts)
- [transcriptionStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/transcriptionStage.ts)
- [llmProvider.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/llmProvider.ts)
New files:
- `src/server/subtitleStages/speakerCastingStage.ts`
- `src/server/subtitleStages/speakerCastingStage.test.ts`
Possible supporting updates:
- [types.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/types.ts)
- [subtitleJobs.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleJobs.ts)
## Testing Strategy
### Unit Tests
- prompt contract for `Speaker Casting`
- normalization of model result
- invalid `voiceId` fallback behavior
- partial assignment handling
- inheritance of `speakerId` and `voiceId` through segmentation
### Integration Tests
- full pipeline success with casting stage enabled
- fallback path when casting stage fails
- stage progress reporting includes `speaker_casting`
- final subtitle payload remains backward compatible for the editor
## Success Criteria
- The pipeline no longer treats each ASR utterance as a permanently separate speaker.
- Same-character lines across the full episode resolve to the same canonical `speakerId`.
- Same canonical speaker gets the same `voiceId` across the episode.
- The editor still receives a stable subtitle payload without frontend breakage.
- Logs clearly show how each segment was assigned to a canonical speaker and voice.

View File

@ -0,0 +1,342 @@
# Speaker Casting Stage Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Add a dedicated `Speaker Casting` stage that uses the full ASR transcript, the original video, and the filtered voice catalog to assign canonical speaker identities and one stable voice per speaker for the whole episode.
**Architecture:** Insert a new `speakerCasting` stage between `transcription` and `segmentation`. This stage will call an LLM with the original video plus transcript metadata, normalize the result into canonical speakers and segment assignments, and stamp stable `speakerId`, `speaker`, `gender`, and `voiceId` onto transcript segments before segmentation and translation continue. The existing `voiceMatching` stage remains as a fallback validator/filler rather than the primary casting engine.
**Tech Stack:** TypeScript, Node server pipeline, Volcengine ASR Stage 1, existing LLM provider abstraction, Vitest, async subtitle job orchestration
---
### Task 1: Extend stage contracts for canonical speakers and casting assignments
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\stageTypes.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\types.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\stageTypes.test.ts`
**Step 1: Write the failing test**
Add tests that expect stage types for:
- `CanonicalSpeaker`
- `SpeakerAssignment`
- `SpeakerCastingResult`
- transcript segments that can carry optional `voiceId`
- pipeline stage keys that include `speakerCasting`
**Step 2: Run test to verify it fails**
Run:
`npm.cmd run test -- src/server/subtitleStages/stageTypes.test.ts`
Expected:
FAIL because the new contracts and stage key are missing.
**Step 3: Write minimal implementation**
- Add canonical speaker and assignment interfaces to `stageTypes.ts`.
- Extend public types only where needed so final subtitle payloads can safely carry speaker-casting diagnostics without breaking the editor.
- Add the new stage key to any stage-duration typing and progress unions.
**Step 4: Run test to verify it passes**
Run:
`npm.cmd run test -- src/server/subtitleStages/stageTypes.test.ts`
Expected:
PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/stageTypes.ts src/types.ts src/server/subtitleStages/stageTypes.test.ts
git commit -m "feat: add speaker casting stage contracts"
```
### Task 2: Build the speaker casting stage prompt, parser, and normalizer
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\speakerCastingStage.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\speakerCastingStage.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\llmProvider.ts`
**Step 1: Write the failing tests**
Add tests that assert:
- the prompt requests only speaker clustering, gender inference, and voice selection
- transcript text is treated as read-only input
- returned JSON must contain `speakers` and `segmentAssignments`
- invalid JSON or incomplete output is rejected safely
**Step 2: Run tests to verify they fail**
Run:
`npm.cmd run test -- src/server/subtitleStages/speakerCastingStage.test.ts src/server/llmProvider.test.ts`
Expected:
FAIL because the new stage does not exist.
**Step 3: Write minimal implementation**
- Create `speakerCastingStage.ts` with:
- system prompt builder
- user prompt builder
- raw result parser
- normalization for canonical speakers and assignments
- Reuse the existing provider config path so Doubao and Gemini can both support the stage if needed.
- Keep `reason` optional and internal.
**Step 4: Run tests to verify they pass**
Run:
`npm.cmd run test -- src/server/subtitleStages/speakerCastingStage.test.ts src/server/llmProvider.test.ts`
Expected:
PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/speakerCastingStage.ts src/server/subtitleStages/speakerCastingStage.test.ts src/server/llmProvider.ts src/server/llmProvider.test.ts
git commit -m "feat: add speaker casting stage"
```
### Task 3: Stamp canonical speaker data back onto transcript segments
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\segmentationStage.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\segmentationStage.test.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\speakerCastingStage.test.ts`
**Step 1: Write the failing tests**
Add tests that assert:
- a `segmentAssignment` rewrites synthetic `speakerId` to canonical `speakerId`
- canonical `speaker`, `gender`, and `voiceId` are copied onto transcript segments
- segmentation preserves inherited `speakerId`, `speaker`, `gender`, and `voiceId` when splitting segments
**Step 2: Run tests to verify they fail**
Run:
`npm.cmd run test -- src/server/subtitleStages/speakerCastingStage.test.ts src/server/subtitleStages/segmentationStage.test.ts`
Expected:
FAIL because canonical speaker inheritance is not implemented.
**Step 3: Write minimal implementation**
- Add a helper that applies `SpeakerCastingResult` to `TranscriptSegment[]`.
- Ensure segmentation duplicates inherited speaker metadata whenever one transcript segment becomes multiple subtitle segments.
- Do not let segmentation overwrite canonical speaker identity.
**Step 4: Run tests to verify they pass**
Run:
`npm.cmd run test -- src/server/subtitleStages/speakerCastingStage.test.ts src/server/subtitleStages/segmentationStage.test.ts`
Expected:
PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/transcriptionStage.ts src/server/subtitleStages/segmentationStage.ts src/server/subtitleStages/speakerCastingStage.test.ts src/server/subtitleStages/segmentationStage.test.ts
git commit -m "feat: inherit canonical speakers through segmentation"
```
### Task 4: Integrate the new stage into the orchestrator and progress system
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleGeneration.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleJobs.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.test.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleJobs.test.ts`
**Step 1: Write the failing tests**
Add tests that assert:
- stage order becomes `transcription -> speakerCasting -> segmentation -> translation -> voiceMatching -> validation`
- async job progress reports `speaker_casting`
- final payload remains backward compatible for the editor
**Step 2: Run tests to verify they fail**
Run:
`npm.cmd run test -- src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleJobs.test.ts`
Expected:
FAIL because the orchestrator does not know about `speakerCasting`.
**Step 3: Write minimal implementation**
- Call `speakerCastingStage` immediately after transcription.
- Pass:
- `videoPath`
- `transcriptSegments`
- `ttsLanguage`
- filtered voice catalog
- Apply casting assignments before segmentation runs.
- Extend progress reporting and stage duration diagnostics.
**Step 4: Run tests to verify they pass**
Run:
`npm.cmd run test -- src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleJobs.test.ts`
Expected:
PASS
**Step 5: Commit**
```bash
git add src/server/multiStageSubtitleGeneration.ts src/server/subtitleGeneration.ts src/server/subtitleJobs.ts server.ts src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleJobs.test.ts
git commit -m "feat: integrate speaker casting stage"
```
### Task 5: Turn voice matching into a fallback stage
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\voiceMatchingStage.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\voiceMatchingStage.test.ts`
**Step 1: Write the failing tests**
Add tests that assert:
- an existing valid `voiceId` from speaker casting is preserved
- missing `voiceId` still falls back to language/gender matching
- invalid `voiceId` is replaced with a safe valid fallback
**Step 2: Run tests to verify they fail**
Run:
`npm.cmd run test -- src/server/subtitleStages/voiceMatchingStage.test.ts`
Expected:
FAIL because voice matching still assumes it is the primary source of `voiceId`.
**Step 3: Write minimal implementation**
- Update `voiceMatchingStage` so it first validates existing `voiceId`.
- Only assign a new `voiceId` when the current one is missing or invalid.
- Keep speaker-level reuse behavior as a fallback only.
**Step 4: Run tests to verify they pass**
Run:
`npm.cmd run test -- src/server/subtitleStages/voiceMatchingStage.test.ts`
Expected:
PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/voiceMatchingStage.ts src/server/subtitleStages/voiceMatchingStage.test.ts
git commit -m "feat: make voice matching a speaker casting fallback"
```
### Task 6: Add full input/output logging and validation coverage
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\speakerCastingStage.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\validationStage.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\validationStage.test.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.test.ts`
**Step 1: Write the failing tests**
Add tests that assert:
- full raw speaker-casting model output is logged
- normalized `speakers` and `segmentAssignments` are logged
- validation reports:
- missing speaker assignment
- invalid voice id
- voice language mismatch
- low speaker-casting confidence
**Step 2: Run tests to verify they fail**
Run:
`npm.cmd run test -- src/server/subtitleStages/validationStage.test.ts src/server/multiStageSubtitleGeneration.test.ts`
Expected:
FAIL because casting-specific logs and validations do not exist.
**Step 3: Write minimal implementation**
- Add structured full-response logs for `speakerCasting`.
- Extend validation with casting-specific checks.
- Preserve existing validation output shape.
**Step 4: Run tests to verify they pass**
Run:
`npm.cmd run test -- src/server/subtitleStages/validationStage.test.ts src/server/multiStageSubtitleGeneration.test.ts`
Expected:
PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/speakerCastingStage.ts src/server/subtitleStages/validationStage.ts src/server/multiStageSubtitleGeneration.ts src/server/subtitleStages/validationStage.test.ts src/server/multiStageSubtitleGeneration.test.ts
git commit -m "feat: add speaker casting diagnostics and validation"
```
### Task 7: Run targeted verification and restart the dev server
**Files:**
- No production file changes required
**Step 1: Run targeted tests**
Run:
```bash
npm.cmd run test -- src/server/subtitleStages/stageTypes.test.ts src/server/subtitleStages/speakerCastingStage.test.ts src/server/subtitleStages/segmentationStage.test.ts src/server/subtitleStages/voiceMatchingStage.test.ts src/server/subtitleStages/validationStage.test.ts src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleGeneration.test.ts src/server/subtitleJobs.test.ts src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx
```
Expected:
PASS
**Step 2: Run lint**
Run:
`npm.cmd run lint`
Expected:
PASS
**Step 3: Restart the dev server**
- Stop the existing listener on port `3000`
- Start the dev server again
- Confirm `http://localhost:3000` is listening
**Step 4: Manual validation**
Use the known failing sample and verify:
- the same real character gets the same `speakerId`
- the same real character gets the same `voiceId`
- the UI shows a stable voice name instead of random per-line drift
- logs show complete speaker-casting request and response payloads
**Step 5: Commit**
```bash
git add .
git commit -m "feat: add episode-level speaker casting stage"
```

View File

@ -0,0 +1,233 @@
# Volcengine ASR Stage-1 Replacement Design
**Date:** 2026-03-20
**Goal**
Replace the current Stage 1 `transcription` agent with Volcengine's flash ASR API using `audio.data` base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.
## Problem
The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.
We need a Stage 1 that is optimized for:
- faithful speech recognition
- utterance-level timestamps
- speaker separation
- stable, repeatable audio-only behavior
## Recommendation
Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the `4+1` pipeline unchanged:
1. `Transcription` -> Volcengine ASR
2. `Segmentation` -> existing local segmenter
3. `Translation` -> existing LLM translation stage
4. `Voice Matching` -> existing matcher
5. `Validation` -> existing validator
This gives us a better ASR foundation without rewriting the rest of the subtitle stack.
## Why This Fits
- The API is purpose-built for recorded audio recognition.
- It returns utterance-level timing data that maps well to our subtitle model.
- It supports speaker info and gender detection, which match our Stage 1 output shape.
- It accepts `audio.data` as base64, which avoids temporary public URL hosting.
- It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.
## Scope
### In Scope
- Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
- Extract audio from uploaded or temp video before Stage 1.
- Send extracted audio as `audio.data` base64 in a single ASR request.
- Map ASR `utterances` into internal `TranscriptSegment[]`.
- Preserve existing downstream stages and existing editor UI contract.
- Add detailed Stage 1 logs for ASR submit/poll/result mapping.
### Out of Scope
- Replacing translation, TTS, or voice matching providers.
- Changing the editor data contract beyond Stage 1 diagnostics.
- Reworking trim/export behavior outside Stage 1 input preparation.
- Automatic fallback to another ASR vendor in this first iteration.
## Current State
Relevant files:
- [transcriptionStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/transcriptionStage.ts)
- [multiStageSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/multiStageSubtitleGeneration.ts)
- [subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
- [subtitleService.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/services/subtitleService.ts)
- [stageTypes.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/stageTypes.ts)
Current Stage 1 responsibilities:
- extract audio locally when using Doubao
- call a model endpoint directly
- parse model JSON into `TranscriptSegment[]`
This means recognition logic and model prompt logic are still coupled together in one module.
## Target Architecture
### Stage 1 Flow
1. Receive video file on server.
2. Extract normalized WAV audio using `ffmpeg`.
3. Base64-encode the WAV audio.
4. Send the audio to `recognize/flash`.
5. Map `result.utterances` into `TranscriptSegment[]`.
6. Pass those segments to the existing `segmentation` stage.
### Proposed Modules
- `src/server/subtitleStages/transcriptionStage.ts`
- keep the stage entrypoint
- delegate provider-specific ASR work to a helper
- `src/server/volcengineAsr.ts`
- send flash recognition request
- parse and massage API payloads
## Input and Output Mapping
### ASR Input
Stage 1 should send:
- extracted WAV audio
- language hint when configured
- `show_utterances=true`
- `enable_speaker_info=true`
- `enable_gender_detection=true`
- conservative transcription options
We should avoid options that aggressively rewrite spoken language for readability in Stage 1.
### ASR Output -> Internal Mapping
From ASR:
- `utterances[].text` -> `originalText`
- `utterances[].start_time` -> `startTime`
- `utterances[].end_time` -> `endTime`
- speaker info -> `speaker`, `speakerId`
- gender info -> `gender`
Internal notes:
- If numeric confidence is unavailable, set `confidence` to a safe default and rely on `needsReview` heuristics from other signals.
- If gender is missing, normalize to `unknown`.
- If utterances are missing, fall back to a single full-text segment only as a last resort.
## Temporary Audio URL Strategy
The flash ASR API supports `audio.data`, so Stage 1 can avoid temporary public URLs entirely.
Recommended implementation:
- extract WAV to a temp file
- read the file into base64
- send the base64 string in `audio.data`
- delete the temp file immediately after extraction
This avoids network reachability issues and is the simplest option for local and server environments.
## Logging
Add explicit Stage 1 ASR logs:
- request started
- request finished
- result mapping summary
- cleanup status
Log fields should include:
- `requestId`
- `provider`
- `resourceId`
- API endpoint
- temp audio path presence
- base64 audio size
- request duration
- utterance count
We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.
## Error Handling
Stage 1 should fail clearly for:
- ffmpeg extraction failure
- ASR request rejection
- ASR service busy response
- malformed ASR result payload
Preferred behavior:
- fail Stage 1
- mark subtitle job as failed
- preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping
## Configuration
Add environment-driven config for the ASR API:
- app key
- access key
- resource id
- flash URL
- request timeout
- optional language hint
These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.
## Testing Strategy
### Unit Tests
- audio extraction helper returns WAV base64
- ASR submit request body matches expected schema
- ASR response parsing handles success and API error codes
- ASR result mapping produces valid `TranscriptSegment[]`
### Integration-Level Tests
- `transcriptionStage` uses ASR helper when configured
- `multiStageSubtitleGeneration` still receives valid transcript output
- `subtitleService` keeps frontend contract unchanged
### Regression Focus
- no change to translation or voice matching contracts
- no regression in subtitle job stages and progress messages
- no regression in editor auto-generation flow
## Risks
### 1. Confidence mismatch
If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.
### 2. Speaker metadata variance
Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates `unknown`, but we should normalize carefully.
## Rollout Plan
1. Implement the flash ASR client and mapping behind Stage 1.
2. Keep old transcription path available behind a feature flag or fallback branch during transition.
3. Validate with the known failing sample video.
4. Remove the old direct multimodal transcription path once logs and results are stable.
## Success Criteria
- Stage 1 no longer sends video content to a general model for transcription.
- Logs show flash ASR request lifecycle for each transcription request.
- The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
- Downstream translation and voice stages continue to work without UI contract changes.

View File

@ -0,0 +1,239 @@
# Volcengine ASR Stage-1 Replacement Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Replace the current Stage 1 transcription agent with Volcengine's flash ASR API using `audio.data` base64 while keeping the rest of the `4+1` subtitle pipeline unchanged.
**Architecture:** The server will extract WAV audio from uploaded video, base64-encode it, send it to Volcengine's `recognize/flash` endpoint, and map the result into the existing `TranscriptSegment[]` shape. `Segmentation`, `Translation`, `Voice Matching`, and `Validation` will continue to use their existing contracts.
**Tech Stack:** Node.js, Express, TypeScript, fluent-ffmpeg, Vitest, existing server job orchestration
---
### Task 1: Add ASR Configuration Surface
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\llmProvider.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\.env.example`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\llmProvider.test.ts`
**Step 1: Write the failing test**
Add a test that expects Volcengine ASR config to resolve from env with:
- app key
- access key
- resource id
- submit/query URLs
- timeout
- polling interval
**Step 2: Run test to verify it fails**
Run: `npm.cmd run test -- src/server/llmProvider.test.ts`
Expected: FAIL because ASR config fields are missing.
**Step 3: Write minimal implementation**
Add environment parsing for ASR config without disturbing existing LLM provider resolution.
**Step 4: Run test to verify it passes**
Run: `npm.cmd run test -- src/server/llmProvider.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/llmProvider.ts src/server/llmProvider.test.ts .env.example
git commit -m "feat: add volcengine asr config"
```
### Task 2: Build the Volcengine Flash ASR Client
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\volcengineAsr.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\volcengineAsr.test.ts`
**Step 1: Write the failing test**
Add tests for:
- request header shape
- request body with `audio.data`
- API error code handling
- success result parsing
**Step 2: Run test to verify it fails**
Run: `npm.cmd run test -- src/server/volcengineAsr.test.ts`
Expected: FAIL because the client does not exist.
**Step 3: Write minimal implementation**
Implement the flash recognition request/response helper.
**Step 4: Run test to verify it passes**
Run: `npm.cmd run test -- src/server/volcengineAsr.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/volcengineAsr.ts src/server/volcengineAsr.test.ts
git commit -m "feat: add volcengine flash asr client"
```
### Task 3: Map Flash ASR Result to Transcript Segments
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\stageTypes.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.test.ts`
**Step 1: Write the failing test**
Add a test that feeds a flash ASR result payload with `utterances` and expects normalized `TranscriptSegment[]`.
**Step 2: Run test to verify it fails**
Run: `npm.cmd run test -- src/server/subtitleStages/transcriptionStage.test.ts`
Expected: FAIL because Stage 1 still expects model JSON output.
**Step 3: Write minimal implementation**
Refactor `transcriptionStage.ts` so Stage 1:
- extracts WAV audio
- base64-encodes the audio
- calls `volcengineAsr.ts`
- maps `utterances` into `TranscriptSegment[]`
**Step 4: Run test to verify it passes**
Run: `npm.cmd run test -- src/server/subtitleStages/transcriptionStage.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/transcriptionStage.ts src/server/subtitleStages/stageTypes.ts src/server/subtitleStages/transcriptionStage.test.ts
git commit -m "feat: switch stage 1 to flash asr"
```
### Task 4: Wire Stage 1 Logging and Cleanup
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.test.ts`
**Step 1: Write the failing test**
Add assertions that the transcription path logs:
- ASR request start and finish
- API status code failures
- mapped utterance summary
**Step 2: Run test to verify it fails**
Run: `npm.cmd run test -- src/server/multiStageSubtitleGeneration.test.ts`
Expected: FAIL because these log events are not present yet.
**Step 3: Write minimal implementation**
Add structured logs and ensure temp audio cleanup still runs after extraction.
**Step 4: Run test to verify it passes**
Run: `npm.cmd run test -- src/server/multiStageSubtitleGeneration.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/transcriptionStage.ts src/server/multiStageSubtitleGeneration.ts src/server/multiStageSubtitleGeneration.test.ts
git commit -m "feat: log flash asr stage lifecycle"
```
### Task 5: Keep the Frontend Contract Stable
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleStages\transcriptionStage.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\multiStageSubtitleGeneration.test.ts`
**Step 1: Write the failing test**
Add assertions that the transcription path logs:
- ASR submit start/finish
- poll lifecycle
- mapped utterance summary
**Step 2: Run test to verify it fails**
Run: `npm.cmd run test -- src/server/multiStageSubtitleGeneration.test.ts`
Expected: FAIL because these log events are not present yet.
**Step 3: Write minimal implementation**
Add structured logs and ensure temp audio cleanup always runs after Stage 1 completes or fails.
**Step 4: Run test to verify it passes**
Run: `npm.cmd run test -- src/server/multiStageSubtitleGeneration.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/subtitleStages/transcriptionStage.ts src/server/multiStageSubtitleGeneration.ts src/server/multiStageSubtitleGeneration.test.ts
git commit -m "feat: log volcengine asr stage lifecycle"
```
### Task 6: Run Full Targeted Verification
**Files:**
- No production file changes required
**Step 1: Run targeted server and pipeline tests**
Run:
```bash
npm.cmd run test -- src/server/llmProvider.test.ts src/server/tempAudioStore.test.ts src/server/volcengineAsr.test.ts src/server/subtitleStages/transcriptionStage.test.ts src/server/multiStageSubtitleGeneration.test.ts src/server/subtitleGeneration.test.ts src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx
```
Expected: PASS
**Step 2: Restart the dev server**
Restart the local dev server and confirm port `3000` is listening.
**Step 3: Manual validation**
Use the known failing sample video and verify:
- Stage 1 logs show ASR submit/poll
- `originalText` is closer to the actual dialogue than before
- downstream translation and dubbing still complete
**Step 4: Commit**
```bash
git add .
git commit -m "feat: replace stage 1 transcription with flash asr"
```

View File

@ -25,6 +25,8 @@ describe('llmProvider', () => {
expect(
resolveLlmProviderConfig('doubao', {
ARK_API_KEY: 'ark-key',
VOLCENGINE_ASR_APP_KEY: 'asr-app-key',
VOLCENGINE_ASR_ACCESS_KEY: 'asr-access-key',
}),
).toEqual({
provider: 'doubao',
@ -32,6 +34,14 @@ describe('llmProvider', () => {
model: DEFAULT_DOUBAO_MODEL,
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
timeoutMs: DEFAULT_DOUBAO_TIMEOUT_MS,
asr: {
appKey: 'asr-app-key',
accessKey: 'asr-access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: DEFAULT_DOUBAO_TIMEOUT_MS,
},
});
});
@ -39,6 +49,8 @@ describe('llmProvider', () => {
expect(
resolveLlmProviderConfig('doubao', {
ARK_API_KEY: 'ark-key',
VOLCENGINE_ASR_APP_KEY: 'asr-app-key',
VOLCENGINE_ASR_ACCESS_KEY: 'asr-access-key',
DOUBAO_TIMEOUT_MS: '600000',
}),
).toEqual({
@ -47,6 +59,14 @@ describe('llmProvider', () => {
model: DEFAULT_DOUBAO_MODEL,
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
timeoutMs: 600000,
asr: {
appKey: 'asr-app-key',
accessKey: 'asr-access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
});
});

View File

@ -2,16 +2,30 @@ export const DEFAULT_LLM_PROVIDER = 'doubao';
export const DEFAULT_DOUBAO_MODEL = 'doubao-seed-2-0-pro-260215';
export const DEFAULT_GEMINI_MODEL = 'gemini-2.5-flash';
export const DEFAULT_DOUBAO_RESPONSES_URL = 'https://ark.cn-beijing.volces.com/api/v3/responses';
export const DEFAULT_VOLCENGINE_ASR_URL =
'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash';
export const DEFAULT_VOLCENGINE_ASR_RESOURCE_ID = 'volc.bigasr.auc_turbo';
export const DEFAULT_VOLCENGINE_ASR_MODEL_NAME = 'bigmodel';
export const DEFAULT_DOUBAO_TIMEOUT_MS = 600000;
export type LlmProvider = 'doubao' | 'gemini';
export interface VolcengineAsrConfig {
appKey: string;
accessKey: string;
resourceId: string;
baseUrl: string;
modelName: string;
timeoutMs: number;
}
export interface DoubaoProviderConfig {
provider: 'doubao';
apiKey: string;
model: string;
baseUrl: string;
timeoutMs: number;
asr: VolcengineAsrConfig;
}
export interface GeminiProviderConfig {
@ -54,12 +68,30 @@ export const resolveLlmProviderConfig = (
throw new Error('ARK_API_KEY is required for Doubao subtitle generation.');
}
const asrAppKey = env.VOLCENGINE_ASR_APP_KEY?.trim();
if (!asrAppKey) {
throw new Error('VOLCENGINE_ASR_APP_KEY is required for Doubao subtitle transcription.');
}
const asrAccessKey = env.VOLCENGINE_ASR_ACCESS_KEY?.trim();
if (!asrAccessKey) {
throw new Error('VOLCENGINE_ASR_ACCESS_KEY is required for Doubao subtitle transcription.');
}
return {
provider,
apiKey,
model: env.DOUBAO_MODEL?.trim() || DEFAULT_DOUBAO_MODEL,
baseUrl: (env.DOUBAO_BASE_URL?.trim() || DEFAULT_DOUBAO_RESPONSES_URL).replace(/\/+$/, ''),
timeoutMs: resolveDoubaoTimeoutMs(env.DOUBAO_TIMEOUT_MS),
asr: {
appKey: asrAppKey,
accessKey: asrAccessKey,
resourceId: env.VOLCENGINE_ASR_RESOURCE_ID?.trim() || DEFAULT_VOLCENGINE_ASR_RESOURCE_ID,
baseUrl: (env.VOLCENGINE_ASR_BASE_URL?.trim() || DEFAULT_VOLCENGINE_ASR_URL).replace(/\/+$/, ''),
modelName: env.VOLCENGINE_ASR_MODEL_NAME?.trim() || DEFAULT_VOLCENGINE_ASR_MODEL_NAME,
timeoutMs: resolveDoubaoTimeoutMs(env.VOLCENGINE_ASR_TIMEOUT_MS || env.DOUBAO_TIMEOUT_MS),
},
};
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,537 @@
import fs from 'node:fs';
import path from 'node:path';
import { GoogleGenAI } from '@google/genai';
import { SubtitleGenerationProgress, SubtitlePipelineResult } from '../types';
import { MINIMAX_VOICES } from '../voices';
import { logEvent, serializeError } from './errorLogging';
import { LlmProviderConfig } from './llmProvider';
import {
applySpeakerCastingResultToTranscriptSegments,
buildSpeakerCastingSystemPrompt,
buildSpeakerCastingUserPrompt,
logSpeakerCastingNormalizedOutput,
logSpeakerCastingRawOutput,
normalizeSpeakerCastingResult,
parseSpeakerCastingResponse,
} from './subtitleStages/speakerCastingStage';
import { generateTranscriptFromVideo } from './subtitleStages/transcriptionStage';
import { segmentTranscriptSegments } from './subtitleStages/segmentationStage';
import { translateSubtitleSegments } from './subtitleStages/translationStage';
import { validateSubtitles } from './subtitleStages/validationStage';
import { matchSubtitleVoices, normalizeVoiceLanguageCode } from './subtitleStages/voiceMatchingStage';
import {
SpeakerCastingResult,
SubtitlePipelineStageDurations,
TranscriptSegment,
ValidationIssue,
VoiceMatchedSubtitle,
} from './subtitleStages/stageTypes';
type ProgressStage = Extract<
SubtitleGenerationProgress['stage'],
'preparing' | 'transcribing' | 'speakerCasting' | 'segmenting' | 'translating' | 'matching_voice' | 'validating'
>;
const SAMPLE_LIMIT = 3;
const truncateText = (value?: string, maxLength = 80) => {
const text = value?.trim() || '';
if (text.length <= maxLength) {
return text;
}
return `${text.slice(0, maxLength)}...`;
};
const summarizeTranscriptSegments = (segments: TranscriptSegment[]) => ({
segmentCount: segments.length,
sampleSegments: segments.slice(0, SAMPLE_LIMIT).map((segment) => ({
id: segment.id,
startTime: segment.startTime,
endTime: segment.endTime,
originalText: truncateText(segment.originalText),
speaker: segment.speaker,
speakerId: segment.speakerId,
gender: segment.gender,
confidence: segment.confidence,
needsReview: segment.needsReview,
})),
});
const summarizeTranslatedSegments = (segments: VoiceMatchedSubtitle[] | Array<VoiceMatchedSubtitle | any>) => ({
segmentCount: segments.length,
sampleSegments: segments.slice(0, SAMPLE_LIMIT).map((segment) => ({
id: segment.id,
originalText: truncateText(segment.originalText),
translatedText: truncateText(segment.translatedText),
ttsText: truncateText(segment.ttsText),
ttsLanguage: segment.ttsLanguage,
speakerId: segment.speakerId,
voiceId: segment.voiceId,
})),
});
const summarizeValidationIssues = (issues: ValidationIssue[]) => ({
issueCount: issues.length,
sampleIssues: issues.slice(0, SAMPLE_LIMIT).map((issue) => ({
subtitleId: issue.subtitleId,
code: issue.code,
severity: issue.severity,
message: truncateText(issue.message),
})),
});
const summarizeSpeakerCastingResult = (result: SpeakerCastingResult) => ({
speakerCount: result.speakers.length,
assignmentCount: result.segmentAssignments.length,
sampleSpeakers: result.speakers.slice(0, SAMPLE_LIMIT).map((speaker) => ({
speakerId: speaker.speakerId,
label: speaker.label,
gender: speaker.gender,
voiceId: speaker.voiceId,
confidence: speaker.confidence,
})),
sampleAssignments: result.segmentAssignments.slice(0, SAMPLE_LIMIT).map((assignment) => ({
segmentId: assignment.segmentId,
speakerId: assignment.speakerId,
})),
});
const reportProgress = (
onProgress: ((progress: Omit<SubtitleGenerationProgress, 'jobId' | 'requestId'>) => void) | undefined,
stage: ProgressStage,
) => {
if (!onProgress) {
return;
}
const details: Record<ProgressStage, { progress: number; message: string }> = {
preparing: { progress: 30, message: 'Preparing subtitle generation' },
transcribing: { progress: 35, message: 'Transcribing dialogue' },
segmenting: { progress: 45, message: 'Segmenting subtitles' },
speakerCasting: { progress: 55, message: 'Casting speakers' },
translating: { progress: 70, message: 'Translating subtitles' },
matching_voice: { progress: 85, message: 'Matching subtitle voices' },
validating: { progress: 95, message: 'Validating subtitle result' },
};
onProgress({
status: 'running',
stage,
progress: details[stage].progress,
message: details[stage].message,
});
};
const buildSpeakerTracks = (segments: TranscriptSegment[]) => {
const seen = new Map<string, { speakerId: string; label: string; gender?: 'male' | 'female' | 'unknown' }>();
for (const segment of segments) {
if (!seen.has(segment.speakerId)) {
seen.set(segment.speakerId, {
speakerId: segment.speakerId,
label: segment.speaker || segment.speakerId,
...(segment.gender ? { gender: segment.gender } : {}),
});
}
}
return Array.from(seen.values());
};
const buildFinalSubtitles = (subtitles: VoiceMatchedSubtitle[]) =>
subtitles.map((subtitle) => ({
id: subtitle.id,
startTime: subtitle.startTime,
endTime: subtitle.endTime,
originalText: subtitle.originalText,
translatedText: subtitle.translatedText,
ttsText: subtitle.ttsText,
ttsLanguage: subtitle.ttsLanguage,
speaker: subtitle.speaker || subtitle.speakerId,
speakerId: subtitle.speakerId,
words: [],
confidence: subtitle.confidence ?? 0,
voiceId: subtitle.voiceId,
}));
const deriveQuality = (subtitles: VoiceMatchedSubtitle[], issues: ValidationIssue[]) => {
if (subtitles.length === 0) {
return 'fallback' as const;
}
if (issues.length > 0 || subtitles.some((subtitle) => subtitle.needsReview)) {
return 'partial' as const;
}
return 'full' as const;
};
const filterVoiceCatalogForTtsLanguage = (ttsLanguage: string) => {
const languageCode = normalizeVoiceLanguageCode(ttsLanguage);
const matchingVoices = MINIMAX_VOICES.filter((voice) => normalizeVoiceLanguageCode(voice.language) === languageCode);
return matchingVoices.length > 0 ? matchingVoices : MINIMAX_VOICES;
};
const extractDoubaoTextOutput = (payload: any): string => {
const output = Array.isArray(payload?.output) ? payload.output : [];
const parts = output.flatMap((item: any) => {
if (!Array.isArray(item?.content)) {
return [];
}
return item.content
.map((part: any) => (typeof part?.text === 'string' ? part.text : ''))
.filter(Boolean);
});
return parts.join('').trim();
};
const resolveSpeakerCastingVideoMimeType = (videoPath: string) => {
const extension = path.extname(videoPath).toLowerCase();
switch (extension) {
case '.mov':
return 'video/quicktime';
case '.webm':
return 'video/webm';
case '.mp4':
default:
return 'video/mp4';
}
};
const buildSpeakerCastingVideoInput = (videoPath?: string, fileId?: string) => {
if (fileId) {
return { type: 'input_video', file_id: fileId };
}
if (!videoPath || !fs.existsSync(videoPath)) {
throw new Error(
'Speaker casting requires a readable video file path or fileId when using Doubao.',
);
}
const videoBase64 = fs.readFileSync(videoPath).toString('base64');
return {
type: 'input_video',
video_url: `data:${resolveSpeakerCastingVideoMimeType(videoPath)};base64,${videoBase64}`,
};
};
const defaultSpeakerCastingStage = async ({
providerConfig,
transcriptSegments,
ttsLanguage,
videoPath,
fileId,
requestId,
fetchImpl = fetch,
}: {
providerConfig: LlmProviderConfig;
transcriptSegments: TranscriptSegment[];
ttsLanguage: string;
videoPath?: string;
fileId?: string;
requestId?: string;
fetchImpl?: typeof fetch;
}): Promise<SpeakerCastingResult> => {
if (transcriptSegments.length === 0) {
return {
speakers: [],
segmentAssignments: [],
};
}
const availableVoices = filterVoiceCatalogForTtsLanguage(ttsLanguage);
const systemPrompt = buildSpeakerCastingSystemPrompt();
const userPrompt = buildSpeakerCastingUserPrompt({
ttsLanguage,
transcriptSegments,
availableVoices,
});
if (providerConfig.provider === 'doubao') {
const speakerCastingVideoInput = buildSpeakerCastingVideoInput(videoPath, fileId);
const response = await fetchImpl(providerConfig.baseUrl, {
method: 'POST',
signal: AbortSignal.timeout(providerConfig.timeoutMs),
headers: {
Authorization: `Bearer ${providerConfig.apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: providerConfig.model,
input: [
{
role: 'system',
content: [{ type: 'input_text', text: systemPrompt }],
},
{
role: 'user',
content: [
...(speakerCastingVideoInput ? [speakerCastingVideoInput] : []),
{ type: 'input_text', text: userPrompt },
],
},
],
}),
});
if (!response.ok) {
throw new Error(`Doubao speaker casting request failed (${response.status}).`);
}
const rawModelOutputText = extractDoubaoTextOutput(await response.json());
logSpeakerCastingRawOutput({
requestId,
provider: providerConfig.provider,
ttsLanguage,
rawModelOutputText,
});
const normalizedOutput = normalizeSpeakerCastingResult(parseSpeakerCastingResponse(rawModelOutputText));
logSpeakerCastingNormalizedOutput({
requestId,
provider: providerConfig.provider,
ttsLanguage,
normalizedOutput,
});
return normalizedOutput;
}
const ai = new GoogleGenAI({ apiKey: providerConfig.apiKey });
const response = await ai.models.generateContent({
model: providerConfig.model,
contents: [
{
role: 'user',
parts: [
{
text: `${systemPrompt}\n\n${userPrompt}`,
},
],
},
],
});
const rawModelOutputText = response.text || '';
logSpeakerCastingRawOutput({
requestId,
provider: providerConfig.provider,
ttsLanguage,
rawModelOutputText,
});
const normalizedOutput = normalizeSpeakerCastingResult(parseSpeakerCastingResponse(rawModelOutputText));
logSpeakerCastingNormalizedOutput({
requestId,
provider: providerConfig.provider,
ttsLanguage,
normalizedOutput,
});
return normalizedOutput;
};
export const generateMultiStageSubtitles = async ({
videoPath,
fileId,
targetLanguage,
ttsLanguage,
providerConfig,
fetchImpl,
requestId,
onProgress,
deps,
}: {
videoPath?: string;
fileId?: string;
targetLanguage: string;
ttsLanguage: string;
providerConfig: LlmProviderConfig;
fetchImpl?: typeof fetch;
requestId?: string;
onProgress?: (progress: Omit<SubtitleGenerationProgress, 'jobId' | 'requestId'>) => void;
deps?: {
generateTranscriptFromVideo?: typeof generateTranscriptFromVideo;
speakerCastingStage?: typeof defaultSpeakerCastingStage;
segmentTranscriptSegments?: typeof segmentTranscriptSegments;
translateSubtitleSegments?: typeof translateSubtitleSegments;
matchSubtitleVoices?: typeof matchSubtitleVoices;
validateSubtitles?: typeof validateSubtitles;
};
}): Promise<SubtitlePipelineResult> => {
const stageDurationsMs: SubtitlePipelineStageDurations = {};
const runStage = async <T,>(
stageKey: keyof SubtitlePipelineStageDurations,
inputSummary: Record<string, unknown>,
summarizeOutput: (value: T) => Record<string, unknown>,
runner: () => Promise<T> | T,
) => {
logEvent({
level: 'info',
message: '[subtitle] stage started',
context: {
requestId,
provider: providerConfig.provider,
stage: stageKey,
},
details: {
input: inputSummary,
},
});
const startedAt = Date.now();
try {
const value = await runner();
const durationMs = Date.now() - startedAt;
stageDurationsMs[stageKey] = durationMs;
logEvent({
level: 'info',
message: '[subtitle] stage finished',
context: {
requestId,
provider: providerConfig.provider,
stage: stageKey,
durationMs,
},
details: {
input: inputSummary,
output: summarizeOutput(value),
},
});
return value;
} catch (error) {
const durationMs = Date.now() - startedAt;
logEvent({
level: 'error',
message: '[subtitle] stage failed',
context: {
requestId,
provider: providerConfig.provider,
stage: stageKey,
durationMs,
},
details: {
input: inputSummary,
error: serializeError(error),
},
});
throw error;
}
};
reportProgress(onProgress, 'preparing');
const generateTranscript = deps?.generateTranscriptFromVideo || generateTranscriptFromVideo;
const speakerCaster = deps?.speakerCastingStage || defaultSpeakerCastingStage;
const segmenter = deps?.segmentTranscriptSegments || segmentTranscriptSegments;
const translator = deps?.translateSubtitleSegments || translateSubtitleSegments;
const voiceMatcher = deps?.matchSubtitleVoices || matchSubtitleVoices;
const validator = deps?.validateSubtitles || validateSubtitles;
reportProgress(onProgress, 'transcribing');
const transcript = await runStage(
'transcription',
{
hasVideoPath: Boolean(videoPath),
hasFileId: Boolean(fileId),
targetLanguage,
ttsLanguage,
},
(value) => ({
sourceLanguage: value.sourceLanguage,
...summarizeTranscriptSegments(value.segments),
}),
() =>
generateTranscript({
providerConfig,
videoPath,
fileId,
fetchImpl,
requestId,
}),
);
reportProgress(onProgress, 'segmenting');
const segmented = await runStage(
'segmentation',
summarizeTranscriptSegments(transcript.segments),
(value) => summarizeTranscriptSegments(value),
() => segmenter(transcript.segments),
);
reportProgress(onProgress, 'speakerCasting');
const speakerCasting = await runStage(
'speakerCasting',
{
videoPath,
hasFileId: Boolean(fileId),
ttsLanguage,
availableVoiceCount: filterVoiceCatalogForTtsLanguage(ttsLanguage).length,
...summarizeTranscriptSegments(segmented),
},
(value) => summarizeSpeakerCastingResult(value),
() =>
speakerCaster({
providerConfig,
transcriptSegments: segmented,
ttsLanguage,
videoPath,
fileId,
requestId,
...(fetchImpl ? { fetchImpl } : {}),
}),
);
const castedTranscriptSegments = applySpeakerCastingResultToTranscriptSegments(segmented, speakerCasting);
reportProgress(onProgress, 'translating');
const translated = await runStage(
'translation',
{
ttsLanguage,
...summarizeTranscriptSegments(castedTranscriptSegments),
},
(value) => summarizeTranslatedSegments(value),
() =>
translator({
providerConfig,
segments: castedTranscriptSegments,
ttsLanguage,
fetchImpl,
}),
);
reportProgress(onProgress, 'matching_voice');
const voiceMatched = await runStage(
'voiceMatching',
summarizeTranslatedSegments(translated),
(value) => summarizeTranslatedSegments(value),
() => voiceMatcher(translated),
);
reportProgress(onProgress, 'validating');
const validationIssues = await runStage(
'validation',
summarizeTranslatedSegments(voiceMatched),
(value) => summarizeValidationIssues(value),
() => validator(voiceMatched, { speakerCasting }),
);
return {
subtitles: buildFinalSubtitles(voiceMatched),
speakers: buildSpeakerTracks(castedTranscriptSegments),
quality: deriveQuality(voiceMatched, validationIssues),
sourceLanguage: transcript.sourceLanguage,
targetLanguage,
ttsLanguage,
duration: voiceMatched.length > 0 ? voiceMatched[voiceMatched.length - 1].endTime : 0,
alignmentEngine: `multi-stage-${providerConfig.provider}`,
diagnostics: {
validationIssues,
speakerCasting,
stageDurationsMs,
},
};
};

View File

@ -17,6 +17,14 @@ describe('createSentenceTranslator', () => {
model: 'doubao-seed-2-0-lite-260215',
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
timeoutMs: 600000,
asr: {
appKey: 'asr-app-key',
accessKey: 'asr-access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
});
expect(translator).toBe('doubao-translator');

View File

@ -10,7 +10,7 @@ describe('generateSubtitlePipeline', () => {
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
const generateMultiStageSubtitles = vi.fn(async () => subtitleResult);
await generateSubtitlePipeline({
videoPath: 'clip.mp4',
@ -21,11 +21,11 @@ describe('generateSubtitlePipeline', () => {
ARK_API_KEY: 'ark-key',
},
deps: {
generateSubtitlesFromVideo,
generateMultiStageSubtitles,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect(generateMultiStageSubtitles).toHaveBeenCalledWith(
expect.objectContaining({
videoPath: 'clip.mp4',
targetLanguage: 'English',
@ -45,7 +45,7 @@ describe('generateSubtitlePipeline', () => {
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
const generateMultiStageSubtitles = vi.fn(async () => subtitleResult);
await generateSubtitlePipeline({
videoPath: 'clip.mp4',
@ -53,13 +53,15 @@ describe('generateSubtitlePipeline', () => {
env: {
DEFAULT_LLM_PROVIDER: 'doubao',
ARK_API_KEY: 'ark-key',
VOLCENGINE_ASR_APP_KEY: 'asr-app-key',
VOLCENGINE_ASR_ACCESS_KEY: 'asr-access-key',
},
deps: {
generateSubtitlesFromVideo,
generateMultiStageSubtitles,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect(generateMultiStageSubtitles).toHaveBeenCalledWith(
expect.objectContaining({
providerConfig: {
provider: 'doubao',
@ -67,6 +69,14 @@ describe('generateSubtitlePipeline', () => {
model: 'doubao-seed-2-0-pro-260215',
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
timeoutMs: 600000,
asr: {
appKey: 'asr-app-key',
accessKey: 'asr-access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
},
}),
);
@ -79,7 +89,7 @@ describe('generateSubtitlePipeline', () => {
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
const generateMultiStageSubtitles = vi.fn(async () => subtitleResult);
const fetchImpl = vi.fn<typeof fetch>();
await generateSubtitlePipeline({
@ -88,15 +98,17 @@ describe('generateSubtitlePipeline', () => {
provider: 'doubao',
env: {
ARK_API_KEY: 'ark-key',
VOLCENGINE_ASR_APP_KEY: 'asr-app-key',
VOLCENGINE_ASR_ACCESS_KEY: 'asr-access-key',
},
fetchImpl,
requestId: 'req-123',
deps: {
generateSubtitlesFromVideo,
generateMultiStageSubtitles,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect(generateMultiStageSubtitles).toHaveBeenCalledWith(
expect.objectContaining({
fetchImpl,
requestId: 'req-123',
@ -111,7 +123,7 @@ describe('generateSubtitlePipeline', () => {
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
const generateMultiStageSubtitles = vi.fn(async () => subtitleResult);
await generateSubtitlePipeline({
fileId: 'file-123',
@ -119,13 +131,15 @@ describe('generateSubtitlePipeline', () => {
provider: 'doubao',
env: {
ARK_API_KEY: 'ark-key',
VOLCENGINE_ASR_APP_KEY: 'asr-app-key',
VOLCENGINE_ASR_ACCESS_KEY: 'asr-access-key',
},
deps: {
generateSubtitlesFromVideo,
generateMultiStageSubtitles,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect(generateMultiStageSubtitles).toHaveBeenCalledWith(
expect.objectContaining({
fileId: 'file-123',
videoPath: undefined,
@ -140,7 +154,7 @@ describe('generateSubtitlePipeline', () => {
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
const generateMultiStageSubtitles = vi.fn(async () => subtitleResult);
await generateSubtitlePipeline({
videoPath: 'clip.mp4',
@ -149,13 +163,15 @@ describe('generateSubtitlePipeline', () => {
provider: 'doubao',
env: {
ARK_API_KEY: 'ark-key',
VOLCENGINE_ASR_APP_KEY: 'asr-app-key',
VOLCENGINE_ASR_ACCESS_KEY: 'asr-access-key',
},
deps: {
generateSubtitlesFromVideo,
generateMultiStageSubtitles,
},
} as any);
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect(generateMultiStageSubtitles).toHaveBeenCalledWith(
expect.objectContaining({
targetLanguage: 'English',
ttsLanguage: 'fr',

View File

@ -2,7 +2,7 @@ import { SubtitleGenerationProgress } from '../types';
import { resolveAudioPipelineConfig } from './audioPipelineConfig';
import { logEvent, serializeError } from './errorLogging';
import { resolveLlmProviderConfig, normalizeLlmProvider } from './llmProvider';
import { generateSubtitlesFromVideo as defaultGenerateSubtitlesFromVideo } from './videoSubtitleGeneration';
import { generateMultiStageSubtitles as defaultGenerateMultiStageSubtitles } from './multiStageSubtitleGeneration';
export interface GenerateSubtitlePipelineOptions {
videoPath?: string;
@ -15,7 +15,7 @@ export interface GenerateSubtitlePipelineOptions {
requestId?: string;
onProgress?: (progress: Omit<SubtitleGenerationProgress, 'jobId' | 'requestId'>) => void;
deps?: {
generateSubtitlesFromVideo?: typeof defaultGenerateSubtitlesFromVideo;
generateMultiStageSubtitles?: typeof defaultGenerateMultiStageSubtitles;
};
}
@ -41,8 +41,9 @@ export const generateSubtitlePipeline = async ({
? normalizeLlmProvider(provider)
: audioPipelineConfig.defaultProvider;
const providerConfig = resolveLlmProviderConfig(selectedProvider, env);
const generateSubtitlesFromVideo =
deps?.generateSubtitlesFromVideo || defaultGenerateSubtitlesFromVideo;
const generateMultiStageSubtitles =
deps?.generateMultiStageSubtitles || defaultGenerateMultiStageSubtitles;
const resolvedTtsLanguage = ttsLanguage?.trim() || targetLanguage;
onProgress?.({
status: 'running',
@ -59,34 +60,22 @@ export const generateSubtitlePipeline = async ({
requestedProvider: provider || undefined,
selectedProvider,
targetLanguage,
ttsLanguage: resolvedTtsLanguage,
hasVideoPath: Boolean(videoPath),
hasFileId: Boolean(fileId),
},
});
try {
onProgress?.({
status: 'running',
stage: 'calling_provider',
progress: 70,
message: 'Calling subtitle provider',
});
const result = await generateSubtitlesFromVideo({
const result = await generateMultiStageSubtitles({
providerConfig,
videoPath,
fileId,
targetLanguage,
ttsLanguage,
requestId,
ttsLanguage: resolvedTtsLanguage,
...(fetchImpl ? { fetchImpl } : {}),
});
onProgress?.({
status: 'running',
stage: 'processing_result',
progress: 90,
message: 'Processing subtitle result',
requestId,
onProgress,
});
logEvent({

View File

@ -3,6 +3,7 @@ import {
createSubtitleJobStore,
createSubtitleJob,
pruneExpiredSubtitleJobs,
updateSubtitleJob,
toSubtitleJobResponse,
} from './subtitleJobs';
@ -70,4 +71,24 @@ describe('subtitleJobs', () => {
pollTimeoutMs: 900000,
});
});
it('reports speaker casting progress with the expected stage defaults', () => {
const store = createSubtitleJobStore();
const job = createSubtitleJob(store, {
requestId: 'req-1',
provider: 'doubao',
targetLanguage: 'English',
});
const updated = updateSubtitleJob(store, job.id, {
status: 'running',
stage: 'speakerCasting',
});
expect(updated).toMatchObject({
stage: 'speakerCasting',
progress: 40,
message: 'Casting speakers',
});
});
});

View File

@ -4,6 +4,12 @@ const DEFAULT_PROGRESS_BY_STAGE: Record<SubtitleJobStage, number> = {
queued: 5,
upload_received: 15,
preparing: 30,
transcribing: 35,
speakerCasting: 40,
segmenting: 50,
translating: 70,
matching_voice: 85,
validating: 95,
calling_provider: 70,
processing_result: 90,
succeeded: 100,
@ -56,6 +62,18 @@ const defaultMessageForStage = (stage: SubtitleJobStage) => {
return 'Upload received';
case 'preparing':
return 'Preparing subtitle generation';
case 'transcribing':
return 'Transcribing dialogue';
case 'speakerCasting':
return 'Casting speakers';
case 'segmenting':
return 'Segmenting subtitles';
case 'translating':
return 'Translating subtitles';
case 'matching_voice':
return 'Matching subtitle voices';
case 'validating':
return 'Validating subtitle result';
case 'calling_provider':
return 'Calling subtitle provider';
case 'processing_result':

View File

@ -0,0 +1,66 @@
import { describe, expect, it } from 'vitest';
import { segmentTranscriptSegments } from './segmentationStage';
describe('segmentationStage', () => {
it('splits long transcript segments on strong pauses without paraphrasing', () => {
const segmented = segmentTranscriptSegments([
{
id: 'segment-1',
startTime: 0,
endTime: 4,
originalText: '这么迟才到。你是不是又忘了?',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
confidence: 0.92,
voiceId: 'voice-1',
needsReview: false,
},
]);
expect(segmented).toEqual([
expect.objectContaining({
id: 'segment-1-part-1',
originalText: '这么迟才到。',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
voiceId: 'voice-1',
}),
expect.objectContaining({
id: 'segment-1-part-2',
originalText: '你是不是又忘了?',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
voiceId: 'voice-1',
}),
]);
expect(segmented[0].startTime).toBe(0);
expect(segmented[0].endTime).toBeLessThanOrEqual(segmented[1].startTime);
expect(segmented[1].endTime).toBe(4);
});
it('preserves short transcript segments as-is', () => {
const segmented = segmentTranscriptSegments([
{
id: 'segment-2',
startTime: 4,
endTime: 5.5,
originalText: '我真的不是故意的',
speakerId: 'speaker-2',
confidence: 0.88,
},
]);
expect(segmented).toEqual([
expect.objectContaining({
id: 'segment-2',
startTime: 4,
endTime: 5.5,
originalText: '我真的不是故意的',
speakerId: 'speaker-2',
}),
]);
});
});

View File

@ -0,0 +1,49 @@
import { SegmentedSubtitle, TranscriptSegment } from './stageTypes';
const STRONG_BREAK_SPLIT_REGEX = /(?<=[])/u;
const roundToMilliseconds = (value: number) => Math.round(value * 1000) / 1000;
const inheritSpeakerMetadata = (segment: TranscriptSegment) => ({
speakerId: segment.speakerId,
...(typeof segment.speaker === 'string' ? { speaker: segment.speaker } : {}),
...(typeof segment.gender === 'string' ? { gender: segment.gender } : {}),
...(typeof segment.voiceId === 'string' ? { voiceId: segment.voiceId } : {}),
});
const splitOnStrongBreaks = (text: string) =>
text
.split(STRONG_BREAK_SPLIT_REGEX)
.map((part) => part.trim())
.filter(Boolean);
export const segmentTranscriptSegments = (segments: TranscriptSegment[]): SegmentedSubtitle[] =>
segments.flatMap((segment) => {
const parts = splitOnStrongBreaks(segment.originalText);
if (parts.length <= 1) {
return [{ ...segment, ...inheritSpeakerMetadata(segment) }];
}
const totalDuration = Math.max(0, segment.endTime - segment.startTime);
const totalCharacters = parts.reduce((sum, part) => sum + part.length, 0) || parts.length;
let currentStart = segment.startTime;
return parts.map((part, index) => {
const isLast = index === parts.length - 1;
const ratio = part.length / totalCharacters;
const proposedEnd = isLast ? segment.endTime : currentStart + totalDuration * ratio;
const nextEnd = roundToMilliseconds(Math.max(currentStart, proposedEnd));
const item: SegmentedSubtitle = {
...segment,
id: `${segment.id}-part-${index + 1}`,
startTime: roundToMilliseconds(currentStart),
endTime: isLast ? segment.endTime : nextEnd,
originalText: part,
...inheritSpeakerMetadata(segment),
};
currentStart = item.endTime;
return item;
});
});

View File

@ -0,0 +1,269 @@
import { describe, expect, it } from 'vitest';
import { Voice } from '../../types';
import type { SpeakerCastingRawResponse } from './speakerCastingStage';
import type { SpeakerCastingResult } from './stageTypes';
import {
applySpeakerCastingResultToTranscriptSegments,
buildSpeakerCastingSystemPrompt,
buildSpeakerCastingUserPrompt,
normalizeSpeakerCastingResult,
parseSpeakerCastingResponse,
} from './speakerCastingStage';
describe('speakerCastingStage', () => {
it('builds a prompt that only asks for speaker clustering, gender inference, and voice selection', () => {
const systemPrompt = buildSpeakerCastingSystemPrompt();
expect(systemPrompt).toContain('speaker clustering');
expect(systemPrompt).toContain('gender inference');
expect(systemPrompt).toContain('voice selection');
expect(systemPrompt).toContain('"speakers"');
expect(systemPrompt).toContain('"segmentAssignments"');
expect(systemPrompt).not.toContain('translate');
expect(systemPrompt).not.toContain('re-transcribe');
expect(systemPrompt).not.toContain('rewrite');
});
it('treats transcript text as read-only input in the user prompt', () => {
const availableVoices = [
{
id: 'voice-1',
name: 'Voice 1',
tag: 'v1',
avatar: 'avatar-1',
gender: 'female',
language: 'en',
},
] satisfies Voice[];
const userPrompt = buildSpeakerCastingUserPrompt({
ttsLanguage: 'en',
transcriptSegments: [
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: 'Do not change this line',
speakerId: 'speaker-1',
speaker: 'Speaker 1',
},
],
availableVoices,
});
expect(userPrompt).toContain('read-only');
expect(userPrompt).toContain('"originalText": "Do not change this line"');
expect(userPrompt).toContain('Voice 1');
expect(userPrompt).not.toContain('rewrite the transcript');
});
it('rejects invalid JSON and incomplete results safely', () => {
expect(() => parseSpeakerCastingResponse('not json')).toThrow();
expect(() => parseSpeakerCastingResponse('{"speakers":[]}')).toThrow();
expect(() => parseSpeakerCastingResponse('{"speakers":[],"segmentAssignments":[],"extra":1}')).toThrow();
expect(() =>
normalizeSpeakerCastingResult({
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
voiceId: 'voice-a',
},
],
} as any),
).toThrow();
});
it('rejects segment assignments that reference missing speakers', () => {
expect(() =>
normalizeSpeakerCastingResult({
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
voiceId: 'voice-a',
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-b',
},
],
} as SpeakerCastingRawResponse),
).toThrow(/unknown speakerId/i);
});
it('rejects duplicate speaker IDs and duplicate segment assignments', () => {
expect(() =>
normalizeSpeakerCastingResult({
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
},
{
speakerId: 'speaker-a',
label: 'Lead 2',
gender: 'female',
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
],
} as SpeakerCastingRawResponse),
).toThrow(/duplicate speaker IDs/i);
expect(() =>
normalizeSpeakerCastingResult({
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
voiceId: 'voice-a',
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
],
} as SpeakerCastingRawResponse),
).toThrow(/duplicate segment assignment/i);
});
it('normalizes canonical speakers and assignments', () => {
const normalized = normalizeSpeakerCastingResult({
speakers: [
{
speakerId: ' speaker-a ',
label: ' Lead ',
gender: 'female',
voiceId: ' voice-1 ',
confidence: '0.8',
reason: 'internal note',
},
],
segmentAssignments: [
{
segmentId: ' segment-1 ',
speakerId: ' speaker-a ',
},
],
} as any);
expect(normalized).toEqual({
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
voiceId: 'voice-1',
confidence: 0.8,
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
],
});
});
it('applies canonical speaker data back onto transcript segments', () => {
const updated = applySpeakerCastingResultToTranscriptSegments(
[
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: 'Hello there',
speakerId: 'speaker-1',
speaker: 'Speaker 1',
gender: 'unknown',
},
],
{
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
voiceId: 'voice-a',
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
],
} as SpeakerCastingResult,
);
expect(updated).toEqual([
expect.objectContaining({
id: 'segment-1',
speakerId: 'speaker-a',
speaker: 'Lead',
gender: 'female',
voiceId: 'voice-a',
}),
]);
});
it('drops stale voiceId when the canonical speaker has no voiceId', () => {
const updated = applySpeakerCastingResultToTranscriptSegments(
[
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: 'Hello there',
speakerId: 'speaker-1',
speaker: 'Speaker 1',
gender: 'unknown',
voiceId: 'stale-voice',
},
],
{
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
],
} as SpeakerCastingResult,
);
expect(updated).toEqual([
expect.objectContaining({
id: 'segment-1',
speakerId: 'speaker-a',
speaker: 'Lead',
gender: 'female',
}),
]);
expect(updated[0]).not.toHaveProperty('voiceId');
});
});

View File

@ -0,0 +1,377 @@
import type { Voice } from '../../types';
import { logEvent } from '../errorLogging';
import type {
CanonicalSpeaker,
SpeakerAssignment,
SpeakerCastingResult,
StageSpeakerGender,
TranscriptSegment,
} from './stageTypes';
export interface SpeakerCastingRawSpeaker {
speakerId?: string;
label?: string;
gender?: string;
voiceId?: string;
confidence?: number | string;
reason?: string;
}
export interface SpeakerCastingRawAssignment {
segmentId?: string;
speakerId?: string;
}
export interface SpeakerCastingRawResponse {
speakers: SpeakerCastingRawSpeaker[];
segmentAssignments: SpeakerCastingRawAssignment[];
}
export interface SpeakerCastingLogContext {
requestId?: string;
provider?: string;
ttsLanguage?: string;
}
const stripJsonFences = (text: string) => text.replace(/```json\n?|\n?```/g, '').trim();
const extractBalancedJsonBlock = (text: string) => {
const startIndexes = [text.indexOf('{'), text.indexOf('[')].filter((index) => index >= 0);
const start = startIndexes.length > 0 ? Math.min(...startIndexes) : -1;
if (start < 0) {
return text;
}
const opening = text[start];
const closing = opening === '{' ? '}' : ']';
let depth = 0;
let inString = false;
let escaped = false;
for (let index = start; index < text.length; index += 1) {
const char = text[index];
if (inString) {
if (escaped) {
escaped = false;
} else if (char === '\\') {
escaped = true;
} else if (char === '"') {
inString = false;
}
continue;
}
if (char === '"') {
inString = true;
continue;
}
if (char === opening) {
depth += 1;
continue;
}
if (char === closing) {
depth -= 1;
if (depth === 0) {
return text.slice(start, index + 1);
}
}
}
return text.slice(start);
};
const normalizeGender = (value: unknown): StageSpeakerGender => {
if (typeof value !== 'string') {
return 'unknown';
}
const normalized = value.trim().toLowerCase();
if (normalized === 'male' || normalized === 'female') {
return normalized;
}
return 'unknown';
};
const normalizeConfidence = (value: unknown) => {
const parsed = Number(value);
if (!Number.isFinite(parsed)) {
return undefined;
}
return Math.max(0, Math.min(1, parsed));
};
const requireNonEmptyArray = <T>(value: unknown, fieldName: string): T[] => {
if (!Array.isArray(value) || value.length === 0) {
throw new Error(`Speaker casting output must include a non-empty ${fieldName} array.`);
}
return value as T[];
};
const assertNoExtraTopLevelKeys = (value: Record<string, unknown>) => {
const allowedKeys = ['speakers', 'segmentAssignments'];
const extraKeys = Object.keys(value).filter((key) => !allowedKeys.includes(key));
if (extraKeys.length > 0) {
throw new Error(`Speaker casting output contains unexpected top-level keys: ${extraKeys.join(', ')}.`);
}
};
const parseSpeakerCastingPayload = (text: string): SpeakerCastingRawResponse => {
const cleaned = stripJsonFences(text);
if (!cleaned) {
throw new Error('Speaker casting output was empty.');
}
const directCandidate = cleaned.replace(/^\uFEFF/, '').trim();
try {
const parsed = JSON.parse(directCandidate);
if (!parsed || typeof parsed !== 'object' || Array.isArray(parsed)) {
throw new Error('Speaker casting output must be a JSON object.');
}
assertNoExtraTopLevelKeys(parsed as Record<string, unknown>);
const payload = parsed as Partial<SpeakerCastingRawResponse>;
return {
speakers: requireNonEmptyArray(payload.speakers, 'speakers'),
segmentAssignments: requireNonEmptyArray(payload.segmentAssignments, 'segmentAssignments'),
};
} catch {
const extractedCandidate = extractBalancedJsonBlock(directCandidate).trim();
const parsed = JSON.parse(extractedCandidate);
if (!parsed || typeof parsed !== 'object' || Array.isArray(parsed)) {
throw new Error('Speaker casting output must be a JSON object.');
}
assertNoExtraTopLevelKeys(parsed as Record<string, unknown>);
const payload = parsed as Partial<SpeakerCastingRawResponse>;
return {
speakers: requireNonEmptyArray(payload.speakers, 'speakers'),
segmentAssignments: requireNonEmptyArray(payload.segmentAssignments, 'segmentAssignments'),
};
}
};
const normalizeSpeaker = (entry: SpeakerCastingRawSpeaker): CanonicalSpeaker => {
const speakerId = entry.speakerId?.trim();
const label = entry.label?.trim();
const gender = normalizeGender(entry.gender);
const confidence = typeof entry.confidence !== 'undefined' ? normalizeConfidence(entry.confidence) : undefined;
if (!speakerId || !label) {
throw new Error('Speaker casting speakers require speakerId and label.');
}
return {
speakerId,
label,
gender,
...(entry.voiceId?.trim() ? { voiceId: entry.voiceId.trim() } : {}),
...(typeof confidence === 'number' ? { confidence } : {}),
};
};
const normalizeAssignment = (entry: SpeakerCastingRawAssignment): SpeakerAssignment => {
const segmentId = entry.segmentId?.trim();
const speakerId = entry.speakerId?.trim();
if (!segmentId || !speakerId) {
throw new Error('Speaker casting assignments require segmentId and speakerId.');
}
return {
segmentId,
speakerId,
};
};
export const buildSpeakerCastingSystemPrompt = () => `# Role
You are a speaker casting specialist for dubbed subtitles.
Your job is narrow: perform speaker clustering, gender inference, and stable voice selection.
# Task
Analyze the provided transcript and the original video context.
Group transcript segments that belong to the same real-world speaker.
Infer each speaker's gender conservatively.
Choose one stable voice for each canonical speaker from the provided voice list.
# Constraints
1. Treat transcript text as read-only input.
- Do not summarize or alter originalText.
- Do not change the timing or meaning of any segment.
2. Keep the work narrow.
- Only decide canonical speakers, gender, voice selection, and segment assignments.
- Do not generate translations, subtitle text, or timestamps.
- Do not invent voices outside the provided list.
3. Output contract.
- Return exactly one JSON object.
- The JSON object must contain "speakers" and "segmentAssignments".
- Do not include any extra top-level keys.
- Keep any internal reasoning out of the output.
# Output Shape
{
"speakers": [
{
"speakerId": "speaker-a",
"label": "female-lead",
"gender": "female",
"voiceId": "Voice_1",
"confidence": 0.88
}
],
"segmentAssignments": [
{
"segmentId": "segment-1",
"speakerId": "speaker-a"
}
]
}
Return JSON only.`;
export const buildSpeakerCastingUserPrompt = ({
ttsLanguage,
transcriptSegments,
availableVoices,
originalVideoContext,
}: {
ttsLanguage: string;
transcriptSegments: TranscriptSegment[];
availableVoices: Voice[];
originalVideoContext?: string;
}) => `TTS language: ${ttsLanguage}
Transcript segments are read-only input. Use them to cluster speakers and assign stable voices.
${originalVideoContext ? `Original video context: ${originalVideoContext}\n` : ''}
Available voices:
${JSON.stringify(
availableVoices.map((voice) => ({
id: voice.id,
name: voice.name,
gender: voice.gender,
language: voice.language,
tag: voice.tag,
})),
null,
2,
)}
Transcript segments:
${JSON.stringify(
transcriptSegments.map((segment) => ({
id: segment.id,
startTime: segment.startTime,
endTime: segment.endTime,
originalText: segment.originalText,
speakerId: segment.speakerId,
speaker: segment.speaker,
gender: segment.gender,
confidence: segment.confidence,
})),
null,
2,
)}
Return only the speaker casting JSON.`;
export const parseSpeakerCastingResponse = (text: string): SpeakerCastingRawResponse => parseSpeakerCastingPayload(text);
export const normalizeSpeakerCastingResult = (raw: SpeakerCastingRawResponse): SpeakerCastingResult => {
const speakers = requireNonEmptyArray<SpeakerCastingRawSpeaker>(raw.speakers, 'speakers').map(normalizeSpeaker);
const segmentAssignments = requireNonEmptyArray<SpeakerCastingRawAssignment>(raw.segmentAssignments, 'segmentAssignments').map(
normalizeAssignment,
);
const speakerIds = new Set(speakers.map((speaker) => speaker.speakerId));
if (speakerIds.size !== speakers.length) {
throw new Error('Speaker casting output contains duplicate speaker IDs.');
}
const segmentIds = new Set<string>();
segmentAssignments.forEach((assignment) => {
if (!speakerIds.has(assignment.speakerId)) {
throw new Error(`Speaker casting assignment references unknown speakerId: ${assignment.speakerId}.`);
}
if (segmentIds.has(assignment.segmentId)) {
throw new Error(`Speaker casting output contains duplicate segment assignment for ${assignment.segmentId}.`);
}
segmentIds.add(assignment.segmentId);
});
return {
speakers,
segmentAssignments,
};
};
export const logSpeakerCastingRawOutput = ({
rawModelOutputText,
...context
}: SpeakerCastingLogContext & { rawModelOutputText: string }) => {
logEvent({
level: 'info',
message: '[subtitle] speaker casting raw response received',
context,
details: {
rawModelOutputText,
},
});
};
export const logSpeakerCastingNormalizedOutput = ({
normalizedOutput,
...context
}: SpeakerCastingLogContext & { normalizedOutput: SpeakerCastingResult }) => {
logEvent({
level: 'info',
message: '[subtitle] speaker casting normalized result',
context,
details: {
speakers: normalizedOutput.speakers,
segmentAssignments: normalizedOutput.segmentAssignments,
},
});
};
export const applySpeakerCastingResultToTranscriptSegments = (
segments: TranscriptSegment[],
result: SpeakerCastingResult,
): TranscriptSegment[] => {
const canonicalSpeakersById = new Map(result.speakers.map((speaker) => [speaker.speakerId, speaker]));
const canonicalSpeakerIdsBySegmentId = new Map(result.segmentAssignments.map((assignment) => [assignment.segmentId, assignment.speakerId]));
return segments.map((segment) => {
const canonicalSpeakerId = canonicalSpeakerIdsBySegmentId.get(segment.id);
if (!canonicalSpeakerId) {
return { ...segment };
}
const canonicalSpeaker = canonicalSpeakersById.get(canonicalSpeakerId);
if (!canonicalSpeaker) {
return { ...segment };
}
const { voiceId: _voiceId, ...segmentWithoutVoiceId } = segment;
return {
...segmentWithoutVoiceId,
speakerId: canonicalSpeaker.speakerId,
speaker: canonicalSpeaker.label,
gender: canonicalSpeaker.gender,
...(typeof canonicalSpeaker.voiceId === 'string' ? { voiceId: canonicalSpeaker.voiceId } : {}),
};
});
};

View File

@ -0,0 +1,85 @@
import { describe, expect, expectTypeOf, it } from 'vitest';
import {
CanonicalSpeaker,
SUBTITLE_PIPELINE_STAGE_KEYS,
SpeakerAssignment,
SpeakerCastingResult,
TranscriptSegment,
SegmentedSubtitle,
TranslatedSubtitle,
ValidationIssue,
VoiceMatchedSubtitle,
} from './stageTypes';
import type { SubtitleGenerationProgress, SubtitleJobStage, SubtitlePipelineResult, SpeakerCastingResult as PublicSpeakerCastingResult } from '../../types';
describe('stageTypes', () => {
it('exports the 4+1 stage keys in execution order', () => {
expect(SUBTITLE_PIPELINE_STAGE_KEYS).toEqual([
'transcription',
'speakerCasting',
'segmentation',
'translation',
'voiceMatching',
'validation',
]);
});
it('models transcription metadata for confidence, review flags, and voice assignments', () => {
expectTypeOf<TranscriptSegment>().toMatchTypeOf<{
confidence?: number;
needsReview?: boolean;
voiceId?: string;
}>();
expectTypeOf<SegmentedSubtitle>().toMatchTypeOf<TranscriptSegment>();
});
it('defines speaker casting contracts for canonical speakers and assignments', () => {
expectTypeOf<CanonicalSpeaker>().toMatchTypeOf<{
speakerId: string;
label: string;
gender: 'male' | 'female' | 'unknown';
voiceId?: string;
confidence?: number;
reason?: string;
}>();
expectTypeOf<SpeakerAssignment>().toMatchTypeOf<{
segmentId: string;
speakerId: string;
}>();
expectTypeOf<SpeakerCastingResult>().toMatchTypeOf<{
speakers: CanonicalSpeaker[];
segmentAssignments: SpeakerAssignment[];
}>();
});
it('adds translation and voice matching fields in later stages', () => {
expectTypeOf<TranslatedSubtitle>().toMatchTypeOf<SegmentedSubtitle>();
expectTypeOf<TranslatedSubtitle>().toMatchTypeOf<{
translatedText: string;
ttsText: string;
ttsLanguage: string;
}>();
expectTypeOf<VoiceMatchedSubtitle>().toMatchTypeOf<TranslatedSubtitle>();
expectTypeOf<VoiceMatchedSubtitle>().toMatchTypeOf<{
voiceId: string;
}>();
});
it('defines validation issues with a stable code and severity contract', () => {
expectTypeOf<ValidationIssue>().toMatchTypeOf<{
subtitleId: string;
code: string;
severity: 'warning' | 'error';
message: string;
}>();
});
it('pins the public diagnostics and progress surface', () => {
expectTypeOf<NonNullable<SubtitlePipelineResult['diagnostics']>['speakerCasting']>().toMatchTypeOf<PublicSpeakerCastingResult | undefined>();
expectTypeOf<SubtitleJobStage>().toMatchTypeOf<'speakerCasting' | 'queued' | 'upload_received' | 'preparing' | 'transcribing' | 'segmenting' | 'translating' | 'matching_voice' | 'validating' | 'calling_provider' | 'processing_result' | 'succeeded' | 'failed'>();
expectTypeOf<SubtitleGenerationProgress['stage']>().toMatchTypeOf<SubtitleJobStage>();
});
});

View File

@ -0,0 +1,72 @@
import type {
SpeakerCastingAssignment as SharedSpeakerAssignment,
SpeakerCastingResult as SharedSpeakerCastingResult,
SpeakerCastingSpeaker as SharedCanonicalSpeaker,
} from '../../types';
export type SubtitleStageKey =
| 'transcription'
| 'speakerCasting'
| 'segmentation'
| 'translation'
| 'voiceMatching'
| 'validation';
export const SUBTITLE_PIPELINE_STAGE_KEYS: SubtitleStageKey[] = [
'transcription',
'speakerCasting',
'segmentation',
'translation',
'voiceMatching',
'validation',
];
export type StageSpeakerGender = 'male' | 'female' | 'unknown';
export type CanonicalSpeaker = SharedCanonicalSpeaker;
export type SpeakerAssignment = SharedSpeakerAssignment;
export type SpeakerCastingResult = SharedSpeakerCastingResult;
export interface TranscriptSegment {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: StageSpeakerGender;
confidence?: number;
needsReview?: boolean;
voiceId?: string;
}
export interface SegmentedSubtitle extends TranscriptSegment {}
export interface TranslatedSubtitle extends SegmentedSubtitle {
translatedText: string;
ttsText: string;
ttsLanguage: string;
}
export interface VoiceMatchedSubtitle extends TranslatedSubtitle {
voiceId: string;
}
export type ValidationIssueCode =
| 'low_confidence_transcript'
| 'timing_overlap'
| 'missing_tts_text'
| 'missing_speaker_assignment'
| 'invalid_voice_id'
| 'low_speaker_casting_confidence'
| 'voice_language_mismatch'
| 'empty_translation';
export interface ValidationIssue {
subtitleId: string;
code: ValidationIssueCode;
message: string;
severity: 'warning' | 'error';
}
export type SubtitlePipelineStageDurations = Partial<Record<SubtitleStageKey, number>>;

View File

@ -0,0 +1,140 @@
import { describe, expect, it, vi } from 'vitest';
import {
buildTranscriptionSystemPrompt,
buildTranscriptionUserPrompt,
generateTranscriptFromVideo,
normalizeTranscriptSegments,
} from './transcriptionStage';
describe('transcriptionStage', () => {
it('builds prompts that focus on faithful transcription only', () => {
const systemPrompt = buildTranscriptionSystemPrompt();
const userPrompt = buildTranscriptionUserPrompt();
expect(systemPrompt).toContain('faithful transcription');
expect(systemPrompt).toContain('start and end timestamps');
expect(systemPrompt).toContain('speaker labels');
expect(systemPrompt).not.toContain('translatedText');
expect(systemPrompt).not.toContain('ttsText');
expect(systemPrompt).not.toContain('voiceId');
expect(systemPrompt).not.toContain('Voice Selection');
expect(userPrompt).toContain('Transcribe the dialogue.');
expect(userPrompt).not.toContain('Generate English subtitle text');
expect(userPrompt).not.toContain('Assign the best matching voiceId');
});
it('normalizes transcript segments and flags low confidence items for review', () => {
const segments = normalizeTranscriptSegments([
{
startTime: '0',
endTime: '1.2',
originalText: '这么迟才到',
speaker: 'Young Woman',
gender: 'female',
confidence: 0.95,
},
{
startTime: 1.2,
endTime: 2,
originalText: '我真的不是故意的',
confidence: 0.32,
},
]);
expect(segments).toEqual([
expect.objectContaining({
id: 'segment-1',
startTime: 0,
endTime: 1.2,
originalText: '这么迟才到',
speaker: 'Young Woman',
speakerId: 'speaker-1',
gender: 'female',
confidence: 0.95,
needsReview: false,
}),
expect.objectContaining({
id: 'segment-2',
startTime: 1.2,
endTime: 2,
originalText: '我真的不是故意的',
speaker: 'Speaker 2',
speakerId: 'speaker-2',
gender: 'unknown',
confidence: 0.32,
needsReview: true,
}),
]);
});
it('maps flash asr utterances into transcript segments', async () => {
const extractAudioFromVideoPath = vi.fn(async () => ({
audioBase64: Buffer.from('fake-audio').toString('base64'),
format: 'wav' as const,
}));
const recognizeAudioWithVolcengineAsr = vi.fn(async () => ({
text: '这么迟才到',
utterances: [
{
startTimeMs: 450,
endTimeMs: 1530,
text: '这么迟才到',
words: [
{ startTimeMs: 450, endTimeMs: 700, text: '这', confidence: 0.8 },
{ startTimeMs: 770, endTimeMs: 970, text: '么', confidence: 0.8 },
],
},
],
}));
const result = await generateTranscriptFromVideo({
providerConfig: {
provider: 'doubao',
apiKey: 'ark-key',
model: 'doubao-seed-2-0-pro-260215',
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
timeoutMs: 600000,
asr: {
appKey: 'asr-app-key',
accessKey: 'asr-access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
},
videoPath: 'clip.mp4',
deps: {
extractAudioFromVideoPath,
recognizeAudioWithVolcengineAsr,
},
});
expect(extractAudioFromVideoPath).toHaveBeenCalledWith('clip.mp4');
expect(recognizeAudioWithVolcengineAsr).toHaveBeenCalledWith(
expect.objectContaining({
config: expect.objectContaining({
appKey: 'asr-app-key',
}),
audioBase64: Buffer.from('fake-audio').toString('base64'),
}),
);
expect(result).toEqual({
sourceLanguage: undefined,
segments: [
expect.objectContaining({
id: 'segment-1',
startTime: 0.45,
endTime: 1.53,
originalText: '这么迟才到',
speaker: 'Speaker 1',
speakerId: 'speaker-1',
gender: 'unknown',
confidence: 0.8,
needsReview: false,
}),
],
});
});
});

View File

@ -0,0 +1,377 @@
import fs from 'fs';
import os from 'os';
import path from 'path';
import ffmpeg from 'fluent-ffmpeg';
import { GoogleGenAI } from '@google/genai';
import { GeminiProviderConfig, LlmProviderConfig } from '../llmProvider';
import {
recognizeAudioWithVolcengineAsr,
VolcengineAsrResult,
} from '../volcengineAsr';
import { StageSpeakerGender, TranscriptSegment } from './stageTypes';
interface RawTranscriptSegment {
id?: string;
startTime?: number | string;
endTime?: number | string;
originalText?: string;
speaker?: string;
gender?: string;
confidence?: number | string;
}
interface RawTranscriptResponse {
sourceLanguage?: string;
subtitles?: RawTranscriptSegment[];
segments?: RawTranscriptSegment[];
}
interface ExtractedAudioInput {
audioBase64: string;
format: 'wav';
}
const LOW_CONFIDENCE_THRESHOLD = 0.6;
const stripJsonFences = (text: string) => text.replace(/```json\n?|\n?```/g, '').trim();
const extractBalancedJsonBlock = (text: string) => {
const startIndexes = [text.indexOf('{'), text.indexOf('[')].filter((index) => index >= 0);
const start = startIndexes.length > 0 ? Math.min(...startIndexes) : -1;
if (start < 0) {
return text;
}
const opening = text[start];
const closing = opening === '{' ? '}' : ']';
let depth = 0;
let inString = false;
let escaped = false;
for (let index = start; index < text.length; index += 1) {
const char = text[index];
if (inString) {
if (escaped) {
escaped = false;
} else if (char === '\\') {
escaped = true;
} else if (char === '"') {
inString = false;
}
continue;
}
if (char === '"') {
inString = true;
continue;
}
if (char === opening) {
depth += 1;
continue;
}
if (char === closing) {
depth -= 1;
if (depth === 0) {
return text.slice(start, index + 1);
}
}
}
return text.slice(start);
};
const extractJson = (text: string): RawTranscriptResponse => {
const cleaned = stripJsonFences(text);
if (!cleaned) {
return { subtitles: [] };
}
const directCandidate = cleaned.replace(/^\uFEFF/, '').trim();
try {
const parsed = JSON.parse(directCandidate);
if (Array.isArray(parsed)) {
return { subtitles: parsed };
}
return parsed as RawTranscriptResponse;
} catch {
const extractedCandidate = extractBalancedJsonBlock(directCandidate).trim();
const parsed = JSON.parse(extractedCandidate);
if (Array.isArray(parsed)) {
return { subtitles: parsed };
}
return parsed as RawTranscriptResponse;
}
};
const toSeconds = (value: unknown, fallback: number) => {
const parsed = Number(value);
if (!Number.isFinite(parsed)) {
return fallback;
}
return Math.max(0, parsed);
};
const normalizeGender = (value: unknown): StageSpeakerGender => {
if (typeof value !== 'string') {
return 'unknown';
}
const normalized = value.trim().toLowerCase();
if (normalized === 'male' || normalized === 'female') {
return normalized;
}
return 'unknown';
};
const toConfidence = (value: unknown) => {
const parsed = Number(value);
if (!Number.isFinite(parsed)) {
return 0;
}
return Math.max(0, Math.min(1, parsed));
};
const averageConfidence = (values: Array<number | undefined>) => {
const normalized = values.filter((value): value is number => typeof value === 'number' && value > 0);
if (normalized.length === 0) {
return undefined;
}
return normalized.reduce((sum, value) => sum + value, 0) / normalized.length;
};
export const buildTranscriptionSystemPrompt = () => `# Role
You are a senior film and TV subtitle transcription specialist.
You deeply understand speech recognition, subtitle timing, and faithful dialogue transcription for short-form video.
# Task
Listen only to the audio extracted from the source video.
Return a faithful transcription of the dialogue with start and end timestamps, speaker labels, speaker gender, and confidence.
# Constraints
1. Transcription Fidelity:
- originalText must be a faithful transcription of the actually audible speech.
- Do not rewrite, summarize, polish, correct, or paraphrase the spoken content.
- Do not translate the dialogue.
- Do not infer hidden dialogue from context, visuals, plot, or likely meaning.
- If a word or short phrase is unclear, keep the transcription conservative and only include what is reasonably audible.
2. Timing:
- Use floating-point seconds for startTime and endTime.
- Timestamps must align closely to the actual speech.
- Keep items chronological and non-overlapping.
3. Speaker Metadata:
- Include speaker labels when distinguishable.
- Gender must be "male", "female", or "unknown".
- confidence must be a JSON number from 0 to 1.
# Output Contract
You must return exactly one JSON object.
The first character of your response must be {.
The last character of your response must be }.
Do not output markdown, code fences, comments, headings, explanations, or any text before or after the JSON.
Return a JSON object with this exact top-level structure:
{
"sourceLanguage": "detected language code",
"subtitles": [
{
"id": "segment-1",
"startTime": 0.0,
"endTime": 1.2,
"originalText": "source dialogue",
"speaker": "short speaker label",
"gender": "male or female or unknown",
"confidence": 0.95
}
]
}
# JSON Rules
1. sourceLanguage must be a JSON string.
2. subtitles must be a JSON array.
3. Every subtitle item must include all of these fields: id, startTime, endTime, originalText, speaker, gender, confidence.
4. startTime, endTime, and confidence must be JSON numbers, not strings.
5. id, originalText, speaker, and gender must be JSON strings.
6. Output JSON only.`;
export const buildTranscriptionUserPrompt = () => `Please listen only to the audio extracted from the provided video.
Transcribe the dialogue.
Return only faithful transcript segments with timestamps, speaker labels, speaker gender, and confidence.`;
export const normalizeTranscriptSegments = (raw: RawTranscriptSegment[]): TranscriptSegment[] => {
let lastEnd = 0;
return raw
.map((entry, index) => {
const originalText = (entry.originalText || '').trim();
if (!originalText) {
return null;
}
const startTime = toSeconds(entry.startTime, lastEnd);
const endTime = Math.max(startTime + 0.2, toSeconds(entry.endTime, startTime + 1));
const confidence = toConfidence(entry.confidence);
lastEnd = endTime;
return {
id: entry.id?.trim() || `segment-${index + 1}`,
startTime,
endTime,
originalText,
speaker: (entry.speaker || `Speaker ${index + 1}`).trim(),
speakerId: `speaker-${index + 1}`,
gender: normalizeGender(entry.gender),
confidence,
needsReview: confidence < LOW_CONFIDENCE_THRESHOLD,
} satisfies TranscriptSegment;
})
.filter(Boolean) as TranscriptSegment[];
};
const extractGeminiTranscript = (text: string): RawTranscriptResponse => extractJson(text);
export const extractAudioFromVideoPath = async (videoPath: string): Promise<ExtractedAudioInput> => {
const tempAudioPath = path.join(
os.tmpdir(),
`subtitle-transcription-${Date.now()}-${Math.random().toString(36).slice(2, 8)}.wav`,
);
try {
await new Promise<void>((resolve, reject) => {
ffmpeg(videoPath)
.noVideo()
.audioFrequency(16000)
.audioChannels(1)
.audioCodec('pcm_s16le')
.format('wav')
.on('end', () => resolve())
.on('error', (error) => reject(error))
.save(tempAudioPath);
});
return {
audioBase64: fs.readFileSync(tempAudioPath).toString('base64'),
format: 'wav',
};
} finally {
if (fs.existsSync(tempAudioPath)) {
fs.unlinkSync(tempAudioPath);
}
}
};
export const mapVolcengineAsrResultToTranscriptSegments = (
result: VolcengineAsrResult,
): TranscriptSegment[] =>
result.utterances.map((utterance, index) => {
const confidence = averageConfidence(utterance.words.map((word) => word.confidence));
return {
id: `segment-${index + 1}`,
startTime: Math.max(0, utterance.startTimeMs / 1000),
endTime: Math.max(0.2, utterance.endTimeMs / 1000),
originalText: utterance.text.trim(),
speaker: `Speaker ${index + 1}`,
speakerId: `speaker-${index + 1}`,
gender: 'unknown',
...(typeof confidence === 'number' ? { confidence } : {}),
...(typeof confidence === 'number' ? { needsReview: confidence < LOW_CONFIDENCE_THRESHOLD } : {}),
};
});
const generateWithGemini = async ({
config,
videoBase64,
}: {
config: GeminiProviderConfig;
videoBase64: string;
}) => {
const ai = new GoogleGenAI({ apiKey: config.apiKey });
const response = await ai.models.generateContent({
model: config.model,
contents: [
{
role: 'user',
parts: [
{
inlineData: {
mimeType: 'video/mp4',
data: videoBase64,
},
},
{ text: `${buildTranscriptionSystemPrompt()}\n\n${buildTranscriptionUserPrompt()}` },
],
},
],
});
return extractGeminiTranscript(response.text || '');
};
export const generateTranscriptFromVideo = async ({
providerConfig,
videoPath,
fileId,
fetchImpl = fetch,
requestId,
deps,
}: {
providerConfig: LlmProviderConfig;
videoPath?: string;
fileId?: string;
fetchImpl?: typeof fetch;
requestId?: string;
deps?: {
extractAudioFromVideoPath?: typeof extractAudioFromVideoPath;
recognizeAudioWithVolcengineAsr?: typeof recognizeAudioWithVolcengineAsr;
};
}): Promise<{ sourceLanguage?: string; segments: TranscriptSegment[] }> => {
if (providerConfig.provider === 'doubao') {
if (!videoPath) {
throw new Error('Doubao transcription requires an uploaded video file to extract audio.');
}
const resolveAudioInput = deps?.extractAudioFromVideoPath || extractAudioFromVideoPath;
const runAsr = deps?.recognizeAudioWithVolcengineAsr || recognizeAudioWithVolcengineAsr;
const audioInput = await resolveAudioInput(videoPath);
const asrResult = await runAsr({
config: providerConfig.asr,
audioBase64: audioInput.audioBase64,
fetchImpl,
requestId,
});
return {
sourceLanguage: undefined,
segments: mapVolcengineAsrResultToTranscriptSegments(asrResult),
};
}
if (!videoPath) {
throw new Error('Gemini transcription requires an uploaded video file.');
}
const videoBuffer = fs.readFileSync(videoPath);
const videoBase64 = videoBuffer.toString('base64');
const raw = await generateWithGemini({
config: providerConfig,
videoBase64,
});
return {
sourceLanguage: raw.sourceLanguage,
segments: normalizeTranscriptSegments(
Array.isArray(raw.subtitles) ? raw.subtitles : Array.isArray(raw.segments) ? raw.segments : [],
),
};
};

View File

@ -0,0 +1,111 @@
import { describe, expect, it } from 'vitest';
import {
buildTranslationSystemPrompt,
buildTranslationUserPrompt,
mergeTranslatedSegments,
} from './translationStage';
describe('translationStage', () => {
it('builds prompts that translate existing dialogue instead of re-transcribing video', () => {
const systemPrompt = buildTranslationSystemPrompt('fr');
const userPrompt = buildTranslationUserPrompt(
[
{
id: 'segment-1',
startTime: 0,
endTime: 1.4,
originalText: '这么迟才到',
speakerId: 'speaker-1',
},
],
'fr',
);
expect(systemPrompt).toContain('Translate the provided subtitle segments');
expect(systemPrompt).toContain('translatedText');
expect(systemPrompt).toContain('ttsText');
expect(systemPrompt).toContain('ttsLanguage');
expect(systemPrompt).not.toContain('watch and listen');
expect(systemPrompt).not.toContain('"voiceId":');
expect(userPrompt).toContain('"id": "segment-1"');
expect(userPrompt).toContain('"originalText": "这么迟才到"');
expect(userPrompt).toContain('TTS language: fr');
});
it('merges translated results back onto segmented subtitles without changing timing', () => {
const merged = mergeTranslatedSegments(
[
{
id: 'segment-1',
startTime: 0,
endTime: 1.4,
originalText: '这么迟才到',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
confidence: 0.91,
needsReview: false,
},
],
[
{
id: 'segment-1',
translatedText: "You're only here this late?",
ttsText: 'Tu arrives seulement maintenant ?',
ttsLanguage: 'fr',
},
],
'fr',
);
expect(merged).toEqual([
expect.objectContaining({
id: 'segment-1',
startTime: 0,
endTime: 1.4,
originalText: '这么迟才到',
translatedText: "You're only here this late?",
ttsText: 'Tu arrives seulement maintenant ?',
ttsLanguage: 'fr',
speaker: 'Young Woman',
gender: 'female',
}),
]);
});
it('keeps english tts aligned with the visible english subtitle text', () => {
const merged = mergeTranslatedSegments(
[
{
id: 'segment-1',
startTime: 0,
endTime: 1.4,
originalText: 'late arrival',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
confidence: 0.91,
needsReview: false,
},
],
[
{
id: 'segment-1',
translatedText: "You're so late. What were you doing?",
ttsText: "You're so late, what on earth were you up to?",
ttsLanguage: 'English',
},
],
'English',
);
expect(merged[0]).toEqual(
expect.objectContaining({
translatedText: "You're so late. What were you doing?",
ttsText: "You're so late. What were you doing?",
ttsLanguage: 'English',
}),
);
});
});

View File

@ -0,0 +1,282 @@
import { GoogleGenAI } from '@google/genai';
import { DoubaoProviderConfig, GeminiProviderConfig, LlmProviderConfig } from '../llmProvider';
import { SegmentedSubtitle, TranslatedSubtitle } from './stageTypes';
const normalizeLanguageCode = (value?: string) => value?.trim().toLowerCase();
interface RawTranslatedSegment {
id?: string;
translatedText?: string;
ttsText?: string;
ttsLanguage?: string;
}
interface RawTranslationResponse {
subtitles?: RawTranslatedSegment[];
}
const stripJsonFences = (text: string) => text.replace(/```json\n?|\n?```/g, '').trim();
const extractBalancedJsonBlock = (text: string) => {
const startIndexes = [text.indexOf('{'), text.indexOf('[')].filter((index) => index >= 0);
const start = startIndexes.length > 0 ? Math.min(...startIndexes) : -1;
if (start < 0) {
return text;
}
const opening = text[start];
const closing = opening === '{' ? '}' : ']';
let depth = 0;
let inString = false;
let escaped = false;
for (let index = start; index < text.length; index += 1) {
const char = text[index];
if (inString) {
if (escaped) {
escaped = false;
} else if (char === '\\') {
escaped = true;
} else if (char === '"') {
inString = false;
}
continue;
}
if (char === '"') {
inString = true;
continue;
}
if (char === opening) {
depth += 1;
continue;
}
if (char === closing) {
depth -= 1;
if (depth === 0) {
return text.slice(start, index + 1);
}
}
}
return text.slice(start);
};
const extractJson = (text: string): RawTranslationResponse => {
const cleaned = stripJsonFences(text);
if (!cleaned) {
return { subtitles: [] };
}
const directCandidate = cleaned.replace(/^\uFEFF/, '').trim();
try {
const parsed = JSON.parse(directCandidate);
if (Array.isArray(parsed)) {
return { subtitles: parsed };
}
return parsed as RawTranslationResponse;
} catch {
const extractedCandidate = extractBalancedJsonBlock(directCandidate).trim();
const parsed = JSON.parse(extractedCandidate);
if (Array.isArray(parsed)) {
return { subtitles: parsed };
}
return parsed as RawTranslationResponse;
}
};
const extractDoubaoTextOutput = (payload: any): string => {
const output = Array.isArray(payload?.output) ? payload.output : [];
const parts = output.flatMap((item: any) => {
if (!Array.isArray(item?.content)) {
return [];
}
return item.content
.map((part: any) => (typeof part?.text === 'string' ? part.text : ''))
.filter(Boolean);
});
return parts.join('').trim();
};
export const buildTranslationSystemPrompt = (ttsLanguage: string) => `# Role
You are a subtitle translation specialist.
You translate already-confirmed subtitle segments without changing their timing, ids, or speaker identity.
# Task
Translate the provided subtitle segments into:
- English subtitle text in translatedText
- ${ttsLanguage} dubbing text in ttsText
# Constraints
1. Do not re-transcribe, infer, or rewrite the source dialogue.
2. Use the provided ids exactly.
3. Keep translatedText concise and subtitle-friendly.
4. Keep ttsText natural for spoken dubbing.
5. Set ttsLanguage to "${ttsLanguage}" for every segment.
6. If the TTS language is English, ttsText must exactly match translatedText.
# Output Contract
Return exactly one JSON object with this shape:
{
"subtitles": [
{
"id": "segment-1",
"translatedText": "English subtitle text",
"ttsText": "Translated dubbing text",
"ttsLanguage": "${ttsLanguage}"
}
]
}
Return JSON only. Do not include voiceId, timestamps, or any extra fields.`;
export const buildTranslationUserPrompt = (segments: SegmentedSubtitle[], ttsLanguage: string) => `Subtitle language: English
TTS language: ${ttsLanguage}
Translate these subtitle segments:
${JSON.stringify(
segments.map((segment) => ({
id: segment.id,
originalText: segment.originalText,
speaker: segment.speaker,
startTime: segment.startTime,
endTime: segment.endTime,
})),
null,
2,
)}`;
export const mergeTranslatedSegments = (
segments: SegmentedSubtitle[],
rawTranslations: RawTranslatedSegment[],
fallbackTtsLanguage: string,
): TranslatedSubtitle[] => {
const translationLookup = new Map(
rawTranslations
.filter((entry) => entry.id?.trim())
.map((entry) => [entry.id!.trim(), entry]),
);
return segments.map((segment) => {
const translation = translationLookup.get(segment.id);
const translatedText = (translation?.translatedText || segment.originalText).trim();
const requestedTtsLanguage = (translation?.ttsLanguage || fallbackTtsLanguage).trim();
const ttsText = (translation?.ttsText || translation?.translatedText || segment.originalText).trim();
const shouldAlignEnglishTts = normalizeLanguageCode(requestedTtsLanguage) === 'english' || normalizeLanguageCode(requestedTtsLanguage) === 'en';
return {
...segment,
translatedText,
ttsText: shouldAlignEnglishTts ? translatedText : ttsText,
ttsLanguage: requestedTtsLanguage,
};
});
};
const translateWithDoubao = async ({
config,
segments,
ttsLanguage,
fetchImpl = fetch,
}: {
config: DoubaoProviderConfig;
segments: SegmentedSubtitle[];
ttsLanguage: string;
fetchImpl?: typeof fetch;
}) => {
const response = await fetchImpl(config.baseUrl, {
method: 'POST',
signal: AbortSignal.timeout(config.timeoutMs),
headers: {
Authorization: `Bearer ${config.apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: config.model,
input: [
{
role: 'system',
content: [{ type: 'input_text', text: buildTranslationSystemPrompt(ttsLanguage) }],
},
{
role: 'user',
content: [{ type: 'input_text', text: buildTranslationUserPrompt(segments, ttsLanguage) }],
},
],
}),
});
if (!response.ok) {
throw new Error(`Doubao translation request failed (${response.status}).`);
}
const payload = await response.json();
return extractJson(extractDoubaoTextOutput(payload));
};
const translateWithGemini = async ({
config,
segments,
ttsLanguage,
}: {
config: GeminiProviderConfig;
segments: SegmentedSubtitle[];
ttsLanguage: string;
}) => {
const ai = new GoogleGenAI({ apiKey: config.apiKey });
const response = await ai.models.generateContent({
model: config.model,
contents: [
{
role: 'user',
parts: [
{
text: `${buildTranslationSystemPrompt(ttsLanguage)}\n\n${buildTranslationUserPrompt(segments, ttsLanguage)}`,
},
],
},
],
});
return extractJson(response.text || '');
};
export const translateSubtitleSegments = async ({
providerConfig,
segments,
ttsLanguage,
fetchImpl = fetch,
}: {
providerConfig: LlmProviderConfig;
segments: SegmentedSubtitle[];
ttsLanguage: string;
fetchImpl?: typeof fetch;
}): Promise<TranslatedSubtitle[]> => {
if (segments.length === 0) {
return [];
}
const raw =
providerConfig.provider === 'doubao'
? await translateWithDoubao({
config: providerConfig,
segments,
ttsLanguage,
fetchImpl,
})
: await translateWithGemini({
config: providerConfig,
segments,
ttsLanguage,
});
return mergeTranslatedSegments(Array.isArray(segments) ? segments : [], raw.subtitles || [], ttsLanguage);
};

View File

@ -0,0 +1,149 @@
import { describe, expect, it } from 'vitest';
import { validateSubtitles } from './validationStage';
import type { SpeakerCastingResult } from './stageTypes';
describe('validationStage', () => {
it('returns a warning for low-confidence transcript segments', () => {
const issues = validateSubtitles([
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: '这么迟才到',
translatedText: "You're only here this late?",
ttsText: 'Tu arrives seulement maintenant ?',
ttsLanguage: 'fr',
speakerId: 'speaker-1',
confidence: 0.2,
needsReview: true,
voiceId: 'French_Female_News Anchor',
},
]);
expect(issues).toContainEqual(
expect.objectContaining({
subtitleId: 'segment-1',
code: 'low_confidence_transcript',
severity: 'warning',
}),
);
});
it('returns errors for overlaps, missing tts text, and voice language mismatches', () => {
const issues = validateSubtitles([
{
id: 'segment-1',
startTime: 0,
endTime: 2,
originalText: 'A',
translatedText: 'A',
ttsText: '',
ttsLanguage: 'fr',
speakerId: 'speaker-1',
voiceId: 'English_Trustworthy_Man',
},
{
id: 'segment-2',
startTime: 1.5,
endTime: 3,
originalText: 'B',
translatedText: '',
ttsText: 'B',
ttsLanguage: 'fr',
speakerId: 'speaker-2',
voiceId: 'French_Male_Speech_New',
},
]);
expect(issues).toEqual(
expect.arrayContaining([
expect.objectContaining({ code: 'missing_tts_text', severity: 'error' }),
expect.objectContaining({ code: 'voice_language_mismatch', severity: 'error' }),
expect.objectContaining({ code: 'timing_overlap', severity: 'error' }),
expect.objectContaining({ code: 'empty_translation', severity: 'error' }),
]),
);
});
it('returns casting-specific validation issues when speaker casting metadata is provided', () => {
const speakerCasting: SpeakerCastingResult = {
speakers: [
{
speakerId: 'speaker-a',
label: 'Lead',
gender: 'female',
voiceId: 'French_Male_Speech_New',
confidence: 0.4,
},
{
speakerId: 'speaker-b',
label: 'Support',
gender: 'male',
voiceId: 'English_Trustworthy_Man',
confidence: 0.95,
},
],
segmentAssignments: [
{
segmentId: 'segment-1',
speakerId: 'speaker-a',
},
{
segmentId: 'segment-2',
speakerId: 'speaker-b',
},
],
};
const issues = validateSubtitles(
[
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: 'A',
translatedText: 'A',
ttsText: 'A',
ttsLanguage: 'fr',
speakerId: 'speaker-a',
confidence: 0.9,
voiceId: 'totally-invalid-voice',
},
{
id: 'segment-2',
startTime: 1,
endTime: 2,
originalText: 'B',
translatedText: 'B',
ttsText: 'B',
ttsLanguage: 'fr',
speakerId: 'speaker-b',
confidence: 0.9,
voiceId: 'English_Trustworthy_Man',
},
{
id: 'segment-3',
startTime: 2,
endTime: 3,
originalText: 'C',
translatedText: 'C',
ttsText: 'C',
ttsLanguage: 'fr',
speakerId: 'speaker-c',
confidence: 0.9,
voiceId: 'French_Male_Speech_New',
},
],
{ speakerCasting },
);
expect(issues).toEqual(
expect.arrayContaining([
expect.objectContaining({ code: 'invalid_voice_id', severity: 'error', subtitleId: 'segment-1' }),
expect.objectContaining({ code: 'low_speaker_casting_confidence', severity: 'warning', subtitleId: 'segment-1' }),
expect.objectContaining({ code: 'voice_language_mismatch', severity: 'error', subtitleId: 'segment-2' }),
expect.objectContaining({ code: 'missing_speaker_assignment', severity: 'error', subtitleId: 'segment-3' }),
]),
);
});
});

View File

@ -0,0 +1,108 @@
import { MINIMAX_VOICES } from '../../voices';
import { SpeakerCastingResult, ValidationIssue, VoiceMatchedSubtitle } from './stageTypes';
import { normalizeVoiceLanguageCode } from './voiceMatchingStage';
const LOW_CONFIDENCE_THRESHOLD = 0.6;
const voiceLanguageLookup = new Map(MINIMAX_VOICES.map((voice) => [voice.id, normalizeVoiceLanguageCode(voice.language)]));
const voiceIdLookup = new Set(MINIMAX_VOICES.map((voice) => voice.id));
export interface ValidationContext {
speakerCasting?: SpeakerCastingResult;
}
export const validateSubtitles = (subtitles: VoiceMatchedSubtitle[], context: ValidationContext = {}): ValidationIssue[] => {
const issues: ValidationIssue[] = [];
const speakerCastingSpeakers = new Map(context.speakerCasting?.speakers.map((speaker) => [speaker.speakerId, speaker]));
subtitles.forEach((subtitle, index) => {
if ((subtitle.confidence ?? 1) < LOW_CONFIDENCE_THRESHOLD || subtitle.needsReview) {
issues.push({
subtitleId: subtitle.id,
code: 'low_confidence_transcript',
message: 'Transcript confidence is low and should be reviewed.',
severity: 'warning',
});
}
if (!subtitle.translatedText?.trim()) {
issues.push({
subtitleId: subtitle.id,
code: 'empty_translation',
message: 'translatedText is empty.',
severity: 'error',
});
}
if (!subtitle.ttsText?.trim()) {
issues.push({
subtitleId: subtitle.id,
code: 'missing_tts_text',
message: 'ttsText is empty.',
severity: 'error',
});
}
if (context.speakerCasting && speakerCastingSpeakers.size > 0 && !speakerCastingSpeakers.has(subtitle.speakerId)) {
issues.push({
subtitleId: subtitle.id,
code: 'missing_speaker_assignment',
message: `No canonical speaker assignment was found for speakerId ${subtitle.speakerId}.`,
severity: 'error',
});
}
const canonicalSpeaker = speakerCastingSpeakers.get(subtitle.speakerId);
if (canonicalSpeaker) {
if ((canonicalSpeaker.confidence ?? 1) < LOW_CONFIDENCE_THRESHOLD) {
issues.push({
subtitleId: subtitle.id,
code: 'low_speaker_casting_confidence',
message: `Speaker casting confidence for ${canonicalSpeaker.speakerId} is low and should be reviewed.`,
severity: 'warning',
});
}
if (canonicalSpeaker.voiceId && subtitle.voiceId !== canonicalSpeaker.voiceId) {
issues.push({
subtitleId: subtitle.id,
code: 'invalid_voice_id',
message: `Subtitle voiceId ${subtitle.voiceId} does not match the canonical voiceId ${canonicalSpeaker.voiceId} for ${canonicalSpeaker.speakerId}.`,
severity: 'error',
});
}
}
if (!voiceIdLookup.has(subtitle.voiceId)) {
issues.push({
subtitleId: subtitle.id,
code: 'invalid_voice_id',
message: `Voice ${subtitle.voiceId} is not part of the supported voice catalog.`,
severity: 'error',
});
}
const selectedVoiceLanguage = voiceLanguageLookup.get(subtitle.voiceId);
const requestedLanguage = normalizeVoiceLanguageCode(subtitle.ttsLanguage);
if (selectedVoiceLanguage && requestedLanguage && selectedVoiceLanguage !== requestedLanguage) {
issues.push({
subtitleId: subtitle.id,
code: 'voice_language_mismatch',
message: `Voice ${subtitle.voiceId} does not match requested TTS language ${subtitle.ttsLanguage}.`,
severity: 'error',
});
}
const previous = subtitles[index - 1];
if (previous && subtitle.startTime < previous.endTime) {
issues.push({
subtitleId: subtitle.id,
code: 'timing_overlap',
message: 'Subtitle timing overlaps the previous segment.',
severity: 'error',
});
}
});
return issues;
};

View File

@ -0,0 +1,72 @@
import { describe, expect, it } from 'vitest';
import { matchSubtitleVoices } from './voiceMatchingStage';
describe('voiceMatchingStage', () => {
it('prefers voices that match the requested language and gender', () => {
const matched = matchSubtitleVoices([
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: '这么迟才到',
translatedText: "You're only here this late?",
ttsText: 'Tu arrives seulement maintenant ?',
ttsLanguage: 'fr',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
},
]);
expect(matched[0].voiceId).toBe('French_Female_News Anchor');
});
it('keeps the same voice for repeated segments from the same speaker', () => {
const matched = matchSubtitleVoices([
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: 'Hi',
translatedText: 'Hi',
ttsText: 'Hi',
ttsLanguage: 'en',
speakerId: 'speaker-1',
gender: 'male',
},
{
id: 'segment-2',
startTime: 1.2,
endTime: 2,
originalText: 'Hello again',
translatedText: 'Hello again',
ttsText: 'Hello again',
ttsLanguage: 'en',
speakerId: 'speaker-1',
gender: 'male',
},
]);
expect(matched[0].voiceId).toBe(matched[1].voiceId);
});
it('preserves a valid preselected voiceId instead of overwriting it', () => {
const matched = matchSubtitleVoices([
{
id: 'segment-1',
startTime: 0,
endTime: 1,
originalText: 'Hello there',
translatedText: 'Hello there',
ttsText: 'Bonjour',
ttsLanguage: 'fr',
speakerId: 'speaker-1',
speaker: 'Young Woman',
gender: 'female',
voiceId: 'French_Male_Speech_New',
},
]);
expect(matched[0].voiceId).toBe('French_Male_Speech_New');
});
});

View File

@ -0,0 +1,74 @@
import { MINIMAX_VOICES } from '../../voices';
import { Voice } from '../../types';
import { TranslatedSubtitle, VoiceMatchedSubtitle } from './stageTypes';
export const DEFAULT_VOICE_ID = 'male-qn-qingse';
const LANGUAGE_ALIASES: Record<string, string> = {
zh: 'zh',
chinese: 'zh',
mandarin: 'zh',
'chinese mandarin': 'zh',
english: 'en',
en: 'en',
french: 'fr',
fr: 'fr',
indonesian: 'id',
id: 'id',
german: 'de',
de: 'de',
filipino: 'fil',
fil: 'fil',
cantonese: 'yue',
yue: 'yue',
};
export const normalizeVoiceLanguageCode = (value?: string) => {
const normalized = value?.trim().toLowerCase();
if (!normalized) {
return '';
}
return LANGUAGE_ALIASES[normalized] || normalized;
};
const selectVoiceForSubtitle = (subtitle: TranslatedSubtitle, voices: Voice[]) => {
const languageCode = normalizeVoiceLanguageCode(subtitle.ttsLanguage);
const languageVoices = voices.filter((voice) => normalizeVoiceLanguageCode(voice.language) === languageCode);
const candidates = languageVoices.length > 0 ? languageVoices : voices;
if (subtitle.gender && subtitle.gender !== 'unknown') {
const exactGenderMatch = candidates.find((voice) => voice.gender === subtitle.gender);
if (exactGenderMatch) {
return exactGenderMatch.id;
}
}
return candidates[0]?.id || DEFAULT_VOICE_ID;
};
const isSupportedVoiceId = (voiceId: string | undefined, voices: Voice[]) =>
typeof voiceId === 'string' && voices.some((voice) => voice.id === voiceId);
export const matchSubtitleVoices = (
subtitles: TranslatedSubtitle[],
voices: Voice[] = MINIMAX_VOICES,
): VoiceMatchedSubtitle[] => {
const voiceBySpeaker = new Map<string, string>();
return subtitles.map((subtitle) => {
const speakerKey = subtitle.speakerId || subtitle.speaker || subtitle.id;
const existingVoiceId = voiceBySpeaker.get(speakerKey);
const preselectedVoiceId = isSupportedVoiceId(subtitle.voiceId, voices) ? subtitle.voiceId : undefined;
const voiceId = existingVoiceId || preselectedVoiceId || selectVoiceForSubtitle(subtitle, voices);
if (!existingVoiceId) {
voiceBySpeaker.set(speakerKey, voiceId);
}
return {
...subtitle,
voiceId,
};
});
};

View File

@ -124,6 +124,9 @@ describe('generateSubtitlesFromVideo', () => {
expect(payload.input[0].content[0].text).toContain('"ttsText":');
expect(payload.input[0].content[0].text).toContain('"ttsLanguage":');
expect(payload.input[0].content[0].text).toContain('translatedText must always be English');
expect(payload.input[0].content[0].text).toContain('originalText must be a faithful transcription of the actually audible speech');
expect(payload.input[0].content[0].text).toContain('Do not rewrite, summarize, polish, correct, or paraphrase');
expect(payload.input[0].content[0].text).toContain('Do not infer hidden dialogue from context, visuals, plot, or likely meaning');
expect(payload.input[1].role).toBe('user');
expect(payload.input[1].content[0]).toEqual({

View File

@ -185,20 +185,35 @@ The duration of a single subtitle item should usually not exceed 3 to 5 seconds.
Accurately identify the speaker label and speaker gender.
Gender must be either "male" or "female".
5. Translation Rules:
5. Transcription Fidelity Rules:
- originalText must be a faithful transcription of the actually audible speech.
- Do not rewrite, summarize, polish, correct, or paraphrase the spoken content in originalText.
- Do not add words that are not clearly supported by the audio.
- Do not infer hidden dialogue from context, visuals, plot, or likely meaning.
- If a word or short phrase is unclear, keep the transcription conservative and only include what is reasonably audible.
- translatedText and ttsText must be derived from originalText, not invented independently.
- Never let translation quality override transcription fidelity.
6. Translation Rules:
- translatedText must always be English subtitle text for on-screen display.
- ttsText must be translated into the user-provided TTS language.
- translatedText and ttsText must preserve the same meaning as the original speech.
- translatedText should prioritize subtitle readability.
- ttsText should prioritize natural spoken dubbing in the target TTS language.
6. Voice Selection:
7. Voice Selection:
The user will provide a TTS language and a list of available voices.
Each voice includes a voiceId and descriptive metadata.
You must analyze the user-provided voice list and choose the best matching voiceId for each subtitle item.
Only return a voiceId that exists in the user-provided voice list.
Do not invent new voiceId values.
Priority order:
1. Audio-faithful transcription accuracy
2. Timestamp accuracy
3. Translation quality
4. Voice matching
# Output Contract
You must return exactly one JSON object.
The first character of your response must be {.
@ -239,9 +254,11 @@ Return a JSON object with this exact top-level structure:
12. Use video timeline seconds for startTime and endTime.
13. Keep subtitles chronological and non-overlapping.
14. Do not invent dialogue if it is not actually audible.
15. Preserve meaning naturally while keeping subtitle lines short and readable.
16. If a long utterance must be split, preserve continuity across consecutive subtitle items.
17. Output JSON only.`;
15. originalText must be a faithful transcription of what is actually spoken in the source audio.
16. Do not rewrite, polish, summarize, or infer originalText from context.
17. Preserve meaning naturally while keeping subtitle lines short and readable.
18. If a long utterance must be split, preserve continuity across consecutive subtitle items.
19. Output JSON only.`;
const createUserPrompt = (ttsLanguage: string) => `Subtitle language: English
TTS language: ${ttsLanguage}

View File

@ -0,0 +1,212 @@
import fs from 'fs';
import os from 'os';
import path from 'path';
import { afterEach, describe, expect, it, vi } from 'vitest';
import { recognizeAudioWithVolcengineAsr } from './volcengineAsr';
const tempDirs: string[] = [];
afterEach(() => {
delete process.env.LOG_FILE_PATH;
while (tempDirs.length > 0) {
const tempDir = tempDirs.pop();
if (tempDir && fs.existsSync(tempDir)) {
fs.rmSync(tempDir, { recursive: true, force: true });
}
}
});
describe('volcengineAsr', () => {
it('sends flash recognition requests with audio.data base64 payload', async () => {
const fetchMock = vi.fn(async () =>
new Response(
JSON.stringify({
audio_info: {
duration: 2499,
},
result: {
text: '关灯送伞。',
utterances: [
{
start_time: 450,
end_time: 1530,
text: '关灯送伞。',
words: [
{ start_time: 450, end_time: 700, text: '关', confidence: 0.8 },
{ start_time: 770, end_time: 970, text: '灯', confidence: 0.7 },
],
},
],
},
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
'X-Api-Status-Code': '20000000',
},
},
),
);
const result = await recognizeAudioWithVolcengineAsr({
config: {
appKey: 'app-key',
accessKey: 'access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
audioBase64: 'ZmFrZS1hdWRpbw==',
fetchImpl: fetchMock as unknown as typeof fetch,
requestId: 'req-123',
});
expect(fetchMock).toHaveBeenCalledWith(
'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
expect.objectContaining({
method: 'POST',
headers: expect.objectContaining({
'Content-Type': 'application/json',
'X-Api-App-Key': 'app-key',
'X-Api-Access-Key': 'access-key',
'X-Api-Resource-Id': 'volc.bigasr.auc_turbo',
'X-Api-Request-Id': 'req-123',
'X-Api-Sequence': '-1',
}),
}),
);
const [, requestInit] = fetchMock.mock.calls[0] as unknown as [string, RequestInit];
const body = JSON.parse(String(requestInit.body));
expect(body).toEqual({
user: {
uid: 'app-key',
},
audio: {
data: 'ZmFrZS1hdWRpbw==',
},
request: {
model_name: 'bigmodel',
},
});
expect(result).toEqual({
text: '关灯送伞。',
utterances: [
{
startTimeMs: 450,
endTimeMs: 1530,
text: '关灯送伞。',
words: [
{ startTimeMs: 450, endTimeMs: 700, text: '关', confidence: 0.8 },
{ startTimeMs: 770, endTimeMs: 970, text: '灯', confidence: 0.7 },
],
},
],
});
});
it('throws a readable error when the asr api returns a business error code', async () => {
const fetchMock = vi.fn(async () =>
new Response(
JSON.stringify({
message: '静音音频',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
'X-Api-Status-Code': '20000003',
},
},
),
);
await expect(
recognizeAudioWithVolcengineAsr({
config: {
appKey: 'app-key',
accessKey: 'access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
audioBase64: 'ZmFrZS1hdWRpbw==',
fetchImpl: fetchMock as unknown as typeof fetch,
}),
).rejects.toThrow(/20000003|静音音频/i);
});
it('logs the full raw asr response without truncation', async () => {
const tempDir = fs.mkdtempSync(path.join(os.tmpdir(), 'volcengine-asr-log-'));
tempDirs.push(tempDir);
const logFilePath = path.join(tempDir, 'server.log');
process.env.LOG_FILE_PATH = logFilePath;
const payload = {
audio_info: {
duration: 2499,
},
result: {
text: '关灯送伞。这是一段用于验证完整日志输出的长文本,不应该被截断。',
utterances: [
{
start_time: 450,
end_time: 1530,
text: '关灯送伞。这是一段用于验证完整日志输出的长文本,不应该被截断。',
words: [
{ start_time: 450, end_time: 700, text: '关', confidence: 0.8 },
{ start_time: 770, end_time: 970, text: '灯', confidence: 0.7 },
],
},
],
},
};
const fetchMock = vi.fn(async () =>
new Response(JSON.stringify(payload), {
status: 200,
headers: {
'Content-Type': 'application/json',
'X-Api-Status-Code': '20000000',
},
}),
);
await recognizeAudioWithVolcengineAsr({
config: {
appKey: 'app-key',
accessKey: 'access-key',
resourceId: 'volc.bigasr.auc_turbo',
baseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
modelName: 'bigmodel',
timeoutMs: 600000,
},
audioBase64: 'ZmFrZS1hdWRpbw==',
fetchImpl: fetchMock as unknown as typeof fetch,
requestId: 'req-log-1',
});
const entries = fs
.readFileSync(logFilePath, 'utf8')
.trim()
.split('\n')
.map((line) => JSON.parse(line));
const requestEntry = entries.find((entry) => entry.message === '[subtitle] volcengine asr request started');
const responseEntry = entries.find((entry) => entry.message === '[subtitle] volcengine asr response received');
expect(requestEntry.context).toMatchObject({
requestId: 'req-log-1',
asrAppKey: 'app-key',
asrResourceId: 'volc.bigasr.auc_turbo',
asrBaseUrl: 'https://openspeech.bytedance.com/api/v3/auc/bigmodel/recognize/flash',
});
expect(responseEntry.details).toMatchObject({
rawResponseText: JSON.stringify(payload),
});
});
});

187
src/server/volcengineAsr.ts Normal file
View File

@ -0,0 +1,187 @@
import { VolcengineAsrConfig } from './llmProvider';
import { logEvent, serializeError } from './errorLogging';
interface RawAsrWord {
confidence?: number | string;
end_time?: number | string;
start_time?: number | string;
text?: string;
}
interface RawAsrUtterance {
end_time?: number | string;
start_time?: number | string;
text?: string;
words?: RawAsrWord[];
}
interface RawAsrResponse {
message?: string;
result?: {
text?: string;
utterances?: RawAsrUtterance[];
};
}
export interface VolcengineAsrWord {
startTimeMs: number;
endTimeMs: number;
text: string;
confidence?: number;
}
export interface VolcengineAsrUtterance {
startTimeMs: number;
endTimeMs: number;
text: string;
words: VolcengineAsrWord[];
}
export interface VolcengineAsrResult {
text: string;
utterances: VolcengineAsrUtterance[];
}
const SUCCESS_CODE = '20000000';
const toNumber = (value: unknown) => {
const parsed = Number(value);
return Number.isFinite(parsed) ? parsed : 0;
};
const toOptionalConfidence = (value: unknown) => {
const parsed = Number(value);
if (!Number.isFinite(parsed) || parsed <= 0) {
return undefined;
}
return parsed;
};
const normalizeWords = (words: RawAsrWord[] | undefined): VolcengineAsrWord[] =>
Array.isArray(words)
? words
.map((word) => ({
startTimeMs: toNumber(word.start_time),
endTimeMs: toNumber(word.end_time),
text: String(word.text || '').trim(),
confidence: toOptionalConfidence(word.confidence),
}))
.filter((word) => word.text)
: [];
const normalizeUtterances = (utterances: RawAsrUtterance[] | undefined): VolcengineAsrUtterance[] =>
Array.isArray(utterances)
? utterances
.map((utterance) => ({
startTimeMs: toNumber(utterance.start_time),
endTimeMs: toNumber(utterance.end_time),
text: String(utterance.text || '').trim(),
words: normalizeWords(utterance.words),
}))
.filter((utterance) => utterance.text)
: [];
export const recognizeAudioWithVolcengineAsr = async ({
config,
audioBase64,
fetchImpl = fetch,
requestId,
}: {
config: VolcengineAsrConfig;
audioBase64: string;
fetchImpl?: typeof fetch;
requestId?: string;
}): Promise<VolcengineAsrResult> => {
const resolvedRequestId = requestId || crypto.randomUUID();
const requestBody = {
user: {
uid: config.appKey,
},
audio: {
data: audioBase64,
},
request: {
model_name: config.modelName,
},
};
logEvent({
level: 'info',
message: '[subtitle] volcengine asr request started',
context: {
requestId: resolvedRequestId,
asrAppKey: config.appKey,
asrResourceId: config.resourceId,
asrBaseUrl: config.baseUrl,
audioBase64Length: audioBase64.length,
hasAccessKey: Boolean(config.accessKey),
},
details: {
requestBody,
},
});
let response: Response;
try {
response = await fetchImpl(config.baseUrl, {
method: 'POST',
signal: AbortSignal.timeout(config.timeoutMs),
headers: {
'Content-Type': 'application/json',
'X-Api-App-Key': config.appKey,
'X-Api-Access-Key': config.accessKey,
'X-Api-Resource-Id': config.resourceId,
'X-Api-Request-Id': resolvedRequestId,
'X-Api-Sequence': '-1',
},
body: JSON.stringify(requestBody),
});
} catch (error) {
logEvent({
level: 'error',
message: '[subtitle] volcengine asr request failed',
context: {
requestId: resolvedRequestId,
asrAppKey: config.appKey,
asrResourceId: config.resourceId,
asrBaseUrl: config.baseUrl,
},
details: serializeError(error),
});
throw error;
}
const rawText = await response.text();
const statusCode = response.headers.get('X-Api-Status-Code')?.trim() || '';
const payload = rawText ? (JSON.parse(rawText) as RawAsrResponse) : {};
logEvent({
level: response.ok && (!statusCode || statusCode === SUCCESS_CODE) ? 'info' : 'error',
message: '[subtitle] volcengine asr response received',
context: {
requestId: resolvedRequestId,
asrAppKey: config.appKey,
asrResourceId: config.resourceId,
asrBaseUrl: config.baseUrl,
httpStatus: response.status,
apiStatusCode: statusCode || undefined,
},
details: {
rawResponseText: rawText,
},
});
if (!response.ok) {
throw new Error(`Volcengine ASR request failed (${response.status}).`);
}
if (statusCode && statusCode !== SUCCESS_CODE) {
throw new Error(`Volcengine ASR failed with code ${statusCode}: ${payload.message || 'Unknown error'}`);
}
return {
text: payload.result?.text?.trim() || '',
utterances: normalizeUtterances(payload.result?.utterances),
};
};

View File

@ -145,51 +145,10 @@ describe('generateSubtitlePipeline', () => {
expect(formData.get('ttsLanguage')).toBe('fr');
});
it('uploads doubao videos to ark files before requesting subtitles', async () => {
it('submits doubao videos to the server as multipart so transcription can extract audio', async () => {
vi.useFakeTimers();
const fetchMock = vi
.fn()
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
)
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
status: 'processing',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
)
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
status: 'active',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
)
.mockResolvedValueOnce(
new Response(
JSON.stringify({
@ -242,102 +201,21 @@ describe('generateSubtitlePipeline', () => {
expect(fetchMock).toHaveBeenNthCalledWith(
1,
'https://ark.cn-beijing.volces.com/api/v3/files',
'/api/generate-subtitles',
expect.objectContaining({
method: 'POST',
body: expect.any(FormData),
}),
);
expect(fetchMock).toHaveBeenNthCalledWith(
2,
'https://ark.cn-beijing.volces.com/api/v3/files/file-123',
expect.objectContaining({
method: 'GET',
headers: {
Authorization: 'Bearer ark-key',
},
}),
);
const [, subtitleRequest] = fetchMock.mock.calls[0] as unknown as [string, RequestInit];
const formData = subtitleRequest.body as FormData;
expect(fetchMock).toHaveBeenNthCalledWith(
3,
'https://ark.cn-beijing.volces.com/api/v3/files/file-123',
expect.objectContaining({
method: 'GET',
headers: {
Authorization: 'Bearer ark-key',
},
}),
);
const [, subtitleRequest] = fetchMock.mock.calls[3] as unknown as [string, RequestInit];
const subtitleBody = JSON.parse(String(subtitleRequest.body));
expect(fetchMock).toHaveBeenNthCalledWith(
4,
'/api/generate-subtitles',
expect.objectContaining({
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
}),
);
expect(subtitleBody).toEqual({
fileId: 'file-123',
provider: 'doubao',
targetLanguage: 'English',
ttsLanguage: 'English',
});
expect(fetchMock).toHaveBeenNthCalledWith(5, '/api/generate-subtitles/job-1', { method: 'GET' });
});
it('stops when ark reports file preprocessing failure', async () => {
const fetchMock = vi
.fn()
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
)
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
status: 'failed',
error: {
message: 'video preprocess failed',
},
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
);
await expect(
generateSubtitlePipeline(
new File(['video'], 'clip.mp4', { type: 'video/mp4' }),
'English',
'doubao',
null,
fetchMock as unknown as typeof fetch,
),
).rejects.toThrow(/video preprocess failed/i);
expect(fetchMock).toHaveBeenCalledTimes(2);
expect(formData.get('video')).toBeInstanceOf(File);
expect(formData.get('provider')).toBe('doubao');
expect(formData.get('targetLanguage')).toBe('English');
expect(formData.get('ttsLanguage')).toBe('English');
expect(fetchMock).toHaveBeenNthCalledWith(2, '/api/generate-subtitles/job-1', { method: 'GET' });
});
it('keeps multipart uploads for gemini requests', async () => {
@ -502,33 +380,6 @@ describe('generateSubtitlePipeline', () => {
vi.useFakeTimers();
const fetchMock = vi
.fn()
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
)
.mockResolvedValueOnce(
new Response(
JSON.stringify({
id: 'file-123',
status: 'active',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
)
.mockResolvedValueOnce(
new Response(
JSON.stringify({
@ -567,9 +418,9 @@ describe('generateSubtitlePipeline', () => {
);
const promise = generateSubtitlePipeline(
new File(['video'], 'clip.mp4', { type: 'video/mp4' }),
'English',
'doubao',
new File(['video'], 'clip.mp4', { type: 'video/mp4' }),
'English',
'doubao',
null,
fetchMock as unknown as typeof fetch,
);

View File

@ -5,9 +5,6 @@ type JsonResponseResult<T> =
| { ok: true; status: number; data: T }
| { ok: false; status: number; error: string };
const ARK_FILES_URL = 'https://ark.cn-beijing.volces.com/api/v3/files';
const ARK_FILE_STATUS_POLL_INTERVAL_MS = 1000;
const ARK_FILE_STATUS_TIMEOUT_MS = 120000;
const SUBTITLE_JOB_POLL_INTERVAL_MS = 5000;
const SUBTITLE_JOB_TIMEOUT_MS = 20 * 60 * 1000;
@ -54,41 +51,6 @@ const readJsonResponseOnce = async <T>(resp: Response): Promise<JsonResponseResu
};
};
const uploadDoubaoVideoFile = async (
videoFile: File,
fetchImpl: typeof fetch,
): Promise<{ fileId: string; apiKey: string }> => {
const apiKey = import.meta.env.VITE_ARK_API_KEY?.trim();
if (!apiKey) {
throw new Error('VITE_ARK_API_KEY is required for frontend Doubao file uploads.');
}
const formData = new FormData();
formData.append('purpose', 'user_data');
formData.append('file', videoFile);
formData.append('preprocess_configs[video][fps]', '1');
const resp = await fetchImpl(ARK_FILES_URL, {
method: 'POST',
headers: {
Authorization: `Bearer ${apiKey}`,
},
body: formData,
});
const parsed = await readJsonResponseOnce<{ id?: string }>(resp);
if (parsed.ok === false) {
throw new Error(parsed.error);
}
const fileId = parsed.data.id?.trim();
if (!fileId) {
throw new Error('Ark Files API did not return a file id.');
}
return { fileId, apiKey };
};
const sleep = (durationMs: number) =>
new Promise((resolve) => {
setTimeout(resolve, durationMs);
@ -148,50 +110,6 @@ const pollSubtitleJob = async (
}
};
const waitForArkFileToBecomeActive = async (
fileId: string,
apiKey: string,
fetchImpl: typeof fetch,
): Promise<void> => {
const deadline = Date.now() + ARK_FILE_STATUS_TIMEOUT_MS;
while (true) {
const resp = await fetchImpl(`${ARK_FILES_URL}/${fileId}`, {
method: 'GET',
headers: {
Authorization: `Bearer ${apiKey}`,
},
});
const parsed = await readJsonResponseOnce<{
status?: string;
error?: { message?: string } | string;
}>(resp);
if (parsed.ok === false) {
throw new Error(parsed.error);
}
const status = parsed.data.status?.trim().toLowerCase();
if (status === 'active') {
return;
}
if (status === 'failed') {
const errorMessage =
typeof parsed.data.error === 'string'
? parsed.data.error
: parsed.data.error?.message || 'Ark file preprocessing failed.';
throw new Error(errorMessage);
}
if (Date.now() >= deadline) {
throw new Error('Timed out while waiting for Ark file preprocessing to complete.');
}
await sleep(ARK_FILE_STATUS_POLL_INTERVAL_MS);
}
};
export const generateSubtitlePipeline = async (
videoFile: File,
targetLanguage: string,
@ -207,41 +125,6 @@ export const generateSubtitlePipeline = async (
const resolvedTtsLanguage = ttsLanguage?.trim() || targetLanguage;
if (provider === 'doubao') {
const { fileId, apiKey } = await uploadDoubaoVideoFile(videoFile, fetchImpl);
await waitForArkFileToBecomeActive(fileId, apiKey, fetchImpl);
const resp = await fetchImpl(apiUrl('/generate-subtitles'), {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
fileId,
targetLanguage,
ttsLanguage: resolvedTtsLanguage,
provider,
...(trimRange ? { trimRange } : {}),
}),
});
const parsed = await readJsonResponseOnce<SubtitleJobResponse>(resp);
if (parsed.ok === false) {
const error = new Error(parsed.error);
(error as any).status = resp.status;
throw error;
}
const job = parsed.data as unknown as SubtitleJobResponse;
onProgress?.(job);
return pollSubtitleJob(
job.jobId,
targetLanguage,
resolvedTtsLanguage,
job.pollTimeoutMs ?? SUBTITLE_JOB_TIMEOUT_MS,
fetchImpl,
onProgress,
);
}
const formData = new FormData();
formData.append('video', videoFile);
formData.append('targetLanguage', targetLanguage);

View File

@ -29,6 +29,12 @@ export type SubtitleJobStage =
| 'queued'
| 'upload_received'
| 'preparing'
| 'transcribing'
| 'speakerCasting'
| 'segmenting'
| 'translating'
| 'matching_voice'
| 'validating'
| 'calling_provider'
| 'processing_result'
| 'succeeded'
@ -48,6 +54,25 @@ export interface SpeakerTrack {
gender?: 'male' | 'female' | 'unknown';
}
export interface SpeakerCastingSpeaker {
speakerId: string;
label: string;
gender: 'male' | 'female' | 'unknown';
voiceId?: string;
confidence?: number;
reason?: string;
}
export interface SpeakerCastingAssignment {
segmentId: string;
speakerId: string;
}
export interface SpeakerCastingResult {
speakers: SpeakerCastingSpeaker[];
segmentAssignments: SpeakerCastingAssignment[];
}
export interface SubtitlePipelineResult {
subtitles: Subtitle[];
speakers: SpeakerTrack[];
@ -57,6 +82,18 @@ export interface SubtitlePipelineResult {
ttsLanguage?: string;
duration?: number;
alignmentEngine?: string;
diagnostics?: {
validationIssues?: Array<{
subtitleId: string;
code: string;
message: string;
severity: 'warning' | 'error';
}>;
speakerCasting?: SpeakerCastingResult;
stageDurationsMs?: Partial<
Record<'transcription' | 'segmentation' | 'translation' | 'speakerCasting' | 'voiceMatching' | 'validation', number>
>;
};
}
export interface SubtitleGenerationProgress {