240 lines
8.2 KiB
Markdown
240 lines
8.2 KiB
Markdown
# Precise Dialogue Localization Design
|
|
|
|
**Date:** 2026-03-17
|
|
|
|
**Goal:** Upgrade the subtitle pipeline so sentence boundaries are more accurate, word-level timings are available, and speaker attribution is based on audio rather than LLM guesses.
|
|
|
|
## Current State
|
|
|
|
The current implementation has two subtitle generation paths:
|
|
|
|
1. The primary path in `server.ts` extracts audio, calls Whisper with `timestamp_granularities: ['segment']`, then asks an LLM to translate and infer `speaker` and `gender`.
|
|
2. The fallback path in `src/services/geminiService.ts` uses Gemini to infer subtitles from video or sampled frames.
|
|
|
|
This is enough for rough subtitle generation, but it has three hard limits:
|
|
|
|
1. Sentence timing is only segment-level, so start and end times drift at pause boundaries.
|
|
2. Word-level timestamps do not exist, so precise editing and karaoke-style highlighting are impossible.
|
|
3. Speaker identity is inferred from text, not measured from audio, so diarization quality is unreliable.
|
|
|
|
## Chosen Approach
|
|
|
|
Adopt a high-precision pipeline with a dedicated alignment layer:
|
|
|
|
1. Extract clean mono audio from the uploaded video.
|
|
2. Use voice activity detection (VAD) to isolate speech regions.
|
|
3. Run ASR for rough transcription.
|
|
4. Run forced alignment to refine every word boundary against the audio.
|
|
5. Run speaker diarization to assign stable `speakerId` values.
|
|
6. Rebuild editable subtitle sentences from aligned words.
|
|
7. Translate only the sentence text while preserving timestamps and speaker assignments.
|
|
|
|
The existing Node service remains the entry point, but it becomes an orchestration layer instead of doing all timing work itself.
|
|
|
|
## Architecture
|
|
|
|
### Frontend
|
|
|
|
The React editor continues to call `/api/process-audio-pipeline`, but it now receives richer subtitle objects:
|
|
|
|
1. Sentence-level timing for the timeline.
|
|
2. Word-level timing for precise playback feedback.
|
|
3. Stable `speakerId` values for speaker-aware UI and voice assignment.
|
|
|
|
The current editor can remain backward compatible by continuing to render sentence-level fields first and gradually enabling word-level behavior.
|
|
|
|
### Node Orchestration Layer
|
|
|
|
`server.ts` keeps responsibility for:
|
|
|
|
1. Receiving uploaded video data.
|
|
2. Extracting audio with FFmpeg.
|
|
3. Calling the alignment service.
|
|
4. Translating sentence text.
|
|
5. Returning a normalized payload to the frontend.
|
|
|
|
The Node layer must not allow translation to rewrite timing or speaker assignments.
|
|
|
|
### Alignment Layer
|
|
|
|
This layer owns all timing-critical operations:
|
|
|
|
1. VAD
|
|
2. ASR
|
|
3. Forced alignment
|
|
4. Speaker diarization
|
|
|
|
It can be implemented as a local Python service or a separately managed service as long as it returns deterministic machine-readable JSON.
|
|
|
|
## Data Model
|
|
|
|
The current `Subtitle` type should be extended rather than replaced.
|
|
|
|
```ts
|
|
type WordTiming = {
|
|
text: string;
|
|
startTime: number;
|
|
endTime: number;
|
|
speakerId: string;
|
|
confidence: number;
|
|
};
|
|
|
|
type Subtitle = {
|
|
id: string;
|
|
startTime: number;
|
|
endTime: number;
|
|
originalText: string;
|
|
translatedText: string;
|
|
speaker: string;
|
|
speakerId: string;
|
|
voiceId: string;
|
|
words: WordTiming[];
|
|
confidence: number;
|
|
audioUrl?: string;
|
|
volume?: number;
|
|
};
|
|
|
|
type SpeakerTrack = {
|
|
speakerId: string;
|
|
label: string;
|
|
gender?: 'male' | 'female' | 'unknown';
|
|
};
|
|
```
|
|
|
|
Rules:
|
|
|
|
1. `speakerId` is the stable machine identifier, for example `spk_0`.
|
|
2. `speaker` is a user-facing label and can be renamed.
|
|
3. Sentence `startTime` and `endTime` are derived from the first and last aligned words.
|
|
|
|
## Processing Rules
|
|
|
|
### Audio Preparation
|
|
|
|
1. Convert uploaded video to `16kHz` mono WAV.
|
|
2. Optionally create a denoised or vocal-enhanced copy when the source contains heavy music.
|
|
|
|
### VAD
|
|
|
|
Use VAD to identify speech windows and pad each detected region by about `0.2s`.
|
|
|
|
### ASR and Forced Alignment
|
|
|
|
1. Use ASR for text hypotheses and rough word order.
|
|
2. Use forced alignment to compute accurate `startTime` and `endTime` for each word.
|
|
3. Treat forced alignment as the source of truth for timing whenever available.
|
|
|
|
### Diarization
|
|
|
|
1. Run diarization separately and produce speaker segments.
|
|
2. Assign each word to the speaker with the highest overlap.
|
|
3. If a sentence crosses speakers, split it rather than forcing a mixed-speaker sentence.
|
|
|
|
### Sentence Reconstruction
|
|
|
|
Build sentence subtitles from words using conservative rules:
|
|
|
|
1. Keep words together only when `speakerId` is the same.
|
|
2. Split when adjacent word gaps exceed `0.45s`.
|
|
3. Split when sentence duration would exceed `8s`.
|
|
4. Split on strong punctuation or long pauses.
|
|
5. Avoid returning sentences shorter than `0.6s` unless the source is actually brief.
|
|
|
|
## API Design
|
|
|
|
Reuse `/api/process-audio-pipeline`, but upgrade its payload to:
|
|
|
|
```json
|
|
{
|
|
"subtitles": [],
|
|
"speakers": [],
|
|
"sourceLanguage": "zh",
|
|
"targetLanguage": "en",
|
|
"duration": 123.45,
|
|
"quality": "full",
|
|
"alignmentEngine": "whisperx+pyannote"
|
|
}
|
|
```
|
|
|
|
Quality levels:
|
|
|
|
1. `full`: sentence timings, word timings, and diarization are all available.
|
|
2. `partial`: word timings are available but diarization is missing or unreliable.
|
|
3. `fallback`: high-precision alignment failed, so the app returned rough timing from the existing path.
|
|
|
|
## Frontend Behavior
|
|
|
|
The current editor in `src/components/EditorScreen.tsx` should evolve incrementally:
|
|
|
|
1. Keep the existing sentence-based timeline as the default view.
|
|
2. Add word-level highlighting during playback when `words` exist.
|
|
3. Add speaker-aware styling and filtering when `speakers` exist.
|
|
4. Preserve manual timeline editing and snap dragged sentence edges to nearest word boundaries when possible.
|
|
|
|
Fallback behavior:
|
|
|
|
1. If `quality` is `full`, enable all precision UI.
|
|
2. If `quality` is `partial`, disable speaker-specific UI and keep timing features.
|
|
3. If `quality` is `fallback`, continue with the current editor and show a low-precision notice.
|
|
|
|
## Error Handling and Degradation
|
|
|
|
The product must remain usable even when the high-precision path is incomplete.
|
|
|
|
1. If forced alignment fails, return sentence-level ASR output instead of failing the whole request.
|
|
2. If diarization fails, keep timings and mark `speakerId` as `unknown`.
|
|
3. If translation fails, return original text with timings intact.
|
|
4. If the alignment layer is unavailable, fall back to the existing visual pipeline and set `quality: "fallback"`.
|
|
5. Preserve low-confidence words and expose their confidence rather than dropping them silently.
|
|
|
|
## Testing Strategy
|
|
|
|
Coverage should focus on deterministic logic:
|
|
|
|
1. Sentence reconstruction from aligned words.
|
|
2. Speaker assignment from overlapping diarization segments.
|
|
3. API normalization and fallback handling.
|
|
4. Frontend word-highlighting and snapping helpers.
|
|
|
|
End-to-end manual verification should include:
|
|
|
|
1. Single-speaker clip with pauses.
|
|
2. Two-speaker dialogue with interruptions.
|
|
3. Music-heavy clip.
|
|
4. Alignment failure fallback.
|
|
|
|
## Rollout Plan
|
|
|
|
1. Extend types and response normalization first.
|
|
2. Introduce the alignment adapter behind a feature flag or environment guard.
|
|
3. Return richer payloads while keeping the current UI backward compatible.
|
|
4. Add word-level highlighting and speaker-aware UI after the backend contract stabilizes.
|
|
|
|
## Constraints and Notes
|
|
|
|
1. This workspace is not a Git repository, so the required design-document commit could not be performed here.
|
|
2. The current project does not yet include a test runner, so the implementation plan includes test infrastructure setup before feature work.
|
|
|
|
## Implementation Status
|
|
|
|
Implemented in this workspace:
|
|
|
|
1. Test infrastructure using Vitest, jsdom, and Testing Library.
|
|
2. Shared subtitle pipeline helpers for normalization, sentence reconstruction, speaker assignment, word highlighting, and timeline snapping.
|
|
3. A backend subtitle orchestration layer plus an alignment-service adapter boundary for local ASR / alignment backends.
|
|
4. Gemini-based sentence translation in the audio pipeline, without relying on OpenAI for ASR or translation.
|
|
5. Frontend pipeline mapping, precision notices, word-level playback feedback, and speaker-aware presentation.
|
|
|
|
Automated verification completed:
|
|
|
|
1. `npm test -- --run`
|
|
2. `npm run lint`
|
|
3. `npm run build`
|
|
|
|
Manual verification still pending:
|
|
|
|
1. Single-speaker clip with pauses.
|
|
2. Two-speaker dialogue with interruptions.
|
|
3. Music-heavy clip.
|
|
4. Alignment-service unavailable fallback using a real upload.
|