video_translate/docs/plans/2026-03-17-precise-dialogue-localization-design.md

# Precise Dialogue Localization Design

**Date:** 2026-03-17

**Goal:** Upgrade the subtitle pipeline so sentence boundaries are more accurate, word-level timings are available, and speaker attribution is based on audio rather than LLM guesses.

## Current State

The current implementation has two subtitle generation paths:

1. The primary path in `server.ts` extracts audio, calls Whisper with `timestamp_granularities: ['segment']`, then asks an LLM to translate and infer `speaker` and `gender`.
2. The fallback path in `src/services/geminiService.ts` uses Gemini to infer subtitles from video or sampled frames.

This is enough for rough subtitle generation, but it has three hard limits:

1. Sentence timing is only segment-level, so start and end times drift at pause boundaries.
2. Word-level timestamps do not exist, so precise editing and karaoke-style highlighting are impossible.
3. Speaker identity is inferred from text, not measured from audio, so diarization quality is unreliable.

## Chosen Approach

Adopt a high-precision pipeline with a dedicated alignment layer:

1. Extract clean mono audio from the uploaded video.
2. Use voice activity detection (VAD) to isolate speech regions.
3. Run ASR for rough transcription.
4. Run forced alignment to refine every word boundary against the audio.
5. Run speaker diarization to assign stable `speakerId` values.
6. Rebuild editable subtitle sentences from aligned words.
7. Translate only the sentence text while preserving timestamps and speaker assignments.

The existing Node service remains the entry point, but it becomes an orchestration layer instead of doing all timing work itself.

## Architecture

### Frontend

The React editor continues to call `/api/process-audio-pipeline`, but it now receives richer subtitle objects:

1. Sentence-level timing for the timeline.
2. Word-level timing for precise playback feedback.
3. Stable `speakerId` values for speaker-aware UI and voice assignment.

The current editor can remain backward compatible by continuing to render sentence-level fields first and gradually enabling word-level behavior.

### Node Orchestration Layer

`server.ts` keeps responsibility for:

1. Receiving uploaded video data.
2. Extracting audio with FFmpeg.
3. Calling the alignment service.
4. Translating sentence text.
5. Returning a normalized payload to the frontend.

The Node layer must not allow translation to rewrite timing or speaker assignments.

### Alignment Layer

This layer owns all timing-critical operations:

1. VAD
2. ASR
3. Forced alignment
4. Speaker diarization

It can be implemented as a local Python service or a separately managed service as long as it returns deterministic machine-readable JSON.

## Data Model

The current `Subtitle` type should be extended rather than replaced.

```ts
type WordTiming = {
  text: string;
  startTime: number;
  endTime: number;
  speakerId: string;
  confidence: number;
};

type Subtitle = {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  translatedText: string;
  speaker: string;
  speakerId: string;
  voiceId: string;
  words: WordTiming[];
  confidence: number;
  audioUrl?: string;
  volume?: number;
};

type SpeakerTrack = {
  speakerId: string;
  label: string;
  gender?: 'male' | 'female' | 'unknown';
};
```

Rules:

1. `speakerId` is the stable machine identifier, for example `spk_0`.
2. `speaker` is a user-facing label and can be renamed.
3. Sentence `startTime` and `endTime` are derived from the first and last aligned words.

## Processing Rules

### Audio Preparation

1. Convert uploaded video to `16kHz` mono WAV.
2. Optionally create a denoised or vocal-enhanced copy when the source contains heavy music.

### VAD

Use VAD to identify speech windows and pad each detected region by about `0.2s`.

### ASR and Forced Alignment

1. Use ASR for text hypotheses and rough word order.
2. Use forced alignment to compute accurate `startTime` and `endTime` for each word.
3. Treat forced alignment as the source of truth for timing whenever available.

### Diarization

1. Run diarization separately and produce speaker segments.
2. Assign each word to the speaker with the highest overlap.
3. If a sentence crosses speakers, split it rather than forcing a mixed-speaker sentence.

### Sentence Reconstruction

Build sentence subtitles from words using conservative rules:

1. Keep words together only when `speakerId` is the same.
2. Split when adjacent word gaps exceed `0.45s`.
3. Split when sentence duration would exceed `8s`.
4. Split on strong punctuation or long pauses.
5. Avoid returning sentences shorter than `0.6s` unless the source is actually brief.

## API Design

Reuse `/api/process-audio-pipeline`, but upgrade its payload to:

```json
{
  "subtitles": [],
  "speakers": [],
  "sourceLanguage": "zh",
  "targetLanguage": "en",
  "duration": 123.45,
  "quality": "full",
  "alignmentEngine": "whisperx+pyannote"
}
```

Quality levels:

1. `full`: sentence timings, word timings, and diarization are all available.
2. `partial`: word timings are available but diarization is missing or unreliable.
3. `fallback`: high-precision alignment failed, so the app returned rough timing from the existing path.

## Frontend Behavior

The current editor in `src/components/EditorScreen.tsx` should evolve incrementally:

1. Keep the existing sentence-based timeline as the default view.
2. Add word-level highlighting during playback when `words` exist.
3. Add speaker-aware styling and filtering when `speakers` exist.
4. Preserve manual timeline editing and snap dragged sentence edges to nearest word boundaries when possible.

Fallback behavior:

1. If `quality` is `full`, enable all precision UI.
2. If `quality` is `partial`, disable speaker-specific UI and keep timing features.
3. If `quality` is `fallback`, continue with the current editor and show a low-precision notice.

## Error Handling and Degradation

The product must remain usable even when the high-precision path is incomplete.

1. If forced alignment fails, return sentence-level ASR output instead of failing the whole request.
2. If diarization fails, keep timings and mark `speakerId` as `unknown`.
3. If translation fails, return original text with timings intact.
4. If the alignment layer is unavailable, fall back to the existing visual pipeline and set `quality: "fallback"`.
5. Preserve low-confidence words and expose their confidence rather than dropping them silently.

## Testing Strategy

Coverage should focus on deterministic logic:

1. Sentence reconstruction from aligned words.
2. Speaker assignment from overlapping diarization segments.
3. API normalization and fallback handling.
4. Frontend word-highlighting and snapping helpers.

End-to-end manual verification should include:

1. Single-speaker clip with pauses.
2. Two-speaker dialogue with interruptions.
3. Music-heavy clip.
4. Alignment failure fallback.

## Rollout Plan

1. Extend types and response normalization first.
2. Introduce the alignment adapter behind a feature flag or environment guard.
3. Return richer payloads while keeping the current UI backward compatible.
4. Add word-level highlighting and speaker-aware UI after the backend contract stabilizes.

## Constraints and Notes

1. This workspace is not a Git repository, so the required design-document commit could not be performed here.
2. The current project does not yet include a test runner, so the implementation plan includes test infrastructure setup before feature work.

## Implementation Status

Implemented in this workspace:

1. Test infrastructure using Vitest, jsdom, and Testing Library.
2. Shared subtitle pipeline helpers for normalization, sentence reconstruction, speaker assignment, word highlighting, and timeline snapping.
3. A backend subtitle orchestration layer plus an alignment-service adapter boundary for local ASR / alignment backends.
4. Gemini-based sentence translation in the audio pipeline, without relying on OpenAI for ASR or translation.
5. Frontend pipeline mapping, precision notices, word-level playback feedback, and speaker-aware presentation.

Automated verification completed:

1. `npm test -- --run`
2. `npm run lint`
3. `npm run build`

Manual verification still pending:

1. Single-speaker clip with pauses.
2. Two-speaker dialogue with interruptions.
3. Music-heavy clip.
4. Alignment-service unavailable fallback using a real upload.