video_translate/docs/plans/2026-03-17-precise-dialogue-localization.md

# Precise Dialogue Localization Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Build a high-precision subtitle pipeline that returns accurate sentence boundaries, word-level timings, and real speaker attribution while preserving the current editor flow.

**Architecture:** Keep the React app and `server.ts` as the public entry points, but move timing-critical work into a dedicated alignment adapter. The backend normalizes aligned words into sentence subtitles, translates text without changing timing, and returns quality metadata so the editor can enable or disable precision UI safely.

**Tech Stack:** React 19, TypeScript, Vite, Express, FFmpeg, OpenAI SDK, a new test runner (`vitest`), and a high-precision alignment backend adapter.

---

### Task 1: Add Test Infrastructure

**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\package.json`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\vitest.config.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\test\setup.ts`

**Step 1: Write the failing test**

Create a minimal smoke test first so the test runner has a real target.

```ts
import { describe, expect, it } from 'vitest';

describe('test harness', () => {
  it('runs vitest in this workspace', () => {
    expect(true).toBe(true);
  });
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run`
Expected: FAIL because no `test` script or Vitest config exists yet.

**Step 3: Write minimal implementation**

1. Add `test` and `test:watch` scripts to `package.json`.
2. Add dev dependencies for `vitest`.
3. Create `vitest.config.ts` with a Node environment default.
4. Add `src/test/setup.ts` for shared setup.

```ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    environment: 'node',
    setupFiles: ['./src/test/setup.ts'],
  },
});
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run`
Expected: PASS with the smoke test.

**Step 5: Commit**

```bash
git add package.json vitest.config.ts src/test/setup.ts
git commit -m "test: add vitest infrastructure"
```

### Task 2: Extract Subtitle Pipeline Types and Normalizers

**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\types.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.test.ts`

**Step 1: Write the failing test**

Write tests for normalization from aligned word payloads to UI-ready subtitles.

```ts
it('derives subtitle boundaries from first and last word', () => {
  const result = normalizeAlignedSentence({
    id: 's1',
    speakerId: 'spk_0',
    words: [
      { text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
      { text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
    ],
    originalText: 'Hello world',
    translatedText: '你好世界',
  });

  expect(result.startTime).toBe(1.2);
  expect(result.endTime).toBe(2.0);
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/lib/subtitlePipeline.test.ts`
Expected: FAIL because the new module and extended types do not exist.

**Step 3: Write minimal implementation**

1. Extend `Subtitle` in `src/types.ts` with `speakerId`, `words`, and `confidence`.
2. Create a pure helper module that normalizes backend payloads into frontend subtitles.

```ts
export const deriveSubtitleBounds = (words: WordTiming[]) => ({
  startTime: words[0]?.startTime ?? 0,
  endTime: words[words.length - 1]?.endTime ?? 0,
});
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/lib/subtitlePipeline.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/types.ts src/lib/subtitlePipeline.ts src/lib/subtitlePipeline.test.ts
git commit -m "feat: add subtitle pipeline normalizers"
```

### Task 3: Implement Sentence Reconstruction Helpers

**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.test.ts`

**Step 1: Write the failing test**

Cover pause splitting and speaker splitting.

```ts
it('splits sentences when speaker changes', () => {
  const result = rebuildSentences([
    { text: 'Hi', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
    { text: 'there', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.9 },
    { text: 'no', startTime: 0.55, endTime: 0.7, speakerId: 'spk_1', confidence: 0.9 },
  ]);

  expect(result).toHaveLength(2);
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts`
Expected: FAIL because the helper module is missing.

**Step 3: Write minimal implementation**

Implement pure splitting rules:

1. Split on `speakerId` change.
2. Split when word gaps exceed `0.45`.
3. Split when sentence duration exceeds `8`.

```ts
if (nextWord.speakerId !== currentSpeakerId) {
  flushSentence();
}
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/lib/alignment/sentenceReconstruction.ts src/lib/alignment/sentenceReconstruction.test.ts
git commit -m "feat: add sentence reconstruction rules"
```

### Task 4: Implement Speaker Assignment Helpers

**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.test.ts`

**Step 1: Write the failing test**

Test overlap-based speaker assignment.

```ts
it('assigns each word to the speaker segment with maximum overlap', () => {
  const word = { text: 'hello', startTime: 1.0, endTime: 1.4 };
  const speakers = [
    { speakerId: 'spk_0', startTime: 0.8, endTime: 1.1 },
    { speakerId: 'spk_1', startTime: 1.1, endTime: 1.6 },
  ];

  expect(assignSpeakerToWord(word, speakers)).toBe('spk_1');
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/lib/alignment/speakerAssignment.test.ts`
Expected: FAIL because speaker assignment logic does not exist.

**Step 3: Write minimal implementation**

Add a pure overlap calculator and default to `unknown` when no segment overlaps.

```ts
const overlap = Math.max(
  0,
  Math.min(word.endTime, segment.endTime) - Math.max(word.startTime, segment.startTime),
);
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/lib/alignment/speakerAssignment.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/lib/alignment/speakerAssignment.ts src/lib/alignment/speakerAssignment.test.ts
git commit -m "feat: add speaker assignment helpers"
```

### Task 5: Isolate Backend Pipeline Logic from `server.ts`

**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`

**Step 1: Write the failing test**

Add tests for orchestration-level fallback behavior.

```ts
it('returns partial quality when diarization is unavailable', async () => {
  const result = await buildSubtitlePayload({
    alignmentResult: {
      words: [{ text: 'hi', startTime: 0, endTime: 0.2, speakerId: 'unknown', confidence: 0.9 }],
      speakerSegments: [],
      quality: 'partial',
    },
  });

  expect(result.quality).toBe('partial');
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/server/subtitlePipeline.test.ts`
Expected: FAIL because orchestration code is still embedded in `server.ts`.

**Step 3: Write minimal implementation**

1. Move payload-building logic into `src/server/subtitlePipeline.ts`.
2. Make `server.ts` call the helper and only handle HTTP concerns.

```ts
export const buildSubtitlePayload = async (deps: SubtitlePipelineDeps) => {
  // normalize alignment result
  // translate text
  // return { subtitles, speakers, quality, ... }
};
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/server/subtitlePipeline.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/server/subtitlePipeline.ts src/server/subtitlePipeline.test.ts server.ts
git commit -m "refactor: isolate subtitle pipeline orchestration"
```

### Task 6: Add an Alignment Service Adapter

**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`

**Step 1: Write the failing test**

Test that the adapter maps raw alignment responses into normalized internal types.

```ts
it('maps aligned words and speaker segments from the adapter response', async () => {
  const result = await parseAlignmentResponse({
    words: [{ word: 'hello', start: 1.0, end: 1.2, speaker: 'spk_0', score: 0.95 }],
    speakers: [{ speaker: 'spk_0', start: 0.8, end: 1.6 }],
  });

  expect(result.words[0].speakerId).toBe('spk_0');
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/server/alignmentAdapter.test.ts`
Expected: FAIL because no adapter exists.

**Step 3: Write minimal implementation**

Create an adapter boundary with one public function such as `requestAlignedTranscript(audioPath)`.

```ts
export const requestAlignedTranscript = async (audioPath: string) => {
  // call local or remote alignment backend
  // normalize response shape
};
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/server/alignmentAdapter.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/server/alignmentAdapter.ts src/server/alignmentAdapter.test.ts server.ts
git commit -m "feat: add alignment service adapter"
```

### Task 7: Upgrade `/api/process-audio-pipeline` Response Shape

**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts`

**Step 1: Write the failing test**

Add a client-side test for parsing `quality`, `speakers`, and `words`.

```ts
it('maps the enriched audio pipeline response into subtitle objects', async () => {
  const payload = {
    subtitles: [
      {
        id: 'sub_1',
        startTime: 1,
        endTime: 2,
        originalText: 'Hello',
        translatedText: '你好',
        speaker: 'Speaker 1',
        speakerId: 'spk_0',
        words: [{ text: 'Hello', startTime: 1, endTime: 2, speakerId: 'spk_0', confidence: 0.9 }],
        confidence: 0.9,
      },
    ],
    speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
    quality: 'full',
  };

  expect(mapPipelineResponse(payload).subtitles[0].words).toHaveLength(1);
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/services/geminiService.test.ts`
Expected: FAIL because the mapping helper does not exist.

**Step 3: Write minimal implementation**

1. Add a response-mapping helper in `src/services/geminiService.ts`.
2. Preserve the existing fallback path.
3. Carry `quality` metadata to the UI.

```ts
const quality = data.quality ?? 'fallback';
const subtitles = (data.subtitles ?? []).map(mapSubtitleFromApi);
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/services/geminiService.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add server.ts src/services/geminiService.ts src/services/geminiService.test.ts
git commit -m "feat: return enriched subtitle pipeline payloads"
```

### Task 8: Add Precision Metadata to Editor State

**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx`

**Step 1: Write the failing test**

Add a test for rendering a fallback warning when `quality` is low.

```tsx
it('shows a low-precision notice for fallback subtitle results', () => {
  render(<EditorScreen ... />);
  expect(screen.getByText(/low-precision/i)).toBeInTheDocument();
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/components/EditorScreen.test.tsx`
Expected: FAIL because the component does not track pipeline quality yet.

**Step 3: Write minimal implementation**

1. Add state for `quality` and `speakers`.
2. Surface a small status badge or warning banner.
3. Keep the existing sentence list and timeline intact.

```tsx
{quality === 'fallback' && (
  <p className="text-xs text-amber-700">Low-precision timing detected. Manual review recommended.</p>
)}
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/components/EditorScreen.test.tsx`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/components/EditorScreen.tsx src/components/EditorScreen.test.tsx
git commit -m "feat: surface subtitle precision status in editor"
```

### Task 9: Add Word-Level Playback Helpers

**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`

**Step 1: Write the failing test**

Test the active-word lookup helper.

```ts
it('returns the active word for the current playback time', () => {
  const activeWord = getActiveWord([
    { text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
  ], 1.1);

  expect(activeWord?.text).toBe('Hello');
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/lib/playback/wordHighlight.test.ts`
Expected: FAIL because playback helpers do not exist.

**Step 3: Write minimal implementation**

1. Create a pure helper for active-word lookup.
2. Use it in `EditorScreen.tsx` to render highlighted word spans when `words` are present.

```ts
export const getActiveWord = (words: WordTiming[], currentTime: number) =>
  words.find((word) => currentTime >= word.startTime && currentTime <= word.endTime);
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/lib/playback/wordHighlight.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/lib/playback/wordHighlight.ts src/lib/playback/wordHighlight.test.ts src/components/EditorScreen.tsx
git commit -m "feat: add word-level playback highlighting"
```

### Task 10: Snap Timeline Edges to Word Boundaries

**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`

**Step 1: Write the failing test**

Test snapping to nearest word edges.

```ts
it('snaps a dragged start edge to the nearest word boundary', () => {
  const next = snapTimeToNearestWordBoundary(
    1.34,
    [
      { text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
      { text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
    ],
  );

  expect(next).toBe(1.35);
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/lib/timeline/snapToWords.test.ts`
Expected: FAIL because no snapping helper exists.

**Step 3: Write minimal implementation**

1. Add a pure snapping helper with a small tolerance window.
2. Use it in the left and right resize timeline handlers.

```ts
export const snapTimeToNearestWordBoundary = (time: number, words: WordTiming[]) => {
  // choose nearest start or end boundary within tolerance
};
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/lib/timeline/snapToWords.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/lib/timeline/snapToWords.ts src/lib/timeline/snapToWords.test.ts src/components/EditorScreen.tsx
git commit -m "feat: snap subtitle edits to word boundaries"
```

### Task 11: Add Speaker-Aware UI State

**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\voices.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.test.ts`

**Step 1: Write the failing test**

Test stable color and label generation for speaker tracks.

```ts
it('creates stable display metadata for each speaker id', () => {
  const speaker = buildSpeakerPresentation({ speakerId: 'spk_0', label: 'Speaker 1' });
  expect(speaker.color).toMatch(/^#/);
});
```

**Step 2: Run test to verify it fails**

Run: `npm test -- --run src/lib/speakers/speakerPresentation.test.ts`
Expected: FAIL because no speaker presentation helper exists.

**Step 3: Write minimal implementation**

1. Create a helper that derives display color and fallback label from `speakerId`.
2. Use it to color sentence chips or timeline items.
3. Keep voice assignment behavior backward compatible.

```ts
export const buildSpeakerPresentation = ({ speakerId, label }: SpeakerTrack) => ({
  speakerId,
  label,
  color: '#1677ff',
});
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run src/lib/speakers/speakerPresentation.test.ts`
Expected: PASS.

**Step 5: Commit**

```bash
git add src/components/EditorScreen.tsx src/voices.ts src/lib/speakers/speakerPresentation.ts src/lib/speakers/speakerPresentation.test.ts
git commit -m "feat: add speaker-aware editor presentation"
```

### Task 12: Verify End-to-End Behavior and Update Docs

**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\README.md`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\docs\plans\2026-03-17-precise-dialogue-localization-design.md`

**Step 1: Write the failing test**

Write down the manual verification checklist before changing docs so the release criteria are explicit.

```md
- [ ] Single-speaker clip returns `quality: full`
- [ ] Two-speaker clip shows distinct speaker IDs
- [ ] Fallback path shows low-precision notice
- [ ] Timeline resize snaps to word boundaries
```

**Step 2: Run test to verify it fails**

Run: `npm run lint`
Expected: PASS or FAIL depending on in-progress code, but manual verification is still incomplete until the checklist is executed.

**Step 3: Write minimal implementation**

1. Update `README.md` with new environment requirements and pipeline description.
2. Record the manual verification results in the design document or a linked note.

```md
## High-Precision Subtitle Mode

Set the alignment backend environment variables before running the app.
```

**Step 4: Run test to verify it passes**

Run: `npm test -- --run`
Expected: PASS.

Run: `npm run lint`
Expected: PASS.

Run: `npm run build`
Expected: PASS.

**Step 5: Commit**

```bash
git add README.md docs/plans/2026-03-17-precise-dialogue-localization-design.md
git commit -m "docs: document precise dialogue localization workflow"
```

## Notes for Execution

1. This workspace currently has no `.git` directory, so commit steps cannot be executed until the project is placed in a real Git checkout.
2. Introduce the alignment backend behind environment-based configuration so existing demos can still use the current fallback path.
3. Prefer pure functions for sentence reconstruction, speaker assignment, snapping, and word-highlighting logic so they remain easy to test.