video translate initial commit

This commit is contained in:
Song367 2026-03-18 11:42:00 +08:00
commit 0e9738b495
76 changed files with 11208 additions and 0 deletions

26
.env.example Normal file
View File

@ -0,0 +1,26 @@
# GEMINI_API_KEY: Required when the editor LLM is set to Gemini.
GEMINI_API_KEY="MY_GEMINI_API_KEY"
# ARK_API_KEY: Required when the editor LLM is set to Doubao.
ARK_API_KEY="YOUR_ARK_API_KEY"
# DEFAULT_LLM_PROVIDER: Optional editor default. Supported values: doubao, gemini.
# Defaults to doubao.
DEFAULT_LLM_PROVIDER="doubao"
# DOUBAO_MODEL: Optional override for the Ark model used by Doubao subtitle generation.
# Defaults to doubao-seed-2-0-pro-260215.
DOUBAO_MODEL="doubao-seed-2-0-pro-260215"
# MINIMAX_API_KEY: Required for MiniMax TTS API calls.
# Use a MiniMax API secret key that has TTS access enabled.
MINIMAX_API_KEY="YOUR_MINIMAX_API_KEY"
# MINIMAX_API_HOST: Optional override for the MiniMax API host.
# Defaults to https://api.minimaxi.com
MINIMAX_API_HOST="https://api.minimaxi.com"
# APP_URL: The URL where this applet is hosted.
# AI Studio automatically injects this at runtime with the Cloud Run service URL.
# Used for self-referential links, OAuth callbacks, and API endpoints.
APP_URL="MY_APP_URL"

8
.gitignore vendored Normal file
View File

@ -0,0 +1,8 @@
node_modules/
build/
dist/
coverage/
.DS_Store
*.log
.env*
!.env.example

38
README.md Normal file
View File

@ -0,0 +1,38 @@
<div align="center">
<img width="1200" height="475" alt="GHBanner" src="https://github.com/user-attachments/assets/0aa67016-6eaf-458a-adb2-6e31a0763ed6" />
</div>
# Run and deploy your AI Studio app
This contains everything you need to run your app locally.
View your app in AI Studio: https://ai.studio/apps/a38a3cd5-7f82-49f0-a26e-99be4d77f863
## Run Locally
**Prerequisites:** Node.js
1. Install dependencies:
`npm install`
2. Configure [.env](.env) with:
`ARK_API_KEY`
`GEMINI_API_KEY`
`MINIMAX_API_KEY`
3. Optional defaults:
`DEFAULT_LLM_PROVIDER=doubao`
`DOUBAO_MODEL=doubao-seed-2-0-pro-260215`
4. Run the app:
`npm run dev`
## Model Switching
1. Subtitle generation now runs through the server and supports `Doubao` and `Gemini`.
2. The editor shows an `LLM` selector and defaults to `Doubao`.
3. `TTS` stays fixed on `MiniMax` regardless of the selected LLM.
4. All provider keys are read from `.env`; the browser no longer calls LLM providers directly.
## Subtitle Generation
1. Subtitle generation is now driven by server-side multimodal LLM calls on the uploaded video file.
2. No separate local alignment/ASR backend is required for `/api/generate-subtitles`.

View File

@ -0,0 +1,249 @@
# Doubao LLM Provider Design
**Date:** 2026-03-17
**Goal:** Add a user-visible LLM switcher so subtitle generation can use either Doubao or Gemini, default to Doubao, and keep TTS fixed on MiniMax.
## Current State
The current project is effectively Gemini-only for subtitle generation and translation.
1. `src/services/geminiService.ts` calls Gemini directly from the browser for subtitle generation and Gemini fallback TTS.
2. `src/server/geminiTranslation.ts` translates sentence text on the server with Gemini.
3. `src/server/audioPipelineConfig.ts` only validates `GEMINI_API_KEY`.
4. `src/components/EditorScreen.tsx` imports a Gemini-specific service and has no model selector.
5. MiniMax is already independent and used only for TTS through `/api/tts`.
This makes provider switching hard because the LLM choice is not isolated behind a shared contract.
## Product Requirements
1. The editor must show a visible LLM selector.
2. Available LLM options are `Doubao` and `Gemini`.
3. The default LLM must be `Doubao`.
4. TTS must remain fixed to MiniMax and must not participate in provider switching.
5. API keys must only come from `.env`.
6. The app must not silently fall back from one LLM provider to the other.
## Chosen Approach
Use a server-side provider abstraction for subtitle generation and translation, with a frontend selector that passes the chosen provider to the server.
This approach keeps secrets on the server, avoids browser-side provider drift, and gives the project one place to add or change LLM providers later.
## Why This Approach
### Option A: Server-side provider abstraction with frontend selector
Recommended.
1. Frontend sends `provider: 'doubao' | 'gemini'`.
2. Server reads the matching API key from `.env`.
3. Server routes subtitle text generation through a provider adapter.
4. Time-critical audio extraction and timeline logic stay outside the provider-specific layer.
Pros:
1. Keeps API keys off the client.
2. Produces one consistent API contract for the editor.
3. Makes default-provider behavior easy to enforce.
4. Prevents Gemini-specific code from leaking further into the app.
Cons:
1. Requires moving browser-side subtitle generation behavior into a server-owned path.
2. Touches both frontend and backend.
### Option B: Keep Gemini in the browser and add Doubao as a separate server path
Rejected.
Pros:
1. Faster initial implementation.
Cons:
1. Two subtitle-generation architectures would coexist.
2. Provider behavior would drift over time.
3. It violates the requirement that keys come only from `.env`.
### Option C: Client-side provider switching
Rejected.
Pros:
1. Minimal backend work.
Cons:
1. Exposes secrets to the browser.
2. Conflicts with the `.env`-only requirement.
## Architecture
### Frontend
The editor adds an `LLM` selector with the values:
1. `Doubao`
2. `Gemini`
The default selected value is `Doubao`.
When the user clicks subtitle generation, the frontend sends:
1. the uploaded video
2. the target language
3. the selected LLM provider
4. optional trim metadata if the current flow needs it
The frontend no longer needs to know how Gemini or Doubao are called. It only consumes a normalized subtitle payload.
### Server
The server becomes the single owner of LLM subtitle generation.
Responsibilities:
1. validate the incoming provider
2. read provider credentials from `.env`
3. extract audio and prepare subtitle-generation inputs
4. call the chosen provider adapter
5. normalize the result into the existing subtitle shape
### Provider Layer
Create a provider abstraction around LLM calls:
1. `resolveLlmProvider(provider, env)`
2. `geminiProvider`
3. `doubaoProvider`
Each provider must accept the same logical input and return the same logical output so the rest of the app is provider-agnostic.
## API Design
Add a dedicated subtitle-generation endpoint rather than overloading the existing audio-extraction endpoint.
### Request
`POST /api/generate-subtitles`
Multipart or JSON payload fields:
1. `video`
2. `targetLanguage`
3. `provider`
4. optional `trimRange`
### Response
Return the same normalized subtitle structure the editor already understands.
At minimum each subtitle object should include:
1. `id`
2. `startTime`
3. `endTime`
4. `originalText`
5. `translatedText`
6. `speaker`
7. `voiceId`
8. `volume`
If richer timeline metadata already exists in the current server subtitle pipeline, keep it in the response rather than trimming it away.
## Subtitle Generation Strategy
The provider switch should affect LLM reasoning, not TTS and not the MiniMax path.
The cleanest boundary is:
1. audio extraction and timeline preparation stay on the server
2. LLM provider handles translation and label generation
3. MiniMax remains the only TTS engine
This reduces the risk that switching providers changes subtitle timing behavior unpredictably.
## Doubao Integration Notes
Use the Ark Responses API on the server:
1. host: `https://ark.cn-beijing.volces.com/api/v3/responses`
2. auth: `Authorization: Bearer ${ARK_API_KEY}`
3. model: configurable, defaulting to `doubao-seed-2-0-pro-260215`
The provider should treat Doubao as a text-generation backend and extract normalized text from the response payload before JSON parsing.
Implementation detail:
1. the response parser should not assume SDK-specific helpers
2. it should read the returned response envelope and collect the textual output fragments
3. the final result should be parsed as JSON only after the output text is reconstructed
This is an implementation inference based on the official Ark Responses API response shape and is meant to keep the parser resilient to wrapper differences.
## Configuration
Environment variables:
1. `ARK_API_KEY` for Doubao
2. `GEMINI_API_KEY` for Gemini
3. `MINIMAX_API_KEY` for TTS
4. optional `DOUBAO_MODEL` for server-side model override
5. optional `DEFAULT_LLM_PROVIDER` with a default value of `doubao`
Rules:
1. No API keys may be embedded in frontend code.
2. No provider may silently reuse another provider's key.
3. If the selected provider is missing its key, return a clear error.
## Error Handling
Provider failures must be explicit.
1. If `provider` is invalid, return `400`.
2. If the selected provider key is missing, return `400`.
3. If the selected provider returns an auth failure, return `401` or a mapped upstream auth error.
4. If the selected provider fails unexpectedly, return `502` or `500` with a provider-specific error message.
5. Do not auto-fallback from Doubao to Gemini or from Gemini to Doubao.
The UI should show which provider failed so the user is never misled about which model generated a subtitle result.
## Frontend UX
Add the selector in the editor near the subtitle-generation controls so the choice is visible at generation time.
Rules:
1. Default selection is `Doubao`.
2. The selector affects each generation request immediately.
3. The selector does not affect previously generated subtitles until the user regenerates.
4. The selector does not affect MiniMax TTS generation.
## Testing Strategy
Coverage should focus on deterministic seams.
1. Provider resolution defaults to Doubao.
2. Invalid provider is rejected.
3. Missing `ARK_API_KEY` or `GEMINI_API_KEY` returns clear errors.
4. Doubao response parsing turns Ark response content into normalized subtitle JSON.
5. Gemini and Doubao providers both satisfy the same interface contract.
6. Editor defaults to Doubao and sends the selected provider on regenerate.
7. TTS behavior remains unchanged when the LLM provider changes.
## Rollout Notes
1. Introduce the new endpoint and provider abstraction first.
2. Switch the editor to the new endpoint second.
3. Keep MiniMax TTS untouched except for regression checks.
4. Leave any deeper visual fallback provider work for a later pass if needed.
## Constraints
1. This workspace is not a Git repository, so the design document cannot be committed here.
2. The user provided an Ark key in chat, but the implementation must still read provider secrets from `.env` and not hardcode them into source files.

View File

@ -0,0 +1,472 @@
# Doubao LLM Provider Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Add a user-visible LLM switcher that lets subtitle generation use Doubao or Gemini, defaults to Doubao, and keeps TTS fixed on MiniMax with all provider keys sourced from `.env`.
**Architecture:** Move subtitle generation behind a new server endpoint, introduce a provider abstraction for Gemini and Doubao, and update the editor to send the selected provider while continuing to use the existing subtitle shape. Keep MiniMax TTS separate and untouched except for regression coverage.
**Tech Stack:** React, TypeScript, Express, multer, fetch, Vitest
---
### Task 1: Add provider types and configuration resolution
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\llmProvider.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\llmProvider.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\audioPipelineConfig.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\audioPipelineConfig.test.ts`
**Step 1: Write the failing test**
```ts
import { describe, expect, it } from 'vitest';
import { normalizeLlmProvider, resolveLlmProviderConfig } from './llmProvider';
describe('llmProvider config', () => {
it('defaults to doubao when no provider override is set', () => {
expect(normalizeLlmProvider(undefined)).toBe('doubao');
});
it('returns the selected provider key from env', () => {
expect(
resolveLlmProviderConfig('doubao', {
ARK_API_KEY: 'ark-key',
GEMINI_API_KEY: 'gemini-key',
}),
).toEqual(expect.objectContaining({ provider: 'doubao', apiKey: 'ark-key' }));
});
});
```
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/llmProvider.test.ts src/server/audioPipelineConfig.test.ts`
Expected: FAIL because `llmProvider.ts` does not exist and `audioPipelineConfig.ts` still only exposes Gemini config.
**Step 3: Write minimal implementation**
```ts
export type LlmProvider = 'doubao' | 'gemini';
export const normalizeLlmProvider = (value?: string): LlmProvider =>
value?.toLowerCase() === 'gemini' ? 'gemini' : 'doubao';
export const resolveLlmProviderConfig = (
provider: LlmProvider,
env: NodeJS.ProcessEnv,
) => {
if (provider === 'doubao') {
const apiKey = env.ARK_API_KEY?.trim();
if (!apiKey) throw new Error('ARK_API_KEY is required for Doubao subtitle generation.');
return {
provider,
apiKey,
model: env.DOUBAO_MODEL?.trim() || 'doubao-seed-2-0-pro-260215',
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
};
}
const apiKey = env.GEMINI_API_KEY?.trim();
if (!apiKey) throw new Error('GEMINI_API_KEY is required for Gemini subtitle generation.');
return {
provider,
apiKey,
model: 'gemini-2.5-flash',
};
};
```
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/llmProvider.test.ts src/server/audioPipelineConfig.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/llmProvider.ts src/server/llmProvider.test.ts src/server/audioPipelineConfig.ts src/server/audioPipelineConfig.test.ts
git commit -m "feat: add llm provider configuration"
```
### Task 2: Add the Doubao provider parser and contract tests
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\doubaoProvider.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\doubaoProvider.test.ts`
**Step 1: Write the failing test**
```ts
import { describe, expect, it } from 'vitest';
import { extractDoubaoTextOutput } from './doubaoProvider';
describe('extractDoubaoTextOutput', () => {
it('reconstructs text from the Ark output array', () => {
const text = extractDoubaoTextOutput({
output: [
{
type: 'message',
content: [{ type: 'output_text', text: '[{"id":"1","translatedText":"你好"}]' }],
},
],
});
expect(text).toContain('translatedText');
});
});
```
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/doubaoProvider.test.ts`
Expected: FAIL because `doubaoProvider.ts` does not exist.
**Step 3: Write minimal implementation**
```ts
export const extractDoubaoTextOutput = (payload: any): string =>
(payload?.output ?? [])
.flatMap((item: any) => item?.content ?? [])
.filter((part: any) => part?.type === 'output_text')
.map((part: any) => part.text ?? '')
.join('')
.trim();
```
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/doubaoProvider.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/doubaoProvider.ts src/server/doubaoProvider.test.ts
git commit -m "feat: add doubao response parsing"
```
### Task 3: Add provider-backed translation adapters
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\geminiTranslation.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\providerTranslation.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\providerTranslation.test.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\geminiTranslation.test.ts`
**Step 1: Write the failing test**
```ts
import { describe, expect, it } from 'vitest';
import { createSentenceTranslator } from './providerTranslation';
describe('createSentenceTranslator', () => {
it('returns a Doubao translator when provider is doubao', () => {
const translator = createSentenceTranslator({
provider: 'doubao',
apiKey: 'ark-key',
model: 'doubao-seed-2-0-pro-260215',
});
expect(typeof translator).toBe('function');
});
});
```
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/providerTranslation.test.ts src/server/geminiTranslation.test.ts`
Expected: FAIL because the provider selection layer does not exist.
**Step 3: Write minimal implementation**
```ts
export const createSentenceTranslator = (config: ProviderConfig) => {
if (config.provider === 'doubao') {
return createDoubaoSentenceTranslator(config);
}
return createGeminiSentenceTranslator(config);
};
```
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/providerTranslation.test.ts src/server/geminiTranslation.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/providerTranslation.ts src/server/providerTranslation.test.ts src/server/geminiTranslation.ts src/server/geminiTranslation.test.ts
git commit -m "feat: add provider-based translation adapters"
```
### Task 4: Add a dedicated subtitle-generation endpoint
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleRequest.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitleRequest.test.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts`
**Step 1: Write the failing test**
```ts
import { describe, expect, it } from 'vitest';
import { parseSubtitleRequest } from './subtitleRequest';
describe('parseSubtitleRequest', () => {
it('defaults provider to doubao', () => {
expect(parseSubtitleRequest({ body: {} as any }).provider).toBe('doubao');
});
});
```
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/subtitleRequest.test.ts src/server/subtitlePipeline.test.ts`
Expected: FAIL because the request parser does not exist.
**Step 3: Write minimal implementation**
```ts
export const parseSubtitleRequest = (req: { body: Record<string, unknown> }) => ({
provider: normalizeLlmProvider(String(req.body.provider || 'doubao')),
targetLanguage: String(req.body.targetLanguage || ''),
});
```
Then update `server.ts` to expose `POST /api/generate-subtitles`, validate input, resolve provider config, and return normalized subtitles.
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/subtitleRequest.test.ts src/server/subtitlePipeline.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add server.ts src/server/subtitleRequest.ts src/server/subtitleRequest.test.ts src/server/subtitlePipeline.test.ts
git commit -m "feat: add subtitle generation endpoint"
```
### Task 5: Update the frontend subtitle service to use the new endpoint
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\services\subtitleService.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\services\subtitleService.test.ts`
- Test: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx`
**Step 1: Write the failing test**
```ts
import { describe, expect, it, vi } from 'vitest';
import { generateSubtitles } from './subtitleService';
describe('generateSubtitles', () => {
it('posts the selected provider to the server', async () => {
const fetchMock = vi.fn(async () => ({
ok: true,
json: async () => ({ subtitles: [] }),
}));
await generateSubtitles(new File(['x'], 'clip.mp4'), 'English', 'doubao', null, fetchMock as any);
expect(fetchMock.mock.calls[0][0]).toBe('/api/generate-subtitles');
});
});
```
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx`
Expected: FAIL because the new service does not exist and the editor still uses the Gemini-specific service directly.
**Step 3: Write minimal implementation**
```ts
export const generateSubtitles = async (
videoFile: File,
targetLanguage: string,
provider: 'doubao' | 'gemini',
trimRange?: { start: number; end: number } | null,
fetchImpl: typeof fetch = fetch,
) => {
const formData = new FormData();
formData.append('video', videoFile);
formData.append('targetLanguage', targetLanguage);
formData.append('provider', provider);
if (trimRange) {
formData.append('trimRange', JSON.stringify(trimRange));
}
const response = await fetchImpl('/api/generate-subtitles', {
method: 'POST',
body: formData,
});
return response.json();
};
```
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx`
Expected: PASS
**Step 5: Commit**
```bash
git add src/services/subtitleService.ts src/services/subtitleService.test.ts src/services/geminiService.ts src/components/EditorScreen.test.tsx
git commit -m "feat: route subtitle generation through the server"
```
### Task 6: Add the editor LLM selector and default it to Doubao
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx`
**Step 1: Write the failing test**
```tsx
it('defaults the llm selector to Doubao', () => {
render(<EditorScreen videoFile={file} targetLanguage="English" onBack={() => {}} />);
expect(screen.getByLabelText(/llm/i)).toHaveValue('doubao');
});
```
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/components/EditorScreen.test.tsx`
Expected: FAIL because the selector does not exist.
**Step 3: Write minimal implementation**
```tsx
const [llmProvider, setLlmProvider] = useState<'doubao' | 'gemini'>('doubao');
<label>
LLM
<select
aria-label="LLM"
value={llmProvider}
onChange={(event) => setLlmProvider(event.target.value as 'doubao' | 'gemini')}
>
<option value="doubao">Doubao</option>
<option value="gemini">Gemini</option>
</select>
</label>
```
Then pass `llmProvider` into the subtitle-generation service.
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/components/EditorScreen.test.tsx`
Expected: PASS
**Step 5: Commit**
```bash
git add src/components/EditorScreen.tsx src/components/EditorScreen.test.tsx
git commit -m "feat: add llm selector to the editor"
```
### Task 7: Add end-to-end provider and regression coverage
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\server\minimaxTts.test.ts`
**Step 1: Write the failing test**
```ts
it('does not change TTS behavior when the llm provider changes', async () => {
expect(true).toBe(true);
});
```
**Step 2: Run test to verify it fails meaningfully**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/subtitlePipeline.test.ts src/services/geminiService.test.ts src/server/minimaxTts.test.ts`
Expected: FAIL or require stronger assertions until the new provider path is covered.
**Step 3: Write minimal implementation**
Add regression tests that prove:
1. selected provider is forwarded correctly
2. Doubao auth failures surface clearly
3. Gemini still works when selected
4. MiniMax TTS tests continue to pass unchanged
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/llmProvider.test.ts src/server/doubaoProvider.test.ts src/server/providerTranslation.test.ts src/server/subtitleRequest.test.ts src/server/subtitlePipeline.test.ts src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx src/server/minimaxTts.test.ts src/services/geminiService.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add src/server/llmProvider.test.ts src/server/doubaoProvider.test.ts src/server/providerTranslation.test.ts src/server/subtitleRequest.test.ts src/server/subtitlePipeline.test.ts src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx src/server/minimaxTts.test.ts src/services/geminiService.test.ts
git commit -m "test: cover llm provider switching"
```
### Task 8: Verify the live app behavior
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\.env.example`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\README.md`
**Step 1: Write the failing doc check**
Add docs assertions by inspection:
1. `.env.example` documents `ARK_API_KEY` and optional `DOUBAO_MODEL`
2. README explains the editor LLM switcher and that MiniMax remains the TTS engine
**Step 2: Run verification commands**
Run: `node .\node_modules\vitest\vitest.mjs run`
Expected: PASS for the new targeted suites or clear identification of pre-existing unrelated failures.
Run: `Invoke-WebRequest -UseBasicParsing http://localhost:3000/`
Expected: `200`
Run manual checks:
1. open the editor
2. confirm the `LLM` selector defaults to `Doubao`
3. generate subtitles with `Doubao`
4. switch to `Gemini`
5. generate subtitles again
6. confirm TTS still uses MiniMax
**Step 3: Write minimal documentation updates**
Document:
1. required env keys
2. default provider
3. how the editor switcher works
**Step 4: Re-run verification**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/llmProvider.test.ts src/server/doubaoProvider.test.ts src/server/providerTranslation.test.ts src/server/subtitleRequest.test.ts src/services/subtitleService.test.ts src/components/EditorScreen.test.tsx src/server/minimaxTts.test.ts`
Expected: PASS
**Step 5: Commit**
```bash
git add .env.example README.md
git commit -m "docs: document llm provider switching"
```
## Notes
1. This workspace is not a Git repository, so the commit steps may not be executable here.
2. Existing unrelated TypeScript baseline issues in `src/lib/*` and `src/server/*` should be treated as pre-existing unless the new work touches them directly.

View File

@ -0,0 +1,76 @@
# Export Preview Parity Design
**Date:** 2026-03-17
**Goal:** Make exported videos match the editor preview for audio mixing, subtitle timing, and visible subtitle styling.
## Current State
The editor preview and the export pipeline currently render the same edit session through different implementations:
1. The preview in `src/components/EditorScreen.tsx` overlays subtitle text with React and plays audio using the browser media elements plus per-subtitle `Audio` instances.
2. The export in `server.ts` rebuilds subtitles as SRT, mixes audio with FFmpeg, and trims the final output after subtitle timing and TTS delays have already been computed.
This creates three deterministic mismatches:
1. Export mixes original audio even when the preview has muted it because instrumental BGM is present.
2. Export uses relative subtitle times from the trimmed editor session but trims the final video afterward, shifting or cutting subtitle/TTS timing.
3. Export ignores `textStyles`, so the rendered subtitle look differs from the preview.
## Chosen Approach
Adopt preview-first export semantics:
1. Treat the editor state as the source of truth.
2. Serialize the preview-visible subtitle data, text styles, and audio volume data explicitly.
3. Convert preview-relative subtitle timing into export timeline timing before FFmpeg rendering.
4. Generate styled subtitle overlays in the backend instead of relying on FFmpeg defaults.
## Architecture
### Frontend
The editor passes a richer export payload:
1. Subtitle text
2. Subtitle timing
3. Subtitle audio volume
4. Global text style settings
5. Trim range
6. Instrumental BGM base64 when present
The preview itself stays unchanged and remains the reference behavior.
### Backend Export Layer
The export route should move the parity-sensitive logic into pure helpers:
1. Build an export subtitle timeline that shifts relative editor timings back onto the full-video timeline when trimming is enabled.
2. Build an audio mix plan that mirrors preview rules:
- Use instrumental BGM at preview volume when present.
- Exclude original source audio when instrumental BGM is present.
- Otherwise keep original source audio at preview volume.
- Apply each subtitle TTS clip at its configured volume.
3. Generate ASS subtitle content so font, color, alignment, bold, italic, and underline can be rendered intentionally.
## Data Flow
1. `EditorScreen` passes `textStyles` into `ExportModal`.
2. `ExportModal` builds a structured export payload instead of manually shaping subtitle fields inline.
3. `server.ts` parses `textStyles`, normalizes subtitle timing for export, builds ASS subtitle content, and applies the preview-equivalent audio plan.
4. FFmpeg burns styled subtitles and mixes the planned audio sources.
## Testing Strategy
Add regression coverage around pure helpers instead of FFmpeg end-to-end tests:
1. Frontend payload builder includes style and volume fields.
2. Export timeline normalization shifts subtitle timing correctly for trimmed clips.
3. Audio mix planning excludes original audio when BGM is present and keeps it at preview volume when BGM is absent.
4. ASS subtitle generation reflects the selected style settings.
## Risks
1. ASS subtitle rendering may still not be pixel-perfect relative to browser CSS.
2. Existing exports without style payload should remain backward compatible by falling back to safe defaults.
3. FFmpeg filter graph assembly becomes slightly more complex, so helper-level tests are required before touching route logic.

View File

@ -0,0 +1,127 @@
# Export Preview Parity Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Make exported videos match the editor preview for audio behavior, subtitle timing, and visible subtitle styling.
**Architecture:** Keep the editor preview as the source of truth and teach the export pipeline to consume the same state explicitly. Extract pure helpers for export payload building, subtitle timeline normalization, audio mix planning, and ASS subtitle generation so we can lock parity with tests before wiring FFmpeg.
**Tech Stack:** React 19, TypeScript, Express, FFmpeg, Vitest.
---
### Task 1: Add Export Payload Builder Coverage
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\exportPayload.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\exportPayload.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\ExportModal.tsx`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
**Step 1: Write the failing test**
Cover that export payloads include subtitle audio volume and global text styles.
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/lib/exportPayload.test.ts`
Expected: FAIL because the helper does not exist yet.
**Step 3: Write minimal implementation**
Create a small pure builder and wire `ExportModal` to use it. Pass `textStyles` from `EditorScreen`.
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/lib/exportPayload.test.ts`
Expected: PASS.
### Task 2: Add Export Backend Planning Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\exportVideo.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\exportVideo.test.ts`
**Step 1: Write the failing test**
Cover:
1. Subtitle times shift by `trimRange.start` for export.
2. Original source audio is excluded when BGM is present.
3. Original source audio is kept at preview volume when BGM is absent.
4. ASS subtitle output reflects selected styles.
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/exportVideo.test.ts`
Expected: FAIL because helper module does not exist yet.
**Step 3: Write minimal implementation**
Implement pure helpers for:
1. Subtitle timeline normalization
2. Audio mix planning
3. ASS subtitle generation
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/exportVideo.test.ts`
Expected: PASS.
### Task 3: Wire Backend Export Route to Helpers
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
**Step 1: Write the failing integration-leaning test**
Extend `src/server/exportVideo.test.ts` if needed to assert the route-facing helper contract.
**Step 2: Run test to verify it fails**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/exportVideo.test.ts`
Expected: FAIL because current route behavior still assumes SRT/default mixing.
**Step 3: Write minimal implementation**
Update the route to:
1. Parse `textStyles`
2. Use normalized subtitle times for export
3. Generate `.ass` instead of `.srt`
4. Apply preview-equivalent audio mix rules
**Step 4: Run test to verify it passes**
Run: `node .\node_modules\vitest\vitest.mjs run src/server/exportVideo.test.ts`
Expected: PASS.
### Task 4: Verify End-to-End Regressions
**Files:**
- Modify if needed: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts`
- Modify if needed: `E:\Downloads\ai-video-dubbing-&-translation\src\server\minimaxTts.test.ts`
**Step 1: Run focused regression suite**
Run:
```bash
node .\node_modules\vitest\vitest.mjs run src/lib/exportPayload.test.ts src/server/exportVideo.test.ts src/services/geminiService.test.ts src/server/minimaxTts.test.ts
```
Expected: PASS.
**Step 2: Run TypeScript check**
Run: `node .\node_modules\typescript\bin\tsc --noEmit`
Expected: Existing baseline errors may remain; no new export-parity errors should appear.
**Step 3: Smoke-check the running app**
1. Restart the local server.
2. Export a trimmed clip with BGM and TTS.
3. Confirm the exported audio and subtitle timing now match preview expectations.
Plan complete and saved to `docs/plans/2026-03-17-export-preview-parity.md`. Defaulting to execution in this session using the plan directly.

View File

@ -0,0 +1,239 @@
# Precise Dialogue Localization Design
**Date:** 2026-03-17
**Goal:** Upgrade the subtitle pipeline so sentence boundaries are more accurate, word-level timings are available, and speaker attribution is based on audio rather than LLM guesses.
## Current State
The current implementation has two subtitle generation paths:
1. The primary path in `server.ts` extracts audio, calls Whisper with `timestamp_granularities: ['segment']`, then asks an LLM to translate and infer `speaker` and `gender`.
2. The fallback path in `src/services/geminiService.ts` uses Gemini to infer subtitles from video or sampled frames.
This is enough for rough subtitle generation, but it has three hard limits:
1. Sentence timing is only segment-level, so start and end times drift at pause boundaries.
2. Word-level timestamps do not exist, so precise editing and karaoke-style highlighting are impossible.
3. Speaker identity is inferred from text, not measured from audio, so diarization quality is unreliable.
## Chosen Approach
Adopt a high-precision pipeline with a dedicated alignment layer:
1. Extract clean mono audio from the uploaded video.
2. Use voice activity detection (VAD) to isolate speech regions.
3. Run ASR for rough transcription.
4. Run forced alignment to refine every word boundary against the audio.
5. Run speaker diarization to assign stable `speakerId` values.
6. Rebuild editable subtitle sentences from aligned words.
7. Translate only the sentence text while preserving timestamps and speaker assignments.
The existing Node service remains the entry point, but it becomes an orchestration layer instead of doing all timing work itself.
## Architecture
### Frontend
The React editor continues to call `/api/process-audio-pipeline`, but it now receives richer subtitle objects:
1. Sentence-level timing for the timeline.
2. Word-level timing for precise playback feedback.
3. Stable `speakerId` values for speaker-aware UI and voice assignment.
The current editor can remain backward compatible by continuing to render sentence-level fields first and gradually enabling word-level behavior.
### Node Orchestration Layer
`server.ts` keeps responsibility for:
1. Receiving uploaded video data.
2. Extracting audio with FFmpeg.
3. Calling the alignment service.
4. Translating sentence text.
5. Returning a normalized payload to the frontend.
The Node layer must not allow translation to rewrite timing or speaker assignments.
### Alignment Layer
This layer owns all timing-critical operations:
1. VAD
2. ASR
3. Forced alignment
4. Speaker diarization
It can be implemented as a local Python service or a separately managed service as long as it returns deterministic machine-readable JSON.
## Data Model
The current `Subtitle` type should be extended rather than replaced.
```ts
type WordTiming = {
text: string;
startTime: number;
endTime: number;
speakerId: string;
confidence: number;
};
type Subtitle = {
id: string;
startTime: number;
endTime: number;
originalText: string;
translatedText: string;
speaker: string;
speakerId: string;
voiceId: string;
words: WordTiming[];
confidence: number;
audioUrl?: string;
volume?: number;
};
type SpeakerTrack = {
speakerId: string;
label: string;
gender?: 'male' | 'female' | 'unknown';
};
```
Rules:
1. `speakerId` is the stable machine identifier, for example `spk_0`.
2. `speaker` is a user-facing label and can be renamed.
3. Sentence `startTime` and `endTime` are derived from the first and last aligned words.
## Processing Rules
### Audio Preparation
1. Convert uploaded video to `16kHz` mono WAV.
2. Optionally create a denoised or vocal-enhanced copy when the source contains heavy music.
### VAD
Use VAD to identify speech windows and pad each detected region by about `0.2s`.
### ASR and Forced Alignment
1. Use ASR for text hypotheses and rough word order.
2. Use forced alignment to compute accurate `startTime` and `endTime` for each word.
3. Treat forced alignment as the source of truth for timing whenever available.
### Diarization
1. Run diarization separately and produce speaker segments.
2. Assign each word to the speaker with the highest overlap.
3. If a sentence crosses speakers, split it rather than forcing a mixed-speaker sentence.
### Sentence Reconstruction
Build sentence subtitles from words using conservative rules:
1. Keep words together only when `speakerId` is the same.
2. Split when adjacent word gaps exceed `0.45s`.
3. Split when sentence duration would exceed `8s`.
4. Split on strong punctuation or long pauses.
5. Avoid returning sentences shorter than `0.6s` unless the source is actually brief.
## API Design
Reuse `/api/process-audio-pipeline`, but upgrade its payload to:
```json
{
"subtitles": [],
"speakers": [],
"sourceLanguage": "zh",
"targetLanguage": "en",
"duration": 123.45,
"quality": "full",
"alignmentEngine": "whisperx+pyannote"
}
```
Quality levels:
1. `full`: sentence timings, word timings, and diarization are all available.
2. `partial`: word timings are available but diarization is missing or unreliable.
3. `fallback`: high-precision alignment failed, so the app returned rough timing from the existing path.
## Frontend Behavior
The current editor in `src/components/EditorScreen.tsx` should evolve incrementally:
1. Keep the existing sentence-based timeline as the default view.
2. Add word-level highlighting during playback when `words` exist.
3. Add speaker-aware styling and filtering when `speakers` exist.
4. Preserve manual timeline editing and snap dragged sentence edges to nearest word boundaries when possible.
Fallback behavior:
1. If `quality` is `full`, enable all precision UI.
2. If `quality` is `partial`, disable speaker-specific UI and keep timing features.
3. If `quality` is `fallback`, continue with the current editor and show a low-precision notice.
## Error Handling and Degradation
The product must remain usable even when the high-precision path is incomplete.
1. If forced alignment fails, return sentence-level ASR output instead of failing the whole request.
2. If diarization fails, keep timings and mark `speakerId` as `unknown`.
3. If translation fails, return original text with timings intact.
4. If the alignment layer is unavailable, fall back to the existing visual pipeline and set `quality: "fallback"`.
5. Preserve low-confidence words and expose their confidence rather than dropping them silently.
## Testing Strategy
Coverage should focus on deterministic logic:
1. Sentence reconstruction from aligned words.
2. Speaker assignment from overlapping diarization segments.
3. API normalization and fallback handling.
4. Frontend word-highlighting and snapping helpers.
End-to-end manual verification should include:
1. Single-speaker clip with pauses.
2. Two-speaker dialogue with interruptions.
3. Music-heavy clip.
4. Alignment failure fallback.
## Rollout Plan
1. Extend types and response normalization first.
2. Introduce the alignment adapter behind a feature flag or environment guard.
3. Return richer payloads while keeping the current UI backward compatible.
4. Add word-level highlighting and speaker-aware UI after the backend contract stabilizes.
## Constraints and Notes
1. This workspace is not a Git repository, so the required design-document commit could not be performed here.
2. The current project does not yet include a test runner, so the implementation plan includes test infrastructure setup before feature work.
## Implementation Status
Implemented in this workspace:
1. Test infrastructure using Vitest, jsdom, and Testing Library.
2. Shared subtitle pipeline helpers for normalization, sentence reconstruction, speaker assignment, word highlighting, and timeline snapping.
3. A backend subtitle orchestration layer plus an alignment-service adapter boundary for local ASR / alignment backends.
4. Gemini-based sentence translation in the audio pipeline, without relying on OpenAI for ASR or translation.
5. Frontend pipeline mapping, precision notices, word-level playback feedback, and speaker-aware presentation.
Automated verification completed:
1. `npm test -- --run`
2. `npm run lint`
3. `npm run build`
Manual verification still pending:
1. Single-speaker clip with pauses.
2. Two-speaker dialogue with interruptions.
3. Music-heavy clip.
4. Alignment-service unavailable fallback using a real upload.

View File

@ -0,0 +1,650 @@
# Precise Dialogue Localization Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Build a high-precision subtitle pipeline that returns accurate sentence boundaries, word-level timings, and real speaker attribution while preserving the current editor flow.
**Architecture:** Keep the React app and `server.ts` as the public entry points, but move timing-critical work into a dedicated alignment adapter. The backend normalizes aligned words into sentence subtitles, translates text without changing timing, and returns quality metadata so the editor can enable or disable precision UI safely.
**Tech Stack:** React 19, TypeScript, Vite, Express, FFmpeg, OpenAI SDK, a new test runner (`vitest`), and a high-precision alignment backend adapter.
---
### Task 1: Add Test Infrastructure
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\package.json`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\vitest.config.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\test\setup.ts`
**Step 1: Write the failing test**
Create a minimal smoke test first so the test runner has a real target.
```ts
import { describe, expect, it } from 'vitest';
describe('test harness', () => {
it('runs vitest in this workspace', () => {
expect(true).toBe(true);
});
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run`
Expected: FAIL because no `test` script or Vitest config exists yet.
**Step 3: Write minimal implementation**
1. Add `test` and `test:watch` scripts to `package.json`.
2. Add dev dependencies for `vitest`.
3. Create `vitest.config.ts` with a Node environment default.
4. Add `src/test/setup.ts` for shared setup.
```ts
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
environment: 'node',
setupFiles: ['./src/test/setup.ts'],
},
});
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run`
Expected: PASS with the smoke test.
**Step 5: Commit**
```bash
git add package.json vitest.config.ts src/test/setup.ts
git commit -m "test: add vitest infrastructure"
```
### Task 2: Extract Subtitle Pipeline Types and Normalizers
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\types.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.test.ts`
**Step 1: Write the failing test**
Write tests for normalization from aligned word payloads to UI-ready subtitles.
```ts
it('derives subtitle boundaries from first and last word', () => {
const result = normalizeAlignedSentence({
id: 's1',
speakerId: 'spk_0',
words: [
{ text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
{ text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
],
originalText: 'Hello world',
translatedText: '你好世界',
});
expect(result.startTime).toBe(1.2);
expect(result.endTime).toBe(2.0);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/subtitlePipeline.test.ts`
Expected: FAIL because the new module and extended types do not exist.
**Step 3: Write minimal implementation**
1. Extend `Subtitle` in `src/types.ts` with `speakerId`, `words`, and `confidence`.
2. Create a pure helper module that normalizes backend payloads into frontend subtitles.
```ts
export const deriveSubtitleBounds = (words: WordTiming[]) => ({
startTime: words[0]?.startTime ?? 0,
endTime: words[words.length - 1]?.endTime ?? 0,
});
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/subtitlePipeline.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/types.ts src/lib/subtitlePipeline.ts src/lib/subtitlePipeline.test.ts
git commit -m "feat: add subtitle pipeline normalizers"
```
### Task 3: Implement Sentence Reconstruction Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.test.ts`
**Step 1: Write the failing test**
Cover pause splitting and speaker splitting.
```ts
it('splits sentences when speaker changes', () => {
const result = rebuildSentences([
{ text: 'Hi', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'there', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'no', startTime: 0.55, endTime: 0.7, speakerId: 'spk_1', confidence: 0.9 },
]);
expect(result).toHaveLength(2);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts`
Expected: FAIL because the helper module is missing.
**Step 3: Write minimal implementation**
Implement pure splitting rules:
1. Split on `speakerId` change.
2. Split when word gaps exceed `0.45`.
3. Split when sentence duration exceeds `8`.
```ts
if (nextWord.speakerId !== currentSpeakerId) {
flushSentence();
}
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/alignment/sentenceReconstruction.ts src/lib/alignment/sentenceReconstruction.test.ts
git commit -m "feat: add sentence reconstruction rules"
```
### Task 4: Implement Speaker Assignment Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.test.ts`
**Step 1: Write the failing test**
Test overlap-based speaker assignment.
```ts
it('assigns each word to the speaker segment with maximum overlap', () => {
const word = { text: 'hello', startTime: 1.0, endTime: 1.4 };
const speakers = [
{ speakerId: 'spk_0', startTime: 0.8, endTime: 1.1 },
{ speakerId: 'spk_1', startTime: 1.1, endTime: 1.6 },
];
expect(assignSpeakerToWord(word, speakers)).toBe('spk_1');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/alignment/speakerAssignment.test.ts`
Expected: FAIL because speaker assignment logic does not exist.
**Step 3: Write minimal implementation**
Add a pure overlap calculator and default to `unknown` when no segment overlaps.
```ts
const overlap = Math.max(
0,
Math.min(word.endTime, segment.endTime) - Math.max(word.startTime, segment.startTime),
);
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/alignment/speakerAssignment.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/alignment/speakerAssignment.ts src/lib/alignment/speakerAssignment.test.ts
git commit -m "feat: add speaker assignment helpers"
```
### Task 5: Isolate Backend Pipeline Logic from `server.ts`
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
**Step 1: Write the failing test**
Add tests for orchestration-level fallback behavior.
```ts
it('returns partial quality when diarization is unavailable', async () => {
const result = await buildSubtitlePayload({
alignmentResult: {
words: [{ text: 'hi', startTime: 0, endTime: 0.2, speakerId: 'unknown', confidence: 0.9 }],
speakerSegments: [],
quality: 'partial',
},
});
expect(result.quality).toBe('partial');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/server/subtitlePipeline.test.ts`
Expected: FAIL because orchestration code is still embedded in `server.ts`.
**Step 3: Write minimal implementation**
1. Move payload-building logic into `src/server/subtitlePipeline.ts`.
2. Make `server.ts` call the helper and only handle HTTP concerns.
```ts
export const buildSubtitlePayload = async (deps: SubtitlePipelineDeps) => {
// normalize alignment result
// translate text
// return { subtitles, speakers, quality, ... }
};
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/server/subtitlePipeline.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/server/subtitlePipeline.ts src/server/subtitlePipeline.test.ts server.ts
git commit -m "refactor: isolate subtitle pipeline orchestration"
```
### Task 6: Add an Alignment Service Adapter
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
**Step 1: Write the failing test**
Test that the adapter maps raw alignment responses into normalized internal types.
```ts
it('maps aligned words and speaker segments from the adapter response', async () => {
const result = await parseAlignmentResponse({
words: [{ word: 'hello', start: 1.0, end: 1.2, speaker: 'spk_0', score: 0.95 }],
speakers: [{ speaker: 'spk_0', start: 0.8, end: 1.6 }],
});
expect(result.words[0].speakerId).toBe('spk_0');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/server/alignmentAdapter.test.ts`
Expected: FAIL because no adapter exists.
**Step 3: Write minimal implementation**
Create an adapter boundary with one public function such as `requestAlignedTranscript(audioPath)`.
```ts
export const requestAlignedTranscript = async (audioPath: string) => {
// call local or remote alignment backend
// normalize response shape
};
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/server/alignmentAdapter.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/server/alignmentAdapter.ts src/server/alignmentAdapter.test.ts server.ts
git commit -m "feat: add alignment service adapter"
```
### Task 7: Upgrade `/api/process-audio-pipeline` Response Shape
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts`
**Step 1: Write the failing test**
Add a client-side test for parsing `quality`, `speakers`, and `words`.
```ts
it('maps the enriched audio pipeline response into subtitle objects', async () => {
const payload = {
subtitles: [
{
id: 'sub_1',
startTime: 1,
endTime: 2,
originalText: 'Hello',
translatedText: '你好',
speaker: 'Speaker 1',
speakerId: 'spk_0',
words: [{ text: 'Hello', startTime: 1, endTime: 2, speakerId: 'spk_0', confidence: 0.9 }],
confidence: 0.9,
},
],
speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
quality: 'full',
};
expect(mapPipelineResponse(payload).subtitles[0].words).toHaveLength(1);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/services/geminiService.test.ts`
Expected: FAIL because the mapping helper does not exist.
**Step 3: Write minimal implementation**
1. Add a response-mapping helper in `src/services/geminiService.ts`.
2. Preserve the existing fallback path.
3. Carry `quality` metadata to the UI.
```ts
const quality = data.quality ?? 'fallback';
const subtitles = (data.subtitles ?? []).map(mapSubtitleFromApi);
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/services/geminiService.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add server.ts src/services/geminiService.ts src/services/geminiService.test.ts
git commit -m "feat: return enriched subtitle pipeline payloads"
```
### Task 8: Add Precision Metadata to Editor State
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx`
**Step 1: Write the failing test**
Add a test for rendering a fallback warning when `quality` is low.
```tsx
it('shows a low-precision notice for fallback subtitle results', () => {
render(<EditorScreen ... />);
expect(screen.getByText(/low-precision/i)).toBeInTheDocument();
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/components/EditorScreen.test.tsx`
Expected: FAIL because the component does not track pipeline quality yet.
**Step 3: Write minimal implementation**
1. Add state for `quality` and `speakers`.
2. Surface a small status badge or warning banner.
3. Keep the existing sentence list and timeline intact.
```tsx
{quality === 'fallback' && (
<p className="text-xs text-amber-700">Low-precision timing detected. Manual review recommended.</p>
)}
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/components/EditorScreen.test.tsx`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/components/EditorScreen.tsx src/components/EditorScreen.test.tsx
git commit -m "feat: surface subtitle precision status in editor"
```
### Task 9: Add Word-Level Playback Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
**Step 1: Write the failing test**
Test the active-word lookup helper.
```ts
it('returns the active word for the current playback time', () => {
const activeWord = getActiveWord([
{ text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
], 1.1);
expect(activeWord?.text).toBe('Hello');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/playback/wordHighlight.test.ts`
Expected: FAIL because playback helpers do not exist.
**Step 3: Write minimal implementation**
1. Create a pure helper for active-word lookup.
2. Use it in `EditorScreen.tsx` to render highlighted word spans when `words` are present.
```ts
export const getActiveWord = (words: WordTiming[], currentTime: number) =>
words.find((word) => currentTime >= word.startTime && currentTime <= word.endTime);
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/playback/wordHighlight.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/playback/wordHighlight.ts src/lib/playback/wordHighlight.test.ts src/components/EditorScreen.tsx
git commit -m "feat: add word-level playback highlighting"
```
### Task 10: Snap Timeline Edges to Word Boundaries
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
**Step 1: Write the failing test**
Test snapping to nearest word edges.
```ts
it('snaps a dragged start edge to the nearest word boundary', () => {
const next = snapTimeToNearestWordBoundary(
1.34,
[
{ text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
],
);
expect(next).toBe(1.35);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/timeline/snapToWords.test.ts`
Expected: FAIL because no snapping helper exists.
**Step 3: Write minimal implementation**
1. Add a pure snapping helper with a small tolerance window.
2. Use it in the left and right resize timeline handlers.
```ts
export const snapTimeToNearestWordBoundary = (time: number, words: WordTiming[]) => {
// choose nearest start or end boundary within tolerance
};
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/timeline/snapToWords.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/timeline/snapToWords.ts src/lib/timeline/snapToWords.test.ts src/components/EditorScreen.tsx
git commit -m "feat: snap subtitle edits to word boundaries"
```
### Task 11: Add Speaker-Aware UI State
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\voices.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.test.ts`
**Step 1: Write the failing test**
Test stable color and label generation for speaker tracks.
```ts
it('creates stable display metadata for each speaker id', () => {
const speaker = buildSpeakerPresentation({ speakerId: 'spk_0', label: 'Speaker 1' });
expect(speaker.color).toMatch(/^#/);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/speakers/speakerPresentation.test.ts`
Expected: FAIL because no speaker presentation helper exists.
**Step 3: Write minimal implementation**
1. Create a helper that derives display color and fallback label from `speakerId`.
2. Use it to color sentence chips or timeline items.
3. Keep voice assignment behavior backward compatible.
```ts
export const buildSpeakerPresentation = ({ speakerId, label }: SpeakerTrack) => ({
speakerId,
label,
color: '#1677ff',
});
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/speakers/speakerPresentation.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/components/EditorScreen.tsx src/voices.ts src/lib/speakers/speakerPresentation.ts src/lib/speakers/speakerPresentation.test.ts
git commit -m "feat: add speaker-aware editor presentation"
```
### Task 12: Verify End-to-End Behavior and Update Docs
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\README.md`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\docs\plans\2026-03-17-precise-dialogue-localization-design.md`
**Step 1: Write the failing test**
Write down the manual verification checklist before changing docs so the release criteria are explicit.
```md
- [ ] Single-speaker clip returns `quality: full`
- [ ] Two-speaker clip shows distinct speaker IDs
- [ ] Fallback path shows low-precision notice
- [ ] Timeline resize snaps to word boundaries
```
**Step 2: Run test to verify it fails**
Run: `npm run lint`
Expected: PASS or FAIL depending on in-progress code, but manual verification is still incomplete until the checklist is executed.
**Step 3: Write minimal implementation**
1. Update `README.md` with new environment requirements and pipeline description.
2. Record the manual verification results in the design document or a linked note.
```md
## High-Precision Subtitle Mode
Set the alignment backend environment variables before running the app.
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run`
Expected: PASS.
Run: `npm run lint`
Expected: PASS.
Run: `npm run build`
Expected: PASS.
**Step 5: Commit**
```bash
git add README.md docs/plans/2026-03-17-precise-dialogue-localization-design.md
git commit -m "docs: document precise dialogue localization workflow"
```
## Notes for Execution
1. This workspace currently has no `.git` directory, so commit steps cannot be executed until the project is placed in a real Git checkout.
2. Introduce the alignment backend behind environment-based configuration so existing demos can still use the current fallback path.
3. Prefer pure functions for sentence reconstruction, speaker assignment, snapping, and word-highlighting logic so they remain easy to test.

View File

@ -0,0 +1,75 @@
# Alignment Fallback Safety Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Stop subtitle generation from silently falling back into the Studio project workflow unless an explicit environment flag enables it.
**Architecture:** Keep the existing alignment adapter and Studio fallback code path, but gate that fallback behind a parsed boolean config from `.env`. When the alignment endpoint returns `404` or `405` and the flag is not enabled, fail fast with a clear error instead of returning unrelated subtitles.
**Tech Stack:** TypeScript, Vitest, Express, Node fetch/FormData APIs
---
### Task 1: Add failing tests for safe-by-default fallback behavior
**Files:**
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/src/server/alignmentAdapter.test.ts`
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/src/server/audioPipelineConfig.test.ts`
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.test.ts`
**Step 1: Write the failing tests**
- Add a test asserting `requestAlignedTranscript()` throws a clear error on `404` when Studio fallback is not explicitly enabled.
- Update the existing fallback test to pass only when `allowStudioProjectFallback` is set to `true`.
- Add config tests for parsing `ALLOW_STUDIO_PROJECT_FALLBACK` with a default of `false`.
- Add a pipeline test asserting `generateSubtitlePipeline()` forwards the parsed fallback flag into `requestAlignedTranscript()`.
**Step 2: Run targeted tests to verify they fail**
Run: `node ./node_modules/vitest/vitest.mjs run src/server/alignmentAdapter.test.ts src/server/audioPipelineConfig.test.ts src/server/subtitleGeneration.test.ts`
Expected: FAIL because the fallback flag does not exist yet and `requestAlignedTranscript()` still auto-falls back.
### Task 2: Implement the minimal fallback gate
**Files:**
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/src/server/audioPipelineConfig.ts`
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/src/server/alignmentAdapter.ts`
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts`
**Step 1: Add config parsing**
- Extend `AudioPipelineConfig` with `allowStudioProjectFallback: boolean`.
- Parse `ALLOW_STUDIO_PROJECT_FALLBACK` from `.env`, defaulting to `false`.
**Step 2: Gate fallback execution**
- Extend `RequestAlignedTranscriptOptions` with `allowStudioProjectFallback`.
- When the alignment root returns `404` or `405` and the flag is `false`, throw a clear error that points to `ALLOW_STUDIO_PROJECT_FALLBACK=true` or a proper alignment backend.
- When the flag is `true`, keep the current Studio project workflow unchanged.
**Step 3: Thread config through the pipeline**
- Pass `allowStudioProjectFallback` from `resolveAudioPipelineConfig()` into `requestAlignedTranscript()` inside `generateSubtitlePipeline()`.
### Task 3: Update docs and verify
**Files:**
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/.env.example`
- Modify: `E:/Downloads/ai-video-dubbing-&-translation/README.md`
**Step 1: Document the new flag**
- Add `ALLOW_STUDIO_PROJECT_FALLBACK="false"` to `.env.example`.
- Clarify in `README.md` that subtitle generation now uses the local Studio workflow by default, and that fail-fast mode is opt-in.
**Step 2: Run verification**
Run: `node ./node_modules/vitest/vitest.mjs run src/server/alignmentAdapter.test.ts src/server/audioPipelineConfig.test.ts src/server/subtitleGeneration.test.ts`
Expected: PASS
Run: `node ./node_modules/vitest/vitest.mjs run`
Expected: PASS
Run: `cmd /c npm run lint`
Expected: PASS

13
index.html Normal file
View File

@ -0,0 +1,13 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>My Google AI Studio App</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

5
metadata.json Normal file
View File

@ -0,0 +1,5 @@
{
"name": "AI Video Dubbing & Translation",
"description": "Professional AI video translation and dubbing tool with vocal separation and TTS (Version 1.0).",
"requestFramePermissions": []
}

3742
package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

49
package.json Normal file
View File

@ -0,0 +1,49 @@
{
"name": "react-example",
"private": true,
"version": "1.0.0",
"type": "module",
"scripts": {
"dev": "node ./node_modules/tsx/dist/cli.mjs server.ts",
"build": "node ./node_modules/vite/bin/vite.js build",
"preview": "node ./node_modules/vite/bin/vite.js preview",
"clean": "rm -rf dist",
"lint": "node ./node_modules/typescript/bin/tsc --noEmit",
"test": "node ./node_modules/vitest/vitest.mjs run"
},
"dependencies": {
"@google-cloud/speech": "^7.3.0",
"@google/genai": "^1.45.0",
"@tailwindcss/vite": "^4.1.14",
"@vitejs/plugin-react": "^5.0.4",
"axios": "^1.13.6",
"cors": "^2.8.6",
"dotenv": "^17.3.1",
"express": "^4.22.1",
"ffmpeg-static": "^5.3.0",
"ffprobe-static": "^3.1.0",
"fluent-ffmpeg": "^2.1.3",
"lucide-react": "^0.546.0",
"motion": "^12.23.24",
"multer": "^2.1.1",
"openai": "^6.29.0",
"react": "^19.0.0",
"react-dom": "^19.0.0",
"vite": "^6.2.0"
},
"devDependencies": {
"@testing-library/jest-dom": "^6.9.1",
"@testing-library/react": "^16.3.2",
"@types/cors": "^2.8.19",
"@types/express": "^4.17.25",
"@types/multer": "^2.1.0",
"@types/node": "^22.14.0",
"autoprefixer": "^10.4.21",
"jsdom": "^29.0.0",
"tailwindcss": "^4.1.14",
"tsx": "^4.21.0",
"typescript": "~5.8.2",
"vite": "^6.2.0",
"vitest": "^4.1.0"
}
}

415
server.ts Normal file
View File

@ -0,0 +1,415 @@
import express from 'express';
import cors from 'cors';
import dotenv from 'dotenv';
import { createServer as createViteServer } from 'vite';
import path from 'path';
import fs from 'fs';
import ffmpeg from 'fluent-ffmpeg';
import ffmpegInstaller from 'ffmpeg-static';
import ffprobeInstaller from 'ffprobe-static';
import axios from 'axios';
import multer from 'multer';
import {
createMiniMaxTtsUrl,
getMiniMaxTtsHttpStatus,
resolveMiniMaxTtsConfig,
} from './src/server/minimaxTts';
import { generateSubtitlePipeline } from './src/server/subtitleGeneration';
import { parseSubtitleRequest } from './src/server/subtitleRequest';
import {
buildAssSubtitleContent,
buildExportAudioPlan,
DEFAULT_EXPORT_TEXT_STYLES,
shiftSubtitlesToExportTimeline,
} from './src/server/exportVideo';
import { TextStyles } from './src/types';
const upload = multer({
dest: 'uploads/',
limits: {
fileSize: 1024 * 1024 * 1024, // 1GB file limit
fieldSize: 1024 * 1024 * 500 // 500MB field limit for base64 strings
}
});
if (!fs.existsSync('uploads')) {
fs.mkdirSync('uploads');
}
if (ffmpegInstaller) {
ffmpeg.setFfmpegPath(ffmpegInstaller);
}
if (ffprobeInstaller.path) {
ffmpeg.setFfprobePath(ffprobeInstaller.path);
}
dotenv.config();
async function startServer() {
const app = express();
const PORT = 3000;
app.use(cors());
app.use(express.json({ limit: '500mb' }));
app.use(express.urlencoded({ limit: '500mb', extended: true }));
// MiniMax TTS Endpoint
app.post('/api/tts', async (req, res) => {
try {
const { text, voiceId } = req.body;
if (!text) return res.status(400).json({ error: 'No text provided' });
const { apiHost, apiKey } = resolveMiniMaxTtsConfig(process.env);
const response = await axios.post(
createMiniMaxTtsUrl(apiHost),
{
model: "speech-2.8-hd",
text: text,
stream: false,
output_format: "hex",
voice_setting: {
voice_id: voiceId || 'male-qn-qingse',
speed: 1.0,
vol: 1.0,
pitch: 0
},
audio_setting: {
sample_rate: 32000,
bitrate: 128000,
format: "mp3",
channel: 1,
}
},
{
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
}
}
);
if (response.data?.base_resp?.status_code !== 0) {
console.error('MiniMax API Error:', response.data?.base_resp);
return res
.status(getMiniMaxTtsHttpStatus(response.data?.base_resp))
.json({ error: response.data?.base_resp?.status_msg || 'MiniMax TTS failed' });
}
const hexAudio = response.data.data.audio;
const audioBuffer = Buffer.from(hexAudio, 'hex');
const audioBase64 = audioBuffer.toString('base64');
res.json({ audio: audioBase64 });
} catch (error: any) {
if (error instanceof Error && error.message.includes('MINIMAX_API_KEY')) {
console.error('TTS Config Error:', error.message);
return res.status(400).json({ error: error.message });
}
console.error('TTS Error:', error.response?.data || error.message);
res
.status(getMiniMaxTtsHttpStatus(error.response?.data?.base_resp))
.json({ error: error.response?.data?.base_resp?.status_msg || error.message || 'Failed to generate TTS' });
}
});
// Vocal Separation Endpoint
app.post('/api/separate-vocal', upload.single('video'), async (req, res) => {
const videoPath = req.file?.path;
const timestamp = Date.now();
const instrumentalPath = path.join(process.cwd(), `temp_instrumental_${timestamp}.mp3`);
try {
if (!videoPath) return res.status(400).json({ error: 'No video file provided' });
// Simple vocal reduction using FFmpeg (Center-panned vocal removal trick)
// This is a basic fallback as true AI separation requires specialized models.
await new Promise((resolve, reject) => {
ffmpeg(videoPath)
.noVideo()
.audioFilters('pan=stereo|c0=c0-c1|c1=c1-c0') // Basic vocal reduction
.format('mp3')
.on('end', resolve)
.on('error', reject)
.save(instrumentalPath);
});
const instrumentalBuffer = fs.readFileSync(instrumentalPath);
const instrumentalBase64 = instrumentalBuffer.toString('base64');
// Cleanup
if (fs.existsSync(instrumentalPath)) fs.unlinkSync(instrumentalPath);
if (fs.existsSync(videoPath)) fs.unlinkSync(videoPath);
res.json({ instrumental: instrumentalBase64 });
} catch (error: any) {
console.error('Vocal Separation Error:', error);
res.status(500).json({ error: error.message || 'Failed to separate vocals' });
} finally {
// Cleanup
if (instrumentalPath && fs.existsSync(instrumentalPath)) fs.unlinkSync(instrumentalPath);
if (videoPath && fs.existsSync(videoPath)) fs.unlinkSync(videoPath);
}
});
app.post('/api/process-audio-pipeline', upload.single('video'), async (req, res) => {
const videoPath = req.file?.path;
const timestamp = Date.now();
const audioPath = path.join(process.cwd(), `temp_audio_${timestamp}.wav`);
try {
if (!videoPath) return res.status(400).json({ error: 'No video file provided' });
// 1. Extract Audio (16kHz, Mono, WAV)
await new Promise((resolve, reject) => {
ffmpeg(videoPath)
.noVideo()
.audioFrequency(16000)
.audioChannels(1)
.format('wav')
.on('end', resolve)
.on('error', reject)
.save(audioPath);
});
const audioFile = fs.readFileSync(audioPath);
const audioBase64 = audioFile.toString('base64');
// Cleanup
if (fs.existsSync(audioPath)) fs.unlinkSync(audioPath);
if (fs.existsSync(videoPath)) fs.unlinkSync(videoPath);
res.json({ audioBase64 });
} catch (error: any) {
console.error('Audio Extraction Error:', error);
res.status(500).json({ error: error.message || 'Failed to extract audio' });
} finally {
// Cleanup
if (audioPath && fs.existsSync(audioPath)) fs.unlinkSync(audioPath);
if (videoPath && fs.existsSync(videoPath)) fs.unlinkSync(videoPath);
}
});
app.post('/api/generate-subtitles', upload.single('video'), async (req, res) => {
const videoPath = req.file?.path;
try {
if (!videoPath) {
return res.status(400).json({ error: 'No video file provided' });
}
const { provider, targetLanguage } = parseSubtitleRequest(req.body);
const result = await generateSubtitlePipeline({
videoPath,
provider,
targetLanguage,
env: process.env,
});
res.json({
...result,
provider,
});
} catch (error: any) {
const message = error instanceof Error ? error.message : 'Failed to generate subtitles';
const lowerMessage = message.toLowerCase();
const status =
lowerMessage.includes('target language') ||
lowerMessage.includes('unsupported llm provider') ||
lowerMessage.includes('_api_key is required') ||
lowerMessage.includes('studio project fallback is disabled')
? 400
: lowerMessage.includes('unauthorized') ||
lowerMessage.includes('authentication') ||
lowerMessage.includes('auth fail') ||
lowerMessage.includes('status 401')
? 401
: 502;
console.error('Subtitle Generation Error:', error);
res.status(status).json({ error: message });
} finally {
if (videoPath && fs.existsSync(videoPath)) fs.unlinkSync(videoPath);
}
});
app.post('/api/export-video', upload.single('video'), async (req, res) => {
const tempFiles: string[] = [];
try {
const { subtitles: subtitlesStr, bgmBase64, trimRange: trimRangeStr, textStyles: textStylesStr } = req.body;
const videoFile = req.file;
if (!videoFile) return res.status(400).json({ error: 'No video file provided' });
const subtitles = subtitlesStr ? JSON.parse(subtitlesStr) : [];
const trimRange = trimRangeStr ? JSON.parse(trimRangeStr) : null;
const textStyles: TextStyles = textStylesStr
? { ...DEFAULT_EXPORT_TEXT_STYLES, ...JSON.parse(textStylesStr) }
: DEFAULT_EXPORT_TEXT_STYLES;
const timestamp = Date.now();
const inputPath = videoFile.path;
const outputPath = path.join(process.cwd(), `output_${timestamp}.mp4`);
const subtitlePath = path.join(process.cwd(), `subs_${timestamp}.ass`);
tempFiles.push(subtitlePath, outputPath, inputPath);
// 2. Prepare Audio Filters
const probeData: any = await new Promise((resolve, reject) => {
ffmpeg.ffprobe(inputPath, (err, metadata) => {
if (err) reject(err);
else resolve(metadata);
});
});
const hasAudio = probeData.streams.some((s: any) => s.codec_type === 'audio');
const videoStream = probeData.streams.find((s: any) => s.codec_type === 'video');
const videoWidth = videoStream?.width || 1080;
const videoHeight = videoStream?.height || 1920;
const exportSubtitles = shiftSubtitlesToExportTimeline(subtitles || [], trimRange);
const hasSubtitles = exportSubtitles.length > 0;
if (hasSubtitles) {
const assContent = buildAssSubtitleContent({
subtitles: exportSubtitles,
textStyles,
videoWidth,
videoHeight,
});
fs.writeFileSync(subtitlePath, assContent);
}
let command = ffmpeg(inputPath);
const filterComplexParts: string[] = [];
const audioMixInputs: string[] = [];
let inputIndex = 1;
const audioPlan = buildExportAudioPlan({
hasSourceAudio: hasAudio,
hasBgm: Boolean(bgmBase64),
subtitles: exportSubtitles,
});
if (bgmBase64) {
const bgmPath = path.join(process.cwd(), `bgm_${timestamp}.mp3`);
fs.writeFileSync(bgmPath, Buffer.from(bgmBase64, 'base64'));
command = command.input(bgmPath);
tempFiles.push(bgmPath);
filterComplexParts.push(`[${inputIndex}:a]volume=${audioPlan.bgmVolume ?? 0.5}[bgm]`);
audioMixInputs.push('[bgm]');
inputIndex++;
}
if (audioPlan.includeSourceAudio) {
filterComplexParts.push(`[0:a]volume=${audioPlan.sourceAudioVolume ?? 0.3}[sourcea]`);
audioMixInputs.push('[sourcea]');
}
for (let i = 0; i < audioPlan.ttsTracks.length; i++) {
const track = audioPlan.ttsTracks[i];
if (track.audioUrl) {
const base64Data = track.audioUrl.split(',')[1];
const isWav = track.audioUrl.includes('audio/wav');
const ext = isWav ? 'wav' : 'mp3';
const ttsPath = path.join(process.cwd(), `tts_${timestamp}_${i}.${ext}`);
fs.writeFileSync(ttsPath, Buffer.from(base64Data, 'base64'));
command = command.input(ttsPath);
tempFiles.push(ttsPath);
filterComplexParts.push(
`[${inputIndex}:a]volume=${track.volume},adelay=${track.delayMs}|${track.delayMs}[tts${i}]`,
);
audioMixInputs.push(`[tts${i}]`);
inputIndex++;
}
}
const escapedSubtitlePath = subtitlePath.replace(/\\/g, '/').replace(/:/g, '\\:');
if (hasSubtitles) {
filterComplexParts.push(`[0:v]subtitles='${escapedSubtitlePath}'[vout]`);
}
let audioMap: string | null = null;
if (audioMixInputs.length > 1) {
filterComplexParts.push(
`${audioMixInputs.join('')}amix=inputs=${audioMixInputs.length}:duration=first:dropout_transition=2[aout]`,
);
audioMap = '[aout]';
} else if (audioMixInputs.length === 1) {
audioMap = audioMixInputs[0];
}
if (filterComplexParts.length > 0) {
command = command.complexFilter(filterComplexParts);
}
const outputMaps = [`-map ${hasSubtitles ? '[vout]' : '0:v'}`];
if (audioMap) {
outputMaps.push(`-map ${audioMap}`);
}
command = command.outputOptions(outputMaps);
if (trimRange) {
command = command.outputOptions([
`-ss ${trimRange.start}`,
`-t ${trimRange.end - trimRange.start}`
]);
}
await new Promise((resolve, reject) => {
command
.output(outputPath)
.on('end', resolve)
.on('error', (err, stdout, stderr) => {
console.error('FFmpeg export error:', err);
console.error('FFmpeg stderr:', stderr);
reject(new Error(`FFmpeg error: ${err.message}. Stderr: ${stderr}`));
})
.run();
});
if (!fs.existsSync(outputPath)) {
throw new Error('FFmpeg finished but output file was not created');
}
const outputBuffer = fs.readFileSync(outputPath);
console.log(`Exported video size: ${outputBuffer.length} bytes`);
const outputBase64 = outputBuffer.toString('base64');
const dataUrl = `data:video/mp4;base64,${outputBase64}`;
res.json({ videoUrl: dataUrl });
} catch (error: any) {
console.error('Export Error:', error);
res.status(500).json({ error: error.message || 'Failed to export video' });
} finally {
// Cleanup
for (const file of tempFiles) {
if (fs.existsSync(file)) {
try {
fs.unlinkSync(file);
} catch (e) {
console.error(`Failed to delete temp file ${file}:`, e);
}
}
}
}
});
if (process.env.NODE_ENV !== 'production') {
const vite = await createViteServer({
server: { middlewareMode: true },
appType: 'spa',
});
app.use(vite.middlewares);
} else {
const distPath = path.join(process.cwd(), 'dist');
app.use(express.static(distPath));
app.get('*', (req, res) => {
res.sendFile(path.join(distPath, 'index.html'));
});
}
app.listen(PORT, '0.0.0.0', () => {
console.log(`Server running on http://localhost:${PORT}`);
});
}
startServer();

43
src/App.tsx Normal file
View File

@ -0,0 +1,43 @@
/**
* @license
* SPDX-License-Identifier: Apache-2.0
*/
import React, { useState } from 'react';
import UploadScreen from './components/UploadScreen';
import EditorScreen from './components/EditorScreen';
function App() {
const [currentView, setCurrentView] = useState<'upload' | 'editor'>('upload');
const [videoFile, setVideoFile] = useState<File | null>(null);
const [targetLanguage, setTargetLanguage] = useState<string>('en');
const [trimRange, setTrimRange] = useState<{start: number, end: number} | null>(null);
const handleVideoUpload = (file: File, lang: string, startTime?: number, endTime?: number) => {
setVideoFile(file);
setTargetLanguage(lang);
if (startTime !== undefined && endTime !== undefined) {
setTrimRange({ start: startTime, end: endTime });
} else {
setTrimRange(null);
}
setCurrentView('editor');
};
return (
<div className="min-h-screen bg-gray-50 text-gray-800 font-sans">
{currentView === 'upload' ? (
<UploadScreen onUpload={handleVideoUpload} />
) : (
<EditorScreen
videoFile={videoFile}
targetLanguage={targetLanguage}
trimRange={trimRange}
onBack={() => setCurrentView('upload')}
/>
)}
</div>
);
}
export default App;

View File

@ -0,0 +1,98 @@
// @vitest-environment jsdom
import React from 'react';
import { cleanup, fireEvent, render, screen, waitFor } from '@testing-library/react';
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
import EditorScreen from './EditorScreen';
const { generateSubtitlePipelineMock, generateTTSMock } = vi.hoisted(() => ({
generateSubtitlePipelineMock: vi.fn(),
generateTTSMock: vi.fn(),
}));
vi.mock('../services/subtitleService', () => ({
generateSubtitlePipeline: generateSubtitlePipelineMock,
}));
vi.mock('../services/ttsService', () => ({
generateTTS: generateTTSMock,
}));
describe('EditorScreen', () => {
afterEach(() => {
cleanup();
});
beforeEach(() => {
generateSubtitlePipelineMock.mockReset();
generateSubtitlePipelineMock.mockResolvedValue({
subtitles: [],
speakers: [],
quality: 'fallback',
});
generateTTSMock.mockReset();
Object.defineProperty(URL, 'createObjectURL', {
writable: true,
value: vi.fn(() => 'blob:video'),
});
Object.defineProperty(URL, 'revokeObjectURL', {
writable: true,
value: vi.fn(),
});
class ResizeObserverMock {
observe() {}
disconnect() {}
}
vi.stubGlobal('ResizeObserver', ResizeObserverMock);
});
it('shows a low-precision notice for fallback subtitle results', async () => {
render(
<EditorScreen
videoFile={new File(['video'], 'clip.mp4', { type: 'video/mp4' })}
targetLanguage="en"
trimRange={null}
onBack={() => {}}
/>,
);
expect(screen.getAllByLabelText(/llm/i)[0]).toHaveValue('doubao');
expect(await screen.findByText(/low-precision/i)).toBeInTheDocument();
});
it('regenerates subtitles with the selected llm provider', async () => {
render(
<EditorScreen
videoFile={new File(['video'], 'clip.mp4', { type: 'video/mp4' })}
targetLanguage="en"
trimRange={null}
onBack={() => {}}
/>,
);
await waitFor(() =>
expect(generateSubtitlePipelineMock).toHaveBeenCalledWith(
expect.any(File),
'en',
'doubao',
null,
),
);
fireEvent.change(screen.getAllByLabelText(/llm/i)[0], {
target: { value: 'gemini' },
});
await waitFor(() =>
expect(generateSubtitlePipelineMock).toHaveBeenLastCalledWith(
expect.any(File),
'en',
'gemini',
null,
),
);
});
});

View File

@ -0,0 +1,946 @@
import React, { useState, useRef, useEffect, useMemo, useCallback } from 'react';
import axios from 'axios';
import { ChevronLeft, Play, Pause, Volume2, Settings, Download, Save, LayoutTemplate, Type, Image as ImageIcon, Music, Scissors, Plus, Trash2, Maximize2, Loader2 } from 'lucide-react';
import VoiceMarketModal from './VoiceMarketModal';
import ExportModal from './ExportModal';
import { LlmProvider, PipelineQuality, Subtitle, TextStyles } from '../types';
import { generateSubtitlePipeline } from '../services/subtitleService';
import { generateTTS } from '../services/ttsService';
import { MINIMAX_VOICES } from '../voices';
export default function EditorScreen({ videoFile, targetLanguage, trimRange, onBack }: { videoFile: File | null; targetLanguage: string; trimRange?: {start: number, end: number} | null; onBack: () => void }) {
const [subtitles, setSubtitles] = useState<Subtitle[]>([]);
const [activeSubtitleId, setActiveSubtitleId] = useState<string>('');
const [showVoiceMarket, setShowVoiceMarket] = useState(false);
const [voiceMarketTargetId, setVoiceMarketTargetId] = useState<string | null>(null);
const [showExportModal, setShowExportModal] = useState(false);
const [isGenerating, setIsGenerating] = useState(false);
const [isDubbingGenerating, setIsDubbingGenerating] = useState(false);
const [generatingAudioIds, setGeneratingAudioIds] = useState<Set<string>>(new Set());
const [generationError, setGenerationError] = useState<string | null>(null);
const [subtitleQuality, setSubtitleQuality] = useState<PipelineQuality>('fallback');
const [llmProvider, setLlmProvider] = useState<LlmProvider>('doubao');
// Video Player State
const videoRef = useRef<HTMLVideoElement>(null);
const audioRef = useRef<HTMLAudioElement | null>(null);
const bgmRef = useRef<HTMLAudioElement | null>(null);
const audioContextRef = useRef<AudioContext | null>(null);
const lastPlayedSubId = useRef<string | null>(null);
const [isPlaying, setIsPlaying] = useState(false);
const [bgmUrl, setBgmUrl] = useState<string | null>(null);
const [bgmBase64, setBgmBase64] = useState<string | null>(null);
const [isSeparating, setIsSeparating] = useState(false);
const [currentTime, setCurrentTime] = useState(0);
const [duration, setDuration] = useState(30); // Default 30s, updated on load
const [videoUrl, setVideoUrl] = useState<string>('');
const [videoAspectRatio, setVideoAspectRatio] = useState<number>(16/9);
const containerRef = useRef<HTMLDivElement>(null);
const [renderedVideoWidth, setRenderedVideoWidth] = useState<number | '100%'>('100%');
// Timeline Dragging State
const [draggingId, setDraggingId] = useState<string | null>(null);
const [dragType, setDragType] = useState<'move' | 'resize-left' | 'resize-right' | null>(null);
const [dragStartX, setDragStartX] = useState(0);
const [initialSubTimes, setInitialSubTimes] = useState<{startTime: number, endTime: number} | null>(null);
const [isDraggingPlayhead, setIsDraggingPlayhead] = useState(false);
const timelineTrackRef = useRef<HTMLDivElement>(null);
useEffect(() => {
if (!containerRef.current) return;
const observer = new ResizeObserver((entries) => {
for (let entry of entries) {
const { width, height } = entry.contentRect;
const containerAspect = width / height;
if (containerAspect > videoAspectRatio) {
// Container is wider than video. Video height is 100%, width is scaled.
setRenderedVideoWidth(height * videoAspectRatio);
} else {
// Container is taller than video. Video width is 100%.
setRenderedVideoWidth(width);
}
}
});
observer.observe(containerRef.current);
return () => observer.disconnect();
}, [videoAspectRatio]);
useEffect(() => {
if (!videoFile) {
setVideoUrl('');
return;
}
const url = URL.createObjectURL(videoFile);
setVideoUrl(url);
return () => {
URL.revokeObjectURL(url);
};
}, [videoFile]);
const fetchSubtitles = useCallback(async () => {
if (!videoFile) return;
setIsGenerating(true);
setGenerationError(null);
try {
const pipelineResult = await generateSubtitlePipeline(
videoFile,
targetLanguage,
llmProvider,
trimRange,
);
const generatedSubs = pipelineResult.subtitles;
setSubtitleQuality(pipelineResult.quality);
let adjustedSubs = generatedSubs;
if (trimRange) {
adjustedSubs = generatedSubs
.filter(sub => sub.endTime > trimRange.start && sub.startTime < trimRange.end)
.map(sub => ({
...sub,
startTime: Math.max(0, sub.startTime - trimRange.start),
endTime: Math.min(trimRange.end - trimRange.start, sub.endTime - trimRange.start)
}));
}
setSubtitles(adjustedSubs);
if (adjustedSubs.length > 0) {
setActiveSubtitleId(adjustedSubs[0].id);
}
} catch (err: any) {
console.error("Failed to generate subtitles:", err);
setSubtitleQuality('fallback');
let errorMessage = "Failed to generate subtitles. Please try again or check your API key.";
const errString = err instanceof Error ? err.message : JSON.stringify(err);
if (
errString.includes("429") ||
errString.includes("quota") ||
errString.includes("RESOURCE_EXHAUSTED") ||
err?.status === 429 ||
err?.error?.code === 429
) {
errorMessage = "You have exceeded your Volcengine API quota. Please check your plan and billing details.";
} else if (err instanceof Error) {
errorMessage = err.message;
}
setGenerationError(errorMessage);
setSubtitles([]);
setActiveSubtitleId('');
} finally {
setIsGenerating(false);
}
}, [videoFile, targetLanguage, trimRange, llmProvider]);
// Generate subtitles on mount
useEffect(() => {
fetchSubtitles();
}, [fetchSubtitles]);
const [textStyles, setTextStyles] = useState<TextStyles>({
fontFamily: 'MiSans-Late',
fontSize: 24,
color: '#FFFFFF',
backgroundColor: 'transparent',
alignment: 'center',
isBold: false,
isItalic: false,
isUnderline: false,
});
const togglePlay = async () => {
if (videoRef.current) {
if (isPlaying) {
videoRef.current.pause();
if (bgmRef.current) bgmRef.current.pause();
if (audioRef.current) audioRef.current.pause();
} else {
if (trimRange && (videoRef.current.currentTime < trimRange.start || videoRef.current.currentTime >= trimRange.end)) {
videoRef.current.currentTime = trimRange.start;
}
videoRef.current.play();
if (bgmRef.current) {
bgmRef.current.currentTime = videoRef.current.currentTime;
bgmRef.current.play();
}
}
setIsPlaying(!isPlaying);
}
};
const handleGenerateDubbing = async () => {
if (subtitles.length === 0) return;
setIsDubbingGenerating(true);
// Step 1: Vocal Separation (Scheme A: Server-side AI-like separation)
if (!bgmUrl && videoFile) {
setIsSeparating(true);
try {
const formData = new FormData();
formData.append('video', videoFile);
const response = await axios.post('/api/separate-vocal', formData, {
headers: {
'Content-Type': 'multipart/form-data'
}
});
const { instrumental } = response.data;
const blob = await (await fetch(`data:audio/mp3;base64,${instrumental}`)).blob();
const url = URL.createObjectURL(blob);
setBgmUrl(url);
setBgmBase64(instrumental);
console.log("Vocal separation completed (Instrumental extracted via Scheme A)");
} catch (err) {
console.error("Vocal separation failed:", err);
} finally {
setIsSeparating(false);
}
}
// Step 2: TTS Generation
const toGenerate = subtitles.filter(s => !s.audioUrl).map(s => s.id);
setGeneratingAudioIds(new Set(toGenerate));
try {
const updatedSubtitles = [...subtitles];
for (let i = 0; i < updatedSubtitles.length; i++) {
const sub = updatedSubtitles[i];
if (sub.audioUrl) continue;
try {
const textToSpeak = sub.translatedText || sub.text;
const audioUrl = await generateTTS(textToSpeak, sub.voiceId);
updatedSubtitles[i] = { ...sub, audioUrl };
// Update state incrementally so user sees progress
setSubtitles([...updatedSubtitles]);
} catch (err) {
console.error(`Failed to generate TTS for subtitle ${sub.id}:`, err);
} finally {
setGeneratingAudioIds(prev => {
const next = new Set(prev);
next.delete(sub.id);
return next;
});
}
}
} catch (err) {
console.error("Failed to generate dubbing:", err);
} finally {
setIsDubbingGenerating(false);
setGeneratingAudioIds(new Set());
if (subtitles.some(s => s.audioUrl)) {
console.log("Dubbing generation completed.");
}
}
};
const handleTimeUpdate = () => {
if (videoRef.current) {
let time = videoRef.current.currentTime;
if (trimRange) {
if (time < trimRange.start) {
videoRef.current.currentTime = trimRange.start;
time = trimRange.start;
} else if (time >= trimRange.end) {
videoRef.current.pause();
if (bgmRef.current) bgmRef.current.pause();
if (audioRef.current) audioRef.current.pause();
setIsPlaying(false);
videoRef.current.currentTime = trimRange.start;
time = trimRange.start;
}
}
setCurrentTime(time);
const displayTime = trimRange ? time - trimRange.start : time;
// Auto-select active subtitle based on time
const activeSub = subtitles.find(s => displayTime >= s.startTime && displayTime <= s.endTime);
// Sync BGM with video
if (bgmRef.current && videoRef.current && isPlaying) {
if (Math.abs(bgmRef.current.currentTime - videoRef.current.currentTime) > 0.3) {
bgmRef.current.currentTime = videoRef.current.currentTime;
}
}
if (activeSub) {
if (activeSub.id !== activeSubtitleId) {
setActiveSubtitleId(activeSub.id);
}
// Play dubbing if available and not already playing for this sub
if (activeSub.audioUrl && lastPlayedSubId.current !== activeSub.id && isPlaying) {
if (audioRef.current) {
audioRef.current.pause();
}
const audio = new Audio(activeSub.audioUrl);
audio.volume = activeSub.volume !== undefined ? activeSub.volume : 1.0;
audioRef.current = audio;
audio.play().catch(e => console.error("Audio playback failed:", e));
lastPlayedSubId.current = activeSub.id;
}
} else {
lastPlayedSubId.current = null;
}
}
};
useEffect(() => {
if (videoRef.current) {
videoRef.current.volume = bgmUrl ? 0 : 0.3;
}
}, [bgmUrl]);
const handleLoadedMetadata = () => {
if (videoRef.current) {
setDuration(videoRef.current.duration);
videoRef.current.volume = bgmUrl ? 0 : 0.3; // Mute if BGM (instrumental) is present
if (bgmRef.current) {
bgmRef.current.volume = 0.5; // BGM volume
}
if (videoRef.current.videoHeight > 0) {
setVideoAspectRatio(videoRef.current.videoWidth / videoRef.current.videoHeight);
}
if (trimRange) {
videoRef.current.currentTime = trimRange.start;
setCurrentTime(trimRange.start);
}
}
};
const handleTimelineMouseDown = (e: React.MouseEvent, subId: string, type: 'move' | 'resize-left' | 'resize-right') => {
e.stopPropagation();
const sub = subtitles.find(s => s.id === subId);
if (!sub) return;
setDraggingId(subId);
setDragType(type);
setDragStartX(e.clientX);
setInitialSubTimes({ startTime: sub.startTime, endTime: sub.endTime });
setActiveSubtitleId(subId);
};
const handleTimelineMouseMove = useCallback((e: MouseEvent) => {
if (!timelineTrackRef.current) return;
const rect = timelineTrackRef.current.getBoundingClientRect();
const timelineWidth = rect.width - 32; // Subtract padding (1rem each side)
const displayDuration = trimRange ? trimRange.end - trimRange.start : duration;
if (isDraggingPlayhead) {
const deltaX = e.clientX - rect.left - 16; // 1rem padding
const percent = Math.max(0, Math.min(1, deltaX / timelineWidth));
const newTime = percent * displayDuration;
if (videoRef.current) {
videoRef.current.currentTime = newTime + (trimRange?.start || 0);
}
return;
}
if (!draggingId || !initialSubTimes) return;
const deltaX = e.clientX - dragStartX;
const deltaSeconds = (deltaX / timelineWidth) * displayDuration;
setSubtitles(prev => prev.map(sub => {
if (sub.id !== draggingId) return sub;
let newStart = sub.startTime;
let newEnd = sub.endTime;
if (dragType === 'move') {
newStart = Math.max(0, Math.min(displayDuration - (initialSubTimes.endTime - initialSubTimes.startTime), initialSubTimes.startTime + deltaSeconds));
newEnd = newStart + (initialSubTimes.endTime - initialSubTimes.startTime);
} else if (dragType === 'resize-left') {
newStart = Math.max(0, Math.min(initialSubTimes.endTime - 0.2, initialSubTimes.startTime + deltaSeconds));
} else if (dragType === 'resize-right') {
newEnd = Math.max(initialSubTimes.startTime + 0.2, Math.min(displayDuration, initialSubTimes.endTime + deltaSeconds));
}
return { ...sub, startTime: newStart, endTime: newEnd };
}));
}, [draggingId, dragType, dragStartX, initialSubTimes, duration, trimRange, isDraggingPlayhead]);
const handleTimelineMouseUp = useCallback(() => {
setDraggingId(null);
setDragType(null);
setInitialSubTimes(null);
setIsDraggingPlayhead(false);
}, []);
useEffect(() => {
if (draggingId || isDraggingPlayhead) {
window.addEventListener('mousemove', handleTimelineMouseMove);
window.addEventListener('mouseup', handleTimelineMouseUp);
} else {
window.removeEventListener('mousemove', handleTimelineMouseMove);
window.removeEventListener('mouseup', handleTimelineMouseUp);
}
return () => {
window.removeEventListener('mousemove', handleTimelineMouseMove);
window.removeEventListener('mouseup', handleTimelineMouseUp);
};
}, [draggingId, isDraggingPlayhead, handleTimelineMouseMove, handleTimelineMouseUp]);
const formatTime = (seconds: number) => {
if (isNaN(seconds) || seconds < 0) seconds = 0;
const mins = Math.floor(seconds / 60);
const secs = Math.floor(seconds % 60);
return `00:${mins.toString().padStart(2, '0')}:${secs.toString().padStart(2, '0')}`;
};
const displayDuration = trimRange ? trimRange.end - trimRange.start : duration;
const displayCurrentTime = trimRange ? Math.max(0, currentTime - trimRange.start) : currentTime;
// Calculate playhead position percentage
const playheadPercent = displayDuration > 0 ? (displayCurrentTime / displayDuration) * 100 : 0;
// Get current subtitle text
const currentSubtitleText = subtitles.find(s => displayCurrentTime >= s.startTime && displayCurrentTime <= s.endTime)?.translatedText || '';
return (
<div className="h-[100dvh] flex flex-col bg-white overflow-hidden">
{/* Top Header */}
<header className="h-14 border-b border-gray-200 flex items-center justify-between px-4 shrink-0">
<div className="flex items-center gap-4">
<button onClick={onBack} className="p-2 hover:bg-gray-100 rounded-md">
<ChevronLeft className="w-5 h-5" />
</button>
<div className="flex items-center gap-2">
<div className="w-6 h-6 bg-orange-500 rounded flex items-center justify-center text-white font-bold text-xs">M</div>
<span className="font-medium text-sm">Translate 1.0</span>
</div>
</div>
<div className="flex items-center gap-4">
<button className="p-2 hover:bg-gray-100 rounded-md text-gray-600">
<LayoutTemplate className="w-5 h-5" />
</button>
<button className="p-2 hover:bg-gray-100 rounded-md text-gray-600">
<Type className="w-5 h-5" />
</button>
<div className="h-4 w-px bg-gray-300 mx-2"></div>
<button
onClick={togglePlay}
className="flex items-center gap-2 px-3 py-1.5 bg-red-50 text-red-600 rounded-md text-sm font-medium hover:bg-red-100"
>
<Play className="w-4 h-4 fill-current" />
Watch Video
</button>
<button className="flex items-center gap-2 px-3 py-1.5 text-gray-600 hover:bg-gray-100 rounded-md text-sm font-medium">
<Save className="w-4 h-4" />
Save Editing
</button>
<button
onClick={() => setShowExportModal(true)}
className="flex items-center gap-2 px-4 py-1.5 bg-[#52c41a] text-white rounded-md text-sm font-medium hover:bg-[#46a616]"
>
<Download className="w-4 h-4" />
Export
</button>
</div>
</header>
{/* Main Workspace */}
<div className="flex-1 flex overflow-hidden min-h-0">
{/* Left Sidebar - Subtitles */}
<div className="w-80 border-r border-gray-200 flex flex-col bg-gray-50 shrink-0">
<div className="p-4 border-b border-gray-200 bg-white shrink-0">
<div className="flex bg-gray-100 p-1 rounded-md mb-4">
<button className="flex-1 py-1.5 bg-white shadow-sm rounded text-sm font-medium">AI Dub</button>
<button className="flex-1 py-1.5 text-gray-600 text-sm font-medium">Voice Clone</button>
</div>
<p className="text-xs text-gray-500 mb-4">
Tip: After modifying the subtitle text, you need to regenerate the dubbing to take effect!
</p>
<div className="mb-4">
<label htmlFor="llm-provider" className="block text-xs font-medium text-gray-500 mb-1">
LLM
</label>
<select
id="llm-provider"
aria-label="LLM"
value={llmProvider}
onChange={(e) => setLlmProvider(e.target.value as LlmProvider)}
className="w-full border border-gray-300 rounded-md px-3 py-2 text-sm focus:outline-none focus:border-blue-500 bg-white"
>
<option value="doubao">Doubao</option>
<option value="gemini">Gemini</option>
</select>
</div>
<button
onClick={handleGenerateDubbing}
disabled={isDubbingGenerating || subtitles.length === 0}
className="w-full py-2 bg-blue-500 hover:bg-blue-600 disabled:bg-blue-300 text-white rounded-md text-sm font-medium transition-colors flex items-center justify-center gap-2"
>
{isDubbingGenerating && <Loader2 className="w-4 h-4 animate-spin" />}
{isSeparating ? 'Separating Vocals...' : isDubbingGenerating ? 'Generating TTS...' : 'Generate Dubbing'}
</button>
</div>
<div className="flex-1 overflow-y-auto p-2 space-y-2 min-h-0">
{isGenerating ? (
<div className="flex flex-col items-center justify-center h-full text-gray-500 space-y-4">
<Loader2 className="w-8 h-8 animate-spin text-blue-500" />
<p className="text-sm font-medium">AI is analyzing and translating...</p>
<p className="text-xs text-center px-4">This may take a minute depending on the video length.</p>
</div>
) : (
<>
{generationError && (
<div className="p-3 mb-2 text-xs text-red-600 bg-red-50 border border-red-200 rounded-md flex flex-col gap-2">
<p>{generationError}</p>
<button
onClick={() => fetchSubtitles()}
className="px-3 py-1.5 bg-red-100 hover:bg-red-200 text-red-700 rounded font-medium transition-colors self-start"
>
Retry Generation
</button>
</div>
)}
{!generationError && subtitleQuality === 'fallback' && (
<div className="p-3 mb-2 text-xs text-amber-700 bg-amber-50 border border-amber-200 rounded-md">
Low-precision subtitle timing is active for this generation. You can still edit subtitles before dubbing.
</div>
)}
{subtitles.map((sub, index) => (
<div
key={sub.id}
className={`p-3 rounded-lg border transition-all ${
activeSubtitleId === sub.id
? 'border-blue-500 bg-blue-50/50 shadow-sm'
: sub.audioUrl
? 'border-green-200 bg-white'
: 'border-gray-200 bg-white'
} cursor-pointer relative group`}
onClick={() => {
setActiveSubtitleId(sub.id);
if (videoRef.current) {
videoRef.current.currentTime = sub.startTime + (trimRange?.start || 0);
}
}}
>
{sub.audioUrl && (
<div className="absolute -right-1 -top-1 w-4 h-4 bg-green-500 rounded-full flex items-center justify-center text-white shadow-sm">
<Music className="w-2.5 h-2.5" />
</div>
)}
{generatingAudioIds.has(sub.id) && (
<div className="absolute right-2 top-2">
<Loader2 className="w-3 h-3 animate-spin text-blue-500" />
</div>
)}
<div className="flex items-center justify-between mb-2">
<span className="text-xs font-medium text-gray-500">
{index + 1}. {formatTime(sub.startTime)} - {formatTime(sub.endTime)}
</span>
<div className="flex items-center gap-2">
{sub.audioUrl && (
<div className="flex items-center gap-1 bg-gray-100 px-1.5 py-0.5 rounded">
<Volume2 className="w-3 h-3 text-gray-500" />
<input
type="range"
min="0"
max="1"
step="0.1"
value={sub.volume ?? 1.0}
onChange={(e) => {
const newSubs = [...subtitles];
newSubs[index].volume = parseFloat(e.target.value);
setSubtitles(newSubs);
}}
className="w-12 h-1 bg-gray-300 rounded-lg appearance-none cursor-pointer accent-blue-500"
onClick={(e) => e.stopPropagation()}
/>
</div>
)}
<button
className="flex items-center gap-1 text-xs text-blue-600 bg-blue-50 px-2 py-1 rounded"
onClick={(e) => {
e.stopPropagation();
setVoiceMarketTargetId(sub.id);
setShowVoiceMarket(true);
}}
>
<div className="w-4 h-4 rounded-full bg-orange-200 overflow-hidden">
<img src={`https://api.dicebear.com/7.x/avataaars/svg?seed=${sub.voiceId}`} alt="avatar" />
</div>
{MINIMAX_VOICES.find(v => v.id === sub.voiceId)?.name || 'Select Voice'}
</button>
</div>
</div>
<textarea
className="w-full text-sm bg-transparent border-none resize-none focus:ring-0 p-0 text-gray-500 mb-2"
rows={2}
value={sub.originalText}
readOnly
/>
<textarea
className="w-full text-sm bg-transparent border-none resize-none focus:ring-0 p-0 text-gray-800 font-medium"
rows={2}
value={sub.translatedText}
onChange={(e) => {
const newSubs = [...subtitles];
newSubs[index].translatedText = e.target.value;
setSubtitles(newSubs);
}}
/>
</div>
))}
</>
)}
</div>
</div>
{/* Center - Video Player */}
<div className="flex-1 flex flex-col bg-gray-100 relative min-w-0 min-h-0">
<div className="flex-1 flex items-center justify-center p-8 relative min-h-0">
{/* Video Container */}
<div ref={containerRef} className="relative w-full h-full bg-black rounded-lg overflow-hidden shadow-lg flex items-center justify-center min-h-0">
{videoUrl ? (
<video
ref={videoRef}
src={videoUrl}
className="w-full h-full object-contain"
onTimeUpdate={handleTimeUpdate}
onLoadedMetadata={handleLoadedMetadata}
onEnded={() => {
setIsPlaying(false);
if (bgmRef.current) bgmRef.current.pause();
if (audioRef.current) audioRef.current.pause();
}}
onClick={togglePlay}
playsInline
preload="metadata"
/>
) : (
<div className="text-gray-500 flex flex-col items-center justify-center h-full w-full">
<ImageIcon className="w-12 h-12 mb-2 opacity-50" />
<p>No video loaded</p>
</div>
)}
{/* BGM Audio Element */}
{bgmUrl && (
<audio
ref={bgmRef}
src={bgmUrl}
className="hidden"
onEnded={() => {
if (videoRef.current) videoRef.current.pause();
setIsPlaying(false);
}}
/>
)}
{/* Subtitle Overlay */}
{currentSubtitleText && (
<div
className="absolute bottom-[10%] left-1/2 -translate-x-1/2 flex justify-center pointer-events-none px-4"
style={{ width: renderedVideoWidth }}
>
<p
className="text-white text-base md:text-lg font-bold drop-shadow-md break-words whitespace-pre-wrap text-center max-w-[90%]"
style={{
fontFamily: textStyles.fontFamily,
color: textStyles.color,
textAlign: textStyles.alignment,
fontWeight: textStyles.isBold ? 'bold' : 'normal',
fontStyle: textStyles.isItalic ? 'italic' : 'normal',
textDecoration: textStyles.isUnderline ? 'underline' : 'none',
}}
>
{currentSubtitleText}
</p>
</div>
)}
</div>
</div>
{/* Player Controls */}
<div className="h-12 bg-white border-t border-gray-200 flex items-center justify-between px-4 shrink-0">
<div className="flex items-center gap-4">
<button className="p-1.5 hover:bg-gray-100 rounded-full" onClick={togglePlay}>
{isPlaying ? <Pause className="w-5 h-5" /> : <Play className="w-5 h-5" />}
</button>
<span className="text-sm font-medium font-mono">
{formatTime(displayCurrentTime)} / {formatTime(displayDuration)}
</span>
</div>
<div className="flex items-center gap-4">
{bgmUrl && (
<div className="flex items-center gap-2 bg-gray-100 px-2 py-1 rounded-md">
<Music className="w-4 h-4 text-green-600" />
<span className="text-xs font-medium text-gray-600">BGM</span>
<input
type="range"
min="0"
max="1"
step="0.1"
defaultValue="0.5"
onChange={(e) => {
if (bgmRef.current) bgmRef.current.volume = parseFloat(e.target.value);
}}
className="w-16 h-1 bg-gray-300 rounded-lg appearance-none cursor-pointer accent-green-500"
/>
</div>
)}
<Maximize2 className="w-4 h-4 text-gray-500" />
</div>
</div>
</div>
{/* Right Sidebar - Properties */}
<div className="w-72 border-l border-gray-200 bg-white flex flex-col shrink-0 overflow-y-auto">
<div className="p-4 border-b border-gray-200">
<label className="flex items-center gap-2 text-sm text-gray-700 cursor-pointer">
<input type="checkbox" className="rounded text-blue-600 focus:ring-blue-500" defaultChecked />
Apply to all subtitles
</label>
</div>
<div className="p-4 border-b border-gray-200">
<div className="flex items-center gap-2 mb-3">
<div className="w-6 h-6 rounded-full bg-green-100 flex items-center justify-center text-green-600 text-xs font-bold">
W
</div>
<span className="text-sm font-medium text-gray-700">Wife</span>
</div>
<input
type="text"
value="Husband"
className="w-full border border-gray-300 rounded-md px-3 py-1.5 text-sm focus:outline-none focus:border-blue-500"
readOnly
/>
</div>
<div className="p-4">
<h3 className="text-sm font-medium text-gray-700 mb-3">Text Styles</h3>
{/* Style Presets Grid */}
<div className="grid grid-cols-4 gap-2 mb-6">
{['T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T', 'T'].map((item, i) => (
<button key={i} className={`aspect-square rounded border flex items-center justify-center text-lg font-bold ${i === 8 ? 'border-blue-500 bg-blue-50 text-blue-600' : 'border-gray-200 bg-gray-50 text-gray-700 hover:bg-gray-100'}`}>
{item}
</button>
))}
</div>
{/* Font Family */}
<div className="mb-4">
<select
className="w-full border border-gray-300 rounded-md px-3 py-2 text-sm focus:outline-none focus:border-blue-500"
value={textStyles.fontFamily}
onChange={(e) => setTextStyles({...textStyles, fontFamily: e.target.value})}
>
<option value="MiSans-Late">MiSans-Late</option>
<option value="Arial">Arial</option>
<option value="Roboto">Roboto</option>
<option value="serif">Serif</option>
</select>
</div>
{/* Font Size & Alignment */}
<div className="flex gap-2 mb-4">
<select className="flex-1 border border-gray-300 rounded-md px-3 py-2 text-sm focus:outline-none focus:border-blue-500">
<option>Normal</option>
<option>Large</option>
<option>Small</option>
</select>
<div className="flex border border-gray-300 rounded-md overflow-hidden">
<button
className={`px-3 py-2 border-r border-gray-300 font-bold ${textStyles.isBold ? 'bg-gray-200' : 'bg-gray-50 hover:bg-gray-100'}`}
onClick={() => setTextStyles({...textStyles, isBold: !textStyles.isBold})}
>B</button>
<button
className={`px-3 py-2 border-r border-gray-300 italic ${textStyles.isItalic ? 'bg-gray-200' : 'bg-gray-50 hover:bg-gray-100'}`}
onClick={() => setTextStyles({...textStyles, isItalic: !textStyles.isItalic})}
>I</button>
<button
className={`px-3 py-2 underline ${textStyles.isUnderline ? 'bg-gray-200' : 'bg-gray-50 hover:bg-gray-100'}`}
onClick={() => setTextStyles({...textStyles, isUnderline: !textStyles.isUnderline})}
>U</button>
</div>
</div>
{/* Colors */}
<div className="space-y-3">
<div className="flex items-center justify-between">
<span className="text-sm text-gray-600">Color</span>
<div className="flex items-center gap-2">
<input
type="color"
value={textStyles.color}
onChange={(e) => setTextStyles({...textStyles, color: e.target.value})}
className="w-6 h-6 rounded border border-gray-300 p-0 cursor-pointer"
/>
<span className="text-sm text-gray-500">100%</span>
</div>
</div>
<div className="flex items-center justify-between">
<span className="text-sm text-gray-600">Stroke</span>
<div className="flex items-center gap-2">
<div className="w-6 h-6 rounded border border-gray-300 bg-black"></div>
<span className="text-sm text-gray-500">100%</span>
</div>
</div>
</div>
</div>
</div>
</div>
{/* Bottom Timeline */}
<div className="h-48 border-t border-gray-200 bg-white flex flex-col shrink-0">
{/* Timeline Toolbar */}
<div className="h-10 border-b border-gray-100 flex items-center px-4 gap-4">
<button className="p-1.5 hover:bg-gray-100 rounded text-gray-600"><Scissors className="w-4 h-4" /></button>
<button className="p-1.5 hover:bg-gray-100 rounded text-gray-600"><Plus className="w-4 h-4" /></button>
<button className="p-1.5 hover:bg-gray-100 rounded text-gray-600"><Trash2 className="w-4 h-4" /></button>
<div className="h-4 w-px bg-gray-300 mx-2"></div>
<span className="text-xs text-orange-500 bg-orange-50 px-2 py-1 rounded border border-orange-200">
Stretch the dubbing to control the speed
</span>
<div className="flex-1"></div>
{/* Zoom slider placeholder */}
<div className="w-32 h-1 bg-gray-200 rounded-full relative">
<div className="absolute left-1/2 top-1/2 -translate-y-1/2 w-3 h-3 bg-white border border-gray-400 rounded-full"></div>
</div>
</div>
{/* Timeline Tracks */}
<div ref={timelineTrackRef} className="flex-1 overflow-x-auto overflow-y-hidden relative custom-scrollbar bg-gray-50">
{/* Time Ruler */}
<div className="h-6 border-b border-gray-200 flex items-end px-4 relative min-w-[1000px] bg-white">
{[0, 20, 40, 60, 80, 100].map(percent => (
<span key={percent} className="absolute text-[10px] text-gray-400" style={{ left: `${percent}%` }}>
{formatTime((displayDuration * percent) / 100)}
</span>
))}
</div>
{/* Video Track */}
<div className="h-12 border-b border-gray-200 flex items-center px-4 min-w-[1000px] relative">
<div className="absolute left-4 right-4 h-10 bg-blue-50 rounded overflow-hidden flex border border-blue-100 items-center justify-center text-xs text-blue-400">
Video Track
</div>
</div>
{/* Subtitle Track */}
<div className="h-12 border-b border-gray-200 flex items-center px-4 min-w-[1000px] relative">
{subtitles.map((sub) => {
const leftPercent = (sub.startTime / displayDuration) * 100;
const widthPercent = ((sub.endTime - sub.startTime) / displayDuration) * 100;
return (
<div
key={sub.id}
className={`absolute h-10 rounded flex flex-col justify-start px-2 cursor-pointer border transition-colors select-none overflow-hidden ${
activeSubtitleId === sub.id ? 'bg-[#e6f4ff] border-[#1677ff] z-10' : 'bg-white border-gray-200 hover:border-blue-300'
}`}
style={{ left: `calc(1rem + ${leftPercent}%)`, width: `${widthPercent}%` }}
onClick={() => {
setActiveSubtitleId(sub.id);
if (videoRef.current) videoRef.current.currentTime = sub.startTime + (trimRange?.start || 0);
}}
onMouseDown={(e) => handleTimelineMouseDown(e, sub.id, 'move')}
>
<div
className="absolute left-0 top-0 bottom-0 w-1.5 cursor-ew-resize hover:bg-blue-400/50 z-20"
onMouseDown={(e) => handleTimelineMouseDown(e, sub.id, 'resize-left')}
/>
<div
className="absolute right-0 top-0 bottom-0 w-1.5 cursor-ew-resize hover:bg-blue-400/50 z-20"
onMouseDown={(e) => handleTimelineMouseDown(e, sub.id, 'resize-right')}
/>
<span className="text-[9px] font-bold text-blue-800 truncate pointer-events-none mt-0.5">T {sub.speaker}</span>
<span className="text-[9px] text-blue-600 truncate pointer-events-none leading-tight">{sub.translatedText}</span>
</div>
);
})}
</div>
{/* Audio Track */}
<div className="h-12 flex items-center px-4 min-w-[1000px] relative">
{subtitles.map((sub) => {
if (!sub.audioUrl && !generatingAudioIds.has(sub.id)) return null;
const leftPercent = (sub.startTime / displayDuration) * 100;
const widthPercent = ((sub.endTime - sub.startTime) / displayDuration) * 100;
return (
<div
key={`audio-${sub.id}`}
className={`absolute h-8 border rounded flex items-center justify-center overflow-hidden transition-all ${
generatingAudioIds.has(sub.id)
? 'bg-blue-50 border-blue-200 animate-pulse'
: 'bg-white border-green-200'
}`}
style={{ left: `calc(1rem + ${leftPercent}%)`, width: `${widthPercent}%` }}
>
{generatingAudioIds.has(sub.id) ? (
<Loader2 className="w-3 h-3 animate-spin text-blue-400" />
) : (
<svg width="100%" height="100%" preserveAspectRatio="none" viewBox="0 0 100 100">
<path d="M0,50 Q10,20 20,50 T40,50 T60,50 T80,50 T100,50" stroke="#52c41a" strokeWidth="1" fill="none" />
<path d="M0,50 Q10,80 20,50 T40,50 T60,50 T80,50 T100,50" stroke="#52c41a" strokeWidth="1" fill="none" />
</svg>
)}
</div>
);
})}
</div>
{/* Playhead */}
<div
className="absolute top-0 bottom-0 w-px bg-red-500 z-30 cursor-ew-resize group"
style={{ left: `calc(1rem + ${playheadPercent}%)` }}
onMouseDown={(e) => {
e.stopPropagation();
setIsDraggingPlayhead(true);
}}
>
<div className="absolute -top-1 -translate-x-1/2 w-3 h-3 bg-red-500 rotate-45 shadow-sm"></div>
<div className="absolute top-0 bottom-0 -left-1 -right-1 cursor-ew-resize"></div>
</div>
</div>
</div>
{/* Modals */}
{showVoiceMarket && (
<VoiceMarketModal
onClose={() => setShowVoiceMarket(false)}
onSelect={(voiceId) => {
if (voiceMarketTargetId) {
const newSubs = subtitles.map(s => s.id === voiceMarketTargetId ? { ...s, voiceId, audioUrl: undefined } : s);
setSubtitles(newSubs);
setShowVoiceMarket(false);
}
}}
onSelectAll={(voiceId) => {
const newSubs = subtitles.map(s => ({ ...s, voiceId, audioUrl: undefined }));
setSubtitles(newSubs);
setShowVoiceMarket(false);
}}
/>
)}
{showExportModal && (
<ExportModal
onClose={() => setShowExportModal(false)}
videoFile={videoFile}
subtitles={subtitles}
bgmUrl={bgmUrl}
bgmBase64={bgmBase64}
textStyles={textStyles}
trimRange={trimRange}
/>
)}
</div>
);
}

View File

@ -0,0 +1,220 @@
import React, { useState, useEffect } from 'react';
import { X, Download, Loader2, CheckCircle2, AlertCircle } from 'lucide-react';
import axios from 'axios';
import { Subtitle, TextStyles } from '../types';
import { buildExportPayload } from '../lib/exportPayload';
interface ExportModalProps {
onClose: () => void;
videoFile: File | null;
subtitles: Subtitle[];
bgmUrl: string | null;
bgmBase64?: string | null;
textStyles: TextStyles;
trimRange?: { start: number; end: number } | null;
}
export default function ExportModal({ onClose, videoFile, subtitles, bgmUrl, bgmBase64, textStyles, trimRange }: ExportModalProps) {
const [isExporting, setIsExporting] = useState(false);
const [progress, setProgress] = useState(0);
const [isDone, setIsDone] = useState(false);
const [error, setError] = useState<string | null>(null);
const [exportedVideoUrl, setExportedVideoUrl] = useState<string | null>(null);
const [thumbnailUrl, setThumbnailUrl] = useState<string>('');
useEffect(() => {
if (videoFile) {
// Simple thumbnail from video file (first frame)
const url = URL.createObjectURL(videoFile);
setThumbnailUrl(url);
return () => URL.revokeObjectURL(url);
}
}, [videoFile]);
const handleExport = async () => {
if (!videoFile) return;
setIsExporting(true);
setProgress(10);
setError(null);
try {
setProgress(30);
const exportPayload = buildExportPayload({
subtitles,
textStyles,
bgmBase64: bgmBase64 || null,
trimRange: trimRange || null,
});
// 3. Call backend to composite video
const formData = new FormData();
formData.append('video', videoFile);
formData.append('subtitles', JSON.stringify(exportPayload.subtitles));
formData.append('textStyles', JSON.stringify(exportPayload.textStyles));
if (exportPayload.bgmBase64) formData.append('bgmBase64', exportPayload.bgmBase64);
if (exportPayload.trimRange) formData.append('trimRange', JSON.stringify(exportPayload.trimRange));
// We'll use a simulated progress for the backend call
const progressInterval = setInterval(() => {
setProgress(p => (p < 90 ? p + 2 : p));
}, 500);
const response = await axios.post('/api/export-video', formData, {
headers: {
'Content-Type': 'multipart/form-data'
}
});
console.log('Export Response:', response.data);
clearInterval(progressInterval);
setProgress(100);
if (response.data && response.data.videoUrl) {
setExportedVideoUrl(response.data.videoUrl);
setIsDone(true);
} else {
console.error('Invalid Export Response:', response.data);
throw new Error('Export failed: No video URL returned from server');
}
} catch (err: any) {
console.error('Export Error:', err);
setError(err.response?.data?.error || err.message || 'An error occurred during export');
} finally {
setIsExporting(false);
}
};
const handleDownload = () => {
if (exportedVideoUrl) {
const link = document.createElement('a');
link.href = exportedVideoUrl;
link.download = `translated_video_${Date.now()}.mp4`;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
}
};
return (
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
<div className="bg-white rounded-lg w-[600px] flex flex-col shadow-xl overflow-hidden">
{/* Header */}
<div className="flex items-center justify-between p-4 border-b border-gray-100 bg-gray-50/50">
<h2 className="text-lg font-semibold text-gray-800">Export Video</h2>
<button onClick={onClose} className="text-gray-400 hover:text-gray-600">
<X className="w-5 h-5" />
</button>
</div>
{/* Content */}
<div className="p-6 flex gap-6">
{/* Thumbnail */}
<div className="w-40 h-60 bg-gray-100 rounded-md overflow-hidden shrink-0 border border-gray-200 relative">
{thumbnailUrl ? (
<video src={thumbnailUrl} className="w-full h-full object-cover" />
) : (
<div className="w-full h-full flex items-center justify-center text-gray-400">
No Preview
</div>
)}
<div className="absolute inset-0 bg-black/20 flex items-center justify-center">
<CheckCircle2 className={`w-12 h-12 text-white opacity-0 transition-opacity ${isDone ? 'opacity-100' : ''}`} />
</div>
</div>
{/* Form */}
<div className="flex-1 space-y-4">
<div>
<label className="block text-xs text-gray-500 mb-1">File Name</label>
<input
type="text"
value={videoFile?.name?.replace(/\.[^/.]+$/, "") + "_translated"}
className="w-full border border-gray-300 rounded px-3 py-2 text-sm bg-gray-50 text-gray-600"
readOnly
/>
</div>
<div className="grid grid-cols-2 gap-4">
<div>
<label className="block text-xs text-gray-500 mb-1">Format</label>
<input type="text" value="MP4 (H.264)" className="w-full border border-gray-300 rounded px-3 py-2 text-sm bg-gray-50 text-gray-600" readOnly />
</div>
<div>
<label className="block text-xs text-gray-500 mb-1">Resolution</label>
<select className="w-full border border-gray-300 rounded px-3 py-2 text-sm focus:outline-none focus:border-blue-500">
<option>Original</option>
<option>1080P</option>
<option>720P</option>
</select>
</div>
</div>
{error && (
<div className="p-3 bg-red-50 border border-red-100 rounded-md flex items-start gap-2 text-red-600 text-xs">
<AlertCircle className="w-4 h-4 shrink-0" />
<p>{error}</p>
</div>
)}
{isExporting && (
<div className="space-y-2">
<div className="flex justify-between text-xs">
<span className="text-blue-600 font-medium">Exporting...</span>
<span className="text-blue-600 font-medium">{progress}%</span>
</div>
<div className="w-full h-2 bg-gray-100 rounded-full overflow-hidden">
<div
className="h-full bg-blue-500 transition-all duration-300"
style={{ width: `${progress}%` }}
/>
</div>
</div>
)}
</div>
</div>
{/* Footer */}
<div className="p-4 border-t border-gray-100 bg-gray-50/50 flex flex-col items-end gap-3">
<p className="text-xs text-gray-500 w-full text-left">
Compositing video with subtitles and AI dubbing. This may take a few minutes depending on the video length.
</p>
<div className="flex gap-3">
<button
onClick={onClose}
className="px-4 py-2 text-sm text-gray-500 hover:text-gray-700 font-medium"
disabled={isExporting}
>
{isDone ? 'Close' : 'Cancel'}
</button>
{isDone ? (
<button
onClick={handleDownload}
className="bg-[#52c41a] text-white px-6 py-2 rounded-md text-sm font-medium flex items-center gap-2 hover:bg-[#46a616] transition-colors shadow-sm"
>
<Download className="w-4 h-4" />
Download Video
</button>
) : (
<button
onClick={handleExport}
disabled={isExporting}
className={`bg-blue-600 text-white px-8 py-2 rounded-md text-sm font-medium transition-colors shadow-sm ${isExporting ? 'opacity-50 cursor-not-allowed' : 'hover:bg-blue-700'}`}
>
{isExporting ? (
<span className="flex items-center gap-2">
<Loader2 className="w-4 h-4 animate-spin" />
Processing...
</span>
) : 'Start Export'}
</button>
)}
</div>
</div>
</div>
</div>
);
}

View File

@ -0,0 +1,268 @@
import React, { useState, useRef, useEffect, useMemo } from 'react';
import { X, Play, Pause } from 'lucide-react';
interface TrimModalProps {
file: File;
onClose: () => void;
onConfirm: (file: File, startTime: number, endTime: number) => void;
}
export default function TrimModal({ file, onClose, onConfirm }: TrimModalProps) {
const videoRef = useRef<HTMLVideoElement>(null);
const timelineRef = useRef<HTMLDivElement>(null);
const [duration, setDuration] = useState(0);
const [startTime, setStartTime] = useState(0);
const [endTime, setEndTime] = useState(0);
const [currentTime, setCurrentTime] = useState(0);
const [isPlaying, setIsPlaying] = useState(false);
const [isDragging, setIsDragging] = useState<'start' | 'end' | null>(null);
const [videoUrl, setVideoUrl] = useState<string>('');
useEffect(() => {
if (!file) return;
const url = URL.createObjectURL(file);
setVideoUrl(url);
return () => {
URL.revokeObjectURL(url);
};
}, [file]);
const handleLoadedMetadata = () => {
if (videoRef.current) {
const d = videoRef.current.duration;
setDuration(d);
// Default to full video duration
setEndTime(d);
}
};
const handleTimeUpdate = () => {
if (videoRef.current) {
const time = videoRef.current.currentTime;
setCurrentTime(time);
// Loop back to start if it reaches the end of the trimmed section
if (time >= endTime && isPlaying) {
videoRef.current.pause();
setIsPlaying(false);
videoRef.current.currentTime = startTime;
}
}
};
const togglePlay = () => {
if (videoRef.current) {
if (isPlaying) {
videoRef.current.pause();
} else {
// If we are at or past the end time, restart from start time
if (currentTime >= endTime || currentTime < startTime) {
videoRef.current.currentTime = startTime;
}
videoRef.current.play();
}
setIsPlaying(!isPlaying);
}
};
// Handle dragging of trim handles
useEffect(() => {
const handleMouseMove = (e: MouseEvent) => {
if (!isDragging || !timelineRef.current || duration === 0) return;
const rect = timelineRef.current.getBoundingClientRect();
let percent = (e.clientX - rect.left) / rect.width;
percent = Math.max(0, Math.min(1, percent));
const newTime = percent * duration;
if (isDragging === 'start') {
const newStart = Math.min(newTime, endTime - 1); // Maintain at least 1s gap
setStartTime(newStart);
if (videoRef.current) {
videoRef.current.currentTime = newStart;
setCurrentTime(newStart);
}
} else if (isDragging === 'end') {
const newEnd = Math.max(newTime, startTime + 1);
setEndTime(newEnd);
if (videoRef.current) {
videoRef.current.currentTime = newEnd;
setCurrentTime(newEnd);
}
}
};
const handleMouseUp = () => {
setIsDragging(null);
};
if (isDragging) {
window.addEventListener('mousemove', handleMouseMove);
window.addEventListener('mouseup', handleMouseUp);
}
return () => {
window.removeEventListener('mousemove', handleMouseMove);
window.removeEventListener('mouseup', handleMouseUp);
};
}, [isDragging, duration, startTime, endTime]);
const formatTime = (seconds: number) => {
const mins = Math.floor(seconds / 60);
const secs = Math.floor(seconds % 60);
return `00:${mins.toString().padStart(2, '0')}:${secs.toString().padStart(2, '0')}`;
};
const startPercent = duration > 0 ? (startTime / duration) * 100 : 0;
const endPercent = duration > 0 ? (endTime / duration) * 100 : 100;
const currentPercent = duration > 0 ? (currentTime / duration) * 100 : 0;
const trimDuration = endTime - startTime;
return (
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
<div className="bg-white rounded-lg w-[800px] max-w-[90vw] flex flex-col overflow-hidden shadow-2xl">
{/* Header */}
<div className="flex justify-end p-4">
<button onClick={onClose} className="text-gray-400 hover:text-gray-600 transition-colors">
<X className="w-5 h-5" />
</button>
</div>
{/* Content */}
<div className="p-6 flex flex-col items-center">
{/* Video Preview */}
<div className="w-full max-w-[600px] h-[350px] bg-black rounded-md mb-4 relative overflow-hidden flex items-center justify-center shadow-inner">
{videoUrl && (
<video
ref={videoRef}
src={videoUrl}
className="w-full h-full object-contain"
onLoadedMetadata={handleLoadedMetadata}
onTimeUpdate={handleTimeUpdate}
onClick={togglePlay}
playsInline
preload="metadata"
/>
)}
<div className="absolute top-2 left-1/2 -translate-x-1/2 bg-black/60 text-white text-xs px-2 py-1 rounded backdrop-blur-sm">
{formatTime(currentTime)}
</div>
</div>
{/* Controls */}
<div className="flex items-center gap-4 mb-8">
<span className="text-sm font-medium text-gray-600">{formatTime(startTime)}</span>
<button
onClick={togglePlay}
className="w-10 h-10 rounded-full border border-gray-300 flex items-center justify-center hover:bg-gray-50 transition-colors text-blue-600"
>
{isPlaying ? <Pause className="w-4 h-4" /> : <Play className="w-4 h-4 ml-1" />}
</button>
<span className="text-sm font-medium text-gray-600">{formatTime(endTime)}</span>
</div>
{/* Timeline */}
<div
ref={timelineRef}
className="w-full h-24 bg-gray-50 rounded-lg relative border border-gray-200 overflow-hidden mb-6 select-none shadow-sm"
>
{/* Time markers */}
<div className="absolute top-0 left-0 w-full flex justify-between px-4 text-[9px] text-gray-500 pt-1 pointer-events-none font-mono">
{[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1].map(p => (
<span key={p}>{formatTime(duration * p)}</span>
))}
</div>
{/* Ticks */}
<div className="absolute top-4 left-0 w-full flex justify-between px-4 pointer-events-none h-4 items-end">
{Array.from({length: 60}).map((_, i) => (
<div key={i} className={`w-px ${i % 6 === 0 ? 'h-3 bg-green-500' : 'h-1.5 bg-gray-300'}`}></div>
))}
</div>
{/* Video Track Background (Simulated Thumbnails) */}
<div className="absolute bottom-2 left-4 right-4 h-14 flex opacity-40 pointer-events-none bg-gray-200 rounded overflow-hidden">
{Array.from({length: 15}).map((_, i) => (
<div key={i} className="flex-1 border-r border-white/20 relative">
<div className="absolute inset-1 bg-gray-400 rounded-sm"></div>
<div className="absolute inset-0 flex items-center justify-center">
<div className="w-4 h-4 rounded-full bg-white/10"></div>
</div>
</div>
))}
</div>
{/* Unselected Overlay (Left) */}
<div
className="absolute bottom-2 top-10 left-4 bg-black/40 backdrop-blur-[1px] z-10 pointer-events-none"
style={{ width: `calc(${startPercent}% - ${startPercent * 0.32}px)` }}
/>
{/* Unselected Overlay (Right) */}
<div
className="absolute bottom-2 top-10 right-4 bg-black/40 backdrop-blur-[1px] z-10 pointer-events-none"
style={{ width: `calc(${100 - endPercent}% - ${(100 - endPercent) * 0.32}px)` }}
/>
{/* Selection Box */}
<div
className="absolute bottom-2 top-10 border-2 border-white shadow-[0_0_15px_rgba(0,0,0,0.3)] z-20 rounded-sm"
style={{ left: `calc(1rem + ${startPercent}% - ${startPercent * 0.32}px)`, right: `calc(1rem + ${100 - endPercent}% - ${(100 - endPercent) * 0.32}px)` }}
>
{/* Left Handle */}
<div
className="absolute left-0 top-0 bottom-0 w-4 bg-white cursor-ew-resize flex flex-col items-center justify-center -translate-x-full rounded-l-md border-r border-gray-200 shadow-md"
onMouseDown={() => setIsDragging('start')}
>
<div className="flex flex-col gap-0.5">
<div className="w-1 h-1 bg-gray-400 rounded-full"></div>
<div className="w-1 h-1 bg-gray-400 rounded-full"></div>
<div className="w-1 h-1 bg-gray-400 rounded-full"></div>
</div>
{/* Red Marker at top */}
<div className="absolute -top-2 left-1/2 -translate-x-1/2 w-0 h-0 border-l-[6px] border-l-transparent border-r-[6px] border-r-transparent border-t-[8px] border-t-red-500"></div>
</div>
{/* Right Handle */}
<div
className="absolute right-0 top-0 bottom-0 w-4 bg-white cursor-ew-resize flex flex-col items-center justify-center translate-x-full rounded-r-md border-l border-gray-200 shadow-md"
onMouseDown={() => setIsDragging('end')}
>
<div className="flex flex-col gap-0.5">
<div className="w-1 h-1 bg-gray-400 rounded-full"></div>
<div className="w-1 h-1 bg-gray-400 rounded-full"></div>
<div className="w-1 h-1 bg-gray-400 rounded-full"></div>
</div>
</div>
</div>
{/* Playhead */}
<div
className="absolute top-8 bottom-2 w-0.5 bg-red-500 z-30 pointer-events-none"
style={{ left: `calc(1rem + ${currentPercent}% - ${currentPercent * 0.32}px)` }}
>
<div className="absolute -top-1 -translate-x-1/2 w-0 h-0 border-l-[4px] border-l-transparent border-r-[4px] border-r-transparent border-t-[6px] border-t-red-500"></div>
</div>
</div>
{/* Footer */}
<div className="w-full flex items-center justify-between border-t border-gray-100 pt-4">
<div className="flex items-center gap-4">
<span className="text-sm text-gray-600">
Selected duration: <strong className="text-gray-800">{Math.round(trimDuration)}s</strong>
</span>
</div>
<button
onClick={() => onConfirm(file, startTime, endTime)}
className="px-6 py-2 rounded-md text-sm font-medium transition-colors border border-gray-300 text-gray-700 hover:bg-gray-50"
>
Confirm
</button>
</div>
</div>
</div>
</div>
);
}

View File

@ -0,0 +1,173 @@
import React, { useState } from 'react';
import { Upload, CheckCircle2, Circle } from 'lucide-react';
import TrimModal from './TrimModal';
const LANGUAGES = [
{ code: 'ar', name: 'Arabic', group: 'A' },
{ code: 'zh', name: 'Chinese', group: 'C' },
{ code: 'yue', name: 'Cantonese', group: 'C' },
{ code: 'cs', name: 'Czech', group: 'C' },
{ code: 'zh-TW', name: 'Traditional Chinese', group: 'T' },
{ code: 'th', name: 'Thai', group: 'T' },
{ code: 'tr', name: 'Turkey', group: 'T' },
{ code: 'nl', name: 'Dutch', group: 'D' },
{ code: 'en', name: 'English', group: 'E' },
{ code: 'ru', name: 'Russian', group: 'R' },
{ code: 'ro', name: 'Romanian', group: 'R' },
{ code: 'ja', name: 'Japanese', group: 'J' },
{ code: 'ko', name: 'Korean', group: 'K' },
{ code: 'ms', name: 'Malay', group: 'M' },
{ code: 'fr', name: 'French', group: 'F' },
];
export default function UploadScreen({ onUpload }: { onUpload: (file: File, lang: string, startTime?: number, endTime?: number) => void }) {
const [mode, setMode] = useState<'editing' | 'simple'>('editing');
const [selectedLang, setSelectedLang] = useState('en');
const [showTrimModal, setShowTrimModal] = useState(false);
const [tempFile, setTempFile] = useState<File | null>(null);
const handleFileChange = (e: React.ChangeEvent<HTMLInputElement>) => {
if (e.target.files && e.target.files[0]) {
setTempFile(e.target.files[0]);
setShowTrimModal(true);
}
};
const handleTrimConfirm = (file: File, startTime: number, endTime: number) => {
setShowTrimModal(false);
const langName = LANGUAGES.find(l => l.code === selectedLang)?.name || 'English';
onUpload(file, langName, startTime, endTime);
};
return (
<div className="max-w-6xl mx-auto p-8 flex gap-8 h-screen items-center">
{/* Left: Upload Area */}
<div className="flex-1 bg-white rounded-lg shadow-sm border border-gray-200 p-8 flex flex-col items-center justify-center min-h-[400px]">
<div className="w-full h-full border-2 border-dashed border-gray-300 rounded-lg flex flex-col items-center justify-center bg-gray-50 relative">
<input
type="file"
className="absolute inset-0 w-full h-full opacity-0 cursor-pointer"
accept="video/mp4,video/quicktime,video/webm"
onChange={handleFileChange}
/>
<Upload className="w-16 h-16 text-gray-400 mb-4" />
<p className="text-gray-600 mb-6">Click to upload or drag files here</p>
<button className="bg-[#52c41a] hover:bg-[#46a616] text-white px-8 py-3 rounded-md font-medium flex items-center gap-2 transition-colors w-full max-w-md justify-center pointer-events-none">
<Upload className="w-5 h-5" />
Upload Video
</button>
</div>
<p className="text-sm text-gray-500 mt-4 w-full text-left">
Supported formats: MP4/MOV/WEBM. Maximum file size is 500MB.
</p>
</div>
{/* Right: Settings */}
<div className="w-[400px] flex flex-col gap-6">
{/* Mode Selection */}
<div className="flex gap-4">
<div
className={`flex-1 p-4 rounded-lg border-2 cursor-pointer transition-colors ${
mode === 'editing' ? 'border-blue-500 bg-blue-50' : 'border-gray-200 hover:border-blue-200'
}`}
onClick={() => setMode('editing')}
>
<div className="flex items-center gap-2 mb-1">
{mode === 'editing' ? (
<CheckCircle2 className="w-5 h-5 text-blue-500" />
) : (
<Circle className="w-5 h-5 text-gray-300" />
)}
<span className="font-semibold text-gray-800">Editing Mode</span>
</div>
<p className="text-xs text-gray-500 ml-7">Supports secondary editing and more precise translation</p>
</div>
<div
className={`flex-1 p-4 rounded-lg border-2 cursor-pointer transition-colors ${
mode === 'simple' ? 'border-blue-500 bg-blue-50' : 'border-gray-200 hover:border-blue-200'
}`}
onClick={() => setMode('simple')}
>
<div className="flex items-center gap-2 mb-1">
{mode === 'simple' ? (
<CheckCircle2 className="w-5 h-5 text-blue-500" />
) : (
<Circle className="w-5 h-5 text-gray-300" />
)}
<span className="font-semibold text-gray-800">Simple Mode</span>
</div>
<p className="text-xs text-gray-500 ml-7">One-click video translation for beginners</p>
</div>
</div>
{/* Language Selection */}
<div className="bg-white rounded-lg shadow-sm border border-gray-200 p-6 flex-1 flex flex-col">
<h3 className="font-semibold text-gray-800 mb-1">Select Translation Language</h3>
<p className="text-xs text-gray-500 mb-4">AI detect video source language automatically.</p>
{/* Alphabet Tabs */}
<div className="flex gap-4 border-b border-gray-100 pb-2 mb-4 text-sm text-gray-500 overflow-x-auto">
<button className="font-medium text-blue-600 border-b-2 border-blue-600 pb-2 -mb-[9px]">Popular</button>
<button className="hover:text-gray-800">ABC</button>
<button className="hover:text-gray-800">DEF</button>
<button className="hover:text-gray-800">GHI</button>
<button className="hover:text-gray-800">JKL</button>
<button className="hover:text-gray-800">MN</button>
<button className="hover:text-gray-800">OPQ</button>
<button className="hover:text-gray-800">RST</button>
<button className="hover:text-gray-800">UVW</button>
<button className="hover:text-gray-800">XYZ</button>
</div>
{/* Language List */}
<div className="flex-1 overflow-y-auto pr-2 custom-scrollbar">
{['A', 'C', 'T', 'D', 'E', 'R', 'J', 'K', 'M', 'F'].map((letter) => (
<div key={letter} className="flex border-b border-gray-100 py-3 last:border-0">
<div className="w-8 text-green-600 font-medium">{letter}</div>
<div className="flex flex-wrap gap-x-6 gap-y-2 flex-1">
{LANGUAGES.filter((l) => l.group === letter).map((lang) => (
<button
key={lang.code}
className={`text-sm hover:text-blue-600 transition-colors ${
selectedLang === lang.code
? 'bg-green-600 text-white px-2 py-0.5 rounded'
: lang.code === 'zh' || lang.code === 'yue' || lang.code === 'ja' || lang.code === 'ko'
? 'text-orange-500'
: 'text-gray-700'
}`}
onClick={() => setSelectedLang(lang.code)}
>
{lang.name}
</button>
))}
</div>
</div>
))}
</div>
<button
className={`w-full py-3 rounded-md font-medium mt-4 transition-colors ${
tempFile
? 'bg-[#52c41a] hover:bg-[#46a616] text-white'
: 'bg-gray-200 text-gray-400 cursor-not-allowed'
}`}
onClick={() => {
if (tempFile) setShowTrimModal(true);
}}
disabled={!tempFile}
>
Generate Translated Video
</button>
</div>
</div>
{showTrimModal && tempFile && (
<TrimModal
file={tempFile}
onClose={() => setShowTrimModal(false)}
onConfirm={handleTrimConfirm}
/>
)}
</div>
);
}

View File

@ -0,0 +1,122 @@
import React, { useState } from 'react';
import { X, Play, Search, Filter } from 'lucide-react';
import { MINIMAX_VOICES } from '../voices';
import { Voice } from '../types';
export default function VoiceMarketModal({
onClose,
onSelect,
onSelectAll
}: {
onClose: () => void;
onSelect: (voiceId: string) => void;
onSelectAll: (voiceId: string) => void;
}) {
const [searchQuery, setSearchQuery] = useState('');
const [selectedLanguage, setSelectedLanguage] = useState<string>('all');
const [selectedGender, setSelectedGender] = useState<string>('all');
const filteredVoices = MINIMAX_VOICES.filter(voice => {
const matchesSearch = voice.name.toLowerCase().includes(searchQuery.toLowerCase()) ||
voice.tag.toLowerCase().includes(searchQuery.toLowerCase());
const matchesLanguage = selectedLanguage === 'all' || voice.language === selectedLanguage;
const matchesGender = selectedGender === 'all' || voice.gender === selectedGender;
return matchesSearch && matchesLanguage && matchesGender;
});
const languages = Array.from(new Set(MINIMAX_VOICES.map(v => v.language)));
return (
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
<div className="bg-white rounded-lg w-[1000px] max-w-[95vw] h-[85vh] flex flex-col shadow-xl overflow-hidden">
{/* Header */}
<div className="flex items-center justify-between p-4 border-b border-gray-100 shrink-0">
<div className="flex items-center gap-4">
<h2 className="text-lg font-semibold text-gray-800">Voice Market</h2>
<div className="flex items-center gap-2 bg-gray-100 px-3 py-1.5 rounded-md">
<Search className="w-4 h-4 text-gray-400" />
<input
type="text"
placeholder="Search voices..."
className="bg-transparent border-none focus:ring-0 text-sm w-48"
value={searchQuery}
onChange={(e) => setSearchQuery(e.target.value)}
/>
</div>
<select
className="text-sm border-gray-200 rounded-md py-1.5"
value={selectedLanguage}
onChange={(e) => setSelectedLanguage(e.target.value)}
>
<option value="all">All Languages</option>
{languages.map(lang => (
<option key={lang} value={lang}>{lang.toUpperCase()}</option>
))}
</select>
<select
className="text-sm border-gray-200 rounded-md py-1.5"
value={selectedGender}
onChange={(e) => setSelectedGender(e.target.value)}
>
<option value="all">All Genders</option>
<option value="male">Male</option>
<option value="female">Female</option>
<option value="neutral">Neutral</option>
</select>
</div>
<button onClick={onClose} className="text-gray-400 hover:text-gray-600">
<X className="w-5 h-5" />
</button>
</div>
{/* Content */}
<div className="flex-1 p-6 overflow-y-auto bg-gray-50/50">
<div className="bg-blue-50 text-blue-800 text-sm p-3 rounded-md mb-6 border border-blue-100">
Tip: Select a voice for the current sentence, or apply it to all sentences in the project.
</div>
<div className="grid grid-cols-4 gap-4">
{filteredVoices.map((voice) => (
<div key={voice.id} className="bg-white border border-gray-200 rounded-lg p-4 flex flex-col items-center hover:shadow-md transition-shadow relative group">
<span className="absolute top-2 left-2 text-[10px] font-bold text-orange-500 bg-orange-50 px-1.5 py-0.5 rounded border border-orange-100">
{voice.tag}
</span>
<div className="w-16 h-16 rounded-full bg-gray-100 mb-3 overflow-hidden relative">
<img src={`https://api.dicebear.com/7.x/avataaars/svg?seed=${voice.id}`} alt={voice.name} />
<button className="absolute inset-0 bg-black/40 flex items-center justify-center opacity-0 group-hover:opacity-100 transition-opacity">
<Play className="w-6 h-6 text-white fill-current" />
</button>
</div>
<h3 className="text-sm font-medium text-gray-800 mb-1 text-center w-full truncate px-2" title={voice.name}>{voice.name}</h3>
<p className="text-[10px] text-gray-400 mb-4 uppercase">{voice.language} | {voice.gender}</p>
<div className="flex gap-2 w-full mt-auto">
<button
onClick={() => onSelect(voice.id)}
className="flex-1 py-1.5 rounded text-xs font-medium border border-[#52c41a] text-[#52c41a] hover:bg-green-50 transition-colors"
>
Choose
</button>
<button
onClick={() => onSelectAll(voice.id)}
className="flex-1 py-1.5 rounded text-xs font-medium border border-gray-300 text-gray-600 hover:bg-gray-50"
>
Apply All
</button>
</div>
</div>
))}
</div>
{filteredVoices.length === 0 && (
<div className="flex flex-col items-center justify-center py-20 text-gray-400">
<Search className="w-12 h-12 mb-2 opacity-20" />
<p>No voices found matching your criteria</p>
</div>
)}
</div>
</div>
</div>
);
}

19
src/index.css Normal file
View File

@ -0,0 +1,19 @@
@import "tailwindcss";
.custom-scrollbar::-webkit-scrollbar {
width: 6px;
height: 6px;
}
.custom-scrollbar::-webkit-scrollbar-track {
background: transparent;
}
.custom-scrollbar::-webkit-scrollbar-thumb {
background-color: #d1d5db;
border-radius: 20px;
}
.custom-scrollbar::-webkit-scrollbar-thumb:hover {
background-color: #9ca3af;
}

View File

@ -0,0 +1,27 @@
import { describe, expect, it } from 'vitest';
import { rebuildSentences } from './sentenceReconstruction';
describe('rebuildSentences', () => {
it('splits sentences when the speaker changes', () => {
const result = rebuildSentences([
{ text: 'Hi', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'there', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'no', startTime: 0.55, endTime: 0.7, speakerId: 'spk_1', confidence: 0.9 },
]);
expect(result).toHaveLength(2);
expect(result[0].originalText).toBe('Hi there');
expect(result[1].originalText).toBe('no');
});
it('splits sentences when the pause exceeds the threshold', () => {
const result = rebuildSentences([
{ text: 'Hello', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'again', startTime: 0.8, endTime: 1.0, speakerId: 'spk_0', confidence: 0.9 },
]);
expect(result).toHaveLength(2);
expect(result[0].startTime).toBe(0.0);
expect(result[1].startTime).toBe(0.8);
});
});

View File

@ -0,0 +1,84 @@
import { WordTiming } from '../../types';
import { deriveSubtitleBounds } from '../subtitlePipeline';
export interface ReconstructedSentence {
id: string;
speakerId: string;
originalText: string;
startTime: number;
endTime: number;
words: WordTiming[];
confidence: number;
}
export interface SentenceReconstructionOptions {
pauseSplitThreshold: number;
maxSentenceDuration: number;
}
const DEFAULT_OPTIONS: SentenceReconstructionOptions = {
pauseSplitThreshold: 0.45,
maxSentenceDuration: 8,
};
const averageConfidence = (words: WordTiming[]) => {
if (words.length === 0) {
return 0;
}
const total = words.reduce((sum, word) => sum + word.confidence, 0);
return total / words.length;
};
const buildSentence = (words: WordTiming[], index: number): ReconstructedSentence => {
const bounds = deriveSubtitleBounds(words);
return {
id: `sentence-${index + 1}`,
speakerId: words[0]?.speakerId ?? 'unknown',
originalText: words.map((word) => word.text).join(' '),
startTime: bounds.startTime,
endTime: bounds.endTime,
words,
confidence: averageConfidence(words),
};
};
export const rebuildSentences = (
words: WordTiming[],
options: Partial<SentenceReconstructionOptions> = {},
): ReconstructedSentence[] => {
if (words.length === 0) {
return [];
}
const resolvedOptions = { ...DEFAULT_OPTIONS, ...options };
const sentences: ReconstructedSentence[] = [];
let currentSentenceWords: WordTiming[] = [words[0]];
for (let index = 1; index < words.length; index += 1) {
const previousWord = words[index - 1];
const currentWord = words[index];
const currentSentenceStart = currentSentenceWords[0].startTime;
const gap = currentWord.startTime - previousWord.endTime;
const nextDuration = currentWord.endTime - currentSentenceStart;
const shouldSplit =
currentWord.speakerId !== previousWord.speakerId ||
gap > resolvedOptions.pauseSplitThreshold ||
nextDuration > resolvedOptions.maxSentenceDuration;
if (shouldSplit) {
sentences.push(buildSentence(currentSentenceWords, sentences.length));
currentSentenceWords = [currentWord];
continue;
}
currentSentenceWords.push(currentWord);
}
if (currentSentenceWords.length > 0) {
sentences.push(buildSentence(currentSentenceWords, sentences.length));
}
return sentences;
};

View File

@ -0,0 +1,25 @@
import { describe, expect, it } from 'vitest';
import { assignSpeakerToWord, assignSpeakersToWords } from './speakerAssignment';
describe('speakerAssignment', () => {
it('assigns each word to the speaker segment with the maximum overlap', () => {
const speakerId = assignSpeakerToWord(
{ text: 'hello', startTime: 1.0, endTime: 1.4, speakerId: 'unknown', confidence: 0.95 },
[
{ speakerId: 'spk_0', startTime: 0.8, endTime: 1.1 },
{ speakerId: 'spk_1', startTime: 1.1, endTime: 1.6 },
],
);
expect(speakerId).toBe('spk_1');
});
it('falls back to unknown when a word has no overlapping speaker segments', () => {
const words = assignSpeakersToWords(
[{ text: 'hello', startTime: 1.0, endTime: 1.4, speakerId: 'unknown', confidence: 0.95 }],
[{ speakerId: 'spk_0', startTime: 2.0, endTime: 2.5 }],
);
expect(words[0].speakerId).toBe('unknown');
});
});

View File

@ -0,0 +1,39 @@
import { WordTiming } from '../../types';
export interface SpeakerSegment {
speakerId: string;
startTime: number;
endTime: number;
}
const getOverlap = (
word: Pick<WordTiming, 'startTime' | 'endTime'>,
segment: SpeakerSegment,
) => Math.max(0, Math.min(word.endTime, segment.endTime) - Math.max(word.startTime, segment.startTime));
export const assignSpeakerToWord = (
word: WordTiming,
segments: SpeakerSegment[],
) => {
let bestSpeakerId = 'unknown';
let bestOverlap = 0;
for (const segment of segments) {
const overlap = getOverlap(word, segment);
if (overlap > bestOverlap) {
bestOverlap = overlap;
bestSpeakerId = segment.speakerId;
}
}
return bestSpeakerId;
};
export const assignSpeakersToWords = (
words: WordTiming[],
segments: SpeakerSegment[],
): WordTiming[] =>
words.map((word) => ({
...word,
speakerId: assignSpeakerToWord(word, segments),
}));

View File

@ -0,0 +1,92 @@
import { describe, expect, it } from 'vitest';
import { buildExportPayload } from './exportPayload';
import { Subtitle, TextStyles } from '../types';
describe('buildExportPayload', () => {
it('includes preview-visible subtitle data, volume, styles, and trim info', () => {
const subtitles: Subtitle[] = [
{
id: 'sub-1',
startTime: 1,
endTime: 2,
originalText: 'hello',
translatedText: '你好',
speaker: 'Speaker 1',
voiceId: 'voice-1',
audioUrl: 'data:audio/mp3;base64,AAAA',
volume: 0.75,
},
];
const textStyles: TextStyles = {
fontFamily: 'MiSans-Late',
fontSize: 24,
color: '#FFFFFF',
backgroundColor: 'transparent',
alignment: 'center',
isBold: true,
isItalic: false,
isUnderline: false,
};
const payload = buildExportPayload({
subtitles,
textStyles,
bgmBase64: 'instrumental-base64',
trimRange: { start: 10, end: 20 },
});
expect(payload).toEqual({
subtitles: [
{
startTime: 1,
endTime: 2,
text: '你好',
audioUrl: 'data:audio/mp3;base64,AAAA',
volume: 0.75,
},
],
textStyles,
bgmBase64: 'instrumental-base64',
trimRange: { start: 10, end: 20 },
});
});
it('falls back to original text and default volume when subtitle fields are missing', () => {
const payload = buildExportPayload({
subtitles: [
{
id: 'sub-2',
startTime: 3,
endTime: 4,
originalText: 'fallback',
translatedText: '',
speaker: 'Speaker 2',
voiceId: 'voice-2',
},
],
textStyles: {
fontFamily: 'Arial',
fontSize: 18,
color: '#FF0000',
backgroundColor: 'transparent',
alignment: 'left',
isBold: false,
isItalic: true,
isUnderline: true,
},
bgmBase64: null,
trimRange: null,
});
expect(payload.subtitles[0]).toEqual({
startTime: 3,
endTime: 4,
text: 'fallback',
audioUrl: undefined,
volume: 1,
});
expect(payload.bgmBase64).toBeNull();
expect(payload.trimRange).toBeNull();
});
});

39
src/lib/exportPayload.ts Normal file
View File

@ -0,0 +1,39 @@
import { Subtitle, TextStyles } from '../types';
export interface ExportPayloadSubtitle {
startTime: number;
endTime: number;
text: string;
audioUrl?: string;
volume: number;
}
export interface ExportPayload {
subtitles: ExportPayloadSubtitle[];
textStyles: TextStyles;
bgmBase64: string | null;
trimRange: { start: number; end: number } | null;
}
export const buildExportPayload = ({
subtitles,
textStyles,
bgmBase64,
trimRange,
}: {
subtitles: Subtitle[];
textStyles: TextStyles;
bgmBase64: string | null;
trimRange: { start: number; end: number } | null;
}): ExportPayload => ({
subtitles: subtitles.map((subtitle) => ({
startTime: subtitle.startTime,
endTime: subtitle.endTime,
text: subtitle.translatedText || subtitle.originalText,
audioUrl: subtitle.audioUrl,
volume: subtitle.volume ?? 1,
})),
textStyles,
bgmBase64,
trimRange,
});

View File

@ -0,0 +1,21 @@
import { describe, expect, it } from 'vitest';
import { getActiveWord } from './wordHighlight';
describe('getActiveWord', () => {
it('returns the active word for the current playback time', () => {
const activeWord = getActiveWord([
{ text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
], 1.1);
expect(activeWord?.text).toBe('Hello');
});
it('returns undefined when no word is active', () => {
const activeWord = getActiveWord([
{ text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
], 2);
expect(activeWord).toBeUndefined();
});
});

View File

@ -0,0 +1,6 @@
import { WordTiming } from '../../types';
export const getActiveWord = (
words: WordTiming[],
currentTime: number,
) => words.find((word) => currentTime >= word.startTime && currentTime <= word.endTime);

View File

@ -0,0 +1,18 @@
import { describe, expect, it } from 'vitest';
import { buildSpeakerPresentation } from './speakerPresentation';
describe('buildSpeakerPresentation', () => {
it('creates stable display metadata for each speaker id', () => {
const speaker = buildSpeakerPresentation({ speakerId: 'spk_0', label: 'Speaker 1' });
expect(speaker.speakerId).toBe('spk_0');
expect(speaker.label).toBe('Speaker 1');
expect(speaker.color).toMatch(/^#/);
});
it('falls back to a readable label when one is missing', () => {
const speaker = buildSpeakerPresentation({ speakerId: 'spk_9', label: '' });
expect(speaker.label).toBe('spk_9');
});
});

View File

@ -0,0 +1,12 @@
import { SpeakerTrack } from '../../types';
const SPEAKER_COLORS = ['#1677ff', '#13c2c2', '#fa8c16', '#eb2f96', '#52c41a', '#722ed1'];
const hashSpeakerId = (speakerId: string) =>
speakerId.split('').reduce((hash, char) => hash + char.charCodeAt(0), 0);
export const buildSpeakerPresentation = (speaker: SpeakerTrack) => ({
speakerId: speaker.speakerId,
label: speaker.label || speaker.speakerId || 'Unknown speaker',
color: SPEAKER_COLORS[hashSpeakerId(speaker.speakerId) % SPEAKER_COLORS.length],
});

View File

@ -0,0 +1,52 @@
import { describe, expect, it } from 'vitest';
import { deriveSubtitleBounds, normalizeAlignedSentence } from './subtitlePipeline';
describe('subtitlePipeline', () => {
it('derives subtitle boundaries from the first and last word', () => {
const result = deriveSubtitleBounds([
{ text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
{ text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
]);
expect(result).toEqual({ startTime: 1.2, endTime: 2.0 });
});
it('normalizes an aligned sentence into a subtitle shape', () => {
const subtitle = normalizeAlignedSentence({
id: 's1',
speakerId: 'spk_0',
speaker: 'Speaker 1',
words: [
{ text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
{ text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
],
originalText: 'Hello world',
translatedText: '你好世界',
voiceId: 'male-qn-qingse',
});
expect(subtitle.startTime).toBe(1.2);
expect(subtitle.endTime).toBe(2.0);
expect(subtitle.speakerId).toBe('spk_0');
expect(subtitle.words).toHaveLength(2);
expect(subtitle.confidence).toBeCloseTo(0.985);
});
it('keeps explicit bounds when the alignment service only returns sentence-level segments', () => {
const subtitle = normalizeAlignedSentence({
id: 's2',
speakerId: 'unknown',
words: [],
startTime: 3.1,
endTime: 4.6,
originalText: 'Sentence only',
translatedText: '仅句段',
voiceId: 'male-qn-qingse',
});
expect(subtitle.startTime).toBe(3.1);
expect(subtitle.endTime).toBe(4.6);
expect(subtitle.words).toEqual([]);
expect(subtitle.confidence).toBe(0);
});
});

View File

@ -0,0 +1,58 @@
import { Subtitle, WordTiming } from '../types';
export interface NormalizedAlignedSentenceInput {
id: string;
speakerId: string;
speaker?: string;
words: WordTiming[];
startTime?: number;
endTime?: number;
originalText: string;
translatedText: string;
voiceId: string;
audioUrl?: string;
volume?: number;
}
export const deriveSubtitleBounds = (
words: WordTiming[],
fallbackStartTime = 0,
fallbackEndTime = fallbackStartTime,
) => ({
startTime: words[0]?.startTime ?? fallbackStartTime,
endTime: words[words.length - 1]?.endTime ?? fallbackEndTime,
});
const deriveConfidence = (words: WordTiming[]) => {
if (words.length === 0) {
return 0;
}
const total = words.reduce((sum, word) => sum + word.confidence, 0);
return total / words.length;
};
export const normalizeAlignedSentence = (
sentence: NormalizedAlignedSentenceInput,
): Subtitle => {
const bounds = deriveSubtitleBounds(
sentence.words,
sentence.startTime,
sentence.endTime,
);
return {
id: sentence.id,
startTime: bounds.startTime,
endTime: bounds.endTime,
originalText: sentence.originalText,
translatedText: sentence.translatedText,
speaker: sentence.speaker ?? sentence.speakerId,
speakerId: sentence.speakerId,
words: sentence.words,
confidence: deriveConfidence(sentence.words),
voiceId: sentence.voiceId,
audioUrl: sentence.audioUrl,
volume: sentence.volume,
};
};

View File

@ -0,0 +1,29 @@
import { describe, expect, it } from 'vitest';
import { snapTimeToNearestWordBoundary } from './snapToWords';
describe('snapTimeToNearestWordBoundary', () => {
it('snaps a dragged edge to the nearest word boundary within tolerance', () => {
const next = snapTimeToNearestWordBoundary(
1.34,
[
{ text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
],
);
expect(next).toBe(1.35);
});
it('leaves the time unchanged when no boundary is close enough', () => {
const next = snapTimeToNearestWordBoundary(
1.55,
[
{ text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
],
0.02,
);
expect(next).toBe(1.55);
});
});

View File

@ -0,0 +1,22 @@
import { WordTiming } from '../../types';
export const snapTimeToNearestWordBoundary = (
time: number,
words: WordTiming[],
tolerance = 0.05,
) => {
const boundaries = words.flatMap((word) => [word.startTime, word.endTime]);
let nearestBoundary = time;
let nearestDistance = tolerance;
for (const boundary of boundaries) {
const distance = Math.abs(boundary - time);
if (distance <= nearestDistance) {
nearestBoundary = boundary;
nearestDistance = distance;
}
}
return nearestBoundary;
};

10
src/main.tsx Normal file
View File

@ -0,0 +1,10 @@
import {StrictMode} from 'react';
import {createRoot} from 'react-dom/client';
import App from './App.tsx';
import './index.css';
createRoot(document.getElementById('root')!).render(
<StrictMode>
<App />
</StrictMode>,
);

View File

@ -0,0 +1,162 @@
import fs from 'fs';
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
import { parseAlignmentResponse, requestAlignedTranscript } from './alignmentAdapter';
const jsonResponse = (payload: unknown, status = 200) =>
new Response(JSON.stringify(payload), {
status,
headers: {
'Content-Type': 'application/json',
},
});
describe('parseAlignmentResponse', () => {
beforeEach(() => {
vi.spyOn(fs, 'readFileSync').mockReturnValue(Buffer.from('audio'));
});
afterEach(() => {
vi.restoreAllMocks();
});
it('maps aligned words and speaker segments from the adapter response', () => {
const result = parseAlignmentResponse({
words: [{ word: 'hello', start: 1.0, end: 1.2, speaker: 'spk_0', score: 0.95 }],
speakers: [{ speaker: 'spk_0', start: 0.8, end: 1.6 }],
sourceLanguage: 'en',
duration: 1.6,
quality: 'full',
alignmentEngine: 'whisperx+pyannote',
});
expect(result.words[0].speakerId).toBe('spk_0');
expect(result.words[0].confidence).toBe(0.95);
expect(result.speakers).toEqual([{ speakerId: 'spk_0', label: 'Speaker 1' }]);
expect(result.quality).toBe('full');
});
it('defaults missing speakers to partial quality with empty speaker tracks', () => {
const result = parseAlignmentResponse({
words: [{ word: 'hello', start: 1.0, end: 1.2, score: 0.95 }],
duration: 1.2,
alignmentEngine: 'whisperx',
});
expect(result.words[0].speakerId).toBe('unknown');
expect(result.speakers).toEqual([]);
expect(result.quality).toBe('partial');
});
it('falls back to the project workflow and keeps segment text untranslated for provider re-translation', async () => {
const fetchImpl = vi
.fn<typeof fetch>()
.mockResolvedValueOnce(new Response(null, { status: 404 }))
.mockResolvedValueOnce(
jsonResponse(
{
data: {
id: 'project-1',
},
},
201,
),
)
.mockResolvedValueOnce(
jsonResponse(
{
data: {
id: 'asset-1',
},
},
201,
),
)
.mockResolvedValueOnce(
jsonResponse(
{
data: {
job_id: 'job-1',
status: 'queued',
},
},
202,
),
)
.mockResolvedValueOnce(
jsonResponse({
data: {
job_id: 'job-1',
status: 'succeeded',
},
}),
)
.mockResolvedValueOnce(
jsonResponse({
data: [
{
id: 'subtitle-1',
source_text: 'Hello there',
translated_text: '你好',
start_ms: 1000,
end_ms: 2500,
},
],
}),
);
const result = await requestAlignedTranscript({
audioPath: 'clip.wav',
serviceUrl: 'http://127.0.0.1:8000',
targetLanguage: 'zh',
allowStudioProjectFallback: true,
fetchImpl,
pollIntervalMs: 0,
maxPollAttempts: 1,
});
expect(fetchImpl).toHaveBeenCalledTimes(6);
expect(fetchImpl.mock.calls[1][0]).toBe('http://127.0.0.1:8000/api/projects');
expect(fetchImpl.mock.calls[2][0]).toBe(
'http://127.0.0.1:8000/api/projects/project-1/assets/upload',
);
expect(fetchImpl.mock.calls[3][0]).toBe(
'http://127.0.0.1:8000/api/projects/project-1/translate',
);
expect(fetchImpl.mock.calls[4][0]).toBe('http://127.0.0.1:8000/api/jobs/job-1');
expect(fetchImpl.mock.calls[5][0]).toBe(
'http://127.0.0.1:8000/api/projects/project-1/subtitles',
);
expect(result.words).toEqual([]);
expect(result.segments).toEqual([
{
id: 'subtitle-1',
originalText: 'Hello there',
translatedText: undefined,
startTime: 1,
endTime: 2.5,
speakerId: 'unknown',
confidence: 0,
words: [],
},
]);
expect(result.quality).toBe('fallback');
expect(result.alignmentEngine).toBe('video-translation-studio');
});
it('fails fast when studio fallback is explicitly disabled', async () => {
const fetchImpl = vi.fn<typeof fetch>().mockResolvedValueOnce(new Response(null, { status: 404 }));
await expect(
requestAlignedTranscript({
audioPath: 'clip.wav',
serviceUrl: 'http://127.0.0.1:8000',
targetLanguage: 'zh',
allowStudioProjectFallback: false,
fetchImpl,
}),
).rejects.toThrow(/studio project fallback is disabled/i);
expect(fetchImpl).toHaveBeenCalledTimes(1);
});
});

View File

@ -0,0 +1,317 @@
import fs from 'fs';
import { SpeakerTrack, WordTiming } from '../types';
import { AlignedSegment, AlignmentResult } from './subtitlePipeline';
interface RawAlignedWord {
word: string;
start: number;
end: number;
speaker?: string;
score?: number;
}
interface RawSpeakerSegment {
speaker: string;
start: number;
end: number;
}
interface RawAlignedSegment {
id?: string;
originalText?: string;
translatedText?: string;
startTime: number;
endTime: number;
speakerId?: string;
speaker?: string;
confidence?: number;
words?: RawAlignedWord[];
}
interface StudioEnvelope<T> {
data?: T;
}
interface StudioProject {
id: string;
}
interface StudioJob {
job_id: string;
status: string;
error_message?: string | null;
}
interface StudioSubtitle {
id: string;
source_text: string;
translated_text?: string;
start_ms: number;
end_ms: number;
}
export interface RawAlignmentResponse {
words?: RawAlignedWord[];
speakers?: RawSpeakerSegment[];
segments?: RawAlignedSegment[];
sourceLanguage?: string;
duration?: number;
quality?: AlignmentResult['quality'];
alignmentEngine?: string;
}
const buildSpeakerTracks = (speakerSegments: RawSpeakerSegment[] = []): SpeakerTrack[] => {
const labels = new Map<string, string>();
speakerSegments.forEach((segment) => {
if (!labels.has(segment.speaker)) {
labels.set(segment.speaker, `Speaker ${labels.size + 1}`);
}
});
return Array.from(labels.entries()).map(([speakerId, label]) => ({
speakerId,
label,
}));
};
const mapWords = (words: RawAlignedWord[] = []): WordTiming[] =>
words.map((word) => ({
text: word.word,
startTime: Number(word.start) || 0,
endTime: Number(word.end) || 0,
speakerId: word.speaker || 'unknown',
confidence: Number(word.score) || 0,
}));
const mapSegments = (segments: RawAlignedSegment[] = []): AlignedSegment[] =>
segments.map((segment, index) => ({
id: segment.id || `segment-${index + 1}`,
originalText: segment.originalText || '',
translatedText: segment.translatedText,
startTime: Number(segment.startTime) || 0,
endTime: Number(segment.endTime) || 0,
speakerId: segment.speakerId || 'unknown',
speaker: segment.speaker,
confidence: Number(segment.confidence) || 0,
words: mapWords(segment.words),
}));
const deriveSegmentDuration = (segments: AlignedSegment[]) =>
segments.reduce((maxDuration, segment) => Math.max(maxDuration, segment.endTime), 0);
export const parseAlignmentResponse = (
response: RawAlignmentResponse,
): AlignmentResult => {
const speakers = buildSpeakerTracks(response.speakers);
const segments = mapSegments(response.segments);
const words = mapWords(response.words);
return {
words,
segments,
speakers,
quality:
response.quality ??
(speakers.length > 0
? 'full'
: segments.length > 0 && words.length === 0
? 'fallback'
: 'partial'),
sourceLanguage: response.sourceLanguage,
duration: response.duration ?? deriveSegmentDuration(segments),
alignmentEngine: response.alignmentEngine,
};
};
export interface RequestAlignedTranscriptOptions {
audioPath: string;
serviceUrl?: string;
targetLanguage: string;
allowStudioProjectFallback?: boolean;
fetchImpl?: typeof fetch;
pollIntervalMs?: number;
maxPollAttempts?: number;
}
const createAudioFormData = (audioBuffer: Buffer) => {
const formData = new FormData();
formData.append('audio', new Blob([audioBuffer], { type: 'audio/wav' }), 'audio.wav');
return formData;
};
const createUploadFormData = (audioBuffer: Buffer) => {
const formData = new FormData();
formData.append('file', new Blob([audioBuffer], { type: 'audio/wav' }), 'audio.wav');
return formData;
};
const normalizeServiceUrl = (serviceUrl: string) => serviceUrl.replace(/\/+$/, '');
const DEFAULT_STUDIO_SERVICE_URL = 'http://127.0.0.1:8000';
const joinServiceUrl = (serviceUrl: string, path: string) =>
`${normalizeServiceUrl(serviceUrl)}${path.startsWith('/') ? path : `/${path}`}`;
const shouldFallbackToProjectWorkflow = (status: number) =>
status === 404 || status === 405;
const isTerminalJobStatus = (status: string) =>
status === 'succeeded' || status === 'failed' || status === 'cancelled';
const sleep = (durationMs: number) =>
new Promise((resolve) => {
setTimeout(resolve, durationMs);
});
const parseStudioEnvelope = async <T>(
response: Response,
errorLabel: string,
): Promise<T> => {
if (!response.ok) {
throw new Error(`${errorLabel} failed with status ${response.status}`);
}
const payload = (await response.json()) as StudioEnvelope<T>;
if (payload.data === undefined || payload.data === null) {
throw new Error(`${errorLabel} returned an empty payload`);
}
return payload.data;
};
const mapStudioSubtitles = (subtitles: StudioSubtitle[]): AlignedSegment[] =>
subtitles.map((subtitle) => ({
id: subtitle.id,
originalText: subtitle.source_text,
// Force the selected LLM provider to own translation so Gemini/Doubao switching
// stays meaningful even when the alignment service can also translate.
translatedText: undefined,
startTime: Number(subtitle.start_ms) / 1000 || 0,
endTime: Number(subtitle.end_ms) / 1000 || 0,
speakerId: 'unknown',
confidence: 0,
words: [],
}));
const requestStudioTranscript = async ({
audioBuffer,
serviceUrl,
targetLanguage,
fetchImpl,
pollIntervalMs,
maxPollAttempts,
}: {
audioBuffer: Buffer;
serviceUrl: string;
targetLanguage: string;
fetchImpl: typeof fetch;
pollIntervalMs: number;
maxPollAttempts: number;
}): Promise<AlignmentResult> => {
const project = await parseStudioEnvelope<StudioProject>(
await fetchImpl(joinServiceUrl(serviceUrl, '/api/projects'), {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
name: `Codex Audio Import ${Date.now()}`,
target_language: targetLanguage,
mode: 'editing',
}),
}),
'Project creation',
);
await parseStudioEnvelope(
await fetchImpl(joinServiceUrl(serviceUrl, `/api/projects/${project.id}/assets/upload`), {
method: 'POST',
body: createUploadFormData(audioBuffer),
}),
'Asset upload',
);
const translationJob = await parseStudioEnvelope<StudioJob>(
await fetchImpl(joinServiceUrl(serviceUrl, `/api/projects/${project.id}/translate`), {
method: 'POST',
}),
'Project translation',
);
let jobStatus = translationJob.status;
let attempts = 0;
while (!isTerminalJobStatus(jobStatus) && attempts < maxPollAttempts) {
attempts += 1;
await sleep(pollIntervalMs);
const polledJob = await parseStudioEnvelope<StudioJob>(
await fetchImpl(joinServiceUrl(serviceUrl, `/api/jobs/${translationJob.job_id}`)),
'Job status',
);
jobStatus = polledJob.status;
if (polledJob.error_message) {
throw new Error(`Project translation failed: ${polledJob.error_message}`);
}
}
if (jobStatus !== 'succeeded') {
throw new Error(`Project translation did not succeed (status: ${jobStatus})`);
}
const studioSubtitles = await parseStudioEnvelope<StudioSubtitle[]>(
await fetchImpl(joinServiceUrl(serviceUrl, `/api/projects/${project.id}/subtitles`)),
'Subtitle fetch',
);
const segments = mapStudioSubtitles(studioSubtitles);
return {
words: [],
segments,
speakers: [],
quality: 'fallback',
duration: deriveSegmentDuration(segments),
alignmentEngine: 'video-translation-studio',
};
};
export const requestAlignedTranscript = async ({
audioPath,
serviceUrl = DEFAULT_STUDIO_SERVICE_URL,
targetLanguage,
allowStudioProjectFallback = true,
fetchImpl = fetch,
pollIntervalMs = 1_000,
maxPollAttempts = 30,
}: RequestAlignedTranscriptOptions): Promise<AlignmentResult> => {
const audioBuffer = fs.readFileSync(audioPath);
const formData = createAudioFormData(audioBuffer);
const response = await fetchImpl(normalizeServiceUrl(serviceUrl), {
method: 'POST',
body: formData,
});
if (!response.ok) {
if (shouldFallbackToProjectWorkflow(response.status)) {
if (!allowStudioProjectFallback) {
throw new Error(
`Alignment service returned status ${response.status} and Studio project fallback is disabled. Set ALLOW_STUDIO_PROJECT_FALLBACK=true to opt in.`,
);
}
return requestStudioTranscript({
audioBuffer,
serviceUrl,
targetLanguage,
fetchImpl,
pollIntervalMs,
maxPollAttempts,
});
}
throw new Error(`Alignment service failed with status ${response.status}`);
}
const payload = (await response.json()) as RawAlignmentResponse;
return parseAlignmentResponse(payload);
};

View File

@ -0,0 +1,21 @@
import { describe, expect, it } from 'vitest';
import { resolveAudioPipelineConfig } from './audioPipelineConfig';
describe('resolveAudioPipelineConfig', () => {
it('defaults provider to doubao', () => {
const config = resolveAudioPipelineConfig({});
expect(config).toEqual({
defaultProvider: 'doubao',
});
});
it('reads an explicit default provider override', () => {
const config = resolveAudioPipelineConfig({
DEFAULT_LLM_PROVIDER: 'gemini',
});
expect(config.defaultProvider).toBe('gemini');
});
});

View File

@ -0,0 +1,13 @@
import { LlmProvider, normalizeLlmProvider } from './llmProvider';
export interface AudioPipelineConfig {
defaultProvider: LlmProvider;
}
export const resolveAudioPipelineConfig = (
env: NodeJS.ProcessEnv,
): AudioPipelineConfig => {
return {
defaultProvider: normalizeLlmProvider(env.DEFAULT_LLM_PROVIDER),
};
};

View File

@ -0,0 +1,95 @@
import { describe, expect, it, vi } from 'vitest';
import {
extractDoubaoTextOutput,
translateSentencesWithDoubao,
} from './doubaoTranslation';
describe('doubaoTranslation', () => {
it('reconstructs text from the Ark output content array', () => {
const text = extractDoubaoTextOutput({
output: [
{
type: 'message',
content: [
{
type: 'output_text',
text: '[{"id":"sentence-1","translatedText":"你好"}]',
},
],
},
],
});
expect(text).toContain('translatedText');
});
it('parses Doubao JSON translation results', async () => {
const requestResponse = vi.fn(async () => ({
output: [
{
type: 'message',
content: [
{
type: 'output_text',
text: JSON.stringify([
{ id: 'sentence-1', translatedText: '你好', speaker: 'Speaker 1' },
]),
},
],
},
],
}));
const result = await translateSentencesWithDoubao({
targetLanguage: 'zh',
sentences: [
{
id: 'sentence-1',
originalText: 'hello',
speakerId: 'spk_0',
startTime: 0,
endTime: 1,
},
],
model: 'doubao-seed-2-0-pro-260215',
requestResponse,
});
expect(result).toEqual([
{ id: 'sentence-1', translatedText: '你好', speaker: 'Speaker 1' },
]);
});
it('accepts fenced json from Doubao output', async () => {
const requestResponse = vi.fn(async () => ({
output: [
{
type: 'message',
content: [
{
type: 'output_text',
text: '```json\n[{"id":"sentence-1","translatedText":"你好"}]\n```',
},
],
},
],
}));
const result = await translateSentencesWithDoubao({
targetLanguage: 'zh',
sentences: [
{
id: 'sentence-1',
originalText: 'hello',
speakerId: 'spk_0',
startTime: 0,
endTime: 1,
},
],
model: 'doubao-seed-2-0-pro-260215',
requestResponse,
});
expect(result[0].translatedText).toBe('你好');
});
});

View File

@ -0,0 +1,122 @@
import { TranslationSentenceInput, TranslationSentenceResult } from './subtitlePipeline';
export interface TranslateSentencesWithDoubaoOptions {
targetLanguage: string;
sentences: TranslationSentenceInput[];
model: string;
requestResponse: (request: unknown) => Promise<unknown>;
}
const stripJsonFences = (text: string) => text.replace(/```json\n?|\n?```/g, '').trim();
export const extractDoubaoTextOutput = (payload: any): string => {
const output = Array.isArray(payload?.output) ? payload.output : [];
const parts = output.flatMap((item: any) => {
if (typeof item?.text === 'string') {
return [item.text];
}
if (!Array.isArray(item?.content)) {
return [];
}
return item.content
.map((part: any) => {
if (typeof part?.text === 'string') {
return part.text;
}
return '';
})
.filter(Boolean);
});
return parts.join('').trim();
};
export const translateSentencesWithDoubao = async ({
targetLanguage,
sentences,
model,
requestResponse,
}: TranslateSentencesWithDoubaoOptions): Promise<TranslationSentenceResult[]> => {
const prompt = `Translate the following subtitle segments into ${targetLanguage}.
CRITICAL INSTRUCTIONS:
1. You MUST return a valid JSON array of objects.
2. Keep the EXACT same "id" and "originalText" for each segment.
3. Add a "translatedText" field with the translation.
4. Add a "speaker" field if you can infer a useful display label.
5. DO NOT wrap in markdown blocks. Return ONLY JSON.
Input JSON:
${JSON.stringify(sentences, null, 2)}`;
const response = await requestResponse({
model,
input: [
{
role: 'user',
content: [
{
type: 'input_text',
text: prompt,
},
],
},
],
});
const text = stripJsonFences(extractDoubaoTextOutput(response));
return JSON.parse(text || '[]') as TranslationSentenceResult[];
};
export const createDoubaoSentenceTranslator = ({
apiKey,
model,
baseUrl,
fetchImpl = fetch,
}: {
apiKey: string;
model: string;
baseUrl: string;
fetchImpl?: typeof fetch;
}) => {
return async (
sentences: TranslationSentenceInput[],
targetLanguage: string,
): Promise<TranslationSentenceResult[]> =>
translateSentencesWithDoubao({
targetLanguage,
sentences,
model,
requestResponse: async (request) => {
const response = await fetchImpl(baseUrl, {
method: 'POST',
headers: {
Authorization: `Bearer ${apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(request),
});
if (!response.ok) {
let errorMessage = `Doubao request failed with status ${response.status}`;
try {
const errorPayload = await response.json();
errorMessage =
errorPayload?.error?.message ||
errorPayload?.message ||
errorPayload?.error?.code ||
errorMessage;
} catch {
// Fall back to the HTTP status message when the upstream body is not JSON.
}
throw new Error(errorMessage);
}
return response.json();
},
});
};

View File

@ -0,0 +1,93 @@
import { describe, expect, it } from 'vitest';
import {
buildAssSubtitleContent,
buildExportAudioPlan,
shiftSubtitlesToExportTimeline,
} from './exportVideo';
import { TextStyles } from '../types';
const defaultTextStyles: TextStyles = {
fontFamily: 'MiSans-Late',
fontSize: 24,
color: '#FFFFFF',
backgroundColor: 'transparent',
alignment: 'center',
isBold: true,
isItalic: false,
isUnderline: false,
};
describe('shiftSubtitlesToExportTimeline', () => {
it('keeps subtitle times relative when there is no trim range', () => {
const shifted = shiftSubtitlesToExportTimeline(
[{ startTime: 1, endTime: 2, text: 'hello', volume: 1 }],
null,
);
expect(shifted).toEqual([{ startTime: 1, endTime: 2, text: 'hello', volume: 1 }]);
});
it('moves subtitle times onto the full video timeline when trim is enabled', () => {
const shifted = shiftSubtitlesToExportTimeline(
[{ startTime: 1, endTime: 2.5, text: 'hello', volume: 1 }],
{ start: 10, end: 20 },
);
expect(shifted).toEqual([{ startTime: 11, endTime: 12.5, text: 'hello', volume: 1 }]);
});
});
describe('buildExportAudioPlan', () => {
it('drops original source audio when instrumental BGM is present', () => {
const plan = buildExportAudioPlan({
hasSourceAudio: true,
hasBgm: true,
subtitles: [
{ startTime: 4, endTime: 5, text: 'hello', audioUrl: 'data:audio/mp3;base64,AAAA', volume: 0.8 },
],
});
expect(plan.includeSourceAudio).toBe(false);
expect(plan.bgmVolume).toBe(0.5);
expect(plan.ttsTracks[0]).toMatchObject({
delayMs: 4000,
volume: 0.8,
});
});
it('keeps original source audio at preview volume when no BGM exists', () => {
const plan = buildExportAudioPlan({
hasSourceAudio: true,
hasBgm: false,
subtitles: [],
});
expect(plan.includeSourceAudio).toBe(true);
expect(plan.sourceAudioVolume).toBe(0.3);
expect(plan.bgmVolume).toBeNull();
});
});
describe('buildAssSubtitleContent', () => {
it('renders ASS styles from the selected text settings', () => {
const ass = buildAssSubtitleContent({
subtitles: [{ startTime: 1, endTime: 2, text: 'Hello world', volume: 1 }],
textStyles: {
...defaultTextStyles,
color: '#00FF88',
alignment: 'right',
isItalic: true,
isUnderline: true,
},
videoWidth: 1080,
videoHeight: 1920,
});
expect(ass).toContain('PlayResX: 1080');
expect(ass).toContain('PlayResY: 1920');
expect(ass).toContain('Style: Default,MiSans-Late,24');
expect(ass).toContain('&H0088FF00');
expect(ass).toContain(',3,');
expect(ass).toContain('Dialogue: 0,0:00:01.00,0:00:02.00,Default,,0,0,0,,Hello world');
});
});

148
src/server/exportVideo.ts Normal file
View File

@ -0,0 +1,148 @@
import { TextStyles } from '../types';
import { ExportPayloadSubtitle } from '../lib/exportPayload';
export const DEFAULT_EXPORT_TEXT_STYLES: TextStyles = {
fontFamily: 'MiSans-Late',
fontSize: 24,
color: '#FFFFFF',
backgroundColor: 'transparent',
alignment: 'center',
isBold: false,
isItalic: false,
isUnderline: false,
};
export interface ExportAudioPlan {
includeSourceAudio: boolean;
sourceAudioVolume: number | null;
bgmVolume: number | null;
ttsTracks: Array<{
delayMs: number;
volume: number;
audioUrl: string;
}>;
}
export const shiftSubtitlesToExportTimeline = (
subtitles: ExportPayloadSubtitle[],
trimRange: { start: number; end: number } | null,
): ExportPayloadSubtitle[] => {
if (!trimRange) {
return subtitles.map((subtitle) => ({ ...subtitle }));
}
return subtitles.map((subtitle) => ({
...subtitle,
startTime: subtitle.startTime + trimRange.start,
endTime: subtitle.endTime + trimRange.start,
}));
};
export const buildExportAudioPlan = ({
hasSourceAudio,
hasBgm,
subtitles,
}: {
hasSourceAudio: boolean;
hasBgm: boolean;
subtitles: ExportPayloadSubtitle[];
}): ExportAudioPlan => ({
includeSourceAudio: hasSourceAudio && !hasBgm,
sourceAudioVolume: hasSourceAudio && !hasBgm ? 0.3 : null,
bgmVolume: hasBgm ? 0.5 : null,
ttsTracks: subtitles
.filter((subtitle) => Boolean(subtitle.audioUrl))
.map((subtitle) => ({
delayMs: Math.floor(subtitle.startTime * 1000),
volume: subtitle.volume ?? 1,
audioUrl: subtitle.audioUrl as string,
})),
});
const formatAssTime = (seconds: number): string => {
const safeSeconds = Math.max(0, seconds);
const hours = Math.floor(safeSeconds / 3600);
const minutes = Math.floor((safeSeconds % 3600) / 60);
const wholeSeconds = Math.floor(safeSeconds % 60);
const centiseconds = Math.floor((safeSeconds % 1) * 100);
return `${hours}:${minutes.toString().padStart(2, '0')}:${wholeSeconds.toString().padStart(2, '0')}.${centiseconds.toString().padStart(2, '0')}`;
};
const assEscape = (text: string): string =>
text
.replace(/\r\n|\r|\n/g, '\\N')
.replace(/{/g, '\\{')
.replace(/}/g, '\\}');
const toAssColor = (hexColor: string): string => {
const normalized = hexColor.replace('#', '').padEnd(6, '0').slice(0, 6);
const rr = normalized.slice(0, 2);
const gg = normalized.slice(2, 4);
const bb = normalized.slice(4, 6);
return `&H00${bb}${gg}${rr}`;
};
const mapAlignmentToAss = (alignment: TextStyles['alignment']): number => {
if (alignment === 'left') return 1;
if (alignment === 'right') return 3;
return 2;
};
export const buildAssSubtitleContent = ({
subtitles,
textStyles,
videoWidth,
videoHeight,
}: {
subtitles: ExportPayloadSubtitle[];
textStyles: TextStyles;
videoWidth: number;
videoHeight: number;
}): string => {
const styleLine = [
'Style: Default',
textStyles.fontFamily,
textStyles.fontSize,
toAssColor(textStyles.color),
'&H00000000',
'&H00000000',
'&H64000000',
textStyles.isBold ? -1 : 0,
textStyles.isItalic ? -1 : 0,
textStyles.isUnderline ? -1 : 0,
0,
100,
100,
0,
0,
1,
1,
2,
mapAlignmentToAss(textStyles.alignment),
48,
48,
Math.max(36, Math.round(videoHeight * 0.1)),
1,
].join(',');
const dialogueLines = subtitles.map((subtitle) =>
`Dialogue: 0,${formatAssTime(subtitle.startTime)},${formatAssTime(subtitle.endTime)},Default,,0,0,0,,${assEscape(subtitle.text || '')}`,
);
return [
'[Script Info]',
'ScriptType: v4.00+',
'WrapStyle: 0',
`PlayResX: ${videoWidth}`,
`PlayResY: ${videoHeight}`,
'',
'[V4+ Styles]',
'Format: Name,Fontname,Fontsize,PrimaryColour,SecondaryColour,OutlineColour,BackColour,Bold,Italic,Underline,StrikeOut,ScaleX,ScaleY,Spacing,Angle,BorderStyle,Outline,Shadow,Alignment,MarginL,MarginR,MarginV,Encoding',
styleLine,
'',
'[Events]',
'Format: Layer,Start,End,Style,Name,MarginL,MarginR,MarginV,Effect,Text',
...dialogueLines,
'',
].join('\n');
};

View File

@ -0,0 +1,52 @@
import { describe, expect, it, vi } from 'vitest';
import { translateSentencesWithGemini } from './geminiTranslation';
describe('translateSentencesWithGemini', () => {
it('parses Gemini JSON translation results', async () => {
const generateContent = vi.fn(async () => ({
text: JSON.stringify([
{ id: 'sentence-1', translatedText: '你好', speaker: 'Speaker 1' },
]),
}));
const result = await translateSentencesWithGemini({
targetLanguage: 'zh',
sentences: [
{
id: 'sentence-1',
originalText: 'hello',
speakerId: 'spk_0',
startTime: 0,
endTime: 1,
},
],
generateContent,
});
expect(result).toEqual([
{ id: 'sentence-1', translatedText: '你好', speaker: 'Speaker 1' },
]);
});
it('accepts fenced JSON and strips the markdown wrapper', async () => {
const generateContent = vi.fn(async () => ({
text: '```json\n[{"id":"sentence-1","translatedText":"你好"}]\n```',
}));
const result = await translateSentencesWithGemini({
targetLanguage: 'zh',
sentences: [
{
id: 'sentence-1',
originalText: 'hello',
speakerId: 'spk_0',
startTime: 0,
endTime: 1,
},
],
generateContent,
});
expect(result[0].translatedText).toBe('你好');
});
});

View File

@ -0,0 +1,71 @@
import { GoogleGenAI, Type } from '@google/genai';
import { TranslationSentenceInput, TranslationSentenceResult } from './subtitlePipeline';
export interface TranslateSentencesWithGeminiOptions {
targetLanguage: string;
sentences: TranslationSentenceInput[];
generateContent: (request: any) => Promise<{ text?: string | null }>;
}
const stripJsonFences = (text: string) => text.replace(/```json\n?|\n?```/g, '').trim();
export const translateSentencesWithGemini = async ({
targetLanguage,
sentences,
generateContent,
}: TranslateSentencesWithGeminiOptions): Promise<TranslationSentenceResult[]> => {
const prompt = `Translate the following subtitle segments into ${targetLanguage}.
CRITICAL INSTRUCTIONS:
1. You MUST return a valid JSON array of objects.
2. Keep the EXACT same "id" and "originalText" for each segment.
3. Add a "translatedText" field with the translation.
4. Add a "speaker" field if you can infer a useful display label.
5. DO NOT wrap in markdown blocks. Return ONLY JSON.
Input JSON:
${JSON.stringify(sentences, null, 2)}`;
const response = await generateContent({
model: 'gemini-2.5-flash',
contents: [{ role: 'user', parts: [{ text: prompt }] }],
config: {
responseMimeType: 'application/json',
responseSchema: {
type: Type.ARRAY,
items: {
type: Type.OBJECT,
properties: {
id: { type: Type.STRING },
translatedText: { type: Type.STRING },
speaker: { type: Type.STRING },
},
required: ['id', 'translatedText'],
},
},
},
});
const text = stripJsonFences(response.text || '[]');
return JSON.parse(text) as TranslationSentenceResult[];
};
export const createGeminiSentenceTranslator = ({
apiKey,
}: {
apiKey: string;
}) => {
const ai = new GoogleGenAI({ apiKey });
return async (
sentences: TranslationSentenceInput[],
targetLanguage: string,
): Promise<TranslationSentenceResult[]> =>
translateSentencesWithGemini({
targetLanguage,
sentences,
generateContent: async (request) => {
const response = await ai.models.generateContent(request);
return { text: response.text };
},
});
};

View File

@ -0,0 +1,47 @@
import { describe, expect, it } from 'vitest';
import {
DEFAULT_DOUBAO_MODEL,
DEFAULT_LLM_PROVIDER,
normalizeLlmProvider,
resolveLlmProviderConfig,
} from './llmProvider';
describe('llmProvider', () => {
it('defaults to doubao when the incoming provider is missing', () => {
expect(normalizeLlmProvider(undefined)).toBe(DEFAULT_LLM_PROVIDER);
});
it('normalizes a supported provider name', () => {
expect(normalizeLlmProvider('Gemini')).toBe('gemini');
expect(normalizeLlmProvider('doubao')).toBe('doubao');
});
it('rejects unsupported providers', () => {
expect(() => normalizeLlmProvider('gpt')).toThrow(/Unsupported LLM provider/);
});
it('returns the selected doubao provider config from env', () => {
expect(
resolveLlmProviderConfig('doubao', {
ARK_API_KEY: 'ark-key',
}),
).toEqual({
provider: 'doubao',
apiKey: 'ark-key',
model: DEFAULT_DOUBAO_MODEL,
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
});
});
it('returns the selected gemini provider config from env', () => {
expect(
resolveLlmProviderConfig('gemini', {
GEMINI_API_KEY: 'gemini-key',
}),
).toEqual({
provider: 'gemini',
apiKey: 'gemini-key',
model: 'gemini-2.5-flash',
});
});
});

64
src/server/llmProvider.ts Normal file
View File

@ -0,0 +1,64 @@
export const DEFAULT_LLM_PROVIDER = 'doubao';
export const DEFAULT_DOUBAO_MODEL = 'doubao-seed-2-0-pro-260215';
export const DEFAULT_GEMINI_MODEL = 'gemini-2.5-flash';
export const DEFAULT_DOUBAO_RESPONSES_URL = 'https://ark.cn-beijing.volces.com/api/v3/responses';
export type LlmProvider = 'doubao' | 'gemini';
export interface DoubaoProviderConfig {
provider: 'doubao';
apiKey: string;
model: string;
baseUrl: string;
}
export interface GeminiProviderConfig {
provider: 'gemini';
apiKey: string;
model: string;
}
export type LlmProviderConfig = DoubaoProviderConfig | GeminiProviderConfig;
export const normalizeLlmProvider = (value?: string | null): LlmProvider => {
if (!value) {
return DEFAULT_LLM_PROVIDER;
}
const normalized = value.trim().toLowerCase();
if (normalized === 'doubao' || normalized === 'gemini') {
return normalized;
}
throw new Error(`Unsupported LLM provider: ${value}`);
};
export const resolveLlmProviderConfig = (
provider: LlmProvider,
env: NodeJS.ProcessEnv,
): LlmProviderConfig => {
if (provider === 'doubao') {
const apiKey = env.ARK_API_KEY?.trim();
if (!apiKey) {
throw new Error('ARK_API_KEY is required for Doubao subtitle generation.');
}
return {
provider,
apiKey,
model: env.DOUBAO_MODEL?.trim() || DEFAULT_DOUBAO_MODEL,
baseUrl: (env.DOUBAO_BASE_URL?.trim() || DEFAULT_DOUBAO_RESPONSES_URL).replace(/\/+$/, ''),
};
}
const apiKey = env.GEMINI_API_KEY?.trim();
if (!apiKey) {
throw new Error('GEMINI_API_KEY is required for Gemini subtitle generation.');
}
return {
provider,
apiKey,
model: env.GEMINI_MODEL?.trim() || DEFAULT_GEMINI_MODEL,
};
};

View File

@ -0,0 +1,48 @@
import { describe, expect, it } from 'vitest';
import {
DEFAULT_MINIMAX_TTS_API_HOST,
getMiniMaxTtsHttpStatus,
resolveMiniMaxTtsConfig,
} from './minimaxTts';
describe('resolveMiniMaxTtsConfig', () => {
it('defaults to the official MiniMax TTS host', () => {
const config = resolveMiniMaxTtsConfig({
MINIMAX_API_KEY: 'secret-key',
});
expect(config).toEqual({
apiKey: 'secret-key',
apiHost: DEFAULT_MINIMAX_TTS_API_HOST,
});
});
it('uses the configured MiniMax host when provided', () => {
const config = resolveMiniMaxTtsConfig({
MINIMAX_API_KEY: 'secret-key',
MINIMAX_API_HOST: ' https://api.example.com/tts/ ',
});
expect(config.apiHost).toBe('https://api.example.com/tts');
});
});
describe('getMiniMaxTtsHttpStatus', () => {
it('maps login failures to 401 so the client can fallback immediately', () => {
expect(
getMiniMaxTtsHttpStatus({
status_code: 1004,
status_msg: "login fail: Please carry the API secret key in the 'Authorization' field of the request header",
}),
).toBe(401);
});
it('preserves rate limit style failures as 429', () => {
expect(
getMiniMaxTtsHttpStatus({
status_code: 429,
status_msg: 'too many requests',
}),
).toBe(429);
});
});

51
src/server/minimaxTts.ts Normal file
View File

@ -0,0 +1,51 @@
export const DEFAULT_MINIMAX_TTS_API_HOST = 'https://api.minimaxi.com';
export interface MiniMaxTtsConfig {
apiKey: string;
apiHost: string;
}
interface MiniMaxBaseResponseLike {
status_code?: number;
status_msg?: string;
}
export const resolveMiniMaxTtsConfig = (
env: NodeJS.ProcessEnv,
): MiniMaxTtsConfig => {
const apiKey = env.MINIMAX_API_KEY?.trim();
if (!apiKey) {
throw new Error('MINIMAX_API_KEY is required for MiniMax TTS.');
}
const rawHost = env.MINIMAX_API_HOST?.trim();
const apiHost = (rawHost || DEFAULT_MINIMAX_TTS_API_HOST).replace(/\/+$/, '');
return {
apiKey,
apiHost,
};
};
export const getMiniMaxTtsHttpStatus = (
baseResp?: MiniMaxBaseResponseLike | null,
): number => {
const statusCode = baseResp?.status_code;
const message = baseResp?.status_msg?.toLowerCase() || '';
if (
statusCode === 1004 ||
message.includes('login fail') ||
message.includes('authorization')
) {
return 401;
}
if (statusCode === 429 || message.includes('too many requests') || message.includes('rate limit')) {
return 429;
}
return 502;
};
export const createMiniMaxTtsUrl = (apiHost: string): string => `${apiHost}/v1/t2a_v2`;

View File

@ -0,0 +1,33 @@
import { describe, expect, it, vi } from 'vitest';
import { createSentenceTranslator } from './providerTranslation';
vi.mock('./geminiTranslation', () => ({
createGeminiSentenceTranslator: vi.fn(() => 'gemini-translator'),
}));
vi.mock('./doubaoTranslation', () => ({
createDoubaoSentenceTranslator: vi.fn(() => 'doubao-translator'),
}));
describe('createSentenceTranslator', () => {
it('returns a doubao translator when provider is doubao', () => {
const translator = createSentenceTranslator({
provider: 'doubao',
apiKey: 'ark-key',
model: 'doubao-seed-2-0-pro-260215',
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
});
expect(translator).toBe('doubao-translator');
});
it('returns a gemini translator when provider is gemini', () => {
const translator = createSentenceTranslator({
provider: 'gemini',
apiKey: 'gemini-key',
model: 'gemini-2.5-flash',
});
expect(translator).toBe('gemini-translator');
});
});

View File

@ -0,0 +1,11 @@
import { createDoubaoSentenceTranslator } from './doubaoTranslation';
import { createGeminiSentenceTranslator } from './geminiTranslation';
import { LlmProviderConfig } from './llmProvider';
export const createSentenceTranslator = (config: LlmProviderConfig) => {
if (config.provider === 'doubao') {
return createDoubaoSentenceTranslator(config);
}
return createGeminiSentenceTranslator(config);
};

View File

@ -0,0 +1,103 @@
import { describe, expect, it, vi } from 'vitest';
import { generateSubtitlePipeline } from './subtitleGeneration';
import { SubtitlePipelineResult } from '../types';
describe('generateSubtitlePipeline', () => {
it('uses the requested provider and video path when building subtitles', async () => {
const subtitleResult: SubtitlePipelineResult = {
subtitles: [],
speakers: [],
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
await generateSubtitlePipeline({
videoPath: 'clip.mp4',
targetLanguage: 'English',
provider: 'gemini',
env: {
GEMINI_API_KEY: 'gemini-key',
ARK_API_KEY: 'ark-key',
},
deps: {
generateSubtitlesFromVideo,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect.objectContaining({
videoPath: 'clip.mp4',
targetLanguage: 'English',
providerConfig: {
provider: 'gemini',
apiKey: 'gemini-key',
model: 'gemini-2.5-flash',
},
}),
);
});
it('falls back to the env default provider when the request omits it', async () => {
const subtitleResult: SubtitlePipelineResult = {
subtitles: [],
speakers: [],
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
await generateSubtitlePipeline({
videoPath: 'clip.mp4',
targetLanguage: 'English',
env: {
DEFAULT_LLM_PROVIDER: 'doubao',
ARK_API_KEY: 'ark-key',
},
deps: {
generateSubtitlesFromVideo,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect.objectContaining({
providerConfig: {
provider: 'doubao',
apiKey: 'ark-key',
model: 'doubao-seed-2-0-pro-260215',
baseUrl: 'https://ark.cn-beijing.volces.com/api/v3/responses',
},
}),
);
});
it('passes fetch implementation into video subtitle generation', async () => {
const subtitleResult: SubtitlePipelineResult = {
subtitles: [],
speakers: [],
quality: 'fallback',
targetLanguage: 'English',
};
const generateSubtitlesFromVideo = vi.fn(async () => subtitleResult);
const fetchImpl = vi.fn<typeof fetch>();
await generateSubtitlePipeline({
videoPath: 'clip.mp4',
targetLanguage: 'English',
provider: 'doubao',
env: {
ARK_API_KEY: 'ark-key',
},
fetchImpl,
deps: {
generateSubtitlesFromVideo,
},
});
expect(generateSubtitlesFromVideo).toHaveBeenCalledWith(
expect.objectContaining({
fetchImpl,
}),
);
});
});

View File

@ -0,0 +1,38 @@
import { resolveAudioPipelineConfig } from './audioPipelineConfig';
import { resolveLlmProviderConfig, normalizeLlmProvider } from './llmProvider';
import { generateSubtitlesFromVideo as defaultGenerateSubtitlesFromVideo } from './videoSubtitleGeneration';
export interface GenerateSubtitlePipelineOptions {
videoPath: string;
targetLanguage: string;
provider?: string | null;
env: NodeJS.ProcessEnv;
fetchImpl?: typeof fetch;
deps?: {
generateSubtitlesFromVideo?: typeof defaultGenerateSubtitlesFromVideo;
};
}
export const generateSubtitlePipeline = async ({
videoPath,
targetLanguage,
provider,
env,
fetchImpl,
deps,
}: GenerateSubtitlePipelineOptions) => {
const audioPipelineConfig = resolveAudioPipelineConfig(env);
const selectedProvider = provider
? normalizeLlmProvider(provider)
: audioPipelineConfig.defaultProvider;
const providerConfig = resolveLlmProviderConfig(selectedProvider, env);
const generateSubtitlesFromVideo =
deps?.generateSubtitlesFromVideo || defaultGenerateSubtitlesFromVideo;
return generateSubtitlesFromVideo({
providerConfig,
videoPath,
targetLanguage,
...(fetchImpl ? { fetchImpl } : {}),
});
};

View File

@ -0,0 +1,115 @@
import { describe, expect, it, vi } from 'vitest';
import { buildSubtitlePayload } from './subtitlePipeline';
describe('buildSubtitlePayload', () => {
it('returns partial quality when diarization is unavailable', async () => {
const result = await buildSubtitlePayload({
alignmentResult: {
words: [
{ text: 'hi', startTime: 0, endTime: 0.2, speakerId: 'unknown', confidence: 0.9 },
],
quality: 'partial',
speakers: [],
duration: 0.2,
alignmentEngine: 'unit-test',
},
targetLanguage: 'en',
translateSentences: async (sentences) =>
sentences.map((sentence) => ({
id: sentence.id,
translatedText: sentence.originalText,
speaker: 'Speaker 1',
})),
});
expect(result.quality).toBe('partial');
expect(result.subtitles).toHaveLength(1);
expect(result.speakers).toEqual([]);
});
it('preserves original text when translation is unavailable', async () => {
const result = await buildSubtitlePayload({
alignmentResult: {
words: [
{ text: 'hello', startTime: 0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.8 },
],
quality: 'full',
speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
duration: 0.5,
alignmentEngine: 'unit-test',
},
targetLanguage: 'zh',
translateSentences: async () => {
throw new Error('translation unavailable');
},
});
expect(result.subtitles[0].translatedText).toBe('hello world');
expect(result.subtitles[0].speakerId).toBe('spk_0');
expect(result.quality).toBe('full');
});
it('throws when translation is unavailable in strict mode', async () => {
await expect(
buildSubtitlePayload({
alignmentResult: {
words: [
{ text: 'hello', startTime: 0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
],
quality: 'full',
speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
duration: 0.2,
alignmentEngine: 'unit-test',
},
targetLanguage: 'zh',
translateSentences: async () => {
throw new Error('translation unavailable');
},
strictTranslation: true,
}),
).rejects.toThrow(/translation unavailable/);
});
it('uses pre-segmented subtitles from the alignment service without rebuilding words', async () => {
const translateSentences = vi.fn();
const result = await buildSubtitlePayload({
alignmentResult: {
words: [],
segments: [
{
id: 'subtitle-1',
originalText: 'Hello there',
translatedText: '你好',
startTime: 1,
endTime: 2.5,
speakerId: 'unknown',
confidence: 0,
words: [],
},
],
quality: 'fallback',
speakers: [],
duration: 2.5,
alignmentEngine: 'video-translation-studio',
},
targetLanguage: 'zh',
translateSentences,
});
expect(translateSentences).not.toHaveBeenCalled();
expect(result.subtitles).toEqual([
expect.objectContaining({
id: 'subtitle-1',
originalText: 'Hello there',
translatedText: '你好',
startTime: 1,
endTime: 2.5,
speakerId: 'unknown',
words: [],
}),
]);
expect(result.quality).toBe('fallback');
});
});

View File

@ -0,0 +1,180 @@
import { normalizeAlignedSentence } from '../lib/subtitlePipeline';
import { rebuildSentences } from '../lib/alignment/sentenceReconstruction';
import { PipelineQuality, SpeakerTrack, SubtitlePipelineResult, WordTiming } from '../types';
const DEFAULT_VOICE_ID = 'male-qn-qingse';
export interface AlignedSegment {
id: string;
originalText: string;
translatedText?: string;
speakerId: string;
speaker?: string;
startTime: number;
endTime: number;
confidence: number;
words: WordTiming[];
}
export interface AlignmentResult {
words: WordTiming[];
segments?: AlignedSegment[];
quality: PipelineQuality;
speakers: SpeakerTrack[];
sourceLanguage?: string;
duration?: number;
alignmentEngine?: string;
}
export interface TranslationSentenceInput {
id: string;
originalText: string;
speakerId: string;
startTime: number;
endTime: number;
}
export interface TranslationSentenceResult {
id: string;
translatedText: string;
speaker?: string;
}
export interface BuildSubtitlePayloadOptions {
alignmentResult: AlignmentResult;
targetLanguage: string;
strictTranslation?: boolean;
translateSentences: (
sentences: TranslationSentenceInput[],
targetLanguage: string,
) => Promise<TranslationSentenceResult[]>;
}
const createSpeakerLookup = (speakers: SpeakerTrack[]) =>
new Map(speakers.map((speaker) => [speaker.speakerId, speaker.label]));
const buildSegmentPayload = async ({
alignmentResult,
targetLanguage,
strictTranslation = false,
translateSentences,
}: BuildSubtitlePayloadOptions): Promise<SubtitlePipelineResult> => {
const segments = alignmentResult.segments ?? [];
const speakerLookup = createSpeakerLookup(alignmentResult.speakers);
const untranslatedSegments = segments.filter((segment) => !segment.translatedText?.trim());
let translations = new Map<string, TranslationSentenceResult>();
if (untranslatedSegments.length > 0) {
try {
const translatedSegments = await translateSentences(
untranslatedSegments.map((segment) => ({
id: segment.id,
originalText: segment.originalText,
speakerId: segment.speakerId,
startTime: segment.startTime,
endTime: segment.endTime,
})),
targetLanguage,
);
translations = new Map(translatedSegments.map((segment) => [segment.id, segment]));
} catch (error) {
if (strictTranslation) {
throw error;
}
translations = new Map();
}
}
return {
subtitles: segments.map((segment) => {
const translation = translations.get(segment.id);
return normalizeAlignedSentence({
id: segment.id,
speakerId: segment.speakerId,
speaker:
translation?.speaker ??
segment.speaker ??
speakerLookup.get(segment.speakerId) ??
segment.speakerId,
words: segment.words,
startTime: segment.startTime,
endTime: segment.endTime,
originalText: segment.originalText,
translatedText:
translation?.translatedText ??
segment.translatedText ??
segment.originalText,
voiceId: DEFAULT_VOICE_ID,
});
}),
speakers: alignmentResult.speakers,
quality: alignmentResult.quality,
sourceLanguage: alignmentResult.sourceLanguage,
targetLanguage,
duration: alignmentResult.duration,
alignmentEngine: alignmentResult.alignmentEngine,
};
};
export const buildSubtitlePayload = async ({
alignmentResult,
targetLanguage,
strictTranslation = false,
translateSentences,
}: BuildSubtitlePayloadOptions): Promise<SubtitlePipelineResult> => {
if ((alignmentResult.segments?.length ?? 0) > 0) {
return buildSegmentPayload({
alignmentResult,
targetLanguage,
strictTranslation,
translateSentences,
});
}
const sentences = rebuildSentences(alignmentResult.words);
const speakerLookup = createSpeakerLookup(alignmentResult.speakers);
let translations = new Map<string, TranslationSentenceResult>();
try {
const translatedSentences = await translateSentences(
sentences.map((sentence) => ({
id: sentence.id,
originalText: sentence.originalText,
speakerId: sentence.speakerId,
startTime: sentence.startTime,
endTime: sentence.endTime,
})),
targetLanguage,
);
translations = new Map(translatedSentences.map((sentence) => [sentence.id, sentence]));
} catch (error) {
if (strictTranslation) {
throw error;
}
translations = new Map();
}
return {
subtitles: sentences.map((sentence) => {
const translation = translations.get(sentence.id);
return normalizeAlignedSentence({
id: sentence.id,
speakerId: sentence.speakerId,
speaker:
translation?.speaker ??
speakerLookup.get(sentence.speakerId) ??
sentence.speakerId,
words: sentence.words,
originalText: sentence.originalText,
translatedText: translation?.translatedText ?? sentence.originalText,
voiceId: DEFAULT_VOICE_ID,
});
}),
speakers: alignmentResult.speakers,
quality: alignmentResult.quality,
sourceLanguage: alignmentResult.sourceLanguage,
targetLanguage,
duration: alignmentResult.duration,
alignmentEngine: alignmentResult.alignmentEngine,
};
};

View File

@ -0,0 +1,29 @@
import { describe, expect, it } from 'vitest';
import { parseSubtitleRequest } from './subtitleRequest';
describe('parseSubtitleRequest', () => {
it('defaults provider to doubao', () => {
expect(parseSubtitleRequest({ targetLanguage: 'English' })).toEqual({
provider: 'doubao',
targetLanguage: 'English',
});
});
it('normalizes a valid provider override', () => {
expect(
parseSubtitleRequest({
targetLanguage: 'English',
provider: 'Gemini',
}),
).toEqual({
provider: 'gemini',
targetLanguage: 'English',
});
});
it('rejects an empty target language', () => {
expect(() => parseSubtitleRequest({ targetLanguage: ' ' })).toThrow(
/target language/i,
);
});
});

View File

@ -0,0 +1,25 @@
import { LlmProvider, normalizeLlmProvider } from './llmProvider';
export interface SubtitleRequestBody {
provider?: string | null;
targetLanguage?: string | null;
}
export interface ParsedSubtitleRequest {
provider: LlmProvider;
targetLanguage: string;
}
export const parseSubtitleRequest = (
body: SubtitleRequestBody,
): ParsedSubtitleRequest => {
const targetLanguage = body.targetLanguage?.trim();
if (!targetLanguage) {
throw new Error('Target language is required.');
}
return {
provider: normalizeLlmProvider(body.provider),
targetLanguage,
};
};

View File

@ -0,0 +1,247 @@
import fs from 'fs';
import { GoogleGenAI } from '@google/genai';
import { SubtitlePipelineResult } from '../types';
import { DoubaoProviderConfig, GeminiProviderConfig, LlmProviderConfig } from './llmProvider';
interface RawModelSubtitle {
id?: string;
startTime?: number | string;
endTime?: number | string;
originalText?: string;
translatedText?: string;
speaker?: string;
voiceId?: string;
}
interface RawModelResponse {
sourceLanguage?: string;
subtitles?: RawModelSubtitle[];
}
const DEFAULT_VOICE_ID = 'male-qn-qingse';
const SUPPORTED_VOICE_IDS = new Set([
DEFAULT_VOICE_ID,
'female-shaonv',
'female-yujie',
'male-qn-jingying',
'male-qn-badao',
]);
const stripJsonFences = (text: string) => text.replace(/```json\n?|\n?```/g, '').trim();
const extractJson = (text: string): RawModelResponse => {
const cleaned = stripJsonFences(text);
if (!cleaned) {
return { subtitles: [] };
}
const parsed = JSON.parse(cleaned);
if (Array.isArray(parsed)) {
return { subtitles: parsed };
}
return parsed as RawModelResponse;
};
const toSeconds = (value: unknown, fallback: number) => {
const parsed = Number(value);
if (!Number.isFinite(parsed)) {
return fallback;
}
return Math.max(0, parsed);
};
const sanitizeVoiceId = (value: unknown) => {
if (typeof value !== 'string') {
return DEFAULT_VOICE_ID;
}
return SUPPORTED_VOICE_IDS.has(value) ? value : DEFAULT_VOICE_ID;
};
const createPrompt = (targetLanguage: string) => `You are a subtitle localization engine.
Analyze the input video and output STRICT JSON only.
Return an object:
{
"sourceLanguage": "detected language code",
"subtitles": [
{
"id": "segment-1",
"startTime": 0.0,
"endTime": 1.2,
"originalText": "source dialogue",
"translatedText": "translated dialogue in ${targetLanguage}",
"speaker": "short speaker label",
"voiceId": "one of: male-qn-qingse, female-shaonv, female-yujie, male-qn-jingying, male-qn-badao"
}
]
}
Rules:
1. Use video timeline seconds for startTime/endTime.
2. Keep subtitles chronological and non-overlapping.
3. Do not invent dialogue if not audible.
4. translatedText must be in ${targetLanguage}.
5. Do not include markdown. JSON only.`;
const normalizeSubtitles = (raw: RawModelSubtitle[]) => {
let lastEnd = 0;
const subtitles = raw
.map((entry, index) => {
const originalText = (entry.originalText || '').trim();
if (!originalText) {
return null;
}
const startTime = toSeconds(entry.startTime, lastEnd);
const endTime = Math.max(startTime + 0.4, toSeconds(entry.endTime, startTime + 1.2));
lastEnd = endTime;
return {
id: entry.id?.trim() || `segment-${index + 1}`,
startTime,
endTime,
originalText,
translatedText: (entry.translatedText || originalText).trim(),
speaker: (entry.speaker || `Speaker ${index + 1}`).trim(),
speakerId: `speaker-${index + 1}`,
voiceId: sanitizeVoiceId(entry.voiceId),
words: [],
confidence: 0,
};
})
.filter(Boolean);
return subtitles as SubtitlePipelineResult['subtitles'];
};
const extractDoubaoTextOutput = (payload: any): string => {
const output = Array.isArray(payload?.output) ? payload.output : [];
const parts = output.flatMap((item: any) => {
if (!Array.isArray(item?.content)) {
return [];
}
return item.content
.map((part: any) => (typeof part?.text === 'string' ? part.text : ''))
.filter(Boolean);
});
return parts.join('').trim();
};
const generateWithDoubao = async ({
config,
videoDataUrl,
targetLanguage,
fetchImpl = fetch,
}: {
config: DoubaoProviderConfig;
videoDataUrl: string;
targetLanguage: string;
fetchImpl?: typeof fetch;
}) => {
const response = await fetchImpl(config.baseUrl, {
method: 'POST',
headers: {
Authorization: `Bearer ${config.apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: config.model,
input: [
{
role: 'user',
content: [
{ type: 'input_video', video_url: videoDataUrl },
{ type: 'input_text', text: createPrompt(targetLanguage) },
],
},
],
}),
});
if (!response.ok) {
const payload = await response.text();
throw new Error(`Doubao subtitle request failed (${response.status}): ${payload}`);
}
const payload = await response.json();
const text = extractDoubaoTextOutput(payload);
return extractJson(text);
};
const generateWithGemini = async ({
config,
videoBase64,
targetLanguage,
}: {
config: GeminiProviderConfig;
videoBase64: string;
targetLanguage: string;
}) => {
const ai = new GoogleGenAI({ apiKey: config.apiKey });
const response = await ai.models.generateContent({
model: config.model,
contents: [
{
role: 'user',
parts: [
{
inlineData: {
mimeType: 'video/mp4',
data: videoBase64,
},
},
{ text: createPrompt(targetLanguage) },
],
},
],
});
return extractJson(response.text || '');
};
export const generateSubtitlesFromVideo = async ({
providerConfig,
videoPath,
targetLanguage,
fetchImpl = fetch,
}: {
providerConfig: LlmProviderConfig;
videoPath: string;
targetLanguage: string;
fetchImpl?: typeof fetch;
}): Promise<SubtitlePipelineResult> => {
const videoBuffer = fs.readFileSync(videoPath);
const videoBase64 = videoBuffer.toString('base64');
const videoDataUrl = `data:video/mp4;base64,${videoBase64}`;
const raw =
providerConfig.provider === 'doubao'
? await generateWithDoubao({
config: providerConfig,
videoDataUrl,
targetLanguage,
fetchImpl,
})
: await generateWithGemini({
config: providerConfig,
videoBase64,
targetLanguage,
});
const subtitles = normalizeSubtitles(Array.isArray(raw.subtitles) ? raw.subtitles : []);
return {
subtitles,
speakers: [],
quality: subtitles.length > 0 ? 'full' : 'fallback',
sourceLanguage: raw.sourceLanguage,
targetLanguage,
duration: subtitles.length > 0 ? subtitles[subtitles.length - 1].endTime : 0,
alignmentEngine: `llm-video-${providerConfig.provider}`,
};
};

View File

@ -0,0 +1,45 @@
// @vitest-environment jsdom
import { describe, expect, it, vi } from 'vitest';
import { generateSubtitlePipeline } from './subtitleService';
describe('generateSubtitlePipeline', () => {
it('posts the selected provider to the server', async () => {
const fetchMock = vi.fn(async () =>
new Response(
JSON.stringify({
subtitles: [],
speakers: [],
quality: 'fallback',
}),
{
status: 200,
headers: {
'Content-Type': 'application/json',
},
},
),
);
await generateSubtitlePipeline(
new File(['video'], 'clip.mp4', { type: 'video/mp4' }),
'English',
'doubao',
null,
fetchMock as unknown as typeof fetch,
);
expect(fetchMock).toHaveBeenCalledWith(
'/api/generate-subtitles',
expect.objectContaining({
method: 'POST',
body: expect.any(FormData),
}),
);
const [, requestInit] = fetchMock.mock.calls[0] as unknown as [string, RequestInit];
const formData = requestInit.body as FormData;
expect(formData.get('targetLanguage')).toBe('English');
expect(formData.get('provider')).toBe('doubao');
});
});

View File

@ -0,0 +1,81 @@
import { LlmProvider, PipelineQuality, SubtitlePipelineResult } from '../types';
type JsonResponseResult<T> =
| { ok: true; status: number; data: T }
| { ok: false; status: number; error: string };
const normalizePipelineQuality = (value: unknown): PipelineQuality => {
if (value === 'full' || value === 'partial' || value === 'fallback') {
return value;
}
return 'fallback';
};
const readJsonResponseOnce = async <T>(resp: Response): Promise<JsonResponseResult<T>> => {
if (!resp.ok) {
try {
const errorData = await resp.json();
return {
ok: false,
status: resp.status,
error: errorData.error || `HTTP ${resp.status}`,
};
} catch {
return {
ok: false,
status: resp.status,
error: `HTTP ${resp.status}`,
};
}
}
return {
ok: true,
status: resp.status,
data: await resp.json(),
};
};
export const generateSubtitlePipeline = async (
videoFile: File,
targetLanguage: string,
provider: LlmProvider = 'doubao',
trimRange?: { start: number; end: number } | null,
fetchImpl: typeof fetch = fetch,
): Promise<SubtitlePipelineResult> => {
if (!targetLanguage.trim()) {
throw new Error('Target language is required.');
}
const formData = new FormData();
formData.append('video', videoFile);
formData.append('targetLanguage', targetLanguage);
formData.append('provider', provider);
if (trimRange) {
formData.append('trimRange', JSON.stringify(trimRange));
}
const resp = await fetchImpl('/api/generate-subtitles', {
method: 'POST',
body: formData,
});
const parsed = await readJsonResponseOnce<Partial<SubtitlePipelineResult>>(resp);
if (parsed.ok === false) {
const error = new Error(parsed.error);
(error as any).status = resp.status;
throw error;
}
return {
subtitles: Array.isArray(parsed.data.subtitles) ? parsed.data.subtitles : [],
speakers: Array.isArray(parsed.data.speakers) ? parsed.data.speakers : [],
quality: normalizePipelineQuality(parsed.data.quality),
sourceLanguage: parsed.data.sourceLanguage,
targetLanguage: parsed.data.targetLanguage || targetLanguage,
duration:
typeof parsed.data.duration === 'number' ? parsed.data.duration : undefined,
alignmentEngine: parsed.data.alignmentEngine,
};
};

View File

@ -0,0 +1,52 @@
// @vitest-environment jsdom
import { beforeEach, describe, expect, it, vi } from 'vitest';
import { generateTTS } from './ttsService';
function createSingleReadJsonResponse(body: unknown, status: number) {
let hasBeenRead = false;
const jsonMock = vi.fn(async () => {
if (hasBeenRead) {
throw new TypeError("Failed to execute 'json' on 'Response': body stream already read");
}
hasBeenRead = true;
return body;
});
return {
jsonMock,
response: {
ok: status >= 200 && status < 300,
status,
json: jsonMock,
} as unknown as Response,
};
}
describe('ttsService', () => {
beforeEach(() => {
vi.restoreAllMocks();
vi.stubGlobal('fetch', vi.fn());
});
it('returns a MiniMax mp3 data url when the request succeeds', async () => {
vi.mocked(fetch).mockResolvedValue({
ok: true,
status: 200,
json: async () => ({ audio: 'AAAA' }),
} as Response);
const result = await generateTTS('hello world');
expect(result).toBe('data:audio/mp3;base64,AAAA');
});
it('reads a failed MiniMax response body only once and surfaces the error', async () => {
const { response, jsonMock } = createSingleReadJsonResponse({ error: 'login fail' }, 401);
vi.mocked(fetch).mockResolvedValue(response);
await expect(generateTTS('hello world')).rejects.toThrow(/login fail/i);
expect(jsonMock).toHaveBeenCalledTimes(1);
});
});

View File

@ -0,0 +1,86 @@
type JsonResponseResult<T> =
| { ok: true; status: number; data: T }
| { ok: false; status: number; error: string };
const isTransientStatus = (status?: number) =>
status === 429 || status === 500 || status === 502 || status === 503 || status === 504;
const withRetry = async <T>(
fn: () => Promise<T>,
maxRetries = 4,
delay = 1500,
): Promise<T> => {
let lastError: unknown;
for (let attempt = 0; attempt < maxRetries; attempt += 1) {
try {
return await fn();
} catch (error) {
lastError = error;
const status = Number((error as any)?.status) || undefined;
if (!isTransientStatus(status) || attempt === maxRetries - 1) {
throw error;
}
await new Promise((resolve) => setTimeout(resolve, delay));
delay *= 2;
}
}
throw lastError;
};
const readJsonResponseOnce = async <T>(resp: Response): Promise<JsonResponseResult<T>> => {
if (!resp.ok) {
try {
const errorData = await resp.json();
return {
ok: false,
status: resp.status,
error: errorData.error || `HTTP ${resp.status}`,
};
} catch {
return {
ok: false,
status: resp.status,
error: `HTTP ${resp.status}`,
};
}
}
return {
ok: true,
status: resp.status,
data: await resp.json(),
};
};
export const generateTTS = async (
text: string,
voiceId: string = 'male-qn-qingse',
): Promise<string> => {
if (!text || text.trim().length === 0) {
throw new Error('Text is empty, cannot generate TTS');
}
const audio = await withRetry(async () => {
const resp = await fetch('/api/tts', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({ text, voiceId }),
});
const parsed = await readJsonResponseOnce<{ audio: string }>(resp);
if (parsed.ok === false) {
const error = new Error(parsed.error);
(error as any).status = resp.status;
throw error;
}
return parsed.data.audio;
});
return `data:audio/mp3;base64,${audio}`;
};

1
src/test/setup.ts Normal file
View File

@ -0,0 +1 @@
import '@testing-library/jest-dom/vitest';

7
src/test/smoke.test.ts Normal file
View File

@ -0,0 +1,7 @@
import { describe, expect, it } from 'vitest';
describe('test harness', () => {
it('runs vitest in this workspace', () => {
expect(true).toBe(true);
});
});

65
src/types.ts Normal file
View File

@ -0,0 +1,65 @@
export interface Subtitle {
id: string;
startTime: number;
endTime: number;
originalText: string;
translatedText: string;
speaker: string;
speakerId?: string;
words?: WordTiming[];
confidence?: number;
voiceId: string;
audioUrl?: string;
volume?: number;
age?: string;
voiceCharacteristics?: string;
emotion?: string;
}
export type LlmProvider = 'doubao' | 'gemini';
export type PipelineQuality = 'full' | 'partial' | 'fallback';
export interface WordTiming {
text: string;
startTime: number;
endTime: number;
speakerId: string;
confidence: number;
}
export interface SpeakerTrack {
speakerId: string;
label: string;
gender?: 'male' | 'female' | 'unknown';
}
export interface SubtitlePipelineResult {
subtitles: Subtitle[];
speakers: SpeakerTrack[];
quality: PipelineQuality;
sourceLanguage?: string;
targetLanguage?: string;
duration?: number;
alignmentEngine?: string;
}
export interface TextStyles {
fontFamily: string;
fontSize: number;
color: string;
backgroundColor: string;
alignment: 'left' | 'center' | 'right';
isBold: boolean;
isItalic: boolean;
isUnderline: boolean;
}
export interface Voice {
id: string;
name: string;
tag: string;
avatar: string;
gender: 'male' | 'female' | 'neutral';
language: string;
}

117
src/voices.ts Normal file
View File

@ -0,0 +1,117 @@
import { Voice } from './types';
export const MINIMAX_VOICES: Voice[] = [
// 中文(普通话)
{ id: 'male-qn-qingse', name: '青涩青年音色', tag: 'ProVoice', avatar: 'male-1', gender: 'male', language: 'zh' },
{ id: 'male-qn-jingying', name: '精英青年音色', tag: 'ProVoice', avatar: 'male-2', gender: 'male', language: 'zh' },
{ id: 'male-qn-badao', name: '霸道青年音色', tag: 'ProVoice', avatar: 'male-3', gender: 'male', language: 'zh' },
{ id: 'male-qn-daxuesheng', name: '青年大学生音色', tag: 'ProVoice', avatar: 'male-4', gender: 'male', language: 'zh' },
{ id: 'female-shaonv', name: '少女音色', tag: 'ProVoice', avatar: 'female-1', gender: 'female', language: 'zh' },
{ id: 'female-yujie', name: '御姐音色', tag: 'ProVoice', avatar: 'female-2', gender: 'female', language: 'zh' },
{ id: 'female-chengshu', name: '成熟女性音色', tag: 'ProVoice', avatar: 'female-3', gender: 'female', language: 'zh' },
{ id: 'female-tianmei', name: '甜美女性音色', tag: 'ProVoice', avatar: 'female-4', gender: 'female', language: 'zh' },
{ id: 'male-qn-qingse-jingpin', name: '青涩青年音色-beta', tag: 'Beta', avatar: 'male-5', gender: 'male', language: 'zh' },
{ id: 'male-qn-jingying-jingpin', name: '精英青年音色-beta', tag: 'Beta', avatar: 'male-6', gender: 'male', language: 'zh' },
{ id: 'male-qn-badao-jingpin', name: '霸道青年音色-beta', tag: 'Beta', avatar: 'male-7', gender: 'male', language: 'zh' },
{ id: 'male-qn-daxuesheng-jingpin', name: '青年大学生音色-beta', tag: 'Beta', avatar: 'male-8', gender: 'male', language: 'zh' },
{ id: 'female-shaonv-jingpin', name: '少女音色-beta', tag: 'Beta', avatar: 'female-5', gender: 'female', language: 'zh' },
{ id: 'female-yujie-jingpin', name: '御姐音色-beta', tag: 'Beta', avatar: 'female-6', gender: 'female', language: 'zh' },
{ id: 'female-chengshu-jingpin', name: '成熟女性音色-beta', tag: 'Beta', avatar: 'female-7', gender: 'female', language: 'zh' },
{ id: 'female-tianmei-jingpin', name: '甜美女性音色-beta', tag: 'Beta', avatar: 'female-8', gender: 'female', language: 'zh' },
{ id: 'clever_boy', name: '聪明男童', tag: 'Child', avatar: 'boy-1', gender: 'male', language: 'zh' },
{ id: 'cute_boy', name: '可爱男童', tag: 'Child', avatar: 'boy-2', gender: 'male', language: 'zh' },
{ id: 'lovely_girl', name: '萌萌女童', tag: 'Child', avatar: 'girl-1', gender: 'female', language: 'zh' },
{ id: 'cartoon_pig', name: '卡通猪小琪', tag: 'Cartoon', avatar: 'pig-1', gender: 'neutral', language: 'zh' },
{ id: 'bingjiao_didi', name: '病娇弟弟', tag: 'Character', avatar: 'male-9', gender: 'male', language: 'zh' },
{ id: 'junlang_nanyou', name: '俊朗男友', tag: 'Character', avatar: 'male-10', gender: 'male', language: 'zh' },
{ id: 'chunzhen_xuedi', name: '纯真学弟', tag: 'Character', avatar: 'male-11', gender: 'male', language: 'zh' },
{ id: 'lengdan_xiongzhang', name: '冷淡学长', tag: 'Character', avatar: 'male-12', gender: 'male', language: 'zh' },
{ id: 'badao_shaoye', name: '霸道少爷', tag: 'Character', avatar: 'male-13', gender: 'male', language: 'zh' },
{ id: 'tianxin_xiaoling', name: '甜心小玲', tag: 'Character', avatar: 'female-9', gender: 'female', language: 'zh' },
{ id: 'qiaopi_mengmei', name: '俏皮萌妹', tag: 'Character', avatar: 'female-10', gender: 'female', language: 'zh' },
{ id: 'wumei_yujie', name: '妩媚御姐', tag: 'Character', avatar: 'female-11', gender: 'female', language: 'zh' },
{ id: 'diadia_xuemei', name: '嗲嗲学妹', tag: 'Character', avatar: 'female-12', gender: 'female', language: 'zh' },
{ id: 'danya_xuejie', name: '淡雅学姐', tag: 'Character', avatar: 'female-13', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Reliable_Executive', name: '沉稳高管', tag: 'Professional', avatar: 'male-14', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_News_Anchor', name: '新闻女声', tag: 'News', avatar: 'female-14', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Mature_Woman', name: '傲娇御姐', tag: 'Character', avatar: 'female-15', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Unrestrained_Young_Man', name: '不羁青年', tag: 'Character', avatar: 'male-15', gender: 'male', language: 'zh' },
{ id: 'Arrogant_Miss', name: '嚣张小姐', tag: 'Character', avatar: 'female-16', gender: 'female', language: 'zh' },
{ id: 'Robot_Armor', name: '机械战甲', tag: 'Robot', avatar: 'robot-1', gender: 'neutral', language: 'zh' },
{ id: 'Chinese (Mandarin)_Kind-hearted_Antie', name: '热心大婶', tag: 'Elder', avatar: 'female-17', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_HK_Flight_Attendant', name: '港普空姐', tag: 'Professional', avatar: 'female-18', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Humorous_Elder', name: '搞笑大爷', tag: 'Elder', avatar: 'male-16', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Gentleman', name: '温润男声', tag: 'Professional', avatar: 'male-17', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Warm_Bestie', name: '温暖闺蜜', tag: 'Friend', avatar: 'female-19', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Male_Announcer', name: '播报男声', tag: 'News', avatar: 'male-18', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Sweet_Lady', name: '甜美女声', tag: 'Sweet', avatar: 'female-20', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Southern_Young_Man', name: '南方小哥', tag: 'Regional', avatar: 'male-19', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Wise_Women', name: '阅历姐姐', tag: 'Wise', avatar: 'female-21', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Gentle_Youth', name: '温润青年', tag: 'Youth', avatar: 'male-20', gender: 'neutral', language: 'zh' },
{ id: 'Chinese (Mandarin)_Warm_Girl', name: '温暖少女', tag: 'Warm', avatar: 'female-22', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Kind-hearted_Elder', name: '花甲奶奶', tag: 'Elder', avatar: 'female-23', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Cute_Spirit', name: '憨憨萌兽', tag: 'Cartoon', avatar: 'spirit-1', gender: 'neutral', language: 'zh' },
{ id: 'Chinese (Mandarin)_Radio_Host', name: '电台男主播', tag: 'News', avatar: 'male-21', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Lyrical_Voice', name: '抒情男声', tag: 'Lyrical', avatar: 'male-22', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Straightforward_Boy', name: '率真弟弟', tag: 'Youth', avatar: 'male-23', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Sincere_Adult', name: '真诚青年', tag: 'Sincere', avatar: 'male-24', gender: 'neutral', language: 'zh' },
{ id: 'Chinese (Mandarin)_Gentle_Senior', name: '温柔学姐', tag: 'Youth', avatar: 'female-24', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Stubborn_Friend', name: '嘴硬竹马', tag: 'Friend', avatar: 'male-25', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Crisp_Girl', name: '清脆少女', tag: 'Youth', avatar: 'female-25', gender: 'female', language: 'zh' },
{ id: 'Chinese (Mandarin)_Pure-hearted_Boy', name: '清澈邻家弟弟', tag: 'Youth', avatar: 'male-26', gender: 'male', language: 'zh' },
{ id: 'Chinese (Mandarin)_Soft_Girl', name: '柔和少女', tag: 'Youth', avatar: 'female-26', gender: 'female', language: 'zh' },
// 中文(粤语)
{ id: 'Cantonese_ProfessionalHostF)', name: '专业女主持', tag: 'Cantonese', avatar: 'female-27', gender: 'female', language: 'yue' },
{ id: 'Cantonese_GentleLady', name: '温柔女声', tag: 'Cantonese', avatar: 'female-28', gender: 'female', language: 'yue' },
{ id: 'Cantonese_ProfessionalHostM)', name: '专业男主持', tag: 'Cantonese', avatar: 'male-27', gender: 'male', language: 'yue' },
{ id: 'Cantonese_PlayfulMan', name: '活泼男声', tag: 'Cantonese', avatar: 'male-28', gender: 'male', language: 'yue' },
{ id: 'Cantonese_CuteGirl', name: '可爱女孩', tag: 'Cantonese', avatar: 'female-29', gender: 'female', language: 'yue' },
{ id: 'Cantonese_KindWoman', name: '善良女声', tag: 'Cantonese', avatar: 'female-30', gender: 'female', language: 'yue' },
// 英文
{ id: 'Santa_Claus', name: '圣诞老人', tag: 'English', avatar: 'male-29', gender: 'male', language: 'en' },
{ id: 'Grinch', name: '格林奇', tag: 'English', avatar: 'male-30', gender: 'male', language: 'en' },
{ id: 'Rudolph', name: '鲁道夫', tag: 'English', avatar: 'male-31', gender: 'male', language: 'en' },
{ id: 'Arnold', name: '阿诺德', tag: 'English', avatar: 'male-32', gender: 'male', language: 'en' },
{ id: 'Charming_Santa', name: '迷人圣诞老人', tag: 'English', avatar: 'male-33', gender: 'male', language: 'en' },
{ id: 'Charming_Lady', name: '迷人女士', tag: 'English', avatar: 'female-31', gender: 'female', language: 'en' },
{ id: 'Sweet_Girl', name: '甜美女孩', tag: 'English', avatar: 'female-32', gender: 'female', language: 'en' },
{ id: 'Cute_Elf', name: '可爱精灵', tag: 'English', avatar: 'elf-1', gender: 'neutral', language: 'en' },
{ id: 'Attractive_Girl', name: '迷人女孩', tag: 'English', avatar: 'female-33', gender: 'female', language: 'en' },
{ id: 'Serene_Woman', name: '沉静女士', tag: 'English', avatar: 'female-34', gender: 'female', language: 'en' },
{ id: 'English_Trustworthy_Man', name: '可靠男声', tag: 'English', avatar: 'male-34', gender: 'male', language: 'en' },
{ id: 'English_Graceful_Lady', name: '优雅女士', tag: 'English', avatar: 'female-35', gender: 'female', language: 'en' },
{ id: 'English_Aussie_Bloke', name: '澳洲小哥', tag: 'English', avatar: 'male-35', gender: 'male', language: 'en' },
{ id: 'English_Whispering_girl', name: '耳语女孩', tag: 'English', avatar: 'female-36', gender: 'female', language: 'en' },
{ id: 'English_Diligent_Man', name: '勤勉男声', tag: 'English', avatar: 'male-36', gender: 'male', language: 'en' },
{ id: 'English_Gentle-voiced_man', name: '温柔男声', tag: 'English', avatar: 'male-37', gender: 'male', language: 'en' },
// 法语
{ id: 'French_Male_Speech_New', name: '沉稳男声', tag: 'French', avatar: 'male-38', gender: 'male', language: 'fr' },
{ id: 'French_Female_News Anchor', name: '耐心女主持', tag: 'French', avatar: 'female-37', gender: 'female', language: 'fr' },
{ id: 'French_CasualMan', name: '随性男声', tag: 'French', avatar: 'male-39', gender: 'male', language: 'fr' },
{ id: 'French_MovieLeadFemale', name: '电影女主角', tag: 'French', avatar: 'female-38', gender: 'female', language: 'fr' },
{ id: 'French_FemaleAnchor', name: '女主播', tag: 'French', avatar: 'female-39', gender: 'female', language: 'fr' },
{ id: 'French_MaleNarrator', name: '男旁白', tag: 'French', avatar: 'male-40', gender: 'male', language: 'fr' },
// 印尼语
{ id: 'Indonesian_SweetGirl', name: '甜美女孩', tag: 'Indonesian', avatar: 'female-40', gender: 'female', language: 'id' },
{ id: 'Indonesian_ReservedYoungMan', name: '内敛青年男声', tag: 'Indonesian', avatar: 'male-41', gender: 'male', language: 'id' },
{ id: 'Indonesian_CharmingGirl', name: '迷人女孩', tag: 'Indonesian', avatar: 'female-41', gender: 'female', language: 'id' },
{ id: 'Indonesian_CalmWoman', name: '沉静女士', tag: 'Indonesian', avatar: 'female-42', gender: 'female', language: 'id' },
{ id: 'Indonesian_ConfidentWoman', name: '自信女士', tag: 'Indonesian', avatar: 'female-43', gender: 'female', language: 'id' },
{ id: 'Indonesian_CaringMan', name: '暖心男声', tag: 'Indonesian', avatar: 'male-42', gender: 'male', language: 'id' },
{ id: 'Indonesian_BossyLeader', name: '强势领导', tag: 'Indonesian', avatar: 'leader-1', gender: 'neutral', language: 'id' },
{ id: 'Indonesian_DeterminedBoy', name: '坚定男孩', tag: 'Indonesian', avatar: 'boy-3', gender: 'male', language: 'id' },
{ id: 'Indonesian_GentleGirl', name: '温柔女孩', tag: 'Indonesian', avatar: 'female-44', gender: 'female', language: 'id' },
// 德语
{ id: 'German_FriendlyMan', name: '友善男声', tag: 'German', avatar: 'male-43', gender: 'male', language: 'de' },
{ id: 'German_SweetLady', name: '甜美女士', tag: 'German', avatar: 'female-45', gender: 'female', language: 'de' },
{ id: 'German_PlayfulMan', name: '活泼男声', tag: 'German', avatar: 'male-44', gender: 'male', language: 'de' },
// 菲律宾语
{ id: 'Filipino_male_1_v1', name: 'Cheerful Man', tag: 'Filipino', avatar: 'male-45', gender: 'male', language: 'fil' },
{ id: 'Filipino_female_1_v1', name: 'Gentle Woman', tag: 'Filipino', avatar: 'female-46', gender: 'female', language: 'fil' },
];

4
start-dev.cmd Normal file
View File

@ -0,0 +1,4 @@
@echo off
setlocal
cd /d "%~dp0"
node ".\node_modules\tsx\dist\cli.mjs" server.ts

BIN
tmp_export_bgm.mp3 Normal file

Binary file not shown.

BIN
tmp_export_test.mp4 Normal file

Binary file not shown.

BIN
tmp_silence.wav Normal file

Binary file not shown.

26
tsconfig.json Normal file
View File

@ -0,0 +1,26 @@
{
"compilerOptions": {
"target": "ES2022",
"experimentalDecorators": true,
"useDefineForClassFields": false,
"module": "ESNext",
"lib": [
"ES2022",
"DOM",
"DOM.Iterable"
],
"skipLibCheck": true,
"moduleResolution": "bundler",
"isolatedModules": true,
"moduleDetection": "force",
"allowJs": true,
"jsx": "react-jsx",
"paths": {
"@/*": [
"./*"
]
},
"allowImportingTsExtensions": true,
"noEmit": true
}
}

24
vite.config.ts Normal file
View File

@ -0,0 +1,24 @@
import tailwindcss from '@tailwindcss/vite';
import react from '@vitejs/plugin-react';
import path from 'path';
import {defineConfig, loadEnv} from 'vite';
export default defineConfig(({mode}) => {
const env = loadEnv(mode, '.', '');
return {
plugins: [react(), tailwindcss()],
define: {
'process.env.GEMINI_API_KEY': JSON.stringify(env.GEMINI_API_KEY),
},
resolve: {
alias: {
'@': path.resolve(__dirname, '.'),
},
},
server: {
// HMR is disabled in AI Studio via DISABLE_HMR env var.
// Do not modify—file watching is disabled to prevent flickering during agent edits.
hmr: process.env.DISABLE_HMR !== 'true',
},
};
});

8
vitest.config.ts Normal file
View File

@ -0,0 +1,8 @@
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
environment: 'node',
setupFiles: ['./src/test/setup.ts'],
},
});