250 lines
7.9 KiB
Markdown
250 lines
7.9 KiB
Markdown
# Doubao LLM Provider Design
|
|
|
|
**Date:** 2026-03-17
|
|
|
|
**Goal:** Add a user-visible LLM switcher so subtitle generation can use either Doubao or Gemini, default to Doubao, and keep TTS fixed on MiniMax.
|
|
|
|
## Current State
|
|
|
|
The current project is effectively Gemini-only for subtitle generation and translation.
|
|
|
|
1. `src/services/geminiService.ts` calls Gemini directly from the browser for subtitle generation and Gemini fallback TTS.
|
|
2. `src/server/geminiTranslation.ts` translates sentence text on the server with Gemini.
|
|
3. `src/server/audioPipelineConfig.ts` only validates `GEMINI_API_KEY`.
|
|
4. `src/components/EditorScreen.tsx` imports a Gemini-specific service and has no model selector.
|
|
5. MiniMax is already independent and used only for TTS through `/api/tts`.
|
|
|
|
This makes provider switching hard because the LLM choice is not isolated behind a shared contract.
|
|
|
|
## Product Requirements
|
|
|
|
1. The editor must show a visible LLM selector.
|
|
2. Available LLM options are `Doubao` and `Gemini`.
|
|
3. The default LLM must be `Doubao`.
|
|
4. TTS must remain fixed to MiniMax and must not participate in provider switching.
|
|
5. API keys must only come from `.env`.
|
|
6. The app must not silently fall back from one LLM provider to the other.
|
|
|
|
## Chosen Approach
|
|
|
|
Use a server-side provider abstraction for subtitle generation and translation, with a frontend selector that passes the chosen provider to the server.
|
|
|
|
This approach keeps secrets on the server, avoids browser-side provider drift, and gives the project one place to add or change LLM providers later.
|
|
|
|
## Why This Approach
|
|
|
|
### Option A: Server-side provider abstraction with frontend selector
|
|
|
|
Recommended.
|
|
|
|
1. Frontend sends `provider: 'doubao' | 'gemini'`.
|
|
2. Server reads the matching API key from `.env`.
|
|
3. Server routes subtitle text generation through a provider adapter.
|
|
4. Time-critical audio extraction and timeline logic stay outside the provider-specific layer.
|
|
|
|
Pros:
|
|
|
|
1. Keeps API keys off the client.
|
|
2. Produces one consistent API contract for the editor.
|
|
3. Makes default-provider behavior easy to enforce.
|
|
4. Prevents Gemini-specific code from leaking further into the app.
|
|
|
|
Cons:
|
|
|
|
1. Requires moving browser-side subtitle generation behavior into a server-owned path.
|
|
2. Touches both frontend and backend.
|
|
|
|
### Option B: Keep Gemini in the browser and add Doubao as a separate server path
|
|
|
|
Rejected.
|
|
|
|
Pros:
|
|
|
|
1. Faster initial implementation.
|
|
|
|
Cons:
|
|
|
|
1. Two subtitle-generation architectures would coexist.
|
|
2. Provider behavior would drift over time.
|
|
3. It violates the requirement that keys come only from `.env`.
|
|
|
|
### Option C: Client-side provider switching
|
|
|
|
Rejected.
|
|
|
|
Pros:
|
|
|
|
1. Minimal backend work.
|
|
|
|
Cons:
|
|
|
|
1. Exposes secrets to the browser.
|
|
2. Conflicts with the `.env`-only requirement.
|
|
|
|
## Architecture
|
|
|
|
### Frontend
|
|
|
|
The editor adds an `LLM` selector with the values:
|
|
|
|
1. `Doubao`
|
|
2. `Gemini`
|
|
|
|
The default selected value is `Doubao`.
|
|
|
|
When the user clicks subtitle generation, the frontend sends:
|
|
|
|
1. the uploaded video
|
|
2. the target language
|
|
3. the selected LLM provider
|
|
4. optional trim metadata if the current flow needs it
|
|
|
|
The frontend no longer needs to know how Gemini or Doubao are called. It only consumes a normalized subtitle payload.
|
|
|
|
### Server
|
|
|
|
The server becomes the single owner of LLM subtitle generation.
|
|
|
|
Responsibilities:
|
|
|
|
1. validate the incoming provider
|
|
2. read provider credentials from `.env`
|
|
3. extract audio and prepare subtitle-generation inputs
|
|
4. call the chosen provider adapter
|
|
5. normalize the result into the existing subtitle shape
|
|
|
|
### Provider Layer
|
|
|
|
Create a provider abstraction around LLM calls:
|
|
|
|
1. `resolveLlmProvider(provider, env)`
|
|
2. `geminiProvider`
|
|
3. `doubaoProvider`
|
|
|
|
Each provider must accept the same logical input and return the same logical output so the rest of the app is provider-agnostic.
|
|
|
|
## API Design
|
|
|
|
Add a dedicated subtitle-generation endpoint rather than overloading the existing audio-extraction endpoint.
|
|
|
|
### Request
|
|
|
|
`POST /api/generate-subtitles`
|
|
|
|
Multipart or JSON payload fields:
|
|
|
|
1. `video`
|
|
2. `targetLanguage`
|
|
3. `provider`
|
|
4. optional `trimRange`
|
|
|
|
### Response
|
|
|
|
Return the same normalized subtitle structure the editor already understands.
|
|
|
|
At minimum each subtitle object should include:
|
|
|
|
1. `id`
|
|
2. `startTime`
|
|
3. `endTime`
|
|
4. `originalText`
|
|
5. `translatedText`
|
|
6. `speaker`
|
|
7. `voiceId`
|
|
8. `volume`
|
|
|
|
If richer timeline metadata already exists in the current server subtitle pipeline, keep it in the response rather than trimming it away.
|
|
|
|
## Subtitle Generation Strategy
|
|
|
|
The provider switch should affect LLM reasoning, not TTS and not the MiniMax path.
|
|
|
|
The cleanest boundary is:
|
|
|
|
1. audio extraction and timeline preparation stay on the server
|
|
2. LLM provider handles translation and label generation
|
|
3. MiniMax remains the only TTS engine
|
|
|
|
This reduces the risk that switching providers changes subtitle timing behavior unpredictably.
|
|
|
|
## Doubao Integration Notes
|
|
|
|
Use the Ark Responses API on the server:
|
|
|
|
1. host: `https://ark.cn-beijing.volces.com/api/v3/responses`
|
|
2. auth: `Authorization: Bearer ${ARK_API_KEY}`
|
|
3. model: configurable, defaulting to `doubao-seed-2-0-pro-260215`
|
|
|
|
The provider should treat Doubao as a text-generation backend and extract normalized text from the response payload before JSON parsing.
|
|
|
|
Implementation detail:
|
|
|
|
1. the response parser should not assume SDK-specific helpers
|
|
2. it should read the returned response envelope and collect the textual output fragments
|
|
3. the final result should be parsed as JSON only after the output text is reconstructed
|
|
|
|
This is an implementation inference based on the official Ark Responses API response shape and is meant to keep the parser resilient to wrapper differences.
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
1. `ARK_API_KEY` for Doubao
|
|
2. `GEMINI_API_KEY` for Gemini
|
|
3. `MINIMAX_API_KEY` for TTS
|
|
4. optional `DOUBAO_MODEL` for server-side model override
|
|
5. optional `DEFAULT_LLM_PROVIDER` with a default value of `doubao`
|
|
|
|
Rules:
|
|
|
|
1. No API keys may be embedded in frontend code.
|
|
2. No provider may silently reuse another provider's key.
|
|
3. If the selected provider is missing its key, return a clear error.
|
|
|
|
## Error Handling
|
|
|
|
Provider failures must be explicit.
|
|
|
|
1. If `provider` is invalid, return `400`.
|
|
2. If the selected provider key is missing, return `400`.
|
|
3. If the selected provider returns an auth failure, return `401` or a mapped upstream auth error.
|
|
4. If the selected provider fails unexpectedly, return `502` or `500` with a provider-specific error message.
|
|
5. Do not auto-fallback from Doubao to Gemini or from Gemini to Doubao.
|
|
|
|
The UI should show which provider failed so the user is never misled about which model generated a subtitle result.
|
|
|
|
## Frontend UX
|
|
|
|
Add the selector in the editor near the subtitle-generation controls so the choice is visible at generation time.
|
|
|
|
Rules:
|
|
|
|
1. Default selection is `Doubao`.
|
|
2. The selector affects each generation request immediately.
|
|
3. The selector does not affect previously generated subtitles until the user regenerates.
|
|
4. The selector does not affect MiniMax TTS generation.
|
|
|
|
## Testing Strategy
|
|
|
|
Coverage should focus on deterministic seams.
|
|
|
|
1. Provider resolution defaults to Doubao.
|
|
2. Invalid provider is rejected.
|
|
3. Missing `ARK_API_KEY` or `GEMINI_API_KEY` returns clear errors.
|
|
4. Doubao response parsing turns Ark response content into normalized subtitle JSON.
|
|
5. Gemini and Doubao providers both satisfy the same interface contract.
|
|
6. Editor defaults to Doubao and sends the selected provider on regenerate.
|
|
7. TTS behavior remains unchanged when the LLM provider changes.
|
|
|
|
## Rollout Notes
|
|
|
|
1. Introduce the new endpoint and provider abstraction first.
|
|
2. Switch the editor to the new endpoint second.
|
|
3. Keep MiniMax TTS untouched except for regression checks.
|
|
4. Leave any deeper visual fallback provider work for a later pass if needed.
|
|
|
|
## Constraints
|
|
|
|
1. This workspace is not a Git repository, so the design document cannot be committed here.
|
|
2. The user provided an Ark key in chat, but the implementation must still read provider secrets from `.env` and not hardcode them into source files.
|