7.9 KiB
Doubao LLM Provider Design
Date: 2026-03-17
Goal: Add a user-visible LLM switcher so subtitle generation can use either Doubao or Gemini, default to Doubao, and keep TTS fixed on MiniMax.
Current State
The current project is effectively Gemini-only for subtitle generation and translation.
src/services/geminiService.tscalls Gemini directly from the browser for subtitle generation and Gemini fallback TTS.src/server/geminiTranslation.tstranslates sentence text on the server with Gemini.src/server/audioPipelineConfig.tsonly validatesGEMINI_API_KEY.src/components/EditorScreen.tsximports a Gemini-specific service and has no model selector.- MiniMax is already independent and used only for TTS through
/api/tts.
This makes provider switching hard because the LLM choice is not isolated behind a shared contract.
Product Requirements
- The editor must show a visible LLM selector.
- Available LLM options are
DoubaoandGemini. - The default LLM must be
Doubao. - TTS must remain fixed to MiniMax and must not participate in provider switching.
- API keys must only come from
.env. - The app must not silently fall back from one LLM provider to the other.
Chosen Approach
Use a server-side provider abstraction for subtitle generation and translation, with a frontend selector that passes the chosen provider to the server.
This approach keeps secrets on the server, avoids browser-side provider drift, and gives the project one place to add or change LLM providers later.
Why This Approach
Option A: Server-side provider abstraction with frontend selector
Recommended.
- Frontend sends
provider: 'doubao' | 'gemini'. - Server reads the matching API key from
.env. - Server routes subtitle text generation through a provider adapter.
- Time-critical audio extraction and timeline logic stay outside the provider-specific layer.
Pros:
- Keeps API keys off the client.
- Produces one consistent API contract for the editor.
- Makes default-provider behavior easy to enforce.
- Prevents Gemini-specific code from leaking further into the app.
Cons:
- Requires moving browser-side subtitle generation behavior into a server-owned path.
- Touches both frontend and backend.
Option B: Keep Gemini in the browser and add Doubao as a separate server path
Rejected.
Pros:
- Faster initial implementation.
Cons:
- Two subtitle-generation architectures would coexist.
- Provider behavior would drift over time.
- It violates the requirement that keys come only from
.env.
Option C: Client-side provider switching
Rejected.
Pros:
- Minimal backend work.
Cons:
- Exposes secrets to the browser.
- Conflicts with the
.env-only requirement.
Architecture
Frontend
The editor adds an LLM selector with the values:
DoubaoGemini
The default selected value is Doubao.
When the user clicks subtitle generation, the frontend sends:
- the uploaded video
- the target language
- the selected LLM provider
- optional trim metadata if the current flow needs it
The frontend no longer needs to know how Gemini or Doubao are called. It only consumes a normalized subtitle payload.
Server
The server becomes the single owner of LLM subtitle generation.
Responsibilities:
- validate the incoming provider
- read provider credentials from
.env - extract audio and prepare subtitle-generation inputs
- call the chosen provider adapter
- normalize the result into the existing subtitle shape
Provider Layer
Create a provider abstraction around LLM calls:
resolveLlmProvider(provider, env)geminiProviderdoubaoProvider
Each provider must accept the same logical input and return the same logical output so the rest of the app is provider-agnostic.
API Design
Add a dedicated subtitle-generation endpoint rather than overloading the existing audio-extraction endpoint.
Request
POST /api/generate-subtitles
Multipart or JSON payload fields:
videotargetLanguageprovider- optional
trimRange
Response
Return the same normalized subtitle structure the editor already understands.
At minimum each subtitle object should include:
idstartTimeendTimeoriginalTexttranslatedTextspeakervoiceIdvolume
If richer timeline metadata already exists in the current server subtitle pipeline, keep it in the response rather than trimming it away.
Subtitle Generation Strategy
The provider switch should affect LLM reasoning, not TTS and not the MiniMax path.
The cleanest boundary is:
- audio extraction and timeline preparation stay on the server
- LLM provider handles translation and label generation
- MiniMax remains the only TTS engine
This reduces the risk that switching providers changes subtitle timing behavior unpredictably.
Doubao Integration Notes
Use the Ark Responses API on the server:
- host:
https://ark.cn-beijing.volces.com/api/v3/responses - auth:
Authorization: Bearer ${ARK_API_KEY} - model: configurable, defaulting to
doubao-seed-2-0-pro-260215
The provider should treat Doubao as a text-generation backend and extract normalized text from the response payload before JSON parsing.
Implementation detail:
- the response parser should not assume SDK-specific helpers
- it should read the returned response envelope and collect the textual output fragments
- the final result should be parsed as JSON only after the output text is reconstructed
This is an implementation inference based on the official Ark Responses API response shape and is meant to keep the parser resilient to wrapper differences.
Configuration
Environment variables:
ARK_API_KEYfor DoubaoGEMINI_API_KEYfor GeminiMINIMAX_API_KEYfor TTS- optional
DOUBAO_MODELfor server-side model override - optional
DEFAULT_LLM_PROVIDERwith a default value ofdoubao
Rules:
- No API keys may be embedded in frontend code.
- No provider may silently reuse another provider's key.
- If the selected provider is missing its key, return a clear error.
Error Handling
Provider failures must be explicit.
- If
provideris invalid, return400. - If the selected provider key is missing, return
400. - If the selected provider returns an auth failure, return
401or a mapped upstream auth error. - If the selected provider fails unexpectedly, return
502or500with a provider-specific error message. - Do not auto-fallback from Doubao to Gemini or from Gemini to Doubao.
The UI should show which provider failed so the user is never misled about which model generated a subtitle result.
Frontend UX
Add the selector in the editor near the subtitle-generation controls so the choice is visible at generation time.
Rules:
- Default selection is
Doubao. - The selector affects each generation request immediately.
- The selector does not affect previously generated subtitles until the user regenerates.
- The selector does not affect MiniMax TTS generation.
Testing Strategy
Coverage should focus on deterministic seams.
- Provider resolution defaults to Doubao.
- Invalid provider is rejected.
- Missing
ARK_API_KEYorGEMINI_API_KEYreturns clear errors. - Doubao response parsing turns Ark response content into normalized subtitle JSON.
- Gemini and Doubao providers both satisfy the same interface contract.
- Editor defaults to Doubao and sends the selected provider on regenerate.
- TTS behavior remains unchanged when the LLM provider changes.
Rollout Notes
- Introduce the new endpoint and provider abstraction first.
- Switch the editor to the new endpoint second.
- Keep MiniMax TTS untouched except for regression checks.
- Leave any deeper visual fallback provider work for a later pass if needed.
Constraints
- This workspace is not a Git repository, so the design document cannot be committed here.
- The user provided an Ark key in chat, but the implementation must still read provider secrets from
.envand not hardcode them into source files.