diff --git a/.openclaw-sync/source.json b/.openclaw-sync/source.json index 5ec6579c9..2b40728d6 100644 --- a/.openclaw-sync/source.json +++ b/.openclaw-sync/source.json @@ -1,15 +1,15 @@ { "repository": "openclaw/openclaw", - "sha": "30e079dd89b451ca22cb360ca887bd9367cc7939", + "sha": "5457462e62670839b1b7d793e22f7f38a76b8b0c", "sources": { "openclaw": { "repository": "openclaw/openclaw", - "sha": "30e079dd89b451ca22cb360ca887bd9367cc7939" + "sha": "5457462e62670839b1b7d793e22f7f38a76b8b0c" }, "clawhub": { "repository": "openclaw/clawhub", "sha": "38c21345906ab1f107a91b33bb86b63667d96643" } }, - "syncedAt": "2026-05-08T13:03:40.138Z" + "syncedAt": "2026-05-08T13:17:34.583Z" } diff --git a/docs/channels/discord.md b/docs/channels/discord.md index aec9b12ea..e27675fa4 100644 --- a/docs/channels/discord.md +++ b/docs/channels/discord.md @@ -1172,6 +1172,7 @@ Auto-join example: discord: { voice: { enabled: true, + mode: "stt-tts", model: "openai/gpt-5.4-mini", autoJoin: [ { @@ -1199,8 +1200,10 @@ Auto-join example: Notes: - `voice.tts` overrides `messages.tts` for voice playback only. -- `voice.model` overrides the LLM used for Discord voice channel responses only. Leave it unset to inherit the routed agent model. Do not set this to `gpt-realtime-2`; Discord voice channels use STT plus TTS playback, not the OpenAI Realtime session transport. -- STT uses `tools.media.audio`; `voice.model` does not affect transcription. +- `voice.mode` controls the conversation path: `stt-tts` keeps the existing batch STT plus TTS flow, `talk-buffer` uses a realtime voice shell for turn timing/transcription/playback while the OpenClaw agent produces the answer, and `bidi` lets the realtime model converse directly while exposing `openclaw_agent_consult` for the OpenClaw brain. +- `voice.model` overrides the OpenClaw agent brain for Discord voice responses and realtime consults. Leave it unset to inherit the routed agent model. It is separate from `voice.realtime.model`. +- In `stt-tts` mode, STT uses `tools.media.audio`; `voice.model` does not affect transcription. +- In realtime modes, `voice.realtime.provider`, `voice.realtime.model`, and `voice.realtime.voice` configure the realtime audio session. For OpenAI Realtime 2 plus the Codex brain, use `voice.realtime.model: "gpt-realtime-2"` and `voice.model: "openai-codex/gpt-5.5"`. - For an OpenAI voice on Discord playback, set `voice.tts.provider: "openai"` and choose a Text-to-speech voice under `voice.tts.openai.voice` or `voice.tts.providers.openai.voice`. `cedar` is a good masculine-sounding choice on the current OpenAI TTS model. - Per-channel Discord `systemPrompt` overrides apply to voice transcript turns for that voice channel. - Voice transcript turns derive owner status from Discord `allowFrom` (or `dm.allowFrom`); non-owner speakers cannot access owner-only tools (for example `gateway` and `cron`). @@ -1211,7 +1214,7 @@ Notes: - `@discordjs/voice` defaults are `daveEncryption=true` and `decryptionFailureTolerance=24` if unset. - `voice.connectTimeoutMs` controls the initial `@discordjs/voice` Ready wait for `/vc join` and auto-join attempts. Default: `30000`. - `voice.reconnectGraceMs` controls how long OpenClaw waits for a disconnected voice session to begin reconnecting before destroying it. Default: `15000`. -- Voice playback does not stop just because another user starts speaking. To avoid feedback loops, OpenClaw ignores new voice capture while TTS is playing; speak after playback finishes for the next turn. +- In `stt-tts` mode, voice playback does not stop just because another user starts speaking. To avoid feedback loops, OpenClaw ignores new voice capture while TTS is playing; speak after playback finishes for the next turn. Realtime modes forward speaker starts as barge-in signals to the realtime provider. - `voice.captureSilenceGraceMs` controls how long OpenClaw waits after Discord reports a speaker has stopped before finalizing that audio segment for STT. Default: `2500`; raise this if Discord splits normal pauses into choppy partial transcripts. - When ElevenLabs is the selected TTS provider, Discord voice playback uses streaming TTS and starts from the provider response stream. Providers without streaming support fall back to the synthesized temp-file path. - OpenClaw also watches receive decrypt failures and auto-recovers by leaving/rejoining the voice channel after repeated failures in a short window. @@ -1219,7 +1222,7 @@ Notes: - `The operation was aborted` receive events are expected when OpenClaw finalizes a captured speaker segment; they are verbose diagnostics, not warnings. - Verbose Discord voice logs include a bounded one-line STT transcript preview for each accepted speaker segment, so debugging shows both the user side and the agent reply side without dumping unbounded transcript text. -Voice channel pipeline: +STT plus TTS pipeline: - Discord PCM capture is converted to a WAV temp file. - `tools.media.audio` handles STT, for example `openai/gpt-4o-mini-transcribe`. @@ -1227,7 +1230,51 @@ Voice channel pipeline: - `voice.model`, when set, overrides only the response LLM for this voice-channel turn. - `voice.tts` is merged over `messages.tts`; streaming-capable providers feed the player directly, otherwise the resulting audio file is played in the joined channel. -Credentials are resolved per component: LLM route auth for `voice.model`, STT auth for `tools.media.audio`, and TTS auth for `messages.tts`/`voice.tts`. +Realtime talk-buffer example: + +```json5 +{ + channels: { + discord: { + voice: { + enabled: true, + mode: "talk-buffer", + model: "openai-codex/gpt-5.5", + realtime: { + provider: "openai", + model: "gpt-realtime-2", + voice: "cedar", + }, + }, + }, + }, +} +``` + +Realtime bidi example: + +```json5 +{ + channels: { + discord: { + voice: { + enabled: true, + mode: "bidi", + model: "openai-codex/gpt-5.5", + realtime: { + provider: "openai", + model: "gpt-realtime-2", + voice: "cedar", + toolPolicy: "safe-read-only", + consultPolicy: "always", + }, + }, + }, + }, +} +``` + +Credentials are resolved per component: LLM route auth for `voice.model`, STT auth for `tools.media.audio`, TTS auth for `messages.tts`/`voice.tts`, and realtime provider auth for `voice.realtime.providers` or the provider's normal auth config. ### Voice messages